From the course: Site Reliability Engineering Essential Training

Introduction

Hello and welcome to this Site Reliability Engineering Essentials Video Course. My name is Karun Subramanian and I will be your instructor. I'm an IT operations expert with over two decades of experience in the field of Site Reliability Engineering. Site Reliability Engineering or SRE is a discipline of managing IT operations using software and systems engineering. Developed originally by Google around 2003, SRE has gained immense popularity. The concepts, technologies, and processes in SRE are broad and often require a significant investment of your time to learn. I created this course specifically to teach you the essentials so that you can start implementing SRE right away. In these lessons, you will learn the basic principles of SRE and also the practical applications of them. I will provide many real-world examples and demonstrations to help you comprehend complex topics. In Lesson 1, we'll begin by explaining the core tenets of SRE. You'll also learn the difference between DevOps, platform engineering, and SRE. Lastly, I'll walk you through a typical day of a site reliability engineer. With basics out of the way, in Lesson 2, we will dive right into observability, one of the foundational capabilities for SRE. You'll grasp the telemetry signals like logs, metrics, and traces, and discover the four golden signals you need to monitor. I'll also demonstrate observability in action using Splunk. In Lesson 3, you will learn the key SRE concepts like SLI, Service Level Indicator; SLO, Service Level Objective; and SLA, Service Level Agreement. You'll look at several real-world examples to grasp these ideas. In Lesson 4, you'll study how incident management is handled in SRE. We'll review how on-call is implemented and learn how to implement blameless postmortem practices. I'll explain how to use postmortem templates to create your postmortems efficiently. Lesson 5 is all about reliable systems architecture, including the core technologies like load balancing and autoscaling. We will discuss the various ways to handle failures using techniques like circuit breakers. Most outages are caused by changes to code or configuration. Lesson 6, we'll discuss release management. You will learn the tenets of release management such as canary deployment, progressive rollouts, and safe rollbacks. Finally, in Lesson 7, you will discover how to implement SRE in your organization. You'll learn the various SRE implementation models and the characteristics of each of them. You'll also learn about using production readiness review, a proven method to implement reliable systems in production. In Lesson 8, I will summarize the key information from this course and suggest the next steps to grow in your SRE journey. There is a lot of valuable content to cover, so let's dive in.

Contents