Join now Sign in

From the course: Site Reliability Engineering Essential Training

Unlock this course with a free trial

Join today to access over 24,900 courses taught by industry experts.

Handling failures

Handling failures

From the course: Site Reliability Engineering Essential Training

Start my 1-month free trial Buy for my team

Handling failures

“

Handling failures. Failures are going to happen, there's no way around it. But as an SRE, you can design and implement systems in such a way that you can handle them. Points of failure. This is just a very small subset of what can fail: applications, hardware, disks, network, DNS, application runtime, cloud service provider, or a bug, and the list goes on and on. This is just to let you know that there are so many points of failures in a distributed system. So what is the solution to handle failures? In SRE world, there are two aspects. One is fault-tolerant infrastructure architecture. The other is fault-tolerant application architecture. You need both in order to handle failures successfully. So let's take a look at them. Fault-tolerant infrastructure architecture. So what do we mean by that, and what are the traits of fault-tolerant infrastructure architecture? First, you need to have redundancy because servers will fail at some point. Second, we talked about load balancer in the…

Contents