From the course: Site Reliability Engineering Essential Training

Unlock this course with a free trial

Join today to access over 24,900 courses taught by industry experts.

Handling failures

Handling failures

Handling failures. Failures are going to happen, there's no way around it. But as an SRE, you can design and implement systems in such a way that you can handle them. Points of failure. This is just a very small subset of what can fail: applications, hardware, disks, network, DNS, application runtime, cloud service provider, or a bug, and the list goes on and on. This is just to let you know that there are so many points of failures in a distributed system. So what is the solution to handle failures? In SRE world, there are two aspects. One is fault-tolerant infrastructure architecture. The other is fault-tolerant application architecture. You need both in order to handle failures successfully. So let's take a look at them. Fault-tolerant infrastructure architecture. So what do we mean by that, and what are the traits of fault-tolerant infrastructure architecture? First, you need to have redundancy because servers will fail at some point. Second, we talked about load balancer in the…

Contents