The Oct 20th AWS outage got me scratching my head 🤔
As a heavy AWS user, I couldn’t help but pause and reflect. When a region as critical as AWS US-East-1 goes down, it reminds us that even the most resilient cloud platforms can and will fail. Private data centers are not immune either.
AWS communicated the root cause of the outage as DNS resolution issues for the regional DynamoDB service endpoints which triggered a cascading effect on many AWS services including Network Load Balancers failures which led to EC2 launch failure and more. The outage impacted many customers’ applications including Snapchat and many more.
This outage was another real-world test of Murphy’s Law: Anything that can go wrong, will go wrong.
That’s why a failover strategy is not optional, it is essential for businesses’ viability.
I understand that every organization faces trade-off decisions due to limited funds, and not all organization can afford multi-region “always active” or even “passive” site architectures. But if your critical workloads rely on any IT technology, it’s highly recommended to at least have a plan B of some sort, ready to take over based on your business need.
To identify your business need, talk with your IT department and communicate your minimum expected downtime, and your data recovery needs. This is your opportunity for honest conversation around your reasonable business uptime needs, to allow your IT people to build a resilient architecture to sustain your business in case of technology failure.
In this case, customers who had failover configured with US-East-2 (Ohio) or other regions other than US-EAST-1 were able to pivot with minimal disruption.
The key is balance. Of course, the farther away your failover region, the more expensive your data replication cost will be, but it doesn’t have to be across coast failover, even a nearby region failover can make all the difference.
The lesson is to always remember that IT will fail. Systems will fail.
What matters is how prepared we are when it happens.
We need to always assume failure will occur.
We need to always design for resilience, and TEST the strategy.
And lastly, always, always have a plan B.
#CloudArchitecture #Resilience #DisasterRecovery #Leadership #TechnologyStrategy