Lessons from AWS outage: Plan for IT failure with a strategy

This title was summarized by AI from the post below.
View profile for Mignon Edorh, PMP, CISSP

Author of Forward with Grit and Grace | Digital Transformation Lead | Women in Tech advocate | Emerging Technologies Enthusiast | Continuous Learner | Speaker

The Oct 20th AWS outage got me scratching my head 🤔 As a heavy AWS user, I couldn’t help but pause and reflect. When a region as critical as AWS US-East-1 goes down, it reminds us that even the most resilient cloud platforms can and will fail. Private data centers are not immune either. AWS communicated the root cause of the outage as DNS resolution issues for the regional DynamoDB service endpoints which triggered a cascading effect on many AWS services including Network Load Balancers failures which led to EC2 launch failure and more. The outage impacted many customers’ applications including Snapchat and many more. This outage was another real-world test of Murphy’s Law: Anything that can go wrong, will go wrong. That’s why a failover strategy is not optional, it is essential for businesses’ viability. I understand that every organization faces trade-off decisions due to limited funds, and not all organization can afford multi-region “always active” or even “passive” site architectures. But if your critical workloads rely on any IT technology, it’s highly recommended to at least have a plan B of some sort, ready to take over based on your business need. To identify your business need, talk with your IT department and communicate your minimum expected downtime, and your data recovery needs. This is your opportunity for honest conversation around your reasonable business uptime needs, to allow your IT people to build a resilient architecture to sustain your business in case of technology failure. In this case, customers who had failover configured with US-East-2 (Ohio) or other regions other than US-EAST-1 were able to pivot with minimal disruption. The key is balance. Of course, the farther away your failover region, the more expensive your data replication cost will be, but it doesn’t have to be across coast failover, even a nearby region failover can make all the difference. The lesson is to always remember that IT will fail. Systems will fail. What matters is how prepared we are when it happens. We need to always assume failure will occur. We need to always design for resilience, and TEST the strategy. And lastly, always, always have a plan B. #CloudArchitecture #Resilience #DisasterRecovery #Leadership #TechnologyStrategy

  • No alternative text description for this image
Mignon Edorh, PMP, CISSP

Author of Forward with Grit and Grace | Digital Transformation Lead | Women in Tech advocate | Emerging Technologies Enthusiast | Continuous Learner | Speaker

1mo

And failover is insurance we have more control over 😁.

Shawn C.

NASA Distinguished Digital Service Expert; Product, design and software engineering leadership. Accelerating NASA missions at the speed of digital. Co-Founded NASA Digital Service. Brought Figma to NASA.

1mo

It's super $$$$ to do failovers. Which is why most companies don't!

Like
Reply
Jason Smith

Applications Enterprise Architect

1mo

I was amazed at all the services using AWS that were affected, including signal. No one likes insurance until they use it, just like fail over.

Mignon Edorh, PMP, CISSP This outage could have been averted. Not because AWS is infallible but because architecture defines blast radius. On Oct 20, AWS US-East-1 went down when DynamoDB endpoints hit DNS resolution failure, triggering a domino effect, Load Balancer failures, EC2 issues, and widespread downtime. That’s the danger of glue-layered stacks: when services (DynamoDB + EC2 + LB + queues) are chained, one break collapses all. For the example sake: How MonkDB would have reduced impact: - Sovereign multi-region, no single-region dependency like DynamoDB - Unified data plane,Multi-Modal support and no DNS-based chaining - Policy-driven failover (MCP) ensure auto recovery - Graceful degradation for cached/read-only continuity - Data sovereignty & control and no vendor collapse Lesson: outages don’t begin with disaster but they begin with one fragile link. Resilience isn’t reacting fast. It’s not breaking at all.

See more comments

To view or add a comment, sign in

Explore content categories