The Oct 20th AWS outage got me scratching my head 🤔 As a heavy AWS user, I couldn’t help but pause and reflect. When a region as critical as AWS US-East-1 goes down, it reminds us that even the most resilient cloud platforms can and will fail. Private data centers are not immune either. AWS communicated the root cause of the outage as DNS resolution issues for the regional DynamoDB service endpoints which triggered a cascading effect on many AWS services including Network Load Balancers failures which led to EC2 launch failure and more. The outage impacted many customers’ applications including Snapchat and many more. This outage was another real-world test of Murphy’s Law: Anything that can go wrong, will go wrong. That’s why a failover strategy is not optional, it is essential for businesses’ viability. I understand that every organization faces trade-off decisions due to limited funds, and not all organization can afford multi-region “always active” or even “passive” site architectures. But if your critical workloads rely on any IT technology, it’s highly recommended to at least have a plan B of some sort, ready to take over based on your business need. To identify your business need, talk with your IT department and communicate your minimum expected downtime, and your data recovery needs. This is your opportunity for honest conversation around your reasonable business uptime needs, to allow your IT people to build a resilient architecture to sustain your business in case of technology failure. In this case, customers who had failover configured with US-East-2 (Ohio) or other regions other than US-EAST-1 were able to pivot with minimal disruption. The key is balance. Of course, the farther away your failover region, the more expensive your data replication cost will be, but it doesn’t have to be across coast failover, even a nearby region failover can make all the difference. The lesson is to always remember that IT will fail. Systems will fail. What matters is how prepared we are when it happens. We need to always assume failure will occur. We need to always design for resilience, and TEST the strategy. And lastly, always, always have a plan B. #CloudArchitecture #Resilience #DisasterRecovery #Leadership #TechnologyStrategy
It's super $$$$ to do failovers. Which is why most companies don't!
I was amazed at all the services using AWS that were affected, including signal. No one likes insurance until they use it, just like fail over.
Mignon Edorh, PMP, CISSP This outage could have been averted. Not because AWS is infallible but because architecture defines blast radius. On Oct 20, AWS US-East-1 went down when DynamoDB endpoints hit DNS resolution failure, triggering a domino effect, Load Balancer failures, EC2 issues, and widespread downtime. That’s the danger of glue-layered stacks: when services (DynamoDB + EC2 + LB + queues) are chained, one break collapses all. For the example sake: How MonkDB would have reduced impact: - Sovereign multi-region, no single-region dependency like DynamoDB - Unified data plane,Multi-Modal support and no DNS-based chaining - Policy-driven failover (MCP) ensure auto recovery - Graceful degradation for cached/read-only continuity - Data sovereignty & control and no vendor collapse Lesson: outages don’t begin with disaster but they begin with one fragile link. Resilience isn’t reacting fast. It’s not breaking at all.
Author of Forward with Grit and Grace | Digital Transformation Lead | Women in Tech advocate | Emerging Technologies Enthusiast | Continuous Learner | Speaker
1moAnd failover is insurance we have more control over 😁.