Lessons from AWS outage: Plan for IT failure with a strategy

This title was summarized by AI from the post below.

1mo

The Oct 20th AWS outage got me scratching my head 🤔 As a heavy AWS user, I couldn’t help but pause and reflect. When a region as critical as AWS US-East-1 goes down, it reminds us that even the most resilient cloud platforms can and will fail. Private data centers are not immune either. AWS communicated the root cause of the outage as DNS resolution issues for the regional DynamoDB service endpoints which triggered a cascading effect on many AWS services including Network Load Balancers failures which led to EC2 launch failure and more. The outage impacted many customers’ applications including Snapchat and many more. This outage was another real-world test of Murphy’s Law: Anything that can go wrong, will go wrong. That’s why a failover strategy is not optional, it is essential for businesses’ viability. I understand that every organization faces trade-off decisions due to limited funds, and not all organization can afford multi-region “always active” or even “passive” site architectures. But if your critical workloads rely on any IT technology, it’s highly recommended to at least have a plan B of some sort, ready to take over based on your business need. To identify your business need, talk with your IT department and communicate your minimum expected downtime, and your data recovery needs. This is your opportunity for honest conversation around your reasonable business uptime needs, to allow your IT people to build a resilient architecture to sustain your business in case of technology failure. In this case, customers who had failover configured with US-East-2 (Ohio) or other regions other than US-EAST-1 were able to pivot with minimal disruption. The key is balance. Of course, the farther away your failover region, the more expensive your data replication cost will be, but it doesn’t have to be across coast failover, even a nearby region failover can make all the difference. The lesson is to always remember that IT will fail. Systems will fail. What matters is how prepared we are when it happens. We need to always assume failure will occur. We need to always design for resilience, and TEST the strategy. And lastly, always, always have a plan B. #CloudArchitecture #Resilience #DisasterRecovery #Leadership #TechnologyStrategy

5 Comments

Mignon Edorh, PMP, CISSP

1mo

And failover is insurance we have more control over 😁.

1 Reaction

Shawn C.

NASA Distinguished Digital Service Expert; Product, design and software engineering leadership. Accelerating NASA missions at the speed of digital. Co-Founded NASA Digital Service. Brought Figma to NASA.

1mo

It's super $$$$ to do failovers. Which is why most companies don't!

Jason Smith

Applications Enterprise Architect

1mo

I was amazed at all the services using AWS that were affected, including signal. No one likes insurance until they use it, just like fail over.

2 Reactions

Krishna Challa

Mignon Edorh, PMP, CISSP This outage could have been averted. Not because AWS is infallible but because architecture defines blast radius. On Oct 20, AWS US-East-1 went down when DynamoDB endpoints hit DNS resolution failure, triggering a domino effect, Load Balancer failures, EC2 issues, and widespread downtime. That’s the danger of glue-layered stacks: when services (DynamoDB + EC2 + LB + queues) are chained, one break collapses all. For the example sake: How MonkDB would have reduced impact: - Sovereign multi-region, no single-region dependency like DynamoDB - Unified data plane,Multi-Modal support and no DNS-based chaining - Policy-driven failover (MCP) ensure auto recovery - Graceful degradation for cached/read-only continuity - Data sovereignty & control and no vendor collapse Lesson: outages don’t begin with disaster but they begin with one fragile link. Resilience isn’t reacting fast. It’s not breaking at all.

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Anuraj A.

Technical Solutions Architect @ Tata Consultancy Services | Driving Data | AI | Cloud Solutions
1mo Edited
Report this post
Reflecting on the AWS Outage — and What It Teaches About Resilience When AWS North Virginia US-East-1 went down recently, the internet felt the shockwave. Apps stalled, services went silent, and users across the world faced “can’t connect” messages. For those who build and depend on the cloud every day, it was another reminder of a simple truth: no cloud region — or provider — is immune to failure. In plain terms, here’s what really happened: Think of a favorite app as a large office building. AWS provides the electricity, elevators, and security systems. One day, a power relay (DNS) malfunctioned in a key AWS building — and suddenly, no one could get work done, even though everything inside was fine. That’s essentially what this outage was: a DNS-level failure in one region that cascaded through dependent services like DynamoDB and EC2. The real takeaway isn’t why it failed, but why so many systems couldn’t handle that failure gracefully. Here are some key lessons reinforced by this incident • Avoid single-region dependency. Distribute workloads across multiple regions so one outage doesn’t halt everything. • Adopt multi-cloud or hybrid strategies. Cloud vendors are powerful allies, but true resilience comes from having a fallback. • Run failover drills regularly. Outages shouldn’t be surprises; they should be tests we’re ready for. • Communicate simply. When things go wrong, clear explanations build trust far more than technical jargon. At the end of the day, the cloud is still someone else’s computer. The real measure of resilience is how systems perform when that computer blinks — and whether operations continue standing strong. #CloudResilience #AWSOutage #DevOps #SiteReliability #SystemsDesign #MultiCloud #ResilientArchitecture #TechLeadership #CloudStrategy #EngineeringMindset #AWS
Like Comment
To view or add a comment, sign in
Rishi Khanna

CEO - 🚀Accelerate Innovation, Digital & AI Data Strategy | Find PMF | Scale Product Teams | CIO CTO CMO CINO CDO CIDO YC Founder Startup Advisor | LinkedIn Top Voice Tech | Investor | Speaker | Life Coach | Stoic Leader
1mo Edited
Report this post
When AWS US-EAST-1 Went Down, Everyone Felt It Yesterday’s AWS outage was a wake-up call for a number of organizations. Even my kids felt the disruption in their school in Texas. Their class was made to write on paper. One region went down, and the ripple effects reached across industries from banking, education, retail and e-commerce to social media and smart homes. Here are a few lessons worth recalibrating with … - No cloud is invincible. Even the most reliable platforms like AWS can fail. The question isn’t if, it’s when. - One region = one point of failure. Too many systems are built around single-region or single-provider architectures. Redundancy isn’t optional, it’s survival. - Dependencies multiply risk. When one layer goes down, everything connected to it suffers. Your architecture is only as resilient as its weakest dependency. - Resilience may be a design choice. Backups, failover region and disaster recovery drills aren’t compliance tasks, they’re business continuity essentials. This outage wasn’t just a failure, but a reminder. Reliability doesn’t come from the cloud provider. It comes from how we design for failure, “what if’s.” If AWS can go down, so can anyone. The only real insurance is thoughtful architecture, planning and regular testing. How are you rethinking resilience after this outage? ♻️ Found this valuable? Please reshare with your network! #ImpactEngineering #ClarityBeforeCode #AWSOutage Amazon Web Services (AWS) Microsoft
1 Comment
Like Comment
To view or add a comment, sign in
Robert Christian Gajudo

Associate Director for Technology Infrastructure Solutions & Services at Solvento Phils., Inc.
1mo
Report this post
Look, another major AWS outage—this time centered on the US-EAST-1 (North Virginia) region and lasting for a brutal 15 hours. It wasn't a cyberattack; it was the classic, dreaded, "it's always DNS" problem. Specifically, a DNS resolution issue for the DynamoDB API endpoint triggered a massive cascading failure that took down countless services from banking to gaming. The main takeaway for anyone in the cloud space? ◾ US-EAST-1 is still the Achilles' heel: It's the oldest region, and too many "global" services still rely on its control plane. When it coughs, the whole internet catches a cold. ◾ Multi-Region is Not Just a Buzzword: If your critical architecture doesn't have a solid multi-region failover plan that you have actually tested, you just had a very expensive learning experience. ◾ The Black Box Risk: We saw countless SaaS platforms go down with zero visibility because their entire resilience strategy was trusting the cloud provider's default settings. You have to build resilience into your application layer. It’s a stark, annoying reminder that 100% uptime is a myth, and we need to stop designing like it isn't. The reliance on a single point of failure—even a giant one—is a systemic risk we have to address now. #AWS #CloudOutage #AWSDowntime #US_EAST_1 #CloudArchitecture #DevOps #ResilienceEngineering #AlwaysDNS
Like Comment
To view or add a comment, sign in
Huddle01

7,713 followers
1mo
Report this post
This Monday, a single DNS error originating in AWS's northern Virginia data center brought down 1000+ websites. The losses run in millions and are still being calculated. As this was the provider’s largest data center, the impact was significant across many dependent services. This incident underscored two critical considerations for modern businesses: 1️⃣ Multi-cloud resilience - Relying on a single provider introduces a single point of failure. 2️⃣ Geographic distribution - Physical infrastructure must be diversified across regions to mitigate localized risks. Redundancy is not just a best practice - it’s essential. Something we constantly optimise for at Huddle01 - be it uptime, resilience, and geographic flexibility. If constant uptime is something you are optimising for too, would love to chat. —-------------------------------------------------------------------------------- BTW, Huddle is giving away free cloud credits worth $25,000. Link in pinned comments.

1 Comment
Like Comment
To view or add a comment, sign in
Rishav Sinha

Cloud Solution Architect | AWS & Azure Specialist | Designing Scalable, Secure, and Resilient Software Solutions
1mo
Report this post
On October 20 2025, AWS experienced a major disruption centred in its US-EAST-1 (Northern Virginia) region. The root cause: an internal subsystem responsible for monitoring the health of network load balancers malfunctioned, affecting services like EC2 and DynamoDB and triggering widespread connectivity and API failures. The Independent+3Los Angeles Times+3GeekWire+3 𝗪𝗵𝘆 𝗶𝘁 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 Because AWS underpins a huge share of the internet’s infrastructure, the outage didn’t just affect AWS itself — it cascaded across apps, websites, banks, travel systems and more. This incident shines a spotlight on the fragility of centralised components in our digital ecosystem. WIRED+2Data Centre Magazine+2 𝗞𝗲𝘆 𝘁𝗮𝗸𝗲-𝗮𝘄𝗮𝘆𝘀 𝗳𝗼𝗿 𝗯𝘂𝘀𝗶𝗻𝗲𝘀𝘀𝗲𝘀 & 𝘁𝗲𝗰𝗵 𝘁𝗲𝗮𝗺𝘀 • Reliance on a single region (or even provider) introduces risk: Even if architecture is robust, regional or subsystem failures can ripple widely. • Multi-region, multi-zone redundancy and active-active setups can reduce “single point of failure” risk. Netguru+1 • Transparent incident communication matters: While AWS posted updates, the scale of impact underscores the need for fast, clear customer messaging. • Consider cloud resilience beyond SLAs: It’s not just uptime statistics — it’s also about how quickly backlogs get cleared, how dependencies recover, and how failures cascade. 𝗠𝘆 𝗽𝗲𝗿𝘀𝗽𝗲𝗰𝘁𝗶𝘃𝗲 This outage reminds me that “cloud” doesn’t mean “fail-safe” — it just shifts the failure surface. As digital dependencies deepen, architecture decisions need to account for 𝗻𝗼𝘁 𝗼𝗻𝗹𝘆 what happens 𝘸𝘩𝘦𝘯 𝘵𝘩𝘪𝘯𝘨𝘴 𝘸𝘰𝘳𝘬, but also 𝘸𝘩𝘦𝘯 𝘵𝘩𝘪𝘯𝘨𝘴 𝘧𝘢𝘪𝘭. It’s time to elevate disaster-scenarios from “unlikely” to part of the planning playbook. Let’s use this as a learning moment: scrutinise our dependencies, validate our failover strategies, and ensure our tech-ecosystem is built for resilience so that one infrastructure hiccup doesn’t become a full-scale disruption. #cloud #AWS #resilience #infrastructure #techstrategy #lessons
Like Comment
To view or add a comment, sign in
Marius Silo

CEO and Co-founder at SiloTech.xyz | Artificial Intelligence (AI) | Agentic Apps Development | Process Audits and Automation
1mo
Report this post
Yesterday, AWS went down for 15 hours. Over 1,000 companies affected. 6.5 million users impacted. The cause? A DNS resolution issue in AWS's US-EAST-1 (Northern Virginia) data center. One region. One problem. Thousands of businesses offline. Here's what this outage exposed: 1. Single points of failure are everywhere; Most companies don't realize how dependent they are on a single cloud region until it goes down. 2. "Cloud-native" doesn't mean "resilient"; Moving to the cloud doesn't automatically make your infrastructure bulletproof. It just moves your risk to someone else's infrastructure. 3. Multi-region architecture isn't optional anymore; If your entire business can be taken offline by one AWS region, you don't have a disaster recovery plan—you have a dependency. 4. The hidden cost of convenience; AWS makes it easy to deploy everything in one region. That convenience becomes expensive when that region goes down for 15 hours. The companies that stayed online yesterday weren't lucky. They were prepared. They had: - Multi-region deployments - Failover systems - Redundancy built in from day one If yesterday's outage affected your business, it's time to ask: What's our backup plan when the cloud provider fails? Because it's not "if" anymore. It's "when." (Source: Sky News, ThousandEyes, AWS Service Health Dashboard) #AWS #CloudComputing #Infrastructure #BusinessContinuity #DisasterRecovery
1 Comment
Like Comment
To view or add a comment, sign in
3R Resilience HUB

30 followers
1mo
Report this post
🌐 Reflections on the AWS Outage — A Lesson in Resilience and Integrity Last night’s AWS outage reminded the entire tech industry that no system, no matter how advanced or globally distributed, is immune to failure. A simple Domain Name System (DNS) issue cascaded across the internet — disrupting major platforms, financial services, and millions of users worldwide. As professionals in DC and cloud operations, we often emphasize uptime, redundancy, and automation. Yet this event teaches us something deeper: resilience is not built by technology alone — it’s reinforced by culture, integrity, and preparedness. 🔹 Key Lessons: 1️⃣ Architect for failure — design systems with multi-region redundancy and hybrid-cloud backup strategies. 2️⃣ Strengthen DNS governance — implement diverse DNS providers and health checks to prevent single points of failure. 3️⃣ Communicate transparently — in crisis, honesty and timely updates sustain customer trust. 4️⃣ Test resilience regularly — simulate large-scale failures to refine incident response and recovery playbooks. This outage is not just about downtime — it’s about how we lead during disruption. True operational excellence lies in the balance between technical precision and ethical responsibility. Technology will fail from time to time — but our response defines our credibility. #Leadership #Resilience #AWSOutage #Integrity #OperationalExcellence #DataCenter #CloudInfrastructure #Trust #CrisisManagement #DigitalReliability

1 Comment
Like Comment
To view or add a comment, sign in
Stefan Ferrari

Driving Cloud Orchestration & Data Resilience | Technical Advocate @ Vawlt Supercloud
1mo
Report this post
Yesterday’s AWS wobble (October 20, 2025) reminded everyone how much of the internet leans on a single region (often US-EAST-1). Even a brief DNS or control-plane issue can cascade into service errors across major apps before recovery. For leaders considering resilience, this is further evidence that single-cloud strategies are risky and can have a significant impact on your business. If your data and state are hosted by a single provider or in a single region, a hiccup can become your outage. How VAWLT could have kept you online: - Provider-agnostic data plane: VAWLT distributes objects across multiple clouds and regions using erasure coding. If one region has a bad day, reads and writes continue on healthy slices without a scramble to rehydrate backups. - Active-active by design: A single S3-compatible endpoint spans diverse infrastructure, so applications fail through, not fail over. - Local cache for performance: Hot data is served at near LAN speed from a secure local cache while durability lives across clouds, so resilience does not mean slower RTOs. - Immutable, policy-driven copies: Built-in versioning and immutability protect data integrity while control-plane issues elsewhere are resolved. - Operational simplicity: One policy set and one namespace. There is no DIY cross-cloud plumbing and no last-minute DNS gymnastics. Outages will happen, but downtime is optional when your data layer is diversified and independent of any single cloud’s regional control plane. If your DR plan still starts with “restore from backups,” you are planning for recovery, not continuity. If you want to pressure-test an architecture where an event like yesterday’s becomes a non-event, please get in touch, and we can schedule a demo. Vawlt #vawlt #supercloud #AWS #downtime
Like Comment
To view or add a comment, sign in
Wallarm: API Security Leader

23,786 followers
1mo
Report this post
When AWS goes down, the internet feels it. 🌐 The recent AWS outage proved that even the most reliable clouds can crumble under hidden dependencies. The disruption started in US-EAST-1 but spread globally, impacting services like Slack, Zoom, and Atlassian. Our latest blog explores the outage’s root cause and key lessons for infrastructure architects: - Avoid single-region control planes - Separate control and data planes - Design for true multi-region, active-active operation - Continuously test failure scenarios Learn how Wallarm’s Security Edge already applies these principles to stay resilient when providers stumble. 👉 Read the full breakdown: https://lnkd.in/dn8jGHnu #APIsecurity #CloudSecurity #Wallarm #AppSec #Resilience #AWS

AWS Outage: Lessons Learned — API Security lab.wallarm.com
Like Comment
To view or add a comment, sign in
Harinandan Praveen

OCI Cloud solutions Architect | Network &VM architecture| Secure Access | OCI Security | Unisys
1mo
Report this post
On October 20, 2025, a significant disruption in the US-EAST-1 region (Northern Virginia) of AWS caused a global ripple effect. This incident involved: - DNS Failure: The Domain Name System became unresponsive, halting connectivity. - DynamoDB API Disruption: The critical database service went offline, impacting dependent workloads. As a result, numerous global platforms across various sectors—including social media (Snapchat, Reddit), finance (Venmo), healthcare (Zoom), education, and businesses—experienced downtime or latency issues. Service was restored by Monday afternoon, with message backlogs cleared thereafter. This event underscores the importance of resiliency architecture. Implementing multi-region failover, hybrid backups, and DNS redundancy is not merely a best practice; it is a business necessity. In cloud architecture, the objective is not to avoid failure but to design for recovery and continuity.
2 Comments
Like Comment
To view or add a comment, sign in

405 followers

61 Posts

View Profile Follow

Lessons from AWS outage: Plan for IT failure with a strategy

More Relevant Posts

Explore content categories