When AWS goes down, the internet feels it. 🌐 The recent AWS outage proved that even the most reliable clouds can crumble under hidden dependencies. The disruption started in US-EAST-1 but spread globally, impacting services like Slack, Zoom, and Atlassian. Our latest blog explores the outage’s root cause and key lessons for infrastructure architects: - Avoid single-region control planes - Separate control and data planes - Design for true multi-region, active-active operation - Continuously test failure scenarios Learn how Wallarm’s Security Edge already applies these principles to stay resilient when providers stumble. 👉 Read the full breakdown: https://lnkd.in/dn8jGHnu #APIsecurity #CloudSecurity #Wallarm #AppSec #Resilience #AWS
AWS outage: lessons for infrastructure architects
More Relevant Posts
-
🔌 Takeaway from the recent Amazon Web Services (AWS) outage 🔹 Earlier this week, AWS reported an incident in its US-EAST-1 region: from ~11:49 PM PDT on October 19 to ~2:24 AM PDT on October 20, a DNS-resolution issue for the Amazon DynamoDB‐service endpoints led to elevated error rates across a number of services. In other words: a core database API endpoint couldn’t be resolved correctly, triggering cascading failures and widespread disruption beyond just one service. 🔹 Why it matters Even the largest, most robust cloud infrastructures are vulnerable when a foundational piece (like DNS) fails. It highlights that our modern digital architecture is built on layers—and a breakdown deep in the stack affects everything above it. For organizations relying on cloud services, it’s a reminder that designing for resilience means accounting for unexpected failures, even when using best-in-class providers. As one expert put it: “Failures increasingly trace to integrity … until we better understand and protect integrity, our total focus on uptime is an illusion.” 🔹 What we can do as professionals Re-evaluate critical dependencies: Is your architecture too tightly coupled on one provider, region or service? Adopt multi-region, multi-endpoint failover strategies for critical services. Ensure you have observability not just into service health, but into infrastructural components like DNS, networking and resolution path. Run regular failure-mode drills (e.g., “What happens if our primary DB endpoint cannot be resolved?”) and treat them like security/fire drills. Communicate transparently with stakeholders when incidents happen: outages are never just “tech issues” — they impact operations, brand trust and user experience. 🧠 Final thought Outages are never convenient — but they are inevitable in complex distributed systems. What differentiates high-performing organisations is not that they never fail, but that they: anticipate weakness, build in resilient countermeasures, and respond & learn swiftly. Let’s treat this AWS outage not just as a headline, but as a learning opportunity: to strengthen our infrastructure, improve our incident readiness, and sharpen our commitment to reliability. Would love to hear your thoughts: 🔍 How do you approach dependency risk in your cloud architecture? ⚙️ What failure drills have you found most revealing in your organisation? #CloudComputing #AWS #ReliabilityEngineering #Resilience #IncidentManagement #DynamoDB
To view or add a comment, sign in
-
-
Everyone talks about the October AWS outage, but one detail is often overlooked: even a “multi-region” AWS setup can fail if your control plane lives in a single region, and it does. AWS’s own docs confirm it. Route 53’s control plane (record updates, health checks, APIs) is hosted entirely in us-east-1; only the DNS query network is globally distributed. You can see this in the AWS Console: Route 53 doesn’t ask for a region, and S3 defaults to one. If us-east-1 stalls, you can’t update records or health checks. https://lnkd.in/eCRtRBzG
To view or add a comment, sign in
-
🔴 If you were affected by today's AWS outage in the US-EAST-1 region, here's what went wrong: The Domino Effect 🎯 Think of AWS services like a chain of dominoes. When one critical piece falls, it can knock down everything connected to it. That's exactly what happened last night. What Triggered It? 🔍 Late last night, AWS's DNS system (think of it as the internet's phone book 📖) stopped working correctly for DynamoDB, a database service. Services couldn't find DynamoDB anymore. The Cascade Begins ⚡ Here's where it got complicated: ▪️ EC2 (virtual servers 💻) relies on DynamoDB to launch new instances. When DynamoDB's address couldn't be found, new servers couldn't start. ▪️ Network Load Balancers couldn't perform health checks, leading to connectivity issues 🌐 ▪️ This rippled out to Lambda, CloudWatch, and other services that depend on these core systems. The Recovery 🛠️ AWS worked to restore services throughout the day: ▪️ Fixed the initial DNS issue ▪️ Resolved Network Load Balancer problems ▪️ Temporarily slowed down some operations to prevent overwhelming the recovering systems ▪️ Full recovery by mid-afternoon ✅ The Takeaway 💡 This incident highlights why cloud architecture matters. A single point of failure in a foundational service can cascade through an entire ecosystem. It's also a reminder of why multi-region deployments and disaster recovery plans aren't optional, they're essential. #AWS #CloudComputing #DevOps #SiteReliability #TechOutage #TechNews #Infrastructure
To view or add a comment, sign in
-
I keep being asked how Ably performed during the AWS us-east-1 outage this week. The bottom line is that there was no customer impact: no downtime, no errors and imperceptible impact on latency. Ably is hosted on AWS with services operating in multiple regions globally. Each region scales independently based on traffic, and us-east-1 is normally the busiest region. When AWS services in us-east-1 started failing, the Ably data plane kept operating as normal. The AWS disruption meant that we couldn’t add capacity in that region, so new connections were routed to us-east-2 instead; this is a routine intervention that we make in response to disruption in a region. Existing connections in us-east-1 stayed live, serving traffic with the same latency and zero errors. This was our globally-distributed system doing exactly what it was built to do. Read the full breakdown: https://lnkd.in/eeVsie4a
To view or add a comment, sign in
-
“…at around 1200 UTC we made DNS changes so that new connections were not routed to us-east-1; traffic that would have ordinarily been routed there (based on latency) were instead handled in us-east-2.” ☝️This is a key tenet for high availability (if not fault tolerance): fail away from the incident. Of course, your system needs to be multi region, you need to have tested it, be confident in doing it, and have a light process in place to execute quickly. #AWS
Founder & CTO at Ably | Building the realtime infrastructure of the internet | Powering 2B+ devices globally
I keep being asked how Ably performed during the AWS us-east-1 outage this week. The bottom line is that there was no customer impact: no downtime, no errors and imperceptible impact on latency. Ably is hosted on AWS with services operating in multiple regions globally. Each region scales independently based on traffic, and us-east-1 is normally the busiest region. When AWS services in us-east-1 started failing, the Ably data plane kept operating as normal. The AWS disruption meant that we couldn’t add capacity in that region, so new connections were routed to us-east-2 instead; this is a routine intervention that we make in response to disruption in a region. Existing connections in us-east-1 stayed live, serving traffic with the same latency and zero errors. This was our globally-distributed system doing exactly what it was built to do. Read the full breakdown: https://lnkd.in/eeVsie4a
To view or add a comment, sign in
-
Earlier this week, AWS suffered a major disruption: the US-EAST-1 region experienced “increased error rates and latencies” across multiple services, which cascaded into widespread outages for many popular platforms. Here are a few thoughts and take-aways: --- ✅ What Happened (in short) The outage started early in the US-EAST-1 region, impacting the DNS subsystem and internal networking. Because AWS underpins so much of the internet infrastructure, the ripple effects were enormous: major apps, services and websites across the globe experienced downtime or degraded performance. The incident has already triggered broader discussion about cloud dependency, concentration risk and resilience. --- 🎯 Why This Matters to Us Resilience in architecture matters: Even best-in-class clouds can have large-scale disruptions. We must assume failure scenarios and have fallback/mitigation plans. Regional-single-point-risk: Hosting everything in one region (US-EAST-1 in this case) increases exposure. Multi-region design (and cross-cloud if feasible) elevates resilience. Upstream dependencies: Many businesses may think “we’re fine”, but if you rely on a provider (or services hosted by them) that is affected, you are still vulnerable. The outage shows how inter-linked the ecosystem is. Communication and expectations: During the incident, transparency from providers and timely communication with your stakeholders is key. Post-event review culture: Learning from this incident is essential. The public “Post-Event Summaries” promised by AWS highlight that.
To view or add a comment, sign in
-
AWS DNS Outage Explained: This morning, AWS US-EAST-1 suffered a DNS resolution failure that took down key services like DynamoDB, authentication systems, and APIs. When AWS’s internal DNS couldn’t translate service endpoints, dependent apps simply couldn’t “find” their backends which caused global timeouts and cascading failures. Even services outside AWS felt it, because they rely on APIs, data pipelines, or auth layers hosted in US-EAST-1. DNS is the nervous system of the internet when it fails, everything depending on it feels the shock. Root Cause: > Internal AWS DNS resolvers in US-EAST-1 failed to route requests correctly. > Dependent services (like DynamoDB, S3, and internal APIs) became unreachable. > Clients with hardcoded region dependencies couldn’t automatically switch to a backup region when that region went down. How do you Mitigate this risk? > Use multi-region DNS routing and consider a secondary DNS provider for cross-cloud redundancy. > Design for region independence no single region should be a single point of truth. > Cache DNS and credentials locally where safe, to reduce reliance on live lookups. Today’s outage reminded us that large distributed systems are incredibly complex, and every outage offers lessons in resilience and design. Build like everything can fail because eventually, it will.
To view or add a comment, sign in
-
-
🚨 AWS Outage – Virginia (US-East-1) | Key Takeaways Recently, AWS faced a major outage in its US-East-1 (Virginia) region, impacting several global platforms like Reddit, Snapchat, and Venmo. 🌍 🔹 Root Cause: Internal DNS resolution failure that disrupted multiple AWS core services (EC2, S3, DynamoDB, Load Balancer). 🔹 Impact: Applications worldwide faced login failures, API errors, and downtime. 🔹 Reason: Many services depend on the US-East-1 region — once DNS failed, the impact cascaded rapidly. 💪 AWS Engineering Response: Activated backup DNS routes Throttled heavy services to reduce load Gradually restored functionality Cleared backlogs and validated health checks 🛡️ Prevention for Future: Strengthen multi-region DNS redundancy Enhance automated failover mechanisms Improve real-time observability and alerting Conduct chaos testing for region failure readiness 📘 Lesson Learned: Even top-tier cloud providers face downtime — building resilient, multi-region architectures is key for business continuity. #AWS #CloudComputing #Outage #DevOps #Reliability #Observability #CloudArchitecture #DNS #HighAvailability
To view or add a comment, sign in
-
-
Look, another major AWS outage—this time centered on the US-EAST-1 (North Virginia) region and lasting for a brutal 15 hours. It wasn't a cyberattack; it was the classic, dreaded, "it's always DNS" problem. Specifically, a DNS resolution issue for the DynamoDB API endpoint triggered a massive cascading failure that took down countless services from banking to gaming. The main takeaway for anyone in the cloud space? ◾ US-EAST-1 is still the Achilles' heel: It's the oldest region, and too many "global" services still rely on its control plane. When it coughs, the whole internet catches a cold. ◾ Multi-Region is Not Just a Buzzword: If your critical architecture doesn't have a solid multi-region failover plan that you have actually tested, you just had a very expensive learning experience. ◾ The Black Box Risk: We saw countless SaaS platforms go down with zero visibility because their entire resilience strategy was trusting the cloud provider's default settings. You have to build resilience into your application layer. It’s a stark, annoying reminder that 100% uptime is a myth, and we need to stop designing like it isn't. The reliance on a single point of failure—even a giant one—is a systemic risk we have to address now. #AWS #CloudOutage #AWSDowntime #US_EAST_1 #CloudArchitecture #DevOps #ResilienceEngineering #AlwaysDNS
To view or add a comment, sign in
-
I think AWS consistently does RCAs well so I've been waiting on their official summary to come out regarding the recent outage. It's a good read and can be found below: https://lnkd.in/eEuVjnXt When you consider the breadth, scope and scale of all the services that AWS provides, I think it's amazing that large-scale, customer-facing impacts happen as infrequently as they do. And kudos to AWS for delivering this sort of open, thorough after-action report. My takeaway goes back to the basics. This is a reminder to plan appropriately for failure. I understand that simply being "multi-region" in this case *may* not have prevented an impact. It's impossible to plan for every possible scenario. That's a fool's errand. But, it *is* possible to identify the Availability, Disaster Recovery and Business Continuity requirements of your application(s) and make thoughtful, cost/risk balanced decisions when it comes to both tech stack, architecture and deployment. It's easy to point the finger at your CSP when they screw up / make a poor architectural decision (hindsight is always 20/20) but... now is also the time to take a look at your decisions around your application. "Everything fails all the time." We now have a new datapoint for potential failure. Do you have a new plan?
To view or add a comment, sign in