AWS outage: lessons for infrastructure architects

This title was summarized by AI from the post below.

View organization page for Wallarm: API Security Leader

23,786 followers

1mo

When AWS goes down, the internet feels it. 🌐 The recent AWS outage proved that even the most reliable clouds can crumble under hidden dependencies. The disruption started in US-EAST-1 but spread globally, impacting services like Slack, Zoom, and Atlassian. Our latest blog explores the outage’s root cause and key lessons for infrastructure architects: - Avoid single-region control planes - Separate control and data planes - Design for true multi-region, active-active operation - Continuously test failure scenarios Learn how Wallarm’s Security Edge already applies these principles to stay resilient when providers stumble. 👉 Read the full breakdown: https://lnkd.in/dn8jGHnu #APIsecurity #CloudSecurity #Wallarm #AppSec #Resilience #AWS

AWS Outage: Lessons Learned — API Security lab.wallarm.com

To view or add a comment, sign in

More Relevant Posts

Abhinav Nair

Software Engineer @Vayana | Fintech | Leetcode Max. 1839 | Go | Java | Microservices
1mo
Report this post
🔌 Takeaway from the recent Amazon Web Services (AWS) outage 🔹 Earlier this week, AWS reported an incident in its US-EAST-1 region: from ~11:49 PM PDT on October 19 to ~2:24 AM PDT on October 20, a DNS-resolution issue for the Amazon DynamoDB‐service endpoints led to elevated error rates across a number of services. In other words: a core database API endpoint couldn’t be resolved correctly, triggering cascading failures and widespread disruption beyond just one service. 🔹 Why it matters Even the largest, most robust cloud infrastructures are vulnerable when a foundational piece (like DNS) fails. It highlights that our modern digital architecture is built on layers—and a breakdown deep in the stack affects everything above it. For organizations relying on cloud services, it’s a reminder that designing for resilience means accounting for unexpected failures, even when using best-in-class providers. As one expert put it: “Failures increasingly trace to integrity … until we better understand and protect integrity, our total focus on uptime is an illusion.” 🔹 What we can do as professionals Re-evaluate critical dependencies: Is your architecture too tightly coupled on one provider, region or service? Adopt multi-region, multi-endpoint failover strategies for critical services. Ensure you have observability not just into service health, but into infrastructural components like DNS, networking and resolution path. Run regular failure-mode drills (e.g., “What happens if our primary DB endpoint cannot be resolved?”) and treat them like security/fire drills. Communicate transparently with stakeholders when incidents happen: outages are never just “tech issues” — they impact operations, brand trust and user experience. 🧠 Final thought Outages are never convenient — but they are inevitable in complex distributed systems. What differentiates high-performing organisations is not that they never fail, but that they: anticipate weakness, build in resilient countermeasures, and respond & learn swiftly. Let’s treat this AWS outage not just as a headline, but as a learning opportunity: to strengthen our infrastructure, improve our incident readiness, and sharpen our commitment to reliability. Would love to hear your thoughts: 🔍 How do you approach dependency risk in your cloud architecture? ⚙️ What failure drills have you found most revealing in your organisation? #CloudComputing #AWS #ReliabilityEngineering #Resilience #IncidentManagement #DynamoDB
Like Comment
To view or add a comment, sign in
Joe Gaffney

Software Developer
4w
Report this post
Everyone talks about the October AWS outage, but one detail is often overlooked: even a “multi-region” AWS setup can fail if your control plane lives in a single region, and it does. AWS’s own docs confirm it. Route 53’s control plane (record updates, health checks, APIs) is hosted entirely in us-east-1; only the DNS query network is globally distributed. You can see this in the AWS Console: Route 53 doesn’t ask for a region, and S3 defaults to one. If us-east-1 stalls, you can’t update records or health checks. https://lnkd.in/eCRtRBzG

Appendix B - Edge network global service guidance docs.aws.amazon.com
Like Comment
To view or add a comment, sign in
Parag Juneja

Data Engineering @ DemandScience | 10+ Years experience in Data & Software Engineering | Building Scalable Data & AI Systems | M.S. CS, Indiana University Bloomington | Ex - Oracle Cerner
1mo
Report this post
🔴 If you were affected by today's AWS outage in the US-EAST-1 region, here's what went wrong: The Domino Effect 🎯 Think of AWS services like a chain of dominoes. When one critical piece falls, it can knock down everything connected to it. That's exactly what happened last night. What Triggered It? 🔍 Late last night, AWS's DNS system (think of it as the internet's phone book 📖) stopped working correctly for DynamoDB, a database service. Services couldn't find DynamoDB anymore. The Cascade Begins ⚡ Here's where it got complicated: ▪️ EC2 (virtual servers 💻) relies on DynamoDB to launch new instances. When DynamoDB's address couldn't be found, new servers couldn't start. ▪️ Network Load Balancers couldn't perform health checks, leading to connectivity issues 🌐 ▪️ This rippled out to Lambda, CloudWatch, and other services that depend on these core systems. The Recovery 🛠️ AWS worked to restore services throughout the day: ▪️ Fixed the initial DNS issue ▪️ Resolved Network Load Balancer problems ▪️ Temporarily slowed down some operations to prevent overwhelming the recovering systems ▪️ Full recovery by mid-afternoon ✅ The Takeaway 💡 This incident highlights why cloud architecture matters. A single point of failure in a foundational service can cascade through an entire ecosystem. It's also a reminder of why multi-region deployments and disaster recovery plans aren't optional, they're essential. #AWS #CloudComputing #DevOps #SiteReliability #TechOutage #TechNews #Infrastructure
Like Comment
To view or add a comment, sign in
Paddy Byers

Founder & CTO at Ably | Building the realtime infrastructure of the internet | Powering 2B+ devices globally
1mo
Report this post
I keep being asked how Ably performed during the AWS us-east-1 outage this week. The bottom line is that there was no customer impact: no downtime, no errors and imperceptible impact on latency. Ably is hosted on AWS with services operating in multiple regions globally. Each region scales independently based on traffic, and us-east-1 is normally the busiest region. When AWS services in us-east-1 started failing, the Ably data plane kept operating as normal. The AWS disruption meant that we couldn’t add capacity in that region, so new connections were routed to us-east-2 instead; this is a routine intervention that we make in response to disruption in a region. Existing connections in us-east-1 stayed live, serving traffic with the same latency and zero errors. This was our globally-distributed system doing exactly what it was built to do. Read the full breakdown: https://lnkd.in/eeVsie4a

AWS us-east-1 outage: How Ably’s multi-region architecture held up ably.com

4 Comments
Like Comment
To view or add a comment, sign in
Brian Finnegan BSc MSc

Digital Solutions Architect, Presidio | Lifelong Learner
4w
Report this post
“…at around 1200 UTC we made DNS changes so that new connections were not routed to us-east-1; traffic that would have ordinarily been routed there (based on latency) were instead handled in us-east-2.” ☝️This is a key tenet for high availability (if not fault tolerance): fail away from the incident. Of course, your system needs to be multi region, you need to have tested it, be confident in doing it, and have a light process in place to execute quickly. #AWS

Paddy Byers

Founder & CTO at Ably | Building the realtime infrastructure of the internet | Powering 2B+ devices globally
1mo

I keep being asked how Ably performed during the AWS us-east-1 outage this week. The bottom line is that there was no customer impact: no downtime, no errors and imperceptible impact on latency. Ably is hosted on AWS with services operating in multiple regions globally. Each region scales independently based on traffic, and us-east-1 is normally the busiest region. When AWS services in us-east-1 started failing, the Ably data plane kept operating as normal. The AWS disruption meant that we couldn’t add capacity in that region, so new connections were routed to us-east-2 instead; this is a routine intervention that we make in response to disruption in a region. Existing connections in us-east-1 stayed live, serving traffic with the same latency and zero errors. This was our globally-distributed system doing exactly what it was built to do. Read the full breakdown: https://lnkd.in/eeVsie4a

AWS us-east-1 outage: How Ably’s multi-region architecture held up ably.com
Like Comment
To view or add a comment, sign in
Yash Bhoite

Azure Cloud Security | Microsoft Defender Suites (Cloud Apps, EDR, XDR, Purview) | Symantec Endpoint Security Complete| 4x Microsoft Certified
1mo
Report this post
Earlier this week, AWS suffered a major disruption: the US-EAST-1 region experienced “increased error rates and latencies” across multiple services, which cascaded into widespread outages for many popular platforms. Here are a few thoughts and take-aways: --- ✅ What Happened (in short) The outage started early in the US-EAST-1 region, impacting the DNS subsystem and internal networking. Because AWS underpins so much of the internet infrastructure, the ripple effects were enormous: major apps, services and websites across the globe experienced downtime or degraded performance. The incident has already triggered broader discussion about cloud dependency, concentration risk and resilience. --- 🎯 Why This Matters to Us Resilience in architecture matters: Even best-in-class clouds can have large-scale disruptions. We must assume failure scenarios and have fallback/mitigation plans. Regional-single-point-risk: Hosting everything in one region (US-EAST-1 in this case) increases exposure. Multi-region design (and cross-cloud if feasible) elevates resilience. Upstream dependencies: Many businesses may think “we’re fine”, but if you rely on a provider (or services hosted by them) that is affected, you are still vulnerable. The outage shows how inter-linked the ecosystem is. Communication and expectations: During the incident, transparency from providers and timely communication with your stakeholders is key. Post-event review culture: Learning from this incident is essential. The public “Post-Event Summaries” promised by AWS highlight that.
Like Comment
To view or add a comment, sign in
Shubham Patil

Full-Stack Engineer @ Heartland Community Network | Nextjs | Reactjs | Nodejs | Python | Database design | Cloud
1mo
Report this post
AWS DNS Outage Explained: This morning, AWS US-EAST-1 suffered a DNS resolution failure that took down key services like DynamoDB, authentication systems, and APIs. When AWS’s internal DNS couldn’t translate service endpoints, dependent apps simply couldn’t “find” their backends which caused global timeouts and cascading failures. Even services outside AWS felt it, because they rely on APIs, data pipelines, or auth layers hosted in US-EAST-1. DNS is the nervous system of the internet when it fails, everything depending on it feels the shock. Root Cause: > Internal AWS DNS resolvers in US-EAST-1 failed to route requests correctly. > Dependent services (like DynamoDB, S3, and internal APIs) became unreachable. > Clients with hardcoded region dependencies couldn’t automatically switch to a backup region when that region went down. How do you Mitigate this risk? > Use multi-region DNS routing and consider a secondary DNS provider for cross-cloud redundancy. > Design for region independence no single region should be a single point of truth. > Cache DNS and credentials locally where safe, to reduce reliance on live lookups. Today’s outage reminded us that large distributed systems are incredibly complex, and every outage offers lessons in resilience and design. Build like everything can fail because eventually, it will.
2 Comments
Like Comment
To view or add a comment, sign in
Sasikiran sakalabattina

Associate Team lead@ Tech Mahindra.. working in Remediation project for DH as a Cloud Engineer #patch EXPERT#IVANTI TOOL# #OS UPGRADE #WINDOWS ADMIN #AD #Aws#vmware #Nutanix #CITRIX#DATACENTERADMIN #POWEREDGE SERVERS
1mo
Report this post
🚨 AWS Outage – Virginia (US-East-1) | Key Takeaways Recently, AWS faced a major outage in its US-East-1 (Virginia) region, impacting several global platforms like Reddit, Snapchat, and Venmo. 🌍 🔹 Root Cause: Internal DNS resolution failure that disrupted multiple AWS core services (EC2, S3, DynamoDB, Load Balancer). 🔹 Impact: Applications worldwide faced login failures, API errors, and downtime. 🔹 Reason: Many services depend on the US-East-1 region — once DNS failed, the impact cascaded rapidly. 💪 AWS Engineering Response: Activated backup DNS routes Throttled heavy services to reduce load Gradually restored functionality Cleared backlogs and validated health checks 🛡️ Prevention for Future: Strengthen multi-region DNS redundancy Enhance automated failover mechanisms Improve real-time observability and alerting Conduct chaos testing for region failure readiness 📘 Lesson Learned: Even top-tier cloud providers face downtime — building resilient, multi-region architectures is key for business continuity. #AWS #CloudComputing #Outage #DevOps #Reliability #Observability #CloudArchitecture #DNS #HighAvailability
Like Comment
To view or add a comment, sign in
Robert Christian Gajudo

Associate Director for Technology Infrastructure Solutions & Services at Solvento Phils., Inc.
1mo
Report this post
Look, another major AWS outage—this time centered on the US-EAST-1 (North Virginia) region and lasting for a brutal 15 hours. It wasn't a cyberattack; it was the classic, dreaded, "it's always DNS" problem. Specifically, a DNS resolution issue for the DynamoDB API endpoint triggered a massive cascading failure that took down countless services from banking to gaming. The main takeaway for anyone in the cloud space? ◾ US-EAST-1 is still the Achilles' heel: It's the oldest region, and too many "global" services still rely on its control plane. When it coughs, the whole internet catches a cold. ◾ Multi-Region is Not Just a Buzzword: If your critical architecture doesn't have a solid multi-region failover plan that you have actually tested, you just had a very expensive learning experience. ◾ The Black Box Risk: We saw countless SaaS platforms go down with zero visibility because their entire resilience strategy was trusting the cloud provider's default settings. You have to build resilience into your application layer. It’s a stark, annoying reminder that 100% uptime is a myth, and we need to stop designing like it isn't. The reliance on a single point of failure—even a giant one—is a systemic risk we have to address now. #AWS #CloudOutage #AWSDowntime #US_EAST_1 #CloudArchitecture #DevOps #ResilienceEngineering #AlwaysDNS
Like Comment
To view or add a comment, sign in
James Sheheane

Cloud Enablement Principal Consultant at Slalom
3w
Report this post
I think AWS consistently does RCAs well so I've been waiting on their official summary to come out regarding the recent outage. It's a good read and can be found below: https://lnkd.in/eEuVjnXt When you consider the breadth, scope and scale of all the services that AWS provides, I think it's amazing that large-scale, customer-facing impacts happen as infrequently as they do. And kudos to AWS for delivering this sort of open, thorough after-action report. My takeaway goes back to the basics. This is a reminder to plan appropriately for failure. I understand that simply being "multi-region" in this case *may* not have prevented an impact. It's impossible to plan for every possible scenario. That's a fool's errand. But, it *is* possible to identify the Availability, Disaster Recovery and Business Continuity requirements of your application(s) and make thoughtful, cost/risk balanced decisions when it comes to both tech stack, architecture and deployment. It's easy to point the finger at your CSP when they screw up / make a poor architectural decision (hindsight is always 20/20) but... now is also the time to take a look at your decisions around your application. "Everything fails all the time." We now have a new datapoint for potential failure. Do you have a new plan?

Summary of the Amazon DynamoDB Service Disruption in the Northern Virginia (US-EAST-1) Region aws.amazon.com
Like Comment
To view or add a comment, sign in

23,786 followers

View Profile Follow

AWS outage: lessons for infrastructure architects

More from this author

Logic Abuse, Outage Resilience, and Q3 API Threats

ThreatStats, Outages, and Halloween Highlights

Awards, Testing, and Real-World API Risks

Explore content categories