How to build resilient cloud systems: lessons from failures

This title was summarized by AI from the post below.

View organization page for Ten Mile Square Technologies, LLC.

2,548 followers

1mo

Architecting and building critical technology systems for resiliency is critical when relying on public cloud infrastructure. Ryan's article takes a look at what happens when things go wrong and what you can do to ensure your systems stay operational. #cloud #technology #resilience #architecture

Ryan Van Fleet

Problem Solver - MOWG22 #menofwarsociety

1mo

How was impacted by the AWS US-East-1 incident yesterday? Take a look at you can protect your business with multi-region design. https://lnkd.in/eYpPrjcz

How to Protect Your Business from the Next AWS Outage with Multi-Region Design - Ten Mile Square Technologies https://tenmilesquare.com

To view or add a comment, sign in

More Relevant Posts

Darren Kirby - MBA

Fractional CIO/CISO | AI,Technology & Cyber Leadership | FTSE 100, VC/PE-Backed Experience - IMMEDIATELY AVAILABLE
1mo
Report this post
Major AWS outage, triggered by a DNS failure in the US East region, rippled across the globe—bringing down or severely impacting services for some of the world’s largest brands and platforms. Major companies and applications including Snapchat, Fortnite, Roblox, Reddit, as well as numerous financial institutions and government services, experienced extended downtime or disruption. 💡 Key Learnings from the AWS US-EAST-1 Outage: Resilient Architectures Require Deep Control Plane Awareness! AWS outage was more than a local glitch—it was a global wake-up call. Despite multi-region deployments, critical control plane dependencies (DNS, authentication, service configuration) left even “resilient” systems exposed. 🔑 Key Takeaways: Audit your control plane: Find and eliminate single-region dependencies for DNS, authentication, and configuration. Embrace active-active or warm standby across regions and providers for critical workloads—don’t settle for local failover. Simulate control plane failures in your recovery drills: Go beyond just data plane outages; your team should know how to recover if region-level APIs or core routing are unavailable. Reduce “blast radius”: Decouple systems, distribute trust, and avoid relying on any one AWS region for foundational services. Keep architecture reviews fresh: Evolving scale and complexity mean resilience must be re-evaluated and re-tested often. Resilience isn’t a checkbox—it’s a culture of constant vigilance, design, and practice. #cloudarchitecture #AWS #resilience #disasterrecovery #infosec #CIO #TechLeadership
3 Comments
Like Comment
To view or add a comment, sign in
Mignon Edorh, PMP, CISSP

Author of Forward with Grit and Grace | Digital Transformation Lead | Women in Tech advocate | Emerging Technologies Enthusiast | Continuous Learner | Speaker
1mo
Report this post
The Oct 20th AWS outage got me scratching my head 🤔 As a heavy AWS user, I couldn’t help but pause and reflect. When a region as critical as AWS US-East-1 goes down, it reminds us that even the most resilient cloud platforms can and will fail. Private data centers are not immune either. AWS communicated the root cause of the outage as DNS resolution issues for the regional DynamoDB service endpoints which triggered a cascading effect on many AWS services including Network Load Balancers failures which led to EC2 launch failure and more. The outage impacted many customers’ applications including Snapchat and many more. This outage was another real-world test of Murphy’s Law: Anything that can go wrong, will go wrong. That’s why a failover strategy is not optional, it is essential for businesses’ viability. I understand that every organization faces trade-off decisions due to limited funds, and not all organization can afford multi-region “always active” or even “passive” site architectures. But if your critical workloads rely on any IT technology, it’s highly recommended to at least have a plan B of some sort, ready to take over based on your business need. To identify your business need, talk with your IT department and communicate your minimum expected downtime, and your data recovery needs. This is your opportunity for honest conversation around your reasonable business uptime needs, to allow your IT people to build a resilient architecture to sustain your business in case of technology failure. In this case, customers who had failover configured with US-East-2 (Ohio) or other regions other than US-EAST-1 were able to pivot with minimal disruption. The key is balance. Of course, the farther away your failover region, the more expensive your data replication cost will be, but it doesn’t have to be across coast failover, even a nearby region failover can make all the difference. The lesson is to always remember that IT will fail. Systems will fail. What matters is how prepared we are when it happens. We need to always assume failure will occur. We need to always design for resilience, and TEST the strategy. And lastly, always, always have a plan B. #CloudArchitecture #Resilience #DisasterRecovery #Leadership #TechnologyStrategy
5 Comments
Like Comment
To view or add a comment, sign in
Chanse Cunningham

Principal Consultant @ Improving | Cloud Architect, Engineering
1mo
Report this post
The AWS outage this morning is still ongoing and is happening in us-east-1, AWS's oldest and busiest region. This is a good moment to reflect on the business impact of downtime. Lost revenue, delayed roadmap implementation, frustrated customers, compliance risks, reputational damage. Even short outages have a ripple effect across multiple areas of business. Equally important is the recovery impact. Consider the effort, time and cost required to restore systems, reconcile data and communicate with stake holders. Recovery is part of the total business impact of downtime that stretches well beyond the outage window. Many teams design for high availability inside a single region, but true resilience accounts for regional failure. At Improving, we help teams architect systems that stay up when a region goes down. Resilience doesn't happen by luck, it happens by design, planning and testing.

1 Comment
Like Comment
To view or add a comment, sign in
Wallarm: API Security Leader

23,786 followers
1mo
Report this post
When AWS goes down, the internet feels it. 🌐 The recent AWS outage proved that even the most reliable clouds can crumble under hidden dependencies. The disruption started in US-EAST-1 but spread globally, impacting services like Slack, Zoom, and Atlassian. Our latest blog explores the outage’s root cause and key lessons for infrastructure architects: - Avoid single-region control planes - Separate control and data planes - Design for true multi-region, active-active operation - Continuously test failure scenarios Learn how Wallarm’s Security Edge already applies these principles to stay resilient when providers stumble. 👉 Read the full breakdown: https://lnkd.in/dn8jGHnu #APIsecurity #CloudSecurity #Wallarm #AppSec #Resilience #AWS

AWS Outage: Lessons Learned — API Security lab.wallarm.com
Like Comment
To view or add a comment, sign in
Sai Mohit Kumar

Jack of All Trades | Astra | Caffeine Holic | Business Growth Hacker | Automating Digital Landscape | UI/UX Designer, Cloud Security, Security Analyst, Bug Bounty Hunter, Tech Enthusiast, Brand & Social Media Strategist
1mo
Report this post
AWS US-East-1 Outage: A Case Study in Over-Reliance and Under-Design When AWS East-1 blinked today, half the internet flinched. After my Snapchat took a hit today, it became clear again — the cloud isn’t magic. It’s architecture. And even the best can break. And that’s not an AWS problem — it’s an architecture problem. Let’s unpack why this matters 👇 💡 1. The Illusion of “High Availability” Most companies say multi-AZ and call it resilience. But the control plane, DNS (Route 53), IAM tokens, and regional API gateways often still flow through a single dependency chain — usually East-1. When that control plane slows, your entire “redundant” design becomes a single point of failure. Most teams don’t realize how deeply tied their services are to East-1 — IAM, Route 53, STS, CloudFront — invisible threads that all snap together. 🧠 2. Resilience Lives in Design, Not Deployment Running two EC2 zones isn’t a DR plan. True resilience needs cross-region replication, DNS failover, automated backups, and service-mesh awareness. If you can’t simulate a region outage without panic, your DR plan is theory, not practice. 🔍 3. Visibility and Observability Are the First Lines of Defense You can’t fix what you can’t see. Centralized logging (CloudWatch, OpenTelemetry), synthetic health checks, and chaos-testing pipelines should be part of your CI/CD lifecycle — not post-mortems. 🧩 4. Shared Responsibility = Shared Accountability AWS guarantees infrastructure. You guarantee availability. That means designing for graceful degradation, not perfect uptime. ⚙️ The Takeaway: The cloud never fails — our assumptions do. Every outage is a free rehearsal for the next one. Use it to measure what your system can survive, not just what it can deliver. #AWS #CloudComputing #DevOps #ResilienceEngineering #CyberSecurity #SRE #CloudArchitecture #AWSUSEast1 #DisasterRecovery #EngineeringLeadership
Like Comment
To view or add a comment, sign in
John Reister

Founder @ GoPowerEV ⚡️ | Turning Multifamily Properties into Virtual Power Plants
4w
Report this post
If Monday’s AWS outage didn’t make you rethink your “multi-region” setup, it should have. Half the internet went dark. GoPowerEV didn't because our systems were built for this kind of failure. We've deployed a hybrid, multi-cloud, local architecture with automated failover and regional isolation, so when the centralized DNS failure hit AWS, our mission-critical services chugged on as if nothing happened. Our customers continued to charge without a hitch. For most companies, this wasn't the case. Here are 3 lessons every tech leader should take from Monday's lesson: → Don't confuse multi-region with resilience. "Multi-region" setups still rely on a single global point (often US-EAST-1) for authentication or configuration. True resilience requires isolation and redundancy across different failure domains. → Architecture is your ONLY insurance policy. Unfortunately, you can't patch your way out of a centralized design flaw. You must architect for failure, not for perfection. That means continuous investment in Chaos Engineering and battle-testing your recovery playbooks before the alarm rings. →The vendor is a partner, not a solution. AWS is best-in-class, but our job as technology leaders is to eliminate single points of failure, even if that means abstracting the dependency on a single provider. Outages expose design choices. What did last week reveal about yours?
Like Comment
To view or add a comment, sign in
3R Resilience HUB

29 followers
1mo
Report this post
🌐 Reflections on the AWS Outage — A Lesson in Resilience and Integrity Last night’s AWS outage reminded the entire tech industry that no system, no matter how advanced or globally distributed, is immune to failure. A simple Domain Name System (DNS) issue cascaded across the internet — disrupting major platforms, financial services, and millions of users worldwide. As professionals in DC and cloud operations, we often emphasize uptime, redundancy, and automation. Yet this event teaches us something deeper: resilience is not built by technology alone — it’s reinforced by culture, integrity, and preparedness. 🔹 Key Lessons: 1️⃣ Architect for failure — design systems with multi-region redundancy and hybrid-cloud backup strategies. 2️⃣ Strengthen DNS governance — implement diverse DNS providers and health checks to prevent single points of failure. 3️⃣ Communicate transparently — in crisis, honesty and timely updates sustain customer trust. 4️⃣ Test resilience regularly — simulate large-scale failures to refine incident response and recovery playbooks. This outage is not just about downtime — it’s about how we lead during disruption. True operational excellence lies in the balance between technical precision and ethical responsibility. Technology will fail from time to time — but our response defines our credibility. #Leadership #Resilience #AWSOutage #Integrity #OperationalExcellence #DataCenter #CloudInfrastructure #Trust #CrisisManagement #DigitalReliability

1 Comment
Like Comment
To view or add a comment, sign in
Aniket Gupta

Staff Engineer || CFA Level 1 || Spring Security || Spring API Gateway || JPA || Springboot || Microservices || Rate limiter || Healthcare || Investment || Gaming || HLD || LLD
4w
Report this post
On October 20, AWS us-east-1 experienced a major disruption, reminding us all that even the largest cloud providers can hit critical breaking points. This outage, rooted in DNS resolution failures, rippled through DynamoDB and core services, affecting platforms globally within minutes. For those designing resilient systems, several lessons stood out: • No Service Is an Island: Regional “isolation” can break down quickly when foundational services like DNS are shared across control and data planes. • Proactive Resilience Matters: Application-level DNS caching, circuit breakers, and graceful degradation are essential for surviving not just hardware blips, but full control-plane failures. • Multi-Region Is Non-Negotiable: True resilience comes from active-active/active-passive deployments, tested failovers, and diversified service discovery—even using multi-provider DNS when possible. • Test for Chaos: Borrow from chaos engineering—simulate DNS and endpoint failures before production faces them. This incident echoes past outages and serves as a wake-up call to revisit our architectural assumptions. As dependency chains grow deeper, the need for robust, fault-tolerant design grows ever more critical. Let’s build systems prepared to survive the next “impossible” event. #CloudReliability #AWSOutage #ResilienceEngineering #DevOps #SystemDesign #SaaS #CloudArchitecture #DNSEngineering #ChaosEngineering #SiteReliability #IncidentResponse #TechLeadership

2 Comments
Like Comment
To view or add a comment, sign in
Inrupt

9,803 followers
4w
Report this post
One DNS failure. Thousands of organizations offline. This week's AWS outage showed us what happens when critical infrastructure lives in vertical stovepipes with no way out. The real question: How many wake-up calls do enterprises need about the fragility of centralized systems? Our latest blog explores why integrity controls and distributed, interoperable architecture are essential for operational resilience in the AI age: https://hubs.li/Q03PX-B90

Monday's AWS Outage & Why Interoperability is Key for Enterprise Resilience inrupt.com
Like Comment
To view or add a comment, sign in
Tej Pratap Singh

Senior Cybersecurity Manager | SOC Leader | Security Architecture | Red Team & Threat Detection | Cloud & DevSecOps Security
1mo
Report this post
AWS just went down — huge parts of the internet staggered. If your product or stack relies on a single cloud region or service, tonight’s outage is a record reminder: resilience needs design, not hope. Many public-facing apps, APIs, and integrations showed cascading failures as monitoring, auth, and storage endpoints faltered. For engineering leaders and on-call teams, three immediate moves can reduce blast radius: validate failover paths, check DNS TTL and caching, and confirm downstream retry/backoff behavior. ✅ Longer term, this is a prompt to test runbooks, invest in cross-region or multi-provider fallbacks where business-critical, and tighten customer communications during incidents. The goal isn’t to avoid every outage — it’s to shrink time-to-recovery and preserve trust. What did your team learn from tonight’s incident? Share one practical change you’ll make in the next 30 days — let’s crowdsource better resilience practices. Read More: https://lnkd.in/gT7F8Xam #CloudOutage #IncidentResponse #Resilience #CloudArchitecture #SRE #FutureOfWork
Like Comment
To view or add a comment, sign in

2,548 followers

View Profile Connect

How to build resilient cloud systems: lessons from failures

More from this author

How a Nonprofit Transformed Its Fragmented, Reactive IT Organization Into A Strategic Business Partner

Cutting Through The AI Hype: The Real Business Questions Every CEO Should Be Asking

Explore content categories