I was the VP of technology for the 2020 Olympics. 175 million + people watched. Here are 5 steps I took when something went wrong: When something breaks and 175 million people are watching, it should immediately be brought to the attention of the event's shift leader. From there, an incident leader must be chosen and will need to: 1. Determine the scope and scale of the issue 2. Investigate the cause of the issue 3. Take corrective action 4. Communicate 5. Repeat Scope and scale ⬇️ 🔹 Is this widespread and affecting all customers? 🔹 Is it limited to customers in a certain country? 🔹 Is it pervasive or only when they first open your app? It is better to assume that the issue is widespread until you can prove with certainty the specific conditions under which it happens. As the scope and scale are refined, the shift leader and senior executive staff should be updated. Investigation ⬇️ Investigating the root cause should take place as quickly as possible and with the correct people involved. During any live event, this is especially critical. Getting the right people to evaluate the problem is a best practice. It is better to have too many people engaged than to have too few and miss something. Corrective Action ⬇️ Once the cause is known, corrective action should be taken. If there is anyway the correction can be made through configuration or server side controls, the process will be relatively easy. If correction requires code changes or other “hot fix” deployments, that painful process will need to be started immediately. In almost all cases, it is advisable to inform the shift leader and senior executives before taking action. Even if they are not necessary for a small change, any change must be communicated to the broader team. Communication ⬇️ I’ve listed communication as step 4, but you can see that it is scattered throughout all of the steps. As the incident leader, you must communicate early and often. Updates should be occurring at least every 10-15 minutes and more frequently as causes and options are identified. This is not a time to be quiet. Tip: Try to avoid your updates becoming a “stream of consciousness” by quickly summarizing where you are and what you are doing to the group. Repeat ⬇️ These steps are continued until the incident is resolved. Once resolved, there will likely still be follow-ups and debriefs after the event. How do your incident leaders operate? 💡 Olympic Countdown - Day 3: Incident Leadership. Follow along for my countdown posts every day leading up to the launch of the Paris Olympics! I will share insights on launching large-scale events gathered over my 30+ year career delivering many large events.
How to Manage IT Outages
Explore top LinkedIn content from expert professionals.
Summary
Managing IT outages involves identifying and resolving system disruptions to minimize downtime and protect business continuity. It requires preparation, communication, and decisive action to handle unexpected technology failures effectively.
- Create a response plan: Design a comprehensive incident management plan, including clear roles, communication protocols, and mitigation steps for various outage scenarios.
- Communicate proactively: Keep stakeholders and teams informed with timely updates, ensuring transparency and coordination throughout the resolution process.
- Focus on prevention: Regularly back up systems, monitor infrastructure health, and test software updates in controlled stages to avoid widespread disruptions.
-
-
One of my clients who was affected by the Crowdstrike outage did a post incident review yesterday. It was a great session. My biggest takeaway was the client’s commitment to transparency and accountability. They acknowledged what they did right and were not afraid to point out what went wrong. Note: they were up and running within half a day so good job to them. Some additional learnings from this client, other clients and in general from this incident. These learnings are 100% Crowdstrike independent. So if you are thinking, we don’t need to worry because we have SentinelOne, these takeaways are for you too! Some generic takeaways for any Windows environment: ● Shutting down laptops at night can prevent this and other bad updates / things from happening. ● When your Domain Controller is down, your organization is (mostly) down. Work hard for redundancy. ● Make sure your Bitlocker recovery keys are persisted in a non Windows environment securely especially those for your Domain Controllers. ● Lock down your Domain Controllers. Do not put any unnecessary software on them. ● Back up your Domain Controllers nightly. (And any other important servers.) ● Consider moving off of a Domain Controller network architecture. Know what services are dependent on your Domain Controllers. ● If you don’t need to worry about Bitlocker or other encryption keys because you don’t encrypt - then you have another big problem to worry about! But this type of situation could occur in non-Windows environments too. Some generic takeaways for everyone: ● Agents are scary. They are on all of your machines and typically have administrative access. Perform due diligence on all vendors who have agents on machines. ● Make sure to have backup communications channels available. When your network / computers are down, you are really going to regret not having your team members phone numbers / other channels of communications. ● When pushing out any software broadly, make sure to do it in waves. Investigate any failures for the early waves. If possible, do the early waves in geographic proximity to headquarters. ● Dedicated employees are critical when things don’t go right. Contrast Delta Airlines response to the companies that handled it the best. The best companies were back up and running within a day or two. Even if their disaster recovery plan fell apart because of the cascading failures, a few heroic employees can help solve the problem for everyone. Did you have any lessons learned from the incident? Let us know in the comments below!
-
On July 19, 2024, the tech world witnessed what many consider the largest IT outage in history. The CrowdStrike/Microsoft disruption affected millions of devices worldwide. Are you prepared for the next big outage? The impact: Global Disruption: The outage affected approximately 8.5 million Windows devices worldwide. (Source: Microsoft). Travel Chaos: Over 4,000 flights were cancelled globally with over 500 major airlines being affected. (Source: CNBC & CrowdStrike). Financial Toll: Downtime costs the world's largest companies $400 billion a year. While this figure is not specific to the CrowdStrike/Microsoft outage, it provides context for the potential financial impact of such large-scale IT disruptions. (Source: Splunk). While some organizations crumbled, others emerged unscathed. What set them apart? They took proactive steps to safeguard their systems and processes. Here are 10 critical steps to help you avoid similar chaos: 1. Implement Staged Rollouts Slow and steady wins the race. Avoid rolling out software updates across all systems at once. Test updates on a small subset first. 2. Use Extra Monitoring Tools Eyes everywhere! Deploy tools like Fleet to monitor endpoints and detect issues early. 3. Non-Kernel Level Security This will be a key topic for many tech leaders now. Explore security solutions that operate outside the kernel to minimize risks. 4. Enhance Cloud Observability It's their cloud until it is your outage, watch for storms at all times. Invest in tools to detect and prevent issues from buggy software updates. 5. Maintain Analog Backups In some crucial cases analog beats digital and not just recorded music. Keep analog backups for critical sectors to ensure continuity during outages. 6. Improve Testing and Debugging Test like you mean it, then test some more. Ensure rigorous testing and debugging of software and system updates before deployment. 7. Robust Crisis Management Protocols Plan for every manner of chaos, think zombie apocalypse. Have well-defined procedures for responding to major outages. 8. Diversify Technology Stack Avoid relying on a single vendor or technology to reduce risk. This can be argued 'til the end of time, but fewer points of failure is better unless all your points of failure are in the same tech basket. 9. Regular System Backups Think of backups as your get-out-of-jail-free card. Maintain recent backups or snapshots for quick rollbacks. 10. Staff Training Train for trouble Train IT staff in crisis response and workaround procedures. The next crisis isn't a matter of if, but when. Will you be the hero who saw it coming, or the one who kept smashing that snooze button? What steps are you taking today to ensure your systems are secure and prepared?
-
14 lessons I learned about working with large distributed systems in the last 8 years of my career at Google, Cisco and DELL EMC. I love exploring system design & distributed systems. These are the insights I would give to my younger self If I were starting again: 1. Infrastructure Health Monitoring - Monitor CPU utilization, memory usage, and other basics. - Ensure auto-scaling and proactive alerting when resources are overloaded. 2. Service Health Monitoring: Traffic, Errors, and Latency - Track traffic volume, error rates, and response times. - Focus on latency percentiles (p95, p99) for a more accurate user experience. 3. Business Metrics Monitoring - Track key business events to ensure the system enables "business as usual." - Customize business metrics for specific services, such as payments. 4. Oncall and Anomaly Detection - Teams should own their services, including the oncall responsibilities. - Use machine learning for anomaly detection to reduce false positives. 5. Efficient Alerting - Set thresholds for actionable alerts to avoid burning out on-call engineers. - Regularly review alerts and tag non-actionable ones for future adjustment. 6. Runbooks for Mitigation - Always have updated runbooks for common outages. - Ensure mitigation steps are easy to follow, even for engineers unfamiliar with the system. 7. Outage Communication - Establish clear channels for communicating outages across teams. - Use central chat groups for faster, collaborative incident resolution. 8. Mitigate First, Investigate Later - Focus on rolling back changes during outages instead of deploying fixes in haste. - Root cause analysis can wait until after the incident is resolved. 9. Blameless Postmortems - Investigate outages without assigning blame and identify root causes. - Use techniques like the "5 Whys" to get to the heart of the issue. 10. Incident Reviews - Have senior engineers and management review severe incidents. - Ensure accountability for implementing system-level improvements. 11. Failover Drills and Capacity Planning - Regularly test data center failovers to ensure services can handle increased traffic. - Plan for future traffic with accurate capacity forecasting to avoid resource bottlenecks. 12. Blackbox Testing - Simulate real user flows to ensure systems function correctly in real-world scenarios. - Use blackbox tests for quick feedback during failover drills. 13. SLOs and SLAs - Define service-level objectives (SLOs) for capacity, latency, and availability. - Regularly measure and report on SLOs to ensure system performance is on track. 14. SRE Team Involvement - Dedicated SRE teams should manage monitoring, alerting, and incident reviews. - SREs ensure system reliability through failover drills, black box tests, and capacity planning.
-
What a night! How to Manage a Major Technology Outage: Lessons from Last Night 🔴 It's time we rethink our reliance on automated updates without adequate safeguards. Last night, our organization faced a significant challenge due to a faulty update from CrowdStrike, our security software that protects against viruses and cyber threats. Here’s what we learned: 1️⃣ Identify the Issue Quickly: ➟ At 9:30 PM, just as we were about to start scheduled maintenance, all our systems crashed. ➟ Our team swiftly identified the problem—a faulty global update from CrowdStrike. ➟ Immediate action was taken to cancel the planned work and focus on resolving the issue. 2️⃣ Work Around the Clock: ➟ We received a confirmed fix from CrowdStrike around 1 AM. ➟ Our dedicated team worked through the night to apply the fix and bring systems back online. ➟ By early morning, most systems were operational, with ongoing efforts to fully restore functionality. 3️⃣ Prepare for Aftermath: ➟ Anticipate a busy day following a major outage. Many users will need assistance. ➟ Create simple, user-friendly guides to help staff resolve common issues themselves. ➟ Ensure open communication with all stakeholders, updating them on progress and next steps. The real lesson here? Always be prepared for unexpected technology failures and have a robust crisis management plan in place. Our team’s dedication and quick response were crucial in minimizing downtime and restoring services. Did you face the challenge last night? How did your team handle it? Share your tips in the comments! #Crisis #Leadership #Crowdstrike #Teamwork
-
I’m incredibly proud of the work my team and I have accomplished in transforming our organization’s approach to Product Health Management. We sparked a cultural shift that significantly reduced incidents, improved application performance, and enhanced uptime. 💥 When I first joined, we faced constant disruptions and a reactive approach to problem-solving. But instead of accepting the status quo, we took bold steps to make a real difference and create lasting change—a cultural transformation. Here’s how we made it happen: 💼 Leap into Action: Building the SWAT Team 💼 🔧 We formed a dedicated SWAT team, always on standby to leap into action during incidents. 🤝 This fostered a sense of ownership and urgency across departments, making everyone responsible for product health. 📊 Assessing and Monitoring: Status Page & Dashboards 📊 🖥️ We introduced a status page to monitor and notify alerts and developed a Product Health Dashboard to track trends. 🔍 These tools became catalysts for transparency and proactive problem-solving. 📋 Strategic Planning & Response 📋 📝 We implemented a step-by-step response strategy involving key executives. 💡 This ensured collaboration, strategic thinking, and teamwork at the forefront of our operations. 📞 Communication & Collaboration 📞 🔄 Weekly connects with stakeholders helped review incidents and plan for changes. 🤝 This proactive communication fostered continuous learning, transparency, and shared accountability. 🔍 Monitoring the Situation: Continuous Vigilance 🔍 ⏰ Even after a crisis, we didn’t let our guard down. 🛡️ Continuous monitoring and reassessment became the norm, embedding vigilance and adaptability. 📈 Continuous Learning: Adapting and Evolving 📈 💡 We analyzed every incident to improve, cultivating a growth mindset. 🎉 We celebrated successes and learned from mistakes, fueling continuous improvement. 🚀 The result? By combining rapid action with thoughtful planning, clear communication, and a strong focus on immediate and long-term solutions, we didn’t just manage incidents—we transformed our entire approach. This led to quicker recoveries, fewer recurring issues, and a more robust, reliable suite of applications. Key Takeaway: Transforming product health management isn’t just about technology; it’s about building a proactive problem-solving culture, transparency, ownership, and continuous learning. Kudos to Jim Baron, Xiaoyu (Judy) Hopkins, Phong Truong, Alison Waghorn, Richard Entwistle, Travis Kieboom, Magnus Tautra, Kelli Delp #IncidentManagement #ApplicationPortfolio #TechLeadership #ShortTermFixes #LongTermSolutions #CTO #SWATTeam #MonitoringSystems #StakeholderEngagement #TechStrategy #ChangeManagement #OutagePlanning #CustomerCommunication #Innovation