My daughter was in Seattle a couple weeks ago—and preparing to fly home to Houston on Saturday. As you probably noticed, there was a massive computer outage with global impact the day before. Flights for most of the major airlines were halted for much of the day.
When my daughter showed up to the airport on Saturday morning, getting checked in was complete chaos. My daughter—with some help from a couple good Samaritans who stepped up and implemented a system to streamline things that United Airlines should have done themselves—was able to get through TSA and get to her gate. Then her flight was delayed and delayed again. It finally took off 3 hours after its scheduled time.
That is just one small but personal anecdote of the impact experienced from the event. My daughter was a few hours late—but she made it back. Many experienced much more serious consequences from the outage.
While it was bad news for everyone involved, Morgan Wright, chief security officer for SentinelOne emphasized that it did offer a unique opportunity to experience what a global cyber attack would look like. “An attack against America and our critical infrastructure would unfold as we saw with the CrowdStrike outage: cascading failures that would trigger more interdependent system failures. The failures would overwhelm the ability to respond immediately, and the lack of response to vital systems would trigger additional failures.”
Every organization should now examine how they were impacted, and explore ways to improve their resilience.
What Happened?
Unlike other cybersecurity industry “wake-up call” moments like WannaCry or NotPetya, this catastrophe was not the result of a cyber attack. We all know now what happened. According to a statement from CrowdStrike CEO George Kurtz posted on the company’s website, “The outage was caused by a defect found in a Falcon content update for Windows hosts.”
As unfortunate as the incident was, it provides valuable insight into the sheer scale of our reliance on technology and the interconnected nature of the world we live and work in.
The balance between speed and quality assurance is a delicate one for any industry. This event has underscored the need for a resilient approach to technology development and deployment, highlighting how critical it is to adhere to basic engineering and quality assurance (QA) principles.
In a way, this incident has a silver lining. While this was not a cyber attack, it ultimately gives us an incredible opportunity to examine what went wrong and learn valuable lessons to enable organizations to be more resilient in the future.
Balancing Speed and Quality Assurance
Quality assurance is the backbone of reliable technology systems. It involves rigorous testing and validation processes to ensure that every component of a system functions correctly and securely before it goes live. In the rush to deliver new features and updates, organizations strive to find the right balance between diligence and expedience.
Tilting the scale too far in either direction has potential consequences. Extreme vigilance risks delaying deployment unnecessarily without any additional reduction in risk while erring on the side of speed exposes the possibility of deploying flawed or vulnerable software.
Back to Basics: Building Resilient Systems
There is no such thing as perfect code, and there is no such thing as invulnerable security. Flaws can still occur even with a process that finds the right equilibrium between vigilance and speed. That’s why it’s important to have additional safeguards in place to limit the fallout if an issue occurs.
“Every organization produces defects or bugs, regardless of how robust their validation engineering practices are. We have all used technology long enough to appreciate this reality,” said Ric Smith, chief product and technology officer at SentinelOne. “However, how you deploy changes or updates into your environment—or your customer’s environment in the case of endpoints—is crucial for mitigating the risks introduced by potentially faulty updates. The most fundamental practice is to introduce changes incrementally, and through these phased increments, you can tier the risk of who gets exposed to updates.”
Resiliency in technology systems means more than just preventing outages; it involves creating systems that can quickly recover and continue to function even when problems arise. This requires a proactive approach to both design and maintenance.
- Architectural Decisions: The foundation of a resilient system lies in its architecture. Decisions made during the design phase can set the stage for long-term stability and security. For example, implementing redundancy and failover mechanisms can ensure that if one component fails, others can take over without service disruption.
- Continuous Improvement: Resiliency is not a one-time effort but an ongoing process. Regularly updating and testing systems to handle new threats and challenges is crucial. This includes conducting regular stress tests and simulations to identify potential weaknesses before they can be exploited.
- Comprehensive QA Processes: QA should be an integral part of the development lifecycle, not an afterthought. This means applying the same rigorous testing standards to all updates and changes, whether they are major product releases or minor patches. Automated testing tools can help ensure consistency and thoroughness.
The Path Forward
“It’s easy to jump on the bandwagon when angry villagers show up at the castle gates with spears and arrows looking for retribution,” said John Wood, CEO at Telos Corporation. “However, it’s important to remember that no one is perfect. Accidents happen. The real test is how good people and good companies respond to such events.”
One lesson from this incident is that organizations should consider having redundant or backup systems running a unique operating system. After all, organizations running on Linux or macOS were not affected. Southwest Airlines apparently dodged the bullet thanks to stubbornly relying on outdated versions of Windows that are no longer supported by Microsoft—but I don’t recommend that as a strategy.
Every business that was impacted by this incident should be performing some sort of post-mortem analysis. This is the perfect opportunity to identify dependencies and highlight critical points of failure so that changes can be made to avoid or reduce the effects of future incidents and make the organization more resilient.
Companies that design and deploy software must recognize that while rapid innovation is essential, it should never come at the cost of reliability and security.
- Emphasize Engineering Principles: Companies should reaffirm their commitment to basic engineering and QA principles. This includes rigorous testing, thorough code reviews, and a cautious approach to deploying updates.
- Foster a Culture of Resiliency: Encourage a mindset that values resilience and reliability. This can be achieved through training, clear policies, and a leadership team that prioritizes long-term stability over short-term gains.
- Engage with the Community: Collaboration with industry peers, analysts, and researchers can provide valuable insights and help identify best practices. Sharing knowledge and experiences can lead to collective improvements in the industry.
- Invest in Technology and Tools: Leveraging the latest technology and tools for QA and system monitoring can enhance a company’s ability to detect and respond to issues quickly. This includes automated testing, real-time monitoring, and advanced analytics.
“This event underscores the need for robust incident response plans and resilience strategies,” emphasized John Chirhart, founder and CEO of GTG.Online. “Organizations should take advantage of this experience to improve communication channels and enhance their redundancy measures.”
Organizations around the world just lived through a real-time case study for better preparation and vigilance to avoid such cascading failures in the future. Hopefully, organizations can learn from this incident and use it as a catalyst for change.
By prioritizing resiliency and quality assurance, companies can build more robust systems that not only withstand disruptions but also provide a reliable foundation for future innovations.