It’s been a busy week for cybersecurity company CrowdStrike. After a botched update caused a massive Microsoft Windows outage worldwide, CrowdStrike’s CEO has now been called to testify before US Congress to explain what happened.

But ahead of that, the firm has released a Preliminary Post Incident Review (PIR), revealing more details about the outage and what went wrong to cause it.

So, what went wrong, according to CrowdStrike’s report—and what is the firm doing to prevent it happening again?

CrowdStrike Reveals What Happened

As you will certainly know by now, Windows computers all around the world suffered mass blue screen of death (BSOD) when an update for CrowdStrike’s Falcon Sensor product went very wrong.

Falcon is “the CrowdStrike platform purpose-built to stop breaches via a unified set of cloud-delivered technologies that prevent all types of attacks—including malware and much more,” according to the company.

Since then, a lot has happened. IT admins scrambled to remediate the issue, in many cases manually since you can’t simply undo Windows BSOD. Microsoft itself has also released a tool to help people recover from the incident.

CrowdStrike also released a short explainer detailing how a bug in the way it delivers updates resulted in the issue that ultimately caused BSOD.

What Does CrowdStrike’s Preliminary Post Incident Review (PIR) say?

This has now been analyzed further in CrowdStrike’s Preliminary Post Incident Review (PIR). The firm says will also be releasing a Root Cause Analysis soon.

In its PIR, CrowdStrike explains how it delivers security content configuration updates to its sensors in two ways: Sensor Content shipped with its sensor directly—and Rapid Response Content “designed to respond to the changing threat landscape at operational speed.”

The Rapid Response Content updates appear to happen little and often to allow the Falcon platform to tackle new cyber security threats. However, in this case, a bug caused an issue. “The issue on Friday involved a Rapid Response Content update with an undetected error,” CrowdStrike says in its PIR.

Indeed, CrowdStrike says the “problematic Rapid Response Content configuration update resulted in a Windows system crash.”

“When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception,” CrowdStrike writes. “This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD).”

CrowdStrike’s Quality Assurance Process

Following the CrowdStrike issues last week, many people are questioning CrowdStrike’s quality assurance (QA) process.

CrowdStrike says its updates “go through an extensive QA process, which includes automated testing, manual testing, validation and rollout steps.”

The firm explains that the sensor release process begins with automated testing, both prior to and after merging into its code base. After being made available, its customers can update their fleet in a managed process.

However, the update on Friday was Rapid Response Content, which goes through a different process.

How CrowdStrike Will Prevent it Happening Again

CrowdStrike has outlined a number of steps it will take to stop anything this devastating happening again. This includes better testing processes such as “a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.”

It will also improve monitoring for both sensor and system performance, collecting feedback during Rapid Response Content deployment to guide a phased rollout.

CrowdStrike says it will provide customers with greater control over the delivery of Rapid Response Content updates by “allowing granular selection of when and where these updates are deployed.”

Meanwhile, CrowdStrike will provide content update details via release notes, which customers can subscribe to.

Experts Have Their Say

Security experts say a staged rollout procedure when publishing Rapid Response Content updates would have helped prevent the issue. “The crash would have been detected early in the first rollout stages and the number of impacted computers would have been significantly limited,” says Talal Haj Bakry, a security researcher at Mysk.

However, he says, staged roll outs are not necessarily usual for an update such as this. He likens the issue to a firm running a large blog that uses a content management system (CMS). “When you want to update the CMS software itself, you may roll out the CMS update to some of your visitors at first to make sure nothing broke. But let’s say you simply want to publish a new blogpost.

“Normally you’d want all your visitors to see the new post once it’s published, and there’s no need to do a staged rollout. Now imagine you published a new blogpost, and somehow it broke your entire website and made it inaccessible. This is similar to what happened with CrowdStrike publishing a Rapid Response Content Update.”

Even so, he says, “there’s no absolving CrowdStrike from responsibility of this incident.”

“It’s clear that for such mission critical software running on millions of computers, every change—no matter how small it may seem—should be subject to a full QA procedure, including staged rollouts.”

While it is understandable that CrowdStrike would want to get new threat detection out quickly, it comes at a risk that something could go wrong and at scale, “as we recently saw,” says Sean Wright, head of application security at Featurespace.

Many organizations attempting this type of approach typically have a smaller rollout first, he says. “This helps them to be absolutely sure nothing obvious is broken before rolling out to a wider audience,” he says.

And while there appears to have been some tool for testing the validity of the update, that in itself had a flaw, says Wright. This highlights why rolling out to production-like instances first for some initial validation is incredibly important, he says. “Long story short, if you don’t test in some production-like environment before rolling an update out to all systems, there is the risk—albeit rather small likelihood—that it could affect all systems.”

It’s good that CrowdStrike is communicating with customers and giving timely updates on what went wrong, but the firm has shown in its investigation so far that its processes could have been a lot better.

CrowdStrike is certainly learning this lesson the hard way. More details will be coming soon, but the damage is already done to CrowdStrike’s reputation, so the firm needs to work hard to recover.

Share.
Exit mobile version