Jon TamplinCrowdStrike Outage

Why the 'eggs in one basket' approach may not persist

CrowdStrike Outage: 19th July 2024

Today, numerous organisations using CrowdStrike’s Falcon platform experienced significant disruptions due to a faulty content update, leading to widespread system crashes. The issue originated from a defective content deployment that caused Blue Screen of Death (BSOD) errors on Windows systems globally, affecting critical services including emergency response, airports, hospitals, and TV stations. Importantly, this was not a cyber-attack but a technical failure.

CrowdStrike Overview

CrowdStrike is a leading cybersecurity firm known for its Falcon platform, which provides endpoint protection by leveraging artificial intelligence and machine learning to detect and prevent cyber threats in real-time. Its solutions are widely used across various sectors to ensure the security of digital assets and infrastructure.

Remediation Efforts

CrowdStrike’s engineers identified and reverted the problematic update. Affected users were provided with the following workaround:

  1. Boot Windows into Safe Mode or the Windows Recovery Environment.
  2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory.
  3. Delete the file matching “C-00000291*.sys”.
  4. Boot the host normally.

Additionally, Microsoft offered further guidance for users running Windows Client and Windows Server on Virtual Machines. They suggested multiple restart operations, with some reports indicating up to 15 restarts might be necessary. Customers were also advised to restore from backups created before 19:00 UTC on the 18th of July or to attempt repairs on the OS disk by deleting the problematic file and reattaching the disk to the original VM. Detailed instructions for unlocking an encrypted OS disk on a separate virtual machine (repair VM) for offline remediation are available in this article.

For AWS users, the recommended steps included forcing a shutdown of the affected instance, detaching and reattaching the volume to a working instance, deleting the problematic file as per CrowdStrike communications, reattaching the volume to the instance, and then rebooting. Rebooting has been recommended primarily to give the machine a chance to contact CrowdStrike servers and retrieve the fix. However, when stuck in a boot loop, this approach isn't feasible. For those unable to boot into Windows, whether on a VM or a physical machine, booting into Safe Mode with networking and then following the steps to delete the offending file or restore from backup is advised.

Impact on End Users

The outage led to significant disruptions across various sectors:

  • Emergency Services: 911 services in parts of the U.S. and Canada were affected, with some agencies resorting to manual operations.
  • Airports: Major airports worldwide, including those in Zurich, Melbourne, Amsterdam, and London, experienced check-in delays and flight cancellations. More than 1,300 flights were cancelled, with many others delayed.
  • Hospitals: Facilities in the Netherlands and Spain faced interruptions, impacting patient care.
  • Media: Sky News in the UK and Australia's ABC faced significant operational disruptions.
  • Retail and Healthcare: In the UK, payment systems in shops, pharmacies, and GP surgeries were affected. Supermarket Morrisons reported initial issues, which were later resolved.

Timeline of Events

The issue began at 19:00 UTC on July 18, 2024, with BSOD reports. By 07:27 UTC on July 19, 2024, CrowdStrike had reverted the faulty update. Restoration efforts continued into the following day, with updates from CrowdStrike and Microsoft being provided periodically.

Ongoing Issues

Despite the deployed fix, many organisations continue to deal with residual effects. IT teams around the world are working hard to identify and remediate affected assets that have not automatically recovered. This task is particularly challenging for assets that need to be started in Safe Mode and are spread over large geographic locations. CrowdStrike CEO George Kurtz has apologised for the disruption, explaining that the problem stemmed from a bug in a single update which interacted negatively with Microsoft’s operating system. Kurtz assured that CrowdStrike is "working with each and every customer to make sure that we can bring them back online," but noted that some systems may take time to return to normal.

Mitigating Risks and Enhancing Visibility

The widespread outages highlight the risks associated with relying on a single technology for vital services. To mitigate such risks, it's crucial to have alternative communication links and diversified software solutions. Although managing multiple products increases security and maintenance responsibilities, it prevents total failure when one system encounters issues.

Gaining visibility across multiple systems is essential for effective risk management, and this is where ThreatAware excels. ThreatAware helps organisations integrate and monitor their various cybersecurity tools via a unified platform, providing a comprehensive overview of their security posture. This approach ensures that potential issues can be identified and addressed promptly, reducing the likelihood of widespread disruptions.

For ThreatAware customers, the following query will establish which devices were online during the publication of the content update, that have not subsequently re-connected to Crowdstrike:

systems:{name:Crowdstrike AND properties.lastSeen>"2024-07-19T00:00:00" AND properties.lastSeen<"2024-07-19T08:00:00"}

Conclusion

The CrowdStrike outage underscores the importance of rigorous testing and validation in software updates, especially for critical cybersecurity tools. The incident highlights the need for robust incident response plans and the importance of maintaining regular communication with cybersecurity vendors for updates and support. Organisations should consider diversifying their technological solutions to mitigate the impact of such incidents. Affected organisations should continue to monitor communications from CrowdStrike for further guidance and assistance in fully restoring their systems.

Ready to protect all your assets?

Leveraging its proprietary timeline-matching technology, ThreatAware ensures you have a complete, accurate, and non-duplicated asset inventory in real-time. No more guesswork – spot and fix deficiencies across your entire IT estate instantly.

Onboard in less than 30 minutes.

Request a Trial
App screenshot