In the dynamic world of cybersecurity, even the most reputable companies can encounter significant setbacks. Last week, CrowdStrike, a leading cybersecurity firm, faced a critical issue that disrupted many of their customers’ operations globally. This incident highlights the ever-present challenges in software development and the crucial need for advanced error detection solutions.
The Problem
On July 19, 2024, CrowdStrike released a configuration update for their Falcon sensor, which inadvertently caused a system crash (BSOD) on Windows systems. This issue, triggered by a logic error in the update, affected systems running Falcon sensor for Windows version 7.11 and above that downloaded the faulty configuration during a specific timeframe.
The fallout was immediate and widespread, impacting banks, airlines, and various other critical sectors globally. Systems crashed, operations halted, and organizations were left scrambling to restore functionality. This disruption also opened the door for cybercriminals to exploit the situation, as they began distributing phishing emails and malicious files masquerading as solutions to the issue.
How Railtown AI Could Have Made a Difference
Development team’s software development lifecycle (SDLC) and tooling set a foundation for building great products. If CrowdStrike was equipped with better tooling platform such as Railtown AI and integrated their CI/CD from multiple environments, this meltdown would have been caught before it was released. Our mission is to leverage artificial intelligence by providing more than what mainstream tools can offer. Our error bucketing and source environment tracking features could have made a difference and prevented a global software outage.
Smart error bucketing
Template type changes is a very common activity in software development. When teams push this change in local environments, leading to constant error notifications. Teams are most likely desensitised to these alerts because of excessive exposure to frequent and similar notifications. This leads to a significant decrease in responsiveness or a tendency to ignore them altogether.
Our error bucketing feature clusters similar errors and only alerts for errors that has been detected for the first time, allowing teams to prioritise critical alerts. This allows the team to prioritize and address the most critical issues first. What would’ve happened was the team receiving one error about the usual template type change while another error about the mismatch of input parameters in the new template type.
Source environment mapping
Because each environment has its own configuration, our source event mapping feature is another way of preventing this global software security meltdown. Local and test environments are usually a perfect match for the new code changes that are about to be released thus errors not being caught. But by easily integrating Railtown to multiple environments, errors that would have been ignored locally would have been caught in the staging environment.
In detail, here’s how Railtown AI could have mitigated the issues faced by CrowdStrike:
- Validate the number of Template Type fields
The mismatch between the 21 inputs validated by the Content Validator vs. the 20 inputs provided to the Content Interpreter will be flagged in Railtown AI’s dashboard as soon as it reaches the staging environment.
- Catch unchecked array access
Since the Rapid Response Content for Channel File 291 instructed the Content Interpreter to read the 21st entry of the input pointer array, an error will be flagged because of the lack of match for the 21st entry.
- Holistic individual and matching test for Template Type and Content Interpreter
Railtown AI will consider and test all criteria available for the Template Type and Content Interpreter. Testing focused on functional validation of the Template Type including the correct flow of security-relevant data through it, evaluation of that data to generate appropriate detection alerts based on criteria created in development test cases and matching of number Template Type fields to the number of fields to be read by Content Interpreter.
- Identify Content Validator’s logic error
An error about the problematic Template Instance is being sent to the Content Interpreter.
- Detect errors in the staging environment
This confluence of multiple errors in matching, logic and validation will all be caught in the staging environment.
Not only does Railtown provide a single pane of view, it goes above data connection and ingests information to become a proactive engine where teams can gain actionable insights from information buried deep in ticketing systems and internal documentation. The Conductor, an AI engine synthesizes technical and high-level information to provide information-based insights for any stakeholder.
Why This Matters
The state of complex software development today makes it clear that traditional methods of error detection are no longer sufficient. The stakes are higher than ever, with user data, company reputation, and financial stability on the line. Railtown AI provides a cutting-edge solution that meets the demands of modern software development:
- Efficiency and Accuracy
CrowdStrike utilized manual and automated testing and yet ended up with a softwaremeltdown. Resorting to 2 methods of quality assurance does not guarantee bug-free deployment if testing and validation do not offer a holistic view. With Railtown AI, security and bug-free deployment is ensured because of our continuous integration with platforms.
- Proactivity
Railtown AI integrates to multiple environments and catches errors at the earliest stage of the pipeline. This guarantees that errors are caught before they reach to production.
- Resilience
Smart error bucketing and source environment tracking enable the system to continue operating properly in the event of unexpected conditions or disruptions.
In conclusion, the recent issue faced by CrowdStrike is a powerful reminder of the need for advanced tools like Railtown AI in the software development lifecycle. By leveraging our AI capabilities, companies can significantly reduce the risk of defects, enhance security, and maintain the trust of their users. The future of software development lies in embracing AI-driven solutions to navigate the complexities and demands of today’s digital world.
At Railtown AI, we are committed to empowering developers to build better, safer, and more reliable software. Let’s ensure that issues like the one faced by CrowdStrike become a thing of the past.
More details on CrowdStrike’s meltdown
The Falcon sensor update had multiple layers to it. This sensor detection engine is built with Rapid Response Content used to gather telemetry and behavioural heuristics. Channel files come with a specific Template Type and Content Interpreter used to deliver Rapid Response Content. In this case, there were 20 inputs provided to the Content Interpreter while the Content Validator defined 21 parameters. This mismatch was not detected during the development of the Template Type. The test cases and Rapid Response Content used to test the Template Type did not trigger a fault during feature development or during testing of the sensor 7.11 release. The following are the specific issues that occurred contributing to the outage:
- Number of fields in the Template Type was not validated
The sensor code for the Template Type described 20 different input sources for use by the Template Instance. This means that when the sensor wanted to make a detection decision-based on the Template Type, the sensor code would supply 20 different input sources to the Content Interpreter. However, the definition of the Template Type in the Template Type Definitions file stated that it expected 21 input fields. This definition resulted in Template Instances in Channel File 291 that are expected to operate on 21 inputs. This mismatch was not detected during the development of the Template Type.
- Runtime Array Bounds Check wasn’t implemented for Content interpreter
The Rapid Response Content for Channel File 291 instructed the Content Interpreter to read the 21st entry of the input pointer array. However, the IPC Template Type only generates 20 inputs. As a result, once Rapid Response Content was delivered that used a non-wildcard matching criterion for the 21st input, the Content Interpreter performed an out-of-bounds read of the input array.
- Minimal testing coverage for Template Type
Both manual and automated Template Type testing did not include a matching criteria and instead focused on functional validation for security-relevant data to properly flow through it and generate appropriate detection alerts.
- Content Validator contained a logic error
The Content Validator evaluated the new Template Instances. However, it based its assessment on the expectation that the Template Type would be provided with 21 inputs. This resulted in the problematic Template Instance being sent to the Content Interpreter.
- Lack of Template Type Instance validation and Content Interpreter matching test
The Content Validator-tested Template Instance did not observe that the mismatched number of inputs would cause a system crash when provided to the Content Interpreter by the IPC Template Type.
- Absence of staged deployment strategy
Each Template Instance were released to the entire system and all at once.
For more details on the CrowdStrike incident, you can refer to the official CrowdStrike technical update and additional coverage on The Register.