Software Testing Insights From the CrowdStrike Incident

August 8, 2024

4 min read

The CrowdStrike incident highlights the importance of software testing and code quality. Read on for insightful lessons about the balance between business risk and software quality cost, along with the time and cost of testing.

Jump to Section

Back to Blog Results

Jump to Section

The CrowdStrike catastrophe will go down in history as an example of how insufficient focus on quality and software testing damages a vendor’s image and the user crowd. Whatever the root causes, it’s fair to give CrowdStrike time to analyze its problems and share results. However, let’s take this situation to offer some general observations about software quality.

The Balance of Business Risk & Software Quality Cost

Given the frequency of updates CrowdStrike releases in their difficult security business, it’s obvious that their development process automation is excellent. Highly automated CI/CD-based workflows are absolutely essential in modern software development. Continuous integration and delivery improve software teams’ efficiency and enable software development at scale.

When you have the whole process automated and instrumented, you can see numbers and statistics—and better understand the costs. When you can measure, you are tempted to optimize. Especially in an industry coping with constant pressure to be lean, agile, and fast-responding. Balancing business risk and the cost of software quality is a well-known problem in the industry.

Testing and quality processes are simple optimization targets. Optimization often means reducing testing. We don’t imply that this is what happened in the CrowdStrike case, but this spirit is visible among software organizations. Implementing multiple testing techniques takes time and effort. There’s also a maintenance cost.

Time & Cost of Testing

Even if we look at a moderately complex system that consists of some frontend and backend with microservices and everything implemented using C++, Java, and maybe Python, a solid quality process likely needs to include:

Also, code coverage should be collected from all types of runtime testing. That’s a considerable effort to implement and maintain.

Even if successfully deployed, testing techniques that are more expensive in maintenance tend to decay over time and are abandoned. Teams periodically check stats with bugs caught by any given testing practice and ask: Is it worth our effort?

With time, teams tend to focus on one or two types of testing that are easy to automate and maintain. Green checks next to pull requests build false confidence with time and suppress thinking about whether the risk is sufficiently mitigated.

There’s nothing groundbreaking in the conclusion above. It was always like that. However, based on our observations, paradoxically, with increasing levels of automation, teams’ tolerance for testing-related expenses and delays decreases.

A Comprehensive, Shift-Left Approach to Testing

The truth about software testing is that there is no silver bullet. You can’t test quality into software. Instead, it requires a holistic approach for the entire SDLC that includes:

Building quality software from the start with a shift-left testing approach.
Cultivating a robust and rich testing strategy to increase code confidence.

There are many testing techniques—each good at detecting a specific class of problems. High software quality and reliability require combining several testing methodologies. So-called optimizing your entire testing process to only include one or two methodologies is a blind path.

The temptation to reduce testing is even stronger when there’s no mandate for certain practices. For safety-critical systems, there are dedicated standards that mandate risk analysis and recommend employing a collection of software testing techniques. If you want to introduce a product to the market, you must prove it complies with the standard.

Functional safety standards typically categorize functionalities by their level of criticality and provide various levels of recommendations for specific testing techniques. Below, you can see an example from ISO 26262, a popular standard in automotive.

Screenshot of Table 7 in the ISO 26262 standards as an example of safety standards and levels of criticality for systems.

The details of this table aren’t important to our discussion. The relevant takeaway is that safety standards distinguish different levels of criticality for systems (A-D) and come with various recommendations on how to test systems. When you want to skip a specific testing practice that the standard deems appropriate for a given risk level, you need a really solid story for your assessor.

The safety process includes hazard analysis and risk assessment. This is a foundation for defining safety goals, implementing features, and appropriate testing according to a defined level of risk. Without getting into industry-specific lingo, it’s all about defining critical scenarios and making sure to prevent them or sufficiently test against them.

Conclusion

In the case of Falcon, it would seem that any scenario that prevents the computer from booting is critical. There should be some safety measures against it. Minimally, there should be extensive tests to detect it. And those tests should be prioritized.

Developing safety-critical software is tremendously expensive. No, we don’t suggest that all products be developed with the same constraints as they are in the safety-critical world. That would be impractical and difficult to justify from the business point of view.

However, it’s interesting to observe how seemingly noncritical systems, such as CrowdStrike’s Falcon, affect our lives. Disruption in 911 line operations certainly puts the health and maybe even the lives of several unlucky individuals at risk.

So, what are the lessons for development teams?

Be aware of and resist the dynamic that constantly pushes them to cut corners and compromise quality in favor of short-term business gains.
Understand the true criticality level of the systems they build, including the business risk for the organization.
Gain a better understanding of best practices and processes implemented by industry verticals where products directly affect human lives may be a helpful exercise and source of valuable inspiration.

By leveraging Parasoft C and C++ solutions for static analysis, unit testing, continuous integration, and monitoring, CrowdStrike could have detected and resolved their software update issues in real-time, maintaining the integrity of their software releases.

Our team of experts believes software testing is an engineering discipline, and preventing software-originating catastrophes requires time and effort. Our solutions help teams minimize costs and regain control over business risk. Any type of business with high stakes can benefit from our expertise in static code analyzers, code coverage, unit testing frameworks, API testing, service virtualization, and more.

How to Shift Left Testing Across the SDLC

Get Whitepaper