Jan 11, 2025
Jan 11, 2025
What a cascading failure of technology!
On July 18, people sitting in front of their computers were greeted by that infamous “blue screen of death” and the world encountered Microsoft outage.
The result: Flights were grounded, airports were thronged with agitated passengers; a wide range of businesses were disrupted; healthcare providers had to cancel appointments, and even defer surgeries; and stores and broadcasters in several countries went offline.
The Cybersecurity Company, CrowdStrike, which provides cyber-security services and software for many large corporations including Microsoft, said that the outage resulted from a routine software update that had gone wrong but “not a security incident or cyberattack”.
Microsoft later estimated that CrowdStrike’s update “affected 8.5 million Windows devices”. In a blog post it said, “While the percentage (less than one percent) was small, the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services”.
Now, security experts opine that the latest update of Falcon sensor software that was meant to make CrowdStrike’s customers’ systems more secure against cyberattacks might have not undergone adequate quality checks before it was released. And one probable reason for such lapse is attributed to the fact that such release of updates being so frequent—almost once a day—CrowdStrike might have not tested it as much.
This posits a question: Why are we so bad at preventing them? One simple answer is: These technological systems are too complicated for anyone to fully understand. For, the programs are not built by a single individual. They are developed by many over several years. And they may have millions of lines of codes that no one entirely grasps.
Over it, there are a number of countless components that might have been designed long back in a specific way for a specific purpose that no one remembers. It is the interaction of these countless components and millions of lines of code that keeps the system functioning. So, any one of them malfunctioning makes the whole system go dead. And that’s what the faulty update did!
And, we do not appreciate all this till something unintended happens. Now that this has happened, and the fragility of the digital ecosystem is exposed, it is time to work towards creating a more resilient system by putting in place sound disaster recovery protocols. First things first: treat this outage as a “dry run” and evaluate our digital dependencies and the policies governing them.
The security-providers such as CrowdStrike are uploading updates constantly, often many times a day with no visibility to customers, no accountability and no regulatory scrutiny. Of course, they may be having a reason for such secrecy: to stop hackers from knowing the underlying software. But it doesn’t mean that cybersecurity vendors can simply get away with no transparency and control.
For instance, let us take a relook at the current outage. CrowdStrike, the cybersecurity vendor of Microsoft, having had kernel access in Windows, directly uploaded the update. Instead, if it was first tested by Microsoft before the vendor straightaway uploaded to the customer’s systems, the fault could have been identified by the Microsoft engineers and the outage could have been averted(?).
Another important point that emerges out of this episode is: Microsoft cannot absolve itself from the responsibility for the current damage by simply saying it is the fault of the security vendor. After all, we the users place immense faith in Microsoft while buying its products. So, there is an implicit responsibility on Microsoft to ensure that what its vendors are uploading is faultless and indeed provides cybersecurity but not an outage.
No doubt, CrowdStrike, to its credit, took the lead and did a speedy job: Withdrew the update and resurrected the systems reasonably quickly. Secondly, an excellent sense of collaboration was witnessed in the entire system to recover from the damage and move forward quickly. Credit also goes to Microsoft for mustering such collaboration.
Yet, as the world is becoming increasingly digitized and interconnected, there needs to be a mechanism set in place to make security vendors’ behavior less and less risky. And Microsoft must take the lead in fostering such confidence among its customers.
To manage risks associated with automated software updates, businesses have to put in place an effective vendor management system. Some experts even suggest developing such technologies that could afford visibility and control over software supply chain.
Lastly, it is perhaps time for big users like airlines to plan for alternative system availability to limit the damage from such future eventualities. Simultaneously, businesses may have to look for a greater range of suppliers of security applications to obviate the concentration risk in the cybersecurity market.
And regulators must build a regulatory mechanism to analyze this kind of technical failures and suggest improvements—just as they analyze cyberattacks—to sustain the resilience of global digital infrastructure.
Else, such failures can cause severe damage to the global economic system.
17-Aug-2024
More by : Gollamudi Radha Krishna Murty