I have learned more about an almost depressing number of costly, high-profile software failures this week. Holy cow. Alright, so I don’t see a huge tie in to Security Engineering in these readings, because almost all of the software failures seemed to result from internal failures rather than external threats. One notable exception was from the Keep Out article’s discussion of product features that are intended to secretly limit the consumer’s use of the product. In the case of Sony’s rootkit, for example, Sony intended to greatly reduce a user’s ability to steal music by including a hidden rootkit in CD’s that would modify system settings to remain hidden and thereby make the system vulnerable to malicious hackers who were well aware of the secret modifications. This enormous short-sighted failure on Sony’s part paved the way for external threats, and there have been and will continue to be examples of private and public organizations opening consumers and citizens to these vulnerabilities with the best of intentions. All of the other readings concerned epic internal failures of planning, management, evaluation, and communication.
Planning failures abound in these various examples. The FBI Trilogy and to a lesser extent Sentinel fiascos were perfectly terrible examples of poor planning. Software engineering is difficult enough just from the perspective of developers trying to meet the myriad system requirements with tight deadlines and budgets, but the ever-moving goalposts set, reset, and reset again by the FBI project leaders were, in my view, at the root of the decade of wasted time and hundreds of millions of wasted dollars. I’m sure that FBI agents are fantastic improvisers and outside-the-box thinkers, if Mulder has taught me anything at all, but they can only operate successfully because laws are quite rigid. If an agent were in the middle of building a case on bank fraud or out on a violent crime-related arrest, and they heard on their walkie-talkies that the relevant laws had just changed their plan for the day, they would experience what it probably felt like to be one of the software contractors working on the case management system overhaul. The same generally goes for a builder trying to follow blueprints with the reasonable expectation that the plans won’t change when the 40th floor has just been finished. System specification is clearly overlooked, underestimated, and of the utmost importance.
There have clearly been management problems at the heart of most of these failures as well. Ever since I heard a story about one of the Challenger engineers failing to get his concerns over the launch day’s extremely low temperature passed up the chain of command, I’ve thought a lot about the role of management in connecting the dots. In many of these stories, there were early signs of failure that were simply not properly managed, often ignored or even shut down. For example, in the hours leading up to the disastrous Titan/Milstar launch, there were system readings that were way off, but no one knew whose responsibility it was to monitor and react to these readings. Just as we practiced doing on day one of our team term project, it is critical for teams to unambiguously define roles and thoroughly divvy up responsibilities.
None of these projects would have failed as spectacularly / tragically as they did had they undergone a robust testing and evaluation framework before deployment. Often this failure to test the product was due to overconfidence and an underappreciation of the emergent complexities of systems that go way beyond their individual components. The Therac-25 disaster was a maddening example of this overconfidence. The previous model worked so well (so they thought at the time), that software components were reused without reevaluation in the new hardware. To top it all off, hardware-level safety controls were removed entirely in favor of untested software-level controls that were assumed to be more trustworthy. Trust should be earned rather than assumed. If your organization doesn’t have enough time to test your product, then you don’t have time to sell your product.
Finally, poor communication was a common ingredient in most of these failures. My favorite example was the missed metric to US unit conversion that resulted in a complete system failure of the Mars Climate Orbiter. The teams working at various locations and with different parts of the system did not clearly communicate who was responsible for what, or even what precisely it was that each team was testing on their own compartmentalized components. Most often it seems that the communication breakdown occurs between the client and contractor, where the client and contractor do not 100% agree on what the goal is. I’ve been a part of that problem myself as a web development intern, and it’s nice to know that highly paid professionals also struggle to communicate successfully.
The chapter on resiliency engineering might have helped all of these developers, engineers, and managers. No project can be free of system failures, but every project should be able to anticipate those failures in order to reduce their risk of occurring, mitigate their effects, and initiate recovery asap. Easier said than done, I’m sure!