Apr 30, 2016

How Complex Systems Fail

Came across this summary paper on the nature of failure, how failure is evaluated and how failure is attributed to causes. While the paper is in context of hospitals and patient safety, it is applicable for big data systems as well. Following are some highlights...

Complex systems are intrinsically hazardous systems
All of the interesting systems (e.g. transportation, healthcare, power generation) are inherently and unavoidably hazardous by the own nature.
Complex systems are heavily and successfully defended against failure
The high consequences of failure lead over time to the construction of multiple layers of defense against failure. The effect of these measures is to provide a series of shields that normally divert operations away from accidents.
Catastrophe requires multiple failures - single point failures are not enough
The array of defenses works. System operations are generally successful. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident.
Complex systems contain changing mixtures of failures latent in them.
The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations.

Complex systems run in degraded mode.
The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.
Post-accident attribution accident to a ‘root cause’ is fundamentally wrong
Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. 
The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcomes.
Hindsight biases post-accident assessments of human performance
Knowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to practitioners at the time than was actually the case.
Human operators have dual roles: as producers & as defenders against failure
In non-accident filled times, the production role is emphasized. After accidents, the defense against failure role is emphasized. At either time, the outsider’s view misapprehends the operator’s constant, simultaneous engagement with both roles.
All practitioner activities are gambles
That practitioner actions are gambles appears clear after accidents; But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.
Actions at the sharp end resolve all ambiguity 
After an accident, practitioner actions may be regarded as ‘errors’ or ‘violations’ but these evaluations are heavily biased by hindsight and ignore the other driving forces, especially production pressure.
Change introduces new forms of failure.
The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. 
Views of ‘cause’ limit the effectiveness of defenses against future events
Post-accident remedies for “human error” are usually predicated on obstructing activities that can “cause” accidents. Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system.
Safety is a characteristic of systems and not of their components
The state of safety in any system is always dynamic; continuous systemic change insures that hazard and its management are constantly changing. 
Failure free operations require experience with failure.
Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the “edge of the envelope”. This is where system performance begins to deteriorate, becomes difficult to predict, or cannot be readily recovered.

You can find the pdf version of the paper here. It is short and good to read.