One key component in the development and evaluation of such a resilient system is having ability to simulate faults in controlled manner.
Chaos engineering is basically about empirically studying the system, simulating chaos, evaluating system resilience and building confidence in system capability to withstand real-world turbulent conditions.
Many developers don't distinguish between faults, errors and failures but otherwise this distinction is fundamental to engineering chaos in a distributed system. A fault is a defect that exists in a service but may be "active" or "dormant". An error occurs when fault becomes active. A failure occurs when the error is not suppressed and becomes visible outside the service.
![]() |
| Pathology of a Failure |
