May 11, 2016

Designing Services for Resilience

This technique falls under fault prevention and fault removal categories...In my opinion, scalability is a commodity now. Resiliency and Intelligence are the next frontiers.


May 8, 2016

Chaos engineering for resilient distributed systems

Errors are a fact of life in large scale distributed systems. Now while a system can have error they are only significant if they interrupt the service i.e, produce a failure. The goal of resilient systems is to mask these failures and continue functioning at an acceptable level.

One key component in the development and evaluation of such a resilient system is having ability to simulate faults in controlled manner.

Chaos engineering is basically about empirically studying the system, simulating chaos, evaluating system resilience and building confidence in system capability to withstand real-world turbulent conditions. 


Many developers don't distinguish between faults, errors and failures but otherwise this distinction is fundamental to engineering chaos in a distributed system. A fault is a defect that exists in a service but may be "active" or "dormant". An error occurs when fault becomes active. A failure occurs when the error is not suppressed and becomes visible outside the service.


Pathology of a Failure