May 11, 2016

Designing Services for Resilience

This technique falls under fault prevention and fault removal categories...In my opinion, scalability is a commodity now. Resiliency and Intelligence are the next frontiers.


May 8, 2016

Chaos engineering for resilient distributed systems

Errors are a fact of life in large scale distributed systems. Now while a system can have error they are only significant if they interrupt the service i.e, produce a failure. The goal of resilient systems is to mask these failures and continue functioning at an acceptable level.

One key component in the development and evaluation of such a resilient system is having ability to simulate faults in controlled manner.

Chaos engineering is basically about empirically studying the system, simulating chaos, evaluating system resilience and building confidence in system capability to withstand real-world turbulent conditions. 


Many developers don't distinguish between faults, errors and failures but otherwise this distinction is fundamental to engineering chaos in a distributed system. A fault is a defect that exists in a service but may be "active" or "dormant". An error occurs when fault becomes active. A failure occurs when the error is not suppressed and becomes visible outside the service.


Pathology of a Failure

Apr 30, 2016

How Complex Systems Fail

Came across this summary paper on the nature of failure, how failure is evaluated and how failure is attributed to causes. While the paper is in context of hospitals and patient safety, it is applicable for big data systems as well. Following are some highlights...

Complex systems are intrinsically hazardous systems
All of the interesting systems (e.g. transportation, healthcare, power generation) are inherently and unavoidably hazardous by the own nature.
Complex systems are heavily and successfully defended against failure
The high consequences of failure lead over time to the construction of multiple layers of defense against failure. The effect of these measures is to provide a series of shields that normally divert operations away from accidents.
Catastrophe requires multiple failures - single point failures are not enough
The array of defenses works. System operations are generally successful. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident.
Complex systems contain changing mixtures of failures latent in them.
The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations.

Apr 20, 2016

Designing distributed applications with code mobility paradigms

When we are working on a distributed systems, many times during design phase we automatically assume the location of code as static. In other words, once a component is created, it cannot change either the location or the code during its life time. 

But there are also scenarios where considering concepts of code location and mobility etc during design phase makes a huge difference to the underlying distributed application in terms of fault-tolerance, concurrency, lower latencies and higher flexibility.

Rest of this post covers two notions of code mobility, various code mobility paradigms and some scenarios where a distributed application can benefit considerably by exploiting mobile code paradigms.

Apr 16, 2016

Understanding RBAC (Role based Access Control) model

Nearly all enterprise and large scale systems need authentication and authorization for using the underlying system resources. The key model and probably the most popular one is RBAC.

The central idea in RBAC is
  • Permissions (or Permission sets) are associated with Roles. 
  • Users (or User Groups) are associated with Roles.
  • permission is basically an approval of an operation on one or more objects.

Apr 9, 2016

Software Resilience Patterns

Good slide deck on patterns for software resiliency...



Apr 6, 2016

Techniques to ensure eventual consistency

Recently I came across a paper on eventual consistency that touches among other things techniques to ensure eventual consistency for the updates to replicas. While the term eventual consistency is fairly popular now a days, I am not sure some of its subtleties are equally well known. So a small detour before diving into the techniques to ensure eventual consistency.

Eventual consistency basically guarantees that if there were no additional updates to a given data item then all reads for that data item will eventually return the same value. 

What the above also means that in an eventually consistent system,

  • The system can return any arbitrary data and still be eventually consistent. So client has no way to know if the read response is wrong behavior. 
  • If there are multiple concurrent updates, under eventually consistency you do not know which update gets eventually chosen. The order is unpredictable. Only guarantee is, there will be eventually a convergence. 
  • In short, what eventual consistency tells us is something good will happen eventually but no guarantees as to what happens in the interim and no behavior is ruled out in the meantime!