Apr 30, 2016

How Complex Systems Fail

Came across this summary paper on the nature of failure, how failure is evaluated and how failure is attributed to causes. While the paper is in context of hospitals and patient safety, it is applicable for big data systems as well. Following are some highlights...

Complex systems are intrinsically hazardous systems
All of the interesting systems (e.g. transportation, healthcare, power generation) are inherently and unavoidably hazardous by the own nature.
Complex systems are heavily and successfully defended against failure
The high consequences of failure lead over time to the construction of multiple layers of defense against failure. The effect of these measures is to provide a series of shields that normally divert operations away from accidents.
Catastrophe requires multiple failures - single point failures are not enough
The array of defenses works. System operations are generally successful. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident.
Complex systems contain changing mixtures of failures latent in them.
The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations.

Apr 20, 2016

Designing distributed applications with code mobility paradigms

When we are working on a distributed systems, many times during design phase we automatically assume the location of code as static. In other words, once a component is created, it cannot change either the location or the code during its life time. 

But there are also scenarios where considering concepts of code location and mobility etc during design phase makes a huge difference to the underlying distributed application in terms of fault-tolerance, concurrency, lower latencies and higher flexibility.

Rest of this post covers two notions of code mobility, various code mobility paradigms and some scenarios where a distributed application can benefit considerably by exploiting mobile code paradigms.

Apr 16, 2016

Understanding RBAC (Role based Access Control) model

Nearly all enterprise and large scale systems need authentication and authorization for using the underlying system resources. The key model and probably the most popular one is RBAC.

The central idea in RBAC is
  • Permissions (or Permission sets) are associated with Roles. 
  • Users (or User Groups) are associated with Roles.
  • permission is basically an approval of an operation on one or more objects.

Apr 9, 2016

Software Resilience Patterns

Good slide deck on patterns for software resiliency...



Apr 6, 2016

Techniques to ensure eventual consistency

Recently I came across a paper on eventual consistency that touches among other things techniques to ensure eventual consistency for the updates to replicas. While the term eventual consistency is fairly popular now a days, I am not sure some of its subtleties are equally well known. So a small detour before diving into the techniques to ensure eventual consistency.

Eventual consistency basically guarantees that if there were no additional updates to a given data item then all reads for that data item will eventually return the same value. 

What the above also means that in an eventually consistent system,

  • The system can return any arbitrary data and still be eventually consistent. So client has no way to know if the read response is wrong behavior. 
  • If there are multiple concurrent updates, under eventually consistency you do not know which update gets eventually chosen. The order is unpredictable. Only guarantee is, there will be eventually a convergence. 
  • In short, what eventual consistency tells us is something good will happen eventually but no guarantees as to what happens in the interim and no behavior is ruled out in the meantime!

Apr 4, 2016

Consistency considerations in distributed data stores

In a large distributed system, network partitions are a given. This means we cannot achieve both consistency and availability (CAP theorem). So our choices are either to relax the consistency for system to be highly available under partitions (or) make consistency a priority and system will not be available under certain conditions. Both of these options require developers to be aware of what the system is offering.

For example, if the system is emphasizing consistency, then the developer has to deal with availability issues. So if an update fails because of system unavailability, then developer need to plan on what to do with that update.

On other hand, if the system is emphasizing availability, then the developer should assume there will be times when the reads will not return latest updates. The application needs to be tolerant i.e., work with slightly stale data.

Apr 2, 2016

Many faces of replication...

Recently I was looking into replication and consistency papers and thought would be good topic to summarize. Replication is one of the most studied topics and is a quite important tool for designer.

It improves system availability by removing single point of failures, improves performance by reducing communication overheads and improves scalability by enabling system to grow with acceptable response times. But the benefits of replication comes with its own challenges. Nothing comes for free in distributed systems...

For example, some of the challenges anyone dealing with replication have to address are -

  • How to manage the updates i.e., replication strategy? 
  • Data Consistency & Availability tradeoffs? 
  • How to handle downtime during new replica creation 
  • Maintenance Overhead 
  • Lower write performance etc.