Mar 2, 2016

Designing scalable failure detection for distributed sensor nodes

Distributed systems need to provide reliable and continuous service despite failure of some of the components or nodes. So failure detection is one of the key aspects in building these systems. Also we generally will need failure detection for lot of other stuff like consensus, group membership, high availability etc.

The challenge is designing failure detection in a way that is both accurate and efficient is not an easy task though. One reason is the delays in a network are unpredictable. Another reason is the FLP impossibility theorem i.e., in any asynchronous distributed systems, it is impossible to determine accurately whether a remote process failed or has just taking its own time to respond. 

Bright side, a node really don't need to know if a remote process has failed or just taking its own time. As long as we have a way for all nodes to come to same conclusion (i.e., failed or delay) about a remote process then we can work from there. You do this by assuming an upper bound on the delays (ex: timeouts) after which a remote process is considered as failed. In academic speak, it is called a partially synchronous model

In rest of the post we go thru 

  1. Eight factors to consider in designing failure detection, 
  2. Four key metrics to consider in evaluating failure detection and 
  3. Static, Lazy, Gossip, Hierarchical, Adaptive and Accrual based failure detection techniques.
  4. Failure detection approaches in some popular distributed systems.