Mar 2, 2016

Designing scalable failure detection for distributed sensor nodes

Distributed systems need to provide reliable and continuous service despite failure of some of the components or nodes. So failure detection is one of the key aspects in building these systems. Also we generally will need failure detection for lot of other stuff like consensus, group membership, high availability etc.

The challenge is designing failure detection in a way that is both accurate and efficient is not an easy task though. One reason is the delays in a network are unpredictable. Another reason is the FLP impossibility theorem i.e., in any asynchronous distributed systems, it is impossible to determine accurately whether a remote process failed or has just taking its own time to respond. 

Bright side, a node really don't need to know if a remote process has failed or just taking its own time. As long as we have a way for all nodes to come to same conclusion (i.e., failed or delay) about a remote process then we can work from there. You do this by assuming an upper bound on the delays (ex: timeouts) after which a remote process is considered as failed. In academic speak, it is called a partially synchronous model

In rest of the post we go thru 

  1. Eight factors to consider in designing failure detection, 
  2. Four key metrics to consider in evaluating failure detection and 
  3. Static, Lazy, Gossip, Hierarchical, Adaptive and Accrual based failure detection techniques.
  4. Failure detection approaches in some popular distributed systems.

2) Design considerations for failure detection functionality...

In this article, let's assume a failure detector is an application that is responsible for detection of node failures or crashes in the distributed system. Following are some factors I think we need to consider while designing any failure detection functionality...
  • Interaction Model
    • Do we use pull or push or push-pull based monitoring for failure detection? What about lazy monitoring i.e., leverage existing communication between the applications and using above failure detection interactions only when needed.
  • Static vs Adaptive
    • Do we use static or adaptive monitoring rules for failure detection? 
    • Similarly do we use static or adaptive thresholds for failure detection?
  • Active vs Passive -
    • Do we continuously send keep-alive messages (i.e., Active)? or Do we piggy back on Application messages (i.e., Passive/Lazy)?
    • If data traffic is frequent, it is sufficient enough for failure detection most of the time although there are situations where pure Passive approach is not sufficient.
  • Failure Interpretation
    • Do we interpret failures in binary terms (i.e., up or down) or do we interpret in probabilities  i.e., accrual model. In the accrual model, the clients are free to set their own thresholds to determine whether a node is up or down.
  • Centralized vs Distributed 
    • Do we use centralized failure detection architecture or do we use distributed architecture? 
    • Centralized failure detectors are easy to maintain but a single point failure. Distributed architecture improves availability of service but is a pain to maintain.
  • Decision Making -  
    • Do we make failure decisions in isolation or do we utilize quorum/peer confirmation. Quorum helps address split-brain scenarios.
  • Peer Selection Patterns -
    • Do we use all-to-all approach? In this model, each node monitor all other nodes for failure detection. With small number of processes this method is ok but with increasing number of processes then network load increases exponentially.
    • Do we use randomized approach? In this model, each node randomly selects a subset of peer nodes to monitor for failure detection. Detection time depends on the probability of being randomly selected. So as the number of failure detectors increase, the probability of a node not being monitored decreases accordingly. Epidemics based algorithms typically fall into this category. Gossiping protocols are a special kind of randomized failure detector based on heartbeats. 
    • Do we use neighborhood-based approach? In this model, basically you organize the system into localized groups to take advantage of locality and monitor local nodes for failure detection. Neighbors are selected statically and do not change over time unless a failure is detected. Typically this model is used when the underlying system is a multi-data center system.
  • Propagation Patterns -
    • Once a failure is detected, how do we propagate that news within the system? 
    • Do we do one-to-all propagation? Small systems are ok. Large systems or geographically separated systems will be a problem. Network traffic can be intense if failures and churn are frequent. IP multicast is one consideration.
    • Do we do gossip-based propagation? Probability that a process does not receive an update decreases exponentially as the number of processes in the system increases. 
    • Do we do structured propagation? For example, in circular failure detectors the nodes are arranged in a virtual ring and communication is only between adjacent nodes. In hierarchical failure detectors, the nodes are arranged in a multi-level hierarchy and failures are reported along a tree to improve scalability. Typically in large networks the nodes are structured in hierarchical layout. 

Now that we went thru various design considerations, next on our list is the QoS metrics we should consider in designing a failure detector.

3) Key metrics to consider...

The below metrics would help us in knowing how fast and accurate a failure detector is. The 1st metric deals with speed and the remaining  3 metrics deal with accuracy.  Note: These metrics are applicable to all types of failure detection functionality. 
  • Detection Time -
    • Basically time it takes to detect failure i.e., the time a process crashed to the time we detect a failure has occurred. 
    • The lower bound for detection time is "send computation delay" + "network delay" + "receive computation delay". 
    • In most networks, the "network delay" manifests as a random variable and pretty difficult to predict.
  • Average Mistake Rate -
    • Rate at which our failure detection logic mistakenly decides a process/node has crashed. 
    • This is an useful metric especially for long lived applications or in Big data systems (like NoSql, In-Memory Grids etc) where costly operations like repartitioning of data happen on node failures. 
    • The network unreliability affects the true detection rate (due to delayed messages) and causes our failure detector to return less accurate results (due to dropped messages).
  • Average Mistake Duration -
    • This is basically the average the time it takes for our failure detector to correct its mistake about a node availability. 
    • This metric is useful if we have components that work in degraded mode (ex: going into read-only mode) when a node in the cluster is down. If you are using Quorums in your system, that is another area this metric is useful. 
  • Query Accuracy Probability -
    • This is basically the probability our failure detector returns correct response when it is queried at a random time. 
    • This metric is useful if we have components in the system that query for a node availability. The reliability of those components is influenced by this metric.
Next on our list is the failure detection techniques. It is tempting to ignore above and just jump into the techniques. I did that. Although now I think it is good to know above stuff first as background.

4) Failure Detection Techniques


Failure detection is a fairly rich field. I am certainly not an expert but I think below will cover fairly good ground. We go thru various failure detection categories, strengths & drawbacks of techniques in each category. This does not mean any one category or technique is better than another. It is just that each technique suits better in some situations while not much good fit in other situations. 


The categories we touch base in this article are -

  1. Static failure detection techniques
  2. Gossip style failure detection techniques
  3. Hierarchical failure detection techniques
  4. Lazy failure detection techniques
  5. Adaptive failure detection techniques and
  6. Accrual failure detection techniques

4a) Static failure detection techniques...


The Push Model

In this model, the monitoring node (our failure detector) is a passive entity. Each component periodically sends messages (heartbeats) to the failure detector The failure detector suspects a component as failed if it does not receive heartbeat message within a certain time interval. 





When we use this approach in a distributed mode, the push model does not require all processes to monitor all other processes. Also compared to pull mode, the push model requires only half of network messages to implement failure detection.

The Pull Model

In this model,  the monitoring node (our failure detector) is an active entity. Our failure detector sends periodically "are you alive" requests to components. If a component responds before a time out, then the component is considered alive. Otherwise the component is suspected as failure.




The Push-Pull Model
When we have a heterogeneous environment, many times it is necessary to utilize both Push and Pull based failure detection. In this model, our failure detector will rely on heartbeat messages (Push model) but also will send "Are you alive?" requests (Pull model) where needed.



For example, we may need this kind of failure detection if our distributed system spans across multiple data centers and the data centers are separated by WAN. The idea is, within a datacenter, one can efficiently synchronize time. So we utilize the Push model. Across data centers, it is hard to do time keeping reliably and also to periodically deliver heartbeat messages. So we utilize the Pull model for failure detection across data centers. 


4b) Gossip Style failure detection techniques...

The technique of gossiping is a popular approach to spread information rapidly in a large distributed system. Protocols based on this approach are also sometimes called epidemic protocols.

This model relies on a variation of heartbeats. Instead of broadcasting a heartbeat to every other node, a node will gossip the entire membership list it has to of one of the random nodes periodically. The periodicity is typically much more frequent compared to a heartbeat interval in static failure detection techniques.. 

The receiving node in this model reconciles the membership lists by taking whichever version is the latest. If after a fail time period, there has been no updated heartbeat number for a particular node, then that node is deemed failed.

Couple variations of gossip-style failure detectors -
  • Basic Gossiping
    • In the basic gossiping mode, the failure detector is present on each host in the network. It keeps track of every other failure detector it knows about and the last time it heard from them.
    • Regularly the failure detectors randomly pick another detector and send its list to it regardless of the underlying network topology.
  • Multi-Level Gossiping
    • This variant is aimed for large-scale networks. Typically the hierarchy is defined using the structure of internet domains and sub-domains. The longer the common prefix of the IP addresses of two failure detectors, the closer they are in the hierarchy.
    • In this model, most of the gossip messages are sent within a subnet. Then fewer gossip messages are sent across different subnets. Then even fewer gossip messages are sent across domains.
  • Round-Robin Gossiping
    • In this variant instead of choosing a failure detector (to gossip the membership list) randomly, we chose  deterministically. For example, we pick the detector sequentially or in a binary round-robin to gossip about.
    • The main advantage with this approach is because we are picking the nodes deterministically, we can allow for immediate failing of the node if a gossip message does not arrive.



Advantages...
Because the gossip is only to either one random node (or a fixed set of random nodes), the message count reduces from polynomial to linear. So this avoids message explosion.

Also one of the strengths of gossip-style failure detection is that they are pretty resilient to underlying network topology changes. This is a fairly important factor. 

Drawbacks...
While the above is great, there are couple drawbacks with gossip style failure detectors. The failure detector does not work well when a large percentage of components crash or become partitioned away. Then the detector might end up spending long time to detect crashed nodes by gossip messages.

Another drawback is this approach is less efficient than approaches based on hierarchical failure detection techniques

4c) Hierarchical failure detection techniques...

When the number of nodes in a system becomes very high, it is impractical to have all of them monitor each other. So to address this typically approach taken is to form some sort of hierarchy (tree, forest ) for monitoring and information propagation about failures.

There are various approaches but in general the hierarchy is assumed to closely match the underlying physical topology of the network. In hierarchical approach, each failure detector monitors the nodes directly or indirectly through underlying failure detectors. This enables to reduce the communication by combining info about several nodes and by temporarily caching information at several places in the system.

For example, in the below image, we have 3 LANs and each subnet has its failure detectors. All the monitoring is done only within a subnet. Messages between failure detectors are sent across the subnets. Same for client communication i.e., from failure detectors to client apps.



4d) Lazy failure detection techniques...

In this model, basically the nodes monitor each other via application messages. Basically the failure detector piggybacks on the application messages whenever possible to reduce the number of monitoring messages sent across the network. 


If there are no application messages being exchanged then the failure detector will send control messages for failure detection. The control messages are something like Ping messages.

The primary benefits of Lazy failure detection technique is in reducing message explosion (think grid/cloud environments) and in faster failure detection (assuming frequent application messages). On other hand, this technique depends highly on the communication patterns between applications and may perform poorly in some scenarios.

Problems...
If you notice, all of the above models assume some form of a timeout. The challenge with timeouts is 
  • If timeout value is short, the crashes are detected quickly but the mistake rate (i.e., mistaking a node to have crashed when it is not) also will be high. 
  • If timeout value is long, the mistake rate will be low but it comes at the expense of detection time.
Another issue with above models is fixed timeout and assumption that we know optimal timeout before hand. That is a fairly big assumption. Why? Few reasons -
  • Network delays are unpredictable. The expected delays can vary significantly between low traffic and high traffic times. Similarly the variance of the delays.
  • Probability of message loss due to changing network conditions.
  • Non-negligible drift in clocks among the nodes to reliably measure these network delays. 

4e) Adaptive failure detection techniques...

Adaptive protocols are designed to adapt dynamically to their environment and in particular to adapt to changing network conditions. Typically there are 3 types of adaptation:

  1. Adapting to network conditions
  2. Adapting to application requirements
  3. Adapting to application behavior
Adapting to network conditions...
There are many adaptive strategies in this category although most of them are variations of heartbeat strategy. This does not mean we cannot use other strategies like interrogation

The principal difference between heartbeat strategies in adaptive and static failure detection techniques is in how the timeout value is chosen. In adaptive failure detectors, the timeout value is modified dynamically according to network conditions.






Following are some of the more popular failure detection techniques in this category:
  • Chen Failure Detection
    • This technique is based on probabilistic analysis of network traffic. 
    • Basically here we sample the arrival times of heartbeat messages in recent past (i.e., a sliding window) and then compute the arrival time estimate for the next heartbeat.
    • The timeout value is set according to this estimation + some safety margin.
  • Bertier Failure Detection -
    • This technique uses same mechanism as Chen Failure Detection for estimating the expected arrival time of the heartbeats. The difference is the estimation algorithm. 
    • The algorithm used to come up with estimate for next heartbeat arrival is Jacobson round trip estimates (used in TCP protocol). This enables Bert's failure detection technique to be more adaptive to network conditions.
    • This failure detector is principally intended for LAN environments.
  • Sotoma Failure Detection -
    • This technique computes a timeout based on the average time intervals of heartbeat messages + ratio between arrival intervals.
  • Fetzer Failure Detection -
    • This technique computes the timeout based on the maximum arrival interval of heartbeat messages.

Adapting to application requirements...
Sometimes we need to adapt our failure detection technique based on the application requirements. For example, say our distributed system has two applications -
  • An interactive application and
  • A database application that launches multi-terabyte of data.
Let's assume both of the above applications rely on the same system-wide failure detection. Now the challenge is 
  • For Interactive Application, we need failures to be detected immediately. Otherwise it will affect user experience.
  • For Database Application, we need failures to be detected accurately.  Otherwise every time we will end up causing lot of re-partitioning churn.
In this scenario we need our failure detectors to emphasize different properties of failure detection depending on the application requirement.

The adaptive failure detectors mentioned above can be tailored by using QoS requirements as one of the inputs to the parameters of failure detector. Then failure detectors can adapt their timeout values to changing network conditions while at same time maximize achieving QoS requirements. Many times that would be sufficient. 


One problem with above approach though is what if the application failure detection requirements change dynamically i.e., application requires failure detection sooner during certain periods of the day and other times it is more relaxed. Another big drawback is most of the above failure detectors are done with one single application in mind. What if you have two micro-services or applications with different failure detection requirements?


Hint: We will address this under accrual based failure detection techniques. 


Adapting to application behavior...

What if the system requires us to adapt not only to changing network conditions but also to changing behavioral patterns of the application?

For example, say we have a sensor node that is not being monitored by any other nodes in the system. Would it make sense for the sensor node to temporarily stop sending useless heartbeat messages?


Adhoc heartbeat failure detector -

This technique basically involves designating some of the messages exchanged by the application as critical messages. The idea being whenever a node (say receiver) gets this critical message then it is an indication to the receiver that some node started monitoring this receiver. So the receiver node start sending heartbeat messages. When the receiver node is done with the work (or receives another critical message), it will stop sending the heartbeat messages.

4f) Accrual failure detection techniques...


To be truly generic, a failure detection service must be equally adaptive to

  1. Changing network conditions,
  2. Changing application requirements and
  3. Changing application behaviors.
The above conflicting requirements cannot be handled by timeout based failure detectors no matter how adaptive they are to changing network conditions. In addition to adapting to changing network conditions,  adapting to changing application requirements and behaviors means failure detector has to maintain and manage several timeout values on behalf of the applications. That is generally very difficult and error prone.

Another alternative is to let each applications set their own timeout values. But if we do that then our failure detector looses the ability to adapt to the changing network conditions. This is also not a good approach. So how do we solve this?

What if our detector instead of providing a binary output (i.e., pass or fail), provides probability estimate of failure as output on a continuous scale. The scale is basically the degree of suspicion about a node status. The higher the suspicion, the higher the chance that node being monitored has crashed. The suspicion level estimate is based on the time since the last heartbeat message was received. Would this help?



The above helps because then each application can set their own suspicion thresholds to determine a node failure status. Also the applications can change their suspicion thresholds according to their runtime behavior. As you see, the interaction between the failure detector and application in this model are fundamentally different from traditional failure detection models. 

Another way to see is in any failure detector we typically have 3 parts -
  1. Monitoring component to gather information about nodes usually via network
  2. Interpretation component to determine the failure status based on gathered data
  3. Action component to execute actions based on the failure status of a process.
The fundamental difference is in traditional failure detectors the monitoring and interpretation components are tightly coupled. Whereas in accrual failure detector, the monitoring and interpretation components are decoupled i.e., the monitoring is done by failure detector whereas the interpretation was done by the application.


The method used for estimating the suspicion level (i.e., Phi) is simple. It involves 3 phases -
  1. The heartbeats arrive and their arrival times are stores in a sampling window.
  2. Past samples are used to determine the distribution of inter-arrival times.
  3. Then the distribution is in turn used to calculate the current suspicion level.
Following image summarizes the above steps -



Problems..
In some scenarios, the probabilistic behavior of the network (i.e., message delays, message loss etc) can vary. The Accrual algorithms adapt their behavior to gradually changing network conditions. But there are times when network conditions can change rapidly due to bursty traffic. Ex: WAN networks.

Challenge is all of the above failure detection techniques that compute expected arrival times adapt well provided 
  1. The occurrences of bursts are independent of each other and follow some slowly changing probabilistic distribution and
  2. The duration of each burst is short (i.e., smaller than heartbeat interval).
Otherwise, these approaches cannot react quickly to sudden changes as they use large history to compute their estimations of expected arrival times etc. For those situations, one would need to consider algorithms like Two Windows Failure Detector (2W-FD) etc.

5) Wrap up...

This article became way longer than I thought initially. If you reached here, you can see designing a good and scalable failure detection is not a trivial task. 

For example, Facebook in designing the Cassandra NoSQL database, chose to base their implementation of failure detection on Phi Accrual failure detector only after finding themselves that gossip-based model would be too slow to detect failures. Given the number of users Facebook has, a too-slow failure detector would cause significant service disruptions to a significant number of people.

On other hand, there is no one universal failure detector that is good for all situations. For example, in Google's Chubby service for distributed locks extensively uses heartbeats to detect failures. The scale of Google's operation required them to utilize hierarchical failure detection to avoid crippling the system with heartbeat message explosion on their networks.

In Globus toolkit, the architecture of failure detector service has two layers - the lower layer includes local monitors and upper layer includes data collectors.
  • local monitors - These are responsible for monitoring the host and selected processes on that host. These local monitors periodically send heartbeat messages to data collectors including the information on the selected processes.
  • data collectors - These receive heartbeats from local monitors, identifies failed components and notifies applications about relevant events concerning monitored components.
In Amazon Dynamo DB, the failure detector relies on random gossip technique i.e., randomized pinging and distributed failure confirmation. A node is alive if it replies to the ping. If no ack has been received, the original node asks k other nodes to ping it as well . If none of the k has received a response, then that node is marked as failed by the original node.