The challenge is designing failure detection in a way that is both accurate and efficient is not an easy task though. One reason is the delays in a network are unpredictable. Another reason is the FLP impossibility theorem i.e., in any asynchronous distributed systems, it is impossible to determine accurately whether a remote process failed or has just taking its own time to respond.
Bright side, a node really don't need to know if a remote process has failed or just taking its own time. As long as we have a way for all nodes to come to same conclusion (i.e., failed or delay) about a remote process then we can work from there. You do this by assuming an upper bound on the delays (ex: timeouts) after which a remote process is considered as failed. In academic speak, it is called a partially synchronous model
In rest of the post we go thru
- Eight factors to consider in designing failure detection,
- Four key metrics to consider in evaluating failure detection and
- Static, Lazy, Gossip, Hierarchical, Adaptive and Accrual based failure detection techniques.
- Failure detection approaches in some popular distributed systems.