Jan 27, 2016

Failover Strategies & Techniques in Distributed Applications

Systems fail. Networks fail. Processes fail. The secret, let it fail, but fix it fast. If services are restored quickly enough before the user notices then did the failure occur? 

Broadly there are four types of failover strategies -
  1. No Failover
  2. Cold Failover
  3. Warm Failover
  4. Hot Failover
These strategies vary in their recovery time, cost and impact. Also often a combination of these strategies are used. For example, hot failover strategy for high availability and cold failover strategy for disaster recovery to resume availability.

Next part is techniques. Typically the techniques utilized for doing hot failover are -
  1. Client-based failover
  2. DNS-based failover
  3. Network-based failover
  4. IP Address takeover
  5. Gratuitous ARP based failover
  6. Server-based failover
Some failover techniques like DNS-based failover are provided as a service by many cloud providers. On other extreme, techniques like Gratuitous ARP based failover are not at all cloud friendly. 

Distributed systems are all about trade-offs. Each failover strategy mentioned above has its own pro's and con's. Same with failover techniques. In rest of the post we dive deeper into these details. At the end we touch base some of the challenges in failover like tug-of-war, split-brain, quorums etc.


Failover Strategy TradeOffs...

Strategy
Recovery Time
Cost
Customer Impact
No Failover
Unpredictable
No cost to low cost
High
Cold Failover
Minutes to Hours
Moderate
Moderate
Warm Failover
Seconds to Minutes
Moderate to high
Low
Hot Failover
Immediate to Seconds
High
None

No Failover Strategy:

As the name implies, the system here does not do any failover. In this mode, the system failures can result in significant down time. The time to resolve depends on 
  1. time to detect the failure,
  2. time to figure out the cause and solution,
  3. time to decide and obtain approvals,
  4. time to apply the solution.
The risk is acceptable if the system is not a business critical system. It is good to have a clear & detailed operational recovery plans , responsive IT staff etc though. 

Cold Failover Strategy:
This is a common and often inexpensive approach to recover from failures. In this strategy, the second node (also called standby) acts as backup of an identical primary node. 


When the primary node breaks down the standby node is powered on and the data is restored. Data from primary node is periodically backed up on a storage system and restored on standby node as and when required. 
This strategy generally provides a recovery time of a few hours.
Warm Failover Strategy:
Here the second node is identical to primary node just like in Cold Failover strategy. Difference is, the 2nd node is up and running. In case of a failure on the primary node, the software components are started on the 2nd node. This process is usually automated using a cluster manager. Data is regularly mirrored to secondary system using disk based replication or shared disk. 


This strategy generally provides a recovery time of a few minutes
Hot Failover Strategy:
This is the approach we want if the goal is continuos availability. It requires the same failover configuration as in other strategies i.e., a second node identical to primary node and running. Difference is the state and data are mirrored in near real time i.e., both systems will have identical data.  
This strategy generally provides a recovery time of a few seconds.

On Failover Techniques...
Fast and reliable recovery from a node failure in hot failover strategy is conceptually simple. Just redirect traffic generated by affected users to surviving nodes in the network. Failover will be fast. So fast that users may not even notice that there has been an outage. However, the devil is in the details.

Note:
One aspect we are not talking in this post is the loss of the replication network. When you have connectivity loss, unless something is done, each node will continue to process independently; and the database copies will begin to diverge. If this is unacceptable, there should be mechanisms in place to take one of the nodes offline until replication can be restored. Also this will bring up a raft of consistency issues, CAP trade-offs etc. 

Client-based failover
Client based failover is the least complex and fastest way to do failover. This approach avoids the need to have a server side cluster as long as data can be replicated. However, it requires that client is intelligent (i.e., has failover logic). Also that you have mechanisms to let the client know which node is primary and which are back up nodes. This may not always be possible though.


In this mode, failover works as follows -
  1. Client normally interacts with its primary node (i.e., IP1 in above picture).
  2. If client does not receive response or has a connection exception then it resends to the 2nd node (i.e., IP2 in above picture).
  3. In parallel, the client periodically sends test messages to primary node.
  4. When the client begins to receive response from primary node (i.e., IP1) then client can switch back to its primary node.
Some notes on this model -
  1. Failover takes more or less same amount of time as any other retry/abort requests.
  2. Failover time here is primarily the time it takes to detect the failure. Restoration time is simply the time it takes for client to switch to using primary node.
  3. The user of the client is typically unaware of both the node failure and node restoration.
  4. Configure the client to retry few times before failover to another node. If the failure is transient then client can handle it without doing a failover.
  5. If there are multiple nodes in the system, the client can be given an ordered list of the nodes. This will help client survive multiple node failures.
  6. Clients should be able to accept external commands to change its primary node and also its failover list. This will enable the system or admins to redistribute the workload to accommodate short-term peaks or trends.
  7. Session loss and Capacity constraints of individual nodes are other aspects we should consider during failover. If state-fulness is a must, then good to explore distributed memory caches for storing session state.
Network-based failover
In this model it is the responsibility of the network to detect failures and redirect traffic from a failed node to functioning node

Below figure shows a typical configuration in this model i.e., two nodes, A and B, which are geographically separated. A replication engine keeps the databases of these nodes in sync continuously by propagating the database changes.


Here the router normally routes all traffic to the local node i.e., Router-A routes normally all local clients traffic to Node-A. Similarly Router-B routes all its local client traffic to Node-B. See the dotted line arrows.

The clients are not aware, nor do they care as to which node they are connected. The clients just know a virtual IP address assigned to them for communication. 

The routers have the intelligence to be able to route around an outage i.e., a capability inherent in most routers today. These routers maintain routing tables that designate alternate routes to use if a primary route fails. In above figure, a communication link connecting Router-A and Router-B provides that alternate route for each router.

In this model, failover works as follows -
  1. Let's say Node-A fails. 
  2. Router-A detects that it can no longer send traffic to Node-A. 
  3. Router-A consults its routing table and determines that alternate route is to Router-B.
  4. Router-A begins sending traffic for the local clients to Router-B. 
  5. Router-B will forward that traffic to Node-B. 
  6. Node-B is now handling traffic from local clients of both Router-A and Router-B.
Whenever Node-A is up and running, a command needs to be sent to Router-A for redirecting local traffic again to Node-A.

Note: In the above image, there are multiple single point of failures. Both network links and Routers require redundancy. It is a fairly expensive but on flip side your clients are simple and cost is hopefully spread out across multiple apps.

DNS-based failover
Over the years, DNS-based failover has become a very popular. Especially so in the cloud service community primarily because it is relatively inexpensive and fairly easy to deploy and manage.


In this model, the client communicates with the node using the node hostname instead of its IP address. The host name must be converted to an IP address. This task is done by the DNS server. Now if we want to route the traffic to a different node, all that needs to be done is to update the DNS server with a new IP address for the hostname.

DNS based Failover service provider model:
There are many cloud service providers who provide DNS-based failover as a service. The typical model is as follows -



Limitations of DNS-based failover:
Overall the DNS-based failover solution is simple but has one major problem. DNS entries are typically cached, and it may take several minutes for the cache to be updated with the address change. Till the cache is updated the DNS translation of the hostname will return failed node.

Therefore, continuous availability cannot generally be achieved with DNS redirection. This is an extremely tough issue to resolve because it is not within the DNS Failover service providers control.

Gratuitous ARP based Failover
Before going into details of gratuitous ARP, first a quick overview of ARP protocol.
ARP protocol is used in IP networks to resolve an IP address to a MAC address. Why? - Routers use the MAC address in their routing tables to route the traffic to appropriate host or another router. 
Routing tables are self-discovering i.e., not only do routers periodically exchange their routing tables with their neighboring routers, but they also monitor traffic to determine new IP/MAC pairs. Every message contains the IP/MAC address pair of the source and destination. These address pairs are used by the routers to update their routing tables. 
Now, if a sending node does not know the MAC address of the destination, it sends an ARP request asking for the MAC address corresponding to the target IP address. The device or router that is handling that IP address responds with the MAC address to use. This completes our quick overview of ARP :-)
A gratuitous ARP is an unsolicited ARP request made by a sender (host, switch, etc… ) to resolve its own IP address. 

In gratuitous ARP the source and destination IP address are basically the IP address of the sender.  The ARP message also contains MAC address of the sender, which allows the routers on the subnet to update their routing tables. In short, gratuitous ARP is used by a node to advertise its presence on the network. When a system first boots up, it will often send a gratuitous ARP to indicate it is "up" and available.  

Now let's say we have two load balancers and we want to provide high availability for our load balancing service -
  1. The two load balancers will share an IP address (i.e., floating IP). 
  2. The client devices recognize which node is active load balancer by means of a simple ARP entry. If you remember, ARP entry ties IP address with MAC address. So if ARP entry has floatingIP and MAC address of  first load balancer, in effect we are telling client devices that our 1st load balancer is the active one.
  3. Let's say the active load balancer fails. Our secondary load balancer immediately notices (due to heartbeat monitoring between the two) and will send out a gratuitous ARP request indicating it is now associated with the floating IP address. 
  4. The 2nd load balancer does this by sending an ARP request with sender IP address = floating IP address and sender MAC address = its own MAC address. This forces all future traffic to floating IP address routed to the 2nd load balancer.  
  5. When 1st load balancer is restored, it will send an equivalent gratuitous ARP request and all traffic gets re-routed again back to the 1st load balancer.


















Limitations:
Most cloud environments do not allow broadcast traffic of this nature. Primary reason being user most likely will be sharing a network segment with other tenants. So broadcasting traffic could create issues like disrupting traffic of other tenants. Another concern would be security i.e., with ARP there is no security, no ability to verify, no authentication, nothing.

A server configured to accept gratuitous ARPs does so at the risk of being tricked into trusting, explicitly, every gratuitous ARP – even those that may be attempting to fool the network into believing it is a device it is not supposed to be. That, in essence, is ARP poisoning

Also gratuitous ARP can be used to execute denial of service, man in the middle and MAC flooding attack etc. Imagine what happens if these attacks occur in a cloud environment. The attacks would be against shared infrastructure, potentially impacting many tenants. Now you can understand why cloud providers will hesitate to provide this functionality.

On other hand, in data centers, typically you have protections against these attacks precisely because of the reliance on gratuitous ARP by various infrastructure services. Most of these protections use a technique that will accept a gratuitous ARP, but not enter it in its ARP cache unless it has a valid IP-to-MAC mapping, as defined by the device configuration. Validation can take the form of matching against DHCP-assigned addresses or existence in a trusted database. 

In short, gratuitous ARP is not cloud-friendly.

IP Address takeover based Failover:
Say for some reason we cannot use either DNS-based failover or gratuitous ARP based failover. Another technique is to use IP Address takeover.

Say our floating IP is 10.1.1.1 and initially it is associated with Node-A (IP1). Our Node-B (IP2) periodically monitors the activity of Node-A. When Node-B detects that Node-A is down, then it takes over the floating IP address 10.1.1.1.

A good thing about this model is, it is transparent to the rest of the system i.e., clients and DNS. So implementation can be done without much external assumptions. There can be small failover latency due to ARP cache though.

Server-based failover:
In server-based failover, the servers are responsible for failure detection. There are multiple ways to detect failures although the most common approach is to use heartbeats. A heartbeat may be a simple ping, or it may be a more complex exchange of information.

Note: Similar to failover techniques & strategies, failure detection is a big area. We can discuss about it more in another post. This post focus is failover. 

Let's say our server can detect failure of its peers. So once a server failure is detected, the clients using that node must be redirected to the other surviving servers. A surviving server can make this happen in few different ways like -
  1. via Commanding the Clients.
  2. via Commanding the Network
  3. via Updating the DNS Server
  4. via Gratuitous ARP etc

Commanding the Clients-
Basically in this approach, the server will send a supervisory message to all clients it wishes to acquire. After receiving the supervisory message, all clients will do their future communication with the requesting server.

This technique requires that the clients have the intelligence to process such requests and to switch their IP addresses. Furthermore, if there is no broadcast or multicast capability to communicate with the clients, individual messages will have to be sent to the clients by the acquiring server. This could delay the takeover process.

Commanding the Network -
Say our clients do not have the intelligence to process a switchover request. Our clients are dumb and know only that they need to communicate to a specific IP address (ex: floating IP). This is basically Network-based failover. 

So here the server will send directives to the appropriate routers to re-route all communication addressed to floating IP to its address. The routers will then start routing the communication to the surviving node. 

This technique depends upon the routers having the intelligence to accept such directives from the network. It solves the problem associated with the client-commanding technique since our clients don't need to have re-routing intelligence. 

Commanding the DNS -
Say neither the clients nor routers have the intelligence to accept directives. In this approach the surviving server will send an update to DNS server to map the URL/hostname to its IP address.

Gratuitous ARP -
Under this approach, the surviving node will send a Gratuitous ARP request to bind the floating IP address to its MAC address. More details on this approach can be found under Gratuitous ARP section.

IP Address Takeover -
Under this approach, the surviving node does not have to send any directives to clients, routers or DNS servers. It will simply takeover the floating IP address from the failed node. More details can be found under IP Address Takeover section.

Reliability Challenges  -
One of implicit assumption in the failure detection via heartbeats is that the underlying network is highly reliable. That means the servers in the network should be interconnected with redundant independent links. Otherwise, it will create problems like tug-of-war and split-brain.

Tug-of-War:
Say the heartbeats between our two servers are lost. Each server will think that the other server is down. So each server will issue supervisory commands or try floating IP takeover to acquire the clients connected to the other server :-)

Best case is, this will happen only once with a net result that the client/server relationships have been reversed. In the worst case, the clients will be continuously doing ping-pong back and forth between the competing servers.
A tug-of-war will not happen with either client-based failover or network-based failover because the failure condition is not being determined by the servers.
Split-Brain:
Say in the above scenario, there was also a replication between the two servers. If no action is taken, both servers will continue updating their databases based on transactions received from their clients. The database on each sever will not see the updates being made to the database on other server. So the databases will diverge, and transactions will be executed against stale data. When the replication link is restored, there will be whole lot of data collisions that will have to be resolved.
Split-brain is a problem with all forms of failover i.e., client, network, and server.
If database divergence or data collisions are not acceptable, then one of the servers must be shut down and all clients should be switched to the other server until the replication network is fixed. Also if we are going to automate the shut-down then we need to make sure there is a distributed consensus between the servers and a tug-of-war does not happen.

There are few ways to make sure tug-of-war does not happen. The key element in all of these is there must be 3rd party to break the ties. Following are some solutions to this problem -
  1. Cluster Lock - In this model, the failover is automated using a cluster lock. The cluster lock must be on some third independent system. Here a server cannot force a switchover unless it has acquired the cluster lock. 
  2. Quorum System - This is commonly used in cluster architectures to reconfigure a cluster following a failure. The quorum system receives the heartbeats from all servers. The quorum system determines network health and decides when a server should be shut down to avoid split-brain operation or a tug-of-war. 
  3. Manual Intervention - This is the simplest solution. Shift the decision-making to the system admins :-). Here when a heartbeat or replication network failure is detected, all servers pause and wait for manual commands to continue. The system admin diagnoses the problem and determine what action to take.
Wrap up...
Gist is when a failure occurs in a distributed system, several factors influence failover time like - detecting the failure, determining its cause, deciding upon the course of action, obtaining approval, rebuilding the database if necessary, restarting the applications, reconfiguring the network if necessary, and testing the system before restoring it to service. 

There are several options for achieving failover. Some fail-over options are proactive and others are reactive failover strategies. Client-based failover is reactive. A failover is not initiated until the client detects a problem. Network and Server-based failovers are proactive i.e., failure is detected and failover is completed before most of the clients even know that there has been a failure. 

The correct choice of failover strategy and technique depends upon where the failover intelligence can be placed and what failover parameters we desire

I hope you found this post useful and interesting. Please feel free to leave a comment below.