Apr 6, 2016

Techniques to ensure eventual consistency

Recently I came across a paper on eventual consistency that touches among other things techniques to ensure eventual consistency for the updates to replicas. While the term eventual consistency is fairly popular now a days, I am not sure some of its subtleties are equally well known. So a small detour before diving into the techniques to ensure eventual consistency.

Eventual consistency basically guarantees that if there were no additional updates to a given data item then all reads for that data item will eventually return the same value. 

What the above also means that in an eventually consistent system,

  • The system can return any arbitrary data and still be eventually consistent. So client has no way to know if the read response is wrong behavior. 
  • If there are multiple concurrent updates, under eventually consistency you do not know which update gets eventually chosen. The order is unpredictable. Only guarantee is, there will be eventually a convergence. 
  • In short, what eventual consistency tells us is something good will happen eventually but no guarantees as to what happens in the interim and no behavior is ruled out in the meantime!

On other hand, eventual consistency is fairly useful for few reasons -
  • Many real-world activities are inherently eventually consistent models. 
  • Recent studies indicate the inconsistency window of many real-world large systems is fairly small (i.e., in milliseconds) making these systems appear strongly consistent most of the time. 
  • System architects can deal with inconsistencies either via external compensatory mechanisms or use data structures that avoid inconsistencies altogether. 
  • Finally benefits like scalability, availability and low-latency etc. 

Techniques to ensure Eventual Consistency...

The techniques to ensure eventual consistency (i.e., convergence of data) broadly fall into 3 categories -
  1. Commutative Updates 
  2. Ordered Updates 
  3. Custom Convergent Merges
At a high level, basically these techniques involve replicas exchanges information about which writes they have seen. This information exchange is called anti-entropy i.e., reversing the disorder/randomness.

1) Commutative Updates

Last write wins:
Lets say we want to make a single-node database into an eventually consistent database. 
One simplest approach is, whenever the db receives an update, it acknowledges to the client and asynchronously propagates the write to all other replicas. These replicas in turn update their local copies. This write forwarding becomes the anti-entropy process.

In the event of concurrent writes, replicas deterministically choose a final value using techniques like "last write wins". It is important that the primary node does not forward the writes in a synchronous manner i.e., wait for responses from replicas before returning to the user.

Implicit in the eventual consistency model is the assumption that the system partitions will be eventually healed and the writes are eventually propagated. Same assumption applies for the replica nodes that might be down temporarily when a write is forwarded. Note: The forwarding node should make sure the write eventually reaches to the replica.

CRDTs:
Another approach for eventual consistency is using convergent and replicated data types i.e., CRDTs. There are basically two types of CRDTs -

  • Convergent Replicated Data Types
    • In this model the basically the replicas propagate the state (i.e., data) to converge or merge. The merge function should be commutative, associative and idempotent. (note: definitions below..). 
    • Similarly the update operations should monotonically increase the internal state according to some partial ordering. Think of the update operation as an append instead of a change to current value. 
    • Typically one uses timestamps / vector clocks etc to facilitate the merge operation.
  • Commutative Replicated Data Types - 
    • In this model basically the replicas propagate the operations. These operations must be commutative i.e., the operations can be applied in any order. 
    • For example, following operations sequence is not commutative i.e., [create(x), modify(x), delete(x) ] != [delete(x), create(x), modify(x)]. 
    • 2+3 = 3+2 i.e., addition is a commutative operation. 
Note:
  • Commutative means the function arguments are order-insensitive. Examples: f(a,b) = f(b,a). Using an example outside math, putting on socks on the feet is a commutative operation. The order in which socks are put does not matter. On other hand, wearing sneakers & socks is not a commutative operation. Order matters!
  • Associative means you can apply function in any order. Ex: f(a, f(b,c)) = f(f(a,b),c). Similarly using an example outside math, making concrete is not an associative operation i.e., you cannot first mix water & cement and pour gravel after a while to make concrete.

2) Ordered updates 


Another way to ensure eventual consistency is to send all updates in the same order. The order could be total order or partial order

Total order updates:
In primary-copy (i.e., master-slave) model, all the updates to the slave replicas are applied in the same order as the updates were applied to the master. 



Logging is another approach to totally-order updates i.e., the order in which the updates are written to the log are the order in which the updates were applied to the replicas. Log shipping is one application of this approach.


Consensus algorithms is another approach. In this approach, algorithms like Paxos or Raft etc are used to reach an agreement on the ordering and group communication for totally-ordered broadcast of updates. The update order then is based on the message order.

One advantage of total ordering is it simplifies developer life. On other hand, each of the above approaches has their own disadvantages. For example,

  • In primary-copy approach, the primary copy is a bottle-neck and single point of failure.
  • In logging approach, the tail of the log is a bottleneck and a point of failure.
  • In consensus approach, the algorithms have higher communication overhead and complex implementations.

Partial order updates:
One approach to establish partial ordering among updates is to use approach like Vector clocks or version vectors. Below image from wikipedia summarizes the Vector clocks. 


A challenge with Vector clocks is the vectors grows larger as the number of copies grow. Note: It is the number of copies and not the number of nodes. Ex: We may have 10 Cassandra nodes but our replication factor (copies) likely be 3. 

3) Custom Convergent Merges

If the updates in our system do not automatically converge and updates are not ordered, then remaining alternative is to resolve the conflicts manually or via some application specific logic. 

So in this model,  the technique is to take two versions of an object and merge them to form a new object. For eventual consistency, the merges need to be commutative and associative

One example of this approach is a developer manually resolving code conflicts to check-in code to a versioning system. Other examples - Collaborative editing systems.

The primary challenge with this approach is the convergence logic is highly application specific.

Wrap up...
One topic I did not touch in this post is the consistency-availability tradeoffs in the presence of network partitions. To interested readers, I recommend reading the paper - Rethinking Eventual Consistency. Finally following is an image from that paper on how to reason about weak consistency.


References-
  1. Rethinking Eventual Consistency, Philip A Bernstein, Sudipto Das
  2. Don't settle for Eventual Consistency - Stronger properties for low-latency geo-replicated storage, Wyatt LLoyd, Michael J Freedman, Michael Kaminsky, David G Andersen.
  3. Eventual Consistency Today: Limitations, Extensions and Beyond, Peter Ballis, Ali Ghodsi
  4. Wikipedia