Apr 2, 2016

Many faces of replication...

Recently I was looking into replication and consistency papers and thought would be good topic to summarize. Replication is one of the most studied topics and is a quite important tool for designer.

It improves system availability by removing single point of failures, improves performance by reducing communication overheads and improves scalability by enabling system to grow with acceptable response times. But the benefits of replication comes with its own challenges. Nothing comes for free in distributed systems...

For example, some of the challenges anyone dealing with replication have to address are -

  • How to manage the updates i.e., replication strategy? 
  • Data Consistency & Availability tradeoffs? 
  • How to handle downtime during new replica creation 
  • Maintenance Overhead 
  • Lower write performance etc. 

Replication Basics...
A replication strategy typically can be described via two characteristics i.e., by When & Where?

When - when the updates should be propagated i.e., 

  • Synchronous (i.e., eager) propagation (or) 
  • Asynchronous (i.e., lazy) propagation 
Where - where the updates should be placed i.e.,
  • Primary copy (i.e., master-slave/single-master) model (or) 
  • Update everywhere (i.e., multi-master) model 


Typically the replica propagation is done either via push based or pull based mechanisms. Some system combine both pull and push based approaches. In general, the quicker the propagation happens, the smaller the inconsistency window (i.e., degree of divergence) and rate of conflict. So push-based propagation generally provides better degree of consistency.

In eager (sync) propagation, 
  • All replicas are updated synchronously as part of one atomic transaction typically using commitment protocol like 2PC. The main advantage is we avoid divergences in replicas. Disadvantage is all replica nodes need to be available for a write to succeed. Also higher latency for writes. 
  • Eager replication is typically not an option for mobile applications where most nodes are normally disconnected. 
  • Other propagation approaches typically used to improve upon fault-tolerance for writes are, ROWAA (read-one/write-all-available) i.e., update only all available copies, Quorum protocols i.e., succeed on writes as long as a quorum of replicas are created etc.

In lazy (async) propagation, only a subset of replicas are updated. Other replicas are brought up-to date lazily after a transaction commits. While eager propagation schemes typically uses locks, lazy propagation schemes typically use multi-version-concurrency-control approaches to detect non-serializable behavior.

There are two types of lazy propagation -

  • Optimistic asynchronous replication
    • This approach assumes the conflicts will be rare. So updates are propagated in the background and the occasional conflicts are reconciled after they happen. This means the application must tolerate some level of divergence between the replicas. That is actually acceptable for a large range of applications. 
    • Examples: DNS, CVS, Mobile Database Systems, Collaborative Software, Distributed Log Collection systems etc. 
  • Pessimistic asynchronous replication
    • This approach assumes update conflicts will likely occur and so tries to implement mechanism to avoid divergence. The goal of this approach is to provide strict consistency. 


In primary-copy (aka master-slave) approach, there is only one primary copy that is updatable by clients. Updates flow from the primary to the secondaries. The slaves are used for read-only queries. Most web based systems are read-heavy. So this model suits well there. Network partitions is a problem though.


In update-everywhere (aka multi-master) approach, the copies are independently updatable. Conflicting updates i.e., replica divergences are likely. So to ensure eventual consistency either the updates need to be designed to be commutative and converge (or) there needs to be automated/manual conflict resolution in place to converge replicas.



Summarizing all the above combinations: 
  • Eager primary-copy replication - In this approach, update operation is first performed at primary/ master copy. Then synchronously propagated from the master copy to slave copies. When the primary has confirmation that all slave copies have performed the update, it commits and returns a notification to the user. 
  • Eager update-everywhere replication - In this approach, the server sends a lock request to all replica nodes synchronously. If all replica nodes grant the lock, then the updates are executed at all nodes. Basically a 2PC protocol. Afterwards, a notification is returned to the user. Another approach is to use group communication primitives to implement database replication i.e., client sends a request to local server and the server sends an Atomic broadcast. 
  • Lazy primary-copy replication - In this approach, an update operation is first performed at primary/master copy. The updates from the primary are then sent asynchronously to slave copies. This asynchronous propagation could be event driven (i.e., in response to a commit event) or could be a periodically executing transaction. 
  • Lazy multi-master replication - In this approach, the updates can originate from any node and are sent asynchronously to all the replica copies. They key here is to ensure the updates are convergent and reflect all committed transactions i.e., if no new updates occur and if all nodes are connected together, then eventually all replicas converge to same value. 

Timestamped appends is one such technique. The idea being, if all nodes are in contact with all other nodes, then all their data will eventually converge to the same state. Note: Timestamped updates will not work though. Unlike with appends, with timestamped updates we can achieve convergence but we will run into lost-updates problem.

Another adhoc technique for achieving convergence and eventual consistency is "last-write-wins". In recent years, CRDTs i.e., convergent replicated data types and cumulative replicated data types are another mechanism which are gaining more traction.

Conflict detection & resolution...
When dealing with conflicts and replica divergences, it is generally useful to decompose it into two sub parts i.e., 
  • conflict-detection and 
  • conflict-resolution. 

Conflict detection policies can be typically classified as either none, concurrency-based and semantic-based
  • In systems with conflict detection policy as none, the conflicts are ignored. The conflicting operation is simply overwritten by a newer operation causing lost-updates. Example: DNS 
  • In systems with concurrency-based detection policies, the conflicts are typically detected based on the timestamps or timing of the operations. Concurrency-based policies are simpler and generic typically but cause more conflicts. 
  • Finally in systems with semantic-based detection policies, the detection is typically application specific and is based on an understand of the semantics of the data or operation. 

Conflict resolution policies can be either manual or automatic. In manual mode, the conflict is removed/merged manually and requires user intervention. In contrast, automatic approaches do not require user intervention. For example, concurrent updates to a mail folder can be resolved by automatically computing the union of the messages from both replicas.

Reconciliation is a fundamental piece in conflict-resolution. We can define different types of reconciliation strategies based on the

  • Type of input
    • The inputs to the reconciliation engine could be either updated state of replica or the update operations. If the inputs are updated replica states, it is state-based reconciliation. If the inputs are update operations, it is operation-based reconciliation. 
    • In state-based reconciliation, you take the state of each replica and try to reconcile to a similar state. In operation-based reconciliation, you take operations performed on each replica and try to build a common sequence of operations. 
  • Criteria for ordering updates
    • The criteria for ordering the updates could be based on semantic properties like information about when, where and by whom i.e., semantic reconciler or it could be based on timestamps, version vectors etc i.e., ordinal-reconciler. 
    • Timestamp based reconciliation are the most popular strategies. 

Replication Schemes...
Replication is a widely used technique. For example, in distributed DBMS we come across various forms of replication strategies like - 
  • Snapshot Replication - In snapshot replication, you take a snapshot of the data from one server and move to another server or to another database. Once the initial synchronization is done, you periodically sync the deltas. 
  • Transactional Replication - In transactional replication, a replication agent monitors the servers for changes to database and transmits those changes to other backup servers. The transmission could be either event driven or could be on periodic basis. This type of replication is typically used for server-server scenarios. 
  • Merge Replication - In merge replication, the replicas work independently i.e., replicas be offline. When the replicas are connected, they are reconciled. If a conflict occurs then typically a pre-defined conflict resolution algorithm is used to achieve consistency. This type of replication is used mostly in wireless environments 
  • Statement based Replication - In statement based replication, instead of data, the SQL queries are sent to replicas. The replicas execute the queries in the same order to achieve consistency. The write queries must be sent to all replica servers whereas read-only queries can be sent to only a subset of servers. 

Similarly for file systems, the replication schemes typically utilized are -

  • Full file replication i.e, full files are replicated at multiple peers based on which node downloads the file. This strategy is simple to implement but replicating larger files will be a problem both in space & time. 
  • Block level replication i.e., divide each file into an ordered sequence of fixed size blocks. An advantage with this approach is better de-duplication and efficiencies during replication in both space & time. Disadvantage is even if a single block is unavailable, the file cannot be reconstructed. 
  • Erasure codes replication This scheme provides the capability to reconstruct original file from a less number of available blocks. In this approach, basically the source data is passed thru a data encoder which adds the redundancy bits (parity) to the pieces. After the blocks are retrieved, they are sent through a data-decoder process. This decoder will try to recover the original file even if some blocks are missing. 

So far in the above replication discussions, we made few assumptions i.e.,
  • The replica distribution is uniform for the duration of the system. Ex: 3 replicas. 
  • The replica placement is static i.e., once a replica location is determined (ex: consistent hashing), the replica location stays same unless cluster membership changes. 
  • The replica selection is static i.e., more or less we select the same replicas every time. 

While the above assumptions are fine for data base management & file systems, it is not necessarily so for P2P systems, CDNs, Grids etc. These later systems utilize a variety of replica distribution, creation and location strategies. For example, Cassandra uses dynamic snitch layer to select replicas dynamically based on their performance etc.




Replica distribution schemes - Following are some schemes used in p2p systems -

  • Uniform distribution - In this scheme, all data is replicated equally. 
  • Proportional distribution - The number of replica copies are proportional to its popularity. If a file is more popular, then there is a higher chance of finding it closer to the site that is submitting the query. 
  • Square-root distribution - The number of replicas is proportional to the square root of the queries distribution. 

Replica creation & selection schemes -
  • Owner based replication - The object is replicated only at the site/node that is requesting the copy after the object/file is found. Ex: Gnutella uses owner replication. 
  • Path based replication - The file is replicated at all nodes along the path through which the request is satisfied. Ex: Freenet uses path replication. 
  • Random replication - This scheme is similar to path replication but instead of placing replicas along the path, the replicas are distributed randomly. The number of replicas created will be same as in path replication scheme. 
  • Best client based replication - Each node basically maintains a record of the access history for each replica. If the access history exceeds a threshold, a replica is created at the requester site. 
  • Cascading replication - In this instead of creating replica at the best client, the replica is created a next level on the path to the best client. 
  • Economy based replication - The basic principle behind economy-based replication policies is to use social concepts to simulate emergent marketplace behavior such that local optimization leads to global optimization. For example, in this model each site tries to buy a data item to create a replica at its own node and generate revenue in future by selling them to other interested nodes. 
  • Cost estimation based model - This is similar to economic model. The difference is in economy model, the investment is measured only based on data access. In cost-estimation model, it is more elaborate and the calculations are based on network latency, bandwidth, read/write statistics, replica size, downtime etc. This model is driven by the estimation of data access gains and the maintenance cost of the replica. 

World Wide Web...
WWW is another fertile area for applying replication techniques. Caching and Replication are two major techniques used for reducing latencies on the web. Caching is typically implemented on the client side to reduce access latency, whereas replication is typically implemented on the server side to ensure data is located close to the requesting server. Most of the caching techniques have an equivalent in replica systems.

The traffic on WWW is typically read-heavy. So the most common replication strategy used on the internet is primary-copy approach. The secondary web servers are replicated at different sites to be close for serving the requests to the user.

Wrap up...
Replication is a vast field and above is just scratching the surface. I have in mind couple additional items like consistency modesl and replication in NoSql systems etc but that is for future posts. If you have read so far, that is great. Please feel free to share your thoughts in comments.

References:

  • Data replication strategies in wide area distributed systems, Sushant Goel, Rajkumar Buyya 
  • Understanding Replication in Databases and Distributed Systems, M. Wiesmann, F. Pedone, A. Schiper, B. Kemme, G. Alonso 
  • Survey of data replication in P2P systems, Vidal Martins, Esther Pacitti, Patrick Valduriez 
  • The Dangers of Replication and a Solution, Jim Gray, Pat Helland, Patric Oneil, Denis Sasha 
  • Database Replication: A Survey of Open Source and Commercial Tools, Salman Abdul, Sailaja P., Venkataswamy G, Supriya Pal