DynamoDB offered by Amazon pioneered the idea of Eventual Consistency as a way to achieve higher availability and scalability.
Big dataset is broken in chunks and these chunks are then sent to different machines. Some replicas of these chunks also sent to these machines to address fault-tolerance. So the two requirements which we need to deal with here are:-
- We need to ensure High Availability, so if something goes wrong data is still available(Replicas address this).
- We want to Support updates. If changes are made in some chunk on a machine, these changes need to be propagated to the replicas of chunk as well present on other machines.
Eventual consistency serves the problem of updates. Data fetched are not guaranteed to be up to date but Eventual Consistency claims that updates are guaranteed to be propagated to all the nodes eventually. If no immediate update is made after an update all replicas converge towards identical copies. What application sees in the meantime is sensitive to replication mechanics and is difficult to predict.
Let us understand how Eventual Consistency is achieved through Vector Clocks. Vector Clocks detect conflicts in a concurrent read-write scenario but don’t do anything about them automatically.
Each data item is associated with the list of (server, timestamp) pairs indicating its version history.
- A client reads D0 item and writes back D1 item handled by server SX. (Vector Clock- D1([SX,1])) (where D1- data item, SX- server, 1- timestamp)
- Another client reads D1, writes back D2 which is also handled by server SX. (Vector Clock- D2([SX,1] ,[SX,2])). Since a client from same server updates the data item later so the D2 value is kept and D1 can be garbage collected. Hence vector clock is D2([SX,2])
- Now two independent clients read D2 and write back different values. One at server Y writes back D3 and one at server Z writes back D4. Since these requests are made by different servers these instances get recorded. (Vector Clocks – D3([SX,2],[SY,1]) and D4([SX,2],[SZ,1]))
- Now if another read request comes in from a client in Server SX, a conflict will occur because there’s a same timestamp but different servers and then the Server SX will then report the conflict. Conflict resolution is done either by Last write Wins or is application specific.
In reality these timestamps are not such integers but actual clock values. So the server will read that vector clock with greater timestamp value as it is the most recent update.
Important Note– Vector clocks conflict when all values in one clock are not later than or equal to all values in other clock. So the vector clock [(SX,A),(SY,B),(SZ,C)] does not conflict with vector clock [(SX,D),(SY,E),(SZ,F)] if and only if [A≤D, B≤E and C≤F] or [D≤A, E≤B and F≤C]. If a server has no entry in a specific clock, assume its timestamp to be zero.