Insights from paper (Part II) — Dynamo: Amazon’s Highly Available Key-value Store
In the previous part, I introduced the paper and talked about requirements, design considerations, related work to this paper and started with a few critical parts of systems architecture. In this post rest of the paper will be covered.
Handling Failures: Hinted Handoff
If Dynamo had used a traditional quorum approach, it would be unavailable during server failures and network partitions, but it uses a sloppy quorum to be highly available. So what is a Sloppy quorum?
All read and write operations are performed on the first N healthy nodes from the preference list, which may only sometimes be the first N nodes encountered while walking the consistent hashing ring. Let’s understand this with an example.
Suppose node A is temporarily down or unreachable during a write operation. In that case, a replica that would typically have lived on an intended targeted replica will now be sent to another healthy node.
The above write operation will have a hint in its metadata that suggests which node was the intended target replica.
Nodes that receive hinted replicas will be kept in a separate local database and scanned periodically. Once the targeted replica is available, the write operation details are sent.
Using the above-hinted handoff, Dynamo ensures that the read and write operations are not failed due to temporary node or network failures.
Handling permanent failures: Replica Synchronization
Hinted handoff works best if the system membership churn is low and node failures are transient. In real-life, node failures may be for a longer duration. So how did Dyanmo fix this?
Dynamo implements an anti-entropy (replica synchronization) protocol to synchronize the replicas. To detect the inconsistencies between replicas faster and to minimize the amount of transferred data, Dynamo uses Merkle trees. So what is the Merkle tree?
Once again, here we will cover only a high-level overview of the Merkle tree. I will cover it in a full paper post.
A Merkle tree is a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in the tree are hashes of their respective children.
If the hash values of the root of two trees (representing two nodes) are equal, then the values of the leaf nodes in the tree are equal, and the nodes require no synchronization.
If not, nodes may exchange the hash values of children, and the process continues until it reaches the leaves of the trees, at which point the hosts can identify the keys that are “out of sync”.
Merkle trees minimize the amount of data that must be transferred for synchronization and reduce the number of disk reads performed during the anti-entropy process.
Each node maintains a separate Merkle tree for each key range (the set of keys covered by a virtual node) it hosts. This allows nodes to compare whether the keys within a key range are up-to-date.
Membership and Failure Detection
Ring membership
A node outage rarely signifies a permanent departure. Similarly, the manual error could result in the unintentional startup of new Dynamo nodes. Because of these scenarios, Dynamo has the obvious solution to have an explicit mechanism to initiate the addition and removal of nodes from the ring.
An administrator connect to a node in the ring and issue a
membership change to join a node to a ring or remove a node
from a ring.Here there is a point to notice that the node which issues this membership change writes the membership change and its time of issue to the persistent store.
A gossip-based protocol propagates these membership changes and maintains an eventually consistent view of membership. Each node contacts a peer chosen at random every second, and the two nodes efficiently reconcile their persisted membership change histories.
So this clears how the membership is managed. The next question is how the token range to node (virtual) mapping is managed.
When a node starts for the first time, it chooses its set of tokens and maps existing nodes in the ring to their respective token sets. This mapping is persisted on the disk and added to the communication exchange that reconciles the membership change histories.
In summary, both the partitioning (token mappings) and placement (membership changes) information also propagates via the gossip-based protocol.
External Discovery
The challenge with the previous approach is that it can lead to a logically partitioned Dynamo ring. How? Think of two different new node issues add to the ring membership command.
To prevent logical partitions, some Dynamo nodes play the role of seeds. These seed nodes are discovered via an external mechanism and are known to all nodes. Because all nodes eventually reconcile their membership with a seed, logical partitions are highly unlikely.
Typically seeds are fully functional nodes in the Dynamo ring.
Failure Detection
In the case of Dynamo, It is sufficient if we have a local notion of failure detection.
Decentralized failure detection protocols use a simple gossip-style protocol that enables each node to learn about other nodes' arrival (or departure). Early designs of Dynamo used a decentralized failure detector, but it is not in use. Failure detection in a distributed system is a vast topic to be discussed here. I will have a full paper post covering the “On scalable and efficient distributed failure detectors” paper referenced in the Dynamo paper.
As Dynamo has explicit node join and leave methods, there is no need to have a global view of the failure state.
Adding/Removing Storage Nodes
In the ring membership section, we discussed that when a node is added, it chooses its set of tokens and maps existing nodes in the ring to their respective token sets.
For every key range that is assigned to the newly added node, there may be several nodes that are currently in charge of those ranges. These transfers to the new node are uniformly distributed across the storage nodes. This ensures fast bootstrapping.
Also, there is a confirmation round between the source and the destination, which ensures that the destination node does not receive duplicate transfers for a given key range.
When a node is removed from the system, keys reallocate in reverse, and the rest of the process remains the same as adding a node.
Implementation
There are a lot of software components in each storage node in Dynamo. The most important ones are:
Request coordination
Membership and Failure detection
Local persistence engine
The local persistence component allows for different storage engines to be plugged in. Some examples are Berkeley Database (BDB) Transactional Data Store, BDB Java Edition, MySQL, and an in-memory buffer with a persistent backing store. Applications choose Dynamo’s local persistence engine based on their object size distribution.
The request coordination component is built on top of an event-driven messaging substrate. The message processing pipeline is split into multiple stages, similar to the SEDA architecture. I will cover the SEDA paper in a full paper post. Let’s learn how it works.
Each client request creates a state machine on the node that received the request. The state machine contains all the logic for identifying the nodes responsible for a key, sending the requests, waiting for responses, potentially doing retries, processing the replies, and packaging the response to the client.
Each state machine instance handles exactly one client request.
After the read response has been returned to the caller, the state machine waits for a short period to receive any outstanding responses. If stale versions were returned in any of the responses, the coordinator updates those nodes with the latest version.
The above process is called read repair because it repairs replicas that have missed a recent update at an opportunistic time and relieve the anti-entropy protocol from having to do it.
The request load is not uniformly distributed across objects. So to counter this, any of the top N nodes in the preference list is allowed to coordinate the writes. In particular, since each write usually follows a read operation, the coordinator for a writer is chosen to be the node that replies fastest to the previous read.
Experiences and Lessons Learned
There are following three main patterns in which Dynamo is used:
Business logic-specific reconciliation — A shopping cart kind of system
Timestamp-based reconciliation- Last write wins kind of systems
High-performance read engine- Read heavy workloads kind of systems
The main advantage of Dynamo is that its client applications can tune the values of N, R, and W to achieve their desired levels of performance, availability, and durability.
If W is set to 1, then the system will never reject a written request
Low values of W and R can increase the risk of inconsistency. This also introduces a vulnerability window for durability when a written request is successfully returned to the client even though it has persisted at only a few nodes.
It is a normal thought process that durability and availability don’t go hand-in-hand. In Dyanmo’s case, it is not valid. The vulnerability window for durability can be decreased by increasing W.
Balancing Performance and Durability
To achieve higher performance levels, Dynamo can trade off durability guarantees. This can be done by each storage node maintaining an object buffer in its main memory. Each write operation is stored in the buffer and gets periodically written to storage by a writer thread.
Ensuring Uniform Load distribution
We discussed that Dynamo uses consistent hashing to partition its key space. A uniform key distribution can help us achieve uniform load distribution, assuming the access distribution of keys is balanced.
Dynamo assumes that even if there is a significant skew in the access distribution, there are enough keys in the popular end. It means the
popular keys can be spread across the nodes uniformly, and there will not be a load imbalance. Dynamo observed its partitioning scheme and analyzed it to improve over time. There are three following partitioning schemes:
T random tokens per node and partition by token value.
T random tokens per node and equal-sized partitions.
Q/S tokens per node, equal-sized partitions.
Divergent Versions: When and How Many?
To understand the precise impact of different failures on consistency, a good summary metric is the number of divergent versions seen by the application in a live production environment.
In one experiment, It was found that 99.94% of requests saw exactly one version, 0.00057% saw two versions, 0.00047% saw three versions, and 0.00009% saw four versions.
Client-driven or Server-driven Coordination
In the implementation section, we saw a request coordination software component at the storage node. It has a state machine that handles incoming requests. This machine can be at the client nodes' side. Let’s see what happens in both cases.
Server-side — Client requests are uniformly assigned to nodes in the ring by a load balancer. Read requests can be served by any node. Write requests will be coordinated by a node in the key’s current preference list if the versioning scheme is not based on physical timestamps otherwise, any node can coordinate.
Client-side — The client has to get the current membership view. Using this information client can determine which set of nodes form the preference list for any given key. Now a write request can be forwarded to a node in the key’s preference list or can be coordinated locally if Dynamo is using timestamps-based versioning. The read request can be coordinated at the client node, avoiding the extra network.
Balancing background vs. foreground tasks
Each node performs different kinds of background tasks for replica synchronization and data handoff in addition to its normal foreground
put/get operations. Background tasks triggered the problem of resource contention and affected the performance of the regular put and get operations. Dynamo added an admission control mechanism to solve this problem.
The admission controller constantly monitors the behavior of resource accesses while executing a put or get operation. It uses this information to allocate runtime slices of the resource for background tasks and also constantly monitors those.
Conclusions
Dynamo is a highly available and scalable data store. Dynamo has provided the desired levels of availability and performance and has successfully handled server failures, data center failures, and network partitions. Dynamo is incrementally scalable. Dynamo allows service owners to customize their storage system to meet their desired performance, durability, and consistency SLAs by allowing them to tune the parameters N, R, and W.
The following paper I will cover will be the Amazon DynamoDB paper.
References:
Dynamo paper link: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Amazon DynamoDB paper: https://www.usenix.org/system/files/atc22-elhemali.pdf