Insights from paper (Part I)— Dynamo: Amazon’s Highly Available Key-value Store
Introduction:
Amazon has an e-commerce platform. This platform is used worldwide to serve millions of customers and employs tens of thousands of servers in many data centers worldwide.
You can imagine that the platform has to be reliable. The slightest outage has significant financial consequences and impacts customer trust. The platform is continuously growing, so it has to be highly scalable also.
Considering there are millions of components in the infrastructure of this platform, the software systems written for this should be able to treat the failure as a standard case without impacting the availability and performance.
Dynamo provides high reliability and tight control over the tradeoffs between availability, consistency, cost-effectiveness, and performance. This control is necessary as different services using the platform have different requirements.
Before we proceed, let’s understand the query access pattern of the different services. They only need primary-key access to a data store, so Dynamo is built as a key-value store. It means there are no tables or rows here like they are in relational database systems.
The question is how Dynamo can provide high availability, scalability, and performance for ever-growing services.
The answer is that Dynamo combines many well-known techniques from distributed system world to achieve this. A few most important of those are listed below:
The data cannot be stored in one single node, so data is partitioned. It uses a hashing-based technique called consistent hashing for partitioning. I will be covering the paper behind this technique. A high-level overview is covered in the following sections to make sense of it.
To provide high availability, data is stored in more than one node. This is called data replication, and the consistent hashing technique mentioned in the previous point helps do this.
The next point is how data is consistent across nodes if stored in multiple nodes. Dynamo uses object versioning and client-side conflict resolution for items. It uses a quorum-like technique and a decentralized replica synchronization protocol to maintain consistency among replicas.
Requirements and Design Considerations:
Dynamo has the following requirements to fulfill:
Query Model — Simple read and write to the single data item. Store items of small size (1 MB)
ACID properties — No ACID guarantees are needed as they tend to have poor availability. Weaker consistency is ok.
Efficiency — There is a need to rely on commodity hardware.
Security — There are no security-related requirements like authentication and authorization.
Design Considerations
When we are dealing with the possibility of network failures, strong consistency and high data availability cannot be achieved simultaneously. Dynamo favors high availability over consistency. The availability can be increased using optimistic replication techniques, where changes can propagate to replicas in the background, and concurrent, disconnected work is tolerated.
The challenge with the above approach is that it can lead to conflicting changes, which must be detected and resolved. This conflict resolution process introduces two problems: when to resolve them and who resolves them.
Dynamo has “always writeable” requirements, so conflict resolution is made at the time when data is read, not at the time when data is written. This is one of the great ideas Dyanmo is built on.
In terms of who should resolve the conflicts. We have two choices. Either the data store or the application does. Since the data store has limited options, Dynamo has a mechanism where client-side logic does this.
Beyond this, there are a few fundamental principles on which Dyanmo is built.
Incremental scalability — Dynamo should be able to scale out.
Symmetry — Every node in Dynamo should have the same set of responsibilities.
Decentralization —The design should favor decentralized peer-to-peer techniques over centralized control.
Heterogeneity — The system should be able to exploit the heterogeneity in infrastructure
One important thing to note here before we proceed, at Amazon, Service Level Agreements (SLAs) are expressed and measured at 99.9 th percentile of the distribution but not on average, median, or mean. An example of SLA is that the response time should be within 300 ms for 99.9% of its requests for a peak load of 500 requests per second.
Related Work
We talked about how Dynamo was built by combining many techniques from distributed systems. In this section, we will see the other related works which influenced Dynamo. These works can be divided into two categories. One is P2P systems, and the other is Distributed File Systems or Databases.
Peer-to-Peer Systems:
First-generation systems — Freenet and Gnutella (file sharing systems) are unstructured P2P networks. Here peers(nodes) were connected via arbitrary overlinks, and any search query had to be flooded through the network to find the required data.
Second-generation systems — Pastry, Chord, and Beehive are structured P2P networks. There is a globally consistent protocol to route queries from any peer to the peer which has data. To reduce the additional latency of multi-hoping here, a few systems employed O(1) routing algorithms.
Oceanstore and PAST — These storage systems were built on top of second-generation routing overlays. They provided global, transactional, persistent storage that supported serialized updates. There are conflict resolution models for the above. For example, Oceanstore creates total order from updates and then applies them atomically.
Distributed File Systems and Databases:
The P2P systems mentioned in the previous section support only flat namespaces. Distributed file systems, in general, support hierarchical namespaces. Let’s have a quick review of those.
Ficus and Coda — These file systems replicate files for high availability at the expense of consistency and allow disconnected operations. Specialized conflict resolution procedures typically manage update conflicts in these.
Farsite — It is a distributed file system that doesn’t support a centralized server. Achieves high availability and scalability using replication.
The Google File System (GFS) — It is a distributed file system that has a single master server for hosting the entire metadata and where the data is split into chunks and stored in chunk servers.
Bayou — It is a distributed relational database that allows disconnected operations and provides eventual consistency.
FAB — It is a distributed block storage system that splits large objects into small blocks and stores them in a highly available manner.
Antiquity —It is a wide-area distributed storage system that can handle multiple server failures. It uses a secure log to preserve data integrity and Byzantine fault tolerance protocols to ensure data consistency.
Google Bigtable — It is a distributed storage system that manages structured data.
Notably, Ficus, Coda, and Bayou are resilient to network partitions and outages. They provide an eventual consistency guarantee.
I will cover most of these systems in a full paper post for a single system.
System Architecture
System Interface
Let’s start with what operations Dynamo exposes to the user:
get(key) operation — It locates the object replicas associated with the key in the storage system. It returns a single object or a list of objects with conflicting versions and a context.
put(key, context, object) operation determines where the object’s replicas should be placed based on the associated key and writes the replicas to disk.
The context information is stored along with the object so the system can verify the object’s validity supplied in the put request.
Dynamo treats both the key and the object supplied by the caller as an opaque array of bytes. It applies an MD5 hash on the key to generate a 128-bit identifier and determine the storage nodes responsible for serving it.
Partitioning Algorithms
We discussed that one node could not store all the data. It means there are many nodes where the data will be stored. So how should we partition data — which data should be in which node?
Also, remember that Dynamo must scale incrementally, so it requires a mechanism to partition the data over the set of nodes dynamically.
There are different ways to partition the data. One of them is a hash-based partition. In one line, the concept is that the data is hashed, and the hashed value is used to determine the partition.
Dynamo’s partitioning scheme relies on an advanced version of hashing called consistent hashing to distribute the load across multiple storage hosts. So what is consistent hashing? We will learn a very high-level overview here to understand this paper. The Dynamo paper references a consistent hashing paper. I will cover that in a full paper post.
In the hashing-based portion, there is a hash function that is applied to data. Here the output range of the hash function is treated as a fixed circular space or ring. It means after the largest output hash value next value is the smallest.
Each node in the system is assigned a random value within this output space which we can think of as a logical representation of a position on the ring.
Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring and then walking the ring clockwise to find the first node with a position larger than the item’s position.
It means each node becomes responsible for the region in the ring between it and its predecessor node on the ring.
Now we understand the basics of consistent hashing. Next is what are the pros and cons of its usage.
Pros: The departure or arrival of a node only affects its immediate neighbors.
Cons: (1) Random position assignment of each node on the ring leads to non-uniform data and load distribution. (2) The basic algorithm is oblivious to the heterogeneity in the performance of nodes.
Dynamo overcomes the cons by using a variant of consistent hashing which adds a bit of complexity to it.
Instead of mapping a node to a single point in the ring, each node gets assigned to multiple points in the ring. Each point is a virtual node belonging to one physical node. There are the following benefits of that:
(1) If a node becomes unavailable, the load handled by this node is evenly dispersed across the remaining available nodes.
(2) When a node becomes available again, or a new node is added to the system, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.
(3) The number of virtual nodes a node is responsible for can be decided based on its capacity, accounting for heterogeneity in the physical infrastructure.
Replication
In the last section, we solved one of the core parts of the system. The next big part is how to maintain availability and durability. Dynamo replicates its data on multiple hosts to achieve this.
Each data item is replicated at N (configurable) hosts.
Each key, k, is assigned to a virtual node and, through it, to a node. This node is called the coordinator node for the key in Dynamo.
In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 clockwise successor nodes in the ring.
The list of nodes responsible for storing a particular key is called the preference list.
Every node in the system can determine which nodes should be in this list for any particular key.
The preference list contains more than N nodes to account for node failures.
Dynamo does one more good fine-tuning. The first N successor positions for a particular key in the ring may be owned by less than N distinct physical nodes. To address this, the preference list for a key is constructed by skipping positions in the ring to ensure that the list contains only distinct physical nodes.
Data Versioning
In the requirements section, we saw that strict consistency is not required. What does it mean in real life? Let’s understand that in the proper way.
A put-call returns to its caller before the update has been applied to all the replicas. It can result in scenarios where a subsequent get() operation may return an object that does not have the latest updates.
Also, in some failure scenarios, the updates may arrive at only some replicas for an extended period of time.
In real-life examples, add or delete on the shopping cart may not have reached replicas. So how to solve this?
Dynamo treats the result of each modification (put call) as a new and immutable version of the data. It allows multiple versions of an object to be simultaneously present in the system. Dynamo uses vector clocks to capture causality between different versions of the same object. So what is a vector clock?
Vector clocks is one of the concepts from distributed systems, and I will cover that in a full paper post. I will refer Leslie B. Lamport ‘s paper for that . We will cover a high-level overview here to understand the concept here.
A vector clock is effectively a list of (node, counter) pairs. One vector clock is associated with every version of every object. Each time an object is modified, a vector clock is created using an algorithm from its previous version.
We need to understand only one thing that by seeing the vector clocks of two versions of an object, we can determine if there is a happened-before relation or not.
If, at the data store side, we can resolve the conflict, it is good; otherwise, this vector clock information is sent to the application ( client-side code), and the client must resolve the conflict.
Execution of get() and put() operations
By now, you will be thinking about how the get or put call will reach the correct node. There are two strategies that a client can use:
(1) Route its request through a generic load balancer that will select a node based on load information.
(2) Use a partition-aware client library that routes requests directly to the appropriate coordinator nodes.
From the partitioning section, we know the coordinator node. The coordinator node handles get or put for a key. Typically this is the first of the N nodes present in the preference list.
If the requests are received through a load balancer, requests may be routed to any random node in the ring. If the node is not in the preference list, that node will forward the request to the first among the top N nodes in the preference list.
The subsequent critical discussion is how Dynamo maintains consistency among its replicas. Dynamo uses a consistency protocol which is similar to quorum systems. Let’s understand that.
This protocol has two key configurable values: R and W.
R — minimum number of nodes that must participate in a successful
read operation.W — minimum number of nodes that must participate in a successful write operation.
R + W > N
In summary, the read is done from a node where the write was done for a key.
In the next part, I will continue on system architecture and the rest of the paper.
References:
Dynamo paper link: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Pastry: https://www.cs.cornell.edu/people/egs/615/pastry.pdf
Chord: https://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf
Beehive: http://www.cs.cornell.edu/people/egs/cs615-spring07/beehive-nsdi.pdf
Oceanstore: https://dl.acm.org/doi/pdf/10.1145/356989.357007
PAST: https://people.mpi-sws.org/~druschel/publications/PAST-hotos.pdf
Ficus: https://css.csail.mit.edu/6.824/2014/papers/ficus.pdf
Coda: https://www.cs.cmu.edu/~satya/docdir/satya-wwos1-1987.pdf
Farsite: http://www.usenix.org/events/osdi02/tech/full_papers/adya/adya.pdf
GFS: https://research.google.com/archive/gfs-sosp2003.pdf
Bayou: https://dl.acm.org/doi/pdf/10.1145/224056.224070
FAB: https://www.cs.princeton.edu/courses/archive/fall07/cos518/papers/fab.pdf
Antiquity: http://fireless.cs.cornell.edu/publications/antiquity06.pdf
BigTable: https://research.google.com/archive/bigtable-osdi06.pdf
Consistent Hashing: https://dl.acm.org/doi/pdf/10.1145/258533.258660
Vector Clocks: http://lamport.azurewebsites.net/pubs/time-clocks.pdf