Insights from paper-(Part II) Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service

Apr 14, 2023

Global admission control

We have seen in part I of this post that bursting and adaptive capacity has helped reduce throughput problem for non-uniform access of keys.

Both of the above techniques have their limitations. And the fundamental problem is we had tightly coupled partition level capacity to admission control.

DynamoDB realized this and broke the coupling. Now admission control is not dependent on the partition. A partition can always burst based on workload requirements. How did DynamoDB do this? Let’s understand that.

DynamoDB replaced adaptive capacity with global admission control (GAC). The GAC builds on the idea of token buckets. It is mentioned in the paper, “ The GAC service centrally tracks the total consumption of the table capacity in terms of tokens.”

The process looks like below:

The request routers manage several time-limited tokens locally.
The request router deducts tokens when a request comes.
When the request router runs out of tokens, it requests more tokens from GAC.
The GAC instance estimates the global token consumption and vends tokens.

Each GAC server can be stopped and restarted without any impact on the overall operation of the service. Each GAC server can track one or more token buckets configured independently.

Also note that, in addition to the GAC, the partition-level token buckets were retained for defense-in-depth.

Balancing consumed capacity

From the previous section, we learned that a partition can always burst. It means we have a new challenge now to manage burst capacity effectively.

To clarify, consider different hardware instance types of the storage nodes in the fleet and different tables from customers with varied traffic patterns.

Without bursting and adaptive capacity, It was easy to find a storage node that could accommodate a partition based on its allocated capacity.

Bursting allows the storage node to go above its prescribed capacity and thus makes the colocation of tenants a more complex challenge. So DyanmoDB solved this challenge?

DynamoDB implemented a system to balance the partitions allocated across the storage nodes proactively.

It is implemented by the first storage node independently monitoring the overall throughput for themselves and then reporting to the autoadmin service. It reports the list of candidate partition replicas that should be moved from the current node.

Splitting for consumption

There is one more thing DaynmoDB does to handle the skewed traffic
to a specific set of items.

Once the consumed throughput of a partition crosses a certain threshold, the partition is split for consumption. Take note this split is not for storage size. The split point in the key range is chosen based on key distribution.

The above scheme is still not valid for a few cases, like a partition receiving
high traffic to a single item.

On-demand provisioning

DynamoDB provided a new provisioning model based on read and write capacity units. Customers couldn’t understand this model correctly, and most landed with over-provisioning or under-provisioning.

To improve the customer experience for spiky workloads, DynamoDB launched on-demand tables.

DynamoDB provisions the on-demand tables based on the consumed capacity. In the background, on-demand scales a table by splitting partitions for consumption which we discussed in the previous section. The split decision algorithm is based on traffic.

Durability and correctness

The D from ACID is durability, meaning data should always be present once committed.

The reasons for data loss are hardware failures, software bugs, or hardware bugs. DynamoDB has solutions to prevent, detect, and correct potential data losses. Let’s talk about each case one by one.

Hardware failures

In DyanmoDB, write-ahead-logs (WAL) provide durability and crash recovery.

First, WAL are stored in all three partition replicas. And then, they are periodically archived to S3. Typically replicas have unarchived WAL.

When a node fails, all replication groups hosted on the node are down to two copies. Upon detecting an unhealthy storage replica, the leader of a replication group adds a log replica (see the diagram in part I to know more) to ensure no impact on durability.

A refresher, log replica doesn’t have the B-tree, so adding a log replica takes only a few seconds. The system has to copy only the recent QAL from a healthy replica to a new replica.

Silent data errors

Due to a hardware failure, we may have incorrect data stored. It is challenging to detect these, which can happen anywhere in the system.

DynamoDB makes extensive use of checksums to detect these silent errors.

DynamoDB maintains checksums within every log entry, message, and log file and also validates data integrity for every data transfer between two nodes.

Continuous verification

DynamoDB also continuously verifies data-at-rest.

An example is the scrub process. The goal of scrub is to detect errors we had not anticipated, such as bit rot. The scrub process runs and verifies two things:

All three copies of the replicas in a replication group have the same data.
The data of the live replicas matches with a copy of a replica built offline using the archived write-ahead log entries.

A similar technique of continuous verification is used to verify replicas of global tables.

Software bugs

Because of the high complexity of DynamoDB, the probability
of human error in design, code, and operations increases. Errors in the
system could cause loss or corruption of data.

DynamoDB uses formal methods extensively to ensure the correctness
of our replication protocols.

When a new feature that affects the replication protocol is added, it is incorporated into the specification, and the model is checked. DynamoDB also employs extensive failure injection testing and stress testing.

Backups and restores

One great feature DynamoDB has is backup and restore. Let’s understand the nitty-gritty.

They don’t affect the performance or availability of the table as it is done from WAL stored in S3.
The backups are consistent across multiple partitions up to the nearest second.
The backups are full copies of DynamoDB tables stored in an Amazon S3 bucket.
A backup can be restored to a new DynamoDB table at any time.

Beyond this, DynamoDB also supports point-in-time restoration. A table that existed at any time in the previous 35 days can be restored to a different DynamoDB table in the same AWS region.

Availability

From the paper “To achieve high availability, DynamoDB tables are distributed and replicated across multiple Availability Zones (AZ) in a Region.”

It sums up clearly how high availability is achieved in DyanmoDB. Let’s dive deep into more than a decade of learning to solve availability challenges.

Write and consistent read availability

Write available comes from a healthy leader and the healthy write quorum ( 2 out of 3).

To re-iterate, If one of the replicas is unresponsive, the leader adds a log replica to the group. It is the fastest way to ensure that the write quorum of the group is always met.

Introducing log replicas was the formally proven implementation of Paxos. DynamoDB is able to run millions of Paxos groups in a Region with log replicas.

The leader replica serves consistent reads. Eventually consistent read can be served by any of the replicas.

Failure detection

One of the critical components of a highly available system is failure detection for the leader. False positives in failure detection can lead to more disruptions in availability.

It is easy to detect a failure where each replica loses connectivity to the leader. However, nodes can experience gray network failures, and detecting them is challenging. How DynamoDb solves this?

When a follower replica loses connectivity to the leader, it wants to trigger a failover. It sends a message to other replicas in the replication group asking if they can communicate with the leader. If replicas respond with a healthy leader message, the follower drops its attempt to trigger a leader election.

This change in the failure detection algorithm used by DynamoDB
significantly minimized the number of false positives in the system, hence the number of spurious leader elections.

Measuring availability

From the paper,” DynamoDB is designed for 99.999 percent availability for global tables and 99.99 percent availability for Regional tables.”

DynamoDB continuously monitors availability at service and table levels. In addition to real-time tracking, the system runs daily jobs that trigger aggregation to calculate aggregate availability metrics per customer.

One very different thing DyanmoDB does to measure the user-perceived availability. DynamoDB has two sets of clients — Internal Amazon services and DynamoDB canary applications. DynamoDB. Both clients share the availability metrics for DynamoDB API calls as observed by their software. This is a perfect representation of what our customers might be experiencing.

Deployments

DynamoDB pushes software updates at a regular cadence. As usual, the new software being deployed goes through a full development and test cycle to build confidence. Beyond that, the software is rolled back on purpose and tested by running functional tests.

A new software might introduce a new type of message or change the protocol in a way that old software doesn’t understand.

DynamoDB handles these kinds of changes with read-write deployments. It has a two-step process. The first step is to deploy the software to read the new message format or protocol. Only when this step is succeeded the software is updated to send new messages.

Dependencies on external services

DynamoDB depends on AWS IAM, AWS KMS, and other services for the request path. While these services are highly available, DynamoDB is designed to operate when these services are unavailable without sacrificing any of the security properties.

DynamoDB caches result from IAM and AWS KMS in the request routers that authenticate every request. This cache is refreshed asynchronously.

Metadata availability

The most crucial metadata is the mapping between a table’s primary keys
and storage nodes. This contains all the partitions for a table, the key range of each partition, and the storage nodes hosting the partition.

When a request router receives a request for a table it had not seen before, it downloads the routing information for the entire table and cached
it locally.

From the paper,” Since the configuration information about partition replicas rarely changes, the cache hit rate was approximately 99.75 percent.”

There is one challenge, in the case of a cold start (empty caches), every DynamoDB request would result in a metadata lookup. How is it solved?

From the paper” DynamoDB built an in-memory distributed datastore called MemDS. MemDS stores all the metadata in memory and replicates it across the theMemDS fleet.”

The data stored in this is highly compressed. MemDS has a Perkle tree data structure (a combination of the Patricia and Merkle trees). In this tree, you can look up using the full key or a key prefix. Also, keys are stored in sorted order.

To avoid the cold start issue, each request router will have a partition map cache deployed. Also, a cache hit here results in an asynchronous call to MemDS to refresh the cache.

Each partition membership update at the storage node is propagated to all MemDS nodes. MemDS is horizontally scalable. If MemDS info is stale, the incorrectly contacted storage node either responds with the latest membership or responds with an error code that triggers another MemDS lookup by the request router.

Micro benchmarks

DynamoDB did run YCSB workloads to show that scale doesn’t affect the latencies.

Conclusion

DynamoDB is a cloud-native NoSQL database that provides steady performance, high availability, and low operational complexity. It is able to maintain these critical properties for more than a decade. It has some fantastic features like on-demand capacity, point-in-time backup and restore, multi-Region replication, and atomic transactions, to name a few.

References

Amazon DynamoDB paper: https://www.usenix.org/system/files/atc22-elhemali.pdf

Part I of the post: https://medium.com/@hemant-gupta/insights-from-paper-part-i-amazon-dynamodb-a-scalable-predictably-performant-and-fully-6d93dfbfe2fb

Formal methods: https://assets.amazon.science/67/f9/92733d574c11ba1a11bd08bfb8ae/how-amazon-web-services-uses-formal-methods.pdf

YCSB: https://courses.cs.duke.edu/fall13/cps296.4/838-CloudPapers/ycsb.pdf

Paxos: https://lamport.azurewebsites.net/pubs/paxos-simple.pdf

Hemant's 100PapersChallenge

Discussion about this post