Insights from paper: CockroachDB: The Resilient Geo-Distributed SQL Database

Aug 13, 2024

1. Abstract

Organizations with a global user base are replacing legacy DBMSs with cloud-based systems to scale OLTP workloads for millions of users.

CockroachDB is a scalable SQL DBMS that handles global OLTP workloads while maintaining high availability and consistency.

The paper presents the design of CockroachDB. It discusses the transaction model used to support consistent geo-distributed transactions on commodity hardware and the issue of fault tolerance.

Finally, the lessons learned are shared.

2. Introduction

In real life, large organizations have a user base across the world.

For example, the organization has to comply with GDPR, and the personal data of EU users should be kept in the EU. The users want an always-on experience, so the system should be fault-tolerant. The database should support SQL transactions to avoid data anomalies.

Considering all the above requirements, CockroachDB focuses on the following features:

Fault tolerance and high availability:

CockroachDB maintains at least three replicas of every partition in the database across diverse geographic zones. It provides automatic recovery mechanisms for node failure.

Geo-distributed partitioning and replica placement:

It is horizontally scalable. It automatically migrates the data as nodes are added.

Users can also control how data is partitioned across nodes and where replicas should be located.

High-performance transactions:

Its transaction protocol supports performant geo-distributed transactions that can span multiple partitions.

It provides serializable isolation, uses standard clock synchronization mechanisms such as NTP, and can run on any public or private cloud.

3. System Overview

3.1 Architecture of CockroachDB

CockroachDB uses a shared-nothing architecture.

All the nodes are used for both data storage and computation. A cluster consists of an arbitrary number of nodes.

Nodes may be co-located in the same data center or spread globally. Clients can connect to any node in the cluster.

There are different layers in a node. Let’s examine them one by one.

3.1.1 SQL: SQL is the highest-level layer. It is the interface for all user interactions with the database. It includes the parser, optimizer, and SQL execution engine. This layer converts high-level SQL statements to low-level read-and-write requests to the underlying key-value (KV) store.

3.1.2 Transactional KV: This is the next layer. The requests from the SQL layer are passed to this layer, which ensures the atomicity of changes and is also largely responsible for the database’s isolation guarantees.

3.1.3 Distribution: This layer abstracts the logical key space, which is too large and ordered by keys. All user data and system metadata are accessed from here.

The database uses range partitioning on the keys to divide the data into contiguous ordered chunks of size ~64 MiB called Ranges.

This layer is responsible for identifying which Ranges should handle which subset of each query and routing the subsets accordingly.

3.1.4 Replication: By default, each range is replicated three ways, each replica stored in a different node. This layer ensures the durability of modifications using consensus-based replication.

3.1.5 Storage: This is the bottommost level. It represents a local disk-backed KV store. It provides efficient writes and range scans. At the time of writing, RocksDB powered this layer.

3.2 Fault Tolerance and High Availability

3.2.1 Replication using Raft: CockroachDB uses the Raft consensus algorithm for consistent replication.

Replicas of a Range form a Raft group. Each replica is either a long-lived leader coordinating all writes to the Raft group or a follower.

The unit of replication is a command. It represents a sequence of low-level edits to the storage engine.

Raft maintains a consistent, ordered log of updates across a Range’s replicas.

3.2.2 Membership changes and automatic load (re)balancing: Nodes can be added to or removed from running CockroachDB clusters.

The database uses Raft to operate seamlessly for short-term failures as long as most replicas remain available.

For longer-term failures, the database automatically creates new replicas of under-replicated Ranges.

I am not covering Raft's internals here. You can read that in detail in my post, based on the Extended Raft paper.

3.2.3 Replica placement: CockraochDB has manual and automatic mechanisms to control replica placement.

For manual replacement, users configure individual nodes in the database with a set of attributes. These attributes may specify node capability ( RAM/CPU) and/or node locality.

In automatic replacement, the database spreads replicas across failure domains to tolerate varying severities of failure modes.

3.3 Data Placement Policies

CockraochDB allows a wide range of possible data placement policies. A few example patterns of policies are listed below:

Geo-Partitioned Replicas: Tables can be partitioned by access location, with each partition pinned to a specific region. This makes for fast intra-region reads and writes and the survival of availability zone (AZ) failures.

Geo-Partitioned Leas: Leaseholders for partitions in a geo-partitioned table can be pinned to the access region. The remaining replicas are pinned to the remaining regions. This helps in fast intra-region reads and the survival of regional failures. However, it comes at the cost of slower cross-region writes.

Duplicated Indexes: Indexes are stored in Ranges that can be pinned to specific regions. The database can serve fast local reads by duplicating indexes on a table and pinning each index’s leaseholder to a specific region.

4. Transactions

CockroachDB transactions can span the entire key space, meaning they can touch data across the distributed cluster.

CockroachDB uses a variation of multi-version concurrency control (MVCC) to provide serializable isolation.

4.1 Overview

SQL client sends the SQL, and a transaction starts at the gateway node where the client is connected.

This node receives from and responds to the SQL client. This node acts as a transaction coordinator. Coordination means orchestrating the transaction and deciding to commit or roll back the transaction.

Let’s see the coordination algorithms.

4.1.1 Execution at the transaction coordinator:

The algorithm shown in the diagram below shows the high-level steps of the transaction from the coordinator's perspective.

The coordinator receives a series of requested KV operations from the SQL layer during transaction processing. The operations are represented in the for loop at line 2.

We all understand the SQL client requires a response to the current operation before starting the next operation.

CockroachDB replicates the operation. During this replication, we don’t want to stall other transactions.

To achieve this, the coordinator employs two important optimizations: writing Pipelines and Parallel Commits. Let’s understand them first.

Write Pipelining:

In very simple terms, write pipelining allows for the return of results without waiting to replicate the current operation.

How does the coordinator do that?

The coordinator tracks operations that may not have been fully replicated yet. It also maintains the transaction timestamp initialized to the current time. It is done in line 1 in the algorithm.

Just a quick reminder, CockroachDB uses MVCC so the transaction timestamp is the timestamp when the transaction performs its read or write.

Each operation includes the key that must be read or updated. It should also include metadata, which indicates whether the transaction should commit to the current operation or not.

Let’s understand the difficult part of the algorithm in line 6, where the operation does not attempt to commit. It can be executed immediately if no overlap with any earlier operation.

In this way, multiple operations on different keys are pipelined. Now, the coordinator sends operation(s) to the leaseholder for execution and waits for a response, as shown in line 9.

There are two possibilities. The response may or may not contain an incremented timestamp. If not, it can be sent to the SQL layer directly.

If the timestamp increases, as shown in the other case at line 10, we have extra work to do.

Now, the coordinator has to verify if the key used in the operation has not changed in the range of the current marked transaction timestamp and increased timestamp. This is given in line 11. It will ensure consistency in the system.

Based on the verification, if no conflicting changes exist, the transaction timestamp will be updated; otherwise, the transaction will be marked as failed.

Parallel Commits

Again, in simple terms, parallel commits let the commit operation and the write pipeline replicate in parallel.

Again, the question is how the coordinator does that.

A typical implementation works straightforwardly. First, all the writes are verified to be replicated, and then the transaction is committed.

The Parallel Commits protocol creates a new staging transaction status. Its true status is conditional on whether all its writes have been replicated.

The coordinator is free to do two things in parallel:

Initiate the replication of the staging status
Verification of the outstanding writes

If both succeed, the coordinator can immediately acknowledge the transaction as committed to the SQL layer.

Also, the coordinator asynchronously records the transaction status as being explicitly committed.

4.1.2 Execution at the Leaseholder

Now, let’s understand the case of execution on the leaseholder. It is shown in the diagram below:

The leaseholder receives an operation from the coordinator.

The first step is to check if its lease is still valid, as shown in line 2.

Next, it acquires latches on the keys of op and all the operations op depends on. As shown in line 3.

Next, it verifies that the operations op depends on have succeeded, as shown in line 4.

Let’s discuss an interesting case in which the leaseholder evaluates the operation.

It determines what data modifications are needed in the storage engine. Note that this is a verification step; we are not applying now.

The rest of the cases are simple.

If the operation is not committing, send the response to the coordinator without waiting for replication, as shown in line 10.

If the operation is for writing, replicate and apply the command to its local storage engine, as shown in line 12.

The leaseholder releases its latches at this point.

If the operation is committing, return the response to the coordinator node, as shown in line 15.

4.2 Atomicity Guarantees

CockroachDB archives atomic commits for a transaction by considering all of its writes provisional before the commit time.

The provisional write values are called write intents. An intent is a KV pair with preceding metadata. The metadata tells that what follows is an intent.

The metadata points to a transaction record. This record is a special key (unique per transaction) that expresses status like pending, staging, committed, or aborted.

If a reader encounters an intent, it follows the redirect and reads the status.

If the status is committed, the reader considers the intent as a regular value.

If the status is aborted, the reader ignores the intent.

if the status is pending, the reader blocks it.

if the status is staging, the reader is unsure and attempts to abort the transaction.

4.3 Concurrency Control

CockraochDB is an MVCC system. Each transaction has a commit timestamp.

This way, we have the total ordering of all transactions in the system. It represents the serializable execution.

Conflicts between transactions may exist, requiring adjustments to the commit timestamp. Let’s discuss these conflicts.

4.3.1 Write-read conflicts: A reader will conflict with a writer when it encounters an uncommitted intent. If the intent's timestamp is lower, the read will wait; otherwise, it will ignore the intent.

4.3.2 Read-write conflicts: A write to a key at timestamp ta cannot be performed if there’s already been a read on the same key at a higher timestamp tb, which is greater or equal to ta. CockraochDB forces this writing transaction to advance its commit timestamp past tb.

4.3.2 Write-write conflicts: A writer will have a conflict with another writer when he encounters an intent that is not committed. If the timestamp of intent is lower, the current write will wait for the previous one to commit. If the timestamp of intent is higher, the current write will advance its timestamp past it. It is possible to have deadlock in such situations, and CockroachDB uses a distributed deadlock-detection algorithm to abort one of them.

4.4 Read Refreshes

In the previous conflict scenarios, we have seen that the commit timestamp needs to be advanced in some cases.

The team has to maintain serializability, so the read timestamp must be advanced to match the commit timestamp.

In very simple terms, this advancement of a transaction’s read timestamp from ta to tb is possible only if we can prove that none of the data that the transaction read at ta has been updated in the interval (ta, tb ].

CockroachDB maintains the keys in the transaction’s read set to solve this.

5. Clock Synchronization

CockroachDB uses NTP or Amazon Time Sync Service to maintain the clock sync of its nodes so that it can be run on any public or private cloud. Google Spanner uses special hardware for the same purpose.

5.1 Hybrid-Logical Clocks

Each node within a cluster maintains a hybrid logical clock (HLC).

The clock provides timestamps that are a combination of physical and logical time.

The Physical time is based on a node’s system clock, and logical time is based on Lamport’s clocks.

There is a maximum allowable offset in HLC between any two nodes in the cluster. The default value is 500 ms.

HLC provides the following important properties:

Causality tracking: The logical component of HLC does this. The inter-node exchange sends and receives HLC timestamps.
Strict monotonicity: When a node restarts, it waits for the maximum clock
offset before serving any requests to ensure strict monotonicity.
Self-stabilization: Because of inter-node exchange, nodes correct themselves.

5.2 Uncertainty Intervals

CockroachDB satisfies single-key linearizability for reads and writes in normal conditions.

It simply means that every operation on a given key appears atomically. Also, there is some total linear order, which is consistent with the real-time ordering of those operations.

One critical point is that CockroachDB does not support strict serializability. There is no guarantee that transactions touching disjoint key sets will match their ordering in real-time.

The above is not a problem in real life unless there is a very low-latency communication channel between clients, which is extremely rare.

So, how does CockroachDB achieve single-key linearizability property?

The database tracks an uncertainty interval for each transaction.

It is considered that within this interval, the causal ordering between two transactions is indeterminate.

When a transaction is created, the provision time commit_ts is given to it. The value comes from HLC. The uncertainty interval is [commit_ts, commit_ts + max_offset].

A transaction can receive a provisional commit timestamp up to the cluster’s max_offset earlier.

When a transaction encounters a value on a key at a timestamp above its provisional commit timestamp but within its uncertainty interval, it moves the provisional commit timestamp (lower bound) above the uncertain value.

Not that the transaction keeps the upper bound of its uncertainty interval fixed.

5.3 Behavior under Clock Skew

There is one more critical point to handle.

What if the clock skew is more than the maximum clock offset?

For a single range (values within it), consistency is maintained through Raft.

Raft does not have a clock dependency, so the ordering of changes will remain linearizable regardless of clock skew.

The problem arises when Range leases allow reads to be served from a leaseholder without going through Raft.

CockroachDB employs two safeguards to ensure such situations do not affect transaction isolation.

A leaseholder cannot serve reads for MVCC timestamps above its lease interval or writes for MVCC timestamps outside its lease interval.
Each write to a Range’s Raft log includes the sequence number of the Range lease. The sequence number is checked against the currently active lease. If they do not match, the write is rejected.

6.SQL

CockroachDB supports much of the PostgreSQL dialect of ANSI standard SQL with some extensions.

Let's understand the SQL data model and how it maps to the beneath layers.

6.1 SQL Data Model

Every SQL table and index is stored in one or more Ranges.

All user data is stored in one or more ordered indexes. One of them is the primary index.

The primary index is keyed on the primary key, and all other columns are stored in the value (making it a KV store).

Secondary indexes are keyed on the index key. They store the primary key columns and any additional columns specified by the index schema.

6.2 Query Optimizer

A Cascades-style query optimizer performs SQL query planning using over 200 transformation rules.

CockroachDB uses Optgen, a DSL for query transformations. Optgen compiles to the Go programming language to integrate the transformation rules with the rest of the CockroachDB codebase.

CockroachDB uses some transformation rules specific to the geo-distributed and partitioned nature of the database.

6.3 Query Planning and Execution

SQL query is executed in one of two modes:

Gateway-only mode: The node that planned the query is responsible for all SQL processing.
Distributed mode: Other nodes in the cluster participate in SQL processing.

At the time of paper writing, read-only queries can be executed in distributed mode.

The distribution decision uses a heuristic that estimates the data that must be sent over the network.

To produce a distributed query plan, CockroachDB performs a physical planning stage that transforms the query optimizer’s plan into a directed acyclic graph (DAG) of physical SQL operators.

Physical planning splits logical scan operations into multiple TableReader operators.

Each operator is designated for the node containing a Range read by the scan.

The logical operators are scheduled on the same nodes as the TableReaders. This helps push down filters, joins, and aggregations close to the physical data.

**Example: Physical plan for a distributed hash join.**

The diagram above shows an example of a distributed hash join across the primary indexes of two tables, a and b, on a 3-node cluster.

CockraochDB uses two different execution engines, Row-at-a-time and Vectorized, depending on input cardinality and plan complexity.

6.4 Schema Changes

CockroachDB has a protocol for schema changes.

It allows the tables to remain online during the schema change, and allows different nodes to transition to a new table schema at different times asynchronously.

It implements the solution used by Google F1. The protocol decomposes each schema change into a sequence of incremental changes.

7. Lessons Learned

7.1 Raft Made Live

The team initially chose Raft as the consensus algorithm.

They found a few issues in practice, like too much chatter and single node addition at a time.

The team developed a new algorithm called atomic replication changes. It is also called the Joint Consensus.

7.2 Removal of Snapshot Isolation

The database initially offered two isolation levels, SNAPSHOT and SERIALIZABLE.

Now, it supports only SERIALIZABLE.

7.3 Postgres Compatibility

The team chose to adopt PostgreSQL’s SQL dialect and network protocol.

8. Related Work

Google Spanner is an SQL system with the strongest isolation level and strict serializability.

Calvin, FaunaDB, and SLOG also provide strict serializability. Their deterministic execution framework requires the read/write sets up front.

H-Store and VoltDB are main-memory databases that support serializable isolation.

Slicer performs range partitioning of hashed keys and splits/merges ranges based on load.

TiDB is an open-source distributed SQL DBMS compatible with the MySQL wire protocol and designed to support HTAP workloads.