Insights from paper - Spanner: Google’s Globally-Distributed Database

Apr 19, 2023

Introduction

Spanner is Google’s scalable, multi-version, globally distributed, and synchronously-replicated database.

Spanner is the first system to distribute data globally and support externally-consistent distributed transactions.

In this paper, Google introduces a novel time API that exposes clock uncertainty in a distributed system.

At the highest level of abstraction, it is a database that shards data automatically in data centers spread worldwide. Replication is used for global availability and geographic locality. The clients can automatically failover between replicas.

Spanner can automatically migrate data across machines (even across data centers) to balance the load and in response to failures.

Spanner is designed to scale up to millions of machines across hundreds of data centers and trillions of database rows.

Spanner’s initial customer was F1, a rewrite of Google’s advertising backend. F1 uses five replicas spread across the United States.

Spanner’s primary focus is managing cross-datacenter replicated data.

Spanner has evolved from a Bigtable-like versioned key-value store into a temporal multi-version database.

Spanner provides several exciting features:

The replication configurations for data can be dynamically controlled at a fine grain level by applications.
Applications can specify constraints to control which data centers contain which data.
How far is data from its users?
How far are replicas from each other?
How many replicas are maintained?
It provides externally consistent reads and writes.
Globally consistent reads across the database at a timestamp.
Atomic schema updates, globally, and even in ongoing transactions.

These features are enabled by the fact that Spanner assigns globally-meaningful commit timestamps to transactions, even though transactions may be distributed.

The timestamps reflect serialization order which satisfies external consistency (or equivalently, linearizability). Don’t worry about these words; we will discuss them to make sense of them.

Spanner can do this because of a new TrueTime API and its implementation. This implementation keeps clock uncertainty small (generally less than 10ms). We will discuss this.

Implementation

We will discuss four things here:

Structure of implementation
Rationale of implementation
Directory abstraction
Data Model

Structure Of Implementation

A Spanner deployment is called a universe. Spanner is organized as a set of zones. Zones are the unit of administrative deployment. The set of zones is the set of locations across which data can be replicated. Zones can be added to or removed from a running system ( like a data center added or turned off). Zones are also the unit of physical isolation: there may be one or more zones in a data center.

A zone has one zonemaster and between one hundred and several thousand spanservers. Zonemaster assigns data to spanservers.

Other than this, there are location proxies inside a zone that clients use to locate the spanservers assigned.

The paper mentions two more components universe master ( a console to display status information) and placement driver ( to handle data automated movement across zones). See the diagram above.

Rationale Of Implementation

Let’s discuss the spanserver implementation to understand replication and distributed transactions (Layered on top of Bigtable-based implementation).

Each spanserver is responsible for between 100 and 1000 instances of a data structure called a tablet. A tablet is similar to Bigtable’s tablet abstraction. You can read my post about the Bigtable paper.

In Spanner, the tablet Implements a bag like the one below:

(key:string, timestamp:int64) → string

Spanner assigns timestamps to data, while in Bigtable user/application can do this.

This clears that Spanner is more like a multi-version database and less like a key-value store.

A tablet’s state is stored in a set of B-tree-like files and a write-ahead log, all on a distributed file system called Colossus (the successor to the Google File System -GFS). You can read my post about the GGS paper.

Each spanserver implements a single Paxos state machine on top of each tablet. See above in the diagram. Each state machine stores its metadata and logs into its tablet.

You must have a double in mind now. What is the Paxos state machine for?

You remember that there was a bag of mapping in the tablet. The Paxos state machine is responsible for consistently replicating this key-value bag of mapping across tablets.

A write initiates the Paxos protocol at the leader.

A read is served directly from the underlying tablet at any sufficiently up-to-date replica.

Spanner calls this set of replicas a Paxos group.

At every leader replica, each spanserver implements a lock table to implement concurrency control.

Spanner is designed for long-lived transactions. Operations requiring synchronization acquire locks in the lock table, and others bypass the lock table.

Each spanserver implements a transaction manager at every leader replica to support distributed transactions. The transaction manager is used to implement a participant leader; the other replicas in the group will be referred to as participant slaves.

A transaction involving only one Paxos group can bypass the transaction manager since the lock table, and Paxos provide transactionality.

If a transaction involves more than one Paxos Group, those groups’ leaders coordinate to perform a two-phase commit.

Directory abstraction

We have the bag of key-value mappings. On top of that, Spanner implements a bucketing abstraction called a directory.

Directories allow applications to control the locality of their data by choosing keys carefully.

A directory is the unit of data placement. All data in a directory has the same replication configuration.

Data inside Paxos groups move directory by directory, as shown in the below diagram:

Directory- The unit of data movement between Paxos Groups

An example of data movement is moving a directory to shed load from a Paxos group. Directories can be moved while client operations are ongoing. One could expect that a 50MB directory can be moved in a few seconds.

Movedir is the background task used to move directories between Paxos groups. Movedir is also used to add or remove replicas to Paxos groups. Movedir is not implemented as a single transaction to avoid blocking ongoing reads and writes on a bulky data move.

Let’s cover one more critical aspect of the directory - placement.

A directory is also the smallest unit whose geographic replication properties (or placement, for short) can be specified by an application.

Administrators control two dimensions:

The number and types of replicas
The geographic placement of those replicas.

An application controls how data is replicated, by tagging each database and/or individual directories with a combination of those options.

A very good example of the above is - A application can enable user A’s data to have three replicas in Europe, and user B’s data to have five replicas in North America.

Data Model

Spanner exposes the following data features:

A data model based on schematized semi-relational tables.
A query language.
General purpose transactions.

The popularity of Megastore supported the need for schematized semi-relational tables and synchronous replication. Inside Google, 300+ applications use Megastore.

The popularity of Dremel (an interactive data analysis tool) supported the need for an SQL-like query language in Spanner.

The lack of cross-row transactions in Bigtable and the building of Percolator partly to address this failure supported the need for general-purpose transactions.

See below the quote from the paper, which has been famous for Spanner in the context of general-purpose transactions:

We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two-phase commit over Paxos mitigates the availability problems.

The application data model is layered on top of the directory-bucketed key-value mappings.

An application creates one or more databases in a universe. Each database can contain an unlimited number of schematized tables. Tables look like relational-database tables, with rows, columns, and versioned values.

Spanner’s data model is not purely relational. In Spanner, rows must have names. It means every table must have an ordered set of one or more primary-key columns.

A row has existence only if some value (even if it is NULL) is defined for the row’s keys.

Let’s see an example Spanner schema for storing photo metadata on a per-user, per-album basis. Clients must partition spanner database into one or more hierarchies of tables. Client applications declare the hierarchies in database schemas via the INTERLEAVE IN declarations.

CREATE TABLE Users {
uid INT64 NOT NULL, email STRING
} PRIMARY KEY (uid), DIRECTORY;

CREATE TABLE Albums {
uid INT64 NOT NULL, aid INT64 NOT NULL,
name STRING
} PRIMARY KEY (uid, aid),
INTERLEAVE IN PARENT Users ON DELETE CASCADE;

In this example, the Users table is the parent table in the hierarchy.

The table at the top of a hierarchy is a directory table. Users table here is declared using the keyword DIRECTORY.

So how one directory table is created?

Directory table has rows. Let’s say there is a row with primary key K. Then this row and all the rows of descendants table with primary key K forms a Directory.

Example: See the diagram where a Directory (Directory 3665) is formed by the row with primary key 1 in the parent table Users and all the rows with primary key 1 in the child table Albums.

In the given example ON DELETE CASCADE means that deleting a row in the directory table also deletes any associated child rows.

TrueTime API

It is time to discuss the heart of the system, the TrueTime API.

The paper is focused on demonstrating the power of the API.

The API methods are listed in the below table:

The type of argument t given in the table is TTStamp.

In the API, time is represented as a TTinterval, which is an interval with bounded time uncertainty.

The core of the API is the method TT.now(). This method returns a TTinterval that is guaranteed to contain the absolute time during which TT.now() was invoked.

TT.after() and TT.before() methods are convenience wrappers around TT.now().

TrueTime refers GPS and atomic clocks.

TrueTime is implemented by a set of time master machines per data center and a time slave daemon per machine.

The majority of masters have GPS receivers with dedicated antennas. These masters are separated physically to reduce the effects of antenna failures, radio interference, and spoofing.

The remaining masters (which Spanner refers to as Armageddon masters) are equipped with atomic clocks.

All masters’ time references are regularly compared against each other. Each master also cross-checks the rate at which its reference advances time against its own local clock and evicts itself if there is substantial divergence.

Every daemon polls a variety of masters to reduce vulnerability to errors from any one master. Daemons apply a variant of Marzullo’s algorithm to detect and reject liars, and synchronize the local machine clocks to the nonliars.

Between synchronizations, a daemon advertises a slowly increasing time uncertainty. ε (epsilon) is derived from conservatively applied worst-case local clock drift. ε also depends on time-master uncertainty and communication delay to the time masters.

Let’s see the production numbers:

ε is typically a sawtooth function of time, varying from about 1 to 7 ms.
The daemon’s poll interval is 30 seconds.
The applied drift rate is set at 200 microseconds/second, which makes sawtooth upper bound as 200*30 = 6000 microseconds or 6 ms, and lower bound is, of course, 0 ms.
The extra 1 ms in bounds is coming from a communication delay to the
time master.

Concurrency Control

We have learned the TrueTime API. Now it is the time to learn how the API is used to guarantee the correctness properties around concurrency control. And then how those properties are used to implement features such as externally consistent transactions, lockfree read-only transactions, and non-blocking reads in the past.

In the Spanner system, there are two different writes. Let’s remember those.

Write from Paxos System
Write from Spanner Clients

Timestamp Management

The Spanner implementation supports read-write transactions, read-only transactions, and snapshot reads as shown in the below diagram.

In Spanner, standalone writes are implemented as read-write transactions.

Non-snapshot standalone reads are implemented as read-only transactions.

In Spanner, a read-only transaction must be predeclared as not having any writes. It is not simply a read-write transaction without any writes. Reads in a read-only transaction execute at a system-chosen timestamp without locking and they can proceed on any replica that is sufficiently up-to-date.

There is another kind of read available in Spanner that you may not have seen in other databases. You can read data from the past. It is called a snapshot read, and it is a read in the past that executes without locking. A client can either specify a timestamp for a snapshot read or provide an upper bound on the desired timestamp’s staleness.

Paxos Leader Leases

Spanner’s Paxos implementation uses timed leases to make leadership long-lived (10 seconds by default).

A potential leader sends requests for timed lease votes. Upon receiving a quorum of lease votes, the leader knows it has a lease. The leader requests lease-vote extensions if they are near expiration.

A replica extends its lease vote implicitly on a successful write.

A design point to mention here, for each Paxos group, each Paxos leader’s lease interval is disjoint from every other leader’s.

The Spanner allows a Paxos leader to abdicate by releasing its slaves from their lease votes. Let’s define Smax to be the maximum timestamp used by a leader. We will talk about how it changes in the next section. A leader must wait until TT.after(Smax) is true before abdicating.

Assigning Timestamps to RW Transactions

Transactional reads and writes are implemented using two-phase locking. For a given transaction, Spanner assigns it the timestamp which Paxos write represents for the transaction commit.

Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders. It is trivial to achieve this monotonicity inside a single Paxos group. How does Spanner it across leaders?

This is enforced across leaders by using the disjointness invariant: a leader must only assign timestamps within the interval of its leader lease. Whenever a timestamp s is assigned, smax is advanced to s to preserve disjointness.

There is one more invariant that spanner uses to achieve external consistency: if the start of a transaction T2 occurs after the commit of a transaction T1, then the commit timestamp of T2 must be greater than the commit timestamp of T1.

Serving Reads at a Timestamp

The monotonicity invariant we discussed in the last section helps determine whether a replica’s state is sufficiently up-to-date to satisfy a read.

Every replica tracks a value called safe time (Tsafe) which is the maximum timestamp at which a replica is up-to-date. Using this, a replica can satisfy a read at a timestamp t if t is less than or equal to Tsafe.

Assigning Timestamps to RO Transactions

A read-only transaction executes in two phases:

Assign a timestamp Sread
Execute the transaction’s reads as snapshot reads at Sread.

Details

In this section, the paper explains some of the practical details of read-write transactions and read-only transactions.

Read-Write Transactions

Writes that occur in a transaction are buffered at the client until commit. As a result, reads in a transaction do not see the effects of the transaction’s writes.

Let’s understand read within a read-write transactions.

The client issues read to the leader replica and then reads the most recent
data.
The client sends keepalive messages to prevent participant leaders from
timing out its transaction.
When a client has completed all read and buffered all writes, it begins two-phase commit.
The client chooses a coordinator group and sends a commit message to each participant’s leader with the identity of the coordinator and any buffered writes.

Read-Only Transactions

Assigning a timestamp requires a negotiation phase between all of the Paxos groups that are involved in the reads.

Due to this we get summary of the keys that will be read by the entire transaction in a an expression called scope expression.

If the scope’s values are served by a single Paxos group, then the client issues the read-only transaction to that group’s leader.

If the scope’s values are served by multiple Paxos groups, Spanner does following:

The client avoids a negotiation round.
It executes read at TT.now().latest. It may need to wait for Tsafe to advance.
All read goes to a replica that is sufficiently up-to-date.

Schema-Change Transactions

TrueTime enables Spanner to support atomic schema changes. A Spanner schema-change transaction is a generally non-blocking variant of a standard transaction.

It is explicitly assigned a timestamp in the future, which is registered in the preparation phase.
Reads and writes, which implicitly depend on the schema, synchronize with any registered schema-change timestamp at time t.
Reads and writes may proceed if their timestamps precede t, but they must block behind the schema-change transaction if their timestamps are after t.

Related Work

Consistent replication across datacenters as a storage service has been provided by Megastore and DynamoDB.

Spanner follows Megastore in providing a semi-relational data model.

Pavlo has compared the performance of databases and MapReduce. The two worlds are converging, and Spanner agrees with the conclusion.

Spanner demonstrates that integrating multiple layers has advantages: integrating concurrency control with replication reduces the cost of commit wait.

The notion of layering transactions on top of a replicated store dates back to Gifford’s dissertation.

Scatter is a DHT-based key-value store that layers transactions on top of consistent replication.

Gray and Lamport describe a non-blocking commit protocol based on Paxos.

Walter provides a variant of snapshot isolation that works within, but not across data centers.

There has been a spate of recent work on reducing or eliminating locking overheads.

Calvin eliminates concurrency control: it pre-assigns timestamps and then executes the transactions in timestamp order.

VoltDB is a sharded in-memory database that supports master-slave replication over the wide area for disaster recovery, but not more general replication configurations.

A number of commercial databases implement reads in the past, such as MarkLogic and Oracle’s Total Recall.

Farsite derived bounds on clock uncertainty relative to a trusted clock reference.

Future Work

The team has spent most of the last year working with the F1 team to transition Google’s advertising backend from MySQL to Spanner.

Now, team is improving its monitoring and support tools and also improving the functionality and performance of our backup/restore system.

The team is implementing the Spanner schema language, automatic maintenance of secondary indices, and automatic load-based resharding.

Conclusions

Spanner combines and extends on ideas from two research communities: from the database community, a familiar, easy-to-use, semi-relational interface, transactions, and an SQL-based query language; from the systems community, scalability, automatic sharding, fault tolerance, consistent replication, external consistency, and wide-area distribution. The team took 5 years to iterate to the current design and implementation. The team shown that reifying clock uncertainty in the time API makes it possible to build distributed systems with much stronger time semantics.

References

Hemant's 100PapersChallenge

Discussion about this post