Insights from paper - Bigtable: A Distributed Storage System for Structured Data

Apr 17, 2023

Introduction

The Bigtable paper was published at OSDI 2006. Bigtable is a distributed storage system for managing massive ( in Petabytes) structured data. Google build that in two and a half years.

Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. Bigtable has achieved several goals: wide applicability, scalability, high performance, and High availability.

More than 60 products inside Google use Bigtable for a variety of demanding workloads, which range from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users.

In many ways, Bigtable resembles a database, but it does not support a full relational data model. It provides clients a simple data model that supports dynamic control over data layout and format.

Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk.

Data Model

A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, a column key, and a timestamp. Each value in the map is an uninterpreted array of bytes. So the model looks like below:

(row:string, column:string, time:int64) → string

Rows

The row keys in a table are arbitrary strings. Every read or write of data under a single row key is atomic. Bigtable maintains data in lexicographic order by row key.

The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing.

Reads of short row ranges are efficient and typically require communication with only a few machines. Clients can exploit this property.

Column Families

Column keys are grouped into sets called column families. Column families are the basic unit of access control.

All data stored in a column family is usually of the same type and is typically compressed.

A column family must be created before storing the data. Bigtable team suggests the number of distinct column families in a table should be small (in the hundreds at most).

A column key is named using the following syntax: family: qualifier. Access control and disk and memory accounting are performed at the column-family level.

Timestamps

Each cell in a Bigtable can contain multiple versions of the same data. These versions are indexed by timestamp. Bigtable timestamps are 64-bit integers. Different versions of a cell are stored in decreasing timestamps.

Bigtable can assign timestamps which are time in microseconds. Users can also define their own order.

There are two per-column-family settings that tell Bigtable to garbage-collect cell versions automatically. The client can specify either that only the last n versions of a cell be kept, or that only new-enough versions be kept.

API

The Bigtable API provides functions for creating and deleting tables and column families.

It also provides functions for changing cluster, table, and column family metadata, such as access control rights.

Bigtable supports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key.

Bigtable allows cells to be used as integer counters.

Bigtable supports the execution of client-supplied scripts in the address spaces of the servers.

Bigtable can be used with MapReduce. There are a set of wrappers that allow a Bigtable to be used both as an input source and as an output target for MapReduce jobs.

Building Blocks

Bigtable is built on several other pieces of Google infrastructure. It uses the distributed GFS to store logs and data files.

The Google SSTable file format is used internally to store Bigtable data. An SSTable is an ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.

Internally, each SSTable contains a sequence of blocks. A block index is used to locate blocks. There are two different ways to work with SSTable.

The index is loaded in the memory when the SSTable is opened. In this case lookup will need one disk seek. First, find the appropriate block from the in-memory index and then read the proper block from the disk.
The complete SSTable is mapped into the memory. It will allow doing lookups without any disk seek.

Bigtable uses Chubby for a variety of tasks, some of which are the followings:

To ensure that there is at most one active master at any time
To store the bootstrap location of Bigtable data
To discover tablet servers and finalize tablet server deaths
To store Bigtable schema information (the column family information)
To store access control lists.

So let’s understand what Chubby is and how it works in the context of BigTable.

Chubby is a highly-available and persistent distributed lock service. For context, ZooKeeper is modeled after the Chubby lock service.

A Chubby service consists of five active replicas, one of which is elected to be the master and actively serves requests. Chubby uses the Paxos algorithm to keep its replicas consistent.

Chubby provides a namespace that consists of directories and small files. Each directory or file can be used as a lock, and reads and writes to a file are atomic. Each Chubby client maintains a session with a Chubby service. When a client's session expires, it loses any locks and open handles. Chubby clients can also register callbacks on Chubby files and directories for notification of changes or session expiration.

One important thing to note If Chubby becomes unavailable for an extended period, Bigtable becomes unavailable.

Implementation

The Bigtable implementation has three major components:

A library linked to every client
One master server
Many tablet servers

Tablet servers can be added or removed from the cluster dynamically.

The heart of the system is the master server, and it does below things:

Assign tablets to tablet servers
Detect addition and expiration of tablet servers
Balance tablet-server load
Garbage collect files in GFS
Handles schema changes (table and column family creations)

Each tablet server manages a set of tablets (ten to a thousand tablets). The tablet server handles read and write requests to the tablets. It splits tablets that have grown too large.

In Bigtable, client data does not move through the master: clients communicate directly with tablet servers for reads and writes.

Initially, each table consists of just one tablet. As a table grows, it is automatically split into multiple tablets, each approximately 100-200 MB in size by default.

Tablet Location

Bigtable uses a three-level hierarchy to store tablet location information as shown in the above figure.

The first level is a file stored in Chubby containing the location of the root tablet.
The root tablet contains the location of all tablets in a unique METADATA table.
Each METADATA tablet contains the location of a set of user tablets.

To simplify things, the root tablet is just the first tablet in the METADATA table. It is never split. The METADATA table stores the location of a tablet under a row key.

The client library caches tablet locations. To add or refresh the location of a tablet, the client recursively moves up the tablet location hierarchy. If the client cache is empty, it will take three network round trips to get the location.

We must remember that tablet locations are stored in memory, so no GFS access is required. Also, the client library can prefetch tablet locations.

Tablet Assignment

Each tablet is assigned to one tablet server at a time. Master keeps track of this.

When a tablet is unassigned and a tablet server with the sufficient room is available, the master assigns the tablet.

Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates and acquires an exclusive lock on a uniquely-named file in a specified Chubby directory. A tablet server stops serving its tablets if it loses its exclusive lock.

A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists. If the file no longer exists, the tablet server will never be able to serve again, so it kills itself.

From another angle, whenever a tablet server terminates, it attempts to release its lock so that the master will reassign its tablets more quickly.

The master is responsible for detecting when a tablet server is no longer serving its tablets and reassigning them as soon as possible. The master periodically asks each tablet server for the status of its lock.

If a tablet server reports that it has lost its lock or the master cannot reach a server, the master attempts to acquire an exclusive lock on the server's file.

I want to highlight the above step because this is critical. If the master can acquire the lock, Chubby is live, but the tablet server can not function, so the master deletes the server’s file. After that master can move all the tablets of this server to the set of unassigned tablets.

We discussed how the lifecycle of a tablet server works, but what about the master server? How does its lifecycle work?

At startup, the master needs to discover the current tablet assignments. It does the following steps:

The master grabs a unique master lock in Chubby.
The master scans the servers directory in Chubby to find the live tablet servers.
The master communicates with every live tablet server to discover what tablets are assigned to each server.
The master scans the METADATA table to learn the set of tablets.

In the above process, whenever the master encounters a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets.

The set of existing tablets only changes in the below cases:

A table is created or deleted.
Two existing tablets are merged to form one larger tablet
An existing tablet is split into two smaller tablets.

The master can keep track of these changes because it initiates all but the last. A tablet server initiates splits and records the information for a new tablet in the METADATA table. Also, it notifies the master. If this notification is missed, the master detects the new tablet when it asks a tablet server to load the tablet that has now split.

Tablet Serving

The tablets are persisted in GFS.

When a write operation arrives at a tablet server, the tablet server checks its authenticity and integrity. After that, the change is written to the commit log. After the write has been committed, its contents are inserted into the memtable.

When a read operation arrives at a tablet server, the tablet server checks for authenticity and integrity. After that, the operation is executed on a merged view of the sequence of SSTables and the memtable.

Beyond these read and write operations, the are tablet update and tablet recovery operations.

Updates are committed to a commit log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable; the older updates are stored in a sequence of SSTables. Refer to below diagram for this.

To recover a tablet, a tablet server reads its metadata from the METADATA table. This metadata contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers to any commit logs. The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have been committed since the redo points.

Compactions

The size of the memtable increases with write operations. At a threshold size, the memtable is frozen, and a new memtable is created. The frozen memtable is converted to an SSTable and written to GFS. It is called a minor compaction process. Every minor compaction creates a new SSTable.

As the number of SSTable files increases with each minor compaction, A merging compaction process runs in the background, which reads the contents of a few SSTables and the memtable, and writes out a new SSTable. It is called a major compaction process.

Refinements

To achieve high performance, availability, and reliability, a lot of refinements are done. Some of them are listed below:

Locality groups

Clients can group multiple-column families into a locality group. A separate SSTable is generated for each locality group in each tablet. This grouping enables more efficient reads.

Compression

Clients can control whether or not the SSTables for a locality group are compressed and, if so, which compression format is used. The user-specified compression format is applied to each SSTable block.

Many clients use a two-pass custom compression scheme. The first pass uses Bentley and McIlroy's scheme, which compresses long standard strings across a large window. The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB data window.

Caching for read performance

To improve read performance, tablet servers use two levels of caching.

The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code.
The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS.

Bloom filters

A read operation has to read from all SSTables of a tablet. If these SSTables are not in memory, we may do many disk accesses. A Bloom filter created from SSTable allows us to ask whether an SSTable might contain any data for a specified row/column pair.

A small amount of tablet server memory used for storing Bloom filters drastically reduces the number of disks seeks required for read operations.

Commit-log implementation

If we kept the commit log for each tablet in a separate log file, a huge number of files would be written concurrently in GFS. To do better here, we can append mutations to a single commit log per tablet server, co-mingling mutations for different tablets in the same physical log file.

Exploiting immutability

Many parts of the BigTable system are simplified as SSTables that we generate are immutable. For example, we do not need any synchronization of accesses to the file system when reading from SSTables.

The only mutable data structure accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.

Lessons

While designing, implementing, maintaining, and supporting Bigtable, Team gained helpful experience and learned several exciting lessons. A few critical once are listed below:

Large distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures assumed in many distributed protocols. For example:

Memory and network corruption
Large clock skew
Hung machines
Extended and asymmetric network partitions
Bugs in other systems that we are using (Chubby for example)
Overflow of GFS quotas
Planned and unplanned hardware maintenance.

Another lesson we learned is that it is crucial to delay adding new features until it is clear how they will be used.

The most important lesson learned is the value of simple designs.

Related Work

The Boxwood project has components that overlap in some ways with Chubby, GFS, and Bigtable.

Several database vendors have developed parallel databases for large volumes of data. Oracle's Real Application Cluster database and IBM's DB2 Parallel Edition have similar architecture as Bigtable’s.

How Bigtable uses memtables and SSTables to store tablet updates is analogous to how the Log-Structured Merge (LSM) Tree stores update to index data.

C-Store and Bigtable share many characteristics.

Conclusions

Before we conclude, I have not converted two sections from the paper performance evaluations and real applications, which are more on the side of facts and figures instead of concepts. The numbers in these sections seem to be outdated in terms of hardware/software capacity, as we are in 2023, and the paper was written in 2006.

Bigtable is a distributed system for storing structured data at Google. Bigtable clusters have been in production use since April 2005. The team has spent roughly seven person-years on design and implementation before that date.

The team is implementing several additional Bigtable features, such as support for secondary indices and infrastructure for building cross-data-center replicated Bigtables with multiple master replicas.

I have written summary insights like this for some papers and will continue with my #100PapersChallenge. To get those in your inbox, please subscribe to my newsletter.

References

BigTable paper: https://research.google.com/archive/bigtable-osdi06.pdf

Chubby: https://research.google.com/archive/chubby-osdi06.pdf

GFS: https://research.google.com/archive/gfs-sosp2003.pdf

LSM-tree: https://www.cs.umb.edu/~poneil/lsmtree.pdf

Hemant's 100PapersChallenge

Discussion about this post