Insights from paper - Cassandra - A Decentralized Structured Storage System
Introduction
Facebook engineers Avinash Lakshman and Prashant Malik developed Cassandra to power Facebook’s inbox search feature.
This paper was published at LADIS 2009. The social networking platform Facebook has strict operational requirements regarding performance, reliability, efficiency, and high scalability.
Cassandra uses a synthesis of well-known techniques to achieve scalability and availability. Cassandra was designed to fulfill the storage needs of the Inbox Search feature. The feature enables users to search through their Facebook inbox.
The system was required to handle two critical things:
A very high write throughput, billions of writes per day
Scale with the number of users
Replicating data across data centers was crucial for keeping search latencies down.
Related Work
Distributing data for performance, availability, and durability has been widely studied in the file system and database communities. Let’s talk about the existing system which influenced Cassandra.
Ficus and Coda are file systems that replicate files for high availability at the expense of consistency. Update conflicts are typically managed using
specialized conflict resolution procedures.
Farsite is a distributed file system that does not use any centralized server. It achieves high availability and scalability using replication.
The Google File System (GFS) is a distributed file system that uses a single
master server for hosting the entire metadata and where the data is split into chunks and stored in chunk servers. However, the GFS master is now made fault-tolerant using Chubby.
Bayou is a distributed relational database system that allows disconnected operations and provides eventual data consistency.
Bayou, Coda, and Ficus allow disconnected operations and are resilient to issues such as network partitions and outages. They guarantee eventual consistency.
Dynamo allows read and write operations to continue even during network partitions, and resolves update conflicts using different conflict resolution
mechanisms, some client-driven.
Bigtable provides both structure and data distribution but relies on a
distributed file system for its durability.
Data Model
Let’s first learn the basics of the data model in Cassandra.
A table in Cassandra is a distributed multi-dimensional map indexed by a key.
The row key in a table is a string with no size restriction.
The value is an object which is highly structured.
Columns are grouped into sets called column families.
There are two kinds of column families, Simple and Super column families.
Super column families can be visualized as column family within a column family.
Every operation under a single row key is atomic per replica. Applications can specify the sort order of columns within a Super Column or Simple Column family.
API
The Cassandra API consists of the following three simple methods.
insert(table; key; rowMutation)
get(table; key; columnName)
delete(table; key; columnName)
System Architecture
The paper focuses only on the core distributed systems techniques used in Cassandra: partitioning, replication, membership, failure handling, and scaling. Let’s see all these one by one.
Partitioning
Cassandra can scale incrementally. This comes from the ability to partition the data over the set of nodes dynamically.
There are different ways to partition the data. One of them is a hash-based partition. In one line, the concept is that the data is hashed, and the hashed value is used to determine the partition.
Cassandra partitions data across the cluster using an advanced version of hash-based partitioning called consistent hashing. The hashing function used here preserves the order.
The output range of a hash function is treated as a fixed circular space or ring (i.e. the largest hash value wraps around to the smallest hash value).
Each node in the system is assigned a random value within this space representing its position on the ring.
A key (representing data) is assigned to a node by
First, hashing the key to yield its position on the ring
And then walk the ring clockwise to find the first node with a position larger than the item's position.
This node is deemed the coordinator for the key.
In this way, each node becomes responsible for the region in the ring between it and its predecessor node on ring. The principal advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors.
There are two challenges in this basic version of consistent hashing.
The random position assignment of each node on the ring leads to non-uniform data and load distribution.
The algorithm is not making use of heterogeneity in the performance of nodes.
How Cassandra overcomes these challenges? It analyzes load information on the ring and has lightly loaded nodes move on the ring to alleviate heavily loaded nodes.
Replication
Cassandra uses replication to achieve high availability and durability.
Each data item is replicated at N (configurable) hosts.
Each key is assigned to a coordinator node.
The coordinator is responsible for replicating the data items in his key range.
Cassandra provides the client with various options for how data needs to be replicated. It provides multiple replication policies such as “Rack Unaware", “Rack Aware," and “Datacenter Aware".
Cassandra uses Zookeeper to elect a leader amongst its nodes. All nodes, on joining the cluster, contact the leader, who tells them for what ranges they are replicas for.
The leader maintains the invariant that no node is responsible for more than N-1 ranges in the ring.
The mapping metadata of key ranges to a node is cached locally at each node and in a fault-tolerant manner stored inside Zookeeper. In this way, Cassandra uses the preference list for the key-range concept of Dynamo.
One important point to note - the preference list of a key is constructed such that the storage nodes are spread across multiple data centers.
Membership
Cassandra uses Scuttlebutt to manage a node’s membership to the cluster. It is a very efficient anti-entropy Gossip-based mechanism.
Failure Detection
Cassandra uses a modified version of the Accrual Failure Detector. This failure detection module emits a value representing a suspicion level for each monitored node in place of a boolean value of true/false.
Bootstrapping
When a node starts for the first time, it chooses a random token for its position in the consistent hashing ring. This information is persisted to disk locally, in Zookeeper, and also gossiped around the cluster. This enables any node to route a request for a key to the correct node in the cluster.
In the bootstrap case, nodes build key-range mapping from seed nodes.
Cassandra has an explicit mechanism to initiate the addition and removal of nodes from the Cassandra instance. The paper mentions:
An administrator uses a command line tool or a browser to connect to a Cassandra node and issue a membership change to join or leave the cluster.
Scaling the Cluster
When a new node is added to the system, it gets assigned a token to alleviate a heavily loaded node.
The node giving up the data streams the data over to the new node using kernel copy techniques.
Local Persistence
The Cassandra system relies on the local file system for data persistence.
A typical write operation has two steps:
Write into a commit log for durability and recoverability
Write into an in-memory data structure only if the previous write is successful
Cassandra has a dedicated disk on each machine for the commit log.
When the in-memory data structure crosses a certain threshold, calculated based on data size and the number of objects, it dumps itself to one of many commodity disks.
There is a couple of significant things here:
All writes are sequential to disk.
These write generate an index for efficient lookup based on the row key.
These indexes are stored with the data file.
A typical read works as below:
The query is run against the in-memory data structure.
If data is not found, files on disk are looked at in the order of newest to oldest.
It is possible the key has to be looked up in many data files on disk. To expedite this look-up process, Cassandra uses the Bloom filter. So what is a bloom filter? In simple words, it is a data structure that is used to test whether an element is a member of a set. Other than this, remember that it never generate the false negative result and is very space efficient.
Cassandra stores the bloom filter on the disk and on the in-memory data structure. A key in a column family could have many columns. To prevent scanning of every column on the disk, Cassandra maintains column indices.
Implementation Details
Cassandra has the following primary modules:
Partitioning module
Cluster membership and failure detection module
Storage engine module
Each module relies on an event-driven architecture where the message processing pipeline and the task pipeline are split into multiple stages. It follows along with the SEDA architecture.
Each module is written in Java.
The cluster membership and failure detection module is built on a network layer that uses non-blocking I/O.
All system control messages rely on UDP. And application related messages for replication and request routing relies on TCP.
The request routing modules are implemented using a state machine. A read/write request morphs the state machine through the following states:
(i) Identify the node(s) that own the data for the key
(ii) Route the requests to the nodes and wait on the responses to arrive
(iii) If the replies do not arrive within a configured timeout value, fail the request and return to the client
(iv) Figure out the latest response based on the timestamp
(v) Schedule a repair of the data at any replica if they do not have the latest data.
Cassandra can be configured to perform either synchronous or asynchronous writes. In Cassandra, a commit log is rolled out after an older one exceeds a particular configurable size.
Both in-memory data structure and data file are generated per column family. Whenever the in-memory data structure for a particular column family is dumped to disk, Cassandra sets its bit in the commit log, stating that this column family has been successfully persisted to disk.
Cassandra's system indexes all data based on the primary key. The data file on the disk is broken down into a sequence of blocks. Each block contains a maximum of 128 keys and is demarcated by a block index.
Cassandra performs a compaction process, like the Bigtable system, which merges multiple files into one. It is essentially a merge sort on a bunch of sorted data files.
Practical Experiences
The paper mentions the team’s learning in this section. A few interesting ones are listed below:
The team has to index 7TB of data, store it in MySQL and load it to Cassandra. Map/Reduce jobs did that. The team exposed some background channels to do this so that serialization/deserialization overhead is avoided.
The team experienced that the time to detect failures increased beyond an acceptable limit as the size of the cluster grew. In one experiment for a cluster size of 100 nodes, the PHI value of 5 gave an average time to detect failures in 12 seconds. It was 2 minutes which was unacceptable.
The monitoring should not be taken for granted. Cassandra used Ganglia as
distributed performance monitoring tool.
The team has learned that having some amount of coordination is essential to implementing some distributed features tractable. Cassandra is integrated with Zookeeper for this.
Conclusion
Cassandra is built, implemented, and operated as a storage system providing scalability, high performance, and wide applicability. Cassandra can support a very high update throughput while delivering low latency. Adding compression, the ability to support atomicity across keys, and secondary index support are the features to be built next.
References
Original paper: https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
Annotated paper: https://docs.datastax.com/en/articles/cassandra/cassandrathenandnow.html
Ficus: https://css.csail.mit.edu/6.824/2014/papers/ficus.pdf
Coda: https://www.cs.cmu.edu/~satya/docdir/satya-wwos1-1987.pdf
Farsite: http://www.usenix.org/events/osdi02/tech/full_papers/adya/adya.pdf
GFS: https://research.google.com/archive/gfs-sosp2003.pdf
Bayou: https://dl.acm.org/doi/pdf/10.1145/224056.224070
BigTable: https://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf