Insights from paper: Cloud-Native Transactions and Analytics in SingleStore

Aug 13, 2024

1. Abstract

In the last decade, there has been a rise in specialized database systems.

Now there are specific system available for transaction processing, data warehousing, time series analysis, full-text search, data lakes, in-memory caching, document storage, queuing, graph processing, and geo-replicated operational workloads.

The reason for this movement is a belief that a single general-purpose database is not capable.

A single system will not be able to run varied workloads at a reasonable cost with strong performance at the level of scale and concurrency required now days.

SingleStore Database (S2DB) is previoulsty known as MemSQL.

It a distributed general-purpose SQL database designed to have the versatility to run both operational and analytical workloads with good performance.

It is one of the earliest distributed HTAP databases on the market.

It has unified table storage which runs both transactional and analytical workloads efficiently.

SingleStore is able to achieve this using a combination of row-store and column-store with other techniques like vectorization, ability to seek efficiently into a column-store using secondary indexes, and using in-memory row-store buffers for recently modified data.

2. Introduction

As of Jan 2022, there are 350 different databases listed on db engine website. The market is saturated and applications has to use many different database altogether and it is cumbersome.

Following two points are driving the movement towards general-purpose databases.

Elastic cloud infrastructure for storage allow databases to be almost limitless, highly-available, and durable. Elastic compute instances allow databases to bring more compute almost immediately.
There is a requirement from developers to store more data and access it with lower latency and with higher throughput.

S2DB enables both high-throughput low-latency writes and complex analytical queries over ever-changing data, with end-to-end latency of seconds to sub-seconds from new data arriving to analytical results.

This paper presents two key components of S2DB that are important for cloud-native transactional and analytical workloads.

Separation of storage and compute

S2DB can efficiently use the cloud storage hierarchy (local memory, local disks, and blob storage) based on data hotness.

S2DB can commit on local disk and push data asynchronously to blob storage.

This gives S2DB all the advantages of separation of storage and compute without the write latency penalty.

S2DB has following features:

It can store more data than fits on local disks by keeping cold data in blob storage.
It stores history in blob storage. This enables point-in-time restores to points in the past without needing to take explicit backups or copy any data on a restore.
It can provision multiple read-only replicas of a database from blob storage without any impact to the read-write master copy of the database.

Unified table storage

S2DB tables support transactions that need both

The scan performance of a column-store (trillions of rows in a second)
The seek performance of rows-tore indexes to speed up point reads and writes

S2DB’s unified table storage internally uses both row-store and column-store formats, but end users need not be aware of this.

The column-store data is organized as a log-structured merge tree (LSM) with secondary hash indexes supported to speed up OLTP workloads.

Unified tables support sort keys, secondary keys, shard keys, unique keys and row-level locking.

3. Background on SingleStoreDB

SingleStoreDB is a horizontally partitioned, shared-nothing DBMS.

An S2DB cluster comprises aggregator nodes and leaf nodes.

The aggregator nodes coordinate queries.

The leaf nodes hold copies of partitions of data and are responsible for the bulk of compute for queries.

Each partition is either a master that can serve both reads and writes or a replica that can only serve reads.

Tables are distributed across partitions by hash-partitioning of a user-configurable set of columns called the shard key.

SingleStore also supports full-query code generation targeting LLVM through intermediate bytecode.

S2DB maintains high availability by storing multiple replicas of each partition on different nodes in the cluster.

Data is replicated synchronously to the replicas by default as transactions commit on the master partitions.

Read queries never run on HA replicas. Queries only run on master partitions or specifically created read replicas.

HA replicas are hot copies of the data on the master partition.

A special aggregator node coordinates failovers and auto-healing called the master aggregator.

S2DB also supports the creation of asynchronously replicated replicas in different regions or data centers for disaster recovery.

3.1 Table storage formats

S2DB uses two storage types internally.

An in-memory row-store backed by a lock-free skiplist
A disk-based column-store.

Unified table storage combines both formats internally to support OLAP and OLTP workloads using a single storage design.

3.1.1 Rowstore storage

Each index in an S2DB in-memory row-store table uses a lock-free skip list.

A node in the skip list corresponds to a row, and each node stores a linked list of versions of the row to implement multi-version concurrency control so that readers don’t need to wait on writers.

Writes use pessimistic concurrency control, implemented using row locks stored on each skip list node to handle concurrent writes to the same row.

In addition to writing to the in-memory skiplists, write operations, and write the affected rows to a log before committing.

A log is created for each database partition, persisted to disk, and replicated to guarantee the durability of writes.

On node restarts, the state of each database partition is recovered by replaying the writes in the persisted log.

A background process periodically creates a snapshot file containing the serialized state of the in-memory row store tables at a particular log position.

3.1.2 Columnstore storage

The data in a column-store table is organized into segments.

Each segment stores a disjoint subset of rows as a set of data files on disk.

Each column is stored in the same row order but compressed separately within a segment.

The same column can use a different encoding in each segment optimized for the data specific to that segment.

The segment metadata is stored in a durable in-memory rowstore table

The column encodings are each implemented to be seekable to allow efficient reads at a specific row offset without decoding all the rows.

A sort key can be specified on each column-store table for more efficient segment elimination.

For each column store table, S2DB creates a row store table as a write-optimized store to store small writes and avoid creating small sorted runs across many files.

A background flusher process periodically deletes rows from the row store and converts those rows into a column store segment in a transaction.

Columnstore tables support vectorized execution.

4. Separation of storage and compute

S2DB can run with and without access to a blob store for separated storage.

When running without access to blob storage, S2DB behaves like a typical shared-nothing distributed database.

When running with a blob store, S2DB doesn’t store all persistent data in the blob store; it only stores transient data in local storage.

This design allows S2DB to commit transactions without the latency penalty of writing all the transaction data to blob storage, which makes it durable.

Durability is managed by the cluster on each partition using replication.

Replicating out-of-order allows small transactions to commit without waiting for big transactions, guaranteeing that commits have low and predictable latency.

By default, data is committed when replicated in-memory to at least one replica partition for every master partition involved in a transaction.

While SingleStore also supports synchronously committing to local disk, this tradeoff often doesn’t make sense in cloud environments. S2DB doesn’t synchronously commit transactions to the local disk by default.

The data for a partition of a column-store table is stored in an LSM tree, where the top level is an in-memory row store, and the lower levels are HTAP-optimized column-store data files.

**Example of a sequence of writes in S2, showing the state of the in-memory and persisted structures in each step**

The diagram above presents the database state after a few example write operations.

4.1 Staging Data from Local to Remote Storage

S2DB has durable, low-latency data storage in which all data for a given partition is recorded in a single log.

The log is the only file that is ever updated via appends.

The data files containing column-store data are immutable once written.

S2DB’s separation of storage and compute design is shown in the below diagram.

**Cluster architecture with separated storage**

Transactions are committed to the tail of the log and replicated to other nodes, just as when S2DB runs without blob storage available.
Newly committed column-store data files are uploaded asynchronously to blob storage as quickly as possible after being committed.
Transaction logs are uploaded to blob storage in chunks below a position in the log known to contain only fully durable and replicated data.
Snapshots of row store data are taken only on master partitions and written directly to the blob store reducing local disk IO compared to S2DB running without blob storage.
To add more compute to the cluster, new replica databases get the snapshots and logs they need from blob storage and replicate the tail of the log from the master databases.

4.2 Capabilities Enabled by Separated Storage

S2DB’s separated storage design gives it many of the benefits expected of systems using shared remote storage.

Some example capabilities enabled by remote storage are:

S2DB uses faster ephemeral SSDs for local storage instead of more expensive and slower network block storage (EBS).
S2DB can keep months of history since storing data at rest in blob storage is cheap. This history is used by a point-in-time restore (PITR) command to restore a database to the state it was in at a given time in the past.
S2DB supports the creation of read-only workspaces.

5. Unified table storage

The team designed the unified table storage to provide a unified table type that works well for OLTP and OLAP access.

S2DB column-store storage is LSM-tree based, which writes data in large consecutive chunks and works well with tiered storage. The column-oriented data format is well known to be ideal for OLAP workloads.

The main problem is to make this store work efficiently for OLTP use cases without sacrificing its OLAP performance.

No merge-based reconciliation during reads:

Common LSM tree implementations use tombstone entries to represent deletes in the LSM tree. Readers must reconcile results from all LSM levels to read the latest data in this representation.

S2DB column-store storage avoids this reconciliation process for reads.

S2DB represents deletes using a bit vector stored as part of the segment metadata, which is cheaper to apply on the data files for the segment to filter out deleted rows.

Minimize disk access and blocking during writes:

S2DB column-store storage performs streaming inserts on an in-memory write-optimized store to achieve low latency writes.

S2DB needs to modify the segment metadata and the in-memory store to avoid tombstone records.

5.1 Secondary indexes

Secondary indexes are important for efficient point access in transactional workloads.

The following are two important approaches:

Per-segment filtering structure

Building bloom filters or inverted indexes for each on-disk segment allows skipping the individual segments when there isn’t a match.

External index structure

Having an index structure outside of the LSM tree, the secondary index columns are mapped to the values of the primary index columns.

S2DB secondary indexes use a two-level structure integrated with the LSM tree storage. It is shown in the diagram below:

**The two-level secondary index structure**

For each segment, an inverted index is built to map values of the indexed column to a postings list, which stores row offsets in the segment with that value.
Across segments in the table, a global index maps the values of the indexed column to the IDs of the segments with that value and the starting location of the corresponding postings list in the inverted index for each segment.

5.1.1 Multi-column secondary index

To support multi-column secondary indexes while minimizing storage costs, S2DB builds a secondary index for each indexed column and allows the single-column indexes to be shared across multiple indexes referring to the same columns.

5.1.2 Uniqueness enforcement

Most column-store implementations don’t support the enforcement of uniqueness constraints.

The S2DB column store supports uniqueness constraint enforcement without forcing the sort key to be the unique key column or duplicating the data.

5.2 Row-level locking

S2DB column-store storage represents deleted rows in a segment as a bit vector in the segment metadata.

Instead of the naive implementation of having update and delete queries update the bit vector directly, S2DB implements a row-level locking mechanism to avoid blocking during transactional workloads.

Rows to be updated or deleted are first moved to the in-memory row-store part of the table in an autonomous transaction.

The primary key of the in-memory row-store acts as the lock manager.

Inserting a copy of the row locks the row, preventing concurrent modifications.

To ensure that the locked rows aren’t modified before inserting their copies, an extra scanning pass on newly created segments is performed after locking to find the latest versions of the locked rows.

6. Adaptive query execution

Unified table storage supports multiple data access methods for transaction and analytical processing.

Data access on S2DB unified table storage has 3 high-level steps:

Find the list of segments to read
Run filters to find the rows to read from each segment
Selectively decode and output the rows.

In this section, we will discuss the first two steps.

6.1 Segment skipping

Segments can be skipped using the global secondary index structures or the min/max values stored in the segment metadata.

First, a secondary index check is done. S2DB dynamically disables a secondary index if the number of keys to look up is too high relative to the table size.

6.2 Filtering

There are up to four different ways to evaluate each clause in the filter condition.

Regular filter: It selectively decodes the column for rows that passed previous filters and then executes the filter on the decoded values.
Encoded filter: It executes directly on the compressed values. Compared with a regular filter, this strategy is ideal with a small set of possible values.
Group filter: It decodes all filtered columns and runs the entire filter condition instead of running the filter clauses separately.
Secondary index filter: It reads the postings list for the value stored in the index to find the filtered row offsets. Using a secondary index is usually better than using a regular filter.

7. Related Work

Oracle and SQL Server have both augmented their popular row store engines by adding column store storage to them.

Oracle’s column-store is in-memory only.

SQL Server’s column store is a traditional on-disk column store and can be created alongside secondary B-tree row store indexes .

TiDB was initially built as a distributed, highly available row store and later added the capability to transparently replicate data into column store format to improve the performance of analytical queries.

Hyper supports a high-performance in-memory hybrid row-store and column-store engine.

SAP HANA supports in-memory row store and in-memory column store tables.

Most cloud data warehouses today use a blob store for persistent storage and keep only frequently queried data cached. Snowflake, Redshift and Databricks all follow this pattern.

Wildfire database adds HTAP capabilities to Apache Spark.

Google Procella database powers the real-time analytical features of YouTube.

8. Conclusion

S2DB is designed to handle transactional and analytical workloads with strong performance.

It uses blob storage to get the cost, durability, and elasticity benefits.S2DB only stores cold data in blob storage.

S2DB’s unified table storage uses a combination of an in-memory row store and an on-disk column store that supports secondary and unique keys via inverted indexes.

S2DB has the fast scan performance of a traditional column store and provides efficient point queries via indexing.

References

A column store engine for real-time streaming analytics

The Story Behind SingleStore’s Skiplist Indexes

Oracle Database In-Memory: A dual format in-memory database

Parallel Replication across Formats in SAP HANA for Scaling Out Mixed OLTP/OLAP Workloads

Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation

WiSer: A Highly Available HTAP DBMS for IoT Applications

How Careful Engineering Led to Processing Over a Trillion Rows Per Second

Hemant's 100PapersChallenge

Discussion about this post