Insights from paper: The Snowflake Elastic Data Warehouse

Aug 22, 2024

1. Abstract

Traditional data warehousing systems are designed for fixed resources and cannot leverage the cloud's elasticity. They mainly depend on complex ETL pipelines.

The Snowflake team’s mission was to build an enterprise-ready data warehousing solution for the cloud.

The Snowflake Elastic Data Warehouse, also known as Snowflake, is a multi-tenant, transactional, secure, highly scalable, and elastic system.

It has full SQL support and built-in extensions for semi-structured and schema-less data.

Snowflake is available as a pay-as-you-go service in the Amazon cloud.

Users upload their data to the cloud and can immediately manage and query it using familiar tools and interfaces.

The system runs several million queries per day over multiple petabytes of data.

The paper describes the design of Snowake and its novel multi-cluster, shared-data architecture.

2. Introduction

Cloud made execution away from local servers to shared data centers. It is excellent, but we can take advantage of it if the software scales elastically in the cloud.

Traditional data warehousing software is outdated. It was designed to run on small, static clusters of well-behaved machines.

Other than software, Data has also changed. Previously, most of the data in a data warehouse came from within the organization, but now it comes from less controllable or external sources.

The data frequently arrives in schema-less and semi-structured formats.

The Snowflake team decided to build an entirely new data warehousing system specifically for the cloud.

Following are the key features of Snowflake. We will discuss them in detail later.

Pure Software-as-a-Service (SaaS) Experience
Relational
Semi-Structured
Elastic Storage and compute resources
Highly Available
Durable
Cost-efficient
Secure

3. Storage Vs Compute

Shared-nothing architectures provide high-performance data warehousing due to scalability and commodity hardware.

This design scales well for star-schema queries because very little bandwidth is required to join a small (broadcast) dimension table with a large (partitioned) fact table.

The architecture has a significant drawback. It tightly couples compute resources and storage resources.

Due to this coupling, the following problematic scenarios occur.

Heterogeneous Workload: Typically, the workload is not homogeneous, while hardware is. This is an impedance mismatch for performance.

Membership Changes: If the set of nodes changes, large amounts of data must be reshuffled.

Online Upgrade: Software and hardware upgrades eventually affect every node in the system. It makes everything coupled.

Because of these reasons, Snowake separates storage and compute.

Compute is provided through Snowflake's (proprietary) shared-nothing engine.

Storage is provided through Amazon S3.

To reduce network traffic between compute nodes and storage nodes, each compute node caches some table data on the local disk.

The local disk is used exclusively for temporary data and caches, which is hot. It helps achieve high performance.

4. Architecture

Snowflake is designed to be highly available. It is a service-oriented architecture composed of highly fault-tolerant and independently scalable services that communicate through RESTful interfaces.

As shown in the above diagram, there are three categories of services: data services, virtual warehouses, and cloud services.

The team calls this multi-cluster, shared-data architecture.

4.1 Data Storage

Snowflake selected AWS because it is the most mature offering and it has the largest pool of potential users.

Compared to local storage, S3 has much higher access latency and CPU overhead for every single I/O request. S3 is a blob storage, so files can only be (over-)written in full.

The above properties had a strong influence on Snowflake's table format and concurrency control scheme.

Tables are horizontally partitioned into large, immutable files. The values of each attribute/column are grouped within a file and heavily compressed using a well-known scheme called PAX or hybrid columnar.

Each table file has a header that contains metadata and offsets for each column within the file.S3 allows GET requests over parts of files, so queries only need to download the file headers and the required columns afterward.

Snowflake used S3 to store temp data generated by query operators.

Storing query results in S3 enables new forms of client interactions and simplifies query processing. There is no need for server-side cursors.

Metadata such as catalog objects (table S3 files mappings), statistics, locks, transaction logs, etc., are stored in a scalable, transactional key-value store.

The store is part of the Cloud Services layer.

4.2 VirtualWarehouses

The Virtual Warehouses layer consists of clusters of EC2 instances.

Each such cluster is presented to its single user through an abstraction called a virtual warehouse(VW). The individual EC2 instances are called worker nodes.

VWs instead come in abstract T-shirt sizes ranging from X-Small to XX-Large.

4.2.1 Elasticity and Isolation

VWs are pure compute resources. They can be created, destroyed, or resized at any point.

Each query runs on exactly one VW. Worker nodes are not shared across VWs. When a new query is submitted, each worker node in the respective VW spawns a new worker process.

Each worker process lives only for the duration of its query.

Each user may have multiple VWs running at any given time. Every VW has access to the same shared tables without physically copying data.

This elasticity and isolation enables some novel use cases.

4.2.2 Local Caching and File Stealing

Each worker node maintains a cache of table data on the local disk. The cache is a collection of table files, specifically file headers and individual file columns.

The cache lives for the duration of the worker node.

To improve the hit rate and avoid redundant caching of individual table files Snowflake uses consistent hashing.

The query optimizer assigns input file sets to worker nodes using consistent hashing over the table file names. Consistent hashing in Snowflake is lazy. When worker nodes change, no data is shuffled immediately.

4.2.3 Execution Engine

In Snowflake, per-node efficiency is essential. To give the best performance, Snowflake implemented its own state-of-the-art SQL execution engine.

Columnar storage is superior to row-wise storage. This is due to the more effective use of CPU caches and SIMD instructions.
Vectorized execution avoids the materialization of intermediate results. Instead, data is processed in a pipelined fashion. It runs in batches of a few thousand rows in columnar format. This approach saves I/O and significantly improves cache efficiency.
Push-based execution refers to relational operators pushing their results to their downstream operators. It avoids waiting for the operators to pull data and improves cache efficiency.

4.3 Cloud Services

The Cloud Services layer is multi-tenant.

Each service of this layer—access control, query optimizer, transaction manager, and others—is long-lived and shared among many users.

Each service is replicated for high availability and scalability.

4.3.1 Query Management and Optimization

Cloud services handle all the early stages of the query life cycle, such as parsing, object resolution, access control, and plan optimization.

Snowake's query optimizer follows a typical Cascades-style approach with top-down cost-based optimization.

Since Snowake does not use indices, the plan search space is smaller than in other systems.

Once the optimizer completes, the resulting execution plan is distributed to all the worker nodes involved in the query.

Cloud Services continuously tracks the state of the query to collect performance counters and detect node failures.

Users can monitor and analyze past and ongoing queries through the Snowake graphical user interface.

4.3.2 Concurrency Control

The concurrency control is handled entirely by the Cloud Services layer.

Snowflake implements ACID transactions via Snapshot Isolation (SI). All reads by a transaction see a consistent snapshot of the database as of the time the transaction started.

SI is implemented on top of multi-version concurrency control (MVCC), which is a natural choice given that table files are immutable.

The write operations (insert, update, delete, merge) on a table produce a newer version of the table by adding and removing whole files relative to the prior table version.

File additions and removals are tracked in the metadata store, which is a transactional key-value store.

Snowake also uses snapshots to implement time travel and efficient cloning of database objects.

4.3.3 Pruning

Most large-scale data processing uses a technique called min-max-based pruning. It is also known as materialized aggregates, zone maps, and data skipping.

The system maintains minimum and maximum values within the chunk for a given data.

Based on query predicates, these values help determine whether a given chunk of data is needed for a given query.

Snowake keeps pruning-related metadata for every individual table file.

Besides this static pruning, Snowake also performs dynamic pruning during execution.

5. Feature Highlights

Let’s discuss in detail the features provided by the Snowflake.

5.1 Pure Software-as-a-Service Experience

Snowflake supports standard database interfaces like JDBC and ODBC.

It works with third-party tools and services like Tableau, Informatica, or Looker.

Snowflake provides a web UI that makes it easy to access from any location and environment.

UI allows SQL operations, access to the database catalog, user and system management, monitoring, usage information, etc.

5.2 Continuous Availability

Snowake offers continuous availability. It provides two technical features:

5.2.1 Fault Resilience

Snowake tolerates individual and correlated node failures at all levels of the architecture.

**Multi-Data Center Instance of Snowake**

The data storage layer of Snowake is S3. S3 is replicated across multiple data centers called availability zones.

Snowake's metadata store is also distributed and replicated across multiple AZs. Other nodes can pick up the activities if a node fails without impacting the end users.

Virtual Warehouses (VWs) are not distributed across AZs. It is done for better performance.

If an entire AZ becomes unavailable, all queries running on a given VW of that AZ will fail, and the user needs to actively re-provision the VW in a different AZ.

5.2.2 Online Upgrade

Snowflake provides continuous availability when failures occur and during software upgrades.

The system is designed to allow multiple versions of the various services to be deployed side-by-side. Whenever the metadata schema changes, backward compatibility with the previous version is ensured.

This is possible because all services are stateless. All hard states are kept in a transactional key-value store.

The diagram below shows a snapshot of an ongoing upgrade process.

5.3 Semi-Structured and Schema-Less Data

Snowake extends the standard SQL-type system with three types of semi-structured data: VARIANT, ARRAY, and OBJECT.

Values of type VARIANT can store:

Any value of native SQL type (DATE, VARCHAR, etc.)
Variable-length ARRAYs of values
JavaScript-like OBJECTs
Map from strings to VARIANT values.

ARRAY and OBJECT are just restrictions of type VARIANT. The internal representation is the same.

The VARIANT type allows Snowake to be used in an ELT (Extract-Load-Transform) manner rather than a traditional ETL (Extract-Transform-Load) manner.

Another advantage of ELT and Snowake is that later, if transformation is desired, it can be performed using the full power of a parallel SQL database, including operations such as joins, sorting, aggregation, and complex predicates.

5.3.1 Post-relational Operations

The most critical operation on documents is extracting data elements. Snowflake provides extraction operations in functional SQL notation and JavaScript-like path syntax. The internal encoding makes extraction very efficient.

Flattening is the second most common operation, which means converging a nested document into multiple rows. Snowflake uses SQL lateral views to represent flattening operations.

5.3.2 Columnar Storage and Processing

Apache Impala and Google Dremel have demonstrated that columnar storage of semi-structured data is possible and beneficial.

When storing semi-structured data, Snowflake automatically performs statistical analysis of the collection of documents within a single table file.

The most frequently used columns are removed from the documents and stored separately. These columns use the same compressed columnar format as native relational data.

5.3.3 Optimistic Conversion

Some native SQL types, like date/time values, are represented as strings.

These values need to be converted from strings to their actual type either at write time (during insert or update) or at read time.

These string conversions must be performed at read time without a typed schema. Snowflake solves the problem by performing optimistic data conversion and preserving the conversion result and the original string in separate columns.

5.4 Time Travel and Cloning

When a new version removes files, they are retained for a configurable duration. File retention allows Snowflake to read earlier versions of tables very efficiently. It is called time travel on the database.

Users can access this feature from SQL using the convenient AT or BEFORE syntax.

Timestamps can be absolute, relative with respect to current time, or relative with respect to previous statements.

One can even access different versions of the same table in a single query.

SELECT new.key, new.value, old.value 
FROM my_table new
JOIN my_table AT(OFFSET => -86400) old -- 1 day ago
ON new.key = old.key WHERE new.value <> old.value;

5.5 Security

Snowflake is designed to protect user data against attacks on all levels of the architecture.

Snowflake implements two-factor authentication (client-side) encrypted data import and export, secure data transfer and storage, and role-based access control for database objects.

Data is encrypted before being sent over the network and written to a local disk or shared storage (S3).

5.5.1 Key Hierarchy

Snowflake uses strong AES 256-bit encryption with a hierarchical key model rooted in AWS CloudHSM.

Encryption and key management are entirely transparent to the user and require no configuration or management.

The diagram below shows the configuration hierarchy:

There are four levels: root keys, account keys, table keys, and file keys. Each layer of parent keys encrypts the child keys below. Each account key corresponds to one user account, each table key corresponds to one database table, and each file key corresponds to one table file.

5.5.2 Key Life Cycle

Snowake also constrains the duration of time during which a key is usable.

Encryption keys go through four phases:

A pre-operational creation phase
An operational phase where keys are used to encrypt and decrypt.
A post-operational phase where keys are no longer in use
It's a destroyed phase

Phase 2 requires one to limit the originator-usage and recipient-usage periods.

Snowake limits the originator-usage period using key rotation and the recipient-usage period using rekeying.

5.5.3 End-to-End Security

Snowake uses AWS CloudHSM as a tamper-proof, highly secure way to generate, store, and use the root keys of the key hierarchy.

AWS CloudHSM is a set of hardware security modules HSMs. The root keys never leave the HSM devices.

This way, lower-level keys cannot be unwrapped without authorized access to the HSM devices.

The HSMs also generate keys at the account and table levels.

6. Related Work

Cloud-based Parallel Database Systems

Amazon Redshift is the data warehousing product in AWS Cloud.

Google's Cloud Platform offers a fully managed query service, BigQuery.

Microsoft SQL Data Warehouse is available in the Azure cloud.

Document Stores and Big Data

A few famous document stores are MongoDB, Couchbase Server, and Apache Cassandra.

Many Big Data engines, including Apache Hive, Apache Spark, Apache Drill, and Presto, now support queries over nested data.

7. Conclusion

Snowflake was founded in 2012. At that time, the database world focused entirely on SQL on Hadoop, and building a classic data warehouse system for the cloud was not the norm.

After three years of development, the team was confident they built the right solution.

People were looking for more efficient, flexible, and better suited for the cloud relational database. Snowflake meets all the above expectations.

References

Apache Spark paper post

Apache Cassandra paper post

Amazon Redshift paper post

Google Dremel paper post

Meta Presto paper post

Hemant's 100PapersChallenge

Discussion about this post

Ready for more?