Insight from paper: MotherDuck: DuckDB in the cloud and in the client
1. Abstract
MotherDuck is a service that connects DuckDB to the cloud.
If you are unaware of the DuckDB, Please read my post here.
MotherDuck provides the concept of hybrid query processing: the ability to execute queries partly on the client and partly in the cloud.
The paper describes the motivation for MotherDuck and some of its use cases.
MotherDuck allows existing DuckDB users to start using cloud computing without changing their queries.
It also allows users to share DuckDB databases with others through the cloud for collaboration.
2. Introduction
So, what does MotherDuck offer?
It offers DuckDB data storage and serverless query processing in the cloud.
DuckDB is an embedded analytical database system.
Before we proceed, let’s make it clear what is embedded here.
Embedded means that DuckDB runs inside the driver/API library used by the client process.
DuckDB is popular because
It has high performance.
It is easy to use.
It is tightly integrated with Python and R data science libraries.
Hybrid Query Processing:
MotherDuck allows customers to store DuckDB databases in the cloud. The cloud data can be in Parquet, CSV, or JSON files.
Users can simultaneously query local and cloud databases within the same query.
The MotherDuck optimizer will plan query operators to be executed close to where the data is.
It uses bridge operators to upload and download tuple streams between local and cloud. It is called hybrid query processing.
Reducing Cloud Footprint:
The team analyzed traces of cloud data warehouses. It found that >95% of databases are <1TB in size, and >95% of queries involve <10GB of data.
The current analytical cloud data platforms rely on scale-out architectures, which add system complexity but are not required by 95% of users.
MotherDuck advocates a serverless architecture that avoids scale-out, opting for the simplicity of a single-node architecture. The team has a slogan for it: Big Data is Dead.
Variating the number of cores and RAM given to a container provides elasticity for a single user.
Enabling New Applications:
Processing in both client and server opens up new applications since clients can process specific queries locally.
DuckDB can be compiled into Web Assembly (wasm).
An early demonstrator is the Mosaic framework for scalable, interactive data visualizations. DuckDB's web interface also uses this new application architecture.
MotherDuck could also make applications more secure.
Scaling Existing Applications:
DuckDB is popular among data scientists for designing and evaluating ETL or analysis pipelines on a laptop.
Now, these scenarios can be enhanced by scheduling the designed pipelines to run in production in the cloud.
These production pipelines also benefit from MotherDuck’s ability to scale memory and CPU resources beyond a laptop.
Data scientists can also use the MotherDuck SHARE and ATTACH features to share databases in the cloud and collaborate in a team.
3. Background DuckDB
State-of-the-Art Architecture: DuckDB provides the following features:
It has columnar storage that is skippable.
It uses primary keys and foreign keys backed by the ART index.
Its columns can store nested datatypes.
Its database transactions are optimized for handling changes in bulk.
The MVCC is optimized for fast scans.
It uses a vectorized query execution engine that supports operating
directly on lightweight-compressed data.
It has a LeanStore-inspired buffer manager for main memory
query processing speeds in SSD.
It has hyperloglog powered statistics and cardinality estimation.
It has a rule-based optimizer using dynamic programming.
It supports push-based operator execution.
User-Friendly Features
It has friendly SQL language extensions.
It has rich timezone support and SQL macros.
It supports reading and writing for Parquet and Iceberg files.
It has "zero-copy" integration with dplyr, numpy, pandas, and Arrow.
Extension Modules
DuckDB can get new functionality via extension modules.
The system can be extended on several dimensions: data types, operators, optimizer rules, and even the parser.
MotherDuck takes advantage of the DuckDB extension mechanism. A signed (trusted) extension provides connectivity from DuckDB clients to MotherDuck in the cloud using hybrid query processing.
4.MotherDuck Architecture
4.1 Infrastructure
MotherDuck is a SaaS. It runs a control plane with components responsible for many administrative tasks.
Compute
MotherDuck’s compute platform is built on top of shortlived, on-demand, allocated containers. Each container runs a single DuckDB instance.
The containers run in a VM that may have a local SSD, which the MotherDuck storage service uses for caching.
The amount of RAM and CPU resources allocated to a container can be changed.
MotherDuck shuts down containers completely when they are not used.
Storage
The storage layer is built on a shared, scale-out, distributed storage fabric provided by the Cloud vendor.
MotherDuck’s storage service provides client data isolation and improves performance by
Leveraging local SSD resources for caching
Adapting DuckDB database storage to better suit cloud storage systems.
DuckDB’s native data format stores compressed columnar data in a single file.
The duckling storage extension allows for differential storage, where changed data is stored independently as a mutation tree.
It enables zero-copy duplication, sharing, branching, and time travel.
DuckDB provides multi-database storage. A single DuckDB can ATTACH multiple database files simultaneously in read-only or read-write mode.
MotherDuck allows multiple users to have read-only access to a database created and modified by another user.
4.2 Hybrid Query Processing
A DuckDB query goes through four phases -
Parsing, Binding, Query optimization, and Execution
The MotherDuck client extension module hooks into all four phases.
Order-aware Bridge Operators
The bridge operators download and upload tuple streams between the client and the cloud.
These operators aim to work well in situations where upload and download bandwidth is asymmetrical and handle the possibly different endianness from the client.
They can materialize tuples in the sink in the same order they were stored in the original pipeline source.
Remote-local Optimizer
The MotherDuck client extension adds a new rule to the DuckDB optimizer for planning hybrid query processing.
This rule splits the plan into fragments and designates each fragment to run locally or remotely, inserting bridge operators in between.
Virtual Catalog
Query planning is done locally, so local DuckDB should have access to metadata information about cloud-resident databases.
DuckDB’s extensible catalog uses a MotherDuck proxy catalog to create and maintain catalogs for all relevant cloud databases.
SQL Language Extensions
MotherDuck has changed the binding process of the table functions to read Parquet, JSON, and CSV files.
Now, these can be used in the DuckDB’s FROM clause.
The MotherDuck client extension further registers a parser that adds support for new SQL statements.
The CREATE SHARE statement allows to share a cloud database with other MotherDuck users.
4.3 User Interface
MotherDuck comes with a built-in web-based user interface.
Interactive Result Set Exploration
An analytical task is like an iterative, user-driven optimization process for finding a suitable SQL query to answer an analytic question.
The user enters an SQL query, examines the query results, and then revises the query to improve its accuracy in answering the original analytic question.
The MotherDuck Web UI supports interactions by providing users with a notebook-style interface.
Declarative Caching
When executing a query, the MotherDuck GUI accumulates the results in a local results cache.
CREATE TABLE localMemDb.main.cacheTable1 AS
SELECT * FROM ({userQuery})
LIMIT {CLIENT_CACHE_ROW_LIMIT}
DuckDB-wasm manages this in-memory table and runs in the user’s browser.
AI assistant
The MotherDuck GUI came with the possibility of formulating notebook queries in natural language.
In the background, MotherDuck AI will automatically generate an SQL statement, iteratively fix it if needed, and execute it against the current database.
5. Conclusion
MotherDuck connects the popular, lightweight, user-friendly open-source analytics database DuckDB to the cloud.
It empowers users to use more client-side resources. It simplifies cloud data system architecture by using a scale-up approach.
At the time of paper writing, MotherDuck has completed one year and is in beta launch.
References
Data Management for Data Science Towards Embedded Analytics
Scalability! But at what COST?