Insights from paper: Velox: Meta’s Unified Execution Engine

Aug 26, 2024

1. Abstract

Meta created Velox, a novel open-source C++ database acceleration library.

Velox provides reusable, extensible, high-performance, and dialect-agnostic data processing components for building execution engines and enhancing data management systems.

The library heavily relies on vectorization and adaptivity.

It is designed from the ground up to support efficient computation over complex data types.

At the time of paper writing, Velox was integrated with more than a dozen data systems at Meta.

Some are analytical query engines such as Presto and Spark, stream processing platforms, message buses and data warehouse ingestion infrastructure, and machine learning systems.

2. Introduction

There are two different aspects of data in today’s world: the diversity of workloads and exponential growth.

There are many data processing requirements, for example:

Simple transaction processing
Batch and interactive analytics
ETL
Bulk data movement
Realtime stream processing
Logging
Time series processing for monitoring
Machine learning use cases

For these requirements, specialized query and computation engines are being created. These engines have created a siloed data ecosystem, and evolving and optimizing them is challenging.

The main differences in these engines are the language front, the optimizer, the way tasks are distributed, and the IO layer.

The execution engines at the core of these systems are all somewhat similar.

All these engines need:

A type system to represent scalar and complex data types
An in-memory representation of these datasets
An expression evaluation system
Operators (such as joins, aggregation, and sort)
Storage and network serialization, encoding formats
Resource management primitives

Meta has developed Velox, a library providing reusable, extensible, high-performance data processing components.

These components can be used to build, enhance, or replace execution engines in existing data management systems. These are language, dialect, and engine-agnostic, providing many extensibility points.

Velox takes a fully optimized query plan as input and performs the described computation using the resources available in the local node.

Velox provides:

Efficiency
Consistency
Engineering Efficiency

This paper makes the following contributions:

How do the Velox library and its components work?
How is Velox being integrated with compute engines?
How is Velox transforming Meta’s data landscape?
Velox Diagram by Philip Bell.

3. Library Overview

Velox's components usually reside on the data plane, while the individual engines provide the control plane.

The high-level components provided by Velox are:

Type, Vector, Expression Eval. Functions, Operators, I/O, Serializers and Resource Management

4. Use Cases

Let’s go through some real-life use cases to understand how different specialized engines are leveraging Velox to accelerate, unify, and consolidate user workloads.

4.1 Presto

Presto is an open-source distributed query engine created by Meta in 2013. You can read my post about Presto here.

Presto allows users to run SQL queries over data stored in Hive and other environments.

Presto is organized in a two-tiered architecture composed of

A coordinator node, responsible for receiving user queries, SQL parsing, metadata resolution, global query optimization, and resource management.
Worker nodes, which are responsible for the actual execution of a query given a query plan fragment.

Both coordinator and worker processes share the same Java codebase and communicate via a HTTP REST interface.

Prestissimo is the codename of the project aimed to replace Java workers by a C++ process based on Velox.

4.2 Spark

Spark is an open-source unified computation engine for large-scale data processing. You can read my post about Spark here.

Spark manages and coordinates the execution of tasks on data across a cluster of servers.

Spark applications consist of a driver process and a set of executor processes.

The driver is responsible for task planning, scheduling, and communicating with an external resource manager.

Executors are responsible for performing the actual computation and communication with the remote storage system.

Spruce is the codename for the Velox implementation for Spark.

Spruce leverages a pre-existing interface that allows users to execute arbitrary binaries in Spark. It is called Spark script transform.

It offload execution to an external C++ process in which Velox is executed.

At query time, a Spark executor receives a query plan fragment, serializes it, and forwards it to the external C++ process using the transform interface.

The external process is called SparkCpp. It deserializes the plan, converts it to a Velox plan, and executes it using Velox.

4.3 Realtime Data Infrastructure

Velox is also being used by Meta’s realtime data infrastructure. There are three use cases for it.

Stream Processing:

XStream is Meta’s stream processing platform which allows users to create stream processing applications, expressed either using SQL or a dataframe-like fluent API.

XStream applications commonly read data continuously from Meta’s messaging infrastructure system called Scribe.

Most data processing operations available in XStream map directly to Velox operators and hence are directly reused.

Messaging Bus:

Scribe is a distributed messaging system for collecting, aggregating, and delivering high volumes of data with low latency.

It servers as the main entry point to data ingestion pipelines at Meta.

Data is written to Scribe on a row-by-row basis (log production) and was traditionally read in the same manner.

Now Scribe Read Service is able to leverage the full extent of wire serialization formats available in Velox.

Velox usage in Scribe Read Service allows data consumers to pushdown operations such as projections and filtering closer to the storage.

Data Ingestion:

FBETL is Meta’s data ingestion engine, responsible for two main use cases: data warehouse ingestion, and database ingestion.

Data warehouse ingestion is the process of converting data read from Scribe pipes into warehouse files.

Using Velox in FBETL allows users to specify data transformations (projections) including expressions, UDFs, and filtering applied to the data at ingestion time.

This avoids users to create full stream processing application to achieve the same result.

The database ingestion process is scraping of operational database logs and saving snapshots to the data warehouse.

Velox aids in the implementation of snapshotting.

5. Deep Dive

5.1 Type System

Velox provides a type system that allows users to represent primitive types, strings, dates, timestamps, and functions (lambdas).

It also supports complex types such as arrays, maps, and rows/structs, all of which can be arbitrarily nested. Velox provides serialization/deserialization methods.

In addition to all these types, Velox provides an opaque data type that developers can use to wrap arbitrary C++ data structures easily.

The type system is extensible. Developers can add engine-specific types, such as Presto’s HyperLogLog type, for cardinality estimation.

5.2 Vectors

Velox Vectors allow developers to represent columnar datasets in memory. A variety of encoding formats are supported.

The basic memory layout extends the Apache Arrow format. It comprises a size variable, the data type, and an optional nullability bitmap to represent null values.

Vectors can represent fixed-size or variable-size elements.

Vectors can also be nested in arbitrary ways.

Vectors can leverage different encoding formats such as flat, dictionary, constant, sequence/RLE, and bias.

All Vector data is stored using Velox Buffers, which are contiguous pieces of memory allocated from a memory pool. These buffers can also be subclassed to support different ownership modes.

Any Vector and Buffer can be made writable via copy-on-write.

Velox provides the concept of Lazy Vectors, which are Vectors that only get populated upon first use. Lazy Vectors are useful in cardinality reduction operations such as joins and conditionals in projections.

Velox provides the Decoded Vector abstraction.

It transforms an arbitrarily encoded Vector into a flat vector and a set of indices for all or parts of its elements and exposes a logically consistent API.

Decoded Vectors are zero-copy for flat, constant, and single-level dictionary-encoded inputs.

Velox Vectors are based on and compatible with the Apache Arrow format.

The team deliberately decided to extend the standard to accelerate data processing operations. Velox Vectors and Apache Arrow formats diverge in the three areas below.

Strings
Out-of-order write Support
More Encodings

5.3 Expression Eval

Velox provides a vectorized expression evaluation engine. It has the following usages:

It is used by the FilterProject operator to evaluate filter and projection expressions.
It is used by TableScan and IO connectors to evaluate the predicate
pushdown consistently.
It can be a standalone component for engines requiring expression evaluation capabilities.

Expression evaluation takes expression trees as input and is divided into compilation and evaluation.

Compilation:

The compilation step takes a list of one or more input expression trees and produces a compiled (executable) expression. The main runtime optimizations applied during this process are:

Common Subexpression Elimination - Identify common subexpressions and optimize them.

Constant Folding - Evaluate deterministic subexpressions that do not depend on any input columns.

Adaptive Conjunct Reordering - Evaluate the most effective conjunct for AND/OR.

Evaluation:

The evaluation process takes a compiled expression and an input dataset. After calculating the results, it returns an output dataset.

This process consists of a recursive descent of the expression tree.

Peeling - When inputs are dictionary-encoded, deterministic expressions can be efficiently computed by only considering distinct values.

Memoization - The evaluation step can be repeated to process multiple batches of data reusing the same compiled expression object.

Velox provides experimental support for expression evaluation through code generation (codegen).

5.4 Functions

Velox provides APIs that allow developers to build custom scalar and aggregate functions.

Scalar Functions - Scalar functions take values from a single row as parameters and produce a single output row.

Velox scalar function API is vectorized and provides input parameters such as vectors (batch-by-batch), their nullability buffers, and a bitmap describing the set of active rows.

Simple Functions—Velox also provides a simple scalar function API. It is designed for simplicity and ease of use and hides as many details of the underlying engine and data layout as possible.

It provides the same level of performance as vectorized functions.

The simple scalar function API allows developers to express their business logic by providing a C++ function that takes a single row of values at a time.

The diagram below shows the performance comparison of three different functions initially implemented using the vectorized API versus its implementation using the simple API.

Aggregate Functions - These functions summarize multiple rows from a particular group into a single output row.

Aggregate functions in Velox are typically calculated in two steps:

Partial aggregation takes raw input data and produces intermediate results
Final aggregation takes intermediate results and produces the final result.

5.5 Operators

Velox query plans are composed of a tree of PlanNodes.

Some examples of PlanNodes are Filter, Project, TableScan, Aggregation, HashJoin, and Exchange.

To execute a query plan, plan nodes are first converted into Operators.

The top-level Velox execution concept is Task.

It is a unit of function shipping in distributed execution and corresponds to a query plan fragment along with its Operator tree.

A Task starts with a TableScan or an Exchange (shuffle) source as input and ends in another Exchange.

The Operator tree of a Task is decomposed into one or more linear sub-trees called Pipelines.

For example, HashProbe and HashBuild are mapped to one Pipeline each.

Each Pipeline has one or more execution threads, called Drivers. Drivers can run on a thread, depending on whether they have work to perform.

Tasks can be canceled or paused by other Velox actors at any time.

All operators implement the same base API. Velox already provides an extensive set of commonly used Operators.

Let’s discuss characteristics and optimizations present in common operators.

5.5.1 Table Scans, Filter, and Project

Table scans are done column-by-column with filter pushdown support.

Columns containing filters are processed first.

Filters are adaptively ordered at run time.

Simple filters evaluate multiple values at a time using SIMD.

5.5.2 Aggregate and Hash Joins

Velox provides a carefully designed hash table implementation.

Hashing keys are processed in a columnar manner using an abstraction called VectorHasher. It recognizes the key ranges and cardinality and, where applicable, translates keys to a smaller integer domain.

The hash table layout is similar to Meta’s F14. The hash table values are stored row-wise to minimize cache misses.

5.6 Memory Management

Velox Tasks track memory usage via memory pools.

Small objects like query plans, expression trees, and other control structures are allocated directly from the C++ heap.

Larger objects are allocated using a custom allocator offering zero fragmentation for large objects.

Memory consumers may provide recovery mechanisms.

Velox provides support for both memory and SSD caching. Memory caching acts as a special memory user and can consume all memory that is not allocated otherwise.

All IO buffers are allocated directly from the memory cache and can have arbitrary sizes.

Cached columns are first read from disaggregated storage systems(S3 or HDFS), stored in RAM for first use, and eventually persisted to local SSD.

6. Related Work

DuckDB is an embeddable analytical RDBMS developed as a C++ library.

The Apache Arrow project provides a module containing analytical functions that process Arrow columnar data.

The Apache Arrow library provides Gandiva, an LLVM-based execution environment for analytical kernels over Arrow encoded data.

Photon is a proprietary C++ vectorized execution engine developed by Databricks.

7. Conclusion

The paper presented Velox.

It is a novel open-source C++ database acceleration library.

It provides reusable, extensible, high-performance, and dialect-agnostic data processing components.

These components are being used to unify existing computation engines at Meta.

References

Apache Spark paper post

Meta Presto paper post

Apache Structured Streaming paper post

Scribe: Transporting petabytes per hour

Introducing the Gandiva Initiative for Apache Arrow

Open-sourcing F14 for memory-efficient hash tables

Hemant's 100PapersChallenge

Discussion about this post