Insights from Paper: Apache Spark: A Unified Engine for Big Data Processing
1. Abstract
Data volumes are growing rapidly, and they can not be handled by single machines. Due to this, many new cluster programming models are being developed.
Most of these models are specialized. For example, MapReduce supports batch processing, Dremel (Google BigQuery now) supports interactive SQL queries, and Google Pregel supports interactive graph algorithms.
Most big data applications must combine these different processing types to meet their needs, which is challenging.
The Apache Spark project was started to design a unified engine for all distributed data processing needs.
Spark’s programming model has a data-sharing abstraction called “Resilient Distributed Datasets (RDD),” an extension of the Map Reduce model.
Spark uses this model to capture a wide range of processing workloads that previously needed separate engines, including SQL, streaming, machine learning, and graph processing.
You may be thinking about the benefits. Here are a few:
Applications are more straightforward to develop because they use a unified API.
It is more efficient to combine processing tasks ( avoid disk storage in most cases).
It enables new applications (such as interactive queries on a graph and streaming machine learning), which were not possible in other models.
2. Programming Model
Let’s dive deep into the programming model.
Spark created an abstraction called Resilient Distributed Datasets, or RDD in short. In straightforward terms, it is a collection of objects.
Its power comes from a few other aspects.
It is fault-tolerant.
It is partitioned into different machines across the cluster.
It can be manipulated in parallel.
Spark exposes RDDs through a functional programming API in Scala, Java, Python, and R.
Users need to provide local functions to run on the cluster.
Let’s take an example. The following Scala code creates an RDD representing the error messages in a log file by searching for lines that start with ERROR and then print the total number of errors.
lines = spark.textFile(“hdfs://...”) // Defines RDD from HDFS file
errors = lines.filter(s => s.startsWith(“ERROR”))
println(“Total errors: “ + errors.count())
The API provides transformation methods like map, filter, and groupBy. These methods work on RDDs and create new RDDs.
Spark evaluates RDDs lazily. This allows Spark to find an efficient plan for the user’s computation.
When an action is called, Spark looks at the whole graph of transformations used to create an execution plan.
RDDs provide explicit support for data sharing among computations.
Users can also persist selected RDDs in memory or disk.
For example, we can persist errors RDD created in the example above using the below code:
errors.persist()
The good part is that we can run any queries in this in-memory data like below:
// Count errors mentioning MySQL
errors.filter(s => s.contains(“MySQL”))
.count()
// Fetch back the time fields of errors that
// mention PHP, assuming time is field #3:
errors.filter(s => s.contains(“PHP”))
.map(line => line.split(‘\t’)(3))
.collect()
Fault - tolerance
Spark has an approach called “lineage” to provide fault tolerance.
Each RDD tracks the graph of transformations used to build it. In case of lost partitions, RDD can rerun these operations on base data to reconstruct them.
The above diagram shows the RDDs in the example query.
Let’s say a node holding an in-memory partition of errors fails. Spark will rebuild it by applying the filter on the corresponding block of the HDFS file.
Lineage-based recovery is significantly more efficient as it avoids data replication.
Spark integrates many storage systems. It is commonly used with cluster file systems like HDFS and key-value stores like S3 and Cassandra.
An Example
The paper provides a more extended example. It showcases that it does better than MapReduce. The code below implements logistic regression in Spark.
// Load data into an RDD
val points = sc.textFile(...).map(readPoint).persist()
// Start with a random parameter vector
var w = DenseVector.random(D)
// On each iteration, update param vector with a sum
for (i <- 1 to ITERATIONS) {
val gradient = points.map { p =>
p.x * (1/(1+exp(-p.y*(w.dot(p.x))))-1) * p.y
}
.reduce((a, b) => a+b)
w -= gradient
}
3. Higher-Level Libraries
Spark has built a variety of higher-level libraries.
These libraries often achieve state-of-the-art performance on each task while offering significant benefits when users combine them.
Let’s discuss a few of these libraries.
SQL and DataFrames—Spark SQL implements SQL on Spark. In Spark SQL, each record in an RDD holds a series of rows stored in binary format, and the system generates code to run directly against this layout.
Spark has provided one more higher-level abstraction for basic data transformations on top of the SQL Engine: DataFrames. These are RDDs of records with a known schema, conceptually similar to R and Python data frames.
Spark Streaming - Spark Streaming implements incremental stream processing. It uses a concept called discretized stream. The input data is split into small batches. Spark regularly combines these batches with the state stored in the RDD to produce new results.
GraphX - GraphX provides a graph computation interface. It is similar to Google Pregel.
MLlib - It is Spark’s machine learning library. It implements more than 50 standard training algorithms.
Since all the above libraries operate on RDD abstraction, combining them in an application is easy. Let‘s see an example.
// Load historical data as an RDD using Spark SQL
val trainingData = sql(
“SELECT location, language FROM old_tweets”)
// Train a K-means model using MLlib
val model = new KMeans()
.setFeaturesCol(“location”)
.setPredictionCol(“language”)
.fit(trainingData)
// Apply the model to new tweets in a stream
TwitterUtils.createStream(...)
.map(tweet => model.predict(tweet.location))
4. Applications
At the time of writing the paper, the team was able to identify more than 1,000 companies using Spark. These were in different areas, ranging from Web services to biotechnology to finance.
Let’s talk about a few critical use cases of Spark.
Batch processing - It is the most common use case.
Interactive queries - Organizations use Spark SQL for relational queries, often through business intelligence tools like Tableau. Developers and data scientists can interactively use Spark’s Scala, Python, and R interfaces through shells or visual notebook environments.
Stream processing - Real-time processing has widespread uses, such as network security monitoring, prescriptive analytics, and log mining.
Scientific applications - Spark is used for large-scale spam detection, image processing, and genomic data processing.
5. Why Is the Spark Model General?
Let’s examine what the paper has to say on this topic.
We study RDDs from two perspectives. First, from an expressiveness point of view, we argue that RDDs can emulate any distributed computation and will do so efficiently in many cases unless the computation is sensitive to network latency.
Second, from a systems point of view, we show that RDDs give applications control over the most common bottleneck resources in clusters—network and storage I/O—and thus make it possible to express the same optimizations for these resources that characterize specialized systems.
Expressiveness perspective
Let’s start by comparing RDDs to the MapReduce model. What computations can MapReduce express?
Any distributed computation can be emulated using the Map Reduce model. There are two problems.
First, MapReduce is inefficient at sharing data across timesteps because it relies on replicated external storage systems. Second, the latency of the MapReduce steps determines how well the emulation will match a real network, and most MapReduce implementations were designed for batch environments with minutes to hours of latency.
RDDs and Spark address both of these limitations. RDDs speed up data sharing by avoiding replication of intermediate data. Spark can run MapReduce-like steps on large clusters with 100ms latency.
Systems perspective
Let’s start by asking what the bottleneck resources are in cluster computations.
Current data centers have a steep storage hierarchy that limits most applications similarly.
Local storage. Most nodes have a local memory with approximately 50GB/s of bandwidth. Considering 10 to 20 local disks, the approximate disk bandwidth is 1GB/s to 2GB/s.
Links. Most nodes have a 10Gbps (1.3GB/s) link.
Racks. Most nodes are organized into racks of 20 to 40 machines, with 40Gbps–80Gbps bandwidth out of each rack.
Here, the most critical performance concern is the placement of data and computation in the network.
RDDs provide the facilities to control this placement. It also provides control over data partitioning and colocation.
Beyond network and I/O bandwidth, the most common bottleneck tends to be CPU time. Spark can run the same algorithms and libraries on each node, so it helps distribute the workload.
6. Ongoing Work
At the time of paper writing, four major works were in progress.
DataFrames and Declarative APIs
Spark’s development has increasingly focused on improving usability through more declarative APIs like DataFrames and Datasets.
Performance Optimizations
Much of the recent work in Spark has focused on optimizing performance. Notable efforts include Project Tungsten, which aims to remove Java Virtual Machine (JVM) overhead using techniques like code generation and non-garbage-collected memory.
Language Support
The Spark ecosystem continues to expand with support for additional programming languages, including R, through the SparkR project.
Research libraries
Apache Spark continues to be used to build higher-level data processing libraries. Recent projects include Thunder for neuroscience, ADAM for genomics, and Kira for image processing in astronomy.
7. Conclusion
The Spark project has introduced a unified programming model and engine for big data applications. This model can efficiently support today’s workloads and substantially benefit users.
References
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Spark SQL: Relational Data Processing in Spark
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
GraphX: Graph Processing in a Distributed Dataflow Framework