Spark Beam Runner (Spark-Runner)

Intro

The Spark-Runner allows users to execute data pipelines written against the Apache Beam API with Apache Spark. This runner allows to execute both batch and streaming pipelines on top of the Spark engine.

Overview

Features

ParDo
GroupByKey
Combine
Windowing
Flatten
View
Side inputs/outputs
Encoding

Fault-Tolerance

The Spark runner fault-tolerance guarantees the same guarantees as Apache Spark.

Monitoring

The Spark runner supports user-defined counters via Beam Aggregators implemented on top of Spark‘s Accumulators.
The Aggregators (defined by the pipeline author) and Spark’s internal metrics are reported using Spark's metrics system.
Spark also provides a web UI for monitoring, more details here.

Beam Model support

Batch

The Spark runner provides full support for the Beam Model in batch processing via Spark RDDs.

Streaming

Providing full support for the Beam Model in streaming pipelines is under development. To follow-up you can subscribe to our mailing list.

issue tracking

See Beam JIRA (runner-spark)

Getting Started

To get the latest version of the Spark Runner, first clone the Beam repository:

git clone https://github.com/apache/beam

Then switch to the newly created directory and run Maven to build the Apache Beam:

cd beam
mvn clean install -DskipTests

Now Apache Beam and the Spark Runner are installed in your local maven repository.

If we wanted to run a Beam pipeline with the default options of a Spark instance in local mode, we would do the following:

Pipeline p = <logic for pipeline creation >
PipelineResult result = p.run();
result.waitUntilFinish();

To create a pipeline runner to run against a different Spark cluster, with a custom master url we would do the following:

SparkPipelineOptions options = PipelineOptionsFactory.as(SparkPipelineOptions.class);
options.setSparkMaster("spark://host:port");
Pipeline p = <logic for pipeline creation >
PipelineResult result = p.run();
result.waitUntilFinish();

Word Count Example

First download a text document to use as input:

curl http://www.gutenberg.org/cache/epub/1128/pg1128.txt > /tmp/kinglear.txt

Switch to the Spark runner directory:

cd runners/spark

Then run the word count example from the SDK using a Spark instance in local mode:

mvn exec:exec -DmainClass=org.apache.beam.runners.spark.examples.WordCount \
      -Dinput=/tmp/kinglear.txt -Doutput=/tmp/out -Drunner=SparkRunner \
      -DsparkMaster=local

Check the output by running:

head /tmp/out-00000-of-00001

Note: running examples using mvn exec:exec only works for Spark local mode at the moment. See the next section for how to run on a cluster.

Running on a Cluster

Spark Beam pipelines can be run on a cluster using the spark-submit command.

TBD pending native HDFS support (currently blocked by BEAM-59).