The Aggregators (defined by the pipeline author) and Spark’s internal metrics are reported using Spark's metrics system.
Spark also provides a web UI for monitoring, more details here.
The Spark runner provides full support for the Beam Model in batch processing via Spark RDDs.
Providing full support for the Beam Model in streaming pipelines is under development. To follow-up you can subscribe to our mailing list.
See Beam JIRA (runner-spark)
To get the latest version of the Spark Runner, first clone the Beam repository:
git clone https://github.com/apache/incubator-beam
Then switch to the newly created directory and run Maven to build the Apache Beam:
cd incubator-beam mvn clean install -DskipTests
Now Apache Beam and the Spark Runner are installed in your local maven repository.
If we wanted to run a Beam pipeline with the default options of a Spark instance in local mode, we would do the following:
Pipeline p = <logic for pipeline creation > PipelineResult result = p.run(); result.waitUntilFinish();
To create a pipeline runner to run against a different Spark cluster, with a custom master url we would do the following:
SparkPipelineOptions options = PipelineOptionsFactory.as(SparkPipelineOptions.class); options.setSparkMaster("spark://host:port"); Pipeline p = <logic for pipeline creation > PipelineResult result = p.run(); result.waitUntilFinish();
First download a text document to use as input:
curl http://www.gutenberg.org/cache/epub/1128/pg1128.txt > /tmp/kinglear.txt
Switch to the Spark runner directory:
cd runners/spark
Then run the word count example from the SDK using a Spark instance in local mode:
mvn exec:exec -DmainClass=org.apache.beam.runners.spark.examples.WordCount \ -Dinput=/tmp/kinglear.txt -Doutput=/tmp/out -Drunner=SparkRunner \ -DsparkMaster=local
Check the output by running:
head /tmp/out-00000-of-00001
Note: running examples using mvn exec:exec
only works for Spark local mode at the moment. See the next section for how to run on a cluster.
Spark Beam pipelines can be run on a cluster using the spark-submit
command.
TBD pending native HDFS support (currently blocked by BEAM-59).