learning/prompts/documentation-lookup-nolinks/54_spark

Prompt: How to execute Apache Beam pipelines with Apache Spark?

Response: Apache Beam supports multiple runners for executing pipelines on various platforms. One such runner is the Spark Runner, tailored for running pipelines using Apache Spark. The Spark Runner enables you to leverage the scalability and parallel processing capabilities of Apache Spark for your data processing tasks.

Key features of the Spark Runner include:

Support for batch, streaming, and unified pipelines.
Offering the same fault-tolerance guarantees and security features as provided by Apache Spark.
Built-in metrics reporting using Apache Spark’s metrics system, which also reports Beam Aggregators.
Native support for Apache Beam side inputs via Apache Spark’s broadcast variables.

There are three types of Spark Runners available:

Legacy Spark Runner: supports Java (and other JVM-based languages) exclusively, based on Apache Spark’s RDD and DStream.
Structured Streaming Spark Runner: supports Java (and other JVM-based languages) exclusively, based on Apache Spark's Datasets and Structured Streaming framework. Currently, it only supports batch mode with limited coverage of the Apache Beam model.
Portable Spark Runner: supports Java, Python, and Go.

For Java-based applications, consider using the Java-based runners, while for Python or Go pipelines, opt for the portable Runner.

The Spark Runner can execute Spark pipelines similar to a native Spark application, allowing deployment as a self-contained application for local mode, running on Spark Standalone Resource Manager (RM), or using YARN or Mesos.

To execute your Apache Beam pipeline on a Spark Standalone RM, follow these steps:

Java-based Non-portable Spark Runners (Java Only)

1. Specify Dependencies:

In the pom.xml file of your Java project directory, specify your dependency on the latest version of the Spark Runner:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-runners-spark-3</artifactId>
  <version>2.54.0</version>
</dependency>

2. Deploy Spark with Your Application:

When running pipelines in a Spark Standalone mode, ensure that your self-contained application includes Spark dependencies explicitly in your pom.xml file:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.12</artifactId>
  <version>${spark.version}</version>
</dependency>

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming_2.12</artifactId>
  <version>${spark.version}</version>
</dependency>

Shade the application JAR using the Maven shade plugin and make sure the shaded JAR file is visible in the target directory by running is target.

To run pipelines in a Spark Standalone mode using the legacy RDD/DStream-based Spark Runner, use the following command:

spark-submit --class com.beam.examples.BeamPipeline --master spark://HOST:PORT target/beam-examples-1.0.0-shaded.jar --runner=SparkRunner

To run pipelines in a Spark Standalone mode using the Structured Streaming Spark Runner, run the following command:

spark-submit --class com.beam.examples.BeamPipeline --master spark://HOST:PORT target/beam-examples-1.0.0-shaded.jar --runner=SparkStructuredStreamingRunner

3. Configure Pipeline Options:

Set the runner option in your pipeline options to specify that you want to use the Spark Runner. In Java, you can do this as follows:

SparkPipelineOptions options = PipelineOptionsFactory.as(SparkPipelineOptions.class);
options.setRunner(SparkRunner.class);

For additional pipeline configuration options, refer to the Spark Runner documentation.

4. Run Your Pipeline:

In Java, you can use the PipelineRunner to run your pipeline:

Pipeline p = Pipeline.create(options);
// Add transforms to your pipeline
p.run();

5. Monitor Your Job:

Monitor the execution of your pipeline using the Apache Spark Web Interfaces, which provides information about tasks, stages, and overall progress. Access the Spark UI by navigating to the appropriate URL (usually localhost:4040). Metrics are also accessible via the Apache Beam REST API. Apache Spark offers a metrics system for reporting metrics to various sinks.

Portable Spark Runner (Python)

1. Deploy Spark with Your Application:

You will need Docker installed in your execution environment. Pre-built Spark Job Service Docker images are available on Docker Hub.

Start the JobService endpoint:

docker run --net=host apache/beam_spark_job_server:latest

A Beam JobService is a central instance where you submit your Apache Beam pipeline. It needs to be provided with the Spark master address to create a job for execution on your Spark cluster.

Submit the Python pipeline to this endpoint, providing Beam JobService with the Spark master address to execute the job on a Spark cluster:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions([
    "--runner=PortableRunner",
    "--job_endpoint=localhost:8099", # localhost:8099 is the default address of the JobService
    "--environment_type=LOOPBACK"
])
with beam.Pipeline(options) as p:
    ...

2. Configure Pipeline Options:

Set the runner option in your pipeline options to specify that you want to use the Spark Runner. In Python, you can do this as follows:

from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions()
options.view_as(SparkRunnerOptions).runner = 'SparkRunner'

For additional pipeline configuration options, refer to the Spark Runner documentation.

3. Run Your Pipeline:

In Python, you can use the run() method of your pipeline object to execute the pipeline:

# Run your pipeline
p.run()

4. Monitor Your Job:

You can monitor the execution of your pipeline using the Apache Spark Web Interfaces, which provides information about tasks, stages, and overall progress. Access the Spark UI by navigating to the appropriate URL (usually localhost:4040) in your web browser. Spark metrics are not yet supported on the portable runner.

For more details and alternative methods of deployment, check the Apache Beam documentation on Spark Runner capabilities.