learning/prompts/documentation-lookup/54_spark_runner.md - beam - Git at Google

 Prompt: How to execute Apache Beam pipelines with Apache Spark?

 Response:
 Apache Beam supports multiple runners for executing pipelines on various platforms. One such runner is the [Spark Runner](https://beam.apache.org/documentation/runners/spark/), tailored for running pipelines using [Apache Spark](https://spark.apache.org/). The Spark Runner enables you to leverage the scalability and parallel processing capabilities of Apache Spark for your data processing tasks.

 Key features of the Spark Runner include:
 * Support for batch, streaming, and unified pipelines.
 * Offering the same [fault-tolerance guarantees](https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics) and [security features](https://spark.apache.org/docs/latest/security.html) as provided by Apache Spark.
 * Built-in metrics reporting using Apache Spark’s [metrics system](https://spark.apache.org/docs/latest/monitoring.html#metrics), which also reports Beam Aggregators.
 * Native support for Apache Beam side inputs via Apache Spark’s [broadcast variables](https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables).

 There are three types of Spark Runners available:
 1. Legacy Spark Runner: supports Java (and other JVM-based languages) exclusively, based on Apache Spark’s [RDD and DStream](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html).
 2. Structured Streaming Spark Runner: supports Java (and other JVM-based languages) exclusively, based on Apache Spark's [Datasets](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) and [Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) framework. Currently, it only supports batch mode with limited coverage of the Apache Beam model.
 3. Portable Spark Runner: supports Java, Python, and Go.

 For Java-based applications, consider using the Java-based runners, while for Python or Go pipelines, opt for the portable Runner.

 The Spark Runner can execute Spark pipelines similar to a native Spark application, allowing deployment as a self-contained application for local mode, running on [Spark Standalone Resource Manager (RM)](https://spark.apache.org/docs/latest/spark-standalone.html), or using [YARN](https://spark.apache.org/docs/latest/running-on-yarn.html) or [Mesos](https://spark.apache.org/docs/latest/running-on-mesos.html).

 To execute your Apache Beam pipeline on a Spark Standalone RM, follow these steps:

 ***Java-based Non-portable Spark Runners (Java Only)***

 ***1. Specify Dependencies:***

 In the `pom.xml` file of your Java project directory, specify your dependency on the latest version of the Spark Runner:

 ```java
 <dependency>
   <groupId>org.apache.beam</groupId>
   <artifactId>beam-runners-spark-3</artifactId>
   <version>2.54.0</version>
 </dependency>
 ```

 ***2. Deploy Spark with Your Application:***

 When running pipelines in a Spark Standalone mode, ensure that your self-contained application includes Spark dependencies explicitly in your `pom.xml` file:

 ```java
 <dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-core_2.12</artifactId>
   <version>${spark.version}</version>
 </dependency>

 <dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-streaming_2.12</artifactId>
   <version>${spark.version}</version>
 </dependency>
 ```

 Shade the application JAR using the [Maven shade plugin](https://maven.apache.org/plugins/maven-shade-plugin/) and make sure the shaded JAR file is visible in the target directory by running `is target`.

 To run pipelines in a Spark Standalone mode using the legacy RDD/DStream-based Spark Runner, use the following command:

 ```java
 spark-submit --class com.beam.examples.BeamPipeline --master spark://HOST:PORT target/beam-examples-1.0.0-shaded.jar --runner=SparkRunner
 ```

 To run pipelines in a Spark Standalone mode using the Structured Streaming Spark Runner, run the following command:

 ```java
 spark-submit --class com.beam.examples.BeamPipeline --master spark://HOST:PORT target/beam-examples-1.0.0-shaded.jar --runner=SparkStructuredStreamingRunner
 ```

 ***3. Configure Pipeline Options:***

 Set the runner option in your pipeline options to specify that you want to use the Spark Runner. In Java, you can do this as follows:

 ```java
 SparkPipelineOptions options = PipelineOptionsFactory.as(SparkPipelineOptions.class);
 options.setRunner(SparkRunner.class);
 ```

 For additional pipeline configuration options, refer to the [Pipeline Options for the Spark Runner](https://beam.apache.org/documentation/runners/spark/#pipeline-options-for-the-spark-runner) section in the Spark Runner documentation.

 ***4. Run Your Pipeline:***

 In Java, you can use the `PipelineRunner` to run your pipeline:

 ```java
 Pipeline p = Pipeline.create(options);
 // Add transforms to your pipeline
 p.run();
 ```

 ***5. Monitor Your Job:***

 Monitor the execution of your pipeline using the Apache Spark [Web Interfaces](https://spark.apache.org/docs/latest/monitoring.html#web-interfaces), which provides information about tasks, stages, and overall progress. Access the Spark UI by navigating to the appropriate URL (usually `localhost:4040`). Metrics are also accessible via the Apache Beam [REST API](https://spark.apache.org/docs/latest/monitoring.html#rest-api). Apache Spark offers a [metrics system](https://spark.apache.org/docs/latest/monitoring.html#metrics) for reporting metrics to various sinks.

 ***Portable Spark Runner (Python)***

 ***1. Deploy Spark with Your Application:***

 You will need Docker installed in your execution environment. Pre-built Spark Job Service Docker images are available on [Docker Hub](https://hub.docker.com/r/apache/beam_spark_job_server).

 Start the JobService endpoint:

 ```python
 docker run --net=host apache/beam_spark_job_server:latest
 ```
 A Beam JobService is a central instance where you submit your Apache Beam pipeline. It needs to be provided with the Spark master address to create a job for execution on your Spark cluster.

 Submit the Python pipeline to this endpoint, providing Beam JobService with the Spark master address to execute the job on a Spark cluster:

 ```python
 import apache_beam as beam
 from apache_beam.options.pipeline_options import PipelineOptions

 options = PipelineOptions([
     "--runner=PortableRunner",
     "--job_endpoint=localhost:8099", # localhost:8099 is the default address of the JobService
     "--environment_type=LOOPBACK"
 ])
 with beam.Pipeline(options) as p:
     ...
 ```

 ***2. Configure Pipeline Options:***

 Set the runner option in your pipeline options to specify that you want to use the Spark Runner. In Python, you can do this as follows:

 ```python
 from apache_beam.options.pipeline_options import PipelineOptions
 options = PipelineOptions()
 options.view_as(SparkRunnerOptions).runner = 'SparkRunner'
 ```

 For additional pipeline configuration options, refer to the [Pipeline Options for the Spark Runner](https://beam.apache.org/documentation/runners/spark/#pipeline-options-for-the-spark-runner) section in the Spark Runner documentation.

 ***3. Run Your Pipeline:***

 In Python, you can use the `run()` method of your pipeline object to execute the pipeline:

 ```python
     # Run your pipeline
     p.run()
 ```

 ***4. Monitor Your Job:***

 You can monitor the execution of your pipeline using the Apache Spark [Web Interfaces](https://spark.apache.org/docs/latest/monitoring.html#web-interfaces), which provides information about tasks, stages, and overall progress. Access the Spark UI by navigating to the appropriate URL (usually `localhost:4040`) in your web browser. Spark metrics are not yet supported on the portable runner.

 For more details and alternative methods of deployment, check the Apache Beam documentation on [Spark Runner capabilities](https://beam.apache.org/documentation/runners/spark/).
	Prompt: How to execute Apache Beam pipelines with Apache Spark?

	Response:
	Apache Beam supports multiple runners for executing pipelines on various platforms. One such runner is the [Spark Runner](https://beam.apache.org/documentation/runners/spark/), tailored for running pipelines using [Apache Spark](https://spark.apache.org/). The Spark Runner enables you to leverage the scalability and parallel processing capabilities of Apache Spark for your data processing tasks.

	Key features of the Spark Runner include:
	* Support for batch, streaming, and unified pipelines.
	* Offering the same [fault-tolerance guarantees](https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics) and [security features](https://spark.apache.org/docs/latest/security.html) as provided by Apache Spark.
	* Built-in metrics reporting using Apache Spark’s [metrics system](https://spark.apache.org/docs/latest/monitoring.html#metrics), which also reports Beam Aggregators.
	* Native support for Apache Beam side inputs via Apache Spark’s [broadcast variables](https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables).

	There are three types of Spark Runners available:
	1. Legacy Spark Runner: supports Java (and other JVM-based languages) exclusively, based on Apache Spark’s [RDD and DStream](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html).
	2. Structured Streaming Spark Runner: supports Java (and other JVM-based languages) exclusively, based on Apache Spark's [Datasets](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) and [Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) framework. Currently, it only supports batch mode with limited coverage of the Apache Beam model.
	3. Portable Spark Runner: supports Java, Python, and Go.

	For Java-based applications, consider using the Java-based runners, while for Python or Go pipelines, opt for the portable Runner.

	The Spark Runner can execute Spark pipelines similar to a native Spark application, allowing deployment as a self-contained application for local mode, running on [Spark Standalone Resource Manager (RM)](https://spark.apache.org/docs/latest/spark-standalone.html), or using [YARN](https://spark.apache.org/docs/latest/running-on-yarn.html) or [Mesos](https://spark.apache.org/docs/latest/running-on-mesos.html).

	To execute your Apache Beam pipeline on a Spark Standalone RM, follow these steps:

	*Java-based Non-portable Spark Runners (Java Only)*

	*1. Specify Dependencies:*

	In the `pom.xml` file of your Java project directory, specify your dependency on the latest version of the Spark Runner:

	```java
	<dependency>
	<groupId>org.apache.beam</groupId>
	<artifactId>beam-runners-spark-3</artifactId>
	<version>2.54.0</version>
	</dependency>
	```

	*2. Deploy Spark with Your Application:*

	When running pipelines in a Spark Standalone mode, ensure that your self-contained application includes Spark dependencies explicitly in your `pom.xml` file:

	```java
	<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-core_2.12</artifactId>
	<version>${spark.version}</version>
	</dependency>

	<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-streaming_2.12</artifactId>
	<version>${spark.version}</version>
	</dependency>
	```

	Shade the application JAR using the [Maven shade plugin](https://maven.apache.org/plugins/maven-shade-plugin/) and make sure the shaded JAR file is visible in the target directory by running `is target`.

	To run pipelines in a Spark Standalone mode using the legacy RDD/DStream-based Spark Runner, use the following command:

	```java
	spark-submit --class com.beam.examples.BeamPipeline --master spark://HOST:PORT target/beam-examples-1.0.0-shaded.jar --runner=SparkRunner
	```

	To run pipelines in a Spark Standalone mode using the Structured Streaming Spark Runner, run the following command:

	```java
	spark-submit --class com.beam.examples.BeamPipeline --master spark://HOST:PORT target/beam-examples-1.0.0-shaded.jar --runner=SparkStructuredStreamingRunner
	```

	*3. Configure Pipeline Options:*

	Set the runner option in your pipeline options to specify that you want to use the Spark Runner. In Java, you can do this as follows:

	```java
	SparkPipelineOptions options = PipelineOptionsFactory.as(SparkPipelineOptions.class);
	options.setRunner(SparkRunner.class);
	```

	For additional pipeline configuration options, refer to the [Pipeline Options for the Spark Runner](https://beam.apache.org/documentation/runners/spark/#pipeline-options-for-the-spark-runner) section in the Spark Runner documentation.

	*4. Run Your Pipeline:*

	In Java, you can use the `PipelineRunner` to run your pipeline:

	```java
	Pipeline p = Pipeline.create(options);
	// Add transforms to your pipeline
	p.run();
	```

	*5. Monitor Your Job:*

	Monitor the execution of your pipeline using the Apache Spark [Web Interfaces](https://spark.apache.org/docs/latest/monitoring.html#web-interfaces), which provides information about tasks, stages, and overall progress. Access the Spark UI by navigating to the appropriate URL (usually `localhost:4040`). Metrics are also accessible via the Apache Beam [REST API](https://spark.apache.org/docs/latest/monitoring.html#rest-api). Apache Spark offers a [metrics system](https://spark.apache.org/docs/latest/monitoring.html#metrics) for reporting metrics to various sinks.

	*Portable Spark Runner (Python)*

	*1. Deploy Spark with Your Application:*

	You will need Docker installed in your execution environment. Pre-built Spark Job Service Docker images are available on [Docker Hub](https://hub.docker.com/r/apache/beam_spark_job_server).

	Start the JobService endpoint:

	```python
	docker run --net=host apache/beam_spark_job_server:latest
	```
	A Beam JobService is a central instance where you submit your Apache Beam pipeline. It needs to be provided with the Spark master address to create a job for execution on your Spark cluster.

	Submit the Python pipeline to this endpoint, providing Beam JobService with the Spark master address to execute the job on a Spark cluster:

	```python
	import apache_beam as beam
	from apache_beam.options.pipeline_options import PipelineOptions

	options = PipelineOptions([
	"--runner=PortableRunner",
	"--job_endpoint=localhost:8099", # localhost:8099 is the default address of the JobService
	"--environment_type=LOOPBACK"
	])
	with beam.Pipeline(options) as p:
	...
	```

	*2. Configure Pipeline Options:*

	Set the runner option in your pipeline options to specify that you want to use the Spark Runner. In Python, you can do this as follows:

	```python
	from apache_beam.options.pipeline_options import PipelineOptions
	options = PipelineOptions()
	options.view_as(SparkRunnerOptions).runner = 'SparkRunner'
	```

	For additional pipeline configuration options, refer to the [Pipeline Options for the Spark Runner](https://beam.apache.org/documentation/runners/spark/#pipeline-options-for-the-spark-runner) section in the Spark Runner documentation.

	*3. Run Your Pipeline:*

	In Python, you can use the `run()` method of your pipeline object to execute the pipeline:

	```python
	# Run your pipeline
	p.run()
	```

	*4. Monitor Your Job:*

	You can monitor the execution of your pipeline using the Apache Spark [Web Interfaces](https://spark.apache.org/docs/latest/monitoring.html#web-interfaces), which provides information about tasks, stages, and overall progress. Access the Spark UI by navigating to the appropriate URL (usually `localhost:4040`) in your web browser. Spark metrics are not yet supported on the portable runner.

	For more details and alternative methods of deployment, check the Apache Beam documentation on [Spark Runner capabilities](https://beam.apache.org/documentation/runners/spark/).