learning/prompts/documentation-lookup/53_dataflow

Prompt: How to execute Apache Beam pipelines with Cloud Dataflow?

Response: Apache Beam supports multiple runners for executing pipelines on various platforms. One such runner is the Cloud Dataflow Runner, tailored for running pipelines on the Google Cloud Dataflow service. Cloud Dataflow offers fully managed and unified stream and batch data processing, boasting dynamic work rebalancing and built-in autoscaling capabilities.

When you execute a pipeline on Cloud Dataflow, the Runner uploads your code and dependencies to a Cloud Storage bucket and creates a Dataflow job, which then executes your pipeline on managed resources within the Google Cloud Platform.

To execute Apache Beam pipelines using the Cloud Dataflow Runner, follow these steps:

1. Setup Your Cloud Project and Resources:

Complete the steps outlined in the ‘Before You Begin’ section of the Cloud Dataflow quickstart for your chosen programming language.

Select or create a Google Cloud Platform Console project.
Enable billing for your project.
Enable the required Google Cloud APIs, including Cloud Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, and Cloud Resource Manager. Additional APIs may be necessary depending on your pipeline code.
Authenticate with Google Cloud Platform.
Install the Google Cloud SDK.
Create a Cloud Storage bucket.

2. Specify Dependencies (Java Only):

When using the Apache Beam Java SDK, specify your dependency on the Cloud Dataflow Runner in the pom.xml file of your Java project directory.

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
  <version>2.54.0</version>
  <scope>runtime</scope>
</dependency>

Ensure that you include all necessary dependencies to create a self-contained application. In some cases, such as when starting a pipeline using a scheduler, you'll need to package a self-executing JAR by explicitly adding a dependency in the Project section of your pom.xml file. For more details about running self-executing JARs on Cloud Dataflow, refer to the Self-executing JAR section in the Apache Beam documentation on Cloud Dataflow Runner.

3. Configure Pipeline Options:

Configure the execution details, including the runner (set to dataflow or DataflowRunner), Cloud project ID, region, and streaming mode, using the GoogleCloudOptions interface for Python or the DataflowPipelineOptions interface for Java.

You can utilize pipeline options to control various aspects of how Cloud Dataflow executes your job. For instance, you can specify whether your pipeline runs on worker virtual machines, on the Cloud Dataflow service backend, or locally. For additional pipeline configuration options, refer to the reference documentation for the GoogleCloudOptions interface (Python) or the DataflowPipelineOptions interface (Java).

4. Run Your Pipeline on Cloud Dataflow:

Execute your pipeline on Cloud Dataflow using the appropriate command for your SDK.

The following example code, taken from the Cloud Dataflow quickstarts for Java and Python, shows how to run the WordCount example pipeline on Dataflow.

For the Apache Beam Java SDK, in your terminal, run the following command (from your word-count-beam directory):

  mvn -Pdataflow-runner compile exec:java \
    -Dexec.mainClass=org.apache.beam.examples.WordCount \
    -Dexec.args="--project=PROJECT_ID \
    --gcpTempLocation=gs://BUCKET_NAME/temp/ \
    --output=gs://BUCKET_NAME/output \
    --runner=DataflowRunner \
    --region=REGION

For the Apache Beam Python SDK, in your terminal, run the following command:

python -m apache_beam.examples.wordcount \
    --region DATAFLOW_REGION \
    --input gs://dataflow-samples/shakespeare/kinglear.txt \
    --output gs://STORAGE_BUCKET/results/outputs \
    --runner DataflowRunner \
    --project PROJECT_ID \
    --temp_location gs://STORAGE_BUCKET/tmp/

Replace placeholders such as PROJECT_ID, BUCKET_NAME, and REGION with your Cloud project-specific details.

To learn more about running pipelines on Cloud Dataflow, visit the Cloud Dataflow quickstart for your preferred programming language.

5. Monitor Your Cloud Dataflow Job:

Monitor the job‘s progress, view execution details, and receive updates on the pipeline’s results using the Dataflow Monitoring Interface or the Dataflow Command-line Interface.

For more information on the supported capabilities of the Cloud Dataflow Runner, refer to the Beam Capability Matrix in the Apache Beam documentation.