{{< language-switcher java py >}}
The Google Cloud Dataflow Runner uses the Cloud Dataflow managed service. When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable code and dependencies to a Google Cloud Storage bucket and creates a Cloud Dataflow job, which executes your pipeline on managed resources in Google Cloud Platform.
The Cloud Dataflow Runner and service are suitable for large scale, continuous jobs, and provide:
The Beam Capability Matrix documents the supported capabilities of the Cloud Dataflow Runner.
To use the Cloud Dataflow Runner, you must complete the setup in the Before you begin section of the Cloud Dataflow quickstart for your chosen language.
When using Java, you must specify your dependency on the Cloud Dataflow Runner in your pom.xml. {{< highlight java >}} org.apache.beam beam-runners-google-cloud-dataflow-java {{< param release_latest >}} runtime {{< /highlight >}}
This section is not applicable to the Beam SDK for Python.
{{< paragraph class=“language-py” >}} This section is not applicable to the Beam SDK for Python. {{< /paragraph >}}
{{< paragraph class=“language-java” >}} In some cases, such as starting a pipeline using a scheduler such as Apache AirFlow, you must have a self-contained application. You can pack a self-executing JAR by explicitly adding the following dependency on the Project section of your pom.xml, in addition to the adding existing dependency shown in the previous section. {{< /paragraph >}}
{{< highlight java >}} org.apache.beam beam-runners-google-cloud-dataflow-java ${beam.version} runtime {{< /highlight >}}
{{< paragraph class=“language-java” >}} Then, add the mainClass name in the Maven JAR plugin. {{< /paragraph >}}
{{< highlight java >}} org.apache.maven.plugins maven-jar-plugin ${maven-jar-plugin.version} true lib/ YOUR_MAIN_CLASS_NAME {{< /highlight >}}
{{< paragraph class=“language-java” >}} After running mvn package -Pdataflow-runner, run ls target and you should see (assuming your artifactId is beam-examples and the version is 1.0.0) the following output. {{< /paragraph >}}
{{< highlight java >}} beam-examples-bundled-1.0.0.jar {{< /highlight >}}
{{< paragraph class=“language-java” >}} To run the self-executing JAR on Cloud Dataflow, use the following command. {{< /paragraph >}}
{{< highlight java >}} java -jar target/beam-examples-bundled-1.0.0.jar
--runner=DataflowRunner
--project=<YOUR_GCP_PROJECT_ID>
--region=<GCP_REGION>
--tempLocation=gs://<YOUR_GCS_BUCKET>/temp/
--output=gs://<YOUR_GCS_BUCKET>/output {{< /highlight >}}
When executing your pipeline with the Cloud Dataflow Runner (Java), consider these common pipeline options. When executing your pipeline with the Cloud Dataflow Runner (Python), consider these common pipeline options.
See the reference documentation for the [DataflowPipelineOptions](https://beam.apache.org/releases/javadoc/{{< param release_latest >}}/index.html?org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.html) [PipelineOptions](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.options.pipeline_options.html#apache_beam.options.pipeline_options.PipelineOptions) interface (and any subinterfaces) for additional pipeline configuration options.
While your pipeline executes, you can monitor the job‘s progress, view details on execution, and receive updates on the pipeline’s results by using the Dataflow Monitoring Interface or the Dataflow Command-line Interface.
To block until your job completes, call waitToFinishwait_until_finish on the PipelineResult returned from pipeline.run(). The Cloud Dataflow Runner prints job status updates and console messages while it waits. While the result is connected to the active job, note that pressing Ctrl+C from the command line does not cancel your job. To cancel the job, you can use the Dataflow Monitoring Interface or the Dataflow Command-line Interface.
If your pipeline uses an unbounded data source or sink, you must set the streaming option to true.
When using streaming execution, keep the following considerations in mind.
Streaming pipelines do not terminate unless explicitly cancelled by the user. You can cancel your streaming job from the Dataflow Monitoring Interface or with the Dataflow Command-line Interface (gcloud dataflow jobs cancel command).
Streaming jobs use a Google Compute Engine machine type of n1-standard-2 or higher by default. You must not override this, as n1-standard-2 is the minimum required machine type for running streaming jobs.
Streaming execution pricing differs from batch execution.