layout: section title: “Beam Quickstart for Java” permalink: /get-started/quickstart-java/ section_menu: section-menu/get-started.html redirect_from:
This Quickstart will walk you through executing your first Beam pipeline to run [WordCount]({{ site.baseurl }}/get-started/wordcount-example), written using Beam's [Java SDK]({{ site.baseurl }}/documentation/sdks/java), on a [runner]({{ site.baseurl }}/documentation#runners) of your choice.
Download and install the Java Development Kit (JDK) version 8. Verify that the JAVA_HOME environment variable is set and points to your JDK installation.
Download and install Apache Maven by following Maven's installation guide for your specific operating system.
The easiest way to get a copy of the WordCount pipeline is to use the following command to generate a simple Maven project that contains Beam's WordCount examples and builds against the most recent Beam release:
{:.shell-unix}
$ mvn archetype:generate \ -DarchetypeGroupId=org.apache.beam \ -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \ -DarchetypeVersion={{ site.release_latest }} \ -DgroupId=org.example \ -DartifactId=word-count-beam \ -Dversion="0.1" \ -Dpackage=org.apache.beam.examples \ -DinteractiveMode=false
{:.shell-PowerShell}
PS> mvn archetype:generate ` -D archetypeGroupId=org.apache.beam ` -D archetypeArtifactId=beam-sdks-java-maven-archetypes-examples ` -D archetypeVersion={{ site.release_latest }} ` -D groupId=org.example ` -D artifactId=word-count-beam ` -D version="0.1" ` -D package=org.apache.beam.examples ` -D interactiveMode=false
This will create a directory word-count-beam
that contains a simple pom.xml
and a series of example pipelines that count words in text files.
{:.shell-unix}
$ cd word-count-beam/ $ ls pom.xml src $ ls src/main/java/org/apache/beam/examples/ DebuggingWordCount.java WindowedWordCount.java common MinimalWordCount.java WordCount.java
{:.shell-PowerShell}
PS> cd .\word-count-beam PS> dir ... Mode LastWriteTime Length Name ---- ------------- ------ ---- d----- 7/19/2018 11:00 PM src -a---- 7/19/2018 11:00 PM 16051 pom.xml PS> dir .\src\main\java\org\apache\beam\examples ... Mode LastWriteTime Length Name ---- ------------- ------ ---- d----- 7/19/2018 11:00 PM common d----- 7/19/2018 11:00 PM complete d----- 7/19/2018 11:00 PM subprocess -a---- 7/19/2018 11:00 PM 7073 DebuggingWordCount.java -a---- 7/19/2018 11:00 PM 5945 MinimalWordCount.java -a---- 7/19/2018 11:00 PM 9490 WindowedWordCount.java -a---- 7/19/2018 11:00 PM 7662 WordCount.java
For a detailed introduction to the Beam concepts used in these examples, see the [WordCount Example Walkthrough]({{ site.baseurl }}/get-started/wordcount-example). Here, we'll just focus on executing WordCount.java
.
A single Beam pipeline can run on multiple Beam [runners]({{ site.baseurl }}/documentation#runners), including the [ApexRunner]({{ site.baseurl }}/documentation/runners/apex), [FlinkRunner]({{ site.baseurl }}/documentation/runners/flink), [SparkRunner]({{ site.baseurl }}/documentation/runners/spark) or [DataflowRunner]({{ site.baseurl }}/documentation/runners/dataflow). The [DirectRunner]({{ site.baseurl }}/documentation/runners/direct) is a common runner for getting started, as it runs locally on your machine and requires no specific setup.
After you‘ve chosen which runner you’d like to use:
--runner=<runner>
(defaults to the [DirectRunner]({{ site.baseurl }}/documentation/runners/direct))For Unix shells:
{:.runner-direct}
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner
{:.runner-apex}
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=counts --runner=ApexRunner" -Papex-runner
{:.runner-flink-local}
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=FlinkRunner --inputFile=pom.xml --output=counts" -Pflink-runner
{:.runner-flink-cluster}
$ mvn package exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=FlinkRunner --flinkMaster=<flink master> --filesToStage=target/word-count-beam-bundled-0.1.jar \ --inputFile=/path/to/quickstart/pom.xml --output=/tmp/counts" -Pflink-runner You can monitor the running job by visiting the Flink dashboard at http://<flink master>:8081
{:.runner-spark}
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -Pspark-runner
{:.runner-dataflow}
Make sure you complete the setup steps at {{ site.baseurl }}/documentation/runners/dataflow/#setup $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \ --gcpTempLocation=gs://<your-gcs-bucket>/tmp \ --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \ -Pdataflow-runner
{:.runner-samza-local}
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=/tmp/counts --runner=SamzaRunner" -Psamza-runner
For Windows PowerShell:
{:.runner-direct}
PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--inputFile=pom.xml --output=counts" -P direct-runner
{:.runner-apex}
PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--inputFile=pom.xml --output=counts --runner=ApexRunner" -P apex-runner
{:.runner-flink-local}
PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--runner=FlinkRunner --inputFile=pom.xml --output=counts" -P flink-runner
{:.runner-flink-cluster}
PS> mvn package exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--runner=FlinkRunner --flinkMaster=<flink master> --filesToStage=.\target\word-count-beam-bundled-0.1.jar ` --inputFile=C:\path\to\quickstart\pom.xml --output=C:\tmp\counts" -P flink-runner You can monitor the running job by visiting the Flink dashboard at http://<flink master>:8081
{:.runner-spark}
PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -P spark-runner
{:.runner-dataflow}
Make sure you complete the setup steps at {{ site.baseurl }}/documentation/runners/dataflow/#setup PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--runner=DataflowRunner --project=<your-gcp-project> ` --gcpTempLocation=gs://<your-gcs-bucket>/tmp ` --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" ` -P dataflow-runner
{:.runner-samza-local}
PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--inputFile=pom.xml --output=/tmp/counts --runner=SamzaRunner" -P samza-runner
Once the pipeline has completed, you can view the output. You'll notice that there may be multiple output files prefixed by count
. The exact number of these files is decided by the runner, giving it the flexibility to do efficient, distributed execution.
{:.runner-direct}
$ ls counts*
{:.runner-apex}
$ ls counts*
{:.runner-flink-local}
$ ls counts*
{:.runner-flink-cluster}
$ ls /tmp/counts*
{:.runner-spark}
$ ls counts*
{:.runner-dataflow}
$ gsutil ls gs://<your-gcs-bucket>/counts*
{:.runner-samza-local}
$ ls /tmp/counts*
When you look into the contents of the file, you'll see that they contain unique words and the number of occurrences of each word. The order of elements within the file may differ because the Beam model does not generally guarantee ordering, again to allow runners to optimize for efficiency.
{:.runner-direct}
$ more counts* api: 9 bundled: 1 old: 4 Apache: 2 The: 1 limitations: 1 Foundation: 1 ...
{:.runner-apex}
$ cat counts* BEAM: 1 have: 1 simple: 1 skip: 4 PAssert: 1 ...
{:.runner-flink-local}
$ more counts* The: 1 api: 9 old: 4 Apache: 2 limitations: 1 bundled: 1 Foundation: 1 ...
{:.runner-flink-cluster}
$ more /tmp/counts* The: 1 api: 9 old: 4 Apache: 2 limitations: 1 bundled: 1 Foundation: 1 ...
{:.runner-spark}
$ more counts* beam: 27 SF: 1 fat: 1 job: 1 limitations: 1 require: 1 of: 11 profile: 10 ...
{:.runner-dataflow}
$ gsutil cat gs://<your-gcs-bucket>/counts* feature: 15 smother'st: 1 revelry: 1 bashfulness: 1 Bashful: 1 Below: 2 deserves: 32 barrenly: 1 ...
{:.runner-samza-local}
$ more /tmp/counts* api: 7 are: 2 can: 2 com: 14 end: 14 for: 14 has: 2 ...
Please don't hesitate to [reach out]({{ site.baseurl }}/community/contact-us) if you encounter any issues!