title: “WordCount quickstart for Java” aliases:
This quickstart shows you how to set up a Java development environment and run an example pipeline written with the Apache Beam Java SDK, using a runner of your choice.
If you're interested in contributing to the Apache Beam Java codebase, see the Contribution Guide.
On this page:
{{< toc >}}
Generate a Maven example project that builds against the latest Beam release: {{< shell unix >}} mvn archetype:generate
-DarchetypeGroupId=org.apache.beam
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples
-DarchetypeVersion={{< param release_latest >}}
-DgroupId=org.example
-DartifactId=word-count-beam
-Dversion=“0.1”
-Dpackage=org.apache.beam.examples
-DinteractiveMode=false {{< /shell >}} {{< shell powerShell >}} mvn archetype:generate -D archetypeGroupId=org.apache.beam -D archetypeArtifactId=beam-sdks-java-maven-archetypes-examples -D archetypeVersion={{< param release_latest >}} -D groupId=org.example -D artifactId=word-count-beam -D version=“0.1” -D package=org.apache.beam.examples -D interactiveMode=false {{< /shell >}}
Maven creates a new project in the word-count-beam directory.
Change into word-count-beam: {{< shell unix >}} cd word-count-beam/ {{< /shell >}} {{< shell powerShell >}} cd .\word-count-beam {{< /shell >}} The directory contains a pom.xml and a src directory with example pipelines.
List the example pipelines: {{< shell unix >}} ls src/main/java/org/apache/beam/examples/ {{< /shell >}} {{< shell powerShell >}} dir .\src\main\java\org\apache\beam\examples {{< /shell >}} You should see the following examples:
The example used in this tutorial, WordCount.java, defines a Beam pipeline that counts words from an input file (by default, a .txt file containing Shakespeare's “King Lear”). To learn more about the examples, see the WordCount Example Walkthrough.
The steps below explain how to convert the build from Maven to Gradle for the following runners:
The conversion process for other runners is similar. For additional guidance, see Migrating Builds From Apache Maven.
repositories, replace mavenLocal() with mavenCentral().repositories, declare a repository for Confluent Kafka dependencies: {{< highlight >}} maven { url = uri(“https://packages.confluent.io/maven/”) } {{< /highlight >}}If you're planning to use the DataflowRunner, you can skip this step. The runner will pull text directly from Google Cloud Storage.
A single Beam pipeline can run on multiple Beam runners. The DirectRunner is useful for getting started, because it runs on your machine and requires no specific setup. If you‘re just trying out Beam and you’re not sure what to use, use the DirectRunner.
The general process for running a pipeline goes like this:
--runner=<runner> (defaults to the DirectRunner).To run the WordCount pipeline:
Follow the setup steps for your runner:
The DirectRunner will work without additional setup.
Run the corresponding Maven or Gradle command below.
For Unix shells:
{{< runner direct >}} mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args=“--inputFile=sample.txt --output=counts” -Pdirect-runner {{< /runner >}} {{< runner flink >}} mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args=“--runner=FlinkRunner --inputFile=sample.txt --output=counts” -Pflink-runner {{< /runner >}} {{< runner flinkCluster >}} mvn package exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args=“--runner=FlinkRunner --flinkMaster= --filesToStage=target/word-count-beam-bundled-0.1.jar
--inputFile=sample.txt --output=/tmp/counts” -Pflink-runner {{< /runner >}} {{< runner spark >}} mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args=“--runner=SparkRunner --inputFile=sample.txt --output=counts” -Pspark-runner {{< /runner >}} {{< runner dataflow >}} mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args=“--runner=DataflowRunner --project=
--region=
--gcpTempLocation=gs:///tmp
--inputFile=gs://apache-beam-samples/shakespeare/* --output=gs:///counts”
-Pdataflow-runner {{< /runner >}} {{< runner samza >}} mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args=“--inputFile=sample.txt --output=/tmp/counts --runner=SamzaRunner” -Psamza-runner {{< /runner >}} {{< runner nemo >}} mvn package -Pnemo-runner && java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount
--runner=NemoRunner --inputFile=pwd/sample.txt --output=counts {{< /runner >}} {{< runner jet >}} mvn package -Pjet-runner java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount
--runner=JetRunner --jetLocalMode=3 --inputFile=pwd/sample.txt --output=counts {{< /runner >}}
For Windows PowerShell:
{{< runner direct >}} mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount -D exec.args="--inputFile=sample.txt --output=counts" -P direct-runner {{< /runner >}} {{< runner flink >}} mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount -D exec.args=“--runner=FlinkRunner --inputFile=sample.txt --output=counts” -P flink-runner {{< /runner >}} {{< runner flinkCluster >}} mvn package exec:java -D exec.mainClass=org.apache.beam.examples.WordCount -D exec.args="--runner=FlinkRunner --flinkMaster=<flink master> --filesToStage=.\target\word-count-beam-bundled-0.1.jar --inputFile=C:\path\to\quickstart\sample.txt --output=C:\tmp\counts" -P flink-runner {{< /runner >}} {{< runner spark >}} mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount -D exec.args="--runner=SparkRunner --inputFile=sample.txt --output=counts" -P spark-runner {{< /runner >}} {{< runner dataflow >}} mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount -D exec.args=“--runner=DataflowRunner --project= --region=<your-gcp-region> \ --gcpTempLocation=gs://<your-gcs-bucket>/tmp --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs:///counts” -P dataflow-runner {{< /runner >}} {{< runner samza >}} mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount -D exec.args=“--inputFile=sample.txt --output=/tmp/counts --runner=SamzaRunner” -P samza-runner {{< /runner >}} {{< runner nemo >}} mvn package -P nemo-runner -DskipTests java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount --runner=NemoRunner --inputFile=pwd/sample.txt --output=counts {{< /runner >}} {{< runner jet >}} mvn package -P jet-runner java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount --runner=JetRunner --jetLocalMode=3 --inputFile=$pwd/sample.txt --output=counts {{< /runner >}}
For Unix shells:
{{< runner direct>}} gradle clean execute -DmainClass=org.apache.beam.examples.WordCount
--args=“--inputFile=sample.txt --output=counts” {{< /runner >}} {{< runner flink>}} TODO: document Flink on Gradle: https://github.com/apache/beam/issues/21498 {{< /runner >}} {{< runner flinkCluster>}} TODO: document FlinkCluster on Gradle: https://github.com/apache/beam/issues/21499 {{< /runner >}} {{< runner spark >}} TODO: document Spark on Gradle: https://github.com/apache/beam/issues/21502 {{< /runner >}} {{< runner dataflow >}} gradle clean execute -DmainClass=org.apache.beam.examples.WordCount
--args=“--project= --inputFile=gs://apache-beam-samples/shakespeare/*
--output=gs:///counts --runner=DataflowRunner” -Pdataflow-runner {{< /runner >}} {{< runner samza>}} TODO: document Samza on Gradle: https://github.com/apache/beam/issues/21500 {{< /runner >}} {{< runner nemo>}} TODO: document Nemo on Gradle: https://github.com/apache/beam/issues/21503 {{< /runner >}} {{< runner jet>}} TODO: document Jet on Gradle: https://github.com/apache/beam/issues/21501 {{< /runner >}}
For Windows PowerShell:
{{< runner direct>}} .\gradlew clean execute -D mainClass=org.apache.beam.examples.WordCount
--args=“--inputFile=sample.txt --output=counts” {{< /runner >}} {{< runner flink>}} TODO: document Flink on Gradle: https://github.com/apache/beam/issues/21498 {{< /runner >}} {{< runner flinkCluster>}} TODO: document FlinkCluster on Gradle: https://github.com/apache/beam/issues/21499 {{< /runner >}} {{< runner spark >}} TODO: document Spark on Gradle: https://github.com/apache/beam/issues/21502 {{< /runner >}} {{< runner dataflow >}} .\gradlew clean execute -DmainClass=org.apache.beam.examples.WordCount
--args=“--project= --inputFile=gs://apache-beam-samples/shakespeare/*
--output=gs:///counts --runner=DataflowRunner” -Pdataflow-runner {{< /runner >}} {{< runner samza>}} TODO: document Samza on Gradle: https://github.com/apache/beam/issues/21500 {{< /runner >}} {{< runner nemo>}} TODO: document Nemo on Gradle: https://github.com/apache/beam/issues/21503 {{< /runner >}} {{< runner jet>}} TODO: document Jet on Gradle: https://github.com/apache/beam/issues/21501 {{< /runner >}}
After the pipeline has completed, you can view the output. There might be multiple output files prefixed by count. The number of output files is decided by the runner, giving it the flexibility to do efficient, distributed execution.
... Think: 3 slower: 1 Having: 1 revives: 1 these: 33 wipe: 1 arrives: 1 concluded: 1 begins: 3 ...
Please don't hesitate to reach out if you encounter any issues!