blob: b58153e887c3d5882a5eb6c1bbe703f8f2a0562f [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Example Pipelines
The examples included in this module serve to demonstrate the basic
functionality of Google Cloud Dataflow, and act as starting points for
the development of more complex pipelines.
## Word Count
A good starting point for new users is our set of
[word count](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples) examples, which computes word frequencies. This series of four successively more detailed pipelines is described in detail in the accompanying [walkthrough](https://cloud.google.com/dataflow/examples/wordcount-example).
1. [`MinimalWordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/MinimalWordCount.java) is the simplest word count pipeline and introduces basic concepts like [Pipelines](https://cloud.google.com/dataflow/model/pipelines),
[PCollections](https://cloud.google.com/dataflow/model/pcollection),
[ParDo](https://cloud.google.com/dataflow/model/par-do),
and [reading and writing data](https://cloud.google.com/dataflow/model/reading-and-writing-data) from external storage.
1. [`WordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WordCount.java) introduces Dataflow best practices like [PipelineOptions](https://cloud.google.com/dataflow/pipelines/constructing-your-pipeline#Creating) and custom [PTransforms](https://cloud.google.com/dataflow/model/composite-transforms).
1. [`DebuggingWordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/DebuggingWordCount.java)
shows how to view live aggregators in the [Dataflow Monitoring Interface](https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf), get the most out of
[Cloud Logging](https://cloud.google.com/dataflow/pipelines/logging) integration, and start writing
[good tests](https://cloud.google.com/dataflow/pipelines/testing-your-pipeline).
1. [`WindowedWordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WindowedWordCount.java) shows how to run the same pipeline over either unbounded PCollections in streaming mode or bounded PCollections in batch mode.
## Building and Running
Change directory into `examples/java` and run the examples:
mvn compile exec:java \
-Dexec.mainClass=<MAIN CLASS> \
-Dexec.args="<EXAMPLE-SPECIFIC ARGUMENTS>"
For example, you can execute the `WordCount` pipeline on your local machine as follows:
mvn compile exec:java \
-Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--inputFile=<LOCAL INPUT FILE> --output=<LOCAL OUTPUT FILE>"
Once you have followed the general Cloud Dataflow
[Getting Started](https://cloud.google.com/dataflow/getting-started) instructions, you can execute
the same pipeline on fully managed resources in Google Cloud Platform:
mvn compile exec:java \
-Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT ID> \
--tempLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowRunner"
Make sure to use your project id, not the project number or the descriptive name.
The Cloud Storage location should be entered in the form of
`gs://bucket/path/to/staging/directory`.
Alternatively, you may choose to bundle all dependencies into a single JAR and
execute it outside of the Maven environment. For example, you can execute the
following commands to create the
bundled JAR of the examples and execute it both locally and in Cloud
Platform:
mvn package
java -cp examples/java/target/beam-examples-java-bundled-<VERSION>.jar \
org.apache.beam.examples.WordCount \
--inputFile=<INPUT FILE PATTERN> --output=<OUTPUT FILE>
java -cp examples/java/target/beam-examples-java-bundled-<VERSION>.jar \
org.apache.beam.examples.WordCount \
--project=<YOUR CLOUD PLATFORM PROJECT ID> \
--tempLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowRunner
Other examples can be run similarly by replacing the `WordCount` class path with the example classpath, e.g.
`org.apache.beam.examples.cookbook.BigQueryTornadoes`,
and adjusting runtime options under the `Dexec.args` parameter, as specified in
the example itself.
Note that when running Maven on Microsoft Windows platform, backslashes (`\`)
under the `Dexec.args` parameter should be escaped with another backslash. For
example, input file pattern of `c:\*.txt` should be entered as `c:\\*.txt`.
## Beyond Word Count
After you've finished running your first few word count pipelines, take a look at the [cookbook](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/cookbook)
directory for some common and useful patterns like joining, filtering, and combining.
The [complete](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/complete)
directory contains a few realistic end-to-end pipelines.
See the
[Java 8](https://github.com/apache/beam/tree/master/examples/java8/src/main/java/org/apache/beam/examples)
examples as well. This directory includes a Java 8 version of the
MinimalWordCount example, as well as series of examples in a simple 'mobile
gaming' domain. This series introduces some advanced concepts and provides
additional examples of using Java 8 syntax. Other than usage of Java 8 lambda
expressions, the concepts that are used apply equally well in Java 7.