Google Cloud Dataflow SDK for Java Examples

The examples included in this module serve to demonstrate the basic functionality of Google Cloud Dataflow, and act as starting points for the development of more complex pipelines.

A good starting point for new users is our WordCount example, which runs over the provided input text file(s) and computes how many times each word occurs in the input.

Besides WordCount, the following examples are included:

Running the Examples

After building and installing the SDK and Examples modules, as explained in this README, you can execute the WordCount and other example pipelines using the DirectPipelineRunner on your local machine:

mvn exec:java -pl examples \
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
-Dexec.args="--inputFile=<INPUT FILE PATTERN> --output=<OUTPUT FILE>"

You can use the BlockingDataflowPipelineRunner to execute the WordCount example on Google Cloud Dataflow Service using managed resources in the Google Cloud Platform. Start by following the general Cloud Dataflow Getting Started instructions. You should have a Google Cloud Platform project that has a Cloud Dataflow API enabled, a Google Cloud Storage bucket that will serve as a staging location, and installed and authenticated Google Cloud SDK. In this case, invoke the example as follows:

mvn exec:java -pl examples \
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
-Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT NAME> \
--stagingLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowPipelineRunner"

Your Cloud Storage location should be entered in the form of gs://bucket/path/to/staging/directory. The Cloud Platform project refers to its name (not number).

Alternatively, you may choose to bundle all dependencies into a single JAR and execute it outside of the Maven environment. For example, after building and installing as usual, you can execute the following commands to create the bundled JAR of the Examples module and execute it both locally and in Cloud Platform:

mvn bundle:bundle -pl examples

java -cp examples/target/google-cloud-dataflow-java-examples-all-bundled-manual_build.jar \
com.google.cloud.dataflow.examples.WordCount \
--inputFile=<INPUT FILE PATTERN> --output=<OUTPUT FILE>

java -cp examples/target/google-cloud-dataflow-java-examples-all-bundled-manual_build.jar \
com.google.cloud.dataflow.examples.WordCount \
--project=<YOUR CLOUD PLATFORM PROJECT NAME> \
--stagingLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowPipelineRunner

Other examples can be run similarly by replacing the WordCount class name with BigQueryTornadoes, DatastoreWordCount, TfIdf, TopWikipediaSessions, etc. and adjusting runtime options under the Dexec.args parameter, as specified in the example itself. If you are running the streaming pipeline examples, see the additional setup instruction, below.

Note that when running Maven on Microsoft Windows platform, backslashes (\) under the Dexec.args parameter should be escaped with another backslash. For example, input file pattern of c:\*.txt should be entered as c:\\*.txt.

Running the “Traffic” Streaming Examples

The TrafficMaxLaneFlow and TrafficRoutes pipelines, when run in streaming mode (with the --streaming=true option), require the publication of traffic sensor data to a Google Cloud Pub/Sub topic. You can run the example with default settings using the following command:

mvn exec:java -pl examples \
-Dexec.mainClass=com.google.cloud.dataflow.examples.TrafficMaxLaneFlow \
-Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT NAME> \
--stagingLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=DataflowPipelineRunner \
--streaming=true"

By default, they use a separate batch pipeline to publish previously gathered traffic sensor data to the Cloud Pub/Sub topic, which is used as an input source for the streaming pipeline.

The default traffic sensor data --inputFile is downloaded from

curl -O \
http://storage.googleapis.com/aju-sd-traffic/unzipped/Freeways-5Minaa2010-01-01_to_2010-02-15_test2.csv

This file contains real traffic sensor data from San Diego freeways. See this file for copyright information.

You may override the default --inputFile with an alternative complete data set (~2GB). It is provided in the Google Cloud Storage bucket gs://dataflow-samples/traffic_sensor/Freeways-5Minaa2010-01-01_to_2010-02-15.csv.

You may also set --inputFile to an empty string, which will disable the automatic Pub/Sub injection, and allow you to use separate tool to control the input to this example. An example code, which publishes traffic sensor data to a Pub/Sub topic, is provided in traffic_pubsub_generator.py

Note: If you set --streaming=false, these traffic pipelines will run in batch mode, using the timestamps applied to the original dataset to process the data in a batch. For further information on how these pipelines operate, see TrafficMaxLaneFlow and TrafficRoutes.