The examples included in this module serve to demonstrate the basic functionality of Google Cloud Dataflow, and act as starting points for the development of more complex pipelines.
A good starting point for new users is our WordCount
example, which runs over the provided input text file(s) and computes how many times each word occurs in the input.
Besides WordCount
, the following examples are included:
After building and installing the SDK
and Examples
modules, as explained in this README, you can execute the WordCount
and other example pipelines using the DirectPipelineRunner
on your local machine:
mvn exec:java -pl examples \ -Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \ -Dexec.args="--inputFile=<INPUT FILE PATTERN> --output=<OUTPUT FILE>"
You can use the BlockingDataflowPipelineRunner
to execute the WordCount
example on Google Cloud Dataflow Service using managed resources in the Google Cloud Platform. Start by following the general Cloud Dataflow Getting Started instructions. You should have a Google Cloud Platform project that has a Cloud Dataflow API enabled, a Google Cloud Storage bucket that will serve as a staging location, and installed and authenticated Google Cloud SDK. In this case, invoke the example as follows:
mvn exec:java -pl examples \ -Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \ -Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT NAME> \ --stagingLocation=<YOUR CLOUD STORAGE LOCATION> \ --runner=BlockingDataflowPipelineRunner"
Your Cloud Storage location should be entered in the form of gs://bucket/path/to/staging/directory
. The Cloud Platform project refers to its name (not number).
Alternatively, you may choose to bundle all dependencies into a single JAR and execute it outside of the Maven environment. For example, after building and installing as usual, you can execute the following commands to create the bundled JAR of the Examples
module and execute it both locally and in Cloud Platform:
mvn bundle:bundle -pl examples java -cp examples/target/google-cloud-dataflow-java-examples-all-bundled-manual_build.jar \ com.google.cloud.dataflow.examples.WordCount \ --inputFile=<INPUT FILE PATTERN> --output=<OUTPUT FILE> java -cp examples/target/google-cloud-dataflow-java-examples-all-bundled-manual_build.jar \ com.google.cloud.dataflow.examples.WordCount \ --project=<YOUR CLOUD PLATFORM PROJECT NAME> \ --stagingLocation=<YOUR CLOUD STORAGE LOCATION> \ --runner=BlockingDataflowPipelineRunner
Other examples can be run similarly by replacing the WordCount
class name with BigQueryTornadoes
, DatastoreWordCount
, TfIdf
, TopWikipediaSessions
, etc. and adjusting runtime options under the Dexec.args
parameter, as specified in the example itself. If you are running the streaming pipeline examples, see the additional setup instruction, below.
Note that when running Maven on Microsoft Windows platform, backslashes (\
) under the Dexec.args
parameter should be escaped with another backslash. For example, input file pattern of c:\*.txt
should be entered as c:\\*.txt
.
The TrafficMaxLaneFlow
and TrafficRoutes
pipelines, when run in streaming mode (with the --streaming=true
option), require the publication of traffic sensor data to a Google Cloud Pub/Sub topic. You can run the example with default settings using the following command:
mvn exec:java -pl examples \ -Dexec.mainClass=com.google.cloud.dataflow.examples.TrafficMaxLaneFlow \ -Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT NAME> \ --stagingLocation=<YOUR CLOUD STORAGE LOCATION> \ --runner=DataflowPipelineRunner \ --streaming=true"
By default, they use a separate batch pipeline to publish previously gathered traffic sensor data to the Cloud Pub/Sub topic, which is used as an input source for the streaming pipeline.
The default traffic sensor data --inputFile
is downloaded from
curl -O \ http://storage.googleapis.com/aju-sd-traffic/unzipped/Freeways-5Minaa2010-01-01_to_2010-02-15_test2.csv
This file contains real traffic sensor data from San Diego freeways. See this file for copyright information.
You may override the default --inputFile
with an alternative complete data set (~2GB). It is provided in the Google Cloud Storage bucket gs://dataflow-samples/traffic_sensor/Freeways-5Minaa2010-01-01_to_2010-02-15.csv
.
You may also set --inputFile
to an empty string, which will disable the automatic Pub/Sub injection, and allow you to use separate tool to control the input to this example. An example code, which publishes traffic sensor data to a Pub/Sub topic, is provided in traffic_pubsub_generator.py
Note: If you set --streaming=false
, these traffic pipelines will run in batch mode, using the timestamps applied to the original dataset to process the data in a batch. For further information on how these pipelines operate, see TrafficMaxLaneFlow and TrafficRoutes.