tag | ecf0c949ca8340b3fa91e9c68815650553f2a5d5 | |
---|---|---|
tagger | Kenneth Knowles <kenn@apache.org> | Tue Oct 24 20:23:05 2017 -0700 |
object | 6ad521097f50d6390ba99489d0dffb87c9991715 |
v0.3.20141216
commit | 6ad521097f50d6390ba99489d0dffb87c9991715 | [log] [tgz] |
---|---|---|
author | davor <davor@google.com> | Tue Dec 16 10:37:48 2014 -0800 |
committer | Davor Bonaci <davor@google.com> | Tue Dec 16 10:41:21 2014 -0800 |
tree | 0ec4daca642d5e449949cc6c427c0c7779dc92fa | |
parent | 9e90654cd39af8d2b91290055b5b5066d07c6d94 [diff] |
Dataflow launch: update pom.xml for certain modules. In the context of bringing in dependencies from com.google.apis group, add exclusion of Guava transitive dependency to all com.google.apis dependencies. Notes: * Artifacts from com.google.apis, version 1.19 in particular, brings in an old version of Guava, which is not compatible with the SDK content. * We need to exclude this transitive dependency to ensure build works. [] ------------- Created by MOE: http://code.google.com/p/moe-java MOE_MIGRATED_REVID=82246218
Google Cloud Dataflow provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.
The key concepts in this programming model are:
Currently there are three runners:
The Dataflow Service is currently in the Alpha phase of development and access is limited to whitelisted users.
This repository consists of two modules:
The following command will build both modules and install them in your local Maven repository:
mvn clean install
You can speed up the build and install process by using the following options:
To skip execution of the unit tests, run:
mvn install -DskipTests
While iterating on a specific module, use the following command to compile and reinstall it. For example, to reinstall the ‘examples’ module, run:
mvn install -pl examples
Be careful, however, as this command will use the most recently installed SDK from the local repository (or Maven Central) even if you have changed it locally.
To run Maven using multiple threads, run:
mvn -T 4 install
After building and installing, the following command will execute the WordCount example using the DirectPipelineRunner on your local machine:
mvn exec:java -pl examples \ -Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \ -Dexec.args="--input=<INPUT FILE PATTERN> --output=<OUTPUT FILE>"
If you have been whitelisted for Alpha access to the Dataflow Service and followed the developer setup steps, you can use the BlockingDataflowPipelineRunner to run the same program in the Google Cloud Platform (GCP):
mvn exec:java -pl examples \ -Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \ -Dexec.args="--project=<YOUR GCP PROJECT NAME> --stagingLocation=<YOUR GCS LOCATION> --runner=BlockingDataflowPipelineRunner"
Google Cloud Storage (GCS) location should be entered in the form of gs://bucket/path/to/staging/directory. Google Cloud Platform (GCP) project refers to its name (not number), which has been whitelisted for Cloud Dataflow. Refer here for instructions to get started with Google Cloud Platform.
Other examples can be run similarly by replacing the WordCount class name with BigQueryTornadoes, DatastoreWordCount, TfIdf, TopWikipediaSessions, etc. and adjusting runtime options under Dexec.args parameter, as specified in the example itself.