v0.3.20141216
Dataflow launch: update pom.xml for certain modules. In the context of bringing in dependencies from com.google.apis group, add exclusion of Guava transitive dependency to all com.google.apis dependencies.

Notes:
* Artifacts from com.google.apis, version 1.19 in particular, brings in an old version of Guava, which is not compatible with the SDK content.
* We need to exclude this transitive dependency to ensure build works.

[]
-------------
Created by MOE: http://code.google.com/p/moe-java
MOE_MIGRATED_REVID=82246218
2 files changed
tree: 0ec4daca642d5e449949cc6c427c0c7779dc92fa
  1. examples/
  2. sdk/
  3. .gitignore
  4. checkstyle.xml
  5. LICENSE
  6. pom.xml
  7. README.md
README.md

Cloud Dataflow Java SDK (Alpha)

Google Cloud Dataflow provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.

Overview

The key concepts in this programming model are:

  • PCollection: represents a collection of data, which could be bounded or unbounded in size.
  • PTransform: represents a computation that transform input PCollections into output PCollections.
  • Pipeline: manages a directed acyclic graph of PTransforms and PCollections, which is ready for execution.
  • PipelineRunner: specifies where and how the pipeline should execute.

Currently there are three runners:

  1. The DirectPipelineRunner runs the pipeline on your local machine.
  2. The DataflowPipelineRunner submits the pipeline to the Dataflow Service, where it runs using managed resources in the Google Cloud Platform.
  3. The BlockingDataflowPipelineRunner submits the pipeline to the Dataflow Service via the DataflowPipelineRunner and then prints messages about the job status until execution is complete.

The Dataflow Service is currently in the Alpha phase of development and access is limited to whitelisted users.

Getting Started

This repository consists of two modules:

  • Java SDK module provides a set of basic Java APIs to program against.
  • Examples module provides a few samples to get started. We recommend starting with the WordCount example.

The following command will build both modules and install them in your local Maven repository:

mvn clean install

You can speed up the build and install process by using the following options:

  1. To skip execution of the unit tests, run:

    mvn install -DskipTests

  2. While iterating on a specific module, use the following command to compile and reinstall it. For example, to reinstall the ‘examples’ module, run:

    mvn install -pl examples

Be careful, however, as this command will use the most recently installed SDK from the local repository (or Maven Central) even if you have changed it locally.

  1. To run Maven using multiple threads, run:

    mvn -T 4 install

After building and installing, the following command will execute the WordCount example using the DirectPipelineRunner on your local machine:

mvn exec:java -pl examples \
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
-Dexec.args="--input=<INPUT FILE PATTERN> --output=<OUTPUT FILE>"

If you have been whitelisted for Alpha access to the Dataflow Service and followed the developer setup steps, you can use the BlockingDataflowPipelineRunner to run the same program in the Google Cloud Platform (GCP):

mvn exec:java -pl examples \
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
-Dexec.args="--project=<YOUR GCP PROJECT NAME> --stagingLocation=<YOUR GCS LOCATION> --runner=BlockingDataflowPipelineRunner"

Google Cloud Storage (GCS) location should be entered in the form of gs://bucket/path/to/staging/directory. Google Cloud Platform (GCP) project refers to its name (not number), which has been whitelisted for Cloud Dataflow. Refer here for instructions to get started with Google Cloud Platform.

Other examples can be run similarly by replacing the WordCount class name with BigQueryTornadoes, DatastoreWordCount, TfIdf, TopWikipediaSessions, etc. and adjusting runtime options under Dexec.args parameter, as specified in the example itself.

More Information