tag	ecf0c949ca8340b3fa91e9c68815650553f2a5d5
tagger	Kenneth Knowles <kenn@apache.org>	Tue Oct 24 20:23:05 2017 -0700
object	6ad521097f50d6390ba99489d0dffb87c9991715

commit	6ad521097f50d6390ba99489d0dffb87c9991715	[log] [tgz]
author	davor <davor@google.com>	Tue Dec 16 10:37:48 2014 -0800
committer	Davor Bonaci <davor@google.com>	Tue Dec 16 10:41:21 2014 -0800
tree	0ec4daca642d5e449949cc6c427c0c7779dc92fa
parent	9e90654cd39af8d2b91290055b5b5066d07c6d94 [diff]

tree: 0ec4daca642d5e449949cc6c427c0c7779dc92fa

README.md

Cloud Dataflow Java SDK (Alpha)

Google Cloud Dataflow provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.

Overview

The key concepts in this programming model are:

PCollection: represents a collection of data, which could be bounded or unbounded in size.
PTransform: represents a computation that transform input PCollections into output PCollections.
Pipeline: manages a directed acyclic graph of PTransforms and PCollections, which is ready for execution.
PipelineRunner: specifies where and how the pipeline should execute.

Currently there are three runners:

The DirectPipelineRunner runs the pipeline on your local machine.
The DataflowPipelineRunner submits the pipeline to the Dataflow Service, where it runs using managed resources in the Google Cloud Platform.
The BlockingDataflowPipelineRunner submits the pipeline to the Dataflow Service via the DataflowPipelineRunner and then prints messages about the job status until execution is complete.

The Dataflow Service is currently in the Alpha phase of development and access is limited to whitelisted users.

Getting Started

This repository consists of two modules:

Java SDK module provides a set of basic Java APIs to program against.
Examples module provides a few samples to get started. We recommend starting with the WordCount example.

The following command will build both modules and install them in your local Maven repository:

mvn clean install

You can speed up the build and install process by using the following options:

To skip execution of the unit tests, run:
mvn install -DskipTests
While iterating on a specific module, use the following command to compile and reinstall it. For example, to reinstall the ‘examples’ module, run:
mvn install -pl examples

Be careful, however, as this command will use the most recently installed SDK from the local repository (or Maven Central) even if you have changed it locally.

To run Maven using multiple threads, run:
mvn -T 4 install

After building and installing, the following command will execute the WordCount example using the DirectPipelineRunner on your local machine:

mvn exec:java -pl examples \
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
-Dexec.args="--input=<INPUT FILE PATTERN> --output=<OUTPUT FILE>"

If you have been whitelisted for Alpha access to the Dataflow Service and followed the developer setup steps, you can use the BlockingDataflowPipelineRunner to run the same program in the Google Cloud Platform (GCP):

mvn exec:java -pl examples \
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
-Dexec.args="--project=<YOUR GCP PROJECT NAME> --stagingLocation=<YOUR GCS LOCATION> --runner=BlockingDataflowPipelineRunner"

Google Cloud Storage (GCS) location should be entered in the form of gs://bucket/path/to/staging/directory. Google Cloud Platform (GCP) project refers to its name (not number), which has been whitelisted for Cloud Dataflow. Refer here for instructions to get started with Google Cloud Platform.

Other examples can be run similarly by replacing the WordCount class name with BigQueryTornadoes, DatastoreWordCount, TfIdf, TopWikipediaSessions, etc. and adjusting runtime options under Dexec.args parameter, as specified in the example itself.

Cloud Dataflow Java SDK (Alpha)

Overview

Getting Started

More Information