This closes #4091: Merge branch 'master' upto commit 269bf89 into mr-runner

  mr-runner: re-enable sdks/python module.
  mr-runner: update to 2.3.0-SNAPSHOT.
  Check that bigtableWriter is non-null before calling close().
  Added a preprocessing step to the Cloud Spanner sink.
  Fix Repackaging Configuration in the the DirectRunner
  Migrate shared Fn Execution code to Java7
  update dataflow.version
  [BEAM-3114] Generate text proto config properly in container boot code
  [BEAM-3113] Disable stack trace optimization in java container
  Fix Go package comment for syscallx
  Add all portability protos to Go
  [BEAM-2728] Extension for sketch-based statistics : HyperLogLog
  Add a runners/java-fn-execution module
  Add sdks/java/fn-execution
  [BEAM-3135] Adding futures dependency to python SDK
  Updates BigQueryTableSource to consider data in streaming buffer when determining estimated size.
  Getting AutoValue 1.5.1 working in Beam.
  Add License Header to SqlTypeUtils
  [BEAM-2203] Implement TIMESTAMPADD
  Fix working dir in website precommits
  Do not relocate generated Model Classes
  Updates Python datastore wordcount example to take a dataset parameter.
  Remove obsolete extra parameter
  [BEAM-2482] - CodedValueMutationDetector should use the coders structural value
  Changed the mutation detector to be based on structural value only
  Allows to set a Cloud Spanner host. https://batch-spanner.googleapis.com/ is set as a default host name.
  [BEAM-2468] Reading Kinesis records in the background
  [BEAM-3054] Uses locale-insensitive number formatting in ESIO and tests
  Added VoidSerializer for KafkaIO. Modified KafkaIO.Write.values() to auto add the VoidSerializer for the key.serializer config for kafka producer
  [BEAM-1542] SpannerIO: mutation encoding and size estimation improvements
  [BEAM-2979] Fix a race condition in getWatermark() in KafkaIO.
  [BEAM-3112] Improve error messages in ElasticsearchIO test utils
  [BEAM-3111] Upgrade Elasticsearch to 5.6.3 and clean pom
  Temporarily disable Dataflow pipeline_url metadata
  NonNull by default in sdk/transforms/splittabledofn
  NonNull by default in sdk/transforms/join
  NonNull by default in sdk/transforms/display
  NonNull by default in sdk/transforms/reflect
  NonNull by default in sdk/transforms/windowing
  NonNull by default in sdk/transforms
  Reading spanner schema transform
  Suppress AutoValue warnings in TextIO
  Remove extraneous type arguments in Latest.java
  Remove extraneous type arguments in PAssert
  Remove coveralls invocations from all jobs
  Adds ParseResult.failure()
  Many improvements to TikaIO
  Add missing @RunWith to test.
  [BEAM-2566] Decouple SDK harness from Dataflow runner by elevating experiments and SDK harness configuration to java-sdk-core.
  Add python_requires to setup.py
  CR: [BEAM-3005] Set JVM max heap size in java container
  Update rat exclusion for python and Go protos
  Declare .go and Dockerfile as text in gitattributes
  [BEAM-3005] Set JVM max heap size in java container
  [BEAM-3005] Add resource limits to provision proto
  Stage the pipeline in Python DataflowRunner
  Add zip to the list of accepted extra package file types.
  Fix remaining nullability errors
  Increase Java postcommit timeout to 240
  Manage RAT plugin more centrally; only toggle skipping
  Fix RAT exclusions
  Rearrange .gitignore slightly
  Fix typo in seed job
  Rename seed job so it is first in glob used by prior seed job
  Fix typo in seed jobs
  Increase seed job(s) timeout to 60 minutes
  Increase seed job(s) timeout to 30 minutes
  Make the main seed job standalone
  Increase job_beam_PreCommit_Java_MavenInstall timeout from 2.5 to 4 hours.
  Implement FnApi side inputs in Python.
  Unit test for label pipeline option
  NonNull by default in metrics
  Ignore findbugs in AutoValue generated classes
  NonNull by default for sdk/testing
  NonNull by default in sdk/state
  NonNull by default in sdk/runners
  NonNull by default in sdk/annotations
  NonNull by default in sdk/coders
  Make Java core SDK root dir NonNull by default
  Add dep on Apache-licensed findbugs-annotations implementation
  [BEAM-2682] Deletes AvroIOTransformTest
  Clone source to a distinguished subdirectory of Jenkins workspace
  Adding lull tracking for python sampler
  Avoids generating proto files for Windows if grpcio-tools is not installed.
  Do not crash when RawPTransform has null spec
  Unit test to repro NPE in PTransformTranslation
  Created Java snippets file
  Add standalone version of seed job
  Pin runner harness also for official BEAM releases.
  Remove duplicate mocking in DataflowRunnerTest
  Add assertion that valid jobs must have staged pipeline
  Stage the pipeline without using a temp file
  Stage the portable pipeline in Dataflow
  Add ability to stage explicit file list
  Improve GcsFileSystem errors messages slightly
  [BEAM-2720] Update kafka client version to 0.11.0.1
  Update PipelineTest.testReplacedNames
  DirectRunner: Replace use of RawPTransform with NotSerializable.forUrn translators
  Wordcount on fnapi pipeline and IT test.
  Clearer getOrDefault style in RehydratedComponents
  ...
tree: 6b4735cdf171c3cb5134d135e70b0aed1b82199c
  1. .github/
  2. .test-infra/
  3. examples/
  4. model/
  5. runners/
  6. sdks/
  7. .gitattributes
  8. .gitignore
  9. LICENSE
  10. NOTICE
  11. pom.xml
  12. README.md
README.md

Apache Beam

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends, including Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

Status

Build Status Coverage Status

Overview

Beam provides a general approach to expressing embarrassingly parallel data processing pipelines and supports three categories of users, each of which have relatively disparate backgrounds and needs.

  1. End Users: Writing pipelines with an existing SDK, running it on an existing runner. These users want to focus on writing their application logic and have everything else just work.
  2. SDK Writers: Developing a Beam SDK targeted at a specific user community (Java, Python, Scala, Go, R, graphical, etc). These users are language geeks, and would prefer to be shielded from all the details of various runners and their implementations.
  3. Runner Writers: Have an execution environment for distributed processing and would like to support programs written against the Beam Model. Would prefer to be shielded from details of multiple SDKs.

The Beam Model

The model behind Beam evolved from a number of internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. This model was originally known as the “Dataflow Model”.

To learn more about the Beam Model (though still under the original name of Dataflow), see the World Beyond Batch: Streaming 101 and Streaming 102 posts on O’Reilly’s Radar site, and the VLDB 2015 paper.

The key concepts in the Beam programming model are:

  • PCollection: represents a collection of data, which could be bounded or unbounded in size.
  • PTransform: represents a computation that transforms input PCollections into output PCollections.
  • Pipeline: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.
  • PipelineRunner: specifies where and how the pipeline should execute.

SDKs

Beam supports multiple language specific SDKs for writing pipelines against the Beam Model.

Currently, this repository contains SDKs for both Java and Python.

Have ideas for new SDKs or DSLs? See the JIRA.

Runners

Beam supports executing programs on multiple distributed processing backends through PipelineRunners. Currently, the following PipelineRunners are available:

  • The DirectRunner runs the pipeline on your local machine.
  • The ApexRunner runs the pipeline on an Apache Hadoop YARN cluster (or in embedded mode).
  • The DataflowRunner submits the pipeline to the Google Cloud Dataflow.
  • The FlinkRunner runs the pipeline on an Apache Flink cluster. The code has been donated from dataArtisans/flink-dataflow and is now part of Beam.
  • The SparkRunner runs the pipeline on an Apache Spark cluster. The code has been donated from cloudera/spark-dataflow and is now part of Beam.

Have ideas for new Runners? See the JIRA.

Getting Started

Please refer to the Quickstart[Java, Python] available on our website.

If you'd like to build and install the whole project from the source distribution, you may need some additional tools installed in your system. In a Debian-based distribution:

sudo apt-get install \
    openjdk-8-jdk \
    maven \
    python-setuptools \
    python-pip

Then please use the standard mvn clean install command.

Spark Runner

See the Spark Runner README.

Contact Us

To get involved in Apache Beam:

We also have a contributor's guide.

More Information