commit | b6f22aa76ad217fff6078e9351676895ac793ecc | [log] [tgz] |
---|---|---|
author | Kenneth Knowles <kenn@apache.org> | Tue Nov 07 08:53:31 2017 -0800 |
committer | Kenneth Knowles <kenn@apache.org> | Tue Nov 07 08:53:31 2017 -0800 |
tree | 6b4735cdf171c3cb5134d135e70b0aed1b82199c | |
parent | 5fa0b14d20fab007d9e2d954eb4a34155a6f199f [diff] | |
parent | 202a2cb242df8ce8b11ff19c7d8307991d61af3a [diff] |
This closes #4091: Merge branch 'master' upto commit 269bf89 into mr-runner mr-runner: re-enable sdks/python module. mr-runner: update to 2.3.0-SNAPSHOT. Check that bigtableWriter is non-null before calling close(). Added a preprocessing step to the Cloud Spanner sink. Fix Repackaging Configuration in the the DirectRunner Migrate shared Fn Execution code to Java7 update dataflow.version [BEAM-3114] Generate text proto config properly in container boot code [BEAM-3113] Disable stack trace optimization in java container Fix Go package comment for syscallx Add all portability protos to Go [BEAM-2728] Extension for sketch-based statistics : HyperLogLog Add a runners/java-fn-execution module Add sdks/java/fn-execution [BEAM-3135] Adding futures dependency to python SDK Updates BigQueryTableSource to consider data in streaming buffer when determining estimated size. Getting AutoValue 1.5.1 working in Beam. Add License Header to SqlTypeUtils [BEAM-2203] Implement TIMESTAMPADD Fix working dir in website precommits Do not relocate generated Model Classes Updates Python datastore wordcount example to take a dataset parameter. Remove obsolete extra parameter [BEAM-2482] - CodedValueMutationDetector should use the coders structural value Changed the mutation detector to be based on structural value only Allows to set a Cloud Spanner host. https://batch-spanner.googleapis.com/ is set as a default host name. [BEAM-2468] Reading Kinesis records in the background [BEAM-3054] Uses locale-insensitive number formatting in ESIO and tests Added VoidSerializer for KafkaIO. Modified KafkaIO.Write.values() to auto add the VoidSerializer for the key.serializer config for kafka producer [BEAM-1542] SpannerIO: mutation encoding and size estimation improvements [BEAM-2979] Fix a race condition in getWatermark() in KafkaIO. [BEAM-3112] Improve error messages in ElasticsearchIO test utils [BEAM-3111] Upgrade Elasticsearch to 5.6.3 and clean pom Temporarily disable Dataflow pipeline_url metadata NonNull by default in sdk/transforms/splittabledofn NonNull by default in sdk/transforms/join NonNull by default in sdk/transforms/display NonNull by default in sdk/transforms/reflect NonNull by default in sdk/transforms/windowing NonNull by default in sdk/transforms Reading spanner schema transform Suppress AutoValue warnings in TextIO Remove extraneous type arguments in Latest.java Remove extraneous type arguments in PAssert Remove coveralls invocations from all jobs Adds ParseResult.failure() Many improvements to TikaIO Add missing @RunWith to test. [BEAM-2566] Decouple SDK harness from Dataflow runner by elevating experiments and SDK harness configuration to java-sdk-core. Add python_requires to setup.py CR: [BEAM-3005] Set JVM max heap size in java container Update rat exclusion for python and Go protos Declare .go and Dockerfile as text in gitattributes [BEAM-3005] Set JVM max heap size in java container [BEAM-3005] Add resource limits to provision proto Stage the pipeline in Python DataflowRunner Add zip to the list of accepted extra package file types. Fix remaining nullability errors Increase Java postcommit timeout to 240 Manage RAT plugin more centrally; only toggle skipping Fix RAT exclusions Rearrange .gitignore slightly Fix typo in seed job Rename seed job so it is first in glob used by prior seed job Fix typo in seed jobs Increase seed job(s) timeout to 60 minutes Increase seed job(s) timeout to 30 minutes Make the main seed job standalone Increase job_beam_PreCommit_Java_MavenInstall timeout from 2.5 to 4 hours. Implement FnApi side inputs in Python. Unit test for label pipeline option NonNull by default in metrics Ignore findbugs in AutoValue generated classes NonNull by default for sdk/testing NonNull by default in sdk/state NonNull by default in sdk/runners NonNull by default in sdk/annotations NonNull by default in sdk/coders Make Java core SDK root dir NonNull by default Add dep on Apache-licensed findbugs-annotations implementation [BEAM-2682] Deletes AvroIOTransformTest Clone source to a distinguished subdirectory of Jenkins workspace Adding lull tracking for python sampler Avoids generating proto files for Windows if grpcio-tools is not installed. Do not crash when RawPTransform has null spec Unit test to repro NPE in PTransformTranslation Created Java snippets file Add standalone version of seed job Pin runner harness also for official BEAM releases. Remove duplicate mocking in DataflowRunnerTest Add assertion that valid jobs must have staged pipeline Stage the pipeline without using a temp file Stage the portable pipeline in Dataflow Add ability to stage explicit file list Improve GcsFileSystem errors messages slightly [BEAM-2720] Update kafka client version to 0.11.0.1 Update PipelineTest.testReplacedNames DirectRunner: Replace use of RawPTransform with NotSerializable.forUrn translators Wordcount on fnapi pipeline and IT test. Clearer getOrDefault style in RehydratedComponents ...
Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends, including Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.
Beam provides a general approach to expressing embarrassingly parallel data processing pipelines and supports three categories of users, each of which have relatively disparate backgrounds and needs.
The model behind Beam evolved from a number of internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. This model was originally known as the “Dataflow Model”.
To learn more about the Beam Model (though still under the original name of Dataflow), see the World Beyond Batch: Streaming 101 and Streaming 102 posts on O’Reilly’s Radar site, and the VLDB 2015 paper.
The key concepts in the Beam programming model are:
PCollection
: represents a collection of data, which could be bounded or unbounded in size.PTransform
: represents a computation that transforms input PCollections into output PCollections.Pipeline
: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.PipelineRunner
: specifies where and how the pipeline should execute.Beam supports multiple language specific SDKs for writing pipelines against the Beam Model.
Currently, this repository contains SDKs for both Java and Python.
Have ideas for new SDKs or DSLs? See the JIRA.
Beam supports executing programs on multiple distributed processing backends through PipelineRunners. Currently, the following PipelineRunners are available:
DirectRunner
runs the pipeline on your local machine.ApexRunner
runs the pipeline on an Apache Hadoop YARN cluster (or in embedded mode).DataflowRunner
submits the pipeline to the Google Cloud Dataflow.FlinkRunner
runs the pipeline on an Apache Flink cluster. The code has been donated from dataArtisans/flink-dataflow and is now part of Beam.SparkRunner
runs the pipeline on an Apache Spark cluster. The code has been donated from cloudera/spark-dataflow and is now part of Beam.Have ideas for new Runners? See the JIRA.
Please refer to the Quickstart[Java, Python] available on our website.
If you'd like to build and install the whole project from the source distribution, you may need some additional tools installed in your system. In a Debian-based distribution:
sudo apt-get install \ openjdk-8-jdk \ maven \ python-setuptools \ python-pip
Then please use the standard mvn clean install
command.
See the Spark Runner README.
To get involved in Apache Beam:
We also have a contributor's guide.