[2.22.0] - Unreleased

Highlights

  • New highly anticipated feature X added to Python SDK (BEAM-X).
  • New highly anticipated feature Y added to Java SDK (BEAM-Y).

I/Os

  • Support for X source added (Java/Python) (BEAM-X).

New Features / Improvements

  • --workerCacheMB flag is supported in Dataflow streaming pipeline (BEAM-9964)
  • --direct_num_workers=0 is supported for FnApi runner. It will set the number of threads/subprocesses to number of cores of the machine executing the pipeline (BEAM-9443).
  • Python SDK now has experimental support for SqlTransform (BEAM-8603).

Breaking Changes

  • The Python SDK now requires --job_endpoint to be set when using --runner=PortableRunner (BEAM-9860). Users seeking the old default behavior should set --runner=FlinkRunner instead.

Deprecations

  • X behavior is deprecated and will be removed in X versions (BEAM-X).

Known Issues

  • Fixed X (Java/Python) (BEAM-X).

[2.21.0] - Unreleased (In Progress)

Highlights

I/Os

  • Python: Deprecated module apache_beam.io.gcp.datastore.v1 has been removed as the client it uses is out of date and does not support Python 3 (BEAM-9529). Please migrate your code to use apache_beam.io.gcp.datastore.v1new. See the updated datastore_wordcount for example usage.
  • Python SDK: Added integration tests and updated batch write functionality for Google Cloud Spanner transform (BEAM-8949).

New Features / Improvements

  • Python SDK will now use Python 3 type annotations as pipeline type hints. (#10717)

    If you suspect that this feature is causing your pipeline to fail, calling apache_beam.typehints.disable_type_annotations() before pipeline creation will disable is completely, and decorating specific functions (such as process()) with @apache_beam.typehints.no_annotations will disable it for that function.

    More details will be in Ensuring Python Type Safety and an upcoming blog post.

  • Java SDK: Introducing the concept of options in Beam Schema’s. These options add extra context to fields and schemas. This replaces the current Beam metadata that is present in a FieldType only, options are available in fields and row schemas. Schema options are fully typed and can contain complex rows. Remark: Schema aware is still experimental. (BEAM-9035)

  • Java SDK: The protobuf extension is fully schema aware and also includes protobuf option conversion to beam schema options. Remark: Schema aware is still experimental. (BEAM-9044)

  • Added ability to write to BigQuery via Avro file loads (Python) (BEAM-8841)

    By default, file loads will be done using JSON, but it is possible to specify the temp_file_format parameter to perform file exports with AVRO. AVRO-based file loads work by exporting Python types into Avro types, so to switch to Avro-based loads, you will need to change your data types from Json-compatible types (string-type dates and timestamp, long numeric values as strings) into Python native types that are written to Avro (Python's date, datetime types, decimal, etc). For more information see https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions.

  • Added integration of Java SDK with Google Cloud AI VideoIntelligence service (BEAM-9147)

  • Added integration of Java SDK with Google Cloud AI natural language processing API (BEAM-9634)

  • docker-pull-licenses tag was introduced. Licenses/notices of third party dependencies will be added to the docker images when docker-pull-licenses was set. The files are added to /opt/apache/beam/third_party_licenses/. By default, no licenses/notices are added to the docker images. (BEAM-9136)

Breaking Changes

  • Dataflow runner now requires the --region option to be set, unless a default value is set in the environment (BEAM-9199). See here for more details.
  • HBaseIO.ReadAll now requires a PCollection of HBaseIO.Read objects instead of HBaseQuery objects (BEAM-9279).
  • ProcessContext.updateWatermark has been removed in favor of using a WatermarkEstimator (BEAM-9430).
  • Coder inference for PCollection of Row objects has been disabled (BEAM-9569).
  • Go SDK docker images are no longer released until further notice.

Deprecations

  • Java SDK: Beam Schema FieldType.getMetadata is now deprecated and is replaced by the Beam Schema Options, it will be removed in version 2.23.0. (BEAM-9704)

Known Issues

[2.20.0] - 2020-04-15

Highlights

I/Os

  • Java SDK: Adds support for Thrift encoded data via ThriftIO. (BEAM-8561)
  • Java SDK: KafkaIO supports schema resolution using Confluent Schema Registry. (BEAM-7310)
  • Java SDK: Add Google Cloud Healthcare IO connectors: HL7v2IO and FhirIO (BEAM-9468)
  • Python SDK: Support for Google Cloud Spanner. This is an experimental module for reading and writing data from Google Cloud Spanner (BEAM-7246).
  • Python SDK: Adds support for standard HDFS URLs (with server name). (#10223).

New Features / Improvements

  • New AnnotateVideo & AnnotateVideoWithContext PTransform's that integrates GCP Video Intelligence functionality. (Python) (BEAM-9146)
  • New AnnotateImage & AnnotateImageWithContext PTransform's for element-wise & batch image annotation using Google Cloud Vision API. (Python) (BEAM-9247)
  • Added a PTransform for inspection and deidentification of text using Google Cloud DLP. (Python) (BEAM-9258)
  • New AnnotateText PTransform that integrates Google Cloud Natural Language functionality (Python) (BEAM-9248)
  • ReadFromBigQuery now supports value providers for the query string (Python) (BEAM-9305)
  • Direct runner for FnApi supports further parallelism (Python) (BEAM-9228)
  • Support for @RequiresTimeSortedInput in Flink and Spark (Java) (BEAM-8550)

Breaking Changes

  • ReadFromPubSub(topic=) in Python previously created a subscription under the same project as the topic. Now it will create the subscription under the project specified in pipeline_options. If the project is not specified in pipeline_options, then it will create the subscription under the same project as the topic. (BEAM-3453).
  • SpannerAccessor in Java is now package-private to reduce API surface. SpannerConfig.connectToSpanner has been moved to SpannerAccessor.create. (BEAM-9310).
  • ParquetIO hadoop dependency should be now provided by the users (BEAM-8616).
  • Docker images will be deployed to apache/beam repositories from 2.20. They used to be deployed to apachebeam repository. (BEAM-9063)
  • PCollections now have tags inferred from the result type (e.g. the keys of a dict or index of a tuple). Users may expect the old implementation which gave PCollection output ids a monotonically increasing id. To go back to the old implementation, use the force_generated_pcollection_output_ids experiment.

Deprecations

Bugfixes

  • Fixed numpy operators in ApproximateQuantiles (Python) (BEAM-9579).
  • Fixed exception when running in IPython notebook (Python) (BEAM-X9277).
  • Fixed Flink uberjar job termination bug. (BEAM-9225)
  • Fixed SyntaxError in process worker startup (BEAM-9503)
  • Key should be available in @OnTimer methods (Java) (BEAM-1819).

Known Issues

[2.19.0] - 2020-01-31