pip install apache-beam[gcp,interactive,yaml,redis,hadoop,tfrecord], though most users will not need all of these extras (#34554).--gbek pipeline option to automatically replace all GroupByKey transforms (Java/Python) (#36214).--element_processing_timeout_minutes option will also interrupt the SDK process if slowness happens during DoFn initialization, for example in DoFn.setup() (#36518).dill is no longer a required, default dependency for Apache Beam (#21298).pickle_library=dill pipeline option.dill==0.3.1.1 is still pre-installed on the official Beam SDK base images, it is no longer a direct dependency of the apache-beam Python package. This means it can be overridden by other dependencies in your environment.pickle_library=dill, you must manually ensure dill==0.3.1.1 is installed in both your submission and runtime environments.pip install apache-beam[gcp,dill].FROM apache/beam_python3.10_sdk:2.69.0dill==0.3.1.1 to your worker's requirements file (e.g., requirements.txt)FROM python:3.9-slimRUN pip install --no-cache-dir apache-beam[gcp,dill]Can't get attribute '_create_code' on <module 'dill._dill' from....ImportError: Pipeline option pickle_library=dill is set, but dill is not installed...pickle_library=dill to the GitHub issue (#21298)pickle_library=dill_unsafe pipeline option. This allows overriding dill==0.3.1.1 using dill as the pickle_library. Use with extreme caution. Other versions of dill has not been tested with Apache Beam (#21298).--update_compatibility_version=2.68.0 (#36345).--transform_name_mapping.pip install apache-beam, use pip install apache-beam[hadoop,gcp,interactive,redis,test,tfrecord].sdks/java/testing/junit, which provides TestPipelineExtension for JUnit 5 while maintaining backward compatibility with existing JUnit 4 TestRule-based tests (Java) (#18733, #35688).org.apache.beam:beam-sdks-java-testing-junit.ReactiveThrottler and ThrottlingSignaler classes to streamline throttling behavior in DoFns, expose throttling mechanisms for users (#35984)--element_processing_timeout_minutes option to reduce the chance of having stalled pipelines due to unexpected cases of slow processing, where slowness might not happen again if processing of the same element is retried.--update_compatibility_version=2.67.0 (35725). Report any pickling related issues to #34903JdbcIO.WriteWithResults. Previously, these settings were ignored (#35669).ParquetTable external table provider has changed its handling of the LOCATION property. To read from a directory, the path must now end with a trailing slash (e.g., LOCATION '/path/to/data/'). Previously, a trailing slash was not required. This change was made to enable support for glob patterns and single-file paths (#35582).--r for --runner), you will now need to specify the whole argument (#34934).updateCompatibilityVersion option to the SDK version used for the original pipeline, example --updateCompatabilityVersion=2.64.0AlloyDBVectorWriterConfig API to align with new PostgresVectorWriter transform. Heres a quick guide to update your code: (#35225)v17()) to current() for better backward compatibility in the future.pickle_library, where previously dill was the default in #34695. For known issues, reporting new issues, and understanding cloudpickle behavior refer to #34903.update_compatibility_version (#34767).TypedDict typehints are now compatible with Mapping and Dict type annotations.pickle_library=dill in pipeline options (#35062)VLLM_CONFIGURE_LOGGING=0 environment variable in your custom container.number_of_worker_harness_threads.Managed.ICEBERG_CDC (#33504)--add-modules JVM option is added through a new pipeline option JdkAddRootModules. This allows extending the module graph with optional modules such as SDK incubator modules. Sample usage: <pipeline invocation> --jdkAddRootModules=jdk.incubator.vector (#30281).collections.abc.Sequence type hints was added, which can lead to pipelines failing type hint checks that were previously passing erroneously. These issues will be most commonly seen trying to consume a PCollection with a Sequence type hint after a GroupByKey or a CoGroupByKey. (#33999).SpannerConfig for SpannerIO, calling withHost with a null or empty host will now result in a Null Pointer Exception (java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null). See https://github.com/GoogleCloudPlatform/DataflowTemplates/issues/34489 for context.--groupFilesFileLoad pipeline option to mitigate side-input related issues in BigQueryIO batch FILE_LOAD on certain runners (including Dataflow Runner V2) (Java) (#33587).--windmillRequestBatchedGetWorkResponse=false--files_to_stage flag (Python) (#34208)date, time, timestamp, and timestamp(tz) (#32939)SolaceIO.Write) added (Java) (#31905)."123" are preserved as strings rather than silently coerced (and possibly truncated) to numeric values. To retain the old behavior, pass dtype=True (or any other value accepted by pandas.read_json).updateCompatibilityVersion option to the SDK version used for the original pipeline, example --updateCompatabilityVersion=2.58.1--runner flag.--runner flag.upload_graph feature when using Dataflow Runner V2 (#32159).Prism is under active development and does not yet support all pipelines. See #29650 for progress.
Java file-based IOs read or write lots (100k+) files could experience slowness and/or broken metrics visualization on Dataflow UI #32649.
BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform (#32780):
[Managed Iceberg] DataFile metadata is assigned incorrect partition values (#33497).
[FileBasedIO] StringSet metrics can grow unlimitedly large when pipeline involves read/write large number of files, and degrading functionalities such us metrics monitoring and Dataflow job upgrade.
ReadFromKafkaViaSDF are redistributed and may contain duplicates regardless of the configuration. This affects Java pipelines with Dataflow v2 runner and xlang pipelines reading from Kafka, (#32196)upload_graph experiment, will fail at construction time (#32159).upload_graph experiment, will fail at construction time (#32159).ReadFromKafkaViaSDF are redistributed and may contain duplicates regardless of the configuration. This affects Java pipelines with Dataflow v2 runner and xlang pipelines reading from Kafka, (#32196)inMemory() variant of Java List and Map side inputs for more efficient lookups when the entire side input fits into memory.--jinja_variables flag.DoFn.SetupContextParam and DoFn.BundleContextParam which can be used as a python DoFn.process, Map, or FlatMap parameter to invoke a context manager per DoFn setup or bundle (analogous to using setup/teardown or start_bundle/finish_bundle respectively.)upload_graph experiment, will fail at construction time (#32159).--requirements_file (Python).upload_graph experiment, will fail at construction time (#32159).--enableBundling pipeline option for BigQueryIO DIRECT_READ is replaced by --enableStorageReadApiV2. Both were considered experimental and may subject to change (Java) (#26354).--max_cache_memory_usage_mb pipeline option from 100 to 0. This option was first introduced in 2.52.0 SDK. This change restores the behavior of 2.51.0 SDK, which does not use the state cache. If your pipeline uses iterable side inputs views, consider increasing the cache size by setting the option manually. (#30360).--max_cache_memory_usage_mb=0 pipeline option. (#30360).collections.abc.Collections types properly. Some type hints that were erroneously allowed by the SDK may now fail. (#29272)--max_cache_memory_usage_mb=0 pipeline option. (#30360).beam-sdks-java-extensions-avro instead. This will allow to easily update Avro version in user code without potential breaking changes in Beam “core” since the Beam Avro extension already supports the latest Avro versions and should handle this. (#25252).UseDataStreamForBatch pipeline option to the Flink runner. When it is set to true, Flink runner will run batch jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed using the DataSet API.upload_graph as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK (PR#28621).--max_cache_memory_usage_mb to configure state and side input cache size. The cache has been enabled to a default of 100 MB. Use --max_cache_memory_usage_mb=X to provide cache size for the user state API and side inputs. (#28770).org.apache.beam.sdk.io.CountingSource.CounterMark uses custom CounterMarkCoder as a default coder since all Avro-dependent classes finally moved to extensions/avro. In case if it's still required to use AvroCoder for CounterMark, then, as a workaround, a copy of “old” CountingSource class should be placed into a project code and used directly (#25252).host to firestoreHost in FirestoreOptions to avoid potential conflict of command line arguments (Java) (#29201).SnappyCoder are update incompatible with previous versions of the same transform (Java) on some runners. This includes PubSubIO's read (#28655).watch_file_pattern arg of the RunInference arg had no effect prior to 2.52.0. To use the behavior of arg watch_file_pattern prior to 2.52.0, follow the documentation at https://beam.apache.org/documentation/ml/side-input-updates/ and use WatchFilePattern PTransform as a SideInput. (#28948)MLTransform doesn't output artifacts such as min, max and quantiles. Instead, MLTransform will add a feature to output these artifacts as human readable format - #29017. For now, to use the artifacts such as min and max that were produced by the eariler MLTransform, use read_artifact_location of MLTransform, which reads artifacts that were produced earlier in a different MLTransform (#29016)mypy on user pipelines (#27906)worker-startup log content is now captured by the worker logger. Users who relied on print() statements for logging might notice that some logs don't flush before pipeline succeeds - we strongly advise to use logging package instead of print() statements for logging. (#28317)t reflect.Type from parquetio.Write. The element type is derived from the input PCollection (Go) (#28490)fastavro dependency to 1.8.3 or earlier on some runners that don't use Beam Docker containers: #28811--staging_location at pipeline submission. Custom container images that are not based on Beam's default image must include Apache Beam installation.(#26996)func TestMain(m *testing.M) { ptest.MainWithDefault(m, "direct") }direct_num_workers!=1.(#27373)orjson dependency might experience segmentation faults or get stuck: #28318.BigtableIO.ReadChangeStream (#27183)--prebuild_sdk_container_engine=cloud_build, like images depending on tensorflow or torch (#27023).pip cache when installing packages on the workers. This reduces the size of prebuilt Python container images (#27035).NullPointerException when a progress check is made before the reader is started (IO) (#23868)--impersonate_service_account option with BigQuery IOs might fail on Dataflow (#32030). This is fixed in 2.59.0 release.df | (Transform1 | Transform2 | ...) to avoid excessive conversions.)key and reverse. (#25888).:sdks:java:extensions:avro (#24748)LZMA compression/decompression of text files added to the Python SDK (#25316)use_bounded_concurrent_output_for_sdf.WatchFilePattern transform, which can be used as a side input to the RunInference PTransfrom to watch for model updates using a file pattern. (#24042)PytorchModelHandler. The TorchScript model path can be passed to PytorchModelHandler using torch_script_model_path=<path_to_model>. (#25321)target_batch_duration_secs_including_fixed_cost=1 to BatchElements.Dockerfile that the Beam container entrypoint and/or Beam base image version match the Beam SDK version used at job submission.beam-sdks-java-core and will be eventually removed. Please, migrate to a new module beam-sdks-java-extensions-avro instead by importing the classes from org.apache.beam.sdk.extensions.avro package. For the sake of migration simplicity, the relative package path and the whole class hierarchy of Avro related classes in new module is preserved the same as it was before. For example, import org.apache.beam.sdk.extensions.avro.coders.AvroCoder class instead oforg.apache.beam.sdk.coders.AvroCoder. (#24749).disable_runner_v2, disable_runner_v2_until_2023, disable_prime_runner_v2 experiments will raise an error during pipeline construction. You can no longer specify the Dataflow worker jar override. Note that non-portable Java jobs and non-portable Python batch jobs are not impacted. (#24515).pyarrow>=3 and pandas>=1.4.3 since older versions are not compatible with numpy==1.24.0.python -m apache_beam.runners.render --help for more details.--extra_package option (Python) (#23684).--resource_hints flag (Go) (#23990).map_windows urn added to Go SDK (#24307).ParquetIO.withSplit was removed since splittable reading has been the default behavior since 2.35.0. The effect of this change is to drop support for non-splittable reading (Java)(#23832).beam-sdks-java-extensions-google-cloud-platform-core is no longer a dependency of the Java SDK Harness. Some users of a portable runner (such as Dataflow Runner v2) may have an undeclared dependency on this package (for example using GCS with TextIO) and will now need to declare the dependency.beam-sdks-java-core is no longer a dependency of the Java SDK Harness. Users of a portable runner (such as Dataflow Runner v2) will need to provide this package and its dependencies.poetry config virtualenvs.create false before installing deps, see an example in: #25085. If you were negatively impacted by this change and cannot find a workaround, feel free to chime in on #16658. To disable this behavior, you could upgrade to Beam 2.48.0 and set an environment variable ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1 in your Dockerfile.--spark_version. Spark 2 support is deprecated and will be removed soon (#23728).AfterProcessingTime behavior in Python's DirectRunner to match Java (#23071)--experiments=disable_projection_pushdown.--experiments=disable_projection_pushdown.BoundedSourceAsSDFWrapperFn and UnboundedSourceAsSDFWrapper.javax.jms.Message (Java) (BEAM-16308).topicNameMapper must be set to extract the topic name from the input value.valueMapper must be set to convert the input value to JMS message.DataFrame.unstack(), DataFrame.pivot() and Series.unstack() implemented for DataFrame API (BEAM-13948, BEAM-13966).ShallowCloneParDoPayload(), ShallowCloneSideInput(), and ShallowCloneFunctionSpec() have been removed from the Go SDK's pipelinex package (BEAM-13739).valueMapper to be set (BEAM-16308). You can use the TextMessageMapper to convert String inputs to JMS TestMessages:JmsIO.<String>write() .withConnectionFactory(jmsConnectionFactory) .withValueMapper(new TextMessageMapper());
metadata() added to io.filesystem.FileSystem in the Python SDK. (BEAM-14314)--experiments=disable_projection_pushdown.amazon-web-services2 has reached feature parity and is finally recommended over the earlier amazon-web-services and kinesis modules (Java). These will be deprecated in one of the next releases (BEAM-13174).Kinesis was added (BEAM-13175).AwsOptions (BEAM-13563, BEAM-13663, BEAM-13587).S3 Filesystem (BEAM-13245, BEAM-13246, BEAM-13441, BEAM-13445, BEAM-14011), DynamoDB IO (BEAM-13209, BEAM-13209), SQS IO (BEAM-13631, BEAM-13510) and others.--requirements_file will now be staged to the runner using binary distributions (wheels) of the PyPI packages for linux_x86_64 platform (BEAM-4032). To restore the behavior to use source distributions, set pipeline option --requirements_cache_only_sources. To skip staging the packages at submission time, set pipeline option --requirements_cache=skip (Python).DoFn.infer_output_types was expected to return Iterable[element_type] where element_type is the PCollection elemnt type. It is now expected to return element_type. Take care if you have overriden infer_output_type in a DoFn (this is not common). See BEAM-13860.amazon-web-services2) The types of awsRegion / endpoint in AwsOptions changed from String to Region / URI (BEAM-13563).amazon-web-services2) Client providers (withXYZClientProvider()) as well as IO specific RetryConfigurations are deprecated, instead use withClientConfiguration() or AwsOptions to configure AWS IOs / clients. Custom implementations of client providers shall be replaced with a respective ClientBuilderFactory and configured through AwsOptions (BEAM-13563).pyarrow version parsing (Python)(BEAM-14235)SqsMessage for AWS IOs for SDK v2 was changed from String to long, visibility of all fields was fixed from package private to public BEAM-13638.JdbcIO.readWithPartitions from int to long (BEAM-13149). This is a relatively minor breaking change, which we're implementing to improve the usability of the transform without increasing cruft. This transform is relatively new, so we may implement other breaking changes in the future to improve its usability.with_exception_handling option for easily ignoring bad records and implementing the dead letter pattern.ReadFromBigQuery and ReadAllFromBigQuery now run queries with BATCH priority by default. The query_priority parameter is introduced to the same transforms to allow configuring the query priority (Python) (BEAM-12913).ReadFromBigQuery. The newly introduced method parameter can be set as DIRECT_READ to use the Storage Read API. The default is EXPORT which invokes a BigQuery export request. (Python) (BEAM-10917).use_native_datetime parameter to ReadFromBigQuery to configure the return type of DATETIME fields when using ReadFromBigQuery. This parameter can only be used when method = DIRECT_READ(Python) (BEAM-10917).dataframe extra to the Python SDK that tracks pandas versions we've verified compatibility with. We now recommend installing Beam with pip install apache-beam[dataframe] when you intend to use the DataFrame API (BEAM-12906).go.mod files will need to change to require github.com/apache/beam/sdks/v2..../sdks/go/... to .../sdks/v2/go/...useReflectApi setting to control it (BEAM-12628).--allow_unsafe_triggers. (BEAM-9487).--allow_unsafe_triggers flag starting with Beam 2.34. (BEAM-9487).spark.jackson.version) to at least version 2.9.2, due to Beam updating its dependencies.VARCHAR, NVARCHAR, LONGVARCHAR, LONGNVARCHAR, DATE, TIME (Java)(BEAM-12385).--allow_unsafe_triggers. (BEAM-9487).--allow_unsafe_triggers flag starting with Beam 2.33. (BEAM-9487).CREATE FUNCTION DDL statement added to Calcite SQL syntax. JAR and AGGREGATE are now reserved keywords. (BEAM-12339).TriggerFn has a new may_lose_data method to signal potential data loss. Default behavior assumes safe (necessary for backwards compatibility). See Deprecations for potential impact of overriding this. (BEAM-9487).Row(x=3, y=4) is no longer considered equal to Row(y=4, x=3) (BEAM-11929).TopCombineFn disallow compare as its argument (Python) (BEAM-7372).--allow_unsafe_triggers. (BEAM-9487).--allow_unsafe_triggers flag starting with Beam 2.33. (BEAM-9487).FakeDeterministicFastPrimitivesCoder with beam.coders.registry.register_fallback_coder(beam.coders.coders.FakeDeterministicFastPrimitivesCoder()) or use the allow_non_deterministic_key_coders pipeline option.dependencyManagement in Maven and force in Gradle.ReadAllFromBigQuery that can receive multiple requests to read data from BigQuery at pipeline runtime. See PR 13170, and BEAM-9650.--region flag in amazon-web-services2 was replaced by --awsRegion (BEAM-11331).--experiments=use_deprecated_read. The Apache Beam community is looking for feedback for this change as the community is planning to make this change permanent with no opt-out. If you run into an issue requiring the opt-out, please send an e-mail to user@beam.apache.org specifically referencing BEAM-10670 in the subject line and why you needed to opt-out. (Java) (BEAM-10670)--HTTPWriteTimeout=0 to revert to the old behavior. (BEAM-6103)--experiments=use_runner_v2 before using this feature. (BEAM-3736)CombineFn.from_callable() or CombineFn.maybe_from_callable() can lead to incorrect behavior. (BEAM-11522).--experiments=use_deprecated_read. The Apache Beam community is looking for feedback for this change as the community is planning to make this change permanent with no opt-out. If you run into an issue requiring the opt-out, please send an e-mail to user@beam.apache.org specifically referencing BEAM-10670 in the subject line and why you needed to opt-out. (Java) (BEAM-10670)apache_beam.io.kinesis (BEAM-10138, BEAM-10137).apache_beam.io.snowflake (BEAM-9898).ElasticsearchIO#Write. Now, Java’s ElasticsearchIO can be used to selectively delete documents using withIsDeleteFn function (BEAM-5757).ReadFromBigQuery added. (Python) (BEAM-10524)@ptransform_fn decorators in the Python SDK. (BEAM-4091) This has not enabled by default to preserve backwards compatibility; use the --type_check_additional=ptransform_fn flag to enable. It may be enabled by default in future versions of Beam.apache_beam.io.external.snowflake to apache_beam.io.snowflake. The previous path will be removed in the future versions.--experiments=use_legacy_bq_sink.apache_beam.io.jdbc (BEAM-10135, BEAM-10136).apache_beam.io.external.snowflake (BEAM-9897).typing.FrozenSet type hints, which are not interchangeable with typing.Set. You may need to update your pipelines if type checking fails. (BEAM-10197)apache_beam.io.gcp.bigquery.ReadFromBigQuery. This transform is experimental. It reads data from BigQuery by exporting data to Avro files, and reading those files. It also supports reading data by exporting to JSON files. This has small differences in behavior for Time and Date-related fields. See Pydoc for more information.RowJson.RowJsonDeserializer, JsonToRow, and PubsubJsonTableProvider now accept “implicit nulls” by default when deserializing JSON (Java) (BEAM-10220). Previously nulls could only be represented with explicit null values, as in {"foo": "bar", "baz": null}, whereas an implicit null like {"foo": "bar"} would raise an exception. Now both JSON strings will yield the same result by default. This behavior can be overridden with RowJson.RowJsonDeserializer#withNullBehavior.GroupIntoBatches experimental transform in Python to actually group batches by key. This changes the output type for this transform (BEAM-6696).--workerCacheMB flag is supported in Dataflow streaming pipeline (BEAM-9964)--direct_num_workers=0 is supported for FnApi runner. It will set the number of threads/subprocesses to number of cores of the machine executing the pipeline (BEAM-9443).--job_endpoint to be set when using --runner=PortableRunner (BEAM-9860). Users seeking the old default behavior should set --runner=FlinkRunner instead.apache_beam.io.gcp.datastore.v1 has been removed as the client it uses is out of date and does not support Python 3 (BEAM-9529). Please migrate your code to use apache_beam.io.gcp.datastore.v1new. See the updated datastore_wordcount for example usage.Python SDK will now use Python 3 type annotations as pipeline type hints. (#10717)
If you suspect that this feature is causing your pipeline to fail, calling apache_beam.typehints.disable_type_annotations() before pipeline creation will disable is completely, and decorating specific functions (such as process()) with @apache_beam.typehints.no_annotations will disable it for that function.
More details will be in (Ensuring Python Type Safety)(https://beam.apache.org/documentation/sdks/python-type-safety/) and an upcoming blog post.
Java SDK: Introducing the concept of options in Beam Schemas. These options add extra context to fields and schemas. This replaces the current Beam metadata that is present in a FieldType only, options are available in fields and row schemas. Schema options are fully typed and can contain complex rows. Remark: Schema aware is still experimental. (BEAM-9035)
Java SDK: The protobuf extension is fully schema aware and also includes protobuf option conversion to beam schema options. Remark: Schema aware is still experimental. (BEAM-9044)
Added ability to write to BigQuery via Avro file loads (Python) (BEAM-8841)
By default, file loads will be done using JSON, but it is possible to specify the temp_file_format parameter to perform file exports with AVRO. AVRO-based file loads work by exporting Python types into Avro types, so to switch to Avro-based loads, you will need to change your data types from Json-compatible types (string-type dates and timestamp, long numeric values as strings) into Python native types that are written to Avro (Python's date, datetime types, decimal, etc). For more information see https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions.
Added integration of Java SDK with Google Cloud AI VideoIntelligence service (BEAM-9147)
Added integration of Java SDK with Google Cloud AI natural language processing API (BEAM-9634)
docker-pull-licenses tag was introduced. Licenses/notices of third party dependencies will be added to the docker images when docker-pull-licenses was set. The files are added to /opt/apache/beam/third_party_licenses/. By default, no licenses/notices are added to the docker images. (BEAM-9136)
--region option to be set, unless a default value is set in the environment (BEAM-9199). See here for more details.2.23.0. (BEAM-9704)--zone option in the Dataflow runner is now deprecated. Please use --worker_zone instead. (BEAM-9716)SpannerConfig.connectToSpanner has been moved to SpannerAccessor.create. (BEAM-9310).force_generated_pcollection_output_ids experiment.