blob: 3ce8f9438975a2310e69c2ab38eb720135a15503 [file] [log] [blame]
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>Apache Beam</title><description>Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Beam also brings DSL in different languages, allowing users to easily implement their data integration processes.</description><link>/</link><generator>Hugo -- gohugo.io</generator><item><title>Apache Beam 2.56.0</title><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;p>We are happy to present the new 2.56.0 release of Beam.
This release includes both improvements and new functionality.
See the &lt;a href="/get-started/downloads/#2550-2023-03-25">download page&lt;/a> for this release.&lt;/p>
&lt;p>For more information on changes in 2.56.0, check out the &lt;a href="https://github.com/apache/beam/milestone/20">detailed release notes&lt;/a>.&lt;/p>
&lt;h2 id="highlights">Highlights&lt;/h2>
&lt;ul>
&lt;li>Added FlinkRunner for Flink 1.17, removed support for Flink 1.12 and 1.13. Previous version of Pipeline running on Flink 1.16 and below can be upgraded to 1.17, if the Pipeline is first updated to Beam 2.56.0 with the same Flink version. After Pipeline runs with Beam 2.56.0, it should be possible to upgrade to FlinkRunner with Flink 1.17. (&lt;a href="https://github.com/apache/beam/issues/29939">#29939&lt;/a>)&lt;/li>
&lt;li>New Managed I/O Java API (&lt;a href="https://github.com/apache/beam/pull/30830">#30830&lt;/a>).&lt;/li>
&lt;li>New Ordered Processing PTransform added for processing order-sensitive stateful data (&lt;a href="https://github.com/apache/beam/pull/30735">#30735&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="ios">I/Os&lt;/h2>
&lt;ul>
&lt;li>Upgraded Avro version to 1.11.3, kafka-avro-serializer and kafka-schema-registry-client versions to 7.6.0 (Java) (&lt;a href="https://github.com/apache/beam/pull/30638">#30638&lt;/a>).
The newer Avro package is known to have breaking changes. If you are affected, you can keep pinned to older Avro versions which are also tested with Beam.&lt;/li>
&lt;li>Iceberg read/write support is available through the new Managed I/O Java API (&lt;a href="https://github.com/apache/beam/pull/30830">#30830&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="new-features--improvements">New Features / Improvements&lt;/h2>
&lt;ul>
&lt;li>Profiling of Cythonized code has been disabled by default. This might improve performance for some Python pipelines (&lt;a href="https://github.com/apache/beam/pull/30938">#30938&lt;/a>).&lt;/li>
&lt;li>Bigtable enrichment handler now accepts a custom function to build a composite row key. (Python) (&lt;a href="https://github.com/apache/beam/issues/30975">#30974&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="breaking-changes">Breaking Changes&lt;/h2>
&lt;ul>
&lt;li>Default consumer polling timeout for KafkaIO.Read was increased from 1 second to 2 seconds. Use KafkaIO.read().withConsumerPollingTimeout(Duration duration) to configure this timeout value when necessary (&lt;a href="https://github.com/apache/beam/issues/30870">#30870&lt;/a>).&lt;/li>
&lt;li>Python Dataflow users no longer need to manually specify &amp;ndash;streaming for pipelines using unbounded sources such as ReadFromPubSub.&lt;/li>
&lt;/ul>
&lt;h2 id="bugfixes">Bugfixes&lt;/h2>
&lt;ul>
&lt;li>Fixed locking issue when shutting down inactive bundle processors. Symptoms of this issue include slowness or stuckness in long-running jobs (Python) (&lt;a href="https://github.com/apache/beam/pull/30679">#30679&lt;/a>).&lt;/li>
&lt;li>Fixed logging issue that caused silecing the pip output when installing of dependencies provided in &lt;code>--requirements_file&lt;/code> (Python).&lt;/li>
&lt;/ul>
&lt;h2 id="list-of-contributors">List of Contributors&lt;/h2>
&lt;p>According to git shortlog, the following people contributed to the 2.56.0 release. Thank you to all contributors!&lt;/p>
&lt;p>Abacn&lt;/p>
&lt;p>Ahmed Abualsaud&lt;/p>
&lt;p>Andrei Gurau&lt;/p>
&lt;p>Andrey Devyatkin&lt;/p>
&lt;p>Aravind Pedapudi&lt;/p>
&lt;p>Arun Pandian&lt;/p>
&lt;p>Arvind Ram&lt;/p>
&lt;p>Bartosz Zablocki&lt;/p>
&lt;p>Brachi Packter&lt;/p>
&lt;p>Byron Ellis&lt;/p>
&lt;p>Chamikara Jayalath&lt;/p>
&lt;p>Clement DAL PALU&lt;/p>
&lt;p>Damon&lt;/p>
&lt;p>Danny McCormick&lt;/p>
&lt;p>Daria Bezkorovaina&lt;/p>
&lt;p>Dip Patel&lt;/p>
&lt;p>Evan Burrell&lt;/p>
&lt;p>Hai Joey Tran&lt;/p>
&lt;p>Jack McCluskey&lt;/p>
&lt;p>Jan Lukavský&lt;/p>
&lt;p>JayajP&lt;/p>
&lt;p>Jeff Kinard&lt;/p>
&lt;p>Julien Tournay&lt;/p>
&lt;p>Kenneth Knowles&lt;/p>
&lt;p>Luís Bianchin&lt;/p>
&lt;p>Maciej Szwaja&lt;/p>
&lt;p>Melody Shen&lt;/p>
&lt;p>Oleh Borysevych&lt;/p>
&lt;p>Pablo Estrada&lt;/p>
&lt;p>Rebecca Szper&lt;/p>
&lt;p>Ritesh Ghorse&lt;/p>
&lt;p>Robert Bradshaw&lt;/p>
&lt;p>Sam Whittle&lt;/p>
&lt;p>Sergei Lilichenko&lt;/p>
&lt;p>Shahar Epstein&lt;/p>
&lt;p>Shunping Huang&lt;/p>
&lt;p>Svetak Sundhar&lt;/p>
&lt;p>Timothy Itodo&lt;/p>
&lt;p>Veronica Wasson&lt;/p>
&lt;p>Vitaly Terentyev&lt;/p>
&lt;p>Vlado Djerek&lt;/p>
&lt;p>Yi Hu&lt;/p>
&lt;p>akashorabek&lt;/p>
&lt;p>bzablocki&lt;/p>
&lt;p>clmccart&lt;/p>
&lt;p>damccorm&lt;/p>
&lt;p>dependabot[bot]&lt;/p>
&lt;p>dmitryor&lt;/p>
&lt;p>github-actions[bot]&lt;/p>
&lt;p>liferoad&lt;/p>
&lt;p>martin trieu&lt;/p>
&lt;p>tvalentyn&lt;/p>
&lt;p>xianhualiu&lt;/p></description><link>/blog/beam-2.56.0/</link><pubDate>Wed, 01 May 2024 10:00:00 -0400</pubDate><guid>/blog/beam-2.56.0/</guid><category>blog</category><category>release</category></item><item><title>Introducing Beam YAML: Apache Beam's First No-code SDK</title><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;p>Writing a Beam pipeline can be a daunting task. Learning the Beam model, downloading dependencies for the SDK language
of choice, debugging the pipeline, and maintaining the pipeline code is a lot of overhead for users who want to write a
simple to intermediate data processing pipeline. There have been strides in making the SDK&amp;rsquo;s entry points easier, but
for many, it is still a long way from being a painless process.&lt;/p>
&lt;p>To address some of these issues and simplify the entry point to Beam, we have introduced a new way to specify Beam
pipelines by using configuration files rather than code. This new SDK, known as
&lt;a href="https://beam.apache.org/documentation/sdks/yaml/">Beam YAML&lt;/a>, employs a declarative approach to creating
data processing pipelines using &lt;a href="https://yaml.org/">YAML&lt;/a>, a widely used data serialization language.&lt;/p>
&lt;h1 id="benefits-of-using-beam-yaml">Benefits of using Beam YAML&lt;/h1>
&lt;p>The primary goal of Beam YAML is to make the entry point to Beam as welcoming as possible. However, this should not
come at the expense of sacrificing the rich features that Beam offers.&lt;/p>
&lt;p>Here are some of the benefits of using Beam YAML:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>No-code development:&lt;/strong> Allows users to develop pipelines without writing any code. This makes it easier to get
started with Beam and to develop pipelines quickly and easily.&lt;/li>
&lt;li>&lt;strong>Maintainability&lt;/strong>: Configuration-based pipelines are easier to maintain than code-based pipelines. YAML format
enables clear separation of concerns, simplifying changes and updates without affecting other code sections.&lt;/li>
&lt;li>&lt;strong>Declarative language:&lt;/strong> Provides a declarative language, which means that it is based on the description of the
desired outcome rather than expressing the intent through code. This makes it easy to understand the structure and
flow of a pipeline. The YAML syntax is also widely used with a rich community of resources for learning and
leveraging the YAML syntax.&lt;/li>
&lt;li>&lt;strong>Powerful features:&lt;/strong> Supports a wide range of features, including a variety of data sources and sinks, turn-key
transforms, and execution parameters. This makes it possible to develop complex data processing pipelines with Beam
YAML.&lt;/li>
&lt;li>&lt;strong>Reusability&lt;/strong>: Beam YAML promotes code reuse by providing a way to define and share common pipeline patterns. You
can create reusable YAML snippets or blocks that can be easily shared and reused in different pipelines. This reduces
the need to write repetitive tasks and helps maintain consistency across pipelines.&lt;/li>
&lt;li>&lt;strong>Extensibility&lt;/strong>: Beam YAML offers a structure for integrating custom transformations into a pipeline, enabling
organizations to contribute or leverage a pre-existing catalog of transformations that can be seamlessly accessed
using the Beam YAML syntax across multiple pipelines. It is also possible to build third-party extensions, including
custom parsers and other tools, that do not need to depend on Beam directly.&lt;/li>
&lt;li>&lt;strong>Backwards Compatibility&lt;/strong>: Beam YAML is still being actively worked on, bringing exciting new features and
capabilities, but as these features are added, backwards compatibility will be preserved. This way, once a pipeline
is written, it will continue to work despite future released versions of the SDK.&lt;/li>
&lt;/ul>
&lt;p>Overall, using Beam YAML provides a number of advantages. It makes pipeline development and management more efficient
and effective, enabling users to focus on the business logic and data processing tasks, rather than spending time on
low-level coding details.&lt;/p>
&lt;h1 id="case-study-a-simple-business-analytics-use-case">Case Study: A simple business analytics use-case&lt;/h1>
&lt;p>Let&amp;rsquo;s take the following sample transaction data for a department store:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left">transaction_id&lt;/th>
&lt;th style="text-align:left">product_name&lt;/th>
&lt;th style="text-align:left">category&lt;/th>
&lt;th style="text-align:left">price&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left">T0012&lt;/td>
&lt;td style="text-align:left">Headphones&lt;/td>
&lt;td style="text-align:left">Electronics&lt;/td>
&lt;td style="text-align:left">59.99&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">T5034&lt;/td>
&lt;td style="text-align:left">Leather Jacket&lt;/td>
&lt;td style="text-align:left">Apparel&lt;/td>
&lt;td style="text-align:left">109.99&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">T0024&lt;/td>
&lt;td style="text-align:left">Aluminum Mug&lt;/td>
&lt;td style="text-align:left">Kitchen&lt;/td>
&lt;td style="text-align:left">29.99&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">T0104&lt;/td>
&lt;td style="text-align:left">Headphones&lt;/td>
&lt;td style="text-align:left">Electronics&lt;/td>
&lt;td style="text-align:left">59.99&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">T0302&lt;/td>
&lt;td style="text-align:left">Monitor&lt;/td>
&lt;td style="text-align:left">Electronics&lt;/td>
&lt;td style="text-align:left">249.99&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Now, let&amp;rsquo;s say that the business wants to get a record of transactions for all purchases made in the Electronics
department for audit purposes. Assuming the records are stored as a CSV file, a Beam YAML pipeline may look something
like this:&lt;/p>
&lt;p>Source code for this example can be found
&lt;a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/simple_filter.yaml">here&lt;/a>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">pipeline&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">transforms&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ReadFromCsv&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ReadInputFile&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">path&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">/path/to/input.csv&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Filter&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">FilterWithCategory&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">input&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ReadInputFile&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">language&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">python&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">keep&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">category == &amp;#34;Electronics&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">WriteToCsv&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">WriteOutputFile&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">input&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">FilterWithCategory&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">path&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">/path/to/output&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This would leave us with the following data:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left">transaction_id&lt;/th>
&lt;th style="text-align:left">product_name&lt;/th>
&lt;th style="text-align:left">category&lt;/th>
&lt;th style="text-align:left">price&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left">T0012&lt;/td>
&lt;td style="text-align:left">Headphones&lt;/td>
&lt;td style="text-align:left">Electronics&lt;/td>
&lt;td style="text-align:left">59.99&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">T0104&lt;/td>
&lt;td style="text-align:left">Headphones&lt;/td>
&lt;td style="text-align:left">Electronics&lt;/td>
&lt;td style="text-align:left">59.99&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">T0302&lt;/td>
&lt;td style="text-align:left">Monitor&lt;/td>
&lt;td style="text-align:left">Electronics&lt;/td>
&lt;td style="text-align:left">249.99&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Now, let&amp;rsquo;s say the business wants to determine how much of each Electronics item is being sold to ensure that the
correct number is being ordered from the supplier. Let&amp;rsquo;s also assume that they want to determine the total revenue for
each item. This simple aggregation can follow the Filter from the previous example as such:&lt;/p>
&lt;p>Source code for this example can be found
&lt;a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/simple_filter_and_combine.yaml">here&lt;/a>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">pipeline&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">transforms&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ReadFromCsv&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ReadInputFile&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">path&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">/path/to/input.csv&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Filter&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">FilterWithCategory&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">input&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">ReadInputFile&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">language&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">python&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">keep&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">category == &amp;#34;Electronics&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">Combine&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">CountNumberSold&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">input&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">FilterWithCategory&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">group_by&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">product_name&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">combine&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">num_sold&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">value&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">product_name&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">fn&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">count&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">total_revenue&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">value&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">price&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">fn&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">sum&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">WriteToCsv&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">WriteOutputFile&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">input&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">CountNumberSold&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">path&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">/path/to/output&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This would leave us with the following data:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left">product_name&lt;/th>
&lt;th style="text-align:left">num_sold&lt;/th>
&lt;th style="text-align:left">total_revenue&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left">Headphones&lt;/td>
&lt;td style="text-align:left">2&lt;/td>
&lt;td style="text-align:left">119.98&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">Monitor&lt;/td>
&lt;td style="text-align:left">1&lt;/td>
&lt;td style="text-align:left">249.99&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>While this was a relatively simple use-case, it shows the power of Beam YAML and how easy it is to go from business
use-case to a prototype data pipeline in just a few lines of YAML.&lt;/p>
&lt;h1 id="getting-started-with-beam-yaml">Getting started with Beam YAML&lt;/h1>
&lt;p>There are several resources that have been compiled to help users get familiar with Beam YAML.&lt;/p>
&lt;h2 id="day-zero-notebook">Day Zero Notebook&lt;/h2>
&lt;a target="_blank" href="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb">
&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
&lt;/a>
&lt;p>To help get started with Apache Beam, there is a Day Zero Notebook available on
&lt;a href="https://colab.sandbox.google.com/">Google Colab&lt;/a>, an online Python notebook environment with a free attachable
runtime, containing some basic YAML pipeline examples.&lt;/p>
&lt;h2 id="documentation">Documentation&lt;/h2>
&lt;p>The Apache Beam website provides a set of &lt;a href="https://beam.apache.org/documentation/sdks/yaml/">docs&lt;/a> that demonstrate the
current capabilities of the Beam YAML SDK. There is also a catalog of currently-supported turnkey transforms found
&lt;a href="https://beam.apache.org/releases/yamldoc/current/">here&lt;/a>.&lt;/p>
&lt;h2 id="examples">Examples&lt;/h2>
&lt;p>A catalog of examples can be found
&lt;a href="https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples">here&lt;/a>. These examples showcase
all the turnkey transforms that can be utilized in Beam YAML. There are also a number of Dataflow Cookbook examples
that can be found &lt;a href="https://github.com/GoogleCloudPlatform/dataflow-cookbook/tree/main/Python/yaml">here&lt;/a>.&lt;/p>
&lt;h2 id="contributing">Contributing&lt;/h2>
&lt;p>Developers who wish to help build out and add functionalities are welcome to start contributing to the effort in the
Beam YAML module found &lt;a href="https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml">here&lt;/a>.&lt;/p>
&lt;p>There is also a list of open &lt;a href="https://github.com/apache/beam/issues?q=is%3Aopen+is%3Aissue+label%3Ayaml">bugs&lt;/a> found
on the GitHub repo - now marked with the &amp;lsquo;yaml&amp;rsquo; tag.&lt;/p>
&lt;p>While Beam YAML has been marked stable as of Beam 2.52, it is still under heavy development, with new features being
added with each release. Those who wish to be part of the design decisions and give insights to how the framework is
being used are highly encouraged to join the dev mailing list as those discussions will be directed there. A link to
the dev list can be found &lt;a href="https://beam.apache.org/community/contact-us/">here&lt;/a>.&lt;/p></description><link>/blog/beam-yaml-release/</link><pubDate>Thu, 11 Apr 2024 10:00:00 -0400</pubDate><guid>/blog/beam-yaml-release/</guid><category>blog</category></item><item><title>Apache Beam 2.55.0</title><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;p>We are happy to present the new 2.55.0 release of Beam.
This release includes both improvements and new functionality.
See the &lt;a href="/get-started/downloads/#2550-2023-03-25">download page&lt;/a> for this release.&lt;/p>
&lt;p>For more information on changes in 2.55.0, check out the &lt;a href="https://github.com/apache/beam/milestone/19">detailed release notes&lt;/a>.&lt;/p>
&lt;h2 id="highlights">Highlights&lt;/h2>
&lt;ul>
&lt;li>The Python SDK will now include automatically generated wrappers for external Java transforms! (&lt;a href="https://github.com/apache/beam/pull/29834">#29834&lt;/a>)&lt;/li>
&lt;/ul>
&lt;h2 id="ios">I/Os&lt;/h2>
&lt;ul>
&lt;li>Added support for handling bad records to BigQueryIO (&lt;a href="https://github.com/apache/beam/pull/30081">#30081&lt;/a>).
&lt;ul>
&lt;li>Full Support for Storage Read and Write APIs&lt;/li>
&lt;li>Partial Support for File Loads (Failures writing to files supported, failures loading files to BQ unsupported)&lt;/li>
&lt;li>No Support for Extract or Streaming Inserts&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Added support for handling bad records to PubSubIO (&lt;a href="https://github.com/apache/beam/pull/30372">#30372&lt;/a>).
&lt;ul>
&lt;li>Support is not available for handling schema mismatches, and enabling error handling for writing to Pub/Sub topics with schemas is not recommended&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>--enableBundling&lt;/code> pipeline option for BigQueryIO DIRECT_READ is replaced by &lt;code>--enableStorageReadApiV2&lt;/code>. Both were considered experimental and subject to change (Java) (&lt;a href="https://github.com/apache/beam/issues/26354">#26354&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="new-features--improvements">New Features / Improvements&lt;/h2>
&lt;ul>
&lt;li>Allow writing clustered and not time-partitioned BigQuery tables (Java) (&lt;a href="https://github.com/apache/beam/pull/30094">#30094&lt;/a>).&lt;/li>
&lt;li>Redis cache support added to RequestResponseIO and Enrichment transform (Python) (&lt;a href="https://github.com/apache/beam/pull/30307">#30307&lt;/a>)&lt;/li>
&lt;li>Merged &lt;code>sdks/java/fn-execution&lt;/code> and &lt;code>runners/core-construction-java&lt;/code> into the main SDK. These artifacts were never meant for users, but noting
that they no longer exist. These are steps to bring portability into the core SDK alongside all other core functionality.&lt;/li>
&lt;li>Added Vertex AI Feature Store handler for Enrichment transform (Python) (&lt;a href="https://github.com/apache/beam/pull/30388">#30388&lt;/a>)&lt;/li>
&lt;/ul>
&lt;h2 id="breaking-changes">Breaking Changes&lt;/h2>
&lt;ul>
&lt;li>Arrow version was bumped to 15.0.0 from 5.0.0 (&lt;a href="https://github.com/apache/beam/pull/30181">#30181&lt;/a>).&lt;/li>
&lt;li>Go SDK users who build custom worker containers may run into issues with the move to distroless containers as a base (see Security Fixes).
&lt;ul>
&lt;li>The issue stems from distroless containers lacking additional tools, which current custom container processes may rely on.&lt;/li>
&lt;li>See &lt;a href="https://beam.apache.org/documentation/runtime/environments/#from-scratch-go">https://beam.apache.org/documentation/runtime/environments/#from-scratch-go&lt;/a> for instructions on building and using a custom container.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Python SDK has changed the default value for the &lt;code>--max_cache_memory_usage_mb&lt;/code> pipeline option from 100 to 0. This option was first introduced in the 2.52.0 SDK version. This change restores the behavior of the 2.51.0 SDK, which does not use the state cache. If your pipeline uses iterable side inputs views, consider increasing the cache size by setting the option manually. (&lt;a href="https://github.com/apache/beam/issues/30360">#30360&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="deprecations">Deprecations&lt;/h2>
&lt;ul>
&lt;li>N/A&lt;/li>
&lt;/ul>
&lt;h2 id="bug-fixes">Bug fixes&lt;/h2>
&lt;ul>
&lt;li>Fixed &lt;code>SpannerIO.readChangeStream&lt;/code> to support propagating credentials from pipeline options
to the &lt;code>getDialect&lt;/code> calls for authenticating with Spanner (Java) (&lt;a href="https://github.com/apache/beam/pull/30361">#30361&lt;/a>).&lt;/li>
&lt;li>Reduced the number of HTTP requests in GCSIO function calls (Python) (&lt;a href="https://github.com/apache/beam/pull/30205">#30205&lt;/a>)&lt;/li>
&lt;/ul>
&lt;h2 id="security-fixes">Security Fixes&lt;/h2>
&lt;ul>
&lt;li>Go SDK base container image moved to distroless/base-nossl-debian12, reducing vulnerable container surface to kernel and glibc (&lt;a href="https://github.com/apache/beam/pull/30011">#30011&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="known-issues">Known Issues&lt;/h2>
&lt;ul>
&lt;li>In Python pipelines, when shutting down inactive bundle processors, shutdown logic can overaggressively hold the lock, blocking acceptance of new work. Symptoms of this issue include slowness or stuckness in long-running jobs. Fixed in 2.56.0 (&lt;a href="https://github.com/apache/beam/pull/30679">#30679&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="list-of-contributors">List of Contributors&lt;/h2>
&lt;p>According to git shortlog, the following people contributed to the 2.55.0 release. Thank you to all contributors!&lt;/p>
&lt;p>Ahmed Abualsaud&lt;/p>
&lt;p>Anand Inguva&lt;/p>
&lt;p>Andrew Crites&lt;/p>
&lt;p>Andrey Devyatkin&lt;/p>
&lt;p>Arun Pandian&lt;/p>
&lt;p>Arvind Ram&lt;/p>
&lt;p>Chamikara Jayalath&lt;/p>
&lt;p>Chris Gray&lt;/p>
&lt;p>Claire McGinty&lt;/p>
&lt;p>Damon Douglas&lt;/p>
&lt;p>Dan Ellis&lt;/p>
&lt;p>Danny McCormick&lt;/p>
&lt;p>Daria Bezkorovaina&lt;/p>
&lt;p>Dima I&lt;/p>
&lt;p>Edward Cui&lt;/p>
&lt;p>Ferran Fernández Garrido&lt;/p>
&lt;p>GStravinsky&lt;/p>
&lt;p>Jan Lukavský&lt;/p>
&lt;p>Jason Mitchell&lt;/p>
&lt;p>JayajP&lt;/p>
&lt;p>Jeff Kinard&lt;/p>
&lt;p>Jeffrey Kinard&lt;/p>
&lt;p>Kenneth Knowles&lt;/p>
&lt;p>Mattie Fu&lt;/p>
&lt;p>Michel Davit&lt;/p>
&lt;p>Oleh Borysevych&lt;/p>
&lt;p>Ritesh Ghorse&lt;/p>
&lt;p>Ritesh Tarway&lt;/p>
&lt;p>Robert Bradshaw&lt;/p>
&lt;p>Robert Burke&lt;/p>
&lt;p>Sam Whittle&lt;/p>
&lt;p>Scott Strong&lt;/p>
&lt;p>Shunping Huang&lt;/p>
&lt;p>Steven van Rossum&lt;/p>
&lt;p>Svetak Sundhar&lt;/p>
&lt;p>Talat UYARER&lt;/p>
&lt;p>Ukjae Jeong (Jay)&lt;/p>
&lt;p>Vitaly Terentyev&lt;/p>
&lt;p>Vlado Djerek&lt;/p>
&lt;p>Yi Hu&lt;/p>
&lt;p>akashorabek&lt;/p>
&lt;p>case-k&lt;/p>
&lt;p>clmccart&lt;/p>
&lt;p>dengwe1&lt;/p>
&lt;p>dhruvdua&lt;/p>
&lt;p>hardshah&lt;/p>
&lt;p>johnjcasey&lt;/p>
&lt;p>liferoad&lt;/p>
&lt;p>martin trieu&lt;/p>
&lt;p>tvalentyn&lt;/p></description><link>/blog/beam-2.55.0/</link><pubDate>Mon, 25 Mar 2024 10:00:00 -0400</pubDate><guid>/blog/beam-2.55.0/</guid><category>blog</category><category>release</category></item><item><title>Apache Beam 2.54.0</title><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;p>We are happy to present the new 2.54.0 release of Beam.
This release includes both improvements and new functionality.
See the &lt;a href="/get-started/downloads/">download page&lt;/a> for this release.&lt;/p>
&lt;p>For more information on changes in 2.54.0, check out the &lt;a href="https://github.com/apache/beam/milestone/18">detailed release notes&lt;/a>.&lt;/p>
&lt;h2 id="highlights">Highlights&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://s.apache.org/enrichment-transform">Enrichment Transform&lt;/a> along with GCP BigTable handler added to Python SDK (&lt;a href="https://github.com/apache/beam/pull/30001">#30001&lt;/a>).&lt;/li>
&lt;li>Beam Java Batch pipelines run on Google Cloud Dataflow will default to the Portable Runner (v2) starting with this version. (All other languages are already on Runner V2.) See &lt;a href="https://cloud.google.com/dataflow/docs/runner-v2">Runner V2 documentation&lt;/a> for how to enable or disable it intentionally.&lt;/li>
&lt;/ul>
&lt;h2 id="ios">I/Os&lt;/h2>
&lt;ul>
&lt;li>Added support for writing to BigQuery dynamic destinations with Python&amp;rsquo;s Storage Write API (&lt;a href="https://github.com/apache/beam/pull/30045">#30045&lt;/a>)&lt;/li>
&lt;li>Adding support for Tuples DataType in ClickHouse (Java) (&lt;a href="https://github.com/apache/beam/pull/29715">#29715&lt;/a>).&lt;/li>
&lt;li>Added support for handling bad records to FileIO, TextIO, AvroIO (&lt;a href="https://github.com/apache/beam/pull/29670">#29670&lt;/a>).&lt;/li>
&lt;li>Added support for handling bad records to BigtableIO (&lt;a href="https://github.com/apache/beam/pull/29885">#29885&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="new-features--improvements">New Features / Improvements&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://s.apache.org/enrichment-transform">Enrichment Transform&lt;/a> along with GCP BigTable handler added to Python SDK (&lt;a href="https://github.com/apache/beam/pull/30001">#30001&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="breaking-changes">Breaking Changes&lt;/h2>
&lt;ul>
&lt;li>N/A&lt;/li>
&lt;/ul>
&lt;h2 id="deprecations">Deprecations&lt;/h2>
&lt;ul>
&lt;li>N/A&lt;/li>
&lt;/ul>
&lt;h2 id="bugfixes">Bugfixes&lt;/h2>
&lt;ul>
&lt;li>Fixed a memory leak affecting some Go SDK since 2.46.0. (&lt;a href="https://github.com/apache/beam/pull/28142">#28142&lt;/a>)&lt;/li>
&lt;/ul>
&lt;h2 id="security-fixes">Security Fixes&lt;/h2>
&lt;ul>
&lt;li>N/A&lt;/li>
&lt;/ul>
&lt;h2 id="known-issues">Known Issues&lt;/h2>
&lt;ul>
&lt;li>Some Python pipelines that run with 2.52.0-2.54.0 SDKs and use large materialized side inputs might be affected by a performance regression. To restore the prior behavior on these SDK versions, supply the &lt;code>--max_cache_memory_usage_mb=0&lt;/code> pipeline option. (&lt;a href="https://github.com/apache/beam/issues/30360">#30360&lt;/a>).&lt;/li>
&lt;li>Python pipelines that run with 2.53.0-2.54.0 SDKs and perform file operations on GCS might be affected by excess HTTP requests. This could lead to a performance regression or a permission issue. (&lt;a href="https://github.com/apache/beam/issues/28398">#28398&lt;/a>)&lt;/li>
&lt;li>In Python pipelines, when shutting down inactive bundle processors, shutdown logic can overaggressively hold the lock, blocking acceptance of new work. Symptoms of this issue include slowness or stuckness in long-running jobs. Fixed in 2.56.0 (&lt;a href="https://github.com/apache/beam/pull/30679">#30679&lt;/a>).&lt;/li>
&lt;/ul>
&lt;p>For the most up to date list of known issues, see &lt;a href="https://github.com/apache/beam/blob/master/CHANGES.md">https://github.com/apache/beam/blob/master/CHANGES.md&lt;/a>&lt;/p>
&lt;h2 id="list-of-contributors">List of Contributors&lt;/h2>
&lt;p>According to git shortlog, the following people contributed to the 2.54.0 release. Thank you to all contributors!&lt;/p>
&lt;p>Ahmed Abualsaud&lt;/p>
&lt;p>Alexey Romanenko&lt;/p>
&lt;p>Anand Inguva&lt;/p>
&lt;p>Andrew Crites&lt;/p>
&lt;p>Arun Pandian&lt;/p>
&lt;p>Bruno Volpato&lt;/p>
&lt;p>caneff&lt;/p>
&lt;p>Chamikara Jayalath&lt;/p>
&lt;p>Changyu Li&lt;/p>
&lt;p>Cheskel Twersky&lt;/p>
&lt;p>Claire McGinty&lt;/p>
&lt;p>clmccart&lt;/p>
&lt;p>Damon&lt;/p>
&lt;p>Danny McCormick&lt;/p>
&lt;p>dependabot[bot]&lt;/p>
&lt;p>Edward Cheng&lt;/p>
&lt;p>Ferran Fernández Garrido&lt;/p>
&lt;p>Hai Joey Tran&lt;/p>
&lt;p>hugo-syn&lt;/p>
&lt;p>Issac&lt;/p>
&lt;p>Jack McCluskey&lt;/p>
&lt;p>Jan Lukavský&lt;/p>
&lt;p>JayajP&lt;/p>
&lt;p>Jeffrey Kinard&lt;/p>
&lt;p>Jerry Wang&lt;/p>
&lt;p>Jing&lt;/p>
&lt;p>Joey Tran&lt;/p>
&lt;p>johnjcasey&lt;/p>
&lt;p>Kenneth Knowles&lt;/p>
&lt;p>Knut Olav Løite&lt;/p>
&lt;p>liferoad&lt;/p>
&lt;p>Marc&lt;/p>
&lt;p>Mark Zitnik&lt;/p>
&lt;p>martin trieu&lt;/p>
&lt;p>Mattie Fu&lt;/p>
&lt;p>Naireen Hussain&lt;/p>
&lt;p>Neeraj Bansal&lt;/p>
&lt;p>Niel Markwick&lt;/p>
&lt;p>Oleh Borysevych&lt;/p>
&lt;p>pablo rodriguez defino&lt;/p>
&lt;p>Rebecca Szper&lt;/p>
&lt;p>Ritesh Ghorse&lt;/p>
&lt;p>Robert Bradshaw&lt;/p>
&lt;p>Robert Burke&lt;/p>
&lt;p>Sam Whittle&lt;/p>
&lt;p>Shunping Huang&lt;/p>
&lt;p>Svetak Sundhar&lt;/p>
&lt;p>S. Veyrié&lt;/p>
&lt;p>Talat UYARER&lt;/p>
&lt;p>tvalentyn&lt;/p>
&lt;p>Vlado Djerek&lt;/p>
&lt;p>Yi Hu&lt;/p>
&lt;p>Zechen Jian&lt;/p></description><link>/blog/beam-2.54.0/</link><pubDate>Wed, 14 Feb 2024 09:00:00 -0400</pubDate><guid>/blog/beam-2.54.0/</guid><category>blog</category><category>release</category></item><item><title>Behind the Scenes: Crafting an Autoscaler for Apache Beam in a High-Volume Streaming Environment</title><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;h3 id="introduction-to-the-design-of-our-autoscaler-for-apache-beam-jobs">Introduction to the Design of Our Autoscaler for Apache Beam Jobs&lt;/h3>
&lt;p>Welcome to the third and final part of our blog series on building a scalable, self-managed streaming infrastructure with Beam and Flink. &lt;a href="https://beam.apache.org/blog/apache-beam-flink-and-kubernetes/">In our previous post&lt;/a>, we delved into the scale of our streaming platforms, highlighting our capacity to manage over 40,000 streaming jobs and process upwards of 10 million events per second. This impressive scale sets the stage for the challenge we address today: the intricate task of resource allocation in a dynamic streaming environment.&lt;/p>
&lt;p>In this blog post &lt;a href="https://www.linkedin.com/in/talatuyarer/">Talat Uyarer (Architect / Senior Principal Engineer)&lt;/a> and &lt;a href="https://www.linkedin.com/in/rishabhkedia/">Rishabh Kedia (Principal Engineer)&lt;/a> describe more details about our Autoscaler. Imagine a scenario where your streaming system is inundated with fluctuating workloads. Our case presents a unique challenge, as our customers, equipped with firewalls distributed globally, generate logs at various times of the day. This results in workloads that not only vary by time but also escalate over time due to changes in settings or the addition of new cybersecurity solutions from PANW. Furthermore, updates to our codebase necessitate rolling out changes across all streaming jobs, leading to a temporary surge in demand as the system processes unprocessed data.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part3/resource-allocation.png"
alt="Resource Allocation">&lt;/p>
&lt;p>Traditionally, managing this ebb and flow of demand involves a manual, often inefficient approach. One might over-provision resources to handle peak loads, inevitably leading to resource wastage during off-peak hours. Conversely, a more cost-conscious strategy might involve accepting delays during peak times, with the expectation of catching up later. However, both methods demand constant monitoring and manual adjustment - a far from ideal situation.&lt;/p>
&lt;p>In this modern era, where automated scaling of web front-ends is a given, we aspire to bring the same level of efficiency and automation to streaming infrastructure. Our goal is to develop a system that can dynamically track and adjust to the workload demands of our streaming operations. In this blog post, we will introduce you to our innovative solution - an autoscaler designed specifically for Apache Beam jobs.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part3/auto-tuned-worker.png"
alt="Auto Tuned Resource Allocation">&lt;/p>
&lt;p>For clarity, when we refer to &amp;ldquo;resources&amp;rdquo; in this context, we mean the number of Flink Task Managers, or Kubernetes Pods, that process your streaming pipeline. These Task Managers aren&amp;rsquo;t just about CPU; they also involve RAM, Network, Disk IO, and other computational resources.&lt;/p>
&lt;p>However, our solution is predicated on certain assumptions. Primarily, it&amp;rsquo;s geared towards operations processing substantial data volumes. If your workload only requires a couple of Task Managers, this system might not be the best fit. In Our case we have 10K+ workload and each each of them has different workload. Manual tuning was not an option for us. We also assume that the data is evenly distributed, allowing for increased throughput with the addition of more Task Managers. This assumption is crucial for effective horizontal scaling. While there are real-world complexities that might challenge these assumptions, for the scope of this discussion, we will focus on scenarios where these conditions hold true.&lt;/p>
&lt;p>Join us as we delve into the design and functionality of our autoscaler, a solution tailored to bring efficiency, adaptability, and a touch of intelligence to the world of streaming infrastructure.&lt;/p>
&lt;h2 id="identifying-the-right-signals-for-autoscaling">Identifying the Right Signals for Autoscaling&lt;/h2>
&lt;p>When we&amp;rsquo;re overseeing a system like Apache Beam jobs on Flink, it&amp;rsquo;s crucial to identify key signals that help us understand the relationship between our workload and resources. These signals are our guiding lights, showing us when we&amp;rsquo;re lagging behind or wasting resources. By accurately identifying these signals, we can formulate effective scaling policies and implement changes in real-time. Imagine needing to expand from 100 to 200 TaskManagers — how do we smoothly make that transition? That&amp;rsquo;s where these signals come into play.&lt;/p>
&lt;p>Remember, we&amp;rsquo;re aiming for a universal solution applicable to any workload and pipeline. While specific problems might benefit from unique signals, our focus here is on creating a one-size-fits-all approach.&lt;/p>
&lt;p>In Flink, tasks form the basic execution unit and consist of one or more operators, such as map, filter, or reduce. Flink optimizes performance by chaining these operators into single tasks when possible, minimizing overheads like thread context switching and network I/O. Your pipeline, when optimized, turns into a directed acyclic graph of stages, each processing elements based on your code. Don&amp;rsquo;t confuse stages with physical machines — they&amp;rsquo;re separate concepts. In our job we measure backlog information by using Apache Beam&amp;rsquo;s &lt;a href="https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/SourceMetrics.java#L32">&lt;code>backlog_bytes&lt;/code> and &lt;code>backlog_elements&lt;/code>&lt;/a> metrics.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part3/flink-operator-chaining.png"
alt="Apache Beam Pipeline Optimization by Apache Flink">&lt;/p>
&lt;h5 id="upscaling-signals">&lt;strong>Upscaling Signals&lt;/strong>&lt;/h5>
&lt;h5 id="backlog-growth">&lt;em>Backlog Growth&lt;/em>&lt;/h5>
&lt;p>Let’s take a practical example. Consider a pipeline reading from Kafka, where different operators handle data parsing, formatting, and accumulation. The key metric here is throughput — how much data each operstor processes over time. But throughput alone isn&amp;rsquo;t enough. We need to examine the queue size or backlog at each operator. A growing backlog indicates we&amp;rsquo;re falling behind. We measure this as backlog growth — the first derivative of backlog size over time, highlighting our processing deficit.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part3/backlog_growth.png"
alt="Backlog Growth Calculation">&lt;/p>
&lt;h5 id="backlog-time">&lt;em>Backlog Time&lt;/em>&lt;/h5>
&lt;p>This leads us to backlog time, a derived metric that compares backlog size with throughput. It’s a measure of how long it would take to clear the current backlog, assuming no new data arrives. This helps us identify if a backlog of a certain size is acceptable or problematic, based on our specific processing needs and thresholds.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part3/backlog_time.png"
alt="Backlog Time Calculation">&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part3/operator-backlog.png"
alt="Close looks at Operator Backlog">&lt;/p>
&lt;h4 id="downscaling-when-less-is-more">&lt;strong>Downscaling: When Less is More&lt;/strong>&lt;/h4>
&lt;h5 id="cpu-utilization">&lt;em>CPU Utilization&lt;/em>&lt;/h5>
&lt;p>A key signal for downscaling is CPU utilization. Low CPU utilization suggests we&amp;rsquo;re using more resources than necessary. By monitoring this, we can scale down efficiently without compromising performance.&lt;/p>
&lt;h4 id="signals-summary">&lt;strong>Signals Summary&lt;/strong>&lt;/h4>
&lt;p>In summary, the signals we&amp;rsquo;ve identified for effective autoscaling are:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Throughput:&lt;/strong> The baseline of our performance.&lt;/li>
&lt;li>&lt;strong>Backlog Growth:&lt;/strong> Indicates if we’re keeping pace with incoming data.&lt;/li>
&lt;li>&lt;strong>Backlog Time:&lt;/strong> Helps understand the severity of backlog.&lt;/li>
&lt;li>&lt;strong>CPU Utilization:&lt;/strong> Guides us in resource optimization.&lt;/li>
&lt;/ol>
&lt;p>These signals might seem straightforward, but their simplicity is key to a scalable, workload-agnostic autoscaling solution.&lt;/p>
&lt;h2 id="simplifying-autoscaling-policies-for-apache-beam-jobs-on-flink">Simplifying Autoscaling Policies for Apache Beam Jobs on Flink&lt;/h2>
&lt;p>In the world of Apache Beam jobs running on Flink, deciding when to scale up or down is a bit like being a chef in a busy kitchen. You need to keep an eye on several ingredients — your workload, virtual machines (VMs), and how they interact. It&amp;rsquo;s about maintaining a perfect balance. Our main goals? Avoid falling behind in processing (no backlog growth), ensure that any existing backlog is manageable (short backlog time), and use our resources (like CPU) efficiently.&lt;/p>
&lt;h4 id="up-scaling-keeping-up-and-catching-up">&lt;strong>Up-scaling: Keeping Up and Catching Up&lt;/strong>&lt;/h4>
&lt;p>Imagine your system is like a team of chefs working together. Here&amp;rsquo;s how we decide when to bring more chefs into the kitchen (a.k.a. upscaling):&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Keeping Up:&lt;/strong> First, we look at our current team size (number of VMs) and how much they&amp;rsquo;re processing (throughput). We then adjust our team size based on the amount of incoming orders (input rate). It&amp;rsquo;s about ensuring that our team is big enough to handle the current demand.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Catching Up:&lt;/strong> Sometimes, we might have a backlog of orders. In that case, we decide how many extra chefs we need to clear this backlog within a desired time (like 60 seconds). This part of the policy helps us get back on track swiftly.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h4 id="scaling-example-a-practical-look">&lt;strong>Scaling Example: A Practical Look&lt;/strong>&lt;/h4>
&lt;p>Let&amp;rsquo;s paint a picture with an example. Initially, we have a steady flow of orders (input rate) matching our processing capacity (throughput), so there&amp;rsquo;s no backlog. But suddenly, orders increase, and our team starts falling behind, creating a backlog. We respond by increasing our team size to match the new rate of orders. Though the backlog doesn&amp;rsquo;t grow further, it still exists. Finally, we add a few more chefs to the team, which allows us to clear the backlog quickly and return to a new, balanced state.&lt;/p>
&lt;h4 id="downscaling-when-to-reduce-resources">&lt;strong>Downscaling: When to Reduce Resources&lt;/strong>&lt;/h4>
&lt;p>Downscaling is like knowing when some chefs can take a break after a rush hour. We consider this when:&lt;/p>
&lt;ul>
&lt;li>Our backlog is low — we&amp;rsquo;ve caught up with the orders.&lt;/li>
&lt;li>The backlog isn&amp;rsquo;t growing — we&amp;rsquo;re keeping up with incoming orders.&lt;/li>
&lt;li>Our kitchen (CPU) isn&amp;rsquo;t working too hard — we&amp;rsquo;re using our resources efficiently.&lt;/li>
&lt;/ul>
&lt;p>Downscaling is all about reducing resources without affecting the quality of service. It&amp;rsquo;s about ensuring that we&amp;rsquo;re not overstaffed when the rush hour is over.&lt;/p>
&lt;h4 id="summary-a-recipe-for-effective-scaling">&lt;strong>Summary: A Recipe for Effective Scaling&lt;/strong>&lt;/h4>
&lt;p>In summary, our scaling policy is for scale up, we first ensure that the time to drain the backlog is beyond the threshold (120s) or the cpu is above the threshold (90%)&lt;/p>
&lt;p>Increasing Backlog aka Backlog Growth &amp;gt; 0 :&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part3/worker_require.png"
alt="Required Worker Calculation">&lt;/p>
&lt;p>Consistent Backlog aka Backlog Growth = 0:&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part3/worker_extra.png"
alt="Extra Worker Calculation">&lt;/p>
&lt;p>To Sum up:&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part3/worker_scaleup.png"
alt="Scale up Worker Calculation">&lt;/p>
&lt;p>To scale down, we need to ensure the machine utilization is low (&amp;lt; 70%) and there is no backlog growth and current time to drain backlog is less than the limit (10s)&lt;/p>
&lt;p>So the only driving factor to calculate the required resources after a scale down is CPU&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part3/cpurate_desired.png"
alt="Desired Cpu Rate Calculation">&lt;/p>
&lt;h2 id="executing-autoscaling-decision">Executing Autoscaling Decision&lt;/h2>
&lt;p>In our setup we use Reactive Mode which uses Adaptive Scheduler and Declarative Resources manager. We wanted to align resources with slots. As advised in most of the Flink documentation we set one per vCPU slot. Most of our jobs use 1 vCPU 4GB Memory combination for TaskManager.&lt;/p>
&lt;p>Reactive Mode, a unique feature of the Adaptive Scheduler, operates under the principle of one job per cluster, a rule enforced in Application Mode. In this mode, a job is configured to utilize all available resources within the cluster. Adding a TaskManager will increase the job&amp;rsquo;s scale, while removing resources will decrease it. In this setup, Flink autonomously manages the job&amp;rsquo;s parallelism, always maximizing it.&lt;/p>
&lt;p>During a rescaling event, Reactive Mode restarts the job using the most recent checkpoint. This eliminates the need for creating a savepoint, typically required for manual job rescaling. The volume of data reprocessed after rescaling is influenced by the checkpointing interval(10 seconds for us), and the time it takes to restore depends on the size of the state.&lt;/p>
&lt;p>The scheduler determines the parallelism of each operator within a job. This setting is not user-configurable and any attempts to set it, whether for individual operators or the entire job, will be overlooked.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part3/adaptive_scheduler_rescale.png"
alt="How Reactive Mode Works">&lt;/p>
&lt;p>Parallelism can only be influenced by setting a maximum for pipelines, which the scheduler will honor. Our maxParallelism is limited by the total count of partitions that the pipeline will process, as well as by the job itself. We cap the maximum number of TaskManagers with maxWorker count and control the job&amp;rsquo;s key count in shuffle by setting maxParallelism. Additionally, we set maxParallelism per pipeline to manage pipeline parallelism. The job cannot exceed the job&amp;rsquo;s maxParallelism in terms of workers.&lt;/p>
&lt;p>After autoscaler analysis, we will tag if the job needs to be scaled up, no action or scaled down. To interact with the job, we use a library we have built over Flink Kubernetes Operator. This library allows us to interact with our flink jobs via a simple java method call. Library will convert our method call to a kubernetes command.&lt;/p>
&lt;p>In the kubernetes world, the call will look like this for a scale up:&lt;/p>
&lt;p>&lt;code>kubectl scale flinkdeployment job-name --replicas=100&lt;/code>&lt;/p>
&lt;p>Apache Flink will handle the rest of the work needed to scale up.&lt;/p>
&lt;h2 id="maintaining-state-for-stateful-streaming-application-with-autoscaling">Maintaining State for Stateful Streaming Application with Autoscaling&lt;/h2>
&lt;p>Adapting Apache Flink&amp;rsquo;s state recovery mechanisms for autoscaling involves leveraging its robust features like max parallelism, checkpointing, and the Adaptive Scheduler to ensure efficient and resilient stream processing, even as the system dynamically adjusts to varying loads. Here&amp;rsquo;s how these components work together in an autoscaling context:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Max Parallelism&lt;/strong> sets an upper limit on how much a job can scale out, ensuring that state can be redistributed across a larger or smaller number of nodes without exceeding predefined boundaries. This is crucial for autoscaling because it allows Flink to manage state effectively, even as the number of task slots changes to accommodate varying workloads.&lt;/li>
&lt;li>&lt;strong>Checkpointing&lt;/strong> is at the heart of Flink&amp;rsquo;s fault tolerance mechanism, periodically saving the state of each job to a durable storage (in our case it is GCS bucket). In an autoscaling scenario, checkpointing enables Flink to recover to a consistent state after scaling operations. When the system scales out (adds more resources) or scales in (removes resources), Flink can restore the state from these checkpoints, ensuring data integrity and processing continuity without losing critical information. In scale down or up situations there could be a moment to reprocess data from last checkpoint. To reduce that amount we reduce the checkpointing interval to 10 seconds.&lt;/li>
&lt;li>&lt;strong>Reactive Mode&lt;/strong> is a special mode for Adaptive Scheduler, that assumes a single job per-cluster (enforced by the Application Mode). Reactive Mode configures a job so that it always uses all resources available in the cluster. Adding a TaskManager will scale up your job, removing resources will scale it down. Flink will manage the parallelism of the job, always setting it to the highest possible values. When a job undergoes resizing, Reactive Mode triggers a restart using the most recent successful checkpoint.&lt;/li>
&lt;/ol>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>In this blog series, we&amp;rsquo;ve taken a deep dive into the creation of an autoscaler for Apache Beam in a high-volume streaming environment, highlighting the journey from conceptualization to implementation. This endeavor not only tackled the complexities of dynamic resource allocation but also set a new standard for efficiency and adaptability in streaming infrastructure. By marrying intelligent scaling policies with the robust capabilities of Apache Beam and Flink, we&amp;rsquo;ve showcased a scalable solution that optimizes resource use and maintains performance under varying loads. This project stands as a testament to the power of teamwork, innovation, and a forward-thinking approach to streaming data processing. As we wrap up this series, we express our gratitude to all contributors and look forward to the continuous evolution of this technology, inviting the community to join us in further discussions and developments.&lt;/p>
&lt;h1 id="references">References&lt;/h1>
&lt;p>[1] Streaming Auto-scaling in Google Cloud Dataflow &lt;a href="https://www.infoq.com/presentations/google-cloud-dataflow/">https://www.infoq.com/presentations/google-cloud-dataflow/&lt;/a>&lt;/p>
&lt;p>[2] Pipeline lifecycle &lt;a href="https://cloud.google.com/dataflow/docs/pipeline-lifecycle">https://cloud.google.com/dataflow/docs/pipeline-lifecycle&lt;/a>&lt;/p>
&lt;p>[3] Flink Elastic Scaling &lt;a href="https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/elastic_scaling/">https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/elastic_scaling/&lt;/a>&lt;/p>
&lt;h1 id="acknowledgements">Acknowledgements&lt;/h1>
&lt;p>This is a large effort to build the new infrastructure and to migrate the large customer based applications from cloud provider managed streaming infrastructure to self-managed Flink based infrastructure at scale. Thanks the Palo Alto Networks CDL streaming team who helped to make this happen: Kishore Pola, Andrew Park, Hemant Kumar, Manan Mangal, Helen Jiang, Mandy Wang, Praveen Kumar Pasupuleti, JM Teo, Rishabh Kedia, Talat Uyarer, Naitik Dani, and David He.&lt;/p>
&lt;hr>
&lt;p>&lt;strong>Explore More:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://beam.apache.org/blog/apache-beam-flink-and-kubernetes/">Part 1: Introduction to Building and Managing Apache Beam Flink Services on Kubernetes&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://beam.apache.org/blog/apache-beam-flink-and-kubernetes-part2/">Part 2: Build a scalable, self-managed streaming infrastructure with Flink: Tackling Autoscaling Challenges - Part 2&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>&lt;em>Join the conversation and share your experiences on our &lt;a href="https://beam.apache.org/community/">Community&lt;/a> or contribute to our ongoing projects on &lt;a href="https://github.com/apache/beam">GitHub&lt;/a>. Your feedback is invaluable. If you have any comments or questions about this series, please feel free to reach out to us via &lt;a href="https://beam.apache.org/community/contact-us/">User Mailist&lt;/a>&lt;/em>&lt;/p>
&lt;p>&lt;em>Stay connected with us for more updates and insights into Apache Beam, Flink, and Kubernetes.&lt;/em>&lt;/p></description><link>/blog/apache-beam-flink-and-kubernetes-part3/</link><pubDate>Mon, 05 Feb 2024 09:00:00 -0400</pubDate><guid>/blog/apache-beam-flink-and-kubernetes-part3/</guid><category>blog</category></item><item><title>Apache Beam 2.53.0</title><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;p>We are happy to present the new 2.53.0 release of Beam.
This release includes both improvements and new functionality.
See the &lt;a href="/get-started/downloads/">download page&lt;/a> for this release.&lt;/p>
&lt;p>For more information on changes in 2.53.0, check out the &lt;a href="https://github.com/apache/beam/milestone/17">detailed release notes&lt;/a>.&lt;/p>
&lt;h2 id="highlights">Highlights&lt;/h2>
&lt;ul>
&lt;li>Python streaming users that use 2.47.0 and newer versions of Beam should update to version 2.53.0, which fixes a known issue: (&lt;a href="https://github.com/apache/beam/issues/27330">#27330&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="ios">I/Os&lt;/h2>
&lt;ul>
&lt;li>TextIO now supports skipping multiple header lines (Java) (&lt;a href="https://github.com/apache/beam/issues/17990">#17990&lt;/a>).&lt;/li>
&lt;li>Python GCSIO is now implemented with GCP GCS Client instead of apitools (&lt;a href="https://github.com/apache/beam/issues/25676">#25676&lt;/a>)&lt;/li>
&lt;li>Adding support for LowCardinality DataType in ClickHouse (Java) (&lt;a href="https://github.com/apache/beam/pull/29533">#29533&lt;/a>).&lt;/li>
&lt;li>Added support for handling bad records to KafkaIO (Java) (&lt;a href="https://github.com/apache/beam/pull/29546">#29546&lt;/a>)&lt;/li>
&lt;li>Add support for generating text embeddings in MLTransform for Vertex AI and Hugging Face Hub models.(&lt;a href="https://github.com/apache/beam/pull/29564">#29564&lt;/a>)&lt;/li>
&lt;li>NATS IO connector added (Go) (&lt;a href="https://github.com/apache/beam/issues/29000">#29000&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="new-features--improvements">New Features / Improvements&lt;/h2>
&lt;ul>
&lt;li>The Python SDK now type checks &lt;code>collections.abc.Collections&lt;/code> types properly. Some type hints that were erroneously allowed by the SDK may now fail. (&lt;a href="https://github.com/apache/beam/pull/29272">#29272&lt;/a>)&lt;/li>
&lt;li>Running multi-language pipelines locally no longer requires Docker.
Instead, the same (generally auto-started) subprocess used to perform the
expansion can also be used as the cross-language worker.&lt;/li>
&lt;li>Framework for adding Error Handlers to composite transforms added in Java (&lt;a href="https://github.com/apache/beam/pull/29164">#29164&lt;/a>).&lt;/li>
&lt;li>Python 3.11 images now include google-cloud-profiler (&lt;a href="https://github.com/apache/beam/pull/29651">#29561&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="deprecations">Deprecations&lt;/h2>
&lt;ul>
&lt;li>Euphoria DSL is deprecated and will be removed in a future release (not before 2.56.0) (&lt;a href="https://github.com/apache/beam/issues/29451">#29451&lt;/a>)&lt;/li>
&lt;/ul>
&lt;h2 id="bugfixes">Bugfixes&lt;/h2>
&lt;ul>
&lt;li>(Python) Fixed sporadic crashes in streaming pipelines that affected some users of 2.47.0 and newer SDKs (&lt;a href="https://github.com/apache/beam/issues/27330">#27330&lt;/a>).&lt;/li>
&lt;li>(Python) Fixed a bug that caused MLTransform to drop identical elements in the output PCollection (&lt;a href="https://github.com/apache/beam/issues/29600">#29600&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="security-fixes">Security Fixes&lt;/h2>
&lt;ul>
&lt;li>Upgraded to go 1.21.5 to build, fixing &lt;a href="https://security-tracker.debian.org/tracker/CVE-2023-45285">CVE-2023-45285&lt;/a> and &lt;a href="https://security-tracker.debian.org/tracker/CVE-2023-39326">CVE-2023-39326&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="known-issues">Known Issues&lt;/h2>
&lt;ul>
&lt;li>Potential race condition causing NPE in DataflowExecutionStateSampler in Dataflow Java Streaming pipelines (&lt;a href="https://github.com/apache/beam/issues/29987">#29987&lt;/a>).&lt;/li>
&lt;li>Some Python pipelines that run with 2.52.0-2.54.0 SDKs and use large materialized side inputs might be affected by a performance regression. To restore the prior behavior on these SDK versions, supply the &lt;code>--max_cache_memory_usage_mb=0&lt;/code> pipeline option. (&lt;a href="https://github.com/apache/beam/issues/30360">#30360&lt;/a>).&lt;/li>
&lt;li>Python pipelines that run with 2.53.0-2.54.0 SDKs and perform file operations on GCS might be affected by excess HTTP requests. This could lead to a performance regression or a permission issue. (&lt;a href="https://github.com/apache/beam/issues/28398">#28398&lt;/a>)&lt;/li>
&lt;li>In Python pipelines, when shutting down inactive bundle processors, shutdown logic can overaggressively hold the lock, blocking acceptance of new work. Symptoms of this issue include slowness or stuckness in long-running jobs. Fixed in 2.56.0 (&lt;a href="https://github.com/apache/beam/pull/30679">#30679&lt;/a>).&lt;/li>
&lt;/ul>
&lt;p>For the most up to date list of known issues, see &lt;a href="https://github.com/apache/beam/blob/master/CHANGES.md">https://github.com/apache/beam/blob/master/CHANGES.md&lt;/a>&lt;/p>
&lt;h2 id="list-of-contributors">List of Contributors&lt;/h2>
&lt;p>According to git shortlog, the following people contributed to the 2.53.0 release. Thank you to all contributors!&lt;/p>
&lt;p>Ahmed Abualsaud&lt;/p>
&lt;p>Ahmet Altay&lt;/p>
&lt;p>Alexey Romanenko&lt;/p>
&lt;p>Anand Inguva&lt;/p>
&lt;p>Arun Pandian&lt;/p>
&lt;p>Balázs Németh&lt;/p>
&lt;p>Bruno Volpato&lt;/p>
&lt;p>Byron Ellis&lt;/p>
&lt;p>Calvin Swenson Jr&lt;/p>
&lt;p>Chamikara Jayalath&lt;/p>
&lt;p>Clay Johnson&lt;/p>
&lt;p>Damon&lt;/p>
&lt;p>Danny McCormick&lt;/p>
&lt;p>Ferran Fernández Garrido&lt;/p>
&lt;p>Georgii Zemlianyi&lt;/p>
&lt;p>Israel Herraiz&lt;/p>
&lt;p>Jack McCluskey&lt;/p>
&lt;p>Jacob Tomlinson&lt;/p>
&lt;p>Jan Lukavský&lt;/p>
&lt;p>JayajP&lt;/p>
&lt;p>Jeffrey Kinard&lt;/p>
&lt;p>Johanna Öjeling&lt;/p>
&lt;p>Julian Braha&lt;/p>
&lt;p>Julien Tournay&lt;/p>
&lt;p>Kenneth Knowles&lt;/p>
&lt;p>Lawrence Qiu&lt;/p>
&lt;p>Mark Zitnik&lt;/p>
&lt;p>Mattie Fu&lt;/p>
&lt;p>Michel Davit&lt;/p>
&lt;p>Mike Williamson&lt;/p>
&lt;p>Naireen&lt;/p>
&lt;p>Naireen Hussain&lt;/p>
&lt;p>Niel Markwick&lt;/p>
&lt;p>Pablo Estrada&lt;/p>
&lt;p>Radosław Stankiewicz&lt;/p>
&lt;p>Rebecca Szper&lt;/p>
&lt;p>Reuven Lax&lt;/p>
&lt;p>Ritesh Ghorse&lt;/p>
&lt;p>Robert Bradshaw&lt;/p>
&lt;p>Robert Burke&lt;/p>
&lt;p>Sam Rohde&lt;/p>
&lt;p>Sam Whittle&lt;/p>
&lt;p>Shunping Huang&lt;/p>
&lt;p>Svetak Sundhar&lt;/p>
&lt;p>Talat UYARER&lt;/p>
&lt;p>Tom Stepp&lt;/p>
&lt;p>Tony Tang&lt;/p>
&lt;p>Vlado Djerek&lt;/p>
&lt;p>Yi Hu&lt;/p>
&lt;p>Zechen Jiang&lt;/p>
&lt;p>clmccart&lt;/p>
&lt;p>damccorm&lt;/p>
&lt;p>darshan-sj&lt;/p>
&lt;p>gabry.wu&lt;/p>
&lt;p>johnjcasey&lt;/p>
&lt;p>liferoad&lt;/p>
&lt;p>lrakla&lt;/p>
&lt;p>martin trieu&lt;/p>
&lt;p>tvalentyn&lt;/p></description><link>/blog/beam-2.53.0/</link><pubDate>Thu, 04 Jan 2024 09:00:00 -0400</pubDate><guid>/blog/beam-2.53.0/</guid><category>blog</category><category>release</category></item><item><title>Scaling a streaming workload on Apache Beam, 1 million events per second and beyond</title><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/0-intro.png"
alt="Streaming Processing">&lt;/p>
&lt;p>Scaling a streaming workload is critical for ensuring that a pipeline can process large amounts of data while also minimizing latency and executing efficiently. Without proper scaling, a pipeline may experience performance issues or even fail entirely, delaying the time to insights for the business.&lt;/p>
&lt;p>Given the Apache Beam support for the sources and sinks needed by the workload, developing a streaming pipeline can be easy. You can focus on the processing (transformations, enrichments, or aggregations) and on setting the right configurations for each case.&lt;/p>
&lt;p>However, you need to identify the key performance bottlenecks and make sure that the pipeline has the resources it needs to handle the load efficiently. This can involve right-sizing the number of workers, understanding the settings needed for the source and sinks of the pipeline, optimizing the processing logic, and even determining the transport formats.&lt;/p>
&lt;p>This article illustrates how to manage the problem of scaling and optimizing a streaming workload developed in Apache Beam and run on Google Cloud using Dataflow. The goal is to reach one million events per second, while also minimizing latency and resource use during execution. The workload uses Pub/Sub as the streaming source and BigQuery as the sink. We describe the reasoning behind the configuration settings and code changes we used to help the workload achieve the desired scale and beyond.&lt;/p>
&lt;p>The progression described in this article maps to the evolution of a real-life workload, with simplifications. After the initial business requirements for the pipeline were achieved, the focus shifted to optimizing the performance and reducing the resources needed for the pipeline execution.&lt;/p>
&lt;h2 id="execution-setup">Execution setup&lt;/h2>
&lt;p>For this article, we created a test suite that creates the necessary components for the pipelines to execute. You can find the code in &lt;a href="https://github.com/prodriguezdefino/apache-beam-streaming-tests">this Github repository&lt;/a>. You can find the subsequent configuration changes that are introduced on every run in this &lt;a href="https://github.com/prodriguezdefino/apache-beam-streaming-tests/tree/main/scaling-streaming-workload-blog">folder&lt;/a> as scripts that you can run to achieve similar results.&lt;/p>
&lt;p>All of the execution scripts can also execute a Terraform-based automation to create a Pub/Sub topic and subscription as well as a BigQuery dataset and table to run the workload. Also, it launches two pipelines: one data generation pipeline that pushes events to the Pub/Sub topic, and an ingestion pipeline that demonstrates the potential improvement points.&lt;/p>
&lt;p>In all cases, the pipelines start with an empty Pub/Sub topic and subscription and an empty BigQuery table. The plan is to generate one million events per second and, after a few minutes, review how the ingestion pipeline scales with time. The data being autogenerated is based on provided schemas or IDL (or Interface Description Language) given the configuration, and the goal is to have messages ranging between 800 bytes and 2 KB, adding up to approximately 1 GB/s volume throughput. Also, the ingestion pipelines are using the same worker type configuration on all runs (&lt;code>n2d-standard-4&lt;/code> GCE machines) and are capping the maximum workers number to avoid very large fleets.&lt;/p>
&lt;p>All of the executions run on Google Cloud using Dataflow, but you can apply all of the configurations and format changes to the suite while executing on other supported Apache Beam runners. Changes and recommendations are not runner specific.&lt;/p>
&lt;h3 id="local-environment-requirements">Local environment requirements&lt;/h3>
&lt;p>Before launching the startup scripts, install the following items in your local environment:&lt;/p>
&lt;ul>
&lt;li>&lt;code>gcloud&lt;/code>, along with the correct permissions&lt;/li>
&lt;li>Terraform&lt;/li>
&lt;li>JDK 17 or later&lt;/li>
&lt;li>Maven 3.6 or later&lt;/li>
&lt;/ul>
&lt;p>For more information, see the &lt;a href="https://github.com/prodriguezdefino/apache-beam-streaming-tests#requisites">requirements&lt;/a> section in the GitHub repository.&lt;/p>
&lt;p>Also, review the service quotas and resources available in your Google Cloud project. Specifically: Pub/Sub regional capacity, BigQuery ingestion quota, and Compute Engine instances available in the selected region for the tests.&lt;/p>
&lt;h3 id="workload-description">Workload description&lt;/h3>
&lt;p>Focusing on the ingestion pipeline, our &lt;a href="https://github.com/prodriguezdefino/apache-beam-streaming-tests/blob/main/canonical-streaming-pipelines/src/main/java/com/google/cloud/pso/beam/pipelines/StreamingSourceToBigQuery.java#L55">workload&lt;/a> is straightforward. It completes the following steps:&lt;/p>
&lt;ol>
&lt;li>reads data in a specific format from Pub/Sub (Apache Thrift in this case)&lt;/li>
&lt;li>deals with potential compression and batching settings (not enabled by default)&lt;/li>
&lt;li>executes a UDF (identity function by default)&lt;/li>
&lt;li>transforms the input format to one of the formats supported by the &lt;code>BigQueryIO&lt;/code> transform&lt;/li>
&lt;li>writes the data to the configured table&lt;/li>
&lt;/ol>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/0-pipeline.png"
alt="Example Workload">&lt;/p>
&lt;p>The pipeline we used for the tests is highly configurable. For more details about how to tweak the ingestion, see the &lt;a href="https://github.com/prodriguezdefino/apache-beam-streaming-tests/blob/main/canonical-streaming-pipelines/src/main/java/com/google/cloud/pso/beam/pipelines/StreamingSourceToBigQuery.java#L39">options&lt;/a> in the file. No code changes are needed on any of our steps. The execution scripts take care of the configurations needed.&lt;/p>
&lt;p>Although these tests are focused on reading data from Pub/Sub, the ingestion pipeline is capable of reading data from a generic streaming source. The repository contains other &lt;a href="https://github.com/prodriguezdefino/apache-beam-streaming-tests/tree/main/example-suite-scripts">examples&lt;/a> that show how to launch this same test suite reading data from Pub/Sub Lite and Kafka. In all cases, the pipeline automation sets up the streaming infrastructure.&lt;/p>
&lt;p>Finally, you can see in the &lt;a href="https://github.com/prodriguezdefino/apache-beam-ptransforms/blob/a0dd229081625c7b593512543614daf995a9f870/common/src/main/java/com/google/cloud/pso/beam/common/formats/options/TransportFormatOptions.java">configuration options&lt;/a> that the pipeline supports many transport format options for the input, such as Thrift, Avro, and JSON. This suite focuses on Thrift, because it is a common open source format, and because it generates a format transformation need. The intent is to put some strain in the workload processing. You can run similar tests for Avro and JSON input data. The streaming data generator pipeline can generate random data for the &lt;a href="https://github.com/prodriguezdefino/apache-beam-streaming-tests/tree/main/streaming-data-generator/src/main/java/com/google/cloud/pso/beam/generator/formats">three supported formats&lt;/a> by walking directly on the schema (Avro and JSON) or IDL (Thrift) provided for execution.&lt;/p>
&lt;h2 id="first-run-default-settings">First run: default settings&lt;/h2>
&lt;p>The default values for the execution writes the data to BigQuery using &lt;code>STREAMING_INSERTS&lt;/code> mode for &lt;code>BigQueryIO&lt;/code>. This mode correlates with the &lt;a href="https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll">&lt;code>tableData insertAll&lt;/code> API&lt;/a> for BigQuery. This API supports data in JSON format. From the Apache Beam perspective, using the &lt;code>BigQueryIO.writeTableRows&lt;/code> method lets us resolve the writes into BigQuery.&lt;/p>
&lt;p>For our ingestion pipeline, the Thrift format needs to be transformed into &lt;code>TableRow&lt;/code>. To do that, we need to translate the Thrift IDL into a BigQuery table schema. That can be achieved by translating the Thrift IDL into an Avro schema, and then using Beam utilities to translate the table schema for BigQuery. We can do this at bootstrap. The schema transformation is cached at the &lt;code>DoFn&lt;/code> level.&lt;/p>
&lt;p>After setting up the data generation and ingestion pipelines, and after letting the pipelines run for some minutes, we see that the pipeline is unable to sustain the desired throughput.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/1-default-ps.png"
alt="PubSub metrics">&lt;/p>
&lt;p>The previous image shows that the number of messages that are not being processed by the ingestion pipeline start to show as unacknowledged messages in Pub/Sub metrics.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/1-default-throughput.png"
alt="Throughput">&lt;/p>
&lt;p>Reviewing the per stage performance metrics, we see that the pipeline shows a saw-like shape, which is often associated with the throttling mechanisms the Dataflow runner uses when some of the stages are acting as bottlenecks for the throughput. Also, we see that the &lt;code>Reshuffle&lt;/code> step on the &lt;code>BigQueryIO&lt;/code> write transform does not scale as expected.&lt;/p>
&lt;p>This behavior happens because by default the &lt;a href="https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java#L57">&lt;code>BigQueryOptions&lt;/code>&lt;/a> uses 50 different keys to shuffle data to workers before the writes happen on BigQuery. To solve this problem, we can add a configuration to our launch script that enables the write operations to scale to a larger number of workers, which improves performance.&lt;/p>
&lt;h2 id="second-run-improve-the-write-bottleneck">Second run: improve the write bottleneck&lt;/h2>
&lt;p>After increasing the number of streaming keys to a higher number, 512 keys in our case, we restarted the test suite. The Pub/Sub metrics started to improve. After an initial ramp on the size of the backlog, the curve started to ease out.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/2-skeys-ps.png"
alt="PubSub metrics">&lt;/p>
&lt;p>This is good, but we should take a look at the throughput per stage numbers to understand if we are achieving the goal we set up for this exercise.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/2-skeys-throughput.png"
alt="Throughput">&lt;/p>
&lt;p>Although the performance has clearly improved, and the Pub/Sub backlog no longer increases monotonically, we are still far from the goal of processing one million events per second (1 GB/s) for our ingestion pipeline. In fact, the throughput metrics jump all over, indicating that bottlenecks are preventing the processing from scaling further.&lt;/p>
&lt;h2 id="third-run-unleash-autoscale">Third run: unleash autoscale&lt;/h2>
&lt;p>Luckily for us, when writing into BigQuery, we can autoscale the writes. This step simplifies the configuration so that we don&amp;rsquo;t have to guess the right number of shards. We switched the pipeline’s configuration and enabled this setting for the next &lt;a href="https://github.com/prodriguezdefino/apache-beam-streaming-tests/blob/main/scaling-streaming-workload-blog/3-ps2bq-si-tr-streamingautoshard.sh">launch script&lt;/a>.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/3-autoshard-parallelism.png"
alt="Key Parallelism">&lt;/p>
&lt;p>Immediately, we see that the autosharding mechanism tweaks the number of keys very aggressively and in a dynamic way. This change is good, because different moments in time might have different scale needs, such as early backlog recoveries and spikes in the execution.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/3-autoshard-throughput-tr.png"
alt="Throughput">&lt;/p>
&lt;p>Inspecting the throughput performance per stage, we see that as the number of keys increases, the performance of the writes also increases. In fact, it reaches very large numbers!&lt;/p>
&lt;p>After the initial backlog was consumed and the pipeline stabilized, we saw that the desired performance numbers were reached. The pipeline can sustain processing many more than a million events per second from Pub/Sub and several GB/s of BigQuery ingestion. Yay!&lt;/p>
&lt;p>Still, we want to see if we can do better. We can introduce several improvements to the pipeline to make the execution more efficient. In most cases, the improvements are configuration changes. We just need to know where to focus next.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/3-autoshard-autoscale.png"
alt="Resources">&lt;/p>
&lt;p>The previous image shows that the number of workers needed to sustain this throughput is still quite high. The workload itself is not CPU intensive. Most of the cost is spent on transforming formats and on I/O interactions, such as shuffles and the actual writes. To understand what to improve, we first investigate the transport formats.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/3-autoshard-tr-input.png"
alt="Thrift Input Size">
&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/3-autoshard-tr-output.png"
alt="TableRow Output Size">&lt;/p>
&lt;p>Looking at the input size, right before the identity UDF execution, the data format is binary Thrift, which is a decently compact format even when no compression is used. However, while comparing the &lt;code>PCollection&lt;/code> approximated size with the &lt;code>TableRow&lt;/code> format needed for BigQuery ingestion, a clear size increase is visible. This is something we can improve by changing the BigQuery write API in use.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/3-autoshard-tr-overhead.png"
alt="Translation Overhead">&lt;/p>
&lt;p>When we inspect the &lt;code>StoreInBigQuery&lt;/code> transform, we see that the majority of the wall time is spent on the actual writes. Also, the wall time spent converting data to the destination format (&lt;code>TableRows&lt;/code>) compared with how much is spent in the actual writes is quite large: 13 times bigger for the writes. To improve this behavior, we can switch the pipeline write mode.&lt;/p>
&lt;h2 id="fourth-run-in-with-the-new">Fourth run: in with the new&lt;/h2>
&lt;p>In this run, we use the &lt;code>StorageWrite&lt;/code> API. Enabling the &lt;code>StorageWrite&lt;/code> API for this pipeline is straightforward. We set the write mode as &lt;code>STORAGE_WRITE_API&lt;/code> and define a write triggering frequency. For this test, we write data at most every ten seconds. The write triggering frequency controls how long the per-stream data accumulate. A higher number defines a larger output to be written after the stream assignment but also imposes a larger end-to-end latency for every element read from Pub/Sub. Similar to the &lt;code>STREAMING_WRITES&lt;/code> configuration, &lt;code>BigQueryIO&lt;/code> can handle autosharding for the writes, which we already demonstrated to be the best setting for performance.&lt;/p>
&lt;p>After both pipelines become stable, the performance benefits seen when using the &lt;code>StorageWrite&lt;/code> API in &lt;code>BigQueryIO&lt;/code> are apparent. After enabling the new implementation, the wall time rate between the format transformation and write operation decreases. The wall time spent on writes is only about 34 percent larger than the format transformation.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/4-format-transformation.png"
alt="Translation Overhead">&lt;/p>
&lt;p>After stabilization, the pipeline throughput is also quite smooth. The runner can quickly and steadily downscale the pipeline resources needed to sustain the desired throughput.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/4-throughput.png"
alt="Throughput">&lt;/p>
&lt;p>Looking at the resource scale needed to process the data, another dramatic improvement is visible. Whereas the streaming inserts-based pipeline needed more than 80 workers to sustain the throughput, the storage writes pipeline only needs 49, a 40 percent improvement.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/4-ingestion-scale.png"
alt="Resources">&lt;/p>
&lt;p>We can use the data generation pipeline as reference. This pipeline only needs to randomly generate data and write the events to Pub/Sub. It runs steadily with an average of 40 workers. The improvements on the ingestion pipeline using the right configuration for the workload makes it closer to those resources needed for the generation.&lt;/p>
&lt;p>Similar to the streaming inserts-based pipeline, writing the data into BigQuery requires running a format translation, from Thrift to &lt;code>TableRow&lt;/code> in the former and from Thrift to Protocol Buffers (protobuf) in the latter. Because we are using the &lt;code>BigQueryIO.writeTableRows&lt;/code> method, we add another step in the format translation. Because the &lt;code>TableRow&lt;/code> format also increases the size of the &lt;code>PCollection&lt;/code> being processed, we want to see if we can improve this step.&lt;/p>
&lt;h2 id="fifth-run-a-better-write-format">Fifth run: a better write format&lt;/h2>
&lt;p>When using &lt;code>STORAGE_WRITE_API&lt;/code>, the &lt;code>BigQueryIO&lt;/code> transform exposes a method that we can use to write the Beam row type directly into BigQuery. This step is useful because of the flexibility that the row type provides for interoperability and schema management. Also, it&amp;rsquo;s both efficient for shuffling and denser than &lt;code>TableRow&lt;/code>, so our pipeline will have smaller &lt;code>PCollection&lt;/code> sizes.&lt;/p>
&lt;p>For the next run, because our data volume is not small, we decrease the triggering frequency when writing to BigQuery. Because we use a different format, slightly different code runs. For this change, the test pipeline script is configured with the flag &lt;code>--formatToStore=BEAM_ROW&lt;/code>.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/5-input-size.png"
alt="Thrift input size">
&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/5-output-size.png"
alt="Row output size">&lt;/p>
&lt;p>The &lt;code>PCollection&lt;/code> size written into BigQuery is considerably smaller than on previous executions. In fact, for this particular execution, the Beam row format is a smaller size than the Thrift format. A larger &lt;code>PCollection&lt;/code> conformed by bigger per-element sizes can put nontrivial memory pressure in smaller worker configurations, reducing the overall throughput.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/5-format-trasformation.png"
alt="Translation overhead">&lt;/p>
&lt;p>The wall clock rate for the format transformation and the actual BigQuery writes also maintain a very similar rate. Handling the Beam row format does not impose a performance penalty in the format translation and subsequent writes. This is confirmed by the number of workers in use by the pipeline when throughput becomes stable, slightly smaller than the previous run but clearly in the same range.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/5-ingestion-scale.png"
alt="Resources">&lt;/p>
&lt;p>Although we are in a much better position than when we started, given our test pipeline input format, there&amp;rsquo;s still room for improvement.&lt;/p>
&lt;h2 id="sixth-run-further-reduce-the-format-translation-effort">Sixth run: further reduce the format translation effort&lt;/h2>
&lt;p>Another supported format for the input &lt;code>PCollection&lt;/code> in the &lt;code>BigQueryIO&lt;/code> transform might be advantageous for our input format. The method &lt;code>writeGenericRecords&lt;/code> enables the transform to transform Avro &lt;code>GenericRecords&lt;/code> directly into protobuf before the write operation. Apache Thrift can be transformed into Avro &lt;code>GenericRecords&lt;/code> very efficiently. We can make another test run configuring our test ingestion pipeline by setting the option &lt;code>--formatToStore=AVRO_GENERIC_RECORD&lt;/code> on our execution script.&lt;/p>
&lt;p>This time, the difference between format translation and writes increases significantly, improving performance. The translation to Avro &lt;code>GenericRecords&lt;/code> is only 20 percent of the write effort spent on writing those records into BigQuery. Given that the test pipelines had similar runtimes and that the wall clock seen in the &lt;code>WriteIntoBigQuery&lt;/code> stage is also aligned with other &lt;code>StorageWrite&lt;/code> related runs, using this format is appropriate for this workload.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/6-format-transformation.png"
alt="Translation overhead">&lt;/p>
&lt;p>We see further gains when we look at resource utilization. We need less CPU time to execute the format translations for our workload while achieving the desired throughput.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/6-ingestion-scale.png"
alt="Resources">&lt;/p>
&lt;p>This pipeline improves upon the previous run, running steadily on 42 workers when throughput is stable. Given the worker configuration used (&lt;code>nd2-standard-4&lt;/code>), and the volume throughput of the workload process (about 1 GB/s), we are achieving about 6 MB/s throughput per CPU core, which is quite impressive for a streaming pipeline with exactly-once semantics.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/6-latencies.png"
alt="Latencies">&lt;/p>
&lt;p>When we add up all of the stages executed in the main path of the pipeline, the latency seen at this scale achieves sub-second end-to-end latencies during sustained periods of time.&lt;/p>
&lt;p>Given the workload requirements and the implemented pipeline code, this performance is the best that we can extract without further tuning the runner’s specific settings.&lt;/p>
&lt;h2 id="seventh-run--lets-just-relax-at-least-some-constraints">Seventh run : lets just relax (at least some constraints)&lt;/h2>
&lt;p>When using the &lt;code>STORAGE\_WRITE\_API&lt;/code> setting for &lt;code>BigQueryIO&lt;/code>, we enforce exactly-once semantics on the writes. This configuration is great for use cases that need strong consistency on the data that gets processed, but it imposes a performance and cost penalty.&lt;/p>
&lt;p>From a high-level perspective, writes into BigQuery are made in batches, which are released based on the current sharding and the triggering frequency. If a write fails during the execution of a particular bundle, it is retried. A bundle of data is committed into BigQuery only when all the data in that particular bundle is correctly appended to a stream. This implementation needs to shuffle the full volume of data to create the batches that are written, and also the information of the finished batches for later commit (although this last piece is very small compared with the first).&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/7-previous-data-input.png"
alt="Read data size">&lt;/p>
&lt;p>Looking at the previous pipeline execution, the total data being processed for the pipeline by Streaming Engine is larger than the data being read from Pub/Sub. For example, 7 TB of data is read from Pub/Sub, whereas the processing of data for the whole execution of the pipeline moves 25 TB of data to and from Streaming Engine.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/7-previous-shuffle-total.png"
alt="Streamed data size">&lt;/p>
&lt;p>When data consistency is not a hard requirement for ingestion, you can use at-least-once semantics with &lt;code>BigQueryIO&lt;/code> write mode. This implementation avoids shuffling and grouping data for the writes. However, this change might cause a small number of repeated rows to be written into the destination table. This can happen with append errors, infrequent worker restarts, and other even less frequent errors.&lt;/p>
&lt;p>Therefore, we add the configuration to use &lt;code>STORAGE_API_AT_LEAST_ONCE&lt;/code> write mode. To instruct the &lt;code>StorageWrite&lt;/code> client to reuse connections while writing data, we also add the configuration flag &lt;code>–useStorageApiConnectionPool&lt;/code>. This configuration option only works with &lt;code>STORAGE_API_AT_LEAST_ONCE&lt;/code> mode, and it reduces the occurrences of warnings similars to &lt;code>Storage Api write delay more than 8 seconds&lt;/code>.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/7-resources.png"
alt="Resources">&lt;/p>
&lt;p>When pipeline throughput stabilizes, we see a similar pattern for resource utilization for the workload. The number of workers in use reaches 40, a small improvement compared with the last run. However, the amount of data being moved from Streaming Engine is much closer to the amount of data read from Pub/Sub.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/7-current-input.png"
alt="Read data size">
&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/7-current-shuffle-total.png"
alt="Streamed data size">&lt;/p>
&lt;p>Considering all of these factors, this change further optimizes the workload, achieving a throughput of 6.4 MB/s per CPU core. This improvement is small compared to the same workload when using consistent writes into BigQuery, but it uses less streaming data resources. This configuration represents the most optimal setup for our workload, with the highest throughput per resource and the lowest streaming data across workers.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/scaling-streaming-workload/7-latency.png"
alt="Streamed data size">&lt;/p>
&lt;p>This configuration also has impressively low latency for the end-to-end processing. Given that the main path of our pipeline has been fused in a single execution stage from reads to writes, we see that even at p99, the latency tends to be below 300 milliseconds at a quite large volume throughput (as previously mentioned around 1 GB/s).&lt;/p>
&lt;h2 id="recap">Recap&lt;/h2>
&lt;p>Optimizing Apache Beam streaming workloads for low latency and efficient execution requires careful analysis and decision-making, and the right configurations.&lt;/p>
&lt;p>Considering the scenario discussed in this article, it is essential to consider factors like overall CPU utilization, throughput and latency per stage, &lt;code>PCollection&lt;/code> sizes, wall time per stage, write mode, and transport formats, in addition to writing the right pipeline for the workload.&lt;/p>
&lt;p>Our experiments revealed that using the &lt;code>StorageWrite&lt;/code> API, autosharding for writes, and Avro &lt;code>GenericRecords&lt;/code> as the transport format yielded the most efficient results. Relaxing the consistency for writes can further improve performance.&lt;/p>
&lt;p>The accompanying &lt;a href="https://github.com/prodriguezdefino/apache-beam-streaming-tests">Github repository&lt;/a> contains a test suite that you can use to replicate the analysis on your Google Cloud project or with a different runner setup. Feel free to take it for a spin. Comments and PRs are always welcome.&lt;/p></description><link>/blog/scaling-streaming-workload/</link><pubDate>Wed, 03 Jan 2024 00:00:01 -0800</pubDate><guid>/blog/scaling-streaming-workload/</guid><category>blog</category></item><item><title>Build a scalable, self-managed streaming infrastructure with Beam and Flink: Tackling Autoscaling Challenges - Part 2</title><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;h1 id="build-a-scalable-self-managed-streaming-infrastructure-with-flink-tackling-autoscaling-challenges---part-2">Build a scalable, self-managed streaming infrastructure with Flink: Tackling Autoscaling Challenges - Part 2&lt;/h1>
&lt;p>Welcome to Part 2 of our in-depth series about building and managing a service for Apache Beam Flink on Kubernetes. In this segment, we&amp;rsquo;re taking a closer look at the hurdles we encountered while implementing autoscaling. These challenges weren&amp;rsquo;t just roadblocks. They were opportunities for us to innovate and enhance our system. Let’s break down these issues, understand their context, and explore the solutions we developed.&lt;/p>
&lt;h2 id="understand-apache-beam-backlog-metrics-in-the-flink-runner-environment">Understand Apache Beam backlog metrics in the Flink runner environment&lt;/h2>
&lt;p>&lt;strong>The Challenge:&lt;/strong> In our current setup, we are using Apache Flink for processing data streams. However, we&amp;rsquo;ve encountered a puzzling issue: our Flink job isn&amp;rsquo;t showing the backlog metrics from Apache Beam. These metrics are critical for understanding the state and performance of our data pipelines.&lt;/p>
&lt;p>&lt;strong>What We Found:&lt;/strong> Interestingly, we noticed that the metrics are actually being generated in &lt;code>KafkaIO&lt;/code>, which is a part of our data pipeline that handles Kafka streams. But when we try to monitor these metrics through the Apache Flink Metric system, we can&amp;rsquo;t find them. We suspected that there might be an issue with the integration (or &amp;lsquo;wiring&amp;rsquo;) between Apache Beam and Apache Flink.&lt;/p>
&lt;p>&lt;strong>Digging Deeper:&lt;/strong> On closer inspection, we found that the metrics should be emitted during the &amp;lsquo;Checkpointing&amp;rsquo; phase of the data stream processing. During this crucial step, the system takes a snapshot of the stream&amp;rsquo;s state, and the metrics are typically metrics that are generated for unbounded sources. Unbounded sources are sources that continuously stream data, like Kafka.&lt;/p>
&lt;p>&lt;strong>A Potential Solution:&lt;/strong> We believe the root of the problem lies in how the metric context is set during the checkpointing phase. A disconnect appears to prevent the Beam metrics from being properly captured in the Flink Metric system. We proposed a fix for this issue, which you can review and contribute to on our GitHub pull request: &lt;a href="https://github.com/apache/beam/pull/29793">Apache Beam PR #29793&lt;/a>.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part2/flink-backlog-metrics.png"
alt="Apache Flink Beam Backlog Metrics">&lt;/p>
&lt;h2 id="overcoming-challenges-in-checkpoint-size-reduction-for-autoscaling-beam-jobs">Overcoming challenges in checkpoint size reduction for autoscaling Beam jobs&lt;/h2>
&lt;p>In this section we will discuss strategies for reducing the size of checkpoints in autoscaling Apache Beam jobs, focusing on efficient checkpointing in Apache Flink and optimizing bundle sizes and PipelineOptions to manage frequent checkpoint timeouts and large-scale job requirements.&lt;/p>
&lt;h3 id="understand-the-basics-of-checkpointing-in-apache-flink">Understand the basics of checkpointing in Apache Flink&lt;/h3>
&lt;p>In stream processing, maintaining state consistency and fault tolerance is crucial. Apache Flink achieves this through a process called &lt;em>checkpointing&lt;/em>. Checkpointing periodically captures the state of a job&amp;rsquo;s operators and stores it in a stable storage location, like Google Cloud Storage or AWS S3. Specifically, Flink checkpoints a job every ten seconds and allows up to one minute for this process to complete. This process is vital for ensuring that, in case of failures, the job can resume from the last checkpoint, providing exactly-once semantics and fault tolerance.&lt;/p>
&lt;h3 id="the-role-of-bundles-in-apache-beam">The role of bundles in Apache Beam&lt;/h3>
&lt;p>Apache Beam introduces the concept of a &lt;em>bundle&lt;/em>. A bundle is essentially a group of elements that are processed together. This step enhances processing efficiency and throughput by reducing the overhead of handling each element separately. For more information, see &lt;a href="https://beam.apache.org/documentation/runtime/model/#bundling-and-persistence">Bundling and persistence&lt;/a>. In the Flink runner &lt;a href="https://beam.apache.org/releases/javadoc/2.52.0/org/apache/beam/runners/flink/FlinkPipelineOptions.html#getMaxBundleSize--">default configuration&lt;/a>, a bundle&amp;rsquo;s default size is 1000 elements with a one-second timeout. However, based on our performance tests, we adjusted the bundle size to &lt;em>10,000 elements with a 10-second timeout&lt;/em>.&lt;/p>
&lt;h3 id="challenge-frequent-checkpoint-timeouts">Challenge: frequent checkpoint timeouts&lt;/h3>
&lt;p>When we configured checkpointing every 10 seconds, we faced frequent checkpoint timeouts, often exceeding 1 minute. This was due to the large size of the checkpoints.&lt;/p>
&lt;h3 id="solution-manage-checkpoint-size">Solution: Manage checkpoint size&lt;/h3>
&lt;p>In Apache Beam Flink jobs, the &lt;code>finishBundleBeforeCheckpointing&lt;/code> option plays a pivotal role. When enabled, it ensures that all bundles are completely processed before initiating a checkpoint. This results in checkpoints that only contain the state post-bundle completion, significantly reducing checkpoint size. Initially, our checkpoints were around 2 MB per pipeline. With this change, they consistently dropped to 150 KB.&lt;/p>
&lt;h3 id="address-the-checkpoint-size-in-large-scale-jobs">Address the checkpoint size in large-scale jobs&lt;/h3>
&lt;p>Despite reducing checkpoint sizes, a 150 KB checkpoint every ten seconds can still be substantial, especially in jobs that run multiple pipelines. For instance, with 100 pipelines in a single job, this size balloons to 15 MB per 10-second interval.&lt;/p>
&lt;h3 id="further-optimization-reduce-checkpoint-size-with-pipelineoptions">Further optimization: reduce checkpoint size with PipelineOptions&lt;/h3>
&lt;p>We discovered that due to a specific issue (BEAM-8577), our Flink runner was including our large &lt;code>PipelineOptions&lt;/code> objects in every checkpoint. We solved this problem by removing unnecessary application-related options from &lt;code>PipelineOptions&lt;/code>, further reducing the checkpoint size to a more manageable 10 KB per pipeline.&lt;/p>
&lt;h2 id="kafka-reader-wait-time-solving-autoscaling-challenges-in-beam-jobs">Kafka Reader wait time: solving autoscaling challenges in Beam jobs&lt;/h2>
&lt;h3 id="understand-unaligned-checkpointing">Understand unaligned checkpointing&lt;/h3>
&lt;p>In our system, we use unaligned checkpointing to speed up the process of checkpointing, which is essential for ensuring data consistency in distributed systems. However, when we activated the &lt;code>finishBundleBeforeCheckpointing&lt;/code> feature, we began facing checkpoint timeout issues and delays in checkpointing steps. Apache Beam leverages Apache Flink&amp;rsquo;s legacy source implementation for processing unbounded sources. In Flink, tasks are categorized into two types: source tasks and non-source tasks.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Source tasks&lt;/strong>: fetch data from external systems into a Flink job&lt;/li>
&lt;li>&lt;strong>Non-source tasks&lt;/strong>: process the incoming data&lt;/li>
&lt;/ul>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part2/kafkaio-wait-reader.png"
alt="Apache Flink Task Types">&lt;/p>
&lt;p>In the standard configuration, non-source tasks check for an available buffer before pulling data. If source tasks don&amp;rsquo;t perform this check, they might experience checkpointing delays in writing data to the output buffer. This delay affects the efficiency of unaligned checkpoints, which are only recognized by legacy source tasks when an output buffer is available.&lt;/p>
&lt;h3 id="address-the-challenge-with-unboundedsourcewrapper-in-beam">Address the challenge with UnboundedSourceWrapper in Beam&lt;/h3>
&lt;p>To solve this problem, Apache Flink introduced a new source implementation that operates in a pull mode. In this mode, a task checks for a free buffer before fetching data, aligning with the approach of non-source tasks.&lt;/p>
&lt;p>However, the legacy source, still used by Apache Beam&amp;rsquo;s Flink Runner, operates in a push mode. It sends data to downstream tasks immediately. This setup might create bottlenecks when buffers are full, causing delays in detecting unaligned checkpoint barriers.&lt;/p>
&lt;h3 id="our-solution">Our solution&lt;/h3>
&lt;p>Despite its deprecation, Apache Beam&amp;rsquo;s Flink Runner still uses the legacy source implementation. To address its issues, we implemented our modifications and the quick workarounds suggested in &lt;a href="https://issues.apache.org/jira/browse/FLINK-26759">FLINK-26759&lt;/a>. These enhancements are detailed in our &lt;a href="#">Pull Request&lt;/a>. You can also find more information about unaligned checkpoint issues in the &lt;a href="https://blog.51cto.com/u_14286418/7000028">Flink Unaligned Checkpoint&lt;/a> blog post.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part2/checkpoint_monitoring-history-subtasks.png"
alt="Apache Flink UI Checkpoint History">&lt;/p>
&lt;h2 id="address-slow-reads-in-high-traffic-scenarios">Address slow reads in high-traffic scenarios&lt;/h2>
&lt;p>In our journey with Apache Beam and the Flink Runner, we encountered a significant challenge similar to one documented in the post &lt;a href="https://antonio-si.medium.com/how-intuit-debug-consumer-lags-in-apache-beam-22ca3b39602e">How Intuit Debug Consumer Lags in Apache Beam&lt;/a> by &lt;a href="https://antonio-si.medium.com/">Antonio Si&lt;/a> in his experience at Intuit. Their real-time data processing pipelines had increasing Kafka consumer lag, particularly with topics experiencing high message traffic. This issue was traced to Apache Beam&amp;rsquo;s handling of Kafka partitions through &lt;code>UnboundedSourceWrapper&lt;/code> and &lt;code>KafkaUnboundedReader&lt;/code>. Specifically, for topics with lower traffic, the processing thread paused unnecessarily, delaying the processing of high-traffic topics. We faced a parallel situation in our system, where the imbalance in processing speeds between high- and low-traffic topics led to inefficiencies.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part2/adaptive-timeout-kafka.png"
alt="UnboundedSourceWrapper Design">&lt;/p>
&lt;p>To resolve this issue, we developed an innovative solution: an adaptive timeout strategy in &lt;code>KafkaIO&lt;/code>. This strategy dynamically adjusts the timeout duration based on the traffic of each topic. For low-traffic topics, it shortens the timeout, preventing unnecessary delays. For high-traffic topics, it extends the timeout, providing more processing opportunities. This approach is detailed in our recent pull request.&lt;/p>
&lt;h2 id="unbalanced-partition-distribution-in-beam-job-autoscaling">Unbalanced partition distribution in Beam job autoscaling&lt;/h2>
&lt;p>At the heart of this system is the adaptive scheduler, a component designed for rapid resource allocation. It intelligently adjusts the number of parallel tasks (parallelism) a job performs based on the availability of computing slots. These slots are like individual workstations, each capable of handling certain parts of the job.&lt;/p>
&lt;p>However, we encountered a problem. Our jobs consist of multiple independent pipelines, each needing its own set of resources. Initially, the system tended to overburden the first few workers by assigning them more tasks, while others remained underutilized. This issue was due to the way Flink allocated tasks, favoring the first workers for each pipeline.&lt;/p>
&lt;p>&lt;img class="center-block"
src="/images/blog/apache-beam-flink-and-kubernetes-part2/flink-partition-assignment.png"
alt="Flink split assignment on slots">&lt;/p>
&lt;p>To address this issue, we developed a custom patch for Flink&amp;rsquo;s &lt;em>SlotSharingSlotAllocator&lt;/em>, a component responsible for task distribution. This patch ensures a more balanced workload distribution across all available workers, improving efficiency and preventing bottlenecks.
With this improvement, each worker gets a fair share of tasks, leading to better resource utilization and smoother operation of our Beam Jobs.&lt;/p>
&lt;h2 id="drain-support-in-kubernetes-operator-with-flink">Drain support in Kubernetes Operator with Flink&lt;/h2>
&lt;h3 id="the-challenge">The challenge&lt;/h3>
&lt;p>In the world of data processing with Apache Flink, a common task is to manage and update data-processing jobs. These jobs could be either stateful, where they remember past data, or stateless, where they don&amp;rsquo;t.&lt;/p>
&lt;p>In the past, when we needed to update or delete a Flink job managed by the Kubernetes Operator, the system saved the current state of the job using a savepoint or checkpoint. However, a crucial step was missing: the system didn&amp;rsquo;t stop the job from processing new data (this is what we mean by draining the job). This oversight could lead to two major issues:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>For stateful jobs:&lt;/strong> potential data inconsistencies, because the job might process new data that wasn&amp;rsquo;t accounted for in the savepoint&lt;/li>
&lt;li>&lt;strong>For stateless jobs:&lt;/strong> data duplication, because the job might reprocess data it already processed&lt;/li>
&lt;/ol>
&lt;h3 id="the-solution-drain-function">The solution: drain function&lt;/h3>
&lt;p>This is where the update referenced as &lt;a href="https://issues.apache.org/jira/browse/FLINK-32700">FLINK-32700&lt;/a> is needed. This update introduced a drain function. Think of it as telling the job, &amp;ldquo;Finish what you&amp;rsquo;re currently processing, but don&amp;rsquo;t take on anything new.&amp;rdquo; Here&amp;rsquo;s how it works:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Stop new data:&lt;/strong> The job stops reading new input.&lt;/li>
&lt;li>&lt;strong>Mark the source:&lt;/strong> The job marks the source with an infinite watermark. Think of this watermark as a marker that tells the system that there&amp;rsquo;s no more new data to process.&lt;/li>
&lt;li>&lt;strong>Propagate through the pipeline:&lt;/strong> This marker is then passed through the job&amp;rsquo;s processing pipeline, ensuring that every part of the job knows not to expect any new data.&lt;/li>
&lt;/ol>
&lt;p>This seemingly small change has a big impact. It ensures that when a job is updated or deleted, the data it processes remains consistent and accurate. This is crucial for any data-processing task, because it maintains the integrity and reliability of the data. Furthermore, in cases where the drainage fails, you can cancel the job without needing a savepoint, which adds a layer of flexibility and safety to the whole process.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>As we conclude Part 2 of our series on building and managing Apache Beam Flink services on Kubernetes, it&amp;rsquo;s evident that the journey of implementing autoscaling has been both challenging and enlightening. The obstacles we faced, from understanding Apache Beam backlog metrics in the Flink Runner environment to addressing slow reads in high-traffic scenarios, pushed us to develop innovative solutions and deepen our understanding of streaming infrastructure.&lt;/p>
&lt;p>Our exploration into the intricacies of checkpointing, Kafka Reader wait times, and unbalanced partition distribution revealed the complexities of autoscaling Beam jobs. These challenges prompted us to devise strategies like the adaptive timeout in &lt;code>KafkaIO&lt;/code> and the balanced workload distribution in Flink&amp;rsquo;s &lt;code>SlotSharingSlotAllocator&lt;/code>. Additionally, the introduction of the drain support in Kubernetes Operator with Flink marks a significant advancement in managing stateful and stateless jobs effectively.&lt;/p>
&lt;p>This journey has not only enhanced the robustness and efficiency of our system but has also contributed valuable insights to the broader community working with Apache Beam and Flink. We hope that our experiences and solutions will aid others facing similar challenges in their projects.&lt;/p>
&lt;p>Stay tuned for our next blog post, where we&amp;rsquo;ll delve into the specifics of autoscaling in Apache Beam. We&amp;rsquo;ll break down the concepts, strategies, and best practices to effectively scale your Beam jobs. Thank you for following our series, and we look forward to sharing more of our journey and learnings with you.&lt;/p>
&lt;h2 id="acknowledgements">Acknowledgements&lt;/h2>
&lt;p>This is a large effort to build the new infrastructure and to migrate the large customer based applications from cloud provider managed streaming infrastructure to self-managed, Flink-based infrastructure at scale. Thanks the Palo Alto Networks CDL streaming team who helped to make this happen: Kishore Pola, Andrew Park, Hemant Kumar, Manan Mangal, Helen Jiang, Mandy Wang, Praveen Kumar Pasupuleti, JM Teo, Rishabh Kedia, Talat Uyarer, Naitk Dani, and David He.&lt;/p>
&lt;hr>
&lt;p>&lt;strong>Explore More:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://beam.apache.org/blog/apache-beam-flink-and-kubernetes/">Part 1: Introduction to Building and Managing Apache Beam Flink Services on Kubernetes&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>&lt;em>Join the conversation and share your experiences on our &lt;a href="https://beam.apache.org/community/">Community&lt;/a> or contribute to our ongoing projects on &lt;a href="https://github.com/apache/beam">GitHub&lt;/a>. Your feedback is invaluable. If you have any comments or questions about this series, please feel free to reach out to us via &lt;a href="https://beam.apache.org/community/contact-us/">User Mailist&lt;/a>&lt;/em>&lt;/p>
&lt;p>&lt;em>Stay connected with us for more updates and insights into Apache Beam, Flink, and Kubernetes.&lt;/em>&lt;/p></description><link>/blog/apache-beam-flink-and-kubernetes-part2/</link><pubDate>Mon, 18 Dec 2023 09:00:00 -0400</pubDate><guid>/blog/apache-beam-flink-and-kubernetes-part2/</guid><category>blog</category></item><item><title>Apache Beam 2.52.0</title><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;p>We are happy to present the new 2.52.0 release of Beam.
This release includes both improvements and new functionality.
See the &lt;a href="/get-started/downloads/#2520-2023-11-17">download page&lt;/a> for this release.&lt;/p>
&lt;p>For more information on changes in 2.52.0, check out the &lt;a href="https://github.com/apache/beam/milestone/16">detailed release notes&lt;/a>.&lt;/p>
&lt;h2 id="highlights">Highlights&lt;/h2>
&lt;ul>
&lt;li>Previously deprecated Avro-dependent code (Beam Release 2.46.0) has been finally removed from Java SDK &amp;ldquo;core&amp;rdquo; package.
Please, use &lt;code>beam-sdks-java-extensions-avro&lt;/code> instead. This will allow to easily update Avro version in user code without
potential breaking changes in Beam &amp;ldquo;core&amp;rdquo; since the Beam Avro extension already supports the latest Avro versions and
should handle this. (&lt;a href="https://github.com/apache/beam/issues/25252">#25252&lt;/a>).&lt;/li>
&lt;li>Publishing Java 21 SDK container images now supported as part of Apache Beam release process. (&lt;a href="https://github.com/apache/beam/issues/28120">#28120&lt;/a>)
&lt;ul>
&lt;li>Direct Runner and Dataflow Runner support running pipelines on Java21 (experimental until tests fully setup). For other runners (Flink, Spark, Samza, etc) support status depend on runner projects.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="new-features--improvements">New Features / Improvements&lt;/h2>
&lt;ul>
&lt;li>Add &lt;code>UseDataStreamForBatch&lt;/code> pipeline option to the Flink runner. When it is set to true, Flink runner will run batch
jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed
using the DataSet API.&lt;/li>
&lt;li>&lt;code>upload_graph&lt;/code> as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK (&lt;a href="https://github.com/apache/beam/pull/28621">PR#28621&lt;/a>).&lt;/li>
&lt;li>state amd side input cache has been enabled to a default of 100 MB. Use &lt;code>--max_cache_memory_usage_mb=X&lt;/code> to provide cache size for the user state API and side inputs. (Python) (&lt;a href="https://github.com/apache/beam/issues/28770">#28770&lt;/a>).&lt;/li>
&lt;li>Beam YAML stable release. Beam pipelines can now be written using YAML and leverage the Beam YAML framework which includes a preliminary set of IO&amp;rsquo;s and turnkey transforms. More information can be found in the YAML root folder and in the &lt;a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/README.md">README&lt;/a>.&lt;/li>
&lt;/ul>
&lt;h2 id="breaking-changes">Breaking Changes&lt;/h2>
&lt;ul>
&lt;li>&lt;code>org.apache.beam.sdk.io.CountingSource.CounterMark&lt;/code> uses custom &lt;code>CounterMarkCoder&lt;/code> as a default coder since all Avro-dependent
classes finally moved to &lt;code>extensions/avro&lt;/code>. In case if it&amp;rsquo;s still required to use &lt;code>AvroCoder&lt;/code> for &lt;code>CounterMark&lt;/code>, then,
as a workaround, a copy of &amp;ldquo;old&amp;rdquo; &lt;code>CountingSource&lt;/code> class should be placed into a project code and used directly
(&lt;a href="https://github.com/apache/beam/issues/25252">#25252&lt;/a>).&lt;/li>
&lt;li>Renamed &lt;code>host&lt;/code> to &lt;code>firestoreHost&lt;/code> in &lt;code>FirestoreOptions&lt;/code> to avoid potential conflict of command line arguments (Java) (&lt;a href="https://github.com/apache/beam/pull/29201">#29201&lt;/a>).&lt;/li>
&lt;/ul>
&lt;h2 id="bugfixes">Bugfixes&lt;/h2>
&lt;ul>
&lt;li>Fixed &amp;ldquo;Desired bundle size 0 bytes must be greater than 0&amp;rdquo; in Java SDK&amp;rsquo;s BigtableIO.BigtableSource when you have more cores than bytes to read (Java) &lt;a href="https://github.com/apache/beam/issues/28793">#28793&lt;/a>.&lt;/li>
&lt;li>&lt;code>watch_file_pattern&lt;/code> arg of the &lt;a href="https://github.com/apache/beam/blob/104c10b3ee536a9a3ea52b4dbf62d86b669da5d9/sdks/python/apache_beam/ml/inference/base.py#L997">RunInference&lt;/a> arg had no effect prior to 2.52.0. To use the behavior of arg &lt;code>watch_file_pattern&lt;/code> prior to 2.52.0, follow the documentation at &lt;a href="https://beam.apache.org/documentation/ml/side-input-updates/">https://beam.apache.org/documentation/ml/side-input-updates/&lt;/a> and use &lt;code>WatchFilePattern&lt;/code> PTransform as a SideInput. (&lt;a href="https://github.com/apache/beam/pulls/28948">#28948&lt;/a>)&lt;/li>
&lt;li>&lt;code>MLTransform&lt;/code> doesn&amp;rsquo;t output artifacts such as min, max and quantiles. Instead, &lt;code>MLTransform&lt;/code> will add a feature to output these artifacts as human readable format - &lt;a href="https://github.com/apache/beam/issues/29017">#29017&lt;/a>. For now, to use the artifacts such as min and max that were produced by the eariler &lt;code>MLTransform&lt;/code>, use &lt;code>read_artifact_location&lt;/code> of &lt;code>MLTransform&lt;/code>, which reads artifacts that were produced earlier in a different &lt;code>MLTransform&lt;/code> (&lt;a href="https://github.com/apache/beam/pull/29016/">#29016&lt;/a>)&lt;/li>
&lt;li>Fixed a memory leak, which affected some long-running Python pipelines: &lt;a href="https://github.com/apache/beam/issues/28246">#28246&lt;/a>.&lt;/li>
&lt;/ul>
&lt;h2 id="security-fixes">Security Fixes&lt;/h2>
&lt;ul>
&lt;li>Fixed &lt;a href="https://www.cve.org/CVERecord?id=CVE-2023-39325">CVE-2023-39325&lt;/a> (Java/Python/Go) (&lt;a href="https://github.com/apache/beam/issues/29118">#29118&lt;/a>).&lt;/li>
&lt;li>Mitigated &lt;a href="https://nvd.nist.gov/vuln/detail/CVE-2023-47248">CVE-2023-47248&lt;/a> (Python) &lt;a href="https://github.com/apache/beam/issues/29392">#29392&lt;/a>.&lt;/li>
&lt;/ul>
&lt;h2 id="list-of-contributors">List of Contributors&lt;/h2>
&lt;p>According to git shortlog, the following people contributed to the 2.52.0 release. Thank you to all contributors!&lt;/p>
&lt;p>Ahmed Abualsaud&lt;/p>
&lt;p>Ahmet Altay&lt;/p>
&lt;p>Aleksandr Dudko&lt;/p>
&lt;p>Alexey Romanenko&lt;/p>
&lt;p>Anand Inguva&lt;/p>
&lt;p>Andrei Gurau&lt;/p>
&lt;p>Andrey Devyatkin&lt;/p>
&lt;p>BjornPrime&lt;/p>
&lt;p>Bruno Volpato&lt;/p>
&lt;p>Bulat&lt;/p>
&lt;p>Chamikara Jayalath&lt;/p>
&lt;p>Damon&lt;/p>
&lt;p>Danny McCormick&lt;/p>
&lt;p>Devansh Modi&lt;/p>
&lt;p>Dominik Dębowczyk&lt;/p>
&lt;p>Ferran Fernández Garrido&lt;/p>
&lt;p>Hai Joey Tran&lt;/p>
&lt;p>Israel Herraiz&lt;/p>
&lt;p>Jack McCluskey&lt;/p>
&lt;p>Jan Lukavský&lt;/p>
&lt;p>JayajP&lt;/p>
&lt;p>Jeff Kinard&lt;/p>
&lt;p>Jeffrey Kinard&lt;/p>
&lt;p>Jiangjie Qin&lt;/p>
&lt;p>Jing&lt;/p>
&lt;p>Joar Wandborg&lt;/p>
&lt;p>Johanna Öjeling&lt;/p>
&lt;p>Julien Tournay&lt;/p>
&lt;p>Kanishk Karanawat&lt;/p>
&lt;p>Kenneth Knowles&lt;/p>
&lt;p>Kerry Donny-Clark&lt;/p>
&lt;p>Luís Bianchin&lt;/p>
&lt;p>Minbo Bae&lt;/p>
&lt;p>Pranav Bhandari&lt;/p>
&lt;p>Rebecca Szper&lt;/p>
&lt;p>Reuven Lax&lt;/p>
&lt;p>Ritesh Ghorse&lt;/p>
&lt;p>Robert Bradshaw&lt;/p>
&lt;p>Robert Burke&lt;/p>
&lt;p>RyuSA&lt;/p>
&lt;p>Shunping Huang&lt;/p>
&lt;p>Steven van Rossum&lt;/p>
&lt;p>Svetak Sundhar&lt;/p>
&lt;p>Tony Tang&lt;/p>
&lt;p>Vitaly Terentyev&lt;/p>
&lt;p>Vivek Sumanth&lt;/p>
&lt;p>Vlado Djerek&lt;/p>
&lt;p>Yi Hu&lt;/p>
&lt;p>aku019&lt;/p>
&lt;p>brucearctor&lt;/p>
&lt;p>caneff&lt;/p>
&lt;p>damccorm&lt;/p>
&lt;p>ddebowczyk92&lt;/p>
&lt;p>dependabot[bot]&lt;/p>
&lt;p>dpcollins-google&lt;/p>
&lt;p>edman124&lt;/p>
&lt;p>gabry.wu&lt;/p>
&lt;p>illoise&lt;/p>
&lt;p>johnjcasey&lt;/p>
&lt;p>jonathan-lemos&lt;/p>
&lt;p>kennknowles&lt;/p>
&lt;p>liferoad&lt;/p>
&lt;p>magicgoody&lt;/p>
&lt;p>martin trieu&lt;/p>
&lt;p>nancyxu123&lt;/p>
&lt;p>pablo rodriguez defino&lt;/p>
&lt;p>tvalentyn&lt;/p></description><link>/blog/beam-2.52.0/</link><pubDate>Fri, 17 Nov 2023 09:00:00 -0400</pubDate><guid>/blog/beam-2.52.0/</guid><category>blog</category><category>release</category></item><item><title>Contributor Spotlight: Johanna Öjeling</title><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;p>Johanna Öjeling is a Senior Software Engineer at &lt;a href="https://normative.io/">Normative&lt;/a>. She started using Apache Beam in 2020 at her previous company &lt;a href="http://datatonic.com">Datatonic&lt;/a> and began contributing in 2022 at a personal capacity. We interviewed Johanna to learn more about her interests and we hope that this will inspire new, future, diverse set of contributors to participate in OSS projects.&lt;/p>
&lt;p>&lt;strong>What areas of interest are you passionate about in your career?&lt;/strong>&lt;/p>
&lt;p>My core interest lies in distributed and data-intensive systems, and I enjoy working on challenges related to performance, scalability and maintainability. I also feel strongly about developer experience, and like to build tools and frameworks that make developers happier and more productive. Aside from that, I take pleasure in mentoring and coaching other software engineers to grow their skills and pursue a fulfilling career.&lt;/p>
&lt;p>&lt;strong>What motivated you to make your first contribution?&lt;/strong>&lt;/p>
&lt;p>I was already a user of the Apache Beam Java and Python SDKs and Google Cloud Dataflow in my previous job, and had started to play around with the Go SDK to learn Go. When I noticed that a feature I wanted was missing, it seemed like a great opportunity to implement it. I had been curious about developing open source software for some time, but did not have a good idea until then of what to contribute with.&lt;/p>
&lt;p>&lt;strong>In which way have you contributed to Apache Beam?&lt;/strong>&lt;/p>
&lt;p>I have primarily worked on the Go SDK with implementation of new features, bug fixes, tests, documentation and code reviews. Some examples include a MongoDB I/O connector with dynamically scalable reads and writes, a file I/O connector supporting continuous file discovery, and an Amazon S3 file system implementation.&lt;/p>
&lt;p>&lt;strong>How has your open source engagement impacted your personal or professional growth?&lt;/strong>&lt;/p>
&lt;p>Contributing to open source is one of the best decisions I have taken professionally. The Beam community has been incredibly welcoming and appreciative, and it has been rewarding to collaborate with talented people around the world to create software that is free for anyone to benefit from. Open source has opened up new opportunities to challenge myself, dive deeper into technologies I like, and learn from highly skilled professionals. To me, it has served as an outlet for creativity, problem solving and purposeful work.&lt;/p>
&lt;p>&lt;strong>How have you noticed contributing to open source is different from contributing to closed source/proprietary software?&lt;/strong>&lt;/p>
&lt;p>My observation has been that there are higher requirements for software quality in open source, and it is more important to get things right the first time. My closed source software experience is from startups/scale-ups where speed is prioritized. When not working on public facing APIs or libraries, one can also more easily change things, whereas we need to be mindful about breaking changes in Beam. I care for software quality and value the high standards the Beam committers hold.&lt;/p>
&lt;p>&lt;strong>What do you like to do with your spare time when you&amp;rsquo;re not contributing to Beam?&lt;/strong>&lt;/p>
&lt;p>Coding is a passion of mine so I tend to spend a lot of my free time on hobby projects, reading books and articles, listening to talks and attending events. When I was younger I loved learning foreign languages and studied English, French, German and Spanish. Later I discovered an interest in computer science and switched focus to programming languages. I decided to change careers to software engineering and have tried to learn as much as possible ever since. I love that it never ends.&lt;/p>
&lt;p>&lt;strong>What future features/improvements are you most excited about, or would you like to see on Beam?&lt;/strong>&lt;/p>
&lt;p>The multi-language pipeline support is an impressive feature of Beam, and I like that new SDKs such as TypeScript and Swift are emerging, which enables developers to write pipelines in their preferred language. Naturally, I am also excited to see where the Go SDK is headed and how we can make use of newer features of the Go language.&lt;/p>
&lt;p>&lt;strong>What types of contributions or support do you think the Beam community needs more of?&lt;/strong>&lt;/p>
&lt;p>Many data and machine learning engineers feel more comfortable with Python than Java and wish the Python SDK were as feature rich as the Java SDK. This presents great opportunities for Python developers to start contributing to Beam. As an SDK author, one can take advantage of Beam&amp;rsquo;s multiple SDKs. When I have developed in Go I have often studied the Java and Python implementations to get ideas for how to solve specific problems and make sure the Go SDK follows a similar pattern.&lt;/p>
&lt;p>&lt;strong>What advice would you give to someone who wants to contribute but does not know where to begin?&lt;/strong>&lt;/p>
&lt;p>Start with asking yourself what prior knowledge you have and what you would like to learn, then look for opportunities that match that. The contribution guidelines will tell you where to find open issues and what the process looks like. There are tasks labeled as &amp;ldquo;good first issue&amp;rdquo; which can be a good starting point. I was quite nervous about making my first contribution and had my mentor pre-review my PR. There was no need to worry though, as people will be grateful for your effort to improve the project. The pride I felt when a committer approved my PR and welcomed me to Beam is something I still remember.&lt;/p>
&lt;p>&lt;strong>What advice would you give to the Beam community? What could we improve?&lt;/strong>&lt;/p>
&lt;p>We can make it easier for new community members to get involved by providing more examples of tasks that we need help with, both in the form of code and non-code contributions. I will take it as an action point myself to label more issues accordingly and tailor the descriptions for newcomers. However, this is contingent on community members visiting the GitHub project. To address this, we could also proactively promote opportunities through social channels and the user mailing list.&lt;/p>
&lt;p>&lt;em>We thank Johanna for the interview and for her contributions! If you would like to learn more about contributing to Beam you can learn more about it here: &lt;a href="https://beam.apache.org/contribute/">https://beam.apache.org/contribute/&lt;/a>.&lt;/em>&lt;/p></description><link>/blog/contributor-spotlight-johanna-ojeling/</link><pubDate>Sat, 11 Nov 2023 15:00:00 -0800</pubDate><guid>/blog/contributor-spotlight-johanna-ojeling/</guid><category>blog</category></item></channel></rss>