blob: 112cfc6541a0e9629151e2b27f12cf01a20936d7 [file] [log] [blame]
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Apache Beam – Case Studies</title><link>/case-studies/</link><description>Recent content in Case Studies on Apache Beam</description><generator>Hugo -- gohugo.io</generator><language>en</language><atom:link href="/case-studies/index.xml" rel="self" type="application/rss+xml"/><item><title>Case-Studies: High-Performing and Efficient Transactional Data Processing for OCTO Technology’s Clients</title><link>/case-studies/octo/</link><pubDate>Thu, 10 Aug 2023 00:12:00 +0000</pubDate><guid>/case-studies/octo/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div class="case-study-opinion">
&lt;div class="case-study-opinion-img">
&lt;img src="/images/logos/powered-by/octo.png"/>
&lt;/div>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“Oftentimes, I tell the clients that Apache Beam is the “Docker” of data processing. It's highly portable, runs seamlessly anywhere, unifies batch and streaming processing, and offers numerous out-of-the-box templates. Adopting Beam enables accelerated migration from batch to streaming, effortless pipeline reuse across contexts, and faster enablement of new use cases. The benefits and great performance of Apache Beam are a game-changer for many!”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/octo/godefroy-clair.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Godefroy Clair
&lt;/div>
&lt;div class="case-study-quote-author-position">
Data Architect @ OCTO Technology
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;/div>
&lt;div class="case-study-post">
&lt;h1 id="high-performing-and-efficient-transactional-data-processing-for-octo-technologys-clients">High-Performing and Efficient Transactional Data Processing for OCTO Technology’s Clients&lt;/h1>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>&lt;a href="https://octo.com/">OCTO Technology&lt;/a>, part of &lt;a href="https://www.accenture.com/">Accenture&lt;/a>, stands at the forefront of technology consultancy and software engineering, specializing in new technologies and digital transformation. Since 1998, OCTO has been dedicated to crafting scalable digital solutions that drive business transformations for clients, ranging from startups to multinational corporations. OCTO leverages its deep technology expertise and a strong culture of successful innovation to help clients explore, test, and embrace emerging technologies or implement mature digital solutions at scale.&lt;/p>
&lt;p>With the powerful Apache Beam unified portable model, OCTO has unlocked the potential to empower, transform, and scale the data ecosystems of several clients, including renowned names like a famous French newspaper and one of France&amp;rsquo;s largest grocery retailers.&lt;/p>
&lt;p>In this spotlight, OCTO’s Data Architect, Godefroy Clair, and Data Engineers, Florian Bastin and Leo Babonnaud, unveil the remarkable impact of Apache Beam on the data processing of a leading French grocery retailer. The implementation led to expedited migration from batch to streaming, a 4x acceleration in transactional data processing, and a 5x improvement in infrastructure cost efficiency.&lt;/p>
&lt;h2 id="high-performing-transactional-data-processing">High-performing transactional data processing&lt;/h2>
&lt;p>OCTO’s Client, a prominent grocery and convenience store retailer with tens of thousands of stores across several countries, relies on an internal web app to empower store managers with informed purchasing decisions and effective store management. The web app provides access to crucial product details, stock quantities, pricing, promotions, and more, sourced from various internal data stores, platforms, and systems.&lt;/p>
&lt;p>Before 2022, the Client utilized an orchestration engine for orchestrating batch pipelines that consolidated and processed data from Cloud Storage files and Pub/Sub messages and wrote the output to BigQuery. However, with most source data uploaded at night, batch processing posed challenges in meeting SLAs and providing the most recent information to store managers before store opening. Moreover, incorrect or missing data uploads required cumbersome database state reverts, involving a substantial amount of transactional data and logs. The Client’s internal team dedicated significant time to maintaining massive SQL queries or manually updating the database state, resulting in high maintenance costs.&lt;/p>
&lt;p>To address these issues, the Client sought OCTO&amp;rsquo;s expertise to transform their data ecosystem and migrate their core use case from batch to streaming. The objectives included faster data processing, ensuring the freshest data in the web app, simplifying pipeline and database maintenance, ensuring scalability and resilience, and efficiently handling spikes in data volumes.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The Client needed to very quickly consolidate and process a huge number of files in different formats from Cloud Storage and Pub/Sub events to have the freshest info about new products, promotions, etc. in their web app every day. For all that, Apache Beam was the perfect tool.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/octo/godefroy-clair.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Godefroy Clair
&lt;/div>
&lt;div class="case-study-quote-author-position">
Data Architect @ OCTO Technology
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Apache Beam and its unified model emerged as the perfect solution, enabling both near-real-time streaming for the Client’s core transactional data processing, as well as their batch processing for standalone use cases. Additionally, it offered the added benefit of autoscaling with the &lt;a href="/documentation/runners/dataflow/">Dataflow runner&lt;/a>. With Apache Beam&amp;rsquo;s &lt;a href="/documentation/sdks/python/"> Python SDK&lt;/a> and the out-of-the-box &lt;a href="/documentation/io/connectors/">I/O connectors&lt;/a>, OCTO was able to reuse Python components between the existing and new batch and streaming pipelines, and leverage native optimized connectivity with Pub/Sub and Cloud Storage, expediting the migration.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/octo/scheme-14.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/octo/scheme-14.png" alt="streaming pipelines">
&lt;/a>
&lt;/div>
&lt;p>The streaming Apache Beam pipeline behind the Client’s web app now processes product and inventory data from Pub/Sub messages and Cloud Storage files of varying sizes - from several rows to 1.7 million rows - that arrive in Cloud Storage at various times, in unpredictable order, and in various formats (such as CSV, JSON, and zip files). Apache Beam&amp;rsquo;s &lt;a href="/blog/timely-processing/">timely processing&lt;/a> capabilities enable the streaming pipeline to handle that data efficiently. Its &lt;a href="/documentation/basics/#state-and-timers">timers&lt;/a> provide a way to control aggregations by waiting until all the necessary events and files come in and then processing them in the right order, while the &lt;a href="/documentation/transforms/python/aggregation/groupbykey/">GroupByKey&lt;/a> and &lt;a href="/documentation/transforms/python/aggregation/groupintobatches/">GroupIntoBatches&lt;/a> transforms allow for efficient grouping for every key and batching the input into desired size. Every day, the Apache Beam pipeline consolidates, deduplicates, enriches, and outputs the data to &lt;a href="https://firebase.google.com/docs/firestore">Firestore&lt;/a> and &lt;a href="https://www.algolia.com/">Algolia&lt;/a>, processing over 100 million rows and consolidating hundreds of gigabytes of transactional data with over a terabyte of an external state within less than 3 hours.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The web app requires fresh data early in the morning before the stores open. Previously, handling the entirety of the Client’s data in time was not feasible. Thanks to Apache Beam, they can now process it within just 3 hours, ensuring data availability even if the input files arrive late at night.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/octo/leo-babonnaud.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Leo Babonnaud
&lt;/div>
&lt;div class="case-study-quote-author-position">
Data Scientist @ OCTO Technology
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>The Client’s specific use case posed unique challenges: the enrichment data was too large to keep in memory, and the unpredictable file order and arrival rendered timers and state API unfeasible. Being unable to leverage Apache Beam&amp;rsquo;s native stateful processing, OCTO found a solution in externalizing the state of &lt;a href="/documentation/programming-guide/#pardo">DoFns&lt;/a> to a transactional &lt;a href="https://cloud.google.com/sql/docs/introduction">Cloud SQL&lt;/a> Postgres database. When processing new events and files, the Apache Beam pipeline uses streaming queries to select, insert, upsert, and delete rows with states in the Cloud SQL database. Apache Beam excels in complex state consolidation when processing files, Pub/Sub events, and logs representing the past, present, and future state of the records in the sink databases. If the incoming data is wrong and the sink data stores need to be reverted, Apache Beam processes a huge amount of logs about data movements that happened within a specific timeframe and consolidates them into states, eliminating manual efforts.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The web app requires fresh data early in the morning before the stores open. Previously, handling the entirety of the Client’s data in time was not feasible. Thanks to Apache Beam, they can now process it within just 3 hours, ensuring data availability even if the input files arrive late at night.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/octo/florian-bastin.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Florian Bastin
&lt;/div>
&lt;div class="case-study-quote-author-position">
Lead Data Scientist @ OCTO Technology
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>By leveraging Apache Beam, the Client has achieved a groundbreaking transformation in data processing, empowering their internal web app with fresh and historical data, enhancing overall operational efficiency, and meeting business requirements with improved processing latency.&lt;/p>
&lt;h2 id="custom-io-and-fine-grained-control-over-sql-connections">Custom I/O and fine-grained control over SQL connections&lt;/h2>
&lt;p>The Client’s specific use case demanded CRUD operations in a Cloud SQL database based on a value in a PCollection, and although the built-in &lt;a href="https://beam.apache.org/releases/pydoc/current/apache_beam.io.jdbc.html">JBDC I/O connector&lt;/a> supported reading and writing from a Cloud SQL database, it did not cater to such SQL operations. However, Apache Beam&amp;rsquo;s custom I/O frameworks open the door for &lt;a href="/documentation/io/developing-io-overview/">creating new connectors&lt;/a> tailored to complex use cases, offering the same connectivity as out-of-the-box I/Os. Capitalizing on this advantage and leveraging &lt;a href="/documentation/transforms/python/elementwise/pardo/">ParDo&lt;/a> and &lt;a href="/documentation/transforms/python/aggregation/groupbykey/">GroupByKey&lt;/a> transforms, OCTO successfully developed a new Apache Beam I/O. This custom I/O seamlessly interacts with a Cloud SQL database using the &lt;a href="https://pypi.org/project/cloud-sql-python-connector/">Cloud SQL Python Connector&lt;/a>, instantiating the latter as a connection object in the &lt;a href="https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.setup">DoFn.Setup&lt;/a> method.&lt;/p>
&lt;p>Moreover, Apache Beam offered OCTO fine-grained control over parallelism, enabling them to maximize worker processes&amp;rsquo; efficiency. With the Dataflow runner&amp;rsquo;s potent parallelism and autoscaling capabilities, OCTO had to address the &lt;a href="https://cloud.google.com/sql/docs/quotas">constraints on the number of concurrent connections&lt;/a> imposed by Cloud SQL. To overcome this challenge, the Apache Beam DoFn.Setup method came into play, providing a means to define the maximum number of concurrent operations by specifying it within the method. OCTO also leveraged the &lt;a href="https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.utils.shared.html">beam.utils.Shared&lt;/a> module to create a connection pool for the Cloud SQL database, effectively sharing it across all processes at the worker level.&lt;/p>
&lt;p>OCTO&amp;rsquo;s data engineers showcased these innovative developments powered by Apache Beam at &lt;a href="https://beamsummit.org/sessions/2023/how-to-balance-power-and-control-when-using-dataflow-with-an-oltp-sql-database/">Beam Summit 2023&lt;/a>.&lt;/p>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>Apache Beam enabled OCTO to revolutionize data processing of one of the most prominent French grocery retailers with a 5x optimization in infrastructure costs and a 4x increase in data processing performance. Apache Beam&amp;rsquo;s unified model and Python SDK proved instrumental in accelerating the migration from batch to streaming processing by providing the ability to reuse components, packages, and modules across pipelines.&lt;/p>
&lt;p>Apache Beam&amp;rsquo;s powerful transforms and robust streaming capabilities enabled the Client’s streaming pipeline to efficiently process over 100 million rows daily, consolidating transactional data with over a terabyte of an external state in under 3 hours, a feat that was previously unattainable. The flexibility and extensibility of Apache Beam empowered OCTO to tackle use-case-specific technical constraints, achieving the perfect balance of power and control to align with the Client’s specific business objectives.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Oftentimes, I tell the clients that Apache Beam is the “Docker” of data processing. It's highly portable, runs seamlessly anywhere, unifies batch and streaming processing, and offers numerous out-of-the-box templates. Adopting Beam enables accelerated migration from batch to streaming, effortless pipeline reuse across contexts, and faster enablement of new use cases. The benefits and great performance of Apache Beam are a game-changer for many!
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/octo/godefroy-clair.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Godefroy Clair
&lt;/div>
&lt;div class="case-study-quote-author-position">
Data Architect @ OCTO Technology
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;h2 id="learn-more">Learn More&lt;/h2>
&lt;iframe class="video video--medium-size" width="560" height="315" src="https://www.youtube.com/embed/TueDlUBJsQU" frameborder="0" allowfullscreen>&lt;/iframe>
&lt;br>&lt;br>
&lt;div class="case-study-feedback" id="case-study-feedback">
&lt;p class="case-study-feedback-title">Was this information useful?&lt;/p>
&lt;div>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(true, 'OCTO')">Yes&lt;/button>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(false, 'OCTO')">No&lt;/button>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="clear-nav">&lt;/div></description></item><item><title>Case-Studies: Revolutionizing Real-Time Stream Processing: 4 Trillion Events Daily at LinkedIn</title><link>/case-studies/linkedin/</link><pubDate>Thu, 10 Aug 2023 00:12:00 +0000</pubDate><guid>/case-studies/linkedin/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div class="case-study-opinion">
&lt;div class="case-study-opinion-img">
&lt;img src="/images/logos/powered-by/linkedin.png"/>
&lt;/div>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“Apache Beam empowers LinkedIn to create timely recommendations and personalized experiences by leveraging the freshest data and processing it in real-time, ultimately benefiting LinkedIn's vast network of over 950 million members worldwide.”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/linkedin/bingfeng-xia.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Bingfeng Xia
&lt;/div>
&lt;div class="case-study-quote-author-position">
Engineering Manager @LinkedIn
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;/div>
&lt;div class="case-study-post">
&lt;h1 id="revolutionizing-real-time-stream-processing-4-trillion-events-daily-at-linkedin">Revolutionizing Real-Time Stream Processing: 4 Trillion Events Daily at LinkedIn&lt;/h1>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>At LinkedIn, Apache Beam plays a pivotal role in stream processing infrastructures that process over 4 trillion events daily through more than 3,000 pipelines across multiple production data centers. This robust framework empowers near real-time data processing for critical services and platforms, ranging from machine learning and notifications to anti-abuse AI modeling. With over 950 million members, ensuring that our platform is running smoothly is critical to connecting members to opportunities worldwide.&lt;/p>
&lt;p>In this case study, LinkedIn&amp;rsquo;s Bingfeng Xia, Engineering Manager, and Xinyu Liu, Senior Staff Engineer, shed light on how the Apache Beam programming model&amp;rsquo;s unified, portable, and user-friendly data processing framework has enabled a multitude of sophisticated use cases and revolutionized Stream Processing at LinkedIn. This technology has &lt;a href="https://engineering.linkedin.com/blog/2023/unified-streaming-and-batch-pipelines-at-linkedin--reducing-proc">optimized cost-to-serve by 2x&lt;/a> by unifying stream and batch processing through Apache Samza and Apache Spark runners, enabled real-time ML feature generation, reduced time-to-production for new pipelines from months to days, allowed for processing time-series events at over 3 million queries per second, and more. For our members, this means that we’re able to serve more accurate job recommendations, improve feed recommendations, and identify fake profiles at a faster rate, etc.&lt;/p>
&lt;h2 id="linkedin-open-source-ecosystem-and-journey-to-beam">LinkedIn Open-Source Ecosystem and Journey to Beam&lt;/h2>
&lt;p>LinkedIn has a rich history of actively contributing to the open-source community, demonstrating its commitment by creating, managing, and utilizing various open-source software projects. The LinkedIn engineering team has &lt;a href="https://engineering.linkedin.com/content/engineering/en-us/open-source">open-sourced over 75 projects&lt;/a> across multiple categories, with several gaining widespread adoption and becoming part of &lt;a href="https://www.apache.org/">the Apache Software Foundation&lt;/a>.&lt;/p>
&lt;p>To enable the ingestion and real-time processing of enormous volumes of data, LinkedIn built a custom stream processing ecosystem largely with tools developed in-house (and subsequently open-sourced). In 2010, they introduced &lt;a href="https://kafka.apache.org/">Apache Kafka&lt;/a>, a pivotal Big Data ingestion backbone for LinkedIn’s real-time infrastructure. To transition from batch-oriented processing and respond to Kafka events within minutes or seconds, they built an in-house distributed event streaming framework, &lt;a href="https://samza.apache.org/">Apache Samza&lt;/a>. This framework, along with Apache Spark for batch processing, formed the basis of LinkedIn’s &lt;a href="https://en.wikipedia.org/wiki/Lambda_architecture">lambda architecture&lt;/a> for data processing jobs. Over time, LinkedIn&amp;rsquo;s engineering team expanded the stream processing ecosystem with more proprietary tools like &lt;a href="https://github.com/linkedin/Brooklin/">Brooklin&lt;/a>, facilitating data streaming across multiple stores and messaging systems, and &lt;a href="https://github.com/linkedin/venice">Venice&lt;/a>, serving as a storage system for ingesting batch and stream processing job outputs, among others.&lt;/p>
&lt;p>Though the stream processing ecosystem with Apache Samza at its core enabled large-scale stateful data processing, LinkedIn’s ever-evolving demands required higher scalability and efficiency, as well as lower latency for the streaming pipelines. The lambda architecture approach led to operational complexity and inefficiencies, because it required maintaining two different codebases and two different engines for batch and streaming data. To address these challenges, data engineers sought a higher level of stream processing abstraction and out-of-the-box support for advanced aggregations and transformations. Additionally, they needed the ability to experiment with streaming pipelines in batch mode. There was also a growing need for multi-language support within the overall Java-prevalent teams due to emerging machine learning use cases requiring Python.&lt;/p>
&lt;p>The release of &lt;a href="/about/">Apache Beam&lt;/a> in 2016 proved to be a game-changer for LinkedIn. Apache Beam offers an open-source, advanced unified programming model for both batch and Stream Processing, making it possible to create a large-scale common data infrastructure across various applications. With support for Python, Go, and Java SDKs and a rich, versatile API layer, Apache Beam provided the ideal solution for building sophisticated multi-language pipelines and running them on any engine.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
When we started looking at Apache Beam, we realized it was a very attractive data processing framework for LinkedIn’s demands: not only does it provide an advanced API, but it also allows for converging stream and batch processing and multi-language support. Everything we were looking for and out-of-the-box.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/linkedin/xinyu-liu.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Xinyu Liu
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Staff Engineer @LinkedIn
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Recognizing the advantages of Apache Beam&amp;rsquo;s unified data processing API, advanced capabilities, and multi-language support, LinkedIn began onboarding its first use cases and developed the &lt;a href="/documentation/runners/samza/">Apache Samza runner for Beam&lt;/a> in 2018. By 2019, Apache Beam pipelines were powering several critical use cases, and the programming model and framework saw extensive adoption across LinkedIn teams. Xinyu Liu showcased the benefits of migrating to Apache Beam pipelines during &lt;a href="https://www.youtube.com/watch?v=uQcpr34RUKY&amp;amp;t=1694s">Beam Summit Europe 2019&lt;/a>.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/linkedin/scheme-1.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/linkedin/scheme-1.png" alt="scheme">
&lt;/a>
&lt;/div>
&lt;h2 id="apache-beam-use-cases-at-linkedin">Apache Beam Use Cases at LinkedIn&lt;/h2>
&lt;h3 id="unified-streaming-and-batch-pipelines">Unified Streaming And Batch Pipelines&lt;/h3>
&lt;p>Some of the first use cases that LinkedIn migrated to Apache Beam pipelines involved both real-time computations and periodic backfilling. One example was LinkedIn&amp;rsquo;s standardization process. Standardization consists of a series of pipelines that use complex AI models to map LinkedIn user inputs, such as job titles, skills, or education history, into predefined internal IDs. For example, a LinkedIn member who lists their current position as &amp;ldquo;Chief Data Scientist&amp;rdquo; has their job title standardized for relevant job recommendations.&lt;/p>
&lt;p>LinkedIn&amp;rsquo;s standardization process requires both real-time processing to reflect immediate user updates and periodic backfilling to refresh data when new AI models are introduced. Before adopting Apache Beam, running backfilling as a streaming job required over 5,000 GB-hours in memory and nearly 4,000 hours in total CPU time. This heavy load led to extended backfilling times and scaling issues, causing the backfilling pipeline to act as a &amp;ldquo;noisy neighbor&amp;rdquo; to colocated streaming pipelines and failing to meet latency and throughput requirements. Although LinkedIn engineers considered migrating the backfilling logic to a batch Spark pipeline, they abandoned the idea due to the unnecessary overhead of maintaining two different codebases.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
We came to the question: is it possible to only maintain one codebase but with the ability to run it as either a batch job or streaming job? The unified Apache Beam model was the solution.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/linkedin/bingfeng-xia.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Bingfeng Xia
&lt;/div>
&lt;div class="case-study-quote-author-position">
Engineering Manager @LinkedIn
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>The Apache Beam APIs enabled LinkedIn engineers to implement business logic once within a unified Apache Beam pipeline that efficiently handles both real-time standardization and backfilling. Apache Beam offers &lt;a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/options/PipelineOptions.html">PipelineOptions&lt;/a>, enabling the configuration and customization of various aspects, such as the pipeline runner and runner-specific configurations. The extensibility of Apache Beam transforms allowed LinkedIn to &lt;a href="https://beam.apache.org/documentation/programming-guide/#composite-transforms">create a custom composite transform&lt;/a> to abstract away I/O differences and switch target processing on the fly based on data source type (bounded or unbounded). In addition, Apache Beam’s abstraction of the underlying infrastructure and the ability to &amp;ldquo;write once, run anywhere&amp;rdquo; empowered LinkedIn to seamlessly switch between data processing engines. Depending on the target processing type, streaming, or batch, the unified Apache Beam standardization pipeline can be deployed through the Samza cluster as a streaming job or through the Spark cluster as a batch backfilling job.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/linkedin/scheme-2.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/linkedin/scheme-2.png" alt="scheme">
&lt;/a>
&lt;/div>
&lt;p>Hundreds of streaming Apache Beam jobs now power real-time standardization, listening to events 24/7, enriching streams with additional data from remote tables, performing necessary processing, and writing results to output databases. The batch Apache Beam backfilling job runs weekly, effectively handling 950 million member profiles at a rate of over 40,000 profiles per second. Apache Beam infers data points into sophisticated AI and machine learning models and joins complex data such as job types and work experiences, thus standardizing user data for search indexing or to run recommendation models.&lt;/p>
&lt;p>The migration of backfilling logic to a unified Apache Beam pipeline and its execution in batch mode resulted in a significant 50% improvement in memory and CPU usage efficiency (from ~5000 GB-hours and ~4000 CPU hours to ~2000 GB-hours and ~1700 CPU hours) and an impressive 94% acceleration in processing time (from 7.5 hours to 25 minutes). More details about this use case can be found on &lt;a href="https://engineering.linkedin.com/blog/2023/unified-streaming-and-batch-pipelines-at-linkedin--reducing-proc">LinkedIn’s engineering blog&lt;/a>.&lt;/p>
&lt;h3 id="anti-abuse--near-real-time-ai-modeling">Anti-Abuse &amp;amp; Near Real-Time AI Modeling&lt;/h3>
&lt;p>LinkedIn is firmly committed to creating a trusted environment for its members, and this dedication extends to safeguarding against various types of abuse on the platform. To achieve this, the Anti-Abuse AI Team at LinkedIn plays a crucial role in creating, deploying, and maintaining AI and deep learning models that can detect and prevent different forms of abuse, such as fake account creation, member profile scraping, automated spam, and account takeovers.&lt;/p>
&lt;p>Apache Beam fortifies LinkedIn’s internal anti-abuse platform, Chronos, enabling abuse detection and prevention in near real-time. Chronos relies on two streaming Apache Beam pipelines: the Filter pipeline and the Model pipeline. The Filter pipeline reads user activity events from Kafka, extracts relevant fields, aggregates and filters the events, and then generates filtered Kafka messages for downstream AI processing. Subsequently, the Model pipeline consumes these filtered messages, aggregates member activity within specific time windows, triggers AI scoring models, and writes the resulting abuse scores to various internal applications, services, and stores for offline processing.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/linkedin/scheme-3.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/linkedin/scheme-3.png" alt="scheme">
&lt;/a>
&lt;/div>
&lt;p>The flexibility of Apache Beam&amp;rsquo;s pluggable architecture and the availability of various I/O options seamlessly integrated the anti-abuse pipelines with Kafka and key-value stores. LinkedIn has dramatically reduced the time it takes to label abusive actions, cutting it down from 1 day to just 5 minutes and processing time-series events at an impressive rate of over 3 million queries per second. Apache Beam empowered near real-time processing, significantly bolstering LinkedIn&amp;rsquo;s anti-abuse defenses. The nearline defenses are able to catch scrapers within minutes after they start to scrape and this leads to more than 6% improvement in detecting logged-in scrapping profiles.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Apache Beam enabled revolutionary, phenomenal performance improvements - the anti-abuse processing accelerated from 1 day to 5 minutes. We have seen more than 6% improvement in detecting logged-in scrapping profiles.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/linkedin/xinyu-liu.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Xinyu Liu
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Staff Engineer @LinkedIn
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;h3 id="notifications-platform">Notifications Platform&lt;/h3>
&lt;p>As a social media network, LinkedIn heavily relies on instant notifications to drive member engagement. To achieve this, Apache Beam and Apache Samza together power LinkedIn’s large-scale Notifications Platform that generates notification content, pinpoints the target audience, and ensures the timely and relevant distribution of content.&lt;/p>
&lt;p>The streaming Apache Beam pipelines have intricate business logic and handle enormous volumes of data in a near real-time fashion. The pipelines consume, aggregate, partition, and process events from over 950 million LinkedIn members and feed the data to downstream machine learning models. The ML models perform distributed targeting and scalable scoring on the order of millions of candidate notifications per second based on the recipient member’s historical actions and make personalized decisions for the recipient for each notification on the fly. As a result, LinkedIn members receive timely, relevant, and actionable activity-based notifications, such as connection invites, job recommendations, daily news digests, and other activities within their social network, through the right channels.&lt;/p>
&lt;p>The advanced Apache Beam API offers complex aggregation and filtering capabilities out-of-the-box, and its programming model allows for the creation of reusable components. These features enable LinkedIn to expedite development and streamline the scaling of the Notifications platform as they transition more notification use cases from Samza to Beam pipelines.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
LinkedIn’s user engagement is greatly driven by how timely we can send relevant notifications. Apache Beam enabled a scalable, near real-time infrastructure behind this business-critical use case.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/linkedin/bingfeng-xia.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Bingfeng Xia
&lt;/div>
&lt;div class="case-study-quote-author-position">
Engineering Manager @LinkedIn
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;h3 id="real-time-ml-feature-generation">Real-Time ML Feature Generation&lt;/h3>
&lt;p>LinkedIn&amp;rsquo;s core functionalities, such as job recommendations and search feed, heavily rely on ML models that consume thousands of features related to various entities like companies, job postings, and members. However, before the adoption of Apache Beam, the original offline ML feature generation pipeline suffered from a delay of 24 to 48 hours between member actions and the impact of those actions on the recommendation system. This delay resulted in missed opportunities, because the system lacked sufficient data about infrequent members and failed to capture the short-term intent and preferences of frequent members. In response to the growing demand for a scalable, real-time ML feature generation platform, LinkedIn turned to Apache Beam to address the challenge.&lt;/p>
&lt;p>Using Managed Beam as the foundation, LinkedIn developed a hosted platform for ML feature generation. The ML platform provides AI engineers with real-time features and an efficient pipeline authoring experience, all while abstracting away deployment and operational complexities. AI engineers create feature definitions and deploy them using Managed Beam. When LinkedIn members take actions on the platform, the streaming Apache Beam pipeline generates fresher machine learning features by filtering, processing, and aggregating the events emitted to Kafka in real-time and writes them to the feature store. Additionally, LinkedIn introduced other Apache Beam pipelines responsible for retrieving the data from the feature store, processing it, and feeding it into the recommendation system.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/linkedin/scheme-4.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/linkedin/scheme-4.png" alt="scheme">
&lt;/a>
&lt;/div>
&lt;p>The powerful Apache Beam Stream Processing platform played a pivotal role in eliminating the delay between member actions and data availability, achieving an impressive end-to-end pipeline latency of just a few seconds. This significant improvement allowed LinkedIn&amp;rsquo;s ML models to take advantage of up-to-date information and deliver more personalized and timely recommendations to our members, leading to significant gains in business metrics.&lt;/p>
&lt;h3 id="managed-stream-processing-platform">Managed Stream Processing Platform&lt;/h3>
&lt;p>As LinkedIn&amp;rsquo;s data infrastructure grew to encompass over 3,000 Apache Beam pipelines, catering to a diverse range of business use cases, LinkedIn&amp;rsquo;s AI and data engineering teams found themselves overwhelmed with managing these streaming applications 24/7. The AI engineers encountered several technical challenges while creating new pipelines, including the intricacy of integrating multiple streaming tools and infrastructures into their frameworks, and limited knowledge of the underlying infrastructure when it came to deployment, monitoring, and operations. These challenges led to a time-consuming pipeline development cycle, often lasting one to two months. Apache Beam enabled LinkedIn to create Managed Beam, a managed Stream Processing platform that is designed to streamline and automate internal processes. This platform makes it easier and faster for teams to develop and operate sophisticated streaming applications while reducing the burden of on-call support.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/linkedin/scheme-5.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/linkedin/scheme-5.png" alt="scheme">
&lt;/a>
&lt;/div>
&lt;p>The Apache Beam SDK empowered LinkedIn engineers to create custom workflow components as reusable sub-DAGs (Directed Acyclic Graphs) and expose them as standard PTransforms. These PTransforms serve as ready-to-use building blocks for new pipelines, significantly speeding up the authoring and testing process for LinkedIn AI engineers. By abstracting the low-level details of underlying engines and runtime environments, Apache Beam allows engineers to focus solely on business logic, further accelerating time to development.&lt;/p>
&lt;p>When the pipelines are ready for deployment, Managed Beam&amp;rsquo;s central control plane comes into play, providing essential features like a deployment UI, operational dashboard, administrative tools, and automated pipeline lifecycle management.&lt;/p>
&lt;p>Apache Beam&amp;rsquo;s abstraction facilitated the isolation of user code from framework evolution during build, deployment, and runtime. To ensure the separation of runner processes from user-defined functions (UDFs), Managed Beam packages the pipeline business logic and the framework logic as two separate JAR files: framework-less artifacts and framework artifacts. During pipeline execution on a YARN cluster, these pipeline artifacts run in a Samza container as two distinct processes, communicating through gRPC. This setup enabled LinkedIn to take advantage of automated framework upgrades, scalable UDF execution, log separation for easier troubleshooting, and multi-language APIs, fostering flexibility and efficiency.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/linkedin/scheme-6.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/linkedin/scheme-6.png" alt="scheme">
&lt;/a>
&lt;/div>
&lt;p>Apache Beam also underpinned Managed Beam&amp;rsquo;s autosizing controller tool, which automates hardware resource tuning and provides auto-remediation for streaming pipelines. Streaming Apache Beam pipelines self-report diagnostic information, such as metrics and key deployment logs, in the form of Kafka topics. Additionally, LinkedIn&amp;rsquo;s internal monitoring tools report runtime errors, such as heartbeat failures, out-of-memory events, and processing lags. The Apache Beam diagnostics processor pipeline aggregates, repartitions, and windows these diagnostic events before passing them to the autosizing controller and writing them to Apache Pinot, LinkedIn&amp;rsquo;s OLAP store for Managed Beam&amp;rsquo;s operational and analytics dashboards. Based on the pre-processed and time-windowed diagnostic data, the autosizing controller generates sizing actions or restarting actions, and then forwards them to the Managed Beam control plane. The Managed Beam control plane then scales LinkedIn&amp;rsquo;s streaming applications and clusters.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Apache Beam helped streamline operations management and enabled fully-automated autoscaling, significantly reducing the time to onboard new applications. Previously, onboarding required a lot of manual 'trial and error' iterations and deep knowledge of the internal system and metrics.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/linkedin/bingfeng-xia.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Bingfeng Xia
&lt;/div>
&lt;div class="case-study-quote-author-position">
Engineering Manager @LinkedIn
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>The extensibility, pluggability, portability, and abstraction of Apache Beam formed the backbone of LinkedIn&amp;rsquo;s Managed Beam platform. The Managed Beam platform accelerated the time to author, test, and stabilize streaming pipelines from months to days, facilitated fast experimentation, and almost entirely eliminated operational costs for AI engineers.&lt;/p>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Apache Beam played a pivotal role in revolutionizing and scaling LinkedIn&amp;rsquo;s data infrastructure. Beam&amp;rsquo;s powerful streaming capabilities enable real-time processing for critical business use cases, at a scale of over 4 trillion events daily through more than 3,000 pipelines.&lt;/p>
&lt;p>The versatility of Apache Beam empowered LinkedIn’s engineering teams to optimize their data processing for various business use cases:&lt;/p>
&lt;ul>
&lt;li>Apache Beam&amp;rsquo;s unified and portable framework allowed LinkedIn to consolidate streaming and batch processing into unified pipelines. These unified pipelines resulted in a 2x optimization in cost-to-serve, a 2x improvement in processing performance, and a 2x improvement in memory and CPU usage efficiency.&lt;/li>
&lt;li>LinkedIn&amp;rsquo;s anti-abuse platform leveraged Apache Beam to process user activity events from Kafka in near-real-time, achieving a remarkable acceleration from days to minutes in labeling abusive actions. The nearline defenses are able to catch scrapers within minutes after they start to scrape and this leads to more than 6% improvement in detecting logged-in scrapping profiles.&lt;/li>
&lt;li>By adopting Apache Beam, LinkedIn was able to transition from an offline ML feature generation pipeline with a 24- to 48-hour delay to a real-time platform with an end-to-end pipeline latency at the millisecond or second level.&lt;/li>
&lt;li>Apache Beam’s abstraction and powerful programming model enabled LinkedIn to create a fully managed stream processing platform, thus facilitating easier authoring, testing, and deployment and accelerating time-to-production for new pipelines from months to days.&lt;/li>
&lt;/ul>
&lt;p>Apache Beam boasts seamless plug-and-play capabilities, integrating smoothly with Apache Kafka, Apache Pinot, and other core technologies at LinkedIn, all while ensuring optimal performance at scale. As LinkedIn continues experimenting with new engines and tooling, the Apache Beam portability future-proofs our ecosystem against any changes in the underlying infrastructure.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
By enabling a scalable, near real-time infrastructure behind business-critical use cases, Apache Beam empowers LinkedIn to leverage the freshest data and process it in real-time to create timely recommendations and personalized experiences, ultimately benefiting LinkedIn's vast network of over 950 million members worldwide.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/linkedin/xinyu-liu.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Xinyu Liu
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Staff Engineer @LinkedIn
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>&lt;br>&lt;br>&lt;/p>
&lt;div class="case-study-feedback" id="case-study-feedback">
&lt;p class="case-study-feedback-title">Was this information useful?&lt;/p>
&lt;div>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(true, 'LinkedIn')">Yes&lt;/button>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(false, 'LinkedIn')">No&lt;/button>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="clear-nav">&lt;/div></description></item><item><title>Case-Studies: High-Performance Quantitative Risk Analysis with Apache Beam at HSBC</title><link>/case-studies/hsbc/</link><pubDate>Tue, 20 Jun 2023 00:12:00 +0000</pubDate><guid>/case-studies/hsbc/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div class="case-study-opinion">
&lt;div class="case-study-opinion-img">
&lt;img src="/images/logos/powered-by/hsbc.png"/>
&lt;/div>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“Our data volume is huge and Apache Beam helps make it manageable. For each of the millions of trades per day, we generate a simulation of their market valuation evolution at granular level of future time increment, then by scanning multiple plausible market scenarios we aggregate this massive data into meaningful statistics. Apache Beam helps harness all of that data and makes data distribution much easier than before.”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/hsbc/andrzej_golonka.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Andrzej Golonka
&lt;/div>
&lt;div class="case-study-quote-author-position">
Lead Assistant Vice President @ HSBC
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“The Apache Beam Python SDK brought math to the orchestration level by providing an easy way for HSBC’s model development team to emulate sophisticated mathematical dependencies between nodes of business logic workflow in pipelines written in Python. We used to spend at least 6 months deploying even small changes to a system of equations in production. With the new team structure driven by Apache Beam, we now can deploy changes as quickly as within 1 week.”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/hsbc/chup_cheng.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Chup Cheng
&lt;/div>
&lt;div class="case-study-quote-author-position">
VP of XVA and CCR Capital Analytics @ HSBC
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;/div>
&lt;div class="case-study-post">
&lt;h1 id="high-performance-quantitative-risk-analysis-with-apache-beam-at-hsbc">High-Performance Quantitative Risk Analysis with Apache Beam at HSBC&lt;/h1>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>&lt;a href="https://www.hsbc.com/">HSBC Holdings plc&lt;/a>, the parent company of HSBC, is headquartered in London. HSBC serves customers worldwide from offices in 62 countries and territories. With assets of $2,990bn as of 31 March 2023, HSBC is one of the world’s largest banking and financial services organisations. HSBC’s &lt;a href="https://www.gbm.hsbc.com/">Global Banking and Markets business&lt;/a> provides a wide range of financial services and products to multinational corporates, governments, and financial institutions.&lt;/p>
&lt;p>HSBC’s Chup Cheng, VP of XVA and CCR Capital Analytics, and Andrzej Golonka, Lead Assistant Vice President, shared how Apache Beam serves as a computation platform and a risk engine that helps HSBC manage counterparty credit risk and XVA across their customer investment portfolios, which arises from the trillions of dollars in trading volumes between numerous of counterparties every day. Apache Beam empowered HSBC to &lt;a href="https://2022.beamsummit.org/sessions/hpc-grid/">integrate their existing C++ HPC workloads into batch Apache Beam pipelines&lt;/a>, to streamline data distribution, and to improve processing performance. Apache Beam also enabled new pipelines that were not possible before and enhanced development efficiency.&lt;/p>
&lt;h2 id="risk-management-at-scale">Risk Management at Scale&lt;/h2>
&lt;p>To realize the scale and value of the Apache Beam-powered data processing at HSBC, let us delve deeper into why Counterparty credit risk calculations at financial institutions require particularly extreme compute capacity.&lt;/p>
&lt;p>The value of an investment portfolio moves along with the financial markets and is impacted by a variety of external factors. To neutralize the risks and determine the capital reserves required by regulations, investment banks need to estimate risk exposure and make corresponding adjustments to the value of individual trades and portfolios. &lt;a href="https://en.wikipedia.org/wiki/XVA">XVA (X-Value Adjustment) model&lt;/a> plays a crucial role in analyzing counterparty credit risk in the financial industry and covers different valuation adjustments, such as &lt;a href="https://corporatefinanceinstitute.com/resources/knowledge/trading-investing/credit-valuation-adjustment-cva/">Credit Valuation Adjustment (CVA)&lt;/a>, &lt;a href="https://finpricing.com/lib/cvaFva.html">Funding Valuation Adjustment (FVA)&lt;/a>, and &lt;a href="https://www.risk.net/definition/capital-valuation-adjustment-kva#:~:text=Capital%20valuation%20adjustment%20reflects%20the,trades%20that%20are%20not%20cleared.">Capital Valuation Adjustment (KVA)&lt;/a>. Calculating XVA involves complex models, multi-layered matrices, and &lt;a href="https://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo simulations&lt;/a> to account for risks based on plausible future scenarios. Valuation functions process multiple &lt;a href="https://en.wikipedia.org/wiki/Stochastic_differential_equation">stochastic differential equation (SDE)&lt;/a> matrices that represent probable trade values in a specific timeframe of up to possibly 70 years, and then output &lt;a href="https://en.wikipedia.org/wiki/Mark-to-market_accounting">MTM (mark-to-market)&lt;/a> matrices that represent the distribution of the current market value of a financial asset depending on future scenarios. Collapsed MTM matrices determine the vector of future risk exposure and the required XVA adjustment.&lt;/p>
&lt;p>XVA calculations require a vast amount of computational capacity due to the extensive matrix data and long time horizons involved. To calculate MTM matrix for one trade, a valuation function needs to iterate hundreds of thousands of times through multiple SDE matrices that weigh a few megabytes and contain hundreds of thousands of elements each.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/hsbc/scheme-9.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/hsbc/scheme-9.png" alt="Monte Carlo Path">
&lt;/a>
&lt;/div>
&lt;p>Calculating XVA on a multi-counterparties portfolio involves much more complex computations in a large system of equations. A valuation function goes through hundreds of GBs of SDE matrices, generates millions of trade-level MTM matrices, aggregates them to counterparty-level matrices, and then calculates the future exposure and XVA for each counterparty.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/hsbc/scheme-10.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/hsbc/scheme-10.png" alt="Calculating XVA">
&lt;/a>
&lt;/div>
&lt;p>The technical challenge escalates when dealing with XVA sensitives to numerous market factors. To neutralize all market risks across counterparty portfolios, investment banks need to calculate XVA sensitivity for hundreds of market factors. There are two primary ways to compute XVA sensitivity:&lt;/p>
&lt;ol>
&lt;li>analytically through backpropagation to input&lt;/li>
&lt;li>numerically through observing how gradients move for XVA&lt;/li>
&lt;/ol>
&lt;p>To obtain XVA variance, a valuation function must iterate hundreds of billions of times through the enormous system of equations, which is an extremely compute-intensive process.&lt;/p>
&lt;p>The XVA model is indispensable for understanding counterparty credit risk in the financial industry, and its accurate computation is vital for assessing all in price of derivatives. With an extensive amount of data and complex calculations involved, efficient and timely execution of these calculations is essential for ensuring that traders can make well-informed decisions.&lt;/p>
&lt;h2 id="journey-to-beam">Journey to Beam&lt;/h2>
&lt;p>NOLA is HSBC’s internal data infrastructure for XVA computations. Its previous version - NOLA1 - was an on-premise solution that used a 10 TB file server as media, processing several petabytes of data in a single batch, going through a huge network of interdependencies within each system of equations, then repeating the process. HSBC’s model development team was creating new statistical models and building the Quantitative library while their IT team was fetching the necessary data to the library, and both teams were working together to lay out orchestration between systems of equations.&lt;/p>
&lt;p>The 2007 to ‘08 financial crisis highlighted the need for robust and efficient computation of XVAs across the industry, and introduced additional regulations in the financial sector that required an exponentially higher amount of computations. HSBC, therefore, sought a numerical solution for calculating the XVA sensitivities for hundreds of market factors. Processing data in a single batch had become a blocker and a throughput bottleneck. The NOLA1 infrastructure and its intensive I/O utilization was not [at the time] conducive for scaling.&lt;/p>
&lt;p>HSBC engineers started looking for a new approach that would allow for scaling their data processing, maximizing throughput, and meeting critical business timelines.&lt;/p>
&lt;p>Then, HSBC’s engineering team selected Apache Beam as a risk engine for NOLA for its scalability, flexibility, and ability to process large volumes of data in parallel. They found Apache Beam to be a natural process executor for the transformational, directed acyclic graphing process of XVA computations. The &lt;a href="https://beam.apache.org/documentation/sdks/python/">Apache Beam Python SDK&lt;/a> offered a simple API to build new data pipelines in Python, while its abstraction presented a way to &lt;a href="https://cloud.google.com/dataflow/docs/hpc-ep">reuse the prevalent analytics in C++&lt;/a>. The variety of Apache Beam runners offered portability, and HSBC’s engineers built the new version of their data infrastructure - NOLA2 - with Apache Beam pipelines running on &lt;a href="/documentation/runners/flink/">Apache Flink&lt;/a> and &lt;a href="/documentation/runners/dataflow/">Cloud Dataflow&lt;/a>.&lt;/p>
&lt;h2 id="data-distribution-easier-than-before">Data Distribution Easier than Before&lt;/h2>
&lt;p>Apache Beam has greatly simplified data distribution for XVA calculations and allowed for handling the inter-correlated Monte Carlo simulations with distributed processing between workers.&lt;/p>
&lt;p>The Apache Beam SDK enables users to build expressive DAGs and easily create stream or batch multi-stage pipelines in which parallel pipelined stages can be brought back together using &lt;a href="/documentation/programming-guide/#side-inputs">side inputs&lt;/a> or &lt;a href="/documentation/pipelines/design-your-pipeline/#merging-pcollections">joins&lt;/a>. Data movement is handled by the runner, with data expressed as PCollection objects, which are immutable parallel element collections.&lt;/p>
&lt;p>Apache Beam provides several methods for distributing C++ components:&lt;/p>
&lt;ul>
&lt;li>sideloading C++ components to custom worker container images (for example, custom Apache Beam or Cloud Dataflow containers) and then using DoFns to interact with C++ components out-of-the-box&lt;/li>
&lt;li>bundling C++ with a JAR file in Apache Beam, where the C++ elements (binaries, configuration, etc.) are extracted to the local disk during the setup/teardown process in DoFn&lt;/li>
&lt;li>including the C++ components in a &lt;a href="/documentation/programming-guide/#pcollections">PCollection&lt;/a> as a side input, which is then deployed to the local disk&lt;/li>
&lt;/ul>
&lt;p>The seamless integration of Apache Beam with C++ allowed HSBC’s engineers to reuse the prevalent analytics(relying on &lt;a href="https://www.nag.com/">NAG&lt;/a> and &lt;a href="https://en.wikipedia.org/wiki/Math_Kernel_Library">MKL&lt;/a> libraries)and select between the logic distribution methods depending on the use case and deployment environment. HSBC found protobufs especially useful for data exchange when PCollections carry the calls and input data to the C++ libraries from Java or Python pipelines. Protobuf data can be shared over disk, network, or directly with the usage of tools like &lt;a href="https://github.com/pybind/pybind11">pybind11&lt;/a>.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/hsbc/scheme-11.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/hsbc/scheme-11.png" alt="beam DAG">
&lt;/a>
&lt;/div>
&lt;p>HSBC migrated their XVA calculations to a batch Apache Beam pipeline. Every day, the XVA pipeline computes over multiple thousands of billions of valuations within just 1 hour, consuming around 2 GB of external input data, processing from 10 to 20 TB of data inside the system of equations, and producing about 4 GB of output reports. Apache Beam distributes XVA calculations into a number of PCollections with tasks, performs the necessary transformations independently and in parallel, and produces map-reduced results - all in one pass.&lt;/p>
&lt;p>Apache Beam provides powerful &lt;a href="/documentation/programming-guide/#transforms">transforms&lt;/a> and orchestration capabilities that helped HSBC engineers to optimize the analytical approach to XVA sensitivities calculation and enable the numerical one, which was not possible before. Instead of iterating a valuation function through the whole system of equations, HSBC’s engineers treat the system of equations as a computation graph, breaking it down into clusters with reusable elements and iterating through the minimal computation graph. They use Apache Beam orchestration to efficiently process tens of thousands of clusters for each portfolio by calling a C++ “executable”. Apache Beam enabled HSBC to bundle multiple market factors using &lt;a href="/documentation/programming-guide/#core-beam-transforms">KV PCollections&lt;/a> to associate each input element of a PCollection with a key and calculate the XVA sensitivity for hundreds of market factors within a single Apache Beam batch pipeline.&lt;/p>
&lt;p>The Apache Beam pipeline that performs analytical and numerical XVA sensitivities calculations runs daily in two separate batches. The first batch pipeline, which runs at midnight, determines the credit line consumption and capital utilisation for traders, directly affecting their trading volume the following day. The second batch, which completes before 8 am, calculates XVA sensitivities that could impact traders&amp;rsquo; risk management and hedging strategies. The pipeline consumes about 2 GB of external market data daily, processing up to 400 TB of internal data in the system of equations, and aggregating the data into the output report of just about 4 GB. At the end of each month, the pipeline processes over 5 PB of monthly data inside the system of equations to produce a full-scope XVA sensitivities report. Apache Beam addresses data distribution and hot spot issues, assisting in managing the data involved in intricate calculations.&lt;/p>
&lt;p>Apache Beam offers HSBC all the traits of a traditional risk engine and more, enabling HSBC to scale the infrastructure and maximize throughput with distributed processing.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Apache Beam makes data distribution much easier than before. The Monte Carlo method significantly increases the amount of data to be processed. Apache Beam helped us harness all of that data volume.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/hsbc/andrzej_golonka.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Andrzej Golonka
&lt;/div>
&lt;div class="case-study-quote-author-position">
Lead Assistant Vice President @ HSBC
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;h2 id="beam-as-a-platform">Beam as a Platform&lt;/h2>
&lt;p>Apache Beam is more than a data processing framework. It is also a computational platform that enables experimentation, accelerates time-to-market for new development, and simplifies deployment.&lt;/p>
&lt;p>Apache Beam&amp;rsquo;s dataset abstraction as PCollection increased HSBC’s model development efficiency by providing a way to organize component ownership and reduce organizational dependencies and bottlenecks. The model development team now owns data pipelines: implements new systems of equations, defines the data transfer within a system of equations in a black box mode, and sends it to the IT team. The IT team fetches, controls, and orchestrates the external data required by a system of equations as a whole.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
In general, we used to spend at least 6 months deploying even small changes to a system of equations in production. With the new team structure driven by Apache Beam, we now can deploy changes as quickly as within 1 week.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/hsbc/chup_cheng.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Chup Cheng
&lt;/div>
&lt;div class="case-study-quote-author-position">
VP of XVA and CCR Capital Analytics @ HSBC
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>By leveraging the abstractions provided by the Apache Beam unified programming model, HSBC&amp;rsquo;s model development team can seamlessly create new data pipelines, choose an appropriate runner, and conduct experiments on Big Data without the underlying infrastructure. The Apache Beam model rules ensure the high quality of the experimental code and make it very easy to move the production-grade pipelines from experimentation to production.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
With Apache Beam, it’s easy to experiment with “What if” questions. If we want to know the impact of changing some parameters, we can write a simple Apache Beam code, run the pipeline, and have the answer within minutes.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/hsbc/andrzej_golonka.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Andrzej Golonka
&lt;/div>
&lt;div class="case-study-quote-author-position">
Lead Assistant Vice President @ HSBC
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>One of the key advantages of Apache Beam for Monte Carlo simulations and counterparty credit risk analysis is its ability to run the same complex simulation logic in various environments, on-premise or in the cloud, with different runners. This flexibility is especially critical in situations requiring local risk analysis in different countries and compliance zones, where sensitive financial data and information cannot be transferred beyond the local perimeter. With Apache Beam, HSBC can easily switch between runners and can future-proof their data processing for any changes. HSBC runs the majority of workflows in Cloud Dataflow, benefitting from its powerful managed service and autoscaling capabilities to manage spikes of up to 18,000 vCPUs when running batch pipelines twice a day. In select countries, they also use the Apache Beam Flink runner to comply with local regulations specific to data storage and processing.&lt;/p>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>Apache Beam harnesses enormous volumes of financial market data and metrics, generates billions of trade valuations to scan plausible future scenarios reaching out around 70 years, and aggregates them into meaningful statistics, enabling HSBC to model their future scenarios and quantitatively account for risk in forecasting and decision-making.&lt;/p>
&lt;p>With Apache Beam, HSBC’s engineers achieved a 2x increase in data processing performance and scaled their XVA batch pipeline by 100x compared to the original solution. The Apache Beam abstraction opened up a way to implement a numerical approach to XVA sensitivity calculation in production, which was not possible before. The batch Apache Beam pipeline calculates XVA sensitivities for hundreds of market factors, processing about 400 TB of internal data every day and up to 5 PB of data once per month.&lt;/p>
&lt;p>Apache Beam portability enabled HSBC to use different runners in different regions depending on local data processing requirements and future-proof their data processing for regulatory changes.&lt;/p>
&lt;p>Apache Beam provides seamless integration and out-of-the-box interaction with highly optimized computational components in C++, which saved HSBC the effort needed to rewrite the C++ analytics accumulated for years into Python.&lt;/p>
&lt;p>The Apache Beam Python SDK brought math to the orchestration level by providing an easy way for HSBC’s model development team to build new Python pipelines. The new work structure driven by Apache Beam accelerated time-to-market by 24x, enabling HSBC’s teams to deploy changes and new models to production within just a few weeks.&lt;/p>
&lt;p>By leveraging the versatile and scalable nature of Apache Beam for computing direct acyclic graphs that process large differential equations systems and Monte Carlo simulations, financial institutions can assess and manage counterparty credit risk efficiently, even in situations that demand localized analysis and strict compliance with data security regulations.&lt;/p>
&lt;h2 id="learn-more">Learn More&lt;/h2>
&lt;iframe class="video video--medium-size" width="560" height="315" src="https://www.youtube.com/embed/QoKWdOXyBw4" frameborder="0" allowfullscreen>&lt;/iframe>
&lt;br>&lt;br>
&lt;div class="case-study-feedback" id="case-study-feedback">
&lt;p class="case-study-feedback-title">Was this information useful?&lt;/p>
&lt;div>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(true, 'HSBC')">Yes&lt;/button>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(false, 'HSBC')">No&lt;/button>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="clear-nav">&lt;/div></description></item><item><title>Case-Studies: Efficient Streaming Analytics: Making the Web a Safer Place with Project Shield</title><link>/case-studies/projectshield/</link><pubDate>Thu, 08 Jun 2023 00:12:00 +0000</pubDate><guid>/case-studies/projectshield/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div class="case-study-opinion">
&lt;div class="case-study-opinion-img">
&lt;img src="/images/logos/powered-by/project_shield.png"/>
&lt;/div>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“Apache Beam supports our mission to make the web a safer and better place by providing near-real-time visibility into traffic data for our customers, providing ongoing analysis and adjustments to our defenses, and neutralizing the impact of traffic spikes during DDoS attacks on the performance and efficiency of our platform.”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/projectShield/marc_howard.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Marc Howard
&lt;/div>
&lt;div class="case-study-quote-author-position">
Founding Engineer @ Project Shield
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;/div>
&lt;div class="case-study-post">
&lt;h1 id="efficient-streaming-analytics-making-the-web-a-safer-place-with-project-shield">Efficient Streaming Analytics: Making the Web a Safer Place with Project Shield&lt;/h1>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>&lt;a href="https://projectshield.withgoogle.com/landing">Project Shield&lt;/a>, offered by &lt;a href="https://cloud.google.com/">Google Cloud&lt;/a> and &lt;a href="https://jigsaw.google.com/">Jigsaw&lt;/a> (a subsidiary of Google), is a service that counters &lt;a href="https://en.wikipedia.org/wiki/Distributed-denial-of-service">distributed-denial-of-service&lt;/a> (DDoS) attacks. Project Shield is available free of charge to eligible websites that have media, elections, and human rights related content. &lt;a href="https://www.forbes.com/sites/andygreenberg/2013/10/21/googles-project-shield-will-offer-free-cyberattack-protection-to-hundreds-of-at-risk-sites/">Founded in 2013&lt;/a>, Project Shield’s mission is to protect freedom of speech and to make sure that, when people have access to democracy-related information, the information isn’t compromised, censored, or silenced in a politically-motivated way.&lt;/p>
&lt;p>In the first half of 2022, Project Shield defended websites of vulnerable users - such as human rights, news, and civil society organizations or governments under exigent circumstances - &lt;a href="https://cloud.google.com/blog/products/identity-security/ddos-attack-trends-during-us-midterm-elections">against more than 25,000 attacks&lt;/a>. Notably, Project Shield helped ensure &lt;a href="https://cloud.google.com/blog/products/identity-security/ddos-attack-trends-during-us-midterm-elections">unhindered access to election-related information during the U.S. 2022 midterm election season&lt;/a>. It also &lt;a href="https://cio.economictimes.indiatimes.com/news/internet/google-expands-project-shield-to-protect-govts-from-hacking/90072091">enables Ukrainian critical infrastructure and media websites to defend against non-stop attacks&lt;/a> and to continue providing crucial services and information during the invasion of Ukraine.&lt;/p>
&lt;p>Marc Howard and Chad Hansen, the co-founding engineers, explain how Project Shield uses Apache Beam to deliver some of their core value. The streaming Apache Beam pipelines process about 3 TB of log data daily at significantly over 10,000 queries per second. These pipelines enable multiple product needs. For example, Apache Beam produces real-time analytics and critical metrics for &lt;a href="https://www.washingtonpost.com/opinions/2022/06/21/russia-ukraine-cyberwar-intelligence-agencies-tech-companies/">over 3000 customer websites in 150 countries&lt;/a>. These metrics power long-term attack analytics at scale, fine-tuning Project Shield’s defenses and supporting them in the effort of making the web a safe and free space.&lt;/p>
&lt;h2 id="journey-to-beam">Journey To Beam&lt;/h2>
&lt;p>The Project Shield platform is built using Google Cloud technologies and provides multi-layered defenses. To absorb part of the traffic and keep websites online even if their servers are down, it uses &lt;a href="https://cloud.google.com/cdn">Cloud CDN&lt;/a> for &lt;a href="https://en.wikipedia.org/wiki/Cache_(computing)">caching&lt;/a>. To protect websites from DDoS and other malicious attacks, it leverages &lt;a href="https://cloud.google.com/armor">Cloud Armor&lt;/a> features, such as adaptive protection, rate limiting, and bot management.&lt;/p>
&lt;p>Project Shield acts as a &lt;a href="https://en.wikipedia.org/wiki/Reverse_proxy">reverse proxy&lt;/a>: it receives traffic requests on a website’s behalf, absorbs traffic through caching, filters harmful traffic, bans attackers, and then sends safe traffic to a website’s origin server. This configuration allows websites to stay up and running when someone tries to take them down with a DDoS attack. The challenge is that, with a large portion of traffic blocked, the logs on customers’ origin servers no longer have accurate analytics about website traffic. Instead, customers rely on Project Shield to provide all of the traffic analytics.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/projectShield/project_shield_mechanism.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/projectShield/project_shield_mechanism.png" alt="Project Shield Mechanism">
&lt;/a>
&lt;span>Project Shield Mechanism&lt;/span>
&lt;/div>
&lt;p>Originally, Project Shield stored traffic logs in &lt;a href="https://cloud.google.com/bigquery">BigQuery&lt;/a>. It used one large query to produce analytics and charts with different traffic metrics, such as the amount of traffic, QPS, the share of cached traffic, and the attack data. However, querying all of the logs, especially with dramatic spikes in traffic, was slow and expensive.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Often people want to know traffic metrics during critical times: if their website is under attack: they want to see what's happening right now. We need the data points to appear on the UI fast.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/projectShield/marc_howard.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Marc Howard
&lt;/div>
&lt;div class="case-study-quote-author-position">
Founding Engineer @ Project Shield
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Project Shield’s team then added &lt;a href="https://firebase.google.com/docs/firestore">Firestore&lt;/a> as an intermediate step, running a &lt;a href="https://en.wikipedia.org/wiki/Cron">cron&lt;/a> every minute to query logs from BigQuery and write the interim reports to Firestore, then using these reports to build charts. This step improved performance slightly, but the gain was not sufficient to meet critical business timelines, and it didn’t provide adequate visibility into historical traffic for customers.&lt;/p>
&lt;p>Unlike BigQuery, Firestore was designed for medium-sized workloads. Therefore, it wasn’t possible to fetch many models at the same time to provide customers with cumulative statistics for extended time frames. Querying the logs every minute was inefficient from a cost perspective. In addition, some of the logs were coming to BigQuery with a delay, and it was necessary to run the cron again in 24 hours to double-check for the late-coming logs.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Querying over all of our traffic logs every minute is very expensive, especially when you consider that we are a DDoS defense service - the number of logs that we see can often spike dramatically.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/projectShield/marc_howard.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Marc Howard
&lt;/div>
&lt;div class="case-study-quote-author-position">
Founding Engineer @ Project Shield
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Project Shield’s team looked for a data processing framework that would minimize end-to-end latency, meet their scaling needs for better customer visibility, and ensure cost efficiency.&lt;/p>
&lt;p>They selected Apache Beam for its &lt;a href="/documentation/runtime/model/">strong processing guarantees&lt;/a>, streaming capabilities, and out-of-the-box &lt;a href="/documentation/io/connectors/">I/Os&lt;/a> for BigQuery and Pub/Sub. By pairing Apache Beam with the &lt;a href="/documentation/runners/dataflow/">Dataflow runner&lt;/a>, they also benefited from built-in autoscaling. In addition, the &lt;a href="/documentation/sdks/python/">Apache Beam Python SDK&lt;/a> lets you use Python across the board and simplifies reading data model types that have to be the same on the backend that consumes them and on the pipeline that writes them.&lt;/p>
&lt;h2 id="use-cases">Use Cases&lt;/h2>
&lt;p>Project Shield became one of the early adopters of Apache Beam and migrated their workflows to streaming Apache Beam pipelines in 2020. Currently, Apache Beam powers multiple product needs with multiple streaming pipelines.&lt;/p>
&lt;p>The unified streaming pipeline that produces user-facing analytics reads the logs from Pub/Sub while they arrive from Cloud Logging, windows logs per minute every minute, splits by the hostname of the request, generates reports, and writes the reports to BigQuery. The pipeline aggregates log data, removes &lt;a href="https://en.wikipedia.org/wiki/Personal_data">Personally Identifiable Information&lt;/a> (PII) without using a &lt;a href="https://en.wikipedia.org/wiki/Data_loss_prevention_software">DLP&lt;/a>, and allows for storing the data in BigQuery for a longer period while meeting the regulatory requirements. &lt;a href="/documentation/transforms/python/aggregation/combineperkey/#example-5-combining-with-a-combinefn">CombineFn&lt;/a> allows Project Shield’s team to create a custom accumulator that takes the key into account when keying log data off the hostname and minute, &lt;a href="/documentation/programming-guide/#combine">combining the log data per key&lt;/a>. If logs come in late, Apache Beam creates a new report and aggregates several reports per hostname per minute.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The fact that we can just key data, use CombinePerKey, and the accumulator works like magic was a big win for us
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/projectShield/marc_howard.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Marc Howard
&lt;/div>
&lt;div class="case-study-quote-author-position">
Founding Engineer @ Project Shield
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>The Apache Beam log processing pipeline provides Project Shield with the ability to query only the relevant reports, thus enabling Project Shield to load the data to the customer dashboard within just 2 minutes. The pipeline also provides enhanced visibility for Project Shield’s customers, because the queried reports are much smaller in size and easier to store than the traffic logs.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The end-to-end pipeline latency was very meaningful to us, and the Apache Beam streaming allowed for displaying all the traffic metrics on charts within 2 minutes, but also to look back on days, weeks, or months of data and show graphs on a scalable timeframe to customers.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/projectShield/chad_hansen.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Chad Hansen
&lt;/div>
&lt;div class="case-study-quote-author-position">
Founding Engineer @ Project Shield
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Project Shield extended the streaming Apache Beam log processing pipeline to generate a different type of report based on the logs and requests during attacks. The Apache Beam pipeline analyzes attacks and generates critical defense recommendations. These recommendations are then used by the internal long-term attack analysis system to fine-tune Project Shield’s defenses.&lt;/p>
&lt;p>Apache Beam also powers Project Shield’s traffic rate-limiting decisions by analyzing patterns in the traffic logs. The streaming Apache Beam pipeline gathers information about the legitimate usage of a website, excludes abusive traffic from that analysis, and crafts traffic rate limits that divide the two groups safely. Those limits are then enforced in the form of Cloud Armor rules and policies and used to restrict abusive traffic or to ban it entirely.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/projectShield/apache_beam_streaming_log_analytics.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/projectShield/apache_beam_streaming_log_analytics.png" alt="Apache Beam Streaming Log Analytics">
&lt;/a>
&lt;span>Apache Beam Streaming Log Analytics&lt;/span>
&lt;/div>
&lt;p>The combination of Apache Beam and Cloud Dataflow greatly simplifies Project Shield’s operational management of streaming pipelines. Apache Beam provides easy-to-use streaming primitives, while Dataflow enables &lt;a href="https://cloud.google.com/dataflow/docs/pipeline-lifecycle">out-of-the-box pipeline lifecycle management&lt;/a> and compliments Pub/Sub’s delivery model with &lt;a href="https://www.cloudskillsboost.google/focuses/18457?parent=catalog">message deduplication and exactly-once, in-order processing&lt;/a>. The Apache Beam &lt;a href="https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.pubsub.html">Pub/Sub I/O&lt;/a> provides the ability to key log data off Pub/Sub event timestamps. This feature enables Project Shield to improve the overall data accuracy by windowing log data after all the relevant logs come in. &lt;a href="https://cloud.google.com/dataflow/docs/guides/pipeline-workflows#update-streaming-pipelines-prod">Various options for managing the pipeline lifecycle&lt;/a> allow Project Shield to employ simple and reliable deployment processes. The Apache Beam Dataflow runner’s &lt;a href="https://cloud.google.com/dataflow/docs/horizontal-autoscaling">autoscaling and managed service capabilities&lt;/a> help handle dramatic spikes in resource consumption during DDoS attacks and provide instant visibility for customers.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
When attacks come in, we are ready to handle high volumes of traffic and deliver on-time metrics during critical windows with Apache Beam.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/projectShield/chad_hansen.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Chad Hansen
&lt;/div>
&lt;div class="case-study-quote-author-position">
Founding Engineer @ Project Shield
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>The adoption of Apache Beam enabled Project Shield to embrace streaming, scale its pipelines, maximize efficiency, and minimize latency. The streaming Apache Beam pipelines process about 3 TB of log data daily at significantly over 10,000 queries per second to produce user-facing analytics, tailored traffic rate limits, and defense recommendations for over 3K customers all over the world. The streaming processing and powerful transforms of Apache Beam ensure delivery of critical metrics within just 2 minutes, enabling customer visibility into historical traffic, and resulting in a 91% compute efficiency gain, compared to the original solution.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The Apache Beam model and the autoscaling capabilities of its Dataflow runner help prevent the spikes in traffic during DDoS attacks from having a meaningful impact on our platform from an efficiency and financial perspective.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/projectShield/marc_howard.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Marc Howard
&lt;/div>
&lt;div class="case-study-quote-author-position">
Founding Engineer @ Project Shield
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>The Apache Beam data processing framework supports Project Shield’s goal to eliminate the DDoS attack as a weapon for silencing the voices of journalists and others who speak the truth, making the web a safer place.&lt;/p>
&lt;div class="case-study-feedback" id="case-study-feedback">
&lt;p class="case-study-feedback-title">Was this information useful?&lt;/p>
&lt;div>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(true, 'Project Shield')">Yes&lt;/button>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(false, 'Project Shield')">No&lt;/button>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="clear-nav">&lt;/div></description></item><item><title>Case-Studies: Mass Ad Bidding With Beam at Booking.com</title><link>/case-studies/booking/</link><pubDate>Sun, 16 Apr 2023 00:12:00 +0000</pubDate><guid>/case-studies/booking/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div class="case-study-opinion">
&lt;div class="case-study-opinion-img">
&lt;img src="/images/logos/powered-by/booking.png"/>
&lt;/div>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“We query across >2 petabytes of analytical data and several terabytes of transactional data, processing 1 billion events daily. Apache Beam enabled us to parallelize data processing, maximize throughput, and accelerate the movement/ETL of these datasets.”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/booking/booking.ico">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Booking.com's PPC Team
&lt;/div>
&lt;div class="case-study-quote-author-position">
Marketing Technology Department
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;/div>
&lt;div class="case-study-post">
&lt;h1 id="mass-ad-bidding-with-beam-at-bookingcom">Mass Ad Bidding With Beam at Booking.com&lt;/h1>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>&lt;a href="https://www.booking.com/">Booking.com&lt;/a> seamlessly connects millions of travelers to memorable experiences by investing in technology that takes the friction out of travel and making it easier for everyone to experience the world. Booking.com is a brand of &lt;a href="https://www.bookingholdings.com/">Booking Holdings&lt;/a>, the world’s largest provider of online travel &amp;amp; related services to consumers and local partners. To help people discover destinations in more than 220 countries and territories, Booking Holdings as a whole spent &lt;a href="https://s201.q4cdn.com/865305287/files/doc_financials/2022/q4/BKNG-Earnings-Release-Q4-2022.pdf">$5.99&lt;/a> billion in marketing in 2022, with Booking.com being a leading travel advertiser on &lt;a href="https://ads.google.com/home/campaigns/search-ads/">Google Pay Per Click (PPC) Search Ads&lt;/a>.&lt;/p>
&lt;p>The PPC team at Booking.com’s Marketing Technology department is in charge of the infrastructure and internal tooling necessary to run PPC advertising at this scale. The PPC team’s primary goal is to reliably and efficiently optimize their PPC across all search engine providers, measure and analyze ad performance data, manage ad hierarchies, and adjust ad criteria. Apache Beam supports this goal by providing an effective abstraction that helps build reliable, performant, and cost-efficient data processing infrastructure at a very large scale.&lt;/p>
&lt;h2 id="journey-to-beam">Journey To Beam&lt;/h2>
&lt;p>PPC advertising is a business-critical promotional channel for Booking.com’s marketing. With billions of searches per day on search engines, they use PPC Search Ads to make sure users will get the most relevant offerings in their search results. Behind the scenes, the PPC team manages the operational infrastructure to process ad performance feedback, assess historical performance, support machine learning operations that generate bids, and communicate the bids back to a search engine provider.&lt;/p>
&lt;p>The earlier implementation of Booking.com’s mass ad bidding was a custom stack batch pipeline, with MySQL data storage, &lt;a href="https://en.wikipedia.org/wiki/Cron">cron&lt;/a> scheduling, and Perl scripts to implement business logic. The design eventually hit the performance bottleneck struggling to keep up with the increasing throughput demands on bids per second. The lost opportunity cost combined with the cost of maintaining the complexity became larger than the cost of a full rewrite.&lt;/p>
&lt;p>The mass bidding infrastructure had undergone several rewrites before Apache Beam came into play. The core idea behind the PPC team’s latest implementation of the Apache Beam-powered data ecosystem originated in late 2020. Attending &lt;a href="https://beamsummit.org/">Beam Summit&lt;/a> and &lt;a href="https://beamcollege.dev/">Beam College&lt;/a> community events helped the team to learn about the open-source Apache Beam abstraction model that is also available as managed service with the &lt;a href="/documentation/runners/dataflow/">Dataflow runner&lt;/a>.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
It was straightforward to introduce the idea to the rest of the team - the Apache Beam model is easy to understand because it isolates the business logic and helps build a mental model.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/booking/igor_dralyuk.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Igor Dralyuk
&lt;/div>
&lt;div class="case-study-quote-author-position">
Principal Engineer @ Booking.com
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>The PPC team decided to pilot this framework by creating a new prototype Java pipeline that downloads ad performance reports.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
With Apache Beam, we achieved a multifactor speed-up in the development time. We were able to actually deliver one pipeline in a matter of 3 weeks. We would have spent a solid three months if we had to implement the pipeline using any other framework.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/booking/warren_qi.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Warren Qi
&lt;/div>
&lt;div class="case-study-quote-author-position">
Engineering Manager @ Booking.com
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Once the first POC proved to be successful, the PPC team placed Apache Beam at the core of their data infrastructure. Spinning Dataflow managed service provided an opportunity to focus on new capabilities instead of maintaining their own compute and storage infrastructure. The Apache Beam abstraction from the runtime implementation details allows the PPC team to focus on business logic and leverage parallelism to easily scale horizontally. Heavy users of various &lt;a href="https://cloud.google.com/">GCP&lt;/a> products, they also leveraged the Apache Beam &lt;a href="/documentation/io/connectors/">I/O connectors&lt;/a> to natively integrate with various sinks and sources, such as &lt;a href="/documentation/io/built-in/google-bigquery/">BiqQuery&lt;/a>, &lt;a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/firestore/FirestoreIO.html">Firestore&lt;/a>, and &lt;a href="https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.html">Spanner&lt;/a>.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Apache Beam serves as an effective abstraction for our data infrastructure and processing. With Dataflow runner, we also don’t need to worry about maintaining the runtime and storage infrastructure, as well as about cloud provider lock-in.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/booking/sergey_dovgal.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Sergey Dovgal
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Software Engineer @ Booking.com
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>The quality of documentation, as well as the vibrant Apache Beam open-source community, lively discussions in mailing lists, and plenty of user-created content made it very easy to onboard new use cases.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The Apache Beam open source community, documentation, and Youtube content were very helpful when designing and implementing the new ecosystem.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/booking/igor_dralyuk.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Igor Dralyuk
&lt;/div>
&lt;div class="case-study-quote-author-position">
Principal Engineer @ Booking.com
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Currently, Apache Beam powers batch and streaming pipelines for the PPC team’s large-scale ad bidding infrastructure.&lt;/p>
&lt;h2 id="mass-ad-bidding">Mass Ad Bidding&lt;/h2>
&lt;p>Conceptually, mass ad bidding infrastructure accepts ad bidding requests and assets and provides staging for submission to multiple services, processing ad performance results at a massive scale. The Ad bidding infrastructure relies on Apache Beam batch and streaming pipelines to interact with Big Query, Spanner, Firestore, Pub/Sub sources and sinks, and uses Beam&amp;rsquo;s stateful processing for ads services API calls.&lt;/p>
&lt;p>When designing the data infrastructure, the PPC team’s primary goal was to maximize the throughput of bids per second while still respecting the request rate limits imposed by the search engines’ Ads APIs on the account level. The PPC team implemented streaming Apache Beam pipelines that utilize keyed &lt;a href="/documentation/programming-guide/#pcollections">PCollections&lt;/a> to cluster outgoing Ads modifications by account ID, &lt;a href="/documentation/transforms/python/aggregation/groupintobatches/">group them into batches&lt;/a>, and execute data processing in parallel. This approach helped optimize the throughput for the team&amp;rsquo;s thousands of Ads accounts and achieve improved performance and reliability at scale.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Apache Beam enabled us to parallelize data processing and maximize throughput at a very large scale.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/booking/prasanjit_barua.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Prasanjit Barua
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Software Engineer @ Booking.com
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>The PPC team uses an internal API interface to submit queries to the mass bidding infrastructure, which routes the queries to the respective ad bidding pipelines for Google and Bing. For the Google branch, the API calls an Invoker cloud function, which reads data from BigQuery, aggregates it, and performs analysis before storing intermediate results in staging tables in BigQuery. The Invoker then calls an Ingestor Apache Beam batch pipeline, which publishes the data into Pub/Sub.&lt;/p>
&lt;p>On the other end, the Google Ad Mutator Apache Beam streaming pipeline listens to over 1 billion of Pub/Sub events per day and sends corresponding requests to the Google Ads API. This job is designed with backpressure in mind, ensuring optimal performance while also considering factors such as &lt;a href="/documentation/runtime/model/">partitioning, parallelism, and key-ordered delivery guarantees&lt;/a>. The results are then written to the Result Table in BigQuery and the Inventory in Spanner, with over 100 GB processed daily.&lt;/p>
&lt;p>Finally, the Daily Importer Apache Beam batch pipeline grabs the inventory data and disseminates it for downstream tasks, also processing 100 GB daily. Data analysts then match the incoming stream of hotel reservations with the inventory data on what was advertised and evaluate PPC ads performance.&lt;/p>
&lt;p>The versatility and flexibility of the Apache Beam framework are key to the entire process, as it allows for combining batch and streaming processing into a unified flow, while also enabling integration with a wide range of sources and destinations with different characteristics and semantics. The framework also provides guarantees for delivery and order, all at scale and with optimal tradeoffs for streaming processing.&lt;/p>
&lt;div class="post-scheme">
&lt;a href="/images/case-study/booking/streaming-processing.png" target="_blank" title="Click to enlarge">
&lt;img src="/images/case-study/booking/streaming-processing.png" alt="Streaming processing">
&lt;/a>
&lt;/div>
&lt;p>The Google Ads Mutator and Bing Ads Mutator pipelines are an integral part of Booking.com’s mass bidding infrastructure. These streaming Apache Beam pipelines process all the data coming to and from search engines’ Ads API and write massive ad performance reports to inventory in Cloud Spanner. The Apache Beam built-in &lt;a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.html">Cloud Spanner SpannerIO&lt;/a>. Write transform allows writing data more efficiently by &lt;a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/spanner/MutationGroup.html">grouping mutations&lt;/a> into batches to maximize throughput while also respecting the Spanner per-transation limits. With Apache Beam reducing the key-range of the mutations, the PPC team was able to achieve cost optimization in Spanner and improve processing performance.&lt;/p>
&lt;p>Similarly, to stay within the Ads API rate limit levels, the PPC team leverages the Apache Beam &lt;a href="/blog/timely-processing/">timely&lt;/a> and &lt;a href="/blog/stateful-processing/">stateful processing&lt;/a> and a Redis-based microservice that maintains the rate limits for bids. The PPC team has a custom “Aggregate and Send” function that accumulates incoming mutation commands in the buffer until it is filled. The function requests mutation quota from the rate limiter and sends a request to the Ads API. If the internal rate limiter or the Ads API requests a wait, the function starts a retry timer and continues buffering incoming commands. If there are no requests to wait, the function clears the command buffer and publishes the queries to Pub/Sub.&lt;/p>
&lt;p>Apache Beam provides windowed aggregations to pre-aggregate mutation commands and assures delivery guarantees through the use of &lt;a href="/documentation/programming-guide/#timers">timers&lt;/a> and stateful operations. By using &lt;a href="/documentation/programming-guide/#valuestate">BagState&lt;/a>, Apache Beam can add elements to a collection to accumulate an unordered set of mutations. &lt;a href="/documentation/programming-guide/#valuestate">ValueState&lt;/a>, on the other hand, stores typed values for batch, batch size, and buffer size that can be read and modified within the &lt;a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/DoFn.html">DoFn’s&lt;/a> &lt;a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/DoFn.ProcessElement.html">ProcessElement&lt;/a> and &lt;a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/DoFn.OnTimer.html">OnTimer&lt;/a> methods.&lt;/p>
&lt;p>Runners that support paged reads can handle individual bags that are larger than available memory. The Apache Beam retry timer is used to output data buffered in state after some amount of processing time. The flush timer is used to prevent commands from remaining in the buffer for too long, particularly when the commands are infrequent and unable to fill the buffer.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The stateful capabilities of Beam model allowed us to gain fine-grained control over bids per second by buffering the incoming data until it can be consumed, and maintain a higher processing performance than the other potential solutions.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/booking/sergey_dovgal.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Sergey Dovgal
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Software Engineer @ Booking.com
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;div class="post-scheme">
&lt;img src="/images/case-study/booking/stateful-capabilities.png" alt="Stateful capabilities">
&lt;/div>
&lt;p>To increase observability and provide users with a way to monitor their submission status, the PPC team also developed a custom function that produces metrics with custom keys to count the number of received and sent requests. Apache Beam extensibility allowed the PPC team to implement this supplementary monitoring tool right inside the Ad Mutator pipelines as an additional block.&lt;/p>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>Apache Beam powers the data logistics behind Booking.com’s massive performance marketing “flywheel” with 1M+ queries monthly for ad bidding workflows across multiple data systems scanning 2 PB+ of analytical data and terabytes of transactional data daily, processing over 1 billion events at thousands of messages per second at peak.&lt;/p>
&lt;p>Apache Beam provided the much-needed parallelism, connectivity, and partitioning capabilities, as well as strong key-ordered delivery guarantees, to parallelize processing for several thousands of Booking.com’s Ads accounts, optimize costs, and ensure performance and reliability for ad bids processing at scale.&lt;/p>
&lt;p>Apache Beam accelerated processing time for the streaming pipelines from 6 hours to 10 minutes, which is an eye-opening 36x reduction in running time. The high-performing and fast PPC ad bidding infrastructure now drives 2M+ nights booked daily from search ads. The Apache Beam abstraction simplifies the onboarding of new team members, makes it easier to write and maintain pipelines, and accelerates time-to-market from a design document to go-live on production by as much as 4x.&lt;/p>
&lt;p>The PPC team is planning to expand the use of the Apache Beam unified processing capabilities to combine several batch and streaming pipelines into a single pipeline.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Apache Beam as a model allows us to write business logic in a very declarative way. In the next development iterations, we are planning to combine several Ads Mutator pipelines into a single streaming pipeline.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/booking/sergey_dovgal.jpg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Sergey Dovgal
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Software Engineer @ Booking.com
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;div class="case-study-feedback" id="case-study-feedback">
&lt;p class="case-study-feedback-title">Was this information useful?&lt;/p>
&lt;div>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(true, 'Booking')">Yes&lt;/button>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(false, 'Booking')">No&lt;/button>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="clear-nav">&lt;/div></description></item><item><title>Case-Studies: Self-service Machine Learning Workflows and Scaling MLOps with Apache Beam</title><link>/case-studies/creditkarma/</link><pubDate>Thu, 01 Dec 2022 00:12:00 +0000</pubDate><guid>/case-studies/creditkarma/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div class="case-study-opinion">
&lt;div class="case-study-opinion-img">
&lt;img src="/images/logos/powered-by/credit-karma.png"/>
&lt;/div>
&lt;blockquote class="case-study-quote-block pb-0">
&lt;p class="case-study-quote-text">
“Apache Beam has been the ideal solution for us. Scaling, backfilling historical data, experimenting with new ML models and new use cases… it is all very easy to do with Beam.”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/credit_karma/avneesh_pratap.jpeg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Avneesh Pratap
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Data Engineer II @ Credit Karma
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“Apache Beam enabled self-service ML for our data scientists. They can plug in pieces of code, and those transformations will be automatically attached to models without any engineering involvement. Within seconds, our data science team can move from experimentation to production.”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/credit_karma/raj_katakam.jpeg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Raj Katakam
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior ML Engineer II @ Credit Karma
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;/div>
&lt;div class="case-study-post">
&lt;h1 id="self-service-machine-learning-workflows-and-scaling-mlops-with-apache-beam">Self-service Machine Learning Workflows and Scaling MLOps with Apache Beam&lt;/h1>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>&lt;a href="https://www.creditkarma.com/">Credit Karma&lt;/a> is an American multinational personal finance company &lt;a href="https://en.wikipedia.org/wiki/Credit_Karma">founded in 2007&lt;/a>, now part of &lt;a href="/case-studies/intuit/">Intuit&lt;/a>. With a free credit and financial management platform, Credit Karma enables financial progress for nearly 130 million members by providing them with personalized financial insights and recommendations.&lt;/p>
&lt;p>Credit Karma’s data science and engineering teams use machine learning to serve members the most relevant content and offers optimized for each member&amp;rsquo;s financial profile and goals. Avneesh Pratap and Raj Katakam, senior data engineers at Credit Karma, shared how Apache Beam enabled them to build a robust, resilient and scalable data and ML infrastructure. They also shared details on how unified Apache Beam data processing reduces the gap between experimenting with new ML pipelines and deploying them to production.&lt;/p>
&lt;h2 id="democratizing--scaling-mlops">Democratizing &amp;amp; Scaling MLOps&lt;/h2>
&lt;p>Before 2018, Credit Karma used a PHP-based ETL pipeline to ingest data from multiple financial services partners, perform different transformations and record the output into their own data warehouse. As the number of partners and members kept growing, the data teams at Credit Karma found it challenging to scale their MLOps. Making any changes and experimenting with new pipelines and attributes required significant engineering overhead. For instance, it took several weeks just to onboard a new partner. Their data engineering team was looking for ways to overcome performance drawbacks when ingesting data and scoring ML models, and to backfill new features within the same pipeline. In 2018, Credit Karma started designing their new data and ML platform - Vega - to keep up with the growing scale, understand members better, and increase their engagement with highly personalized offers.&lt;/p>
&lt;p>Apache Beam, the industry standard for unified distributed processing, has been placed at the core of Vega.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
When we started exploring Apache Beam, we found this programming model very promising. At first, we migrated just one partner [to an Apache Beam pipeline]. We were very impressed with the results and migrated to other partner pipelines right away.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/credit_karma/avneesh_pratap.jpeg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Avneesh Pratap
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Data Engineer II @ Credit Karma
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>&lt;a href="/documentation/runners/capability-matrix/">With Apache Beam Dataflow runner&lt;/a>, Credit Karma benefitted from &lt;a href="https://cloud.google.com/dataflow">Google Cloud Dataflow&lt;/a> managed service to ensure increased scalability and efficiency. The Apache Beam &lt;a href="/documentation/io/built-in/">built-in I/O connectors&lt;/a> provide native support for a variety of sinks and sources, which has allowed Credit Karma to seamlessly integrate Beam into their ecosystem with various Google Cloud tools and services, including &lt;a href="https://cloud.google.com/pubsub/docs/overview">Pub/Sub&lt;/a>, &lt;a href="https://cloud.google.com/bigquery">BigQuery&lt;/a>, and &lt;a href="https://cloud.google.com/storage">Cloud Storage&lt;/a>.&lt;/p>
&lt;p>Credit Karma leveraged an Apache Beam kernel and &lt;a href="https://jupyter.org/">Jupyter Notebook&lt;/a> to create an exploratory environment in Vega and enable their data scientists to create new experimental data pipelines without engineering involvement.&lt;/p>
&lt;p>The data scientists at Credit Karma mostly use &lt;a href="https://en.wikipedia.org/wiki/SQL">SQL&lt;/a> and &lt;a href="https://www.python.org/">Python&lt;/a> to create new pipelines. Apache Beam provides powerful &lt;a href="/documentation/dsls/sql/extensions/user-defined-functions/">user-defined functions&lt;/a> with multi-language capabilities that allow for authoring scalar or aggregate functions in Java or Scala, and invoking them in SQL queries. To democratize Scala transformations for their data science team, Credit Karma’s engineers abstracted out the UDFs, &lt;a href="https://www.tensorflow.org/">Tensorflow Transforms&lt;/a>, and other complex transformations with numerous components - reusable and shareable “building blocks” - to create Credit Karma’s data and ML platform. Apache Beam and custom abstractions allow data scientists to operate these components when creating experimental pipelines and transformations, which can be easily reproduced in staging and production environments. Credit Karma’s data science team commits their code changes to a common GitHub repository, the pipelines are then merged into a staging environment, and combined into a production application.&lt;/p>
&lt;p>The Apache Beam abstraction layer plays a crucial part in the operationalization of hypotheses and experiments into the production pipelines when it comes to working with financials and sensitive information. Apache Beam enables masking and filtering data right inside data pipelines before writing it to the data warehouse. Credit Karma uses &lt;a href="https://thrift.apache.org/">Apache Thrift&lt;/a> annotations to label the column metadata, Apache Beam pipelines filter specific elements from the data based on Thrift annotations before it reaches the data warehouse. Credit Karma’s data science team can use the available abstractions or write data transformations on top of them to calculate new metrics and validate the ML models without seeing the actual data.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Apache Beam helped us to ‘black-box’ the financial aspects and non-disclosable information so that teams can work with costs and financials without actually having access to all the data.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/credit_karma/raj_katakam.jpeg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Raj Katakam
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior ML Engineer II @ Credit Karma
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Currently, about 20 Apache Beam pipelines are running in production and over 100 experimental pipelines are on the way. Plenty of the upcoming experimental pipelines leverage Apache Beam stateful processing to compute user aggregates right inside the streaming pipelines, instead of computing them in the data warehouse. Credit Karma’s data science team is also planning to leverage &lt;a href="/documentation/dsls/sql/overview/">Beam SQL&lt;/a> to use SQL syntax directly within the stream processing pipeline and easily create aggregations. The Apache Beam abstraction of the execution engines and a variety of runners allow Credit Karma to test data pipeline performance with different engines on mock data, create benchmarks and compare the results of different data ecosystems to optimize performance depending on specific use cases.&lt;/p>
&lt;h2 id="unified-stream--batch-data-ingestion">Unified Stream &amp;amp; Batch Data Ingestion&lt;/h2>
&lt;p>Apache Beam enabled Credit Karma to revamp one of their most significant use cases - the data ingestion pipeline. Numerous Credit Karma partners send data about their financial products and offerings via gateways to Pub/Sub for downstream processing. The streaming Apache Beam pipeline written in &lt;a href="https://spotify.github.io/scio/">Scio&lt;/a> consumes Pub/Sub topics in real-time and works with deeply nested JSON data, flattening it to the database row format. The pipeline also structures and partitions the data, then writes the outcome into the BigQuery data warehouse for ML model training.&lt;/p>
&lt;p>The Apache Beam unified programming model executes business logic for batch and streaming use cases, which allowed Credit Karma to develop one uniform pipeline. The data ingestion pipeline handles both real-time data and batched data ingestion to backfill historical data from partners into the data warehouse. Some of Credit Karma’s partners send historical data using object stores like GCS or S3, while some use Pub/Sub. Apache Beam unifies batch and stream processing by creating bounded and &lt;a href="/documentation/basics/">unbounded PCollections in the same pipeline&lt;/a> depending on the use case. Reading from a batch object store creates a bounded PCollection. Reading from a streaming and continuously-updating Pub/Sub creates an unbounded PCollection. In case of backfilling just new features for past dates, Credit Karma’s data engineering team configures the same streaming Apache Beam pipeline to process chunks of historical data sent by partners in a batch fashion: read the entire data set once and join historical data elements with the data for a particular date, in a job of finite length.&lt;/p>
&lt;div class="post-scheme">
&lt;img src="/images/case-study/credit_karma/scheme-5.png" alt="Credit Karma Scheme">
&lt;/div>
&lt;p>Apache Beam is flexible, its constructs allow for generic pipeline coding and ease of configuration to add new data attributes, sources and partners without changing the pipeline code. Cloud Dataflow service provides advanced features, such as dynamically &lt;a href="https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline">replacing an ongoing streaming job&lt;/a> with a new pipeline. The Apache Beam Dataflow runner enables Credit Karma’s data engineering team to deploy pipeline code updates without draining ongoing jobs.&lt;/p>
&lt;p>Credit Karma offers a way for third-party data provider partners to deploy their own models for internal decision-making and predictions. Some of those models require custom attributes for the past 3 to 8 months to be backfilled for model training, which creates huge data spikes. The Apache Beam abstraction layer and its Dataflow runner help minimize infrastructure management efforts when dealing with these regular spikes.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
With Apache Beam, you can easily add complex processing logic, for example, you can add configurable triggers on processing time. At the same time, Dataflow runner will manage execution for you, it uploads your executable code and dependencies automatically. And you have Dataflow auto-scaling working out of the box. You don’t have to worry about scaling horizontally.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/credit_karma/avneesh_pratap.jpeg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Avneesh Pratap
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Data Engineer II @ Credit Karma
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Currently, the data ingestion pipeline processes and transforms more than 100 million messages, along with regular backfills, which is equivalent to around 5-10 TB worth of data.&lt;/p>
&lt;h2 id="self-service-machine-learning">Self-Service Machine Learning&lt;/h2>
&lt;p>At Credit Karma, the data scientists deal with modeling and analyzing the data, and it was crucial for the company to give them the power and flexibility to easily create, test and deploy new models. Apache Beam provided an abstraction that enabled the data scientists to write their own transformations on raw feature space for efficient ML engineering, while keeping the model serving layer independent of any custom code.&lt;/p>
&lt;p>Apache Beam helped to automate Credit Karma’s machine presenting workflows, chain and score models, and prepare data for ML model training. Apache Beam provides &lt;a href="/documentation/dsls/dataframes/overview/">Beam DataFrame API&lt;/a> to identify and implement the required &lt;a href="/documentation/ml/data-processing/">preprocessing&lt;/a> steps to iterate faster towards production. Apache Beam’s built-in I/O transforms allow for reading and writing &lt;a href="https://www.tensorflow.org/tutorials/load_data/tfrecord">TensorFlow TFRecord&lt;/a> files natively, and Credit Karma leverages this connectivity to preprocess data, score models, and use the model scores to recommend financial offers and content.&lt;/p>
&lt;p>Apache Beam enables Credit Karma to process large volumes of data, both for &lt;a href="/documentation/ml/overview/">preprocessing and model validation&lt;/a>, and experiment with data during preprocessing. They use &lt;a href="https://www.tensorflow.org/tfx/tutorials/transform/simple">TensorFlow Transforms&lt;/a> for applying transformations on data in batch and real-time model inferences. The output of TensorFlow Transforms is exported as a TensorFlow graph and is attached to models, making prediction services independent of any transformations. Credit Karma was able to offload ad hoc changes on prediction services by performing on-the-fly transformations on raw data, rather than aggregated data that required the involvement of their data engineering team. Their data scientists can now write any type of transformation on the raw data in SQL and deploy new models without any changes to the infrastructure.&lt;/p>
&lt;div class="post-scheme post-scheme--centered">
&lt;img src="/images/case-study/credit_karma/scheme-2.png" alt="Credit Karma Scheme">
&lt;/div>
&lt;p>The Apache Beam and custom abstractions enable Credit Karma&amp;rsquo;s data science team to create new models, specifically for powering Credit Karma’s recommendations, without engineering overhead. The pieces of code created by data scientists are automatically compiled into an Airflow DAG, deployed to staging sandbox and then to production. On the model training and interference side, Credit Karma’s data engineers use Tensorflow libraries built on top of Apache Beam - &lt;a href="https://www.tensorflow.org/tfx/model_analysis/get_started">TensorFlow Model Analysis (TFMA)&lt;/a> and &lt;a href="https://www.tensorflow.org/tfx/data_validation/get_started">TensorFlow Data Validation (TFDV)&lt;/a> - to perform validation of ML models and features and enable automated ML model refresh. For model analysis, they leverage native Apache Beam transforms to compute statistics and have built internal library transforms which validate new models for performance and accuracy. For instance, the batch Apache Beam pipeline calculates algorithmic features (scores) for ML models.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Apache Beam enabled self-service ML for our data scientists. They can plug in pieces of code, and those transformations will be automatically attached to models without any engineering involvement. Within seconds, our data science team can move DAGs from experimentation to production by just changing the deploy path.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/credit_karma/raj_katakam.jpeg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Raj Katakam
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior ML Engineer II @ Credit Karma
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Apache Beam-powered ML pipelines have proven to be incredibly reliable, processing more than 100 million events and updating ML models with fresh data daily.&lt;/p>
&lt;h2 id="enabling-real-time-data-availability">Enabling Real-Time Data Availability&lt;/h2>
&lt;p>Credit Karma leverages machine learning to analyze user behavior and recommend the most relevant offers and content. Before using Apache Beam, collecting user actions (logs) across multiple systems required a myriad of manual steps and multiple tools, which resulted in processing performance drawbacks and backs-and-forths between teams whenever any changes were needed. Apache Beam helped to automate this logging pipeline. The cross-system user session logs are recorded in Kafka topics and are stored in Google Cloud Storage. The batch Apache Beam pipeline written in Scio parses the user actions for a particular tracking ID, transforms and cleans the data, and writes it to BigQuery.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Now that we have migrated the logging pipeline to Apache Beam, we are very happy with its speed and performance, and we are planning to transform this batch pipeline into a streaming one.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/credit_karma/raj_katakam.jpeg">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Raj Katakam
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior ML Engineer II @ Credit Karma
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;div class="post-scheme">
&lt;img src="/images/case-study/credit_karma/scheme-3.png" alt="Credit Karma Scheme">
&lt;/div>
&lt;p>With a subset of their ML models powering recommendations and processing data for nearly 130 million members, Credit Karma employs FinOps culture to continuously explore ways to optimize infrastructure costs while increasing processing performance. Tensorflow models used in Credit Karma were historically scored sequentially one at a time, even though input features were the same, which resulted in excessive compute costs.&lt;/p>
&lt;p>Apache Beam provided an opportunity to reconsider this approach. The data engineering team has developed an Apache Beam batch pipeline that combines multiple Tensorflow models into a single merged model, processes data for the last 3-9 months (~2-3 TB) at 5,000 events per second, and stores the output in the feature store. The features are then used in lightweight models for real-time predictions to recommend relevant content the very second the member logs in to the platform. This elegant solution allowed for saving compute resources and decreasing associated costs significantly, while increasing processing performance. The configuration is dynamic and allows data scientists to experiment and deploy new models seamlessly.&lt;/p>
&lt;div class="post-scheme">
&lt;img src="/images/case-study/credit_karma/scheme-4.png" alt="Credit Karma Scheme">
&lt;/div>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>Apache Beam has future-proofed Credit Karma’s data ecosystem for scalability and resilience, enabling them to manage over 20,000 features processed by 200 ML models, powering recommendations for nearly 130 million members daily. The scale of data processing has grown 2x since Apache Beam adoption, and their data engineering team did not have to undertake any significant changes to the infrastructure. Onboarding new partners requires minimal changes to the pipelines, compared to several weeks needed before using Apache Beam. The Apache Beam ingestion pipeline accelerated data loading to the warehouse from days to under an hour, processing around 5-10 TB of data daily. The Apache Beam batch-scoring pipeline processes historic data and generates features for lightweight ML models, enabling real-time experiences for Credit Karma members.&lt;/p>
&lt;p>Apache Beam paved the way for an end-to-end data science process and efficient ML engineering at Credit Karma by abstracting the low-level details of the infrastructure and providing the data processing framework for the unified, self-service ML workflows. Credit Karma’s data scientists can now experiment with new models and have them deployed to production automatically, without the need for any engineering resources or infrastructure changes. Credit Karma presented their experience of building a self-service data and ML platform and scaling MLOps pipelines with Apache Beam at &lt;a href="https://2022.beamsummit.org/sessions/vega-mlops/">Beam Summit 2022&lt;/a>.&lt;/p>
&lt;h2 id="impact">Impact&lt;/h2>
&lt;p>These scalability initiatives enable Credit Karma to provide its members with a financial experience that is grounded in transparency, choice and personalization. Peoples’ financial situations are always in flux, as well as financial institutions&amp;rsquo; eligibility criteria when it comes to approving consumers for financial products, especially during times of economic uncertainty. As Credit Karma continues to scale its data ecosystem, including automated model refreshes, members have peace of mind that when they use Credit Karma, they can shop for financial products with more confidence by knowing their likelihood of approval – creating a win-win scenario for both its members and partners, no matter how uncertain times are.&lt;/p>
&lt;h2 id="learn-more">Learn More&lt;/h2>
&lt;p>&lt;a href="https://2022.beamsummit.org/sessions/vega-mlops/">Vega: Scaling MLOps Pipelines at Credit Karma using Apache Beam and Dataflow&lt;/a>&lt;/p>
&lt;div class="case-study-feedback" id="case-study-feedback">
&lt;p class="case-study-feedback-title">Was this information useful?&lt;/p>
&lt;div>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(true, 'CreditKarma')">Yes&lt;/button>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(false, 'CreditKarma')">No&lt;/button>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="clear-nav">&lt;/div></description></item><item><title>Case-Studies: Powering Streaming and Real-time ML at Intuit</title><link>/case-studies/intuit/</link><pubDate>Tue, 16 Aug 2022 00:12:00 +0000</pubDate><guid>/case-studies/intuit/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div class="case-study-opinion">
&lt;div class="case-study-opinion-img">
&lt;img src="/images/logos/powered-by/intuit.png"/>
&lt;/div>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“We feel that the runner agnosticism of Apache Beam affords flexibility and future-proofs our Stream Processing Platform as new runtimes are developed. Apache Beam enabled the democratization of stream processing at Intuit and the migration of many batch jobs to streaming applications.”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/intuit/nick_hwang.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Nick Hwang
&lt;/div>
&lt;div class="case-study-quote-author-position">
Engineering Manager, Stream Processing Platform @ Intuit
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;/div>
&lt;div class="case-study-post">
&lt;h1 id="powering-streaming-and-real-time-ml-at-intuit">Powering Streaming and Real-time ML at Intuit&lt;/h1>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>&lt;a href="https://www.intuit.com/">Intuit®&lt;/a> is a global technology platform that provides a range of financial and marketing automation solutions, including &lt;a href="https://turbotax.intuit.com/">TurboTax&lt;/a>, &lt;a href="https://quickbooks.intuit.com/">QuickBooks&lt;/a>, &lt;a href="https://mint.intuit.com/">Mint&lt;/a>, &lt;a href="https://www.creditkarma.com/">Credit Karma&lt;/a>, and &lt;a href="https://mailchimp.com/">Mailchimp&lt;/a>, on its mission to power prosperity around the world. Over 100 million people trust their tax preparation, small business accounting, and personal financial management to Intuit products.&lt;/p>
&lt;p>Intuit developed an internal self-service Stream Processing Platform that leverages Apache Beam to accelerate time-to-market for real-time applications.&lt;/p>
&lt;p>Nick Hwang, an Engineering Manager on the Intuit Data Infrastructure team, shared the story of how Apache Beam was used to build Intuit’s self-service Stream Processing Platform and provided a simple, intuitive way for developers to author, deploy, and manage streaming pipelines.&lt;/p>
&lt;h2 id="self-service-stream-processing">Self-service Stream Processing&lt;/h2>
&lt;p>When looking for AI and data-driven solutions to enhance their portfolio of financial management products, the Intuit Data Infrastructure and product teams saw an immense need for a self-service data processing platform. Their data engineers and developers needed a “paved road” to develop real-time applications while abstracting the low-level operational and infrastructure management details.&lt;/p>
&lt;p>In 2019, Intuit’s Data Infrastructure team started designing their Stream Processing Platform with a mission to enable developers to focus on business logic, while the platform handles all the operational and infrastructure management details on their behalf.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The promise of our platform is that you don't have to worry about the deployment at first. You just update your code artifact, add the transformations that you want, point the pipeline to your sources and sinks, and we'll take care of the rest. You click a button and the platform will deploy your jobs for you.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/intuit/nick_hwang.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Nick Hwang
&lt;/div>
&lt;div class="case-study-quote-author-position">
Engineering Manager, Stream Processing Platform @ Intuit
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Apache Beam was selected as Intuit’s Stream Processing Platform’s core data processing technology due to its flexibility to choose from a variety of &lt;a href="/documentation/sdks/java/">programming languages&lt;/a> and &lt;a href="/documentation/runners/capability-matrix/">execution engines&lt;/a>. Apache Beam’s portability and ease of adoption provided the necessary “jump-start” for the launch of the initial platform version, which used &lt;a href="https://samza.apache.org/">Apache Samza&lt;/a> as an execution engine and Apache Beam streaming pipelines to read from and write to &lt;a href="https://kafka.apache.org/">Kafka&lt;/a>.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The primary reason why we chose Apache Beam was runner agnosticism. Our platform was a long-term investment and we wanted to be prepared for whatever may be coming eventually.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/intuit/nick_hwang.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Nick Hwang
&lt;/div>
&lt;div class="case-study-quote-author-position">
Engineering Manager, Stream Processing Platform @ Intuit
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>In January 2020, the first version of Intuit’s Stream Processing Platform &lt;a href="https://www.ververica.com/blog/how-intuit-built-a-self-serve-stream-processing-platform-with-flink">was launched&lt;/a>. Soon enough, the Apache Beam abstraction of the execution engines proved its benefits, allowing Intuit to seamlessly switch its data processing infrastructure from Apache Samza to &lt;a href="https://flink.apache.org/">Apache Flink&lt;/a> without causing any user pain points or production downtimes.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
When we decided to pivot from Apache Samza to Apache Flink, we had a couple dozen use cases and pipelines running in production, but none of the users had to change their code. The benefits of Apache Beam really showcased themselves in that case.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/intuit/nick_hwang.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Nick Hwang
&lt;/div>
&lt;div class="case-study-quote-author-position">
Engineering Manager, Stream Processing Platform @ Intuit
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>The Intuit Stream Processing Platform team benefitted from Apache Beam’s extensibility, which allowed them to easily wrap Apache Beam with a custom SDK layer for better interoperability with their specific Kafka installation. They paired the SDK with a graphic user interface to provide a visual way to design, manage, deploy, monitor, and debug data processing pipelines, as well as &lt;a href="https://argoproj.github.io/">Argo Workflows&lt;/a> to facilitate deployment on Kubernetes. The Intuit Stream Processing Platform team has also developed an internal service to help filter and manage metrics by categories when routing them to &lt;a href="https://docs.wavefront.com/wavefront_introduction.html">Wavefront&lt;/a> to improve observability and monitoring of pipelines’ health. The Apache Beam &lt;a href="/documentation/io/built-in/">in-built I/O connectors&lt;/a> helped provide native support for a variety of sinks and sources.&lt;/p>
&lt;p>The Stream Processing Platform provides developers with a full stack environment to visually design streaming pipelines; test, provision, and promote them to production; and monitor the pipelines in production. Developers create Apache Beam pipelines with the Beam Java SDK at the Stream Processing Platform’s Application Layer (see layers below). Intuit’s graphic user interface (the UX Layer) enables visual configuration of sinks and sources, compute resource scaling, pipeline lifecycle management, monitoring, and metrics. At the Control Layer, the &lt;a href="https://spring.io/">Spring-based&lt;/a> backend maintains metadata on all pipelines running on the platform and interacts with the Intuit ecosystem for data governance, asset management, and data lineage. The UX Layer communicates with the Control Layer, which invokes Argo Workflows to deploy Apache Beam pipelines upon an Apache Flink runtime layer hosted on Kubernetes.&lt;/p>
&lt;p>With the promise of an out-of-the-box solution, Intuit’s Stream Processing Platform has been designed to allow reusable templated implementations to accelerate the development of common use cases, while still providing the ability to customize for standalone applications. For instance, Intuit created its own DSL interface to provide custom configurations for simple transformations of the clickstream topics.&lt;/p>
&lt;div class="post-scheme">
&lt;img src="/images/case-study/intuit/stream_processing_tech_stack.png" alt="Intuit Stream Processing Platform’s Tech Stack">
&lt;span>Intuit Stream Processing Platform’s Tech Stack&lt;/span>
&lt;/div>
&lt;p>The platform empowered much easier adoption of stream processing, providing self-service capabilities for Intuit’s data engineers and developers.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The whole idea of our platform is to minimize the barrier to entry to get your real-time application up and running. Like, “I just want to run this SQL query on a Kafka topic and write it to some sink, tell me how to do that in a day and not two months.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/intuit/nick_hwang.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Nick Hwang
&lt;/div>
&lt;div class="case-study-quote-author-position">
Engineering Manager, Stream Processing Platform @ Intuit
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;h2 id="powering-real-time-data">Powering Real-time Data&lt;/h2>
&lt;p>Apache Beam-powered unified &lt;a href="https://www.gartner.com/en/information-technology/glossary/clickstream-analysis">clickstream&lt;/a> processing is the most impactful of Intuit’s use cases. The Apache Beam streaming pipeline consumes, aggregates, and processes raw clickstream events, such as website visits, from Kafka across the large portfolio of Intuit’s products. The clickstream pipeline enriches the data with geolocation along with other new features, sessionizes and standardizes it for writing to Kafka and use by downstream applications, processing over 60,000 transactions per second. The Intuit Data Infrastructure team realizes the value of Apache Beam composite transforms, such as &lt;a href="/documentation/programming-guide/#windowing">windowing&lt;/a>, &lt;a href="/documentation/programming-guide/#state-and-timers">timers&lt;/a>, and &lt;a href="/blog/stateful-processing/">stateful processing&lt;/a> for fine-grained control over data freshness. Apache Beam stream processing allows Intuit to enrich clickstream data with new features every 1 minute instead of every 4 hours, improving the availability of real-time data by 240x, and reduce costs associated with memory and compute resources by 5x.&lt;/p>
&lt;div class="post-scheme">
&lt;img src="/images/case-study/intuit/intuit_stream_processing_pipeline.png" alt="Intuit Stream Processing Platform’s Pipeline Topology">
&lt;span>Intuit Stream Processing Platform’s Pipeline Topology&lt;/span>
&lt;/div>
&lt;p>Another front-and-center Apache Beam use case from Intuit’s business perspective is the feature store ingestion platform that enables new AI and ML-powered customer experiences. Several Apache Beam pipelines take in real-time features generated by other Apache Beam pipelines on the platform from Kafka and write them to the Intuit feature store for ML model training and inference. Pipelines generating real-time features can also use a capability offered by the platform to &amp;ldquo;backfill&amp;rdquo; features when historic data needs to be re-featurized, even if the features are stateful. The same stream processing code will first read Intuit&amp;rsquo;s historic data from the data lake, reprocess the data to bootstrap the pipeline&amp;rsquo;s state, then switch to a streaming context that uses the bootstrapped state. This is all done in a way that abstracts the complexity of the backfill process from the machine learning engineer or data scientist owning the pipeline.&lt;/p>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>Since Intuit Stream Processing Platform’s launch, the number of Apache Beam-powered streaming pipelines has been growing 2x per year and as of July’22 reached over 160 active production pipelines running on 710 nodes across 6 different Kubernetes clusters. The Apache Beam pipelines handle ~17.3 billion events and 82 TB of data, processing 800,000 transactions per second at peak seasons.&lt;/p>
&lt;p>Apache Beam and its abstraction of the execution engines allowed Intuit to seamlessly switch their primary runner without rewriting the code to a new execution environment runner. It also provided confidence by future-proofing the Intuit Stream Processing Platform for flexibility as new execution runtimes keep evolving. Apache Beam helped lower the entry barrier, democratize stream processing across Intuit’s development teams, and ensure fast onboarding for engineers without prior experience with Apache Flink or other streaming data processing tools. Apache Beam facilitated the migration from batch jobs to streaming applications, enabling new real-time and ML-powered experiences for Intuit customers.&lt;/p>
&lt;p>With Apache Beam, Intuit accelerated the development and launch of production-grade streaming data pipelines 3x, from 3 months to just 1 month. The time to design pipelines for preproduction shrank to just 10 days. Migration from batch jobs to Apache Beam streaming pipelines resulted in a 5x memory and compute cost optimization. Intuit continues developing Apache Beam streaming pipelines for new use cases, 150 more pipelines are in preproduction and coming to production soon.&lt;/p>
&lt;h2 id="learn-more">Learn More&lt;/h2>
&lt;iframe class="video video--medium-size" width="560" height="315" src="https://www.youtube.com/embed/H4s7rAlk68w" frameborder="0" allowfullscreen>&lt;/iframe>
&lt;br>&lt;br>
&lt;div class="case-study-feedback" id="case-study-feedback">
&lt;p class="case-study-feedback-title">Was this information useful?&lt;/p>
&lt;div>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(true, 'Intuit')">Yes&lt;/button>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(false, 'Intuit')">No&lt;/button>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="clear-nav">&lt;/div></description></item><item><title>Case-Studies: Real-time ML with Beam at Lyft</title><link>/case-studies/lyft/</link><pubDate>Fri, 17 Jun 2022 00:12:00 +0000</pubDate><guid>/case-studies/lyft/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div class="case-study-opinion">
&lt;div class="case-study-opinion-img">
&lt;img class="case-study-opinion-img-cropped case-study-opinion-img-center" src="/images/logos/powered-by/lyft.png"/>
&lt;/div>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“Lyft Marketplace team aims to improve our business efficiency by being nimble to real-world dynamics. Apache Beam has enabled us to meet the goal of having a robust and scalable ML infrastructure for improving model accuracy with features in real-time. These real-time features support critical functions like Forecasting, Primetime, Dispatch.”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/lyft/ravi_kiran_magham.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Ravi Kiran Magham
&lt;/div>
&lt;div class="case-study-quote-author-position">
Software Engineer @ Lyft
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;/div>
&lt;div class="case-study-post">
&lt;h1 id="real-time-ml-with-beam-at-lyft">Real-time ML with Beam at Lyft&lt;/h1>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>&lt;a href="https://www.lyft.com/">Lyft, Inc.&lt;/a> is an American mobility-as-a-service provider that offers ride-hailing, car and motorized scooter rentals, bicycle-sharing, food delivery, and business transportation solutions. Lyft is based in San Francisco, California, and &lt;a href="https://www.lyft.com/rider/cities">operates in&lt;/a> 644 cities in the United States and 12 cities in Canada.&lt;/p>
&lt;p>As you might expect from a company as large as Lyft, connecting drivers and riders in space and time at such a scale requires a powerful real-time streaming infrastructure. Ravi Kiran Magham, Software Engineer at Lyft, shared the story of how Apache Beam has become a mission-critical and integral real-time data processing technology for Lyft by enabling large-scale streaming data processing and machine learning pipelines.&lt;/p>
&lt;h2 id="democratizing-stream-processing">Democratizing Stream Processing&lt;/h2>
&lt;p>Lyft originally built streaming ETL pipelines to transform, enrich, and sink events generated by application services to their data lake in &lt;a href="https://aws.amazon.com/s3/">AWS S3 &lt;/a> using &lt;a href="https://aws.amazon.com/kinesis/">Amazon Kinesis&lt;/a> and &lt;a href="https://flink.apache.org/">Apache Flink&lt;/a>. Apache Flink is the foundation of Lyft’s streaming architecture and was chosen over Apache Spark due to its robust, fault-tolerant, and intuitive API for distributed stateful stream processing, exactly-once processing, and variety of I/O connectors.&lt;/p>
&lt;p>Lyft’s popularity and growth were bringing new demands to data streaming infrastructure: more teams with diverse programming language preferences wanted to explore event-driven streaming applications, and build streaming features for real-time machine learning models to make business more efficient, enhance customer experiences, and provide time-sensitive compliance operations. The Data Platform team looked into improving the prime time (surge pricing) computation for the Marketplace team, which had a service orchestrating an ensemble of ML models, exchanging data over &lt;a href="https://redis.com/">Redis&lt;/a>. The teams aimed at reducing code complexity and improving latency (from 5 to &amp;lt; 1 min end to end). With Python being a prerequisite by the Marketplace team and Java being heavily used by the Data Platform team, Lyft started exploring the &lt;a href="/">Apache Beam&lt;/a> &lt;a href="/roadmap/portability/">portability framework&lt;/a> in 2019 to democratize streaming for all teams.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The Apache Beam portability and multi-language capabilities were the key pique and the primary reason for us to start exploring Beam in a bigger way.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/lyft/ravi_kiran_magham.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Ravi Kiran Magham
&lt;/div>
&lt;div class="case-study-quote-author-position">
Software Engineer @ Lyft
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Apache Beam provides a solution to the programming language and data processing engine dilemma, as it offers a variety of &lt;a href="/documentation/basics/#runner">runners&lt;/a> (including the &lt;a href="/documentation/runners/flink/">Beam Flink runner&lt;/a> for Apache Flink) and a &lt;a href="/documentation/sdks/java/">variety of programming language SDKs&lt;/a>. Apache Beam offers an ultimate level of portability with its concept of “write once, run anywhere” and its ability to create &lt;a href="/documentation/programming-guide/#multi-language-pipelines">multi-language pipelines - data pipelines&lt;/a> that use transforms from more than one programming language.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Leveraging Apache Beam has been a “win-win” decision for us because our data infra teams use Java but we are able to offer Python SDK for our product teams, as it has been the de-facto language that they prefer. We write streaming pipelines with ease and comfort and run them on the Beam Flink runner.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/lyft/ravi_kiran_magham.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Ravi Kiran Magham
&lt;/div>
&lt;div class="case-study-quote-author-position">
Software Engineer @ Lyft
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>The Data Platform team built a control plane of in-house services and &lt;a href="https://github.com/lyft/flinkk8soperator">FlinkK8sOperator&lt;/a> to manage Flink applications on a Kubernetes cluster and deploy streaming Apache Beam and Apache Flink jobs. Lyft uses a blue/green deployment strategy on critical pipelines to minimize any downtime and uses custom macros for improved observability and seamless integration of the CI/CD deployments. To improve developer productivity, the Data Platform team offers a lightweight, YAML-based DSL to abstract the source and sink configurations, and provides reusable Apache Beam PTransforms for filtering and enrichment of incoming events.&lt;/p>
&lt;h2 id="powering-real-time-machine-learning-pipelines">Powering Real-time Machine Learning Pipelines&lt;/h2>
&lt;p>Lyft Marketplace plays a pivotal role in optimizing fleet demand and supply prediction, dynamic pricing, ETA calculation, and more. The Apache Beam Python SDK and Flink Runner enable the team to be nimble to change and support the demands for real-time ML – streaming feature generation and model execution. The Data Platform team has extended the streaming infrastructure to support Continual Learning use cases. Apache Beam powers continuous training of ML models with real-time data over larger windows of 2 hours to identify and fine-tune biases in cost and ETA.&lt;/p>
&lt;div class="post-scheme">
&lt;img src="/images/case-study/lyft/apache_beam_ml_features_generation.svg" alt="Apache Beam Feature Generation and ML Model Execution">
&lt;span>Apache Beam Feature Generation and ML Model Execution &lt;/span>
&lt;/div>
&lt;p>Lyft separated Feature Generation and ML Model Execution into multiple streaming pipelines. The streaming Apache Beam pipeline generates features in real-time and writes them to a Kafka topic to be consumed by the model execution pipeline. Based on user configuration, the features are replicated and keyed out by model ID to &lt;a href="/blog/stateful-processing/">stateful&lt;/a> ParDo transforms, which leverage &lt;a href="/documentation/programming-guide/#timers">timers&lt;/a> and/or data (feature) availability to invoke ML models. Features are stored in a global window and the &lt;a href="/documentation/programming-guide/#state-and-timers">state&lt;/a> is explicitly cleaned up. The ML models run as part of the Model Serving infrastructure and their output can be an input feature to another ML model. To support this DAG workflow, Apache Beam pipelines write the output to Kafka and feed it to the model execution streaming pipeline for processing, in addition to writing it to Redis.&lt;/p>
&lt;p>The complex real-time Feature Generation involves processing ~4 million events of 1KB per minute with sub-second latency, generating ~100 features on multiple event attributes across space and time granularities (1 and 5 minutes). Apache Beam allowed the Lyft Marketplace team to reduce latency by &lt;a href="https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/en/assets/1/event/290/The%20magic%20behind%20your%20Lyft%20ride%20prices_%20A%20case%20study%20on%20machine%20learning%20and%20streaming%20Presentation.pdf">60%&lt;/a>, significantly simplify the code, and onboard many teams and use cases onto streaming.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
The Marketplace team are &lt;a href="https://eng.lyft.com/gotchas-of-stream-processing-data-skewness-cfba58eb45d4">heavy users of Apache Beam&lt;/a> for real-time feature computation and model executions. Processing events in real-time with a sub-second latency allows our ML models to understand marketplace dynamics early and make informed decisions.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/lyft/ravi_kiran_magham.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Ravi Kiran Magham
&lt;/div>
&lt;div class="case-study-quote-author-position">
Software Engineer @ Lyft
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;h2 id="amplifying-use-cases">Amplifying Use Cases&lt;/h2>
&lt;p>Lyft has leveraged Apache Beam for more than 60 use cases and enabled them to complete critical business commitments and improve real-time user experiences.&lt;/p>
&lt;p>For example, Lyft&amp;rsquo;s Map Data Delivery team moved from a batch process to a streaming pipeline for identifying road closures in real-time. Their Routing Engine uses this information to determine the best routes, improve ETA and provide a better driver and customer experience. The job processes ~400k events per second, conflates streams of data coming from 3rd party road closures and real-time traffic data to determine actual closures and publish them as events to Kafka. A custom S3 PTransform allows for the job to regularly publish a snapshot of closures for downstream batch processing.&lt;/p>
&lt;p>Apache Beam enabled Lyft to optimize a very specific use case that relates to reporting pick-ups and drop-offs at airports. Airports require mobility applications to report every pick-up and drop-off and match them with the time of fleet entry and exit. Failing to do so results in a lower compliance score and even risk of being penalized. Originally, Lyft had a complicated implementation using the &lt;a href="https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-implementation-app-py.html">KCL library&lt;/a> to consume events and store them in Redis. Python worker processes ran at regular intervals to consume data from Redis, join and enrich the data with service API calls, and send the output to airport applications. With that implementation, late-arriving updates and out-of-order events significantly impacted the completeness score. Lyft migrated the use case to a streaming Apache Beam pipeline with state and timers to keep events in a global window and manage sessions. Apache Beam helped Lyft achieve a top compliance score by improving the latency of event reporting from 5 to 2 seconds and reducing missing entry/exit data to 1.3%.&lt;/p>
&lt;p>Like many companies shaking up standard business models, Lyft relies on open-source software and likes to give back to the community. Many of the big data frameworks, tools, and implementations developed by Lyft are open-sourced on their &lt;a href="https://github.com/orgs/lyft/repositories">GitHub&lt;/a>. Lyft has been an ample Apache Beam contributor since 2018, and Lyft engineers have presented their Apache Beam integrations at various events, such as &lt;a href="https://www.youtube.com/watch?v=D_NA-LY1xP0">Beam Summit North America&lt;/a>, &lt;a href="https://2019.berlinbuzzwords.de/sites/2019.berlinbuzzwords.de/files/media/documents/streaming_at_lyft_-_berlin_buzzwords_2019.pdf">Berlin Buzzwords&lt;/a>, &lt;a href="https://conferences.oreilly.com/strata/strata-ca-2019/cdn.oreillystatic.com/en/assets/1/event/290/The%20magic%20behind%20your%20Lyft%20ride%20prices_%20A%20case%20study%20on%20machine%20learning%20and%20streaming%20Presentation.pdf">O’Reilly Strata Data &amp;amp; AI&lt;/a>, and more.&lt;/p>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>The portability of the Apache Beam model is the key to distributed execution. It enabled Lyft to run mission-critical data pipelines written in a non-JVM language on a JVM-based runner. Thus, they avoided code rewrites and sidestepped the potential cost of many API styles and runtime environments, reducing pipeline development time from multiple days to just hours. Full isolation of user code and native CPython execution without library restrictions resulted in easy onboarding and adoption. Apache Beam’s multi-language and cross-language capabilities solved Lyft’s programming language dilemma. With the unified programming model, Lyft is no longer tied to a specific technology stack.&lt;/p>
&lt;p>Apache Beam enabled Lyft to switch from batch ML model training to real-time ML training with granular control of data freshness using windowing. Their data engineering and product teams can use both Python and Java, based on the appropriateness for a particular task or their preference. Apache Beam has helped Lyft successfully build and scale 60+ streaming pipelines processing events at very low latencies in near-real-time. New use cases keep coming, and Lyft is planning on leveraging &lt;a href="/documentation/dsls/sql/overview/">Beam SQL&lt;/a> and the &lt;a href="/documentation/sdks/go/">Go SDK&lt;/a> to provide a full range of Apache Beam multi-language capabilities for their teams.&lt;/p>
&lt;div class="case-study-feedback" id="case-study-feedback">
&lt;p class="case-study-feedback-title">Was this information useful?&lt;/p>
&lt;div>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(true, 'Lyft')">Yes&lt;/button>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(false, 'Lyft')">No&lt;/button>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="clear-nav">&lt;/div></description></item><item><title>Case-Studies: Real-time Event Stream Processing at Scale for Palo Alto Networks</title><link>/case-studies/paloalto/</link><pubDate>Tue, 22 Feb 2022 20:19:00 +0000</pubDate><guid>/case-studies/paloalto/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div class="case-study-opinion">
&lt;div class="case-study-opinion-img">
&lt;img src="/images/logos/powered-by/paloalto.png"/>
&lt;/div>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“I know one thing: Beam is very powerful and the abstraction is its most significant feature. With the right abstraction we have the flexibility to run workloads where needed. Thanks to Beam, we are not locked to any vendor, and we don’t need to change anything else if we make the switch.”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/paloalto/talat_uyarer.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Talat Uyarer
&lt;/div>
&lt;div class="case-study-quote-author-position">
Sr Principal Software Engineer
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;/div>
&lt;div class="case-study-post">
&lt;h1 id="real-time-event-stream-processing-at-scale-for-palo-alto-networks">Real-time Event Stream Processing at Scale for Palo Alto Networks&lt;/h1>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>&lt;a href="https://www.paloaltonetworks.com/">Palo Alto Networks, Inc.&lt;/a> is a global cybersecurity leader with a comprehensive
portfolio of enterprise products. Palo Alto Networks protects and provides visibility, trusted intelligence, automation,
and flexibility to &lt;a href="https://www.paloaltonetworks.com/about-us">over 85K customers&lt;/a> across clouds, networks, and devices.&lt;/p>
&lt;p>Palo Alto Networks’ integrated security operations platform - &lt;a href="https://www.paloaltonetworks.com/cortex">Cortex™&lt;/a> -
applies AI and machine learning to enable security automation, advanced threat intelligence, and effective rapid
security responses for Palo Alto Networks’
customers. &lt;a href="https://www.paloaltonetworks.com/cortex/cortex-data-lake">Cortex™ Data Lake&lt;/a> infrastructure collects,
integrates, and normalizes enterprises’ security data combined with trillions of multi-source artifacts.&lt;/p>
&lt;p>Cortex™ data infrastructure processes ~10 millions of security log events per second currently, at ~3 PB per day, which
are on the high end of real-time streaming processing scale in the industry. Palo Alto Networks’ Sr Principal Software
Engineer, Talat Uyarer, shared insights on how Apache Beam provides a high-performing, reliable, and resilient data
processing framework to support this scale.&lt;/p>
&lt;h2 id="large-scale-streaming-infrastructure">Large-scale Streaming Infrastructure&lt;/h2>
&lt;p>When building the data infrastructure from the ground up, Palo Alto Networks’ Cortex Data Lake team faced a challenging
task. We needed to ensure that the Cortex platform could stream and process petabyte-sized data coming from customers’
firewalls, networks, and all kinds of devices to customers and internal apps with low latency and perfect quality.&lt;/p>
&lt;div class="post-scheme">
&lt;img src="/images/case-study/paloalto/data_lake_scheme.png" alt="Cortex™ Data Lake">
&lt;/div>
&lt;p>To meet the SLAs, the Cortex Data Lake team had to design a large-scale data infrastructure for real-time processing and
reduce time-to-value. One of their initial architectural decisions was to leverage Apache Beam, the industry standard
for unified distributed processing, due to its portability and abstraction.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Beam is very flexible, its abstraction from implementation details of distributed data processing is wonderful for delivering proofs of concept really fast.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/paloalto/talat_uyarer.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Talat Uyarer
&lt;/div>
&lt;div class="case-study-quote-author-position">
Sr Principal Software Engineer
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Apache Beam provides a variety of runners, offering freedom of choice between different data processing engines. Palo
Alto Networks’ data infrastructure is hosted entirely on &lt;a href="https://cloud.google.com/gcp/">Google Cloud Platform&lt;/a>,
and &lt;a href="/documentation/runners/capability-matrix/">with Apache Beam Dataflow runner&lt;/a>, we could
easily benefit from &lt;a href="https://cloud.google.com/dataflow">Google Cloud Dataflow&lt;/a>’s managed service and
&lt;a href="https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#horizontal-autoscaling">autotuning&lt;/a> capabilities.
Apache Kafka was selected as the message broker for the backend, and all events were stored as binary data with a common
schema on multiple Kafka clusters.&lt;/p>
&lt;p>The Cortex Data Lake team considered the option of having separate data processing infrastructures for each customer,
with multiple upstream applications creating their own streaming jobs, consuming and processing events from Kafka
directly. Therefore we are building a multi-tenants system. However, the team anticipated possible issues related to
Kafka migrations and partition creation, as well as a lack of visibility into the tenant use cases, which might arise
when having multiple infrastructures.&lt;/p>
&lt;p>Hence, the Cortex Data Lake team took a common streaming infrastructure approach. At the core of the common data
infrastructure, Apache Beam served as a unified programming model to implement business logic just once for all internal
and customer tenant applications.&lt;/p>
&lt;p>The first data workflows that the Cortex Data Lake team implemented were simple: reading from Kafka, creating a batch
job, and writing the results to sink. The release of
the &lt;a href="/get-started/downloads/#releases">Apache Beam version with SQL support&lt;/a> opened up new
possibilities. &lt;a href="/documentation/dsls/sql/calcite/overview/">Beam Calcite SQL&lt;/a> provides full
support for &lt;a href="/documentation/dsls/sql/calcite/data-types/">complex Apache Calcite data types&lt;/a>,
including nested rows, in SQL statements, so developers can use SQL queries in an Apache Beam pipeline for composite
transforms. The Cortex Data Lake team decided to take advantage of the
&lt;a href="/documentation/dsls/sql/overview/">Beam SQL&lt;/a> to write Beam pipelines with standard SQL
statements.&lt;/p>
&lt;p>The main challenge of the common infrastructure was to support a variety of business logic customizations and
user-defined functions and transform them to a variety of sink formats. Tenant applications needed to consume data from
dynamically-changing Kafka clusters, and streaming pipeline &lt;a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">DAGs&lt;/a>
had to be regenerated if the jobs’ source had been updated.&lt;/p>
&lt;p>The Cortex Data Lake team developed their own “subscription” model that allows tenant applications to “subscribe” to the
streaming job when sending job deployment requests to the REST API service. The Subscription service abstracts tenant
applications from the changes in DAG by storing infrastructure-specific information in metadata service. This way, the
streaming jobs stay in sync with the dynamic Kafka infrastructure.&lt;/p>
&lt;div class="post-scheme">
&lt;img src="/images/case-study/paloalto/subscription_service_scheme.png" alt="Cortex™ Data Lake Subscription Service">
&lt;/div>
&lt;p>Apache Beam is flexible, it allows creating streaming jobs dynamically, on the fly. The Apache Beam constructs allow for
generic pipeline coding, enabling pipelines that process data even if schemas are not fully defined in advance. Cortex’s
Subscription Service generates Apache Beam pipeline DAG based on the tenant application’s REST payload and submits the
job to the runner. When the job is
running, &lt;a href="https://beam.apache.org/releases/javadoc/2.4.0/org/apache/beam/sdk/io/kafka/KafkaIO.html">Apache Beam SDK’s Kafka I/O&lt;/a>
returns an unbounded collection of Kafka records as
a &lt;a href="https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/values/PCollection.html">PCollection&lt;/a>
. &lt;a href="https://avro.apache.org/">Apache Avro&lt;/a> turns the binary Kafka representation into generic records, which are further
converted to the &lt;a href="https://beam.apache.org/releases/javadoc/2.4.0/org/apache/beam/sdk/values/Row.html">Apache Beam Row&lt;/a>
format. The Row structure supports primitives, byte arrays, and containers, and allows organizing values in the same
order as the schema definition.&lt;/p>
&lt;p>Apache Beam’s cross-language transforms allow the Cortex Data Lake team to execute SQL with Java. The output of
an &lt;a href="https://beam.apache.org/releases/javadoc/2.7.0/org/apache/beam/sdk/extensions/sql/SqlTransform.html">SQL Transform&lt;/a>
performed inside the Apache Beam pipeline is sequentially converted from Beam Row format to a generic record, then to
the output format required by a subscriber application, such as Avro, JSON, CSV, etc.&lt;/p>
&lt;p>Once the base use cases had been implemented, the Cortex Data Lake team turned to more complex transformations, such as
filtering a subset of events directly inside Apache Beam pipelines, and kept looking into customization and
optimization.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
We have more than 10 use cases running across customers and apps. More are coming, like the machine learning use cases .... for these use cases, Beam provides a really good programming model.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/paloalto/talat_uyarer.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Talat Uyarer
&lt;/div>
&lt;div class="case-study-quote-author-position">
Sr Principal Software Engineer
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Apache Beam provides a pluggable data processing model that seamlessly integrates with various tools and technologies,
which allowed the Cortex Data Lake team to customize their data processing to performance requirements and specific use
cases.&lt;/p>
&lt;h2 id="customizing-serialization-for-use-cases">Customizing Serialization for Use Cases&lt;/h2>
&lt;p>Palo Alto Networks’ streaming data infrastructure deals with hundreds of billions of real-time security events every
day, and even a sub-second difference in processing times is crucial.&lt;/p>
&lt;p>To enhance performance, the Cortex Data Lake team developed their own library for direct serialization and
deserialization. The library reads Avro binary records from Kafka and turns them into the Beam Row format, then converts
the Beam Row format pipeline output to the required sink format.&lt;/p>
&lt;p>This custom library replaced serializing data into generic records with steps optimized for Palo Alto Networks’ specific
use cases. Direct serialization eliminated shuffling and creating additional memory copies from processing steps.&lt;/p>
&lt;p>This customization increased serialization performance 10x times, allowing to process up to 3K events per second per
vCPU with reduced latency and infrastructure costs.&lt;/p>
&lt;div class="post-scheme vertical-scheme">
&lt;img src="/images/case-study/paloalto/direct_serialization.png" alt="Direct Serialization from Avro to Beam Row">
&lt;/div>
&lt;h2 id="in-flight-streaming-job-updates">In-flight Streaming Job Updates&lt;/h2>
&lt;p>At a scale of thousands of jobs running concurrently, the Cortex Data Lake team faced cases when needed to improve the
pipeline code or fix bugs for an ongoing job. Google Cloud Dataflow provides a way
to &lt;a href="https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline">replace an “in-flight” streaming job&lt;/a> with a new
job that runs an updated Apache Beam pipeline code. However, Palo Alto Networks needed to expand the supported
scenarios.&lt;/p>
&lt;p>To address updating jobs in the dynamically-changing Kafka infrastructure, the Cortex Data Lake team created an
additional workflow in their deployment service
which &lt;a href="https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#drain">drains the jobs&lt;/a> if the change
is &lt;a href="https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#UpdateSchemas">not permitted&lt;/a> by the Dataflow
update and starts a new job with the exact same naming. This internal job replacement workflow allows the Cortex Data
Lake to update the jobs and payloads automatically for all use cases.&lt;/p>
&lt;h2 id="handling-schema-changes-in-beam-sql">Handling Schema Changes In Beam SQL&lt;/h2>
&lt;p>Another use case that Palo Alto Networks tackled is handling changes in data schemas for ongoing jobs. Apache Beam
allows PCollections to have &lt;a href="/documentation/programming-guide/#schemas">schemas&lt;/a> with named
fields, that are validated at pipeline construction step. When a job is submitted, an execution plan in the form of a
Beam pipeline fragment is generated based on the latest schema. Beam SQL does not yet have built-in support for relaxed
schema compatibility for running jobs. For optimized performance, Beam SQL’s
Schema &lt;a href="https://beam.apache.org/releases/javadoc/2.4.0/org/apache/beam/sdk/coders/RowCoder.html">RowCoder&lt;/a> has a fixed
data format and doesn&amp;rsquo;t handle schema evolution, so it is necessary to restart the jobs to regenerate their execution
plan. At a scale of 10K+ streaming jobs, Cortex Data Lake team wanted to avoid resubmitting the jobs as much as
possible.&lt;/p>
&lt;p>We created an internal workflow to identify the jobs with SQL queries relevant to the schema change. The schema update
workflow stores Reader schema of each job (Avro schema) and Writer schema of each Kafka message (metadata on Kafka
header) in the internal Schema Registry, compares them to the SQL queries of the running jobs, and restarts the affected
jobs only. This optimization allowed them to utilize resources more efficiently.&lt;/p>
&lt;h2 id="fine-tuning-performance-for-kafka-changes">Fine-tuning Performance for Kafka Changes&lt;/h2>
&lt;p>With multiple clusters and topics, and over 100K partitions in Kafka, Palo Alto Networks needed to make sure that
actively-running jobs are not being affected by the frequent Kafka infrastructure changes such as cluster migrations or
changes in partition count.&lt;/p>
&lt;p>The Cortex Data Lake team developed several internal Kafka lifecycle support tools, including a “Self Healing” service.
Depending on the amount of traffic per topic coming from a specific tenant, the internal service increases the number of
partitions or creates new topics with fewer partitions. The “Self Healing” service compares the Kafka states in the data
store and then finds and updates all related streaming Apache Beam jobs on Cloud Dataflow automatically.&lt;/p>
&lt;p>With the &lt;a href="/blog/beam-2.28.0/">release of Apache Beam 2.28.0&lt;/a> in early
2021, &lt;a href="https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/kafka/KafkaIO.html">the pre-built Kafka I/O dynamic read feature&lt;/a>
provides an out-of-the-box solution for detecting Kafka partition changes to enable cost savings and increased
performance. Kafka I/O uses WatchKafkaTopicPartitionDoFn to emit
new &lt;a href="https://kafka.apache.org/24/javadoc/index.html?org/apache/kafka/common/TopicPartition.html">TopicPartitions&lt;/a>, and
allows reading from Kafka topics dynamically when certain partitions are added or stop reading from them once they are
deleted. This feature eliminated the need to create in-house Kafka monitoring tools.&lt;/p>
&lt;p>In addition to performance optimization, the Cortex Data Lake team has been exploring ways to optimize the Cloud
Dataflow costs. We looked into resource usage optimization in cases when streaming jobs consume very few incoming
events. For cost efficiency, Google Cloud Dataflow provides
the &lt;a href="https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#streaming-autoscaling">streaming autoscaling&lt;/a>
feature that adaptively changes the number of workers in response to changes in the load and resource utilization. For
some of Cortex Data Lake team’s use cases, where input data streams may quiesce for prolonged periods of time, we
implemented an internal “Cold Starter” service that analyzes Kafka topics traffic and hibernates pipelines whose input
dries up and reactivates them once their input resumes.&lt;/p>
&lt;p>Talat Uyarer presented the Cortex Data Lake’s experience of building and customizing the large-scale streaming
infrastructure during &lt;a href="https://2021.beamsummit.org/sessions/large-scale-streaming-infrastructure/">Beam Summit 2021&lt;/a>.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
I really enjoy working with Beam. If you understand its internals, the understanding empowers you to fine-tune the open source, customize it, so that it provides the best performance for your specific use case.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/case-study/paloalto/talat_uyarer.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Talat Uyarer
&lt;/div>
&lt;div class="case-study-quote-author-position">
Sr Principal Software Engineer
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>The level of abstraction of Apache Beam empowered the Cortex Data Lake team to create a common infrastructure across
their internal apps and tens of thousands of customers. With Apache Beam, we implement business logic just once and
dynamically generate 10K+ streaming pipelines running in parallel for over 10 use cases.&lt;/p>
&lt;p>The Cortex Data Lake team took advantage of Apache Beam’s portability and pluggability to fine-tune and enhance their
data processing infrastructure with custom libraries and services. Palo Alto Networks ultimately achieved high
performance and low latency, processing 3K+ streaming events per second per vCPU. Combining the benefits of open source
Apache Beam and Cloud Dataflow managed service, we were able to implement use-case specific customizations and reduced
their costs by more than 60%.&lt;/p>
&lt;p>The Apache Beam open source community welcomes and encourages the contributions of its numerous members, such as Palo
Alto Networks, that leverage the powerful capabilities of Apache Beam, bring new optimizations, and empower future
innovation by sharing their expertise and actively participating in the community.&lt;/p>
&lt;div class="case-study-feedback" id="case-study-feedback">
&lt;p class="case-study-feedback-title">Was this information useful?&lt;/p>
&lt;div>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(true, 'Palo Alto')">Yes&lt;/button>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(false, 'Palo Alto')">No&lt;/button>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="clear-nav">&lt;/div></description></item><item><title>Case-Studies: Beam visual pipeline development with Hop</title><link>/case-studies/hop/</link><pubDate>Tue, 15 Feb 2022 12:21:00 +0000</pubDate><guid>/case-studies/hop/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div class="case-study-opinion">
&lt;div class="case-study-opinion-img">
&lt;img class="case-study-opinion-img-center" src="/images/logos/powered-by/hop.png"/>
&lt;/div>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“Apache Beam and its abstraction of the execution engines is a big thing for us. The amount of work that that saves...it would be hard to build that support for Dataflow or Spark all by yourself. It is amazing that this technology exists in the first place, really amazing! Not having to worry about all those underlying platforms - that is tremendous!”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/matt_casters_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Matt Casters
&lt;/div>
&lt;div class="case-study-quote-author-position">
Chief Solutions Architect, Neo4j, Apache Hop co-founder
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;/div>
&lt;div class="case-study-post">
&lt;h1 id="visual-apache-beam-pipeline-design-and-orchestration-with-apache-hop">Visual Apache Beam Pipeline Design and Orchestration with Apache Hop&lt;/h1>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>&lt;a href="https://hop.apache.org/">Apache Hop&lt;/a> is an open source data orchestration and data engineering
platform that aims to facilitate all aspects of data processing with visual pipeline development
environment. This easy-to-use, fast, and flexible platform enables developers to create and manage
Apache Beam batch and streaming pipelines in Hop GUI. Apache Hop uses metadata and kernel to
describe how the data should be processed, and Apache Beam to “design once, run anywhere”.&lt;/p>
&lt;p>&lt;a href="https://neo4j.com/">Neo4j’s&lt;/a> Chief Solutions
Architect, &lt;a href="https://be.linkedin.com/in/mattcasters">Matt Casters&lt;/a>, has been an early adopter of
Apache Beam and its abstraction of execution engines. Matt has been an active member of the Apache
open-source community for years and has leveraged Apache Beam as an execution engine to build Apache
Hop.&lt;/p>
&lt;h2 id="apache-hop-project">Apache Hop Project&lt;/h2>
&lt;p>Thriving popularity and the growing number of Apache Beam users across the globe inspired Matt
Casters to expand the idea of abstraction to visual pipeline lifecycle management and development.
Matt co-founded and incubated the
&lt;a href="https://hop.apache.org/">Apache Hop&lt;/a> project that became a top level project at
the &lt;a href="https://www.apache.org/">Apache Software Foundation&lt;/a>
in December 2021. The platform enables users of all skill levels to build, test, launch, and deploy
powerful data workflows without writing code. Apache Hop’s intuitive drag and drop interface
provides a visual representation of Apache Beam pipelines, simplifying pipeline design, execution,
preview, monitoring, and debugging.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
I was a big fan of Beam from the get go. Apache Beam is now a very important part of the Apache Hop project.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/matt_casters_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Matt Casters
&lt;/div>
&lt;div class="case-study-quote-author-position">
Chief Solutions Architect, Neo4j,
&lt;br>Apache Hop co-founder
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>The Apache Hop GUI allows data professionals to work visually and focus on “what” they need to do
rather than “how”, using metadata to describe how the Apache Beam pipelines should be processed.
Apache
Hop’s &lt;a href="https://hop.apache.org/manual/latest/pipeline/create-pipeline.html#_concepts">transform-agnostic&lt;/a>
action
plugins (&lt;a href="https://hop.apache.org/manual/latest/pipeline/create-pipeline.html#_concepts">“hops”&lt;/a>)
link transforms together, creating a pipeline. Various Apache Beam runners, such as
&lt;a href="https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-spark-pipeline-engine.html">Spark&lt;/a>
,
&lt;a href="https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html">Flink&lt;/a>
,
&lt;a href="https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-dataflow-pipeline-engine.html">Dataflow&lt;/a>
, and
the &lt;a href="https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-direct-pipeline-engine.html">Direct&lt;/a>
runner, read the metadata with help of Apache
Hop&amp;rsquo;s &lt;a href="https://hop.apache.org/dev-manual/latest/sdk/hop-sdk.html#_hop_metadata_providers">Metadata Provider&lt;/a>
and &lt;a href="https://hop.apache.org/dev-manual/latest/sdk/hop-sdk.html#_workflow_execution">workflow engines(plugins)&lt;/a>
, and execute the pipeline.&lt;/p>
&lt;p>Apache Hop’s custom plugins and metadata objects
for &lt;a href="https://hop.apache.org/manual/latest/technology/technology.html">some of the most popular technologies&lt;/a>
, such as &lt;a href="https://neo4j.com/">Neo4j&lt;/a>, empower users to execute database- and technology-specific
transforms inside the Apache Beam pipelines, which allows for native optimized connectivity and
flexible Apache Beam pipeline configurations. For instance, the Apache
Hop’s &lt;a href="https://hop.apache.org/manual/latest/technology/neo4j/index.html#_description">Neo4j plugin&lt;/a>
stores logging and execution lineage of Apache Beam pipelines in the Neo4j graph database and
enables users to query this information for more details, such as quickly jump to the place where an
error occurred. The combination of Apache Hop
transforms, &lt;a href="/documentation/io/built-in/">Apache Beam built-in I/Os&lt;/a>, and
Apache Beam-powered data processing opens up new horizons for more sinks and sources and custom use
cases.&lt;/p>
&lt;p>Apache Hop aims to bring a no-code approach to Apache Beam data pipelines. Sometimes the choice of a
particular programming language, framework, or engine is driven by developers&amp;rsquo; preferences, which
results in businesses becoming tied to a specific technology skill set and stack. Apache Hop
eliminates this dependency by abstracting out the I/Os with a fully pluggable runtime support and
providing a graphic user interface on top of Apache Beam pipelines. All settings for pipeline
elements are performed in the Hop’s visual editor just once, and pipeline is automatically described
as metadata in JSON and CSV formats. Programming data pipelines’ source code becomes an option, not
a necessity. Apache Hop does not require knowledge of a particular programming language to create
pipelines, helping with the adoption of Apache Beam unified streaming and batch processing
technology.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
In general, a visual pipeline design interface is really valuable for a non-developer audience…
We categorically choose the side of the organization when it comes to lowering setup costs,
maintenance costs, increasing ROI, and safeguarding an investment over time.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/matt_casters_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Matt Casters
&lt;/div>
&lt;div class="case-study-quote-author-position">
Chief Solutions Architect, Neo4j,
&lt;br>Apache Hop co-founder
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>Apache Beam continuously expands the number of use cases and scenarios it supports and makes it
possible to bring advanced technology solutions into a reality. Being an early adopter of Apache
Beam and its powerful abstraction, Matt Casters leveraged this knowledge and experience to create
Apache Hop. The platform creates a value-add for Apache Beam users by enabling visual pipeline
development and lifecycle management.&lt;/p>
&lt;p>Matt sees Apache Beam as a foundation and a driving force behind Apache Hop. Communication between
Apache Beam and Apache Hop projects keeps fostering co-creation and enriches both products with new
features.&lt;/p>
&lt;p>Apache Hop project is the example of the continuous improvement driven by the Apache open source
community and amplified by collaborative organizations.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Knowledge sharing and collaboration is something that comes naturally in the community. If we
see some room for improvement, we exchange ideas and this way, we keep driving Apache Beam and
Apache Hop projects forward. Together, we can work with the most complex problems and just solve them.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/matt_casters_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Matt Casters
&lt;/div>
&lt;div class="case-study-quote-author-position">
Chief Solutions Architect, Neo4j,
&lt;br>Apache Hop co-founder
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;div class="case-study-feedback" id="case-study-feedback">
&lt;p class="case-study-feedback-title">Was this information useful?&lt;/p>
&lt;div>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(true, 'Hop')">Yes&lt;/button>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(false, 'Hop')">No&lt;/button>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="clear-nav">&lt;/div></description></item><item><title>Case-Studies: Scalability and Cost Optimization for Search Engine's Workloads</title><link>/case-studies/seznam/</link><pubDate>Tue, 15 Feb 2022 01:56:00 +0000</pubDate><guid>/case-studies/seznam/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div class="case-study-opinion">
&lt;div class="case-study-opinion-img">
&lt;img src="/images/logos/powered-by/seznam.png"/>
&lt;/div>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“Apache Beam is a well-defined data processing model that lets you concentrate on business logic rather than low-level details of distributed processing.”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/marek_simunek_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Marek Simunek
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Software Engineer @ seznam.cz
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;/div>
&lt;div class="case-study-post">
&lt;h1 id="scalability-and-cost-optimization-for-search-engines-workloads">Scalability and Cost Optimization for Search Engine&amp;rsquo;s Workloads&lt;/h1>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>&lt;a href="https://www.seznam.cz/">Seznam.cz&lt;/a> is a Czech search engine that serves over 25% of local organic search traffic.
Seznam employs over 1,500 people and runs a portfolio of more than 30 web services and associated brands,
processing around &lt;a href="https://www.searchenginejournal.com/seznam-interview/302851/#close">15 million queries a day&lt;/a>.&lt;/p>
&lt;p>Seznam continuously optimizes their big data infrastructure, web crawlers, algorithms,
and ML models on a mission to achieve excellence in accuracy, quality, and usefulness of search results for their users.
Seznam has been an early contributor and adopter of Apache Beam, and they migrated several petabyte-scale workloads
to Apache Beam pipelines running in Apache Spark and Apache Flink clusters in Seznam’s on-premises data center.&lt;/p>
&lt;h2 id="journey-to-apache-beam">Journey to Apache Beam&lt;/h2>
&lt;p>Seznam started using MapReduce in a Hadoop Yarn cluster back in 2010 to facilitate concurrent batch jobs processing
for the web crawler components of their search engine.
Within several years, their data infrastructure evolved to &lt;a href="https://www.youtube.com/watch?v=rJIpva0tD0g">over 40 billion rows with 400 terabytes&lt;/a>
in HBase, 2 on-premises data centers with over 1,100 bare metal servers, 13 PB storage, and 50 TB memory, which made their business logic more complex.
MapReduce no longer provided enough flexibility, &lt;a href="https://youtu.be/rJIpva0tD0g?t=130">cost efficiency, and performance&lt;/a>
to support this growth, and Seznam rewrote the jobs to native Spark.
Spark &lt;a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations">shuffle operations&lt;/a>
enabled Seznam to split large data keys into partitions, load them in-memory one by one, and process them iteratively.
However, exponential data skews and inability to fit all values for a single key into an in-memory buffer resulted in
&lt;a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html#performance-impact">increased disk space utilization and memory overhead&lt;/a>.
Some tasks took unexpectedly long time to complete, and it was challenging
to debug Spark pipelines due to generic exceptions. Thus, Seznam needed a data processing framework that can scale more efficiently.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
To manage this kind of scale, you need the abstraction.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/marek_simunek_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Marek Simunek
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Software Engineer @ seznam.cz
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>In 2014, Seznam started work on Euphoria API - a proprietary programming model that can express business logic
in batch and streaming pipelines and allow for runner independent implementation.&lt;/p>
&lt;p>Apache Beam was released in 2016 and became a readily available and well-defined unified programming model.
This engine-independent model has been evolving very fast, supports multiple shuffle operators and fits perfectly
into Seznam’s existing on-premises data infrastructure. For a while, Seznam continued to develop Euphoria,
but soon the high cost and the amount of effort needed to maintain the solution and create their own
runners in-house surpassed the benefits of having a proprietary framework.&lt;/p>
&lt;div class="post-scheme">
&lt;img src="/images/seznam_scheme_1.png">
&lt;/div>
&lt;p>Seznam started migrating their key workloads to Apache Beam.
They decided to merge the &lt;a href="/documentation/sdks/java/euphoria/">Euphoria API&lt;/a>
as a high-level DSL for Apache Beam Java SDK.
This significant contribution to Apache Beam was a starting point for Seznam’s active participation in the community,
later presenting their unique experience and findings at &lt;a href="https://www.youtube.com/watch?v=ZIFtmx8nBow">Beam Summit Europe 2019&lt;/a>
and developer conferences.&lt;/p>
&lt;h2 id="adopting-apache-beam">Adopting Apache Beam&lt;/h2>
&lt;p>Apache Beam enabled Seznam to execute batch and stream jobs much faster without increasing memory and disk space,
thus maximizing scalability, performance, and efficiency.&lt;/p>
&lt;p>Apache Beam offers a variety of ways to distribute skewed data evenly.
&lt;a href="/documentation/programming-guide/#windowing">Windowing&lt;/a>
for processing unbounded and &lt;a href="/documentation/transforms/java/elementwise/partition/">Partition&lt;/a>
for bounded data sets transform input into finite
collections of elements that can be reshuffled. Apache Beam provides a byte-based shuffle that can be
executed by Spark runner or Flink runner, without requiring Apache Spark or Apache Flink to deserialize the full key.
Apache Beam SDKs provide effective coders to serialize and deserialize elements and pass to distributed workers.
Using Apache Beam serialization and byte-based shuffle resulted in substantial performance gains for many of the
Seznam’s use cases and reduced memory required for the shuffling by Apache Spark execution environment.
Seznam’s infrastructure costs associated with &lt;a href="https://youtu.be/rJIpva0tD0g?t=522">disk I/O and memory splits&lt;/a>
decreased significantly.&lt;/p>
&lt;p>One of the most valuable use cases is Seznam’s LinkRevert job, which analyzes the web graph to improve search relevance.
This data pipeline figuratively “turns the Internet upside down”, processing over 150 TB daily,
extending redirect chains to identify every successor of a specific URL, and discovering backlinks that point to a specific web page.
The Apache Beam pipeline executes multiple large-scale skewed joins, and scores the URLs for search results based on the redirect and backlinking factors.&lt;/p>
&lt;div class="post-scheme">
&lt;img src="/images/seznam_scheme_2.png">
&lt;/div>
&lt;p>Apache Beam allows for a unified engine-independent execution, so Seznam was able to select between
Spark or Flink runner depending on the use case. For example, the Apache Beam batch pipeline executed by
Spark runner on a Hadoop Yarn cluster parses new web documents, enriches data with additional features,
and scores the web pages based on their relevance, ensuring timely database updates and accurate search results.
Apache Beam stream processing runs in the Apache Flink execution environment on a Kubernetes cluster for thumbnail
requests that are displayed in users’ search results. Another example of stream event processing is the Apache Beam Flink
runner pipeline that maps, joins, and processes search logs to calculate SLO metrics and other features.&lt;/p>
&lt;div class="post-scheme">
&lt;img src="/images/seznam_scheme_3.png">
&lt;/div>
&lt;div class="post-scheme">
&lt;img src="/images/seznam_scheme_4.png">
&lt;/div>
&lt;p>Over the years, Seznam’s approach has evolved. They have realized the tremendous benefits of Apache Beam
for balancing petabyte-size workloads and optimizing memory and compute resources in on-premises data centers.
Apache Beam is Seznam’s go-to platform for batch and stream pipelines that require multiple shuffle operations,
processing skewed data, and implementing complex business logic. Apache Beam unified model with sources
and sinks exposed as transforms, increased business logic maintainability and traceability with unit tests.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
One of the biggest benefits is Apache Beam sinks and sources. By exposing your source or sink as a transform, your implementation is hidden and later on, you can add additional functionality without breaking the existing implementation for users.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/marek_simunek_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Marek Simunek
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Software Engineer @ seznam.cz
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;h2 id="monitoring-and-debugging">Monitoring and Debugging&lt;/h2>
&lt;p>Apache Beam pipelines monitoring and debugging was critical for cases with complex business logic and
multiple data transformations. Seznam engineers identified optimal tools depending on the execution engine.
Seznam leveraged &lt;a href="https://github.com/criteo/babar">Babar from Criteo&lt;/a> to profile Apache Beam pipelines
on Spark runner and identify the root causes
of downtimes in their performance. Babar allows for easier monitoring, debugging, and performance optimization
by analyzing cluster resource utilization, memory allocated, CPU used, etc. For Apache Beam pipelines executed by Flink runner
on Kubernetes cluster, Seznam employs Elasticsearch to store, search, and analyze metrics.&lt;/p>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>Apache Beam offered a unified model for Seznam’s stream and batch processing that provided performance at scale.
Apache Beam supported multiple runners, language SDKs, and built-in and custom pluggable I/O transforms,
thus eliminating the need to invest into the development and support of proprietary runners and solutions.
After evaluation, Seznam transitioned their workloads to Apache Beam and integrated
&lt;a href="/documentation/sdks/java/euphoria/">Euphoria API&lt;/a>
(a fast prototyping framework developed by Seznam), contributing to the Apache Beam open source community.&lt;/p>
&lt;p>The Apache Beam abstraction and execution model allowed Seznam to robustly scale their data processing.
It also provided the flexibility to write the business logic just once and keep freedom of choice between runners.
The model was especially valuable for pipeline maintainability in complex use cases.
Apache Beam helped overcome memory and compute resource constraints by reshuffling unevenly distributed data into manageable partitions.
&lt;div class="case-study-feedback" id="case-study-feedback">
&lt;p class="case-study-feedback-title">Was this information useful?&lt;/p>
&lt;div>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(true, 'Seznam')">Yes&lt;/button>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(false, 'Seznam')">No&lt;/button>
&lt;/div>
&lt;/div>
&lt;/p>
&lt;/div>
&lt;div class="clear-nav">&lt;/div></description></item><item><title>Case-Studies: Apache Beam Amplified Ricardo’s Real-time and ML Data Processing for eCommerce Platform</title><link>/case-studies/ricardo/</link><pubDate>Wed, 01 Dec 2021 01:36:00 +0000</pubDate><guid>/case-studies/ricardo/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div class="case-study-opinion">
&lt;div class="case-study-opinion-img">
&lt;img src="/images/logos/powered-by/ricardo.png"/>
&lt;/div>
&lt;blockquote class="case-study-quote-block">
&lt;p class="case-study-quote-text">
“Without Beam, without all this data and real time information, we could not deliver the services we are providing and handle the volumes of data we are processing.”
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/tobias_kaymak_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Tobias Kaymak
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Data Engineer @ Ricardo
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;/div>
&lt;div class="case-study-post">
&lt;h1 id="apache-beam-amplified-ricardos-real-time-and-ml-data-processing-for-ecommerce-platform">Apache Beam Amplified Ricardo’s Real-time and ML Data Processing for eCommerce Platform.&lt;/h1>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>&lt;a href="https://www.ricardo.ch/">Ricardo&lt;/a> is a leading second hand marketplace in Switzerland. The site supports over 4 million
registered buyers and sellers, processing more than 6.5 million article transactions via the platform annually. Ricardo
needs to process high volumes of streaming events and manage over 5 TB of articles, assets, and analytical data.&lt;/p>
&lt;p>With the scale that came from 20 years in the market, Ricardo made the decision to migrate from their on-premises data
center to cloud to easily grow and evolve further and reduce operational costs through managed cloud services. Data
intelligence and engineering teams took the lead on this transformation and development of new AI/ML-enabled customer
experiences. Apache Beam has been a technology amplifier that expedited Ricardo’s transformation.&lt;/p>
&lt;h2 id="challenge">Challenge&lt;/h2>
&lt;p>Migrating from an on-premises data center to the cloud presented Ricardo with an opportunity to modernize their
marketplace from heavy legacy reliance on transactional SQL, switch to BigQuery for analytics, and take advantage of the
event-based streaming architecture.&lt;/p>
&lt;p>Ricardo’s data intelligence team identified two key success factors: a carefully designed data model and a framework
that provides unified stream and batch data pipelines execution, both on-premises and in the cloud.&lt;/p>
&lt;p>Ricardo needed a data processing framework that can scale easily, enrich event streams with historic data from multiple
sources, provide granular control on data freshness, and provide an abstract pipeline operational infrastructure, thus
helping their team focus on creating new value for customers and business&lt;/p>
&lt;h2 id="journey-to-beam">Journey to Beam&lt;/h2>
&lt;p>Ricardo’s data intelligence team began modernizing their stack in 2018. They selected frameworks that provide reliable
and scalable data processing both on-premises and in the cloud. Apache Beam enables users to create pipelines in their
favorite programming language offering SDKs in Java, Python, Go, SQL, Scala (SCIO).
A &lt;a href="/documentation/#available-runners">Beam Runner&lt;/a> runs a Beam pipeline on a specific (often
distributed) data processing system. Ricardo selected the Apache Beam Flink runner for executing pipelines on-premises
and the Dataflow runner as a managed cloud service for the same pipelines developed using Apache Beam Java SDK. Apache
Flink is well known for its reliability and cost-efficiency and an on-premises cluster was spun up at Ricardo’s
datacenter as the initial environment.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
We wanted to implement a solution that would multiply our possibilities, and that’s exactly where Beam comes in. One of the major drivers in this decision was the ability to evolve without adding too much operational load.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/tobias_kaymak_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Tobias Kaymak
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Data Engineer @ Ricardo
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Beam pipelines for core business workloads to ingest events data from Apache Kafka into BigQuery were running stable in
just one month. As Ricardo’s cloud migration progressed, the data intelligence
team &lt;a href="https://www.youtube.com/watch?v=EcvnFH5LDE4">migrated Flink cluster from Kubernetes&lt;/a> in their on-premises
datacenter to GKE.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
I knew Beam, I knew it works. When you need to move from Kafka to BigQuery and you know that Beam is exactly the right tool, you just need to choose the right executor for it.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/tobias_kaymak_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Tobias Kaymak
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Data Engineer @ Ricardo
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>The flexibility to refresh data every hour, minute, or stream data real-time, depending on the specific use case and
need, helped the team improve data freshness which was a significant advancement for Ricardo’s eCommerce platform
analytics and reporting.&lt;/p>
&lt;p>Ricardo’s team found benefits in Apache Beam Flink runner on self-managed Flink cluster in GKE for streaming pipelines.
Full control over Flink provisioning enabled to set up required connectivity from Flink cluster to an external peered
Kafka managed service. The data intelligence team optimized operating costs through cluster resource utilization
significantly. For batch pipelines, the team chose Dataflow managed service for its on-demand autoscaling and cost
reduction features like FlexRS, especially efficient for training ML models over TBs of historic data. This hybrid
approach has been serving Ricardo’s needs well and proved to be a reliable production solution.&lt;/p>
&lt;h2 id="evolution-of-use-cases">Evolution of Use Cases&lt;/h2>
&lt;p>Thinking of a stream as data in motion, and a table as data at rest provided a fortuitous chance to take a look at some
data model decisions that were made as far back as 20 years before. Articles that are on the marketplace have assets
that describe them, and for performance and cost optimizations purposes, data entities that belong together were split
into separate database instances. Apache Beam enabled Ricardo’s data intelligence team
to &lt;a href="https://youtu.be/PiwLC-YK_Zw">join assets and articles streams&lt;/a> and optimize BigQuery scans to reduce costs. When
designing the pipeline, the team created streams for assets and articles. Since the assets stream is the primary one,
they shifted the stream 5 minutes back and created a lookup schema with it in BigTable. This elegant solution ensures
that the assets stream is always processed first while BigTable allows for matching the latest asset to an article and
Apache Beam joins them both together.&lt;/p>
&lt;div class="post-scheme">
&lt;img src="/images/post_scheme.png">
&lt;/div>
&lt;p>The successful case of joining different data streams facilitated further Apache Beam adoption by Ricardo in areas like
data science and ML.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
Once you start laying out the simple use cases, you will always figure out the edge case scenarios. This pipeline has been running for a year now, and Beam handles it all, from super simple use cases to something crazy.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/tobias_kaymak_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Tobias Kaymak
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Data Engineer @ Ricardo
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>As an eCommerce retailer, Ricardo faces the increasing scale and sophistication of fraud transactions and takes a
strategic approach by employing Beam pipelines for fraud detection and prevention. Beam pipelines act on an external
intelligent API to identify the signs of fraudulent behaviour, like device characteristics or user activity. Apache Beam
&lt;a href="/documentation/programming-guide/#state-and-timers">stateful processing&lt;/a> feature enables Ricardo
to apply an associating operation to the streams of data (trigger banishing a user for example). Thus, Apache Beam saves
Ricardo’s customer care team&amp;rsquo;s time and effort on investigating duplicate cases. It also runs batch pipelines
to &lt;a href="https://www.youtube.com/watch?v=LXnh9jNNfYY">find linked accounts&lt;/a>, associate products to categories by
encapsulating a ML model, or calculates the likelihood something is going to sell, at a scale or precision that was
previously not possible.&lt;/p>
&lt;p>Originally implemented by Ricardo’s data intelligence team, Apache Beam has proven to be a powerful framework that
supports advanced scenarios and acts as a glue between Kafka, BigQuery, and platform and external APIs, which encouraged
other teams at Ricardo to adopt it.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
[Apache Beam] is a framework that is so good that other teams are picking up the idea and starting to work with it after we tested it.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/tobias_kaymak_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Tobias Kaymak
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Data Engineer @ Ricardo
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>Apache Beam has provided Ricardo with a scalable and reliable data processing framework that supported Ricardo’s
fundamental business scenarios and enabled new use cases to respond to events in real-time.&lt;/p>
&lt;p>Throughout Ricardo’s transformation, Apache Beam has been a unified framework that can run batch and stream pipelines,
offers on-premises and cloud managed services execution, and programming language options like Java and Python,
empowered data science and research teams to advance customer experience with new real-time scenarios fast-tracking time
to value.&lt;/p>
&lt;blockquote class="case-study-quote-block case-study-quote-wrapped">
&lt;p class="case-study-quote-text">
After this first pipeline, we are working on other use cases and planning to move them to Beam. I was always trying to spread the idea that this a framework that is reliable, it actually helps you to get the stuff done in a consistent way.
&lt;/p>
&lt;div class="case-study-quote-author">
&lt;div class="case-study-quote-author-img">
&lt;img src="/images/tobias_kaymak_photo.png">
&lt;/div>
&lt;div class="case-study-quote-author-info">
&lt;div class="case-study-quote-author-name">
Tobias Kaymak
&lt;/div>
&lt;div class="case-study-quote-author-position">
Senior Data Engineer @ Ricardo
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/blockquote>
&lt;p>Apache Beam has been a technology that multiplied possibilities, allowing Ricardo to maximize technology benefits at all
stages of their modernization and cloud journey.&lt;/p>
&lt;h2 id="learn-more">Learn More&lt;/h2>
&lt;iframe class="video video--medium-size" width="560" height="315" src="https://www.youtube.com/embed/v-MclVrGJcQ" frameborder="0" allowfullscreen>&lt;/iframe>
&lt;br>&lt;br>
&lt;div class="case-study-feedback" id="case-study-feedback">
&lt;p class="case-study-feedback-title">Was this information useful?&lt;/p>
&lt;div>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(true, 'Ricardo')">Yes&lt;/button>
&lt;button class="btn case-study-feedback-btn" onclick="sendCaseStudyFeedback(false, 'Ricardo')">No&lt;/button>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="clear-nav">&lt;/div></description></item><item><title>Case-Studies: 163 Net Ease</title><link>/case-studies/163netease/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/163netease/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Accenture</title><link>/case-studies/accenture/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/accenture/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Akvelon</title><link>/case-studies/akvelon/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/akvelon/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Align</title><link>/case-studies/align/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/align/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Aliz</title><link>/case-studies/aliz/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/aliz/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Amazon</title><link>/case-studies/amazon/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/amazon/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Android</title><link>/case-studies/android/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/android/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Arquivei</title><link>/case-studies/arquivei/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/arquivei/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Bahwan Cybertek</title><link>/case-studies/bahwancybertek/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/bahwancybertek/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: BBC</title><link>/case-studies/bbc/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/bbc/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Beam and Geocoding</title><link>/case-studies/goga/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/goga/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div>
&lt;header class="case-study-header">
&lt;h2 itemprop="name headline">Beam and Geocoding&lt;/h2>
&lt;/header>
&lt;p>GOGA Data Analysis and Consulting is a company based in Japan that specializes in analytics of geospatial and mapping data. They use Apache Beam and Cloud Dataflow for a smooth data transformation process for analytical purposes. This use case focuses on handling multiple extractions, geocoding, and insertion process by wrangling and requesting API call of each data based on the location provided.&lt;/p>
&lt;/div></description></item><item><title>Case-Studies: Behalf</title><link>/case-studies/behalf/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/behalf/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Bell Labs</title><link>/case-studies/belllabs/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/belllabs/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: BenchSci</title><link>/case-studies/benchsci/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/benchsci/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: BetterUp</title><link>/case-studies/betterup/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/betterup/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Big Data Institute</title><link>/case-studies/bigdatainstitute/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/bigdatainstitute/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Calico</title><link>/case-studies/calico/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/calico/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Car Finance 24/7</title><link>/case-studies/carfinance247/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/carfinance247/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: CitiBank</title><link>/case-studies/citibank/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/citibank/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Cloud Dataflow</title><link>/case-studies/dataflow/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/dataflow/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div>
&lt;header class="case-study-header">
&lt;h2 itemprop="name headline">Cloud Dataflow&lt;/h2>
&lt;/header>
&lt;p>&lt;strong>&lt;a href="https://cloud.google.com/dataflow">Cloud Dataflow&lt;/a>:&lt;/strong> Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem.&lt;/p>
&lt;/div></description></item><item><title>Case-Studies: Cognite</title><link>/case-studies/cognite/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/cognite/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Cruise</title><link>/case-studies/cruise/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/cruise/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Datatonic</title><link>/case-studies/datatonic/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/datatonic/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: DeepMind</title><link>/case-studies/deepmind/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/deepmind/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Dun &amp; Bradstreet</title><link>/case-studies/dunbradstreet/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/dunbradstreet/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Ericsson</title><link>/case-studies/ericsson/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/ericsson/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Evolve24</title><link>/case-studies/evolve24/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/evolve24/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Feature Powered by Apache Beam - Beyond Lambda</title><link>/case-studies/ebay/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/ebay/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div>
&lt;header class="case-study-header">
&lt;h2 itemprop="name headline">Feature Powered by Apache Beam - Beyond Lambda&lt;/h2>
&lt;/header>
&lt;p>eBay is an American e-commerce company that provides business-to-consumer and consumer-to-consumer sales through the online website. They build feature pipelines with Apache Beam: unify feature extraction and selection in online and offline, speed up E2E iteration for model training, evaluation and serving, support different types (streaming, runtime, batch) of features, etc. eBay leverages Apache Beam for the streaming feature SDK as a foundation to integrate with Kafka, Hadoop, Flink, Airflow and others in eBay.&lt;/p>
&lt;/div></description></item><item><title>Case-Studies: Fitbit</title><link>/case-studies/fitbit/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/fitbit/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: From Apache Beam to Leukemia early detection</title><link>/case-studies/oriel/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/oriel/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;div>
&lt;header class="case-study-header">
&lt;h2 itemprop="name headline">From Apache Beam to Leukemia early detection&lt;/h2>
&lt;/header>
&lt;p>Oriel Research Therapeutics (ORT) is a startup company in the greater Boston area that provides early detection services for
multiple medical conditions, utilizing cutting edge Artificial Intelligence technologies and Next Generation Sequencing (NGS). ORT utilizes Apache Beam pipelines to process over 1 million samples of genomics and clinical information. The processed data is used by ORT in detecting Leukemia, Sepsis and other medical conditions.&lt;/p>
&lt;/div></description></item><item><title>Case-Studies: Google Chrome</title><link>/case-studies/chrome/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/chrome/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Google Play</title><link>/case-studies/googleplay/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/googleplay/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: GraalSystems</title><link>/case-studies/graalsystems/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/graalsystems/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
&lt;p>&lt;strong>&lt;a href="https://graal.systems">GraalSystems&lt;/a>&lt;/strong> is a cloud native data platform providing support for Beam, Spark, Tensorflow, Samza and many other data processing solutions. At the heart of our architecture are a set of distributed processing and analytics modules using Beam to route over 2 billion events per day from our Apache Pulsar clusters. For our clients, we run also more than 2,000 Beam jobs per day at a very large scale in our production platform&lt;/p></description></item><item><title>Case-Studies: Hazelcast</title><link>/case-studies/hazelcast/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/hazelcast/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Hoxton AI</title><link>/case-studies/hoxtonai/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/hoxtonai/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: iBiblio</title><link>/case-studies/ibiblio/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/ibiblio/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Industrial Technology Research Institute</title><link>/case-studies/industrialtechnologyresearchinstitute/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/industrialtechnologyresearchinstitute/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item><item><title>Case-Studies: Ineat</title><link>/case-studies/ineat/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/case-studies/ineat/</guid><description>
&lt;!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--></description></item></channel></rss>