title: “Accenture Baltics' Journey with Apache Beam to Streamlined Data Workflows for a Sustainable Energy Leader” name: “Accenture Baltics” icon: /images/logos/powered-by/accenture.png hasNav: true category: study cardTitle: “Accenture Baltics' Journey with Apache Beam” cardDescription: “Accenture Baltics uses Apache Beam on Google Cloud to build a robust data processing infrastructure for a sustainable energy leader.They use Beam to democratize data access, process data in real-time, and handle complex ETL tasks.” authorName: “Jana Polianskaja” authorPosition: “Data Engineer @ Accenture Baltics” authorImg: /images/case-study/accenture/Jana_Polianskaja_sm.jpg publishDate: 2024-11-25T00:12:00+00:00

Accenture Baltics' Journey with Apache Beam to Streamlined Data Workflows for a Sustainable Energy Leader

Background

Accenture Baltics, a branch of the global professional services company Accenture, leverages its expertise across various industries to provide consulting, technology, and outsourcing solutions to clients worldwide. A specific project at Accenture Baltics highlights the effective implementation of Apache Beam to support a client who is a global leader in sustainable energy and uses Google Cloud.

Journey to Apache Beam

The team responsible for transforming, curating, and preparing data, including transactional, analytics, and sensor data, for data scientists and other teams has been using Dataflow with Apache Beam for about five years. Dataflow with Beam is a natural choice for both streaming and batch data processing. For our workloads, we typically use the following configurations: worker machine types are n1-standard-2 or n1-standard-4, and the maximum number of workers varies up to five, using the Dataflow runner.

As an example, a streaming pipeline ingests transaction data from Pub/Sub, performs basic ETL and data cleaning, and outputs the results to BigQuery. A separate batch Dataflow pipeline evaluates a binary classification model, reading input and writing results to Google Cloud Storage. The following diagram shows a workflow that uses Pub/Sub to feed Dataflow pipelines across three Google Cloud projects. It also shows how Dataflow, Composer, Cloud Storage, BigQuery, and Grafana integrate into the architecture.

Use Cases

Apache Beam is an invaluable tool for our use cases, particularly in the following areas:

Democratizing data access: Beam empowers team members without data engineering backgrounds to directly access and analyze BigQuery data using their SQL skills. The data scientists, the finance department, and production optimization teams all benefit from improved data accessibility, gaining immediate access to critical information for faster analysis and decision-making.
Real-time data processing: Beam excels at ingesting and processing data in real time from sources like Pub/Sub.
ETL (extract, transform, load): Beam effectively manages the full spectrum of data transformation and cleaning tasks, even when dealing with complex data structures.
Data routing and partitioning: Beam enables sophisticated data routing and partitioning strategies. For example, it can automatically route failed transactions to a separate BigQuery table for further analysis.
Data deduplication and error handling: Beam has been instrumental in tackling challenging tasks like deduplicating Pub/Sub messages and implementing robust error handling, such as for JSON parsing, that are crucial for maintaining data integrity and pipeline reliability.

We also utilize Grafana (shown in below) with custom notification emails and tickets for comprehensive monitoring of our Beam pipelines. Notifications are generated from Google’s Cloud Logging and Cloud Monitoring services to ensure we stay informed about the performance and health of our pipelines. The seamless integration of Airflow with Dataflow and Beam further enhances our workflow, allowing us to effortlessly use operators such as DataflowCreatePythonJobOperator and BeamRunPythonPipelineOperator in Airflow 2.

Results

Our data processing infrastructure uses 12 distinct pipelines to manage and transform data across various projects within the organization. These pipelines are divided into two primary categories:

Streaming pipelines: These pipelines are designed to handle real-time or near real-time data streams. In our current setup, these pipelines process an average of 10,000 messages per second from Pub/Sub and about 200,000 rows per hour to BigQuery, ensuring that time-sensitive data is ingested and processed with minimal latency.
Batch pipelines: These pipelines are optimized for processing large volumes of data in scheduled batches. Our current batch pipelines handle approximately two gigabytes of data per month, transforming and loading this data into our data warehouse for further analysis and reporting.

Apache Beam has proven to be a highly effective solution for orchestrating and managing the complex data pipelines required by the client. By leveraging the capabilities of Dataflow, a fully managed service for executing Beam pipelines, we have successfully addressed and fulfilled the client's specific data processing needs. This powerful combination has enabled us to achieve scalability, reliability, and efficiency in handling large volumes of data, ensuring timely and accurate delivery of insights to the client.

Check out my Medium blog! I usually post about using Beam/Dataflow as an ETL tool with Python and how it works with other data engineering tools. My focus is on building projects that are easy to understand and learn from, especially if you want to get some hands-on experience with Beam.