layout: section title: “Learning Resources” section_menu: section-menu/documentation.html permalink: /documentation/resources/learning-resources/

Learning Resources

Welcome to our learning resources. This page contains a collection of resources that will help you to get started and use Apache Beam. If you’re just starting, you can view this as a guided tour, otherwise you can jump straight to any section of your interest.

If you have additional material that you would like to see here, please let us know at user@beam.apache.org!

  • TOC {:toc}

Getting Started

Quickstart

Learning the Basics

  • WordCount - Walks you through the code of a simple WordCount pipeline. This is a very basic pipeline intended to show the most basic concepts of data processing. WordCount is the “Hello World” for data processing.
  • Mobile Gaming - Introduces how to consider time while processing data, user defined transforms, windowing, filtering data, streaming pipelines, triggers, and session analysis. This is a great place to start once you get the hang of WordCount.

Fundamentals

  • Programming Guide - The Programming Guide contains more in-depth information on most topics in the Apache Beam SDK. These include descriptions on how everything works as well as code snippets to see how to use every part. This can be used as a reference guidebook.
  • The world beyond batch: Streaming 101 - Covers some basic background information, terminology, time domains, batch processing, and streaming.
  • The world beyond batch: Streaming 102 - Tour of the unified batch and streaming programming model in Beam, alongside with an example to explain many of the concepts.
  • Apache Beam Execution Model - Explanation on how runners execute an Apache Beam pipeline. This includes why serialization is important, and how a runner might distribute the work in parallel to multiple machines.

Common Patterns

  • Common Use Case Patterns Part 1 - Common patterns such as writing data to multiple storage locations, slowly-changing lookup cache, calling external services, dealing with bad data, and starting jobs through a REST endpoint.
  • Common Use Case Patterns Part 2 - Common patterns such as GroupBy using multiple data properties, joining two PCollections on a common key, streaming large lookup tables, merging two streams with different window lengths, and threshold detection with time-series data.
  • Retry Policy - Adding a retry policy to a DoFn.

Articles

Data Analysis

  • Predicting news social engagement - Using multiple data sources, many common design patterns, and sentiment analysis to get insights into different news articles for TensorFlow and Dataflow.
  • Processing IoT Data - IoT sensors are continuously streaming data to the cloud. Learn how to handle the sensor data which can be useful for real-time monitoring, alerts, long-term data storage for analysis, performance improvement, and model training.

Data Migration

Machine Learning

Advanced Concepts

  • Running on AppEngine - Use a Dataflow template to launch a pipeline from Google AppEngine, and how to run the pipeline periodically via a cron job.
  • Stateful Processing - Learn how to access a persistent mutable state while processing input elements, this allows for side effects in a DoFn. This can be used for arbitrary-but-consistent index assignment, if you want to assign a unique incrementing index to each incoming element where order doesn't matter.
  • Timely and Stateful Processing - An example on how to do batched RPC calls. The call requests are stored in a mutable state as they are received. Once there are either enough requests or a certain time has passed, the batch of requests is triggered to be sent.
  • Running External Libraries - Call an external library written in a language that does not have a native SDK in Apache Beam such as C++.

Interactive Labs

Java

  • Big Data Text Processing Pipeline (40m) - Run a word count pipeline on the Dataflow runner.
  • Real Time Machine Learning (45m) - Create a real-time flight delay prediction service using historical data on internal flights in the United States.
  • Visualize Real-Time Geospatial Data (60m) - Process real-time streaming data from a real-time real world historical data set, store the results in BigQuery, and visualize the geospatial data on Data Studio.
  • Processing Time Windowed Data (90m) - Implement time-windowed aggregation to augment the raw data in order to produce a consistent training and test datasets for a machine learning model.

Python

Beam Katas

Beam Katas are interactive Beam coding exercises (i.e. code katas) that can help you to learn Apache Beam concepts and programming model hands-on. Built based on JetBrains Educational Products, Beam Katas objective is to provide a series of structured hands-on learning experiences for learners to understand about Apache Beam and its SDKs by solving exercises with gradually increasing complexity. Beam Katas are available for both Java and Python SDKs.

Java

  • Download IntelliJ Edu
  • Upon opening the IDE, expand the “Learn and Teach” menu, then select “Browse Courses”
  • Search for “Beam Katas - Java”
  • Expand the “Advanced Settings” and modify the “Location” and “Jdk” appropriately
  • Click “Join”
  • Learn more about how to use the Education product

Python

  • Download PyCharm Edu
  • Upon opening the IDE, expand the “Learn and Teach” menu, then select “Browse Courses”
  • Search for “Beam Katas - Python”
  • Expand the “Advanced Settings” and modify the “Location” and “Interpreter” appropriately
  • Click “Join”
  • Learn more about how to use the Education product

Code Examples

Java

  • Snippets 1 - Commonly-used data analysis patterns such as how to use BigQuery, a CombinePerKey transform, remove duplicate lines in files, filtering, joining PCollections, getting the maximum value of a PCollection, etc.
  • Snippets 2 - Additional examples on common tasks such as configuring BigQuery, PubSub, writing one file per window, etc.
  • Complete Examples - End-to-end example pipelines such as an auto complete, a streaming word extract, calculating the Term Frequency-Inverse Document Frequency (TF-IDF), getting the top Wikipedia sessions, traffic max lane flow, traffic routes, etc.

Python

  • Snippets - Commonly-used data analysis patterns such as how to use BigQuery, Datastore, coders, combiners, filters, custom PTransforms, etc.
  • Complete Examples - End-to-end example pipelines such as an auto complete, getting mobile gaming statistics, calculating the Julia set, solving distributing optimization tasks, estimating PI, calculating the Term Frequency-Inverse Document Frequency (TF-IDF), getting the top Wikipedia sessions, etc.

API Reference

Feedback and Suggestions

We are open for feedback and suggestions, you can find different ways to reach out to the community in the Contact Us page.

If you have a bug report or want to suggest a new feature, you can let us know by submitting a new issue.

How to Contribute

We welcome contributions from everyone! To learn more on how to contribute, check our Contribution Guide.