learning/prompts/documentation-lookup/08_basic

Prompt: What is windowing in Apache Beam?

Response: In stream processing, windowing allows you to divide data streams into logical units for efficient and correct parallel processing.

With an unbounded data set, collecting all of the elements is impossible, because new elements are constantly being added. In the Beam model, any PCollection (including unbounded PCollection objects) can be subdivided into logical windows. Grouping transforms then consider the elements of each PCollection on a per-window basis.

Because Apache Beam's default windowing strategy is to assign each element to a single, global window, you must explicitly specify a windowing function for your pipeline.

The following code snippet shows how to divide a PCollection into 60-second windows:

from apache_beam import beam
from apache_beam import window
fixed_windowed_items = (
    items | 'window' >> beam.WindowInto(window.FixedWindows(60)))

Beam provides a number of built-in windowing functions that you can use to subdivide your data into different types of windows:

Fixed Time Windows
Sliding Time Windows
Per-Session Windows
Single Global Window
Calendar-based Windows (not supported by the Beam SDK for Python or Go)

You can also create your own custom windowing function WindowFn.

You need to specify a triggering strategy to determine when to emit the results of your pipeline’s windowed computations.

You can adjust the windowing strategy to allow for late data, which is data that arrives after the watermark has passed the end of the window. You can also specify how to handle late data, such as discarding or adding it to the next window.