Prompt: What are side inputs in Apache Beam?

Response: In Apache Beam, side inputs are additional inputs that can be provided to a ParDo transform, in addition to the main input PCollection. These side inputs can be used by the ParDo transform's DoFn to access and read additional data during the processing of each element in the main input PCollection.

Side inputs are particularly useful when a Beam pipeline needs to ingest additional data dynamically at runtime when processing each element, rather than relying on preset or hard-coded values. By leveraging side inputs, Apache Beam can determine additional data based on the main PCollection data or even from a different branch of your pipeline. For example, you can leverage side inputs for your streaming analytics use cases, enriching streaming events with additional data retrieved from a lookup table.

In the Apache Beam Python SDK, you can provide side inputs as extra arguments in the DoFn's process method or the Map/FlatMap transforms. The Python SDK supports optional, positional, and keyword arguments.

class MyDoFn(beam.DoFn):
  def process(self, element, side_input):
    ...

In the Apache Beam Java SDK, side inputs are provided to the ParDo transform by using the .withSideInputs method. To access a side input from within a DoFn, the DoFn.ProcessContext.sideInput method is invoked.

PCollection<Integer> input = ...;
PCollectionView<Integer> sideInput = ...;
PCollection<Integer> output =
    input.apply(ParDo.of(new DoFn<Integer, Integer>() {
      @ProcessElement
      public void processElement(ProcessContext c) {
        Integer sideInputValue = c.sideInput(sideInput);
        ...
      }
    }).withSideInputs(sideInput));

Side inputs can also be used to process windowed data. Apache Beam uses the windows for the main input elements to look up the appropriate windows for the side input elements. Apache Beam projects the main input’s window to the side input's window set and uses the side input from the resulting window. Main and side inputs might have identical or different windowing strategies.

For example, if the main input PCollection is windowed into ten-minute windows, and the side input is windowed into one-hour windows, Apache Beam projects the main input window set and selects the value from the corresponding one-hour side input window. For more information, refer to side inputs and windowing.

For common patterns and code samples to implement side inputs in Apache Beam pipelines, refer to side input patterns.