Prompt: What are schemas in Apache Beam?

Response: A Schema in Apache Beam is a language-independent type definition for a PCollection. The schema defines elements of that PCollection as an ordered list of named fields.

In many cases, the element type in a PCollection has a structure that can be introspected. Some examples are JSON, Protocol Buffer, Avro, and database row objects. All of these formats can be converted to Beam Schemas.

To take advantage of schemas, your PCollection objects must have a schema attached. Often, the source itself attaches a schema to the PCollection.

Although schemas are language-independent, they're designed to embed naturally into the Beam SDK programming languages. This feature allows you to continue using native types while leveraging the advantage of having Beam understand your element schemas.

Apache Beam provides a collection of schema transforms that operate on schemas. For an example, see SqlTransform in the ‘Beam SQL overview’ section of the Apache Beam documentation.

The following list includes key capabilities provided by schema transforms:

  • field selection
  • grouping and aggregation
  • join operations
  • filtering data
  • adding and removing fields
  • renaming fields
  • type conversion
  • enhanced ParDo functionality

For more information about schemas and the Apache Beam best practices when working with structured data, see the ‘Schema Usage Patterns’ section in the Apache Beam documentation.