Prompt: What are schemas in Apache Beam?
Response: A Schema in Apache Beam is a language-independent type definition for a PCollection
. The schema defines elements of that PCollection
as an ordered list of named fields.
In many cases, the element type in a PCollection
has a structure that can be introspected. Some examples are JSON, Protocol Buffer, Avro, and database row objects. All of these formats can be converted to Beam Schemas.
To take advantage of schemas, your PCollection
objects must have a schema attached. Often, the source itself attaches a schema to the PCollection
.
Although schemas are language-independent, they're designed to embed naturally into the Beam SDK programming languages. This feature allows you to continue using native types while leveraging the advantage of having Beam understand your element schemas.
Apache Beam provides a collection of schema transforms that operate on schemas. For an example, see SqlTransform
in the ‘Beam SQL overview’ section of the Apache Beam documentation.
The following list includes key capabilities provided by schema transforms:
For more information about schemas and the Apache Beam best practices when working with structured data, see the ‘Schema Usage Patterns’ section in the Apache Beam documentation.