blob: db53c04b83888fe3c5527b5a385c0f8c3c4b9ffe [file] [log] [blame] [view]
Prompt:
What are schemas in Apache Beam?
Response:
A [Schema in Apache Beam](https://beam.apache.org/documentation/programming-guide/#schemas) is a language-independent type definition for a `PCollection`. The schema defines elements of that `PCollection` as an ordered list of named fields.
In many cases, the element type in a `PCollection` has a structure that can be introspected. Some examples are JSON, Protocol Buffer, Avro, and database row objects. All of these formats can be converted to Beam Schemas.
To take advantage of schemas, your `PCollection` objects must have a schema attached. Often, the source itself attaches a schema to the `PCollection`.
Although schemas are language-independent, they're designed to embed naturally into the Beam SDK programming languages. This feature lets you continue [to use native types](https://beam.apache.org/documentation/programming-guide/#schemas-for-pl-types) while leveraging the advantage of having Beam understand your element schemas.
Beam provides a collection of [schema transforms](https://beam.apache.org/documentation/programming-guide/#662-schema-transforms) that operate on schemas. For an example, see [SqlTransform](https://beam.apache.org/documentation/dsls/sql/overview/).
The following list includes key capabilities provided by schema transforms:
* field selection
* grouping and aggregation
* join operations
* filtering data
* adding and removing fields
* renaming fields
* type conversion
* enhanced ParDo functionality
For more information about schemas and the Apache Beam best practices when working with structured data, see [Schema Usage Patterns](https://beam.apache.org/documentation/patterns/schema/).