| Prompt: |
| Can I create Apache Beam pipelines using YAML? |
| |
| Response: |
| The Beam YAML API enables the creation of basic Apache Beam pipelines using YAML files, which can be edited with any text editor. The API provides an intuitive, declarative syntax that simplifies pipeline authoring by eliminating the need for coding experience or SDK familiarity. Manipulating and understanding YAML semantics is easier compared to working directly with Beam protos. Additionally, YAML's readability and ease of authorship make it a suitable intermediate representation for tools such as pipeline authoring GUIs or lineage analysis tools. |
| |
| The Beam YAML parser is currently integrated into the Apache Beam Python SDK. |
| |
| The Beam YAML SDK offers a wide range of built-in transforms and allows the creation of custom transforms through the notion of `providers` that leverage expansion services and schema transforms, or by integrating Python packages into your pipeline. You can give names to transforms to help you monitor, debug, and disambiguate identical transforms, enhancing readability. |
| |
| Beam YAML enables authoring both linear and non-linear pipelines. For linear pipelines, Beam YAML allows the inputs to be implicit by specifying the pipeline as a `chain` type. Additionally, you can denote the first and last transforms as `source` and `sink`, respectively. For non-linear pipelines, inputs must be explicitly named, although `chains` can be nested within them. |
| |
| The following Beam YAML code snippet demonstrates a linear pipeline that reads data from a CSV file, filters rows based on a specified condition, executes an SQL query, and writes the resulting data to a JSON file: |
| |
| ```yaml |
| pipeline: |
| type: chain |
| |
| source: |
| type: ReadFromCsv |
| config: |
| path: /path/to/input*.csv |
| |
| transforms: |
| - type: Filter |
| config: |
| language: python |
| keep: "col3 > 100" |
| |
| - type: Sql |
| name: MySqlTransform |
| config: |
| query: "select col1, count(*) as cnt from PCOLLECTION group by col1" |
| |
| sink: |
| type: WriteToJson |
| config: |
| path: /path/to/output.json |
| ``` |
| |
| With Beam YAML, you can define both streaming and batch pipelines. Beam YAML also supports Apache Beam’s windowing and triggering, allowing windowing to be declared using the standard `WindowInto` transform or by tagging a transform with specified windowing. The latter applies the windowing to its inputs and the transform itself. |
| |
| The following example demonstrates a streaming Beam YAML pipeline with a chain of transforms. It reads from a Pub/Sub topic, applies a grouping transform with sliding windowing, and then writes the output to another Pub/Sub topic: |
| |
| ```yaml |
| pipeline: |
| type: chain |
| transforms: |
| - type: ReadFromPubSub |
| config: |
| topic: myPubSubTopic |
| format: ... |
| schema: ... |
| - type: SomeGroupingTransform |
| config: |
| arg: ... |
| windowing: |
| type: sliding |
| size: 60s |
| period: 10s |
| - type: WriteToPubSub |
| config: |
| topic: anotherPubSubTopic |
| format: json |
| ``` |
| |
| You can execute a pipeline defined in a YAML file using the standard `python -m` command: |
| |
| ```python |
| python -m apache_beam.yaml.main --yaml_pipeline_file=/path/to/pipeline.yaml [other pipeline options such as the runner] |
| ``` |
| |