learning/prompts/documentation-lookup-nolinks/50_beam_yaml.md - beam - Git at Google

 Prompt:
 Can I create Apache Beam pipelines using YAML?

 Response:
 The Beam YAML API enables the creation of basic Apache Beam pipelines using YAML files, which can be edited with any text editor. The API provides an intuitive, declarative syntax that simplifies pipeline authoring by eliminating the need for coding experience or SDK familiarity. Manipulating and understanding YAML semantics is easier compared to working directly with Beam protos. Additionally, YAML's readability and ease of authorship make it a suitable intermediate representation for tools such as pipeline authoring GUIs or lineage analysis tools.

 The Beam YAML parser is currently integrated into the Apache Beam Python SDK.

 The Beam YAML SDK offers a wide range of built-in transforms and allows the creation of custom transforms through the notion of `providers` that leverage expansion services and schema transforms, or by integrating Python packages into your pipeline. You can give names to transforms to help you monitor, debug, and disambiguate identical transforms, enhancing readability.

 Beam YAML enables authoring both linear and non-linear pipelines. For linear pipelines, Beam YAML allows the inputs to be implicit by specifying the pipeline as a `chain` type. Additionally, you can denote the first and last transforms as `source` and `sink`, respectively. For non-linear pipelines, inputs must be explicitly named, although `chains` can be nested within them.

 The following Beam YAML code snippet demonstrates a linear pipeline that reads data from a CSV file, filters rows based on a specified condition, executes an SQL query, and writes the resulting data to a JSON file:

 ```yaml
 pipeline:
   type: chain

   source:
     type: ReadFromCsv
     config:
       path: /path/to/input*.csv

   transforms:
     - type: Filter
       config:
         language: python
         keep: "col3 > 100"

     - type: Sql
       name: MySqlTransform
       config:
         query: "select col1, count(*) as cnt from PCOLLECTION group by col1"

   sink:
     type: WriteToJson
     config:
       path: /path/to/output.json
 ```

 With Beam YAML, you can define both streaming and batch pipelines. Beam YAML also supports Apache Beam’s windowing and triggering, allowing windowing to be declared using the standard `WindowInto` transform or by tagging a transform with specified windowing. The latter applies the windowing to its inputs and the transform itself.

 The following example demonstrates a streaming Beam YAML pipeline with a chain of transforms. It reads from a Pub/Sub topic, applies a grouping transform with sliding windowing, and then writes the output to another Pub/Sub topic:

 ```yaml
 pipeline:
   type: chain
   transforms:
     - type: ReadFromPubSub
       config:
         topic: myPubSubTopic
         format: ...
         schema: ...
     - type: SomeGroupingTransform
       config:
         arg: ...
       windowing:
         type: sliding
         size: 60s
         period: 10s
     - type: WriteToPubSub
       config:
         topic: anotherPubSubTopic
         format: json
 ```

 You can execute a pipeline defined in a YAML file using the standard `python -m` command:

 ```python
 python -m apache_beam.yaml.main --yaml_pipeline_file=/path/to/pipeline.yaml [other pipeline options such as the runner]
 ```
	Prompt:
	Can I create Apache Beam pipelines using YAML?

	Response:
	The Beam YAML API enables the creation of basic Apache Beam pipelines using YAML files, which can be edited with any text editor. The API provides an intuitive, declarative syntax that simplifies pipeline authoring by eliminating the need for coding experience or SDK familiarity. Manipulating and understanding YAML semantics is easier compared to working directly with Beam protos. Additionally, YAML's readability and ease of authorship make it a suitable intermediate representation for tools such as pipeline authoring GUIs or lineage analysis tools.

	The Beam YAML parser is currently integrated into the Apache Beam Python SDK.

	The Beam YAML SDK offers a wide range of built-in transforms and allows the creation of custom transforms through the notion of `providers` that leverage expansion services and schema transforms, or by integrating Python packages into your pipeline. You can give names to transforms to help you monitor, debug, and disambiguate identical transforms, enhancing readability.

	Beam YAML enables authoring both linear and non-linear pipelines. For linear pipelines, Beam YAML allows the inputs to be implicit by specifying the pipeline as a `chain` type. Additionally, you can denote the first and last transforms as `source` and `sink`, respectively. For non-linear pipelines, inputs must be explicitly named, although `chains` can be nested within them.

	The following Beam YAML code snippet demonstrates a linear pipeline that reads data from a CSV file, filters rows based on a specified condition, executes an SQL query, and writes the resulting data to a JSON file:

	```yaml
	pipeline:
	type: chain

	source:
	type: ReadFromCsv
	config:
	path: /path/to/input*.csv

	transforms:
	- type: Filter
	config:
	language: python
	keep: "col3 > 100"

	- type: Sql
	name: MySqlTransform
	config:
	query: "select col1, count(*) as cnt from PCOLLECTION group by col1"

	sink:
	type: WriteToJson
	config:
	path: /path/to/output.json
	```

	With Beam YAML, you can define both streaming and batch pipelines. Beam YAML also supports Apache Beam’s windowing and triggering, allowing windowing to be declared using the standard `WindowInto` transform or by tagging a transform with specified windowing. The latter applies the windowing to its inputs and the transform itself.

	The following example demonstrates a streaming Beam YAML pipeline with a chain of transforms. It reads from a Pub/Sub topic, applies a grouping transform with sliding windowing, and then writes the output to another Pub/Sub topic:

	```yaml
	pipeline:
	type: chain
	transforms:
	- type: ReadFromPubSub
	config:
	topic: myPubSubTopic
	format: ...
	schema: ...
	- type: SomeGroupingTransform
	config:
	arg: ...
	windowing:
	type: sliding
	size: 60s
	period: 10s
	- type: WriteToPubSub
	config:
	topic: anotherPubSubTopic
	format: json
	```

	You can execute a pipeline defined in a YAML file using the standard `python -m` command:

	```python
	python -m apache_beam.yaml.main --yaml_pipeline_file=/path/to/pipeline.yaml [other pipeline options such as the runner]
	```