website/www/site/content/en/blog/gsoc-25-yaml-user-accessibility.md - beam - Git at Google

 ---
 title:  "Google Summer of Code 2025 - Beam YAML, Kafka and Iceberg User
 Accessibility"
 date:   2025-09-23 00:00:00 -0400
 categories:
   - blog
   - gsoc
 aliases:
   - /blog/2025/09/23/gsoc-25-yaml-user-accessibility.html
 authors:
   - charlespnh

 ---
 <!--
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->

 The relatively new Beam YAML SDK was introduced in the spirit of making data processing easy,
 but it has gained little adoption for complex ML tasks and hasn’t been widely used with
 [Managed I/O](https://beam.apache.org/documentation/io/managed-io/) such as Kafka and Iceberg.
 As part of Google Summer of Code 2025, new illustrative, production-ready pipeline examples
 of ML use cases with Kafka and Iceberg data sources using the YAML SDK have been developed
 to address this adoption gap.

 ## Context
 The YAML SDK was introduced in Spring 2024 as Beam’s first no-code SDK. It follows a declarative approach
 of defining a data processing pipeline using a YAML DSL, as opposed to other programming language specific SDKs.
 At the time, it had few meaningful examples and documentation to go along with it. Key missing examples
 were ML workflows and integration with the Kafka and Iceberg Managed I/O. Foundational work had already been done
 to add support for ML capabilities as well as Kafka and Iceberg IO connectors in the YAML SDK, but there were no
 end-to-end examples demonstrating their usage.

 Beam, as well as Kafka and Iceberg, are mainstream big data technologies but they also have a learning curve.
 The overall theme of the project is to help democratize data processing for scientists and analysts who traditionally
 don’t have a strong background in software engineering. They can now refer to these meaningful examples as the starting point,
 helping them onboard faster and be more productive when authoring ML/data pipelines to their use cases with Beam and its YAML DSL.

 ## Contributions
 The data pipelines/workflows developed are production-ready: Kafka and Iceberg data sources are set up on GCP,
 and the data used are raw public datasets. The pipelines are tested end-to-end on Google Cloud Dataflow and
 are also unit tested to ensure correct transformation logic.

 Delivered pipelines/workflows, each with documentation as README.md, address 4 main ML use cases below:

 1. **Streaming Classification Inference**: A streaming ML pipeline that demonstrates Beam YAML capability to perform
 classification inference on a stream of incoming data from Kafka. The overall workflow also includes
 DistilBERT model deployment and serving on Google Cloud Vertex AI where the pipeline can access for remote inferences.
 The pipeline is applied to a sentiment analysis task on a stream of YouTube comments, preprocessing data and classifying
 whether a comment is positive or negative. See [pipeline](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/sentiment_analysis/streaming_sentiment_analysis.yaml) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/sentiment_analysis).


 2. **Streaming Regression Inference**: A streaming ML pipeline that demonstrates Beam YAML capability to perform
 regression inference on a stream of incoming data from Kafka. The overall workflow also includes
 custom model training, deployment and serving on Google Cloud Vertex AI where the pipeline can access for remote inferences.
 The pipeline is applied to a regression task on a stream of taxi rides, preprocessing data and predicting the fare amount
 for every ride. See [pipeline](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/taxi_fare/streaming_taxifare_prediction.yaml) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/taxi_fare).


 3. **Batch Anomaly Detection**: A ML workflow that demonstrates ML-specific transformations
 and reading from/writing to Iceberg IO. The workflow contains unsupervised model training and several pipelines that leverage
 Iceberg for storing results, BigQuery for storing vector embeddings and MLTransform for computing embeddings to demonstrate
 an end-to-end anomaly detection workflow on a dataset of system logs. See [workflow](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/log_analysis/batch_log_analysis.sh) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/log_analysis).


 4. **Feature Engineering & Model Evaluation**: A ML workflow that demonstrates Beam YAML capability to do feature engineering
 which is subsequently used for model evaluation, and its integration with Iceberg IO. The workflow contains model training
 and several pipelines, showcasing an end-to-end Fraud Detection MLOps solution that generates features and evaluates models
 to detect credit card transaction frauds. See [workflow](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/fraud_detection/fraud_detection_mlops_beam_yaml_sdk.ipynb) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/fraud_detection).

 ## Challenges
 The main challenge of the project was a lack of previous YAML pipeline examples and good documentation to rely on.
 Unlike the Python or Java SDKs where there are already many notebooks and end-to-end examples demonstrating various use cases,
 the examples for YAML SDK only involved simple transformations such as filter, group by, etc. More complex transforms like
 `MLTransform` and `ReadFromIceberg` had no examples and requires configurations that didn't have clear API reference at the time.
 As a result, there were a lot of deep dives into the actual implementation of the PTransforms across YAML, Python and Java SDKs to
 understand the error messages and how to correctly use the transforms.

 Another challenge was writing unit tests for the pipeline to ensure that the pipeline’s logic is correct.
 It was a learning curve to understand how the existing test suite is set up and how it can be used to write unit tests for
 the data pipelines. A lot of time was spent on properly writing mocks for the pipeline's sources and sinks, as well as for the
 transforms that require external services such as Vertex AI.

 ## Conclusion & Personal Thoughts
 These production-ready pipelines demonstrate the potential of Beam YAML SDK to author complex ML workflows
 that interact with Iceberg and Kafka. The examples are a nice addition to Beam, especially with Beam 3.0.0 milestones
 coming up where low-code/no-code, ML capabilities and Managed I/O are focused on.

 I had an amazing time working with the big data technologies Beam, Iceberg, and Kafka as well as many Google Cloud services
 (Dataflow, Vertex AI and Google Kubernetes Engine, to name a few). I’ve always wanted to work more in the ML space, and this
 experience has been a great growth opportunity for me. Google Summer of Code this year has been selective, and the project's success
 would not have been possible without the support of my mentor, Chamikara Jayalath. It's been a pleasure working closely
 with him and the broader Beam community to contribute to this open-source project that has a meaningful impact on the
 data engineering community.

 My advice for future Google Summer of Code participants is to first and foremost research and choose a project that aligns closely
 with your interest. Most importantly, spend a lot of time making yourself visible and writing a good proposal when the program
 is opened for applications. Being visible (e.g. by sharing your proposal, or generally any ideas and questions on the project's
 communication channel early on) makes it more likely for you to be selected; and a good proposal not only will make you even
 more likely to be in the program, but also give you a lot of confidence when contributing to and completing the project.

 ## References
 - [Google Summer of Code Project Listing](https://summerofcode.withgoogle.com/programs/2025/projects/f4kiDdus)
 - [Google Summer of Code Final Report](https://docs.google.com/document/d/1MSAVF6X9ggtVZbqz8YJGmMgkolR_dve0Lr930cByyac/edit?usp=sharing)
	---
	title: "Google Summer of Code 2025 - Beam YAML, Kafka and Iceberg User
	Accessibility"
	date: 2025-09-23 00:00:00 -0400
	categories:
	- blog
	- gsoc
	aliases:
	- /blog/2025/09/23/gsoc-25-yaml-user-accessibility.html
	authors:
	- charlespnh

	---
	<!--
	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->

	The relatively new Beam YAML SDK was introduced in the spirit of making data processing easy,
	but it has gained little adoption for complex ML tasks and hasn’t been widely used with
	[Managed I/O](https://beam.apache.org/documentation/io/managed-io/) such as Kafka and Iceberg.
	As part of Google Summer of Code 2025, new illustrative, production-ready pipeline examples
	of ML use cases with Kafka and Iceberg data sources using the YAML SDK have been developed
	to address this adoption gap.

	## Context
	The YAML SDK was introduced in Spring 2024 as Beam’s first no-code SDK. It follows a declarative approach
	of defining a data processing pipeline using a YAML DSL, as opposed to other programming language specific SDKs.
	At the time, it had few meaningful examples and documentation to go along with it. Key missing examples
	were ML workflows and integration with the Kafka and Iceberg Managed I/O. Foundational work had already been done
	to add support for ML capabilities as well as Kafka and Iceberg IO connectors in the YAML SDK, but there were no
	end-to-end examples demonstrating their usage.

	Beam, as well as Kafka and Iceberg, are mainstream big data technologies but they also have a learning curve.
	The overall theme of the project is to help democratize data processing for scientists and analysts who traditionally
	don’t have a strong background in software engineering. They can now refer to these meaningful examples as the starting point,
	helping them onboard faster and be more productive when authoring ML/data pipelines to their use cases with Beam and its YAML DSL.

	## Contributions
	The data pipelines/workflows developed are production-ready: Kafka and Iceberg data sources are set up on GCP,
	and the data used are raw public datasets. The pipelines are tested end-to-end on Google Cloud Dataflow and
	are also unit tested to ensure correct transformation logic.

	Delivered pipelines/workflows, each with documentation as README.md, address 4 main ML use cases below:

	1. Streaming Classification Inference: A streaming ML pipeline that demonstrates Beam YAML capability to perform
	classification inference on a stream of incoming data from Kafka. The overall workflow also includes
	DistilBERT model deployment and serving on Google Cloud Vertex AI where the pipeline can access for remote inferences.
	The pipeline is applied to a sentiment analysis task on a stream of YouTube comments, preprocessing data and classifying
	whether a comment is positive or negative. See [pipeline](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/sentiment_analysis/streaming_sentiment_analysis.yaml) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/sentiment_analysis).


	2. Streaming Regression Inference: A streaming ML pipeline that demonstrates Beam YAML capability to perform
	regression inference on a stream of incoming data from Kafka. The overall workflow also includes
	custom model training, deployment and serving on Google Cloud Vertex AI where the pipeline can access for remote inferences.
	The pipeline is applied to a regression task on a stream of taxi rides, preprocessing data and predicting the fare amount
	for every ride. See [pipeline](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/taxi_fare/streaming_taxifare_prediction.yaml) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/taxi_fare).


	3. Batch Anomaly Detection: A ML workflow that demonstrates ML-specific transformations
	and reading from/writing to Iceberg IO. The workflow contains unsupervised model training and several pipelines that leverage
	Iceberg for storing results, BigQuery for storing vector embeddings and MLTransform for computing embeddings to demonstrate
	an end-to-end anomaly detection workflow on a dataset of system logs. See [workflow](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/log_analysis/batch_log_analysis.sh) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/log_analysis).


	4. Feature Engineering & Model Evaluation: A ML workflow that demonstrates Beam YAML capability to do feature engineering
	which is subsequently used for model evaluation, and its integration with Iceberg IO. The workflow contains model training
	and several pipelines, showcasing an end-to-end Fraud Detection MLOps solution that generates features and evaluates models
	to detect credit card transaction frauds. See [workflow](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/fraud_detection/fraud_detection_mlops_beam_yaml_sdk.ipynb) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/fraud_detection).

	## Challenges
	The main challenge of the project was a lack of previous YAML pipeline examples and good documentation to rely on.
	Unlike the Python or Java SDKs where there are already many notebooks and end-to-end examples demonstrating various use cases,
	the examples for YAML SDK only involved simple transformations such as filter, group by, etc. More complex transforms like
	`MLTransform` and `ReadFromIceberg` had no examples and requires configurations that didn't have clear API reference at the time.
	As a result, there were a lot of deep dives into the actual implementation of the PTransforms across YAML, Python and Java SDKs to
	understand the error messages and how to correctly use the transforms.

	Another challenge was writing unit tests for the pipeline to ensure that the pipeline’s logic is correct.
	It was a learning curve to understand how the existing test suite is set up and how it can be used to write unit tests for
	the data pipelines. A lot of time was spent on properly writing mocks for the pipeline's sources and sinks, as well as for the
	transforms that require external services such as Vertex AI.

	## Conclusion & Personal Thoughts
	These production-ready pipelines demonstrate the potential of Beam YAML SDK to author complex ML workflows
	that interact with Iceberg and Kafka. The examples are a nice addition to Beam, especially with Beam 3.0.0 milestones
	coming up where low-code/no-code, ML capabilities and Managed I/O are focused on.

	I had an amazing time working with the big data technologies Beam, Iceberg, and Kafka as well as many Google Cloud services
	(Dataflow, Vertex AI and Google Kubernetes Engine, to name a few). I’ve always wanted to work more in the ML space, and this
	experience has been a great growth opportunity for me. Google Summer of Code this year has been selective, and the project's success
	would not have been possible without the support of my mentor, Chamikara Jayalath. It's been a pleasure working closely
	with him and the broader Beam community to contribute to this open-source project that has a meaningful impact on the
	data engineering community.

	My advice for future Google Summer of Code participants is to first and foremost research and choose a project that aligns closely
	with your interest. Most importantly, spend a lot of time making yourself visible and writing a good proposal when the program
	is opened for applications. Being visible (e.g. by sharing your proposal, or generally any ideas and questions on the project's
	communication channel early on) makes it more likely for you to be selected; and a good proposal not only will make you even
	more likely to be in the program, but also give you a lot of confidence when contributing to and completing the project.

	## References
	- [Google Summer of Code Project Listing](https://summerofcode.withgoogle.com/programs/2025/projects/f4kiDdus)
	- [Google Summer of Code Final Report](https://docs.google.com/document/d/1MSAVF6X9ggtVZbqz8YJGmMgkolR_dve0Lr930cByyac/edit?usp=sharing)