| # Apache Beam starter for Python Providers |
| |
| This provides an example of how to write a catalog of transforms via a provider |
| in Python that can be used from Beam YAML. |
| |
| If you want to clone this repository to start your own project, |
| you can choose the license you prefer and feel free to delete anything |
| related to the license you are dropping. |
| |
| ## Before you begin |
| |
| Make sure you have a [Python 3](https://www.python.org/) development environment ready. |
| If you don't, you can download and install it from the |
| [Python downloads page](https://www.python.org/downloads/). |
| |
| This project uses poetry to manage its dependencies, however it can be used |
| with in any virtual environment where its dependencies (as listed in |
| pyproject.toml) are installed. |
| |
| Simple steps to set up your environment using [Poetry](https://python-poetry.org/): |
| |
| ```shell |
| # Install Poetry if you haven't already |
| pip install poetry |
| |
| # Install project dependencies into a virtual environment |
| poetry install |
| |
| # Build the Python package containing your PTransforms |
| # This creates a distributable tarball (e.g., dist/beam_starter_python_provider-*.tar.gz) |
| # which Beam YAML needs to find your custom transforms. |
| poetry build |
| |
| # Activate the virtual environment managed by Poetry |
| # Alternatively, you can prefix commands with `poetry run` |
| eval $(poetry env activate) |
| ``` |
| |
| ## Overview |
| |
| Beam YAML transforms can be ordinary Python transforms that are simply |
| enumerated in a YAML file that indicates where to find them and how |
| to instantiate them. This allows one to author arbitrarily complex |
| transforms in Python and offer them for easy use from within a Beam |
| YAML pipeline. The main steps that are involved are: |
| |
| 1. Author your PTransforms which accept and produce [schema'd PCollections](https://beam.apache.org/documentation/programming-guide/#schemas). |
| In practice, this means that the elements are named tuples or `beam.Row` objects. |
| |
| 2. Publish these transforms as a standard Python package. For local development |
| this can be a simple tarball, for production use these can be published |
| to pypi or otherwise hosted anywhere else that is accessible to its users. |
| |
| 3. Write a simple yaml file enumerating these transforms, providing both their |
| qualified names and the package(s) in which they live. This file can then |
| be referenced from a Beam YAML pipeline which can then use these transforms. |
| |
| ## This repository |
| |
| This repository is a complete working example of the above steps. |
| |
| ### PTransform definitions |
| |
| Several transforms are defined in [my_provider.py](./my_provider.py). |
| Ordinary unit tests can be found in |
| [my_provider_test_.py](./my_provider_test.py) |
| that can be run with pytest. |
| A real-world example would probably have a more structured format than |
| putting all transforms in a single top-level python module, but this |
| is a python package structuring question and would not change anything |
| essential here. |
| |
| ### Publishing the package |
| |
| Run `poetry build` to build the package. |
| This will create the file `dist/beam_starter_python_provider-0.1.0.tar.gz` |
| which is referenced elsewhere. |
| |
| ### Write the provider listing file |
| |
| The next step is to write a file that tells Beam YAML how and where to find the |
| transforms that were defined above. |
| An example of how to do this is given in |
| [examples/provider_listing.yaml](examples/provider_listing.yaml). |
| These listings can also be specified inline with the pipeline definition |
| as done in [examples/simple.yaml](examples/simple.yaml). |
| |
| ### Use the transforms |
| |
| The [examples](./examples/) directory contains several examples of how to invoke the provided transforms from Beam YAML. |
| The script at [examples/run_all.sh](./examples/run_all.sh) shows how they can be run locally. |
| |
| **Running Locally:** |
| |
| Make sure you have activated your virtual environment. |
| |
| ```shell |
| cd examples |
| python -m apache_beam.yaml.main --yaml_pipeline_file=./simple.yaml |
| ``` |
| |
| **Running with Dataflow:** |
| |
| You will need a Google Cloud project and a GCS bucket for staging. |
| |
| * **Using `gcloud`:** (Requires gcloud CLI installed and configured) |
| |
| Replace `<YOUR_PROJECT_ID>` and `<YOUR_REGION>` with your details. |
| |
| ```shell |
| cd examples |
| gcloud dataflow yaml run my-yaml-job \ |
| --yaml-pipeline-file=./simple.yaml \ |
| --project=<YOUR_PROJECT_ID> \ |
| --region=<YOUR_REGION> |
| ``` |
| |
| * **Using `python -m apache_beam.yaml.main`:** |
| |
| Replace `<YOUR_PROJECT_ID>`, `<YOUR_REGION>`, and `<YOUR_GCS_BUCKET>` with your details. |
| |
| ```shell |
| cd examples |
| python -m apache_beam.yaml.main \ |
| --yaml_pipeline_file=./simple.yaml \ |
| --runner=DataflowRunner \ |
| --project=<YOUR_PROJECT_ID> \ |
| --region=<YOUR_REGION> \ |
| --temp_location=gs://<YOUR_GCS_BUCKET>/temp \ |
| --staging_location=gs://<YOUR_GCS_BUCKET>/staging |
| ``` |