This guide shows you how to set up your Python development environment, get the Apache Beam SDK for Python, and run an example pipeline.
If you're interested in contributing to the Apache Beam Python codebase, see the Contribution Guide.
{{< toc >}}
The Python SDK supports Python 3.8, 3.9, 3.10, 3.11 and 3.12. Beam 2.48.0 was the last release with support for Python 3.7.
For details, see Set up your development environment.
A virtual environment is a directory tree containing its own Python distribution. To create a virtual environment, run:
{{< shell unix >}} python -m venv /path/to/directory {{< /shell >}}
{{< shell powerShell >}} PS> python -m venv C:\path\to\directory {{< /shell >}}
A virtual environment needs to be activated for each shell that is to use it. Activating it sets some environment variables that point to the virtual environment's directories.
To activate a virtual environment in Bash, run:
{{< shell unix >}} . /path/to/directory/bin/activate {{< /shell >}}
{{< shell powerShell >}} PS> C:\path\to\directory\Scripts\activate.ps1 {{< /shell >}}
That is, execute the activate script under the virtual environment directory you created.
For instructions using other shells, see the venv documentation.
Install the latest Python SDK from PyPI:
{{< shell unix >}} pip install apache-beam {{< /shell >}}
{{< shell powerShell >}} PS> python -m pip install apache-beam {{< /shell >}}
The above installation will not install all the extra dependencies for using features like the Google Cloud Dataflow runner. Information on what extra packages are required for different features are highlighted below. It is possible to install multiple extra requirements using something like pip install 'apache-beam[feature1,feature2]'.
pip install 'apache-beam[gcp]'pip install 'apache-beam[aws]'pip install 'apache-beam[azure]'pip install 'apache-beam[yaml]'pip install 'apache-beam[dataframe]'pip install 'apache-beam[test]'pip install 'apache-beam[docs]'The Apache Beam examples directory has many examples. All examples can be run locally by passing the required arguments described in the example script.
For example, run wordcount.py with the following command:
{{< runner direct >}} python -m apache_beam.examples.wordcount --input /path/to/inputfile --output /path/to/write/counts {{< /runner >}}
{{< runner flink >}} python -m apache_beam.examples.wordcount --input /path/to/inputfile
--output /path/to/write/counts
--runner FlinkRunner {{< /runner >}}
{{< runner spark >}} python -m apache_beam.examples.wordcount --input /path/to/inputfile
--output /path/to/write/counts
--runner SparkRunner {{< /runner >}}
{{< runner dataflow >}}
pip install apache-beam[gcp] python -m apache_beam.examples.wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt
--output gs:///counts
--runner DataflowRunner
--project your-gcp-project
--region your-gcp-region
--temp_location gs:///tmp/ {{< /runner >}}
{{< runner nemo >}} This runner is not yet available for the Python SDK. {{< /runner >}}
After the pipeline completes, you can view the output files at your specified output path. For example, if you specify /dir1/counts for the --output parameter, the pipeline writes the files to /dir1/ and names the files sequentially in the format counts-0000-of-0001.
Please don't hesitate to reach out if you encounter any issues!