This guide shows you how to set up your Python development environment, get the Apache Beam SDK for Python, and run an example pipeline.
If you're interested in contributing to the Apache Beam Python codebase, see the [Contribution Guide]({{ site.baseurl }}/contribute).
The Python SDK supports Python 2.7, 3.5, 3.6, and 3.7. New Python SDK releases will stop supporting Python 2.7 in 2020 (BEAM-8371). For best results, use Beam with Python 3.
The Beam SDK requires Python 2 users to use Python 2.7 and Python 3 users to use Python 3.5 or higher. Check your version by running:
python --version
Install pip, Python's package manager. Check that you have version 7.0.0 or newer by running:
pip --version
If you do not have pip
version 7.0.0 or newer, run the following command to install it. This command might require administrative privileges.
{:.shell-unix}
pip install --upgrade pip
{:.shell-PowerShell}
PS> python -m pip install --upgrade pip
It is recommended that you install a Python virtual environment for initial experiments. If you do not have virtualenv
version 13.1.0 or newer, run the following command to install it. This command might require administrative privileges.
{:.shell-unix}
pip install --upgrade virtualenv
{:.shell-PowerShell}
PS> python -m pip install --upgrade virtualenv
If you do not want to use a Python virtual environment (not recommended), ensure setuptools
is installed on your machine. If you do not have setuptools
version 17.1 or newer, run the following command to install it.
{:.shell-unix}
pip install --upgrade setuptools
{:.shell-PowerShell}
PS> python -m pip install --upgrade setuptools
A virtual environment is a directory tree containing its own Python distribution. To create a virtual environment, create a directory and run:
{:.shell-unix}
virtualenv /path/to/directory
{:.shell-PowerShell}
PS> virtualenv C:\path\to\directory
A virtual environment needs to be activated for each shell that is to use it. Activating it sets some environment variables that point to the virtual environment's directories.
To activate a virtual environment in Bash, run:
{:.shell-unix}
. /path/to/directory/bin/activate
{:.shell-PowerShell}
PS> C:\path\to\directory\Scripts\activate.ps1
That is, execute the activate
script under the virtual environment directory you created.
For instructions using other shells, see the virtualenv documentation.
Install the latest Python SDK from PyPI:
{:.shell-unix}
pip install apache-beam
{:.shell-PowerShell}
PS> python -m pip install apache-beam
The above installation will not install all the extra dependencies for using features like the Google Cloud Dataflow runner. Information on what extra packages are required for different features are highlighted below. It is possible to install multitple extra requirements using something like pip install apache-beam[feature1,feature2]
.
pip install apache-beam[gcp]
pip install apache-beam[test]
pip install apache-beam[docs]
The Apache Beam examples directory has many examples. All examples can be run locally by passing the required arguments described in the example script.
For example, run wordcount.py
with the following command:
{:.runner-direct}
python -m apache_beam.examples.wordcount --input /path/to/inputfile --output /path/to/write/counts
{:.runner-apex}
This runner is not yet available for the Python SDK.
{:.runner-flink-local}
Currently, running wordcount.py on Flink requires a full download of the Beam source code. See https://beam.apache.org/roadmap/portability/#python-on-flink for more information.
{:.runner-flink-cluster}
Currently, running wordcount.py on Flink requires a full download of the Beam source code. See https://beam.apache.org/documentation/runners/flink/ for more information.
{:.runner-spark}
Currently, running wordcount.py on Spark requires a full download of the Beam source code. See https://beam.apache.org/roadmap/portability/#python-on-spark for more information.
{:.runner-dataflow}
# As part of the initial setup, install Google Cloud Platform specific extra components. Make sure you # complete the setup steps at {{ site.baseurl }}/documentation/runners/dataflow/#setup pip install apache-beam[gcp] python -m apache_beam.examples.wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \ --output gs://<your-gcs-bucket>/counts \ --runner DataflowRunner \ --project your-gcp-project \ --temp_location gs://<your-gcs-bucket>/tmp/
{:.runner-nemo}
This runner is not yet available for the Python SDK.
After the pipeline completes, you can view the output files at your specified output path. For example, if you specify /dir1/counts
for the --output
parameter, the pipeline writes the files to /dir1/
and names the files sequentially in the format counts-0000-of-0001
.
Please don't hesitate to [reach out]({{ site.baseurl }}/community/contact-us) if you encounter any issues!