The Beam SDK runtime environment is isolated from other runtime systems because the SDK runtime environment is containerized with Docker. This means that any execution engine can run the Beam SDK.
This page describes how to customize, build, and push Beam SDK container images.
Before you begin, install Docker on your workstation.
You can add extra dependencies to container images so that you don't have to supply the dependencies to execution engines.
To customize a container image, either:
It's often easier to write a new Dockerfile. However, by modifying the original Dockerfile, you can customize anything (including the base OS).
docker pull apachebeam/python3.7_sdk
beam
repository:git clone https://github.com/apache/beam.git
base_image_requirements.txt
instead.To test a customized image locally, run a pipeline with PortableRunner and set the --environment_config
flag to the image path:
{:.runner-direct}
python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ --output /path/to/write/counts \ --runner=PortableRunner \ --job_endpoint=embed \ --environment_config=path/to/container/image
{:.runner-flink-local}
# Start a Flink job server on localhost:8099 ./gradlew :runners:flink:1.5:job-server:runShadow # Run a pipeline on the Flink job server python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ --output=/path/to/write/counts \ --runner=PortableRunner \ --job_endpoint=localhost:8099 \ --environment_config=path/to/container/image
{:.runner-spark-local}
# Start a Spark job server on localhost:8099 ./gradlew :runners:spark:job-server:runShadow # Run a pipeline on the Spark job server python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ --output=path/to/write/counts \ --runner=PortableRunner \ --job_endpoint=localhost:8099 \ --environment_config=path/to/container/image
To test a customized image on the Google Cloud Dataflow runner, use the DataflowRunner
option and the worker_harness_container_image
flag:
python -m apache_beam.examples.wordcount \ --input=path/to/inputfile \ --output=/path/to/write/counts \ --runner=DataflowRunner \ --project={gcp_project_id} \ --temp_location={gcs_location} \ \ --experiment=beam_fn_api \ --sdk_location=[…]/beam/sdks/python/container/py{version}/build/target/apache-beam.tar.gz \ --worker_harness_container_image=path/to/container/image # The sdk_location option accepts four Python version variables: 2, 35, 36, and 37
To build Beam SDK container images:
docker
target. If you‘re building a child image, set the optional --file
flag to the new Dockerfile. If you’re building an image from an original Dockerfile, ignore the --file
flag and use a default repository:# The default repository of each SDK ./gradlew [--file=path/to/new/Dockerfile] :sdks:java:container:docker ./gradlew [--file=path/to/new/Dockerfile] :sdks:go:container:docker ./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py2:docker ./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py35:docker ./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py36:docker ./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py37:docker # Shortcut for building all four Python SDKs ./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container buildAll
To examine the containers that you built, run docker images
from anywhere in the command line. If you successfully built all of the container images, the command prints a table like the following:
REPOSITORY TAG IMAGE ID CREATED SIZE apachebeam/java_sdk latest 16ca619d489e 2 weeks ago 550MB apachebeam/python2.7_sdk latest b6fb40539c29 2 weeks ago 1.78GB apachebeam/python3.5_sdk latest bae309000d09 2 weeks ago 1.85GB apachebeam/python3.6_sdk latest 42faad307d1a 2 weeks ago 1.86GB apachebeam/python3.7_sdk latest 18267df54139 2 weeks ago 1.86GB apachebeam/go_sdk latest 30cf602e9763 2 weeks ago 124MB
The default tag is latest
and the default repositories are in the Docker Hub apachebeam
namespace. The docker
command-line tool implicitly pushes container images to this location.
To tag a local image, set the docker-tag
option when building the container. The following command tags a Python SDK image with a date.
./gradlew :sdks:python:container:py2:docker -Pdocker-tag=2019-10-04
To change the repository, set the docker-repository-root
option to a new location. The following command sets the docker-repository-root
to a Bintray repository named apache
.
./gradlew :sdks:python:container:py2:docker -Pdocker-repository-root=$USER-docker-apache.bintray.io/beam/python
After building a container image, you can store it in a remote Docker repository.
The following steps push a Python SDK image to the docker-root-repository
value.
docker login
docker push apachebeam/python2.7_sdk
To download the image again, run docker pull
:
docker pull apachebeam/python2.7_sdk
Note: After pushing a container image, the remote image ID and digest match the local image ID and digest.