Beam allows configuration of the SDK harness to accommodate varying cluster setups. (The options below are for Python, but much of this information should apply to the Java and Go SDKs as well.)
environment_type determines where user code will be executed. environment_config configures the environment depending on the value of environment_type.DOCKER (default): User code is executed within a container started on each worker node. This requires docker to be installed on worker nodes.PROCESS: User code is executed by processes that are automatically started by the runner on each worker node.environment_config: JSON of the form {"os": "<OS>", "arch": "<ARCHITECTURE>", "command": "<process to execute>", "env":{"<Environment variables 1>": "<ENV_VAL>"} }. All fields in the JSON are optional except command.command, it is recommended to use the bootloader executable, which can be built from source with ./gradlew :sdks:python:container:build and copied from sdks/python/container/build/target/launcher/linux_amd64/boot to worker machines. Note that the Python bootloader assumes Python and the apache_beam module are installed on each worker machine.EXTERNAL: User code will be dispatched to an external service. For example, one can start an external service for Python workers by running docker run -p=50000:50000 apache/beam_python3.6_sdk --worker_pool.environment_config: Address for the external service, e.g. localhost:50000.BEAM_WORKER_POOL_IN_DOCKER_VM environment variable on the client: export BEAM_WORKER_POOL_IN_DOCKER_VM=1.LOOPBACK: User code is executed within the same process that submitted the pipeline. This option is useful for local testing. However, it is not suitable for a production environment, as it performs work on the machine the job originated from.environment_config is not used for the LOOPBACK environment.sdk_worker_parallelism sets the number of SDK workers that run on each worker node. The default is 1. If 0, the value is automatically set by the runner by looking at different parameters, such as the number of CPU cores on the worker machine. Only used for Python pipelines on Flink and Spark runners.