Apache Beam Playground

About

Beam Playground helps facilitate trying out and adopting Apache Beam by providing a very quick way for prospective Beam users to see and run examples of Apache Beam pipelines in a web interface that requires no setup.

Getting Started

See playground/README.md for details on installing development dependencies.

This section describes what is needed to run the backend application.

  • Go commands to run/test the backend locally
  • Set up environment variables to run the backend locally
  • Running the backend via Docker

Go commands to run/test application locally

Run/build

Go to the backend directory:

cd backend

To run backend server on development machine without using docker you'll need first to prepare a working directory anywhere outside of Beam source tree:

mkdir ~/path/to/workdir

and then copy datasets/ and configs/ and logging.properties from playground/backend/ directory:

cp -r {logging.properties,datasets/,configs/} ~/path/to/workdir

In case if you want to start backend for Go SDK you additionally will also need to create a prepared mod dir and export an additional environment variable:

export PREPARED_MOD_DIR=~/path/to/workdir/prepared_folder
SDK_TAG=2.44.0 bash ./containers/go/setup_sdk.sh $PREPARED_MOD_DIR

The following command will build and serve the backend locally:

SERVER_PORT=<port> \
BEAM_SDK=<beam_sdk_type> \
APP_WORK_DIR=<path_to_workdir> \
DATASTORE_EMULATOR_HOST=127.0.0.1:8888 \
DATASTORE_PROJECT_ID=test \
SDK_CONFIG=../sdks-emulator.yaml \
go run ./cmd/server

where <port> should be the value of port on which you want to have the backend server available; <beam_sdk_type> is a value of desired Beam SDK, possible values are SDK_UNSPECIFIED, SDK_JAVA, SDK_PYTHON, SDK_GO, SDK_SCIO; <path_to_workdir> should be set to path to your work dir, e.g. ~/path/to/workdir.

Run the following command to generate a release build file:

go build ./cmd/server/server.go

Test

Playground tests may be run using this command:

go test ./... -v

The full list of commands can be found here.

Set up environment variables to run the backend locally

These environment variables should be set to run the backend locally:

  • BEAM_SDK - is the SDK which backend could process (SDK_GO / SDK_JAVA / SDK_PYTHON / SDK_SCIO / SDK_UNSPECIFIED)
  • APP_WORK_DIR - is the directory where all folders will be placed to process each code processing request
  • PREPARED_MOD_DIR - is the directory where prepared go.mod and go.sum files are placed. It is used only for Go SDK

There are also environment variables which are needed for the deployment of Apache Beam Playground. These variables have default value and there is no need to set them up to launch locally:

  • SERVER_IP - is the IP address of the backend server (default value = localhost)
  • SERVER_PORT - is the PORT of the backend server (default value = 8080)
  • CACHE_TYPE - is a type of the cache service which is used for the backend server. If it is set as a remote, then the backend server will use Redis to keep all cache values (default value = local)
  • CACHE_ADDRESS - is an address of the Redis server. It is used only when CACHE_TYPE=remote (default value = localhost:6379)
  • BEAM_PATH - it is the place where all required for the Java SDK libs are placed (default value = /opt/apache/beam/jars/*)
  • KEY_EXPIRATION_TIME - is the expiration time of the keys in the cache (default value = 15 min)
  • PIPELINE_EXPIRATION_TIMEOUT - is the expiration time of the code processing (default value = 15 min)
  • PROTOCOL_TYPE - is the type of the backend server protocol. It could be TCP or HTTP (default value = HTTP)
  • NUM_PARALLEL_JOBS - is the max number of the code processing requests which could be processed on the backend server at the same time (default value = 20). This value is used to check the readiness of the backend server. If the server reaches the max number of concurrent code-processing requests, then the load-balancer will route all other incoming requests to other instances while the instance will not ready.
  • LAUNCH_SITE - is the value to configure log (default value = local). If developers want to use log service on the App Engine then need to change this value to app_engine.
  • SDK_CONFIG - is the sdk configuration file path, e.g. default example for corresponding sdk. It will be saved to cloud datastore during application startup (default value = ../sdks.yaml)
  • DATASTORE_EMULATOR_HOST - is the datastore emulator address. If it is given in the environment, the application will connect to the datastore emulator.
  • PROPERTY_PATH - is the application properties path (default value = .)
  • CACHE_REQUEST_TIMEOUT - is the timeout to request data from cache (default value = 5 sec)

Application properties

These properties are stored in backend/properties.yaml file:

  • playground_salt - is the salt to generate the hash to avoid whatever problems a collision may cause.
  • max_snippet_size - is the file content size limit. Since 1 character occupies 1 byte of memory, and 1 MB is approximately equal to 1000000 bytes, then maximum size of the snippet is 1000000.
  • id_length - is the length of the identifier that is used to store data in the cloud datastore. It's appropriate length to save storage size in the cloud datastore and provide good randomnicity.
  • removing_unused_snippets_cron - is the cron expression for the scheduled task to remove unused snippets.

Running the server app via Docker

To run the server using Docker images there are Docker files in the containers folder for Java, Python and Go languages. Each of them processes the corresponding SDK, so the backend with Go SDK will work with Go examples/katas/tests only.

One more way to run the server is to run it locally how it is described above.

Calling the server from another client

To call the server from another client – models and client code should be generated using the playground/api/v1/api.proto file. More information about generating models and client's code using .proto files for each language can be found here.

Running the Beam Code

RunCode representation

The following diagram represents the execution of beam code at the server: RunCode

Validators/preparators representation

To clarify which validators and preparators used with the code: