Apache Beam

CI Environment

Continuous Integration is important component of making Apache Beam robust and stable.

Our execution environment for CI is mainly the Jenkins which is available at https://ci-beam.apache.org/. See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs. See Apache Beam Developer Guide for Jenkins Tips.

An additional execution environment for CI is GitHub Actions. GitHub Actions (GA) are very well integrated with GitHub code and Workflow and it has evolved fast in 2019/2020 to become a fully-fledged CI environment, easy to use and develop for, so we decided to use it for building python source distribution and wheels.

GitHub Actions

GitHub actions run types

The following GA CI Job runs are currently run for Apache Beam, and each of the runs have different purpose and context.

Pull request run

Those runs are results of PR from the forks made by contributors. Most builds for Apache Beam fall into this category. They are executed in the context of the “Fork”, not main Beam Code Repository which means that they have only “read” permission to all the GitHub resources (container registry, code repository). This is necessary as the code in those PRs (including CI job definition) might be modified by people who are not committers for the Apache Beam Code Repository.

The main purpose of those jobs is to check if PR builds cleanly, if the test run properly and if the PR is ready to review and merge.

Direct Push/Merge Run

Those runs are results of direct pushes done by the committers or as result of merge of a Pull Request by the committers. Those runs execute in the context of the Apache Beam Code Repository and have also write permission for GitHub resources (container registry, code repository). The main purpose for the run is to check if the code after merge still holds all the assertions - like whether it still builds, all tests are green.

This is needed because some of the conflicting changes from multiple PRs might cause build and test failures after merge even if they do not fail in isolation.

Scheduled runs

Those runs are results of (nightly) triggered job - only for master branch. The main purpose of the job is to check if there was no impact of external dependency changes on the Apache Beam code (for example transitive dependencies released that fail the build). Another reason for the nightly build is that the builds tags most recent master with nightly-master.

All runs consist of the same jobs, but the jobs behave slightly differently or they are skipped in different run categories. Here is a summary of the run categories with regards of the jobs they are running. Those jobs often have matrix run strategy which runs several different variations of the jobs (with different platform type / Python version to run for example)

Google Cloud Platform Credentials

Some of the jobs require variables stored as GitHub Secrets to perform operations on Google Cloud Platform. These variables are:

  • GCP_PROJECT_ID - ID of the Google Cloud project. For example: apache-beam-testing.
  • GCP_REGION - Region of the bucket and dataflow jobs. For example: us-central1.
  • GCP_TESTING_BUCKET - Name of the bucket where temporary files for Dataflow tests will be stored. For example: beam-github-actions-tests.
  • GCP_PYTHON_WHEELS_BUCKET - Name of the bucket where python source distribution and wheels will be stored. For example: beam-wheels-staging.
  • GCP_SA_EMAIL - Service account email address. This is usually of the format <name>@<project-id>.iam.gserviceaccount.com.
  • GCP_SA_KEY - Service account key. This key should be created and encoded as a Base64 string (eg. cat my-key.json | base64 on macOS).

Service Account shall have following permissions (IAM roles):

  • Storage Admin (roles/storage.admin)
  • Dataflow Admin (roles/dataflow.admin)

Workflows

Build python source distribution and wheels - build_wheels.yml

JobDescriptionPull Request RunDirect Push/Merge RunScheduled RunRequires GCP Credentials
Check GCP variablesChecks that GCP variables are set. Jobs which required them depend on the output of this job.YesYesYesYes/No
Build python source distributionBuilds python source distribution and uploads it to artifacts. Artifacts from release branch are used in release process (build_release_candidate.sh)YesYesYes-
Prepare GCSClears target path on GCS if already exists.-YesYesYes
Upload python source distribution to GCS bucketUploads python source distribution to GCS bucket for path unique for specific workflow run.-YesYesYes
Build python wheels on linux/macos/windowsBuilds python wheels on linux/macos/windows platform with usage of cibuildwheel and uploads it to artifacts. Artifacts from release branch are used in release process ( build_release_candidate.sh )YesYesYes-
Upload python wheels to GCS bucketUploads python wheels to GCS bucket for path unique for specific workflow run. Additionally uploads workflow run data.-YesYesYes
List files on Google Cloud Storage BucketLists files on GCS for verification purpose.-YesYesYes
Branch repo nightlyBranch repo with nightly-master if build python source distribution and python wheels finished successfully.--Yes-

Python tests - python_tests.yml

JobDescriptionPull Request RunDirect Push/Merge RunScheduled RunRequires GCP Credentials
Check GCP variablesChecks that GCP variables are set. Jobs which required them depend on the output of this job.YesYesYesYes/No
Build python source distributionBuilds python source distribution and uploads it to artifacts. Artifacts are used in Python Wordcount Dataflow job.-YesYesYes
Python Unit TestsRuns python unit tests.YesYesYes-
Python Wordcount Direct RunnerRuns python WordCount example with Direct Runner.YesYesYes-
Python Wordcount DataflowRuns python WordCount example with DataFlow Runner.-YesYesYes

Java tests - java_tests.yml

JobDescriptionPull Request RunDirect Push/Merge RunScheduled RunRequires GCP Credentials
Check GCP variablesChecks that GCP variables are set. Jobs which required them depend on the output of this job.YesYesYesYes/No
Java Unit TestsRuns Java unit tests.YesYesYes-
Java Wordcount Direct RunnerRuns Java WordCount example with Direct Runner.YesYesYes-
Java Wordcount DataflowRuns Java WordCount example with DataFlow Runner.-YesYesYes

Release Preparation and Validation Workflows

Start Snapshot Build - start_snapshot_build.yml

JobDescriptionPull Request RunDirect Push/Merge RunScheduled RunRequires GCP Credentials
Start Snapshot BuildCreates PR against apache:master and triggers a job to build a snapshotNoNoNoNo

Choose RC Commit - choose_rc_commit.yml

JobDescriptionPull Request RunDirect Push/Merge RunScheduled RunRequires GCP Credentials
Choose RC CommitChooses a commit to be the basis of a release candidate and pushes a new tagged commit for that RC.NoNoNoNo

Cut Release Branch - verify_release_build.yml

JobDescriptionPull Request RunDirect Push/Merge RunScheduled RunRequires GCP Credentials
Update MasterUpdate Apache Beam master branch with next release versionNoNoNoNo
Update Release BranchCut release branch for current development versionNoNoNoNo

Verify Release Build - verify_release_build.yml

JobDescriptionPull Request RunDirect Push/Merge RunScheduled RunRequires GCP Credentials
Verify Release BuildVerifies full life cycle of Gradle Build and all PostCommit/PreCommit tests against Release Branch on CI.NoNoNoNo

Git tag Release Version - git_tag_released_version.yml

JobDescriptionPull Request RunDirect Push/Merge RunScheduled RunRequires GCP Credentials
Git Tag Release VersionCreate and push a new tag for the released version by copying the tag for the final release candidate.NoNoNoNo

Run RC Validation - run_rc_validation.yml

JobDescriptionPull Request RunDirect Push/Merge RunScheduled RunRequires GCP Credentials
Python Release CandidateComment on PR to trigger Python ReleaseCandidate Jenkins job.NoNoNoNo
Python XLang SQL TaxiRuns Python XLang SQL Taxi with DataflowRunnerNoNoNoYes
Python XLang KafkaRuns Python XLang Kafka Taxi with DataflowRunnerNoNoNoYes
Direct Runner LeaderboardRuns Python Leaderboard with DirectRunnerNoNoNoYes
Direct Runner GameStatsRuns Python GameStats with DirectRunner.NoNoNoYes
Dataflow Runner LeaderboardRuns Python Leaderboard with DataflowRunnerNoNoNoYes
Dataflow Runner GameStatsRuns Python GameStats with DataflowRunnerNoNoNoYes

All migrated workflows run based on the following triggers

DescriptionPull Request RunDirect Push/Merge RunScheduled RunWorkflow Dispatch
PostCommitNoYesYesYes
PreCommitYesYesYesYes

PreCommit Workflows

WorkflowDescriptionRequires GCP Credentials
job-precommit-placeholder.ymlDescription placeholderYes/No

PostCommit Workflows

WorkflowDescriptionRequires GCP Credentials
job-postcommit-placeholder.ymlDescription placeholderYes/No

GitHub Action Tips

  • All migrated workflows get executed on pre-configured self-hosted runners. For this reason, GCP credentials are only needed when running the workflows in a different runner.
  • If you introduce changes to the workflow it is possible that your changes will not be present in the check run triggered in Pull Request. In this case please attach link to the modified workflow run executed on your fork.
  • Possible timeouts with macOS runner - existing issue: (X) This check failed - sometimes happens on macOS runner #841
  • GitHub Actions Documentation