title: “IO Standards”

I/O Standards

Overview

This Apache Beam I/O Standards document lays out the prescriptive guidance for 1P/3P developers developing an Apache Beam I/O connector. These guidelines aim to create best practices encompassing documentation, development and testing in a simple and concise manner.

What are built-in I/O Connectors?

An I/O connector (I/O) living in the Apache Beam Github repository is known as a Built-in I/O connector. Built-in I/O’s have their integration tests and performance tests routinely run by the Google Cloud Dataflow Team using the Dataflow Runner and metrics published publicly for reference. Otherwise, the following guidelines will apply to both unless explicitly stated.

Guidance

Documentation

This section lays out the superset of all documentation that is expected to be made available with an I/O. The Apache Beam documentation referenced throughout this section can be found here. And generally a good example to follow would be the built-in I/O, Snowflake I/O.

Built-in I/O

I/O (not built-in)

Custom I/Os are not included in the Apache Beam Github repository. Some examples would be SolaceIO.

This section outlines API syntax, semantics and recommendations for features that should be adopted for new as well as existing Apache Beam I/O Connectors.

The I/O Connector development guidelines are written with the following principles in mind:

  • Consistency makes an API easier to learn.
    • If there are multiple ways of doing something, we should strive to be consistent first
  • With a couple minutes of studying documentation, users should be able to pick up most I/O connectors.
  • The design of a new I/O should consider the possibility of evolution.
  • Transforms should integrate well with other Beam utilities.

All SDKs

Pipeline Configuration / Execution / Streaming / Windowing semantics guidelines

Java

General

Classes / Methods / Properties

BigQueryIO.write() .withWriteConfig(StreamingInsertsConfig.withDetailedError() .withExactlyOnce().etc..) {{< /highlight >}}

Types

Evolution

Over time, I/O need to evolve to address new use cases, or use new APIs under the covers. Some examples of necessary evolution of an I/O:

Python

General

Classes / Methods / Properties

Types

GoLang

General

Typescript

Classes / Methods / Properties

Testing

An I/O should have unit tests, integration tests, and performance tests. In the following guidance we explain what each type of test aims to achieve, and provide a baseline standard of test coverage. Do note that the actual test cases and business logic of the actual test would vary depending on specifics of each source/sink but we have included some suggested test cases as a baseline.

This guide complements the Apache Beam I/O transform testing guide by adding specific test cases and scenarios. For general information regarding testing Beam I/O connectors, please refer to that guide.

Unit Tests

I/O unit tests need to efficiently test the functionality of the code. Given that unit tests are expected to be executed many times over multiple test suites (for example, for each Python version) these tests should execute relatively fast and should not have side effects. We recommend trying to achieve 100% code coverage through unit tests.

When possible, unit tests are favored over integration tests due to faster execution time and low resource usage. Additionally, unit tests can be easily included in pre-commit tests suites (for example, Jenkins beam_PreCommit_* test suites) hence has a better chance of discovering regressions early. Unit tests are also preferred for error conditions.

Suggested Test Cases

Integration Tests

Integration tests test end-to-end interactions between the Beam runner and the data store a given I/O connects to. Since these usually involve remote RPC calls, integration tests take a longer time to execute. Additionally, Beam runners may use more than one worker when executing integration tests. Due to these costs, an integration test should only be implemented when a given scenario cannot be covered by a unit test.

Suggested Test Cases

Performance Tests

Because the Performance testing framework is still in flux, performance tests can be a follow-up submission after the actual I/O code.

The Performance testing framework does not yet support GoLang or Typescript.

Performance benchmarks are a critical part of best practices for I/Os as they effectively address several areas:

  • To evaluate if the cost and performance of a specific I/O or dataflow template meets the customer’s business requirements.
  • To illustrate performance regressions and improvements to I/O or dataflow templates between code changes.
  • To help end customers estimate costs and plan capacity to meet their SLOs.

Dashboard

Google runs performance tests routinely for built-in I/Os and publishes them to an externally viewable dashboard for Java and Python.

Guidance