This Apache Beam I/O Standards document lays out the prescriptive guidance for 1P/3P developers developing an Apache Beam I/O connector. These guidelines aim to create best practices encompassing documentation, development and testing in a simple and concise manner.
An I/O connector (I/O) living in the Apache Beam Github repository is known as a Built-in I/O connector. Built-in I/O’s have their integration tests and performance tests routinely run by the Google Cloud Dataflow Team using the Dataflow Runner and metrics published publicly for reference. Otherwise, the following guidelines will apply to both unless explicitly stated.
This section lays out the superset of all documentation that is expected to be made available with an I/O. The Apache Beam documentation referenced throughout this section can be found here. And generally a good example to follow would be the built-in I/O, Snowflake I/O.
Custom I/Os are not included in the Apache Beam Github repository. Some examples would be SolaceIO.
This section outlines API syntax, semantics and recommendations for features that should be adopted for new as well as existing Apache Beam I/O Connectors.
The I/O Connector development guidelines are written with the following principles in mind:
BigQueryIO.write() .withWriteConfig(StreamingInsertsConfig.withDetailedError() .withExactlyOnce().etc..) {{< /highlight >}}
Over time, I/O need to evolve to address new use cases, or use new APIs under the covers. Some examples of necessary evolution of an I/O:
An I/O should have unit tests, integration tests, and performance tests. In the following guidance we explain what each type of test aims to achieve, and provide a baseline standard of test coverage. Do note that the actual test cases and business logic of the actual test would vary depending on specifics of each source/sink but we have included some suggested test cases as a baseline.
This guide complements the Apache Beam I/O transform testing guide by adding specific test cases and scenarios. For general information regarding testing Beam I/O connectors, please refer to that guide.
I/O unit tests need to efficiently test the functionality of the code. Given that unit tests are expected to be executed many times over multiple test suites (for example, for each Python version) these tests should execute relatively fast and should not have side effects. We recommend trying to achieve 100% code coverage through unit tests.
When possible, unit tests are favored over integration tests due to faster execution time and low resource usage. Additionally, unit tests can be easily included in pre-commit tests suites (for example, Jenkins beam_PreCommit_* test suites) hence has a better chance of discovering regressions early. Unit tests are also preferred for error conditions.
Integration tests test end-to-end interactions between the Beam runner and the data store a given I/O connects to. Since these usually involve remote RPC calls, integration tests take a longer time to execute. Additionally, Beam runners may use more than one worker when executing integration tests. Due to these costs, an integration test should only be implemented when a given scenario cannot be covered by a unit test.
Because the Performance testing framework is still in flux, performance tests can be a follow-up submission after the actual I/O code.
The Performance testing framework does not yet support GoLang or Typescript.
Performance benchmarks are a critical part of best practices for I/Os as they effectively address several areas:
Google runs performance tests routinely for built-in I/Os and publishes them to an externally viewable dashboard for Java and Python.