This README describes how code coverage report across multiple modules works for Hudi by leveraging JaCoCo.
We used to report code coverage on each PR in early days ( see https://github.com/apache/hudi/pull/1667#issuecomment-633665810, screenshot below). However, we have disabled it due to several problems:
JaCoCo is a free, open-source code coverage library for Java. It helps developers understand how much of their codebase is actually being exercised by their tests. It is still the defacto standard for Java code coverage reporting.
JaCoCo supports report-aggregate for multi-module project but there are certain limitations as of 0.8.12 release (1, 2, 3, 4, 5, 6). One hack includes creating a new source module for reporting aggregation, which is sth we want to avoid if possible.
However, JaCoCo also provides a powerful CLI tool (https://www.jacoco.org/jacoco/trunk/doc/cli.html) which can do report manipulation at the file level, which we can use for custom report aggregation.
At high level, here's how JaCoCo generates the code coverage report:
(1) While running tests, JaCoCo generates binary execution data for reporting later. The execution data can be stored in a jacoco.exec file if enabled. It‘s not a human-readable text format. It’s designed for consumption by JaCoCo's reporting tools. The following key information is stored in jacoco.exec:
(2) Once tests finish, JaCoCo generates code coverage report in HTML and/or XML based on the binary execution data (jacoco.exec).
To make cross-module code coverage report work in Azure DevOps Pipeline (or in other similar CI environments) for Hudi, here's the workflow:
(1) When running tests from mvn command in each job, enable binary execution data to be written to the storage, i.e., through prepare-agent goal (see pom.xml). As we run multiple mvn test commands in the same job with different args, to avoid collision, a unique destFile is configured for each command (see azure-pipelines-20230430.yml);
(2) Once each job finishes, multiple *.exec binary execution data files are merged into one merged-jacoco.exec through JaCoCo CLI (see Merge JaCoCo Execution Data Files task in azure-pipelines-20230430.yml). The merged execution data file is published as an artifact for later analysis (see Publish Merged JaCoCo Execution Data File task in azure-pipelines-20230430.yml).
(3) Once all jobs finish running all tests, all the JaCoCo execution data files are processed ( see MergeAndPublishCoverage job in azure-pipelines-20230430.yml). The execution data files from multiple jobs are downloaded and merged again into a single file jacoco.exec through JaCoCo CLI;
(4) To generate the final report, the source files (*.java, *.scala) and class files (*.class) must be under the same directory, not in different modules, due to the limitation of JaCoCo CLI taking only a single directory path for each. So a new maven plugin execution target is added to do that (see copy-source-files and copy-class-files in pom.xml). Once that's done, the final reporting is done through JaCoCo CLI by using the aggregated source files, class files, and jacoco.exec (see MergeAndPublishCoverage job in azure-pipelines-20230430.yml). Both the jacoco.exec and final reports are published.
Azure Run
JaCoCo Coverage Report
Published Artifacts
download_jacoco.sh: downloads JaCoCo binaries, especially the CLI jar, for usage.merge_jacoco_exec_files.sh: merges multiple JaCoCo execution data files in multiple modules.merge_jacoco_job_files.sh: merges multiple JaCoCo execution data files from multiple Azure pipeline jobs.generate_jacoco_coverage_report.sh: generates the JaCoCo code coverage report by taking the execution data file, source files and class files.