RocketMQ Flaky Test Detection Plan

Background & Goals

RocketMQ's mainline CI frequently experiences intermittent test failures, causing developer trust fatigue toward red builds and masking real regressions. This plan uses large-scale repeated execution to statistically measure failure rates, marks unstable methods (≥1%) with @Ignore to restore CI reliability, and retains the data to prioritize subsequent fixes.

Methodology References

  • GoogleFlaky Tests at Google and How We Mitigate Them (2016): Introduced “deflake” (run N times to measure failure rate) and “quarantine” (isolate flaky tests from mainline CI). Internal data: ~1.5% of tests are flaky, 16% have been flaky at some point.
  • MetaPredictive Test Selection (2018): Uses aggressive retry to separate flaky failures from real regressions.
  • SpotifyTest Flakiness Methods (2019): Three-stage framework of repeated execution + isolation + tracking.

Core Idea: Three-Layer Funnel

A “coarse → fine → pinpoint” strategy to progressively narrow scope and avoid wasting compute at the full method level:

Layer 1: Module level (16 modules × 100 runs) → filter out modules with failures
Layer 2: Class level (only classes in unstable modules × 100 runs) → filter out classes with failures
Layer 3: Method level (only methods in unstable classes × 100 runs) → precisely locate each method's failure rate

After each layer, Surefire XML reports are analyzed and the unstable list feeds the next layer. After marking, a full re-run verifies stability; if new flaky tests surface, the mark + verify cycle repeats until zero failures.

Execution Architecture

  • Control node (local): Orchestrates task distribution, result collection, data analysis
  • Worker nodes (10 ECS, 16C 64G each): Max 4 Docker containers per node in parallel, each test run isolated

Execution Flow

1. Build      → Compile RocketMQ with JDK 8 inside Docker, package as test image
2. Distribute → Relay image via internal network to all worker nodes
3. Dispatch   → Generate task list, split evenly across nodes, start workers
4. Collect    → Poll until complete, retrieve Surefire XML reports
5. Analyze    → Parse XML, compute failure count and rate per method
6. Mark       → Add @Ignore to methods exceeding threshold
7. Verify     → Rebuild and run full suite to confirm trunk stability

Key Design Decisions

Decision PointChoiceRationale
Build environmentJDK 8 inside DockerLocal JDK versions vary; container ensures consistency
Image distributionUpload to one node, relay via internal networkInternal bandwidth far exceeds public internet
Test isolationIndependent container per runAvoids residual processes, port conflicts
Failure threshold≥1% failure rate~10 failures across 1000 effective runs; balances false positives vs. missed cases
Marking approach@Ignore + failure rate commentMinimal intrusion, easy to re-enable later
Verification loopFull re-run after markingHandles “hidden flaky” problem

Follow-up Plan

  • Prioritize root-cause analysis and fix for high failure rate methods (>10%); remove @Ignore and re-verify after fix
  • Consider integrating the detection tool into periodic CI tasks for continuous stability monitoring