83af09f7e9f0fd55b58c7f159979a49882b92ef5 - kudu

commit	83af09f7e9f0fd55b58c7f159979a49882b92ef5	[log] [tgz]
author	Alexey Serbin <alexey@apache.org>	Sat Sep 10 12:10:12 2022 -0700
committer	Yifan Zhang <chinazhangyifan@163.com>	Thu Sep 15 04:03:49 2022 +0000
tree	7c639334d3c3c2ad28160ffdc82b7fc9b1e8cb02
parent	1ad8c1f18a35e06d4aafd7813de71ee335c68ea1 [diff]

[tests] fix flakiness in TestFailDuringScanWorkload

This patch fixes flakiness in the
TabletServerDiskErrorITest.TestFailDuringScanWorkload scenario.
There was a prior attempt to make the scenario more stable [1],
but that hadn't ruled out sporadic test failures due to
  * various scheduler anomalies
  * random distribution of replicas chosen by client to read data from

With that, the scenario was failing in about 1 out of 10 runs for
RELEASE and ASAN builds [2].

To eliminate the flakiness, it's necessary to make sure that
  * the dedicated tablet server ends up with at least one replica
    from which the client tries to fetch the data
  * scan requests arrive to tablet replicas hosted by the dedicated
    tablet server only after IO failures have been injected
This patch does so by
  * having more control over the selection of tablet replicas that
    client sends scan requests to
  * starting the scan operation only after injecting IO errors

To verify the fix, I ran the test scenario built in ASAN configuration
with and without this patch.  Without this patch, 96 out of 1024 runs
failed [3].  With the patch applied, 0 out of 1024 runs failed [4].

[1] https://github.com/apache/kudu/commit/ccbbfb3006314f2c37f3a40bfec355db9fc90e02
[2] http://dist-test.cloudera.org:8080/test_drilldown?test_name=disk_failure-itest
[3] http://dist-test.cloudera.org/job?job_id=aserbin.1662847551.105230
[4] http://dist-test.cloudera.org/job?job_id=aserbin.1662873124.94488

Change-Id: Ia29bfdc9761139426344532bab3e5d0b3c1b12ad
Reviewed-on: http://gerrit.cloudera.org:8080/18967
Reviewed-by: Yifan Zhang <chinazhangyifan@163.com>
Tested-by: Yifan Zhang <chinazhangyifan@163.com>

src/kudu/integration-tests/disk_failure-itest.cc[diff]

1 file changed

tree: 7c639334d3c3c2ad28160ffdc82b7fc9b1e8cb02