[tests] fix flakiness in TestFailDuringScanWorkload

This patch fixes flakiness in the
TabletServerDiskErrorITest.TestFailDuringScanWorkload scenario.
There was a prior attempt to make the scenario more stable [1],
but that hadn't ruled out sporadic test failures due to
  * various scheduler anomalies
  * random distribution of replicas chosen by client to read data from

With that, the scenario was failing in about 1 out of 10 runs for
RELEASE and ASAN builds [2].

To eliminate the flakiness, it's necessary to make sure that
  * the dedicated tablet server ends up with at least one replica
    from which the client tries to fetch the data
  * scan requests arrive to tablet replicas hosted by the dedicated
    tablet server only after IO failures have been injected
This patch does so by
  * having more control over the selection of tablet replicas that
    client sends scan requests to
  * starting the scan operation only after injecting IO errors

To verify the fix, I ran the test scenario built in ASAN configuration
with and without this patch.  Without this patch, 96 out of 1024 runs
failed [3].  With the patch applied, 0 out of 1024 runs failed [4].

[1] https://github.com/apache/kudu/commit/ccbbfb3006314f2c37f3a40bfec355db9fc90e02
[2] http://dist-test.cloudera.org:8080/test_drilldown?test_name=disk_failure-itest
[3] http://dist-test.cloudera.org/job?job_id=aserbin.1662847551.105230
[4] http://dist-test.cloudera.org/job?job_id=aserbin.1662873124.94488

Change-Id: Ia29bfdc9761139426344532bab3e5d0b3c1b12ad
Reviewed-on: http://gerrit.cloudera.org:8080/18967
Reviewed-by: Yifan Zhang <chinazhangyifan@163.com>
Tested-by: Yifan Zhang <chinazhangyifan@163.com>
1 file changed