refs/heads/branch-1.5.x - kudu

commit	737930f3cd0c0caba3c36944f91058ffaead4714	[log] [tgz]
author	Adar Dembo <adar@cloudera.com>	Tue Sep 19 18:45:51 2017 -0700
committer	Adar Dembo <adar@cloudera.com>	Thu Jul 19 00:28:48 2018 +0000
tree	27b92f15adfbbc1d3ca4b7f739b7d91c1eb623b1
parent	81325bf5f2572e5d3fb7af0e53db7a1b4ac2c500 [diff]

KUDU-2149: avoid election stacking by restoring failure monitor semantics

Prior to commit 21b0f3d, the dedicated failure monitor thread invoked
RaftConsensus::StartElection() synchronously, thus preventing it from
surfacing additional failures during that time. This patch attempts to
restore these semantics by short-circuiting and ignoring any failures
detected while a Raft thread is in StartElection().

This is a super targeted fix geared towards a point release; a more correct
fix would be to completely disable failure detection while an election is
running, but that'll require more work.

Originally I had written a test that injects latency into
ConsensusMetadata::Flush(), toggles the fix, and compares the number of vote
request RPCs. I couldn't get it to be totally robust, and the "feature flag"
used in the toggle is likely to become obselete quickly. So in the end I
decided to drop the test from the patch.

Change-Id: Ifeaf99ce57f7d5cd01a6c786c178567a98438ced
Reviewed-on: http://gerrit.cloudera.org:8080/8107
Reviewed-by: Mike Percy <mpercy@apache.org>
Tested-by: Kudu Jenkins
(cherry picked from commit edd41cb40fbad206e2c356983baba8fbc57199b5)
Reviewed-on: http://gerrit.cloudera.org:8080/10987
Reviewed-by: Adar Dembo <adar@cloudera.com>
Tested-by: Adar Dembo <adar@cloudera.com>

2 files changed

tree: 27b92f15adfbbc1d3ca4b7f739b7d91c1eb623b1