Autorecovery may hang indefinitely when zookeeper connection blips
Descriptions of the changes in this PR:
### Motivation
In certain circumstances, all AR processes were running but not performing any replication. Thus ledgers remain under-replicated for periods more than the threshold defined by our monitoring system. The reason is that the latch here gets countdown only under certain condition.
### Changes
Ensure that countdown latch is released for any ZK event. Also log the zk event.
Master Issue: #2302
Reviewers: Enrico Olivelli <eolivelli@apache.org>, Anup Ghatage <gathage@apache.org>
This closes #2471 from karanmehta93/ar-zk-bug
diff --git a/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java b/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java
index dfb1c2a..a50261a 100644
--- a/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java
+++ b/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java
@@ -588,13 +588,8 @@
Watcher w = new Watcher() {
@Override
public void process(WatchedEvent e) {
- if (e.getType() == Watcher.Event.EventType.NodeChildrenChanged
- || e.getType() == Watcher.Event.EventType.NodeDeleted
- || e.getType() == Watcher.Event.EventType.NodeCreated
- || e.getState() == Watcher.Event.KeeperState.Expired
- || e.getState() == Watcher.Event.KeeperState.Disconnected) {
- changedLatch.countDown();
- }
+ LOG.info("Latch countdown due to ZK event: " + e);
+ changedLatch.countDown();
}
};
try (SubTreeCache.WatchGuard wg = subTreeCache.registerWatcherWithGuard(w)) {