commit | c8d4086697f85c9093f4da2c907a13e17c198914 | [log] [tgz] |
---|---|---|
author | Raúl Gracia <raul.gracia@emc.com> | Wed Aug 25 14:05:34 2021 +0200 |
committer | GitHub <noreply@github.com> | Wed Aug 25 14:05:34 2021 +0200 |
tree | 01db192464edddad17c6c8a01c2c2cf40ef09ad8 | |
parent | 393834b6131423c0b7c5d7baa54b1bd867afe72a [diff] |
ISSUE #2482: Added TCP_USER_TIMEOUT to Epoll channel config ### Motivation Added `TCP_USER_TIMEOUT` in Epoll channel config to limit the time a connection is left sending keepalives to a non-responding Bookie. ### Changes The original issue reported that in scenarios where Bookies may go down unexpectedly and change their IP (e.g., Kubernetes), the Bookkeeper client may be left for some time attempting to connect with the old IP of the restarted Bookie (see #2482 for details). To prevent this problem from happening (in Epoll channels), we introduce the following changes: - Epoll channels are now configured with `TCP_USER_TIMEOUT`. This parameter rules over the underlying TCP keepalive configuration (see https://datatracker.ietf.org/doc/html/rfc5482), which may be defaulted to retry for too long depending on the environment (e.g., 10-15 minutes in our experience). - To prevent adding more configuration parameters, the existing `clientConnectTimeoutMillis` value in `ClientConfiguration` is the one used to set `TCP_USER_TIMEOUT` due to its similarity. ### Validation We have reproduced the original testing environment in which this problem appears consistently: - Cluster with 4 Bookies and 3 Kubernetes nodes, in addition to https://pravega.io which uses the Bookkeeper client. - Deployed an application to do IO to Pravega (and therefore, to Bookkeeper). - Periodically shut down a Kubernetes node, so Bookkeeper pods on it are restarted as well. Considering this test procedure, without the proposed PR we consistently observe Bookkeeper clients getting stuck trying to contact with old IPs from Bookies. With this change, we confirmed via logs that the configuration change takes place and we have not been able to reproduce the original problem so far after performing multiple node reboots. Master Issue: #2482 Reviewers: Flavio Junqueira <fpj@apache.org>, Enrico Olivelli <eolivelli@apache.org> This closes #2761 from RaulGracia/issue-2482-close-idle-bookie-connection, closes #2482
Apache BookKeeper is a scalable, fault tolerant and low latency storage service optimized for append-only workloads.
It is suitable for being used in following scenarios:
Please visit the Documentation from the project website for more information.
For filing bugs, suggesting improvements, or requesting new features, help us out by opening a Github issue or opening an Apache jira.
Subscribe or mail the user@bookkeeper.apache.org list - Ask questions, find answers, and also help other users.
Subscribe or mail the dev@bookkeeper.apache.org list - Join development discussions, propose new ideas and connect with contributors.
Join us on Slack - This is the most immediate way to connect with Apache BookKeeper committers and contributors.
We feel that a welcoming open community is important and welcome contributions.
See Developer Setup to get your local environment setup.
Take a look at our open issues: JIRA Issues Github Issues.
Review our coding style and follow our pull requests to learn about our conventions.
Make your changes according to our contribution guide.