blob: 485baa464537604104042f63ac4507bcf517286a [file] [log] [blame]
---
title: Handling Forced Cache Disconnection Using Autoreconnect
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
A <%=vars.product_name%> member may be forcibly disconnected from a <%=vars.product_name%> cluster if the member is unresponsive for a period of time, or if a network partition separates one or more members into a group that is too small to act as the cluster.
## How the Autoreconnection Process Works
After being disconnected from a cluster,
a <%=vars.product_name%> member shuts down and, by default, automatically restarts into
a "reconnecting" state,
while periodically attempting to rejoin the cluster
by contacting a list of known locators.
If the member succeeds in reconnecting to a known locator, the member rebuilds its view of the cluster from existing members and receives a new distributed member ID.
If the member cannot connect to a known locator, the member will then check to see if it itself is a locator (or hosting an embedded locator process). If the member is a locator, then the member does a quorum-based reconnect; it will attempt to contact a quorum of the members that were in the membership view just before it became disconnected. If a quorum of members can be contacted, then startup of the cluster is allowed to begin. Since the reconnecting member does not know which members survived the network partition event, all members that are in a reconnecting state will keep their UDP unicast ports open and respond to ping requests.
Membership quorum is determined using the same member weighting system used in network partition detection. See [Membership Coordinators, Lead Members and Member Weighting](network_partitioning/membership_coordinators_lead_members_and_weighting.html#concept_23C2606D59754106AFBFE17515DF4330).
Note that when a locator is in the reconnecting state,
it provides no discovery services for the cluster.
The default settings for reconfiguration of the cache once
reconnected assume that the cluster configuration service has
a valid (XML) configuration.
This will not be the case if the cluster was configured using
API calls.
To handle this case,
either disable autoreconnect by setting the property to
```
disable-auto-reconnect = true
```
or, disable the cluster configuration service by setting the property to
```
enable-cluster-configuration = false
```
After the cache has reconnected, applications must fetch a reference to the new Cache, Regions, DistributedSystem and other artifacts. Old references will continue to throw cancellation exceptions like `CacheClosedException(cause=ForcedDisconnectException)`.
See the <%=vars.product_name%> `DistributedSystem` and `Cache` Java API documentation for more information.
## Managing the Autoreconnection Process
By default a <%=vars.product_name%> member will try to reconnect until it is told to stop by using the `DistributedSystem.stopReconnecting()` or `Cache.stopReconnecting()` method. You can disable automatic reconnection entirely by setting `disable-auto-reconnect` <%=vars.product_name%> property to "true."
You can use `DistributedSystem` and `Cache` callback methods to perform actions during the reconnect process, or to cancel the reconnect process if necessary.
The `DistributedSystem` and `Cache` API provide several methods you can use to take actions while a member is reconnecting to the cluster:
- `DistributedSystem.isReconnecting()` returns true if the member is in the process of reconnecting and recreating the cache after having been removed from the system by other members.
- `DistributedSystem.waitUntilReconnected(long, TimeUnit)` waits for a period of time, and then returns a boolean value to indicate whether the member has reconnected to the DistributedSystem. Use a value of -1 seconds to wait indefinitely until the reconnect completes or the member shuts down. Use a value of 0 seconds as a quick probe to determine if the member has reconnected.
- `DistributedSystem.getReconnectedSystem()` returns the reconnected DistributedSystem.
- `DistributedSystem.stopReconnecting()` stops the reconnection process and ensures that the DistributedSystem stays in a disconnected state.
- `Cache.isReconnecting()` returns true if the cache is attempting to reconnect to a cluster.
- `Cache.waitUntilReconnected(long, TimeUnit)` waits for a period of time, and then returns a boolean value to indicate whether the DistributedSystem has reconnected. Use a value of -1 seconds to wait indefinitely until the reconnect completes or the cache shuts down. Use a value of 0 seconds as a quick probe to determine if the member has reconnected.
- `Cache.getReconnectedCache()` returns the reconnected Cache.
- `Cache.stopReconnecting()` stops the reconnection process and ensures that the DistributedSystem stays in a disconnected state.
## Operator Intervention
You may need to intervene in the autoreconnection process if processes or hardware have crashed or are otherwise shut down before the network connection is healed. In this case the members in a "reconnecting" state will not be able to find the lost processes through UDP probes and will not rejoin the system until they are able to contact a locator.