blob: 35799fdc4816164a9e4feb1ef1e6e1dbc45c90f0 [file] [log] [blame]
---
title: Diagnosing System Problems
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
This section provides possible causes and suggested responses for system problems.
- [Locator does not start](diagnosing_system_probs.html#diagnosing_system_probs__section_7BC1FF8CE0FC492CB49235FC4BC4060B)
- [Application or cache server process does not start](diagnosing_system_probs.html#diagnosing_system_probs__section_D51F5FA86ABA43C699B593D890BC3E28)
- [Application or cache server does not join the cluster](diagnosing_system_probs.html#diagnosing_system_probs__section_53D97CED679443F28E20E8B08C699056)
- [Member process seems to hang](diagnosing_system_probs.html#diagnosing_system_probs__section_D607C96A6CBE42FD880F1463A20A8BEF)
- [Member process does not read settings from the gemfire.properties file](diagnosing_system_probs.html#diagnosing_system_probs__section_E3B4A6DB81AB4C659C6093D2D61EFD71)
- [Cache creation fails - must match schema definition root](diagnosing_system_probs.html#diagnosing_system_probs__section_B0698527A4DF4D84877B1AF66291ABFD)
- [Cache is not configured properly](diagnosing_system_probs.html#diagnosing_system_probs__section_B2DAD06E80A4475D96FF2ACCF30FE198)
- [Unexpected results for keySetOnServer and containsKeyOnServer](diagnosing_system_probs.html#diagnosing_system_probs__section_6B4E2AD4ECBB4C08B8F1DB5E07AFE7F6)
- [Data operation returns PartitionOfflineException](diagnosing_system_probs.html#diagnosing_system_probs__section_9276E09D9FAC408E899F73B7068E80C6)
- [Entries are not being evicted or expired as expected](diagnosing_system_probs.html#diagnosing_system_probs__section_A3BB709B754949C6981C431F1F8023D6)
- [Cannot find the log file](diagnosing_system_probs.html#diagnosing_system_probs__section_346C62F16B19491E83B59B0A51D9E2B6)
- [OutOfMemoryError](diagnosing_system_probs.html#diagnosing_system_probs__section_3CFAA7BA258B43A795AEAB09F9DD9AAB)
- [PartitionedRegionDistributionException](diagnosing_system_probs.html#diagnosing_system_probs__section_B49BD03F4CA241C7BED4A2C4D5936A7A)
- [PartitionedRegionStorageException](diagnosing_system_probs.html#diagnosing_system_probs__section_7DE15A6C99974821B6CA418BC2AF98F1)
- [Application crashes without producing an exception](diagnosing_system_probs.html#diagnosing_system_probs__section_AFA1D06BC3AA44A4AB0593FD1EF0B0B7)
- [Timeout alert](diagnosing_system_probs.html#diagnosing_system_probs__section_06C68EA0DACC46C58AA88E98C19AD2D8)
- [Member produces SocketTimeoutException](diagnosing_system_probs.html#diagnosing_system_probs__section_66D11C8E84F941B58800EDB52194B087)
- [Member logs ForcedDisconnectException, Cache and DistributedSystem forcibly closed](diagnosing_system_probs.html#diagnosing_system_probs__section_8C7CB2EA0A274DAF90083FECE0BF3B1F)
- [Members cannot see each other](diagnosing_system_probs.html#diagnosing_system_probs__section_778D150443044847B1C73B9E02BE247B)
- [One part of the cluster cannot see another part](diagnosing_system_probs.html#diagnosing_system_probs__section_E31AFADE4A3A45C7A6EABB67697CFF33)
- [Data distribution has stopped, although member processes are running](diagnosing_system_probs.html#diagnosing_system_probs__section_04CEF27475924E5D9860BEE6D64C49E2)
- [Distributed-ack operations take a very long time to complete](diagnosing_system_probs.html#diagnosing_system_probs__section_7A6113ED20044B8C868483AABC45216E)
- [Slow system performance](diagnosing_system_probs.html#diagnosing_system_probs__section_E5DB25F2CC454510A9E58790C09C8CE3)
- [Cant get Windows performance data](diagnosing_system_probs.html#diagnosing_system_probs__section_F93DD765FF2A43439D3FF7936F8883DE)
- [Java applications on 64-bit platforms hang or use 100% CPU](diagnosing_system_probs.html#diagnosing_system_probs__section_E70C332303A242BEAE9D2C0A2EE70E0A)
## <a id="diagnosing_system_probs__section_7BC1FF8CE0FC492CB49235FC4BC4060B" class="no-quick-link"></a>Locator does not start
Invocation of a locator with gfsh fails with an error like this:
``` pre
Starting a GemFire Locator in C:\devel\gfcache\locator\locator
The Locator process terminated unexpectedly with exit status 1. Please refer to the log
file in C:\devel\gfcache\locator\locator for full details.
Exception in thread "main" java.lang.RuntimeException: An IO error occurred while
starting a Locator in C:\devel\gfcache\locator\locator on 192.0.2.0[10999]: Network is
unreachable; port (10999) is not available on 192.0.2.0.
at
org.apache.geode.distributed.LocatorLauncher.start(LocatorLauncher.java:622)
at
org.apache.geode.distributed.LocatorLauncher.run(LocatorLauncher.java:513)
at
org.apache.geode.distributed.LocatorLauncher.main(LocatorLauncher.java:188)
Caused by: java.net.BindException: Network is unreachable; port (10999) is not available on
192.0.2.0.
at
org.apache.geode.distributed.AbstractLauncher.assertPortAvailable(AbstractLauncher.java:136)
at
org.apache.geode.distributed.LocatorLauncher.start(LocatorLauncher.java:596)
...
```
This indicates a mismatch somewhere in the address, port pairs used for locator startup and configuration. The address you use for locator startup must match the address you list for the locator in the `gemfire.properties` locators specification. Every member of this cluster, including the locator itself, must have the complete locators specification in the `gemfire.properties`.
Response:
- Check that your locators specification includes the address you are using to start your locator.
- If you use a bind address, you must use numeric addresses for the locator specification. The bind address will not resolve to the machines default address.
- If you are using a 64-bit Linux system, check whether your system is experiencing the leap second bug. See [Java applications on 64-bit platforms hang or use 100% CPU](diagnosing_system_probs.html#diagnosing_system_probs__section_E70C332303A242BEAE9D2C0A2EE70E0A) for more information.
## <a id="diagnosing_system_probs__section_D51F5FA86ABA43C699B593D890BC3E28" class="no-quick-link"></a>Application or cache server process does not start
If the process tries to start and then silently disappears, on Windows this indicates a memory problem.
Response:
- On a Windows host, decrease the maximum JVM heap size. This property is specified on the `gfsh` command line:
``` pre
gfsh>start server --name=server_name --max-heap=1024m
```
For details, see [JVM Memory Settings and System Performance](../monitor_tune/system_member_performance_jvm_mem_settings.html#sys_mem_perf).
- If this doesnt work, try rebooting.
## <a id="diagnosing_system_probs__section_53D97CED679443F28E20E8B08C699056" class="no-quick-link"></a>Application or cache server does not join the cluster
Response: Check these possible causes.
- Network problemthe most common cause. First, try to ping the other hosts.
- Firewall problems. If members of your distributed <%=vars.product_name%> system are located outside the LAN, check whether the firewall is blocking communication. <%=vars.product_name%> is a network-centric distributed system, so if you have a firewall running on your machine, it could cause connection problems. For example, your connections may fail if your firewall places restrictions on inbound or outbound permissions for Java-based sockets. You may need to modify your firewall configuration to permit traffic to Java applications running on your machine. The specific configuration depends on the firewall you are using.
- Wrong multicast port when using multicast for membership. Check the `gemfire.properties` file of this application or cache server to see that the mcast-port is configured correctly. If you are running multiple clusters at your site, each cluster must use a unique multicast port.
- Can not connect to locator (when using TCP for discovery).
- Check that the locators attribute in this processs `gemfire.properties` has the correct IP address for the locator.
- Check that the locator process is running. If not, see instructions for related problem, [Data distribution has stopped, although member processes are running](diagnosing_system_probs.html#diagnosing_system_probs__section_04CEF27475924E5D9860BEE6D64C49E2).
- Bind address set incorrectly on a multi-homed host. When you specify the bind address, use the IP address rather than the host name. Sometimes multiple network adapters are configured with the same hostname. See [Topology and Communication General Concepts](../../topologies_and_comm/topology_concepts/chapter_overview.html#concept_7628F498DB534A2D8A99748F5DA5DC94) for more information about using bind addresses.
- Wrong version of <%=vars.product_name%> . A version mismatch can cause the process to hang or crash. Check the software version with the gemfire version command.
## <a id="diagnosing_system_probs__section_D607C96A6CBE42FD880F1463A20A8BEF" class="no-quick-link"></a>Member process seems to hang
Response:
- **During initialization**—For persistent regions, the member may be waiting for another member with more recent data to start and load from its disk stores. See [Disk Storage](../disk_storage/chapter_overview.html). Wait for the initialization to finish or time out. The process could be busysome caches have millions of entries, and they can take a long time to load. Look for this especially with cache servers, because their regions are typically replicas and therefore store all the entries in the region. Applications, on the other hand, typically store just a subset of the entries. For partitioned regions, if the initialization eventually times out and produces an exception, the system architect needs to repartition the data.
- **For a running process**—Investigate whether another member is initializing. Under some optional cluster configurations, a process can be required to wait for a response from other processes before it proceeds.
## <a id="diagnosing_system_probs__section_E3B4A6DB81AB4C659C6093D2D61EFD71" class="no-quick-link"></a>Member process does not read settings from the gemfire.properties file
Either the process cant find the configuration file or, if it is an application, it may be doing programmatic configuration.
Response:
- Check that the `gemfire.properties` file is in the right directory.
- Make sure the process is not picking up settings from another `gemfire.properties` file earlier in the search path. <%=vars.product_name%> looks for a `gemfire.properties` file in the current working directory, the home directory, and the CLASSPATH, in that order.
- For an application, check the documentation to see whether it does programmatic configuration. If so, the properties that are set programmatically cannot be reset in a `gemfire.properties` file. See your applications customer support group for configuration changes.
## <a id="diagnosing_system_probs__section_B0698527A4DF4D84877B1AF66291ABFD" class="no-quick-link"></a>Cache creation fails - must match schema definition root
System member startup fails with an error like one of these:
``` pre
Exception in thread "main" org.apache.geode.cache.CacheXmlException:
While reading Cache XML file:/C:/gemfire/client_cache.xml.
Error while parsing XML, caused by org.xml.sax.SAXParseException:
Document root element "client-cache", must match DOCTYPE root "cache".
```
``` pre
Exception in thread "main" org.apache.geode.cache.CacheXmlException:
While reading Cache XML file:/C:/gemfire/cache.xml.
Error while parsing XML, caused by org.xml.sax.SAXParseException:
Document root element "cache", must match DOCTYPE root "client-cache".
```
<%=vars.product_name%> declarative cache creation uses one of two root element pairs: `cache` or `client-cache`. The name must be the same in both places.
Response:
- Modify your `cache.xml` file so it has the proper XML namespace and schema definition.
**For peers and servers:**
``` pre
<?xml version="1.0" encoding="UTF-8"?>
<cache
xmlns="http://geode.apache.org/schema/cache"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://geode.apache.org/schema/cache http://geode.apache.org/schema/cache/cache-1.0.xsd"
version="1.0”>
...
</cache>
```
**For clients:**
``` pre
<?xml version="1.0" encoding="UTF-8"?>
<client-cache
xmlns="http://geode.apache.org/schema/cache"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://geode.apache.org/schema/cache http://geode.apache.org/schema/cache/cache-1.0.xsd"
version="1.0">
...
</client-cache>
```
## <a id="diagnosing_system_probs__section_B2DAD06E80A4475D96FF2ACCF30FE198" class="no-quick-link"></a>Cache is not configured properly
An empty cache can be a normal condition. Some applications start with an empty cache and populate it programmatically, but others are designed to bulk load data during initialization.
Response:
If your application should start with a full cache but it comes up empty, check these possible causes:
- **No regions**—If the cache has no regions, the process isnt reading the cache configuration file. Check that the name and location of the cache configuration file match those configured in the cache-xml-file attribute in `gemfire.properties`. If they match, the process may not be reading `gemfire.properties`. See [Member process does not read settings from the gemfire.properties file](diagnosing_system_probs.html#diagnosing_system_probs__section_E3B4A6DB81AB4C659C6093D2D61EFD71).
- **Regions without data**—If the cache starts with regions, but no data, this process may not have joined the correct cluster. Check the log file for messages that indicate other members. If you dont see any, the process may be running alone in its own cluster. In a process that is clearly part of the correct cluster, regions without data may indicate an implementation design error.
## <a id="diagnosing_system_probs__section_6B4E2AD4ECBB4C08B8F1DB5E07AFE7F6" class="no-quick-link"></a>Unexpected results for keySetOnServer and containsKeyOnServer
Client calls to keySetOnServer and containsKeyOnServer can return incomplete or inconsistent results if your server regions are not configured as partitioned or replicated regions.
A non-partitioned, non-replicate server region may not hold all data for the distributed region, so these methods would operate on a partial view of the data set.
In addition, the client methods use the least loaded server for each method call, so may use different servers for two calls. If the servers do not have a consistent view in their local data set, responses to client requests will vary.
The consistent view is only guaranteed by configuring the server regions with partitioned or replicate data-policy settings. Non-server members of the server system can use any allowable configuration as they are not available to take client requests.
The following server region configurations give inconsistent results. These configurations allow different data on different servers. There is no additional messaging on the servers, so no union of keys across servers or checking other servers for the key in question.
- Normal
- Mix (replicated, normal, empty) for a single distributed region. Inconsistent results depending on which server the client sends the request to
These configurations provide consistent results:
- Partitioned server region
- Replicated server region
- Empty server region: keySetOnServer returns the empty set and containsKeyOnServer returns false
Response: Use a partitioned or replicate data-policy for your server regions. This is the only way to provide a consistent view to clients of your server data set. See [Region Data Storage and Distribution Options](../../developing/region_options/chapter_overview.html).
## <a id="diagnosing_system_probs__section_9276E09D9FAC408E899F73B7068E80C6" class="no-quick-link"></a>Data operation returns PartitionOfflineException
In partitioned regions that are persisted to disk, if you have any members offline, the partitioned region will still be available but may have some buckets represented only in offline disk stores. In this case, methods that access the bucket entries return a PartitionOfflineException, similar to this:
``` pre
org.apache.geode.cache.persistence.PartitionOfflineException:
Region /__PR/_B__root_partitioned__region_7 has persistent data that is no
longer online stored at these locations:
[/192.0.2.1:/export/straw3/users/jpearson/bugfix_Apr10/testCL/hostB/backupDirectory
created at timestamp 1270834766733 version 0]
```
Response: Bring the missing member online, if possible. This restores the buckets to memory and you can work with them again. If the missing member cannot be brought back online, or the disk stores for the member are corrupt, you may need to revoke the member, which will allow the system to create the buckets in new members and resume operations with the entries. See [Handling Missing Disk Stores](../disk_storage/handling_missing_disk_stores.html#handling_missing_disk_stores).
## <a id="diagnosing_system_probs__section_A3BB709B754949C6981C431F1F8023D6" class="no-quick-link"></a>Entries are not being evicted or expired as expected
Check these possible causes.
- TransactionsEntries that are due to be expired may remain in the cache if they are involved in a transaction. Further, transactions never time out, so if a transaction hangs, the entries involved in the transaction will remain stuck in the cache. If you have a process with a hung transaction, you may need to end the process to remove the transaction. In your application programming, do not leave transactions open ended. Program all transactions to end with a commit or a rollback.
- Partitioned regionsFor performance reasons, eviction and expiration behave differently in partitioned regions and can cause entries to be removed before you expect. See [Eviction](../../developing/eviction/chapter_overview.html) and [Expiration](../../developing/expiration/chapter_overview.html).
## <a id="diagnosing_system_probs__section_346C62F16B19491E83B59B0A51D9E2B6" class="no-quick-link"></a>Cannot find the log file
Operating without a log file can be a normal condition, so the process does not log a warning.
Response:
- Check whether the log-file attribute is configured in `gemfire.properties`. If not, logging defaults to standard output, and on Windows it may not be visible at all.
- If log-file is configured correctly, the process may not be reading `gemfire.properties`. See [Member process does not read settings from the gemfire.properties file](diagnosing_system_probs.html#diagnosing_system_probs__section_E3B4A6DB81AB4C659C6093D2D61EFD71).
## <a id="diagnosing_system_probs__section_3CFAA7BA258B43A795AEAB09F9DD9AAB" class="no-quick-link"></a>OutOfMemoryError
An application gets an OutOfMemoryError if it needs more object memory than the process is able to give. The messages include java.lang.OutOfMemoryError.
Response:
The process may be hitting its virtual address space limits. The virtual address space has to be large enough to accommodate the heap, code, data, and dynamic link libraries (DLLs).
- If your application is out of memory frequently, you may want to profile it to determine the cause.
- If you suspect your heap size is set too low, you can increase direct memory by resetting the maximum heap size, using -Xmx. For details, see [JVM Memory Settings and System Performance](../monitor_tune/system_member_performance_jvm_mem_settings.html#sys_mem_perf).
- You may need to lower the thread stack size. The default thread stack size is quite large: 512kb on Sparc and 256kb on Intel for 1.3 and 1.4 32-bit JVMs, 1mb with the 64-bit Sparc 1.4 JVM; and 128k for 1.2 JVMs. If you have thousands of threads then you might be wasting a significant amount of stack space. If this is your problem, the error may be this:
``` pre
OutOfMemoryError: unable to create new native thread
```
The minimum setting in 1.3 and 1.4 is 64kb, and in 1.2 is 32kb. You can change the stack size using the -Xss flag, like this: -Xss64k
- You can also control memory use by setting entry limits for the regions.
## <a id="diagnosing_system_probs__section_B49BD03F4CA241C7BED4A2C4D5936A7A" class="no-quick-link"></a>PartitionedRegionDistributionException
The org.apache.geode.cache.PartitionedRegionDistributionException appears when <%=vars.product_name%> fails after many attempts to complete a distributed operation. This exception indicates that no data store member can be found to perform a destroy, invalidate, or get operation.
Response:
- Check the network for traffic congestion or a broken connection to a member.
- Look at the overall installation for problems, such as operations at the application level set to a higher priority than the <%=vars.product_name%> processes.
- If you keep seeing PartitionedRegionDistributionException, you should evaluate whether you need to start more members.
## <a id="diagnosing_system_probs__section_7DE15A6C99974821B6CA418BC2AF98F1" class="no-quick-link"></a>PartitionedRegionStorageException
The org.apache.geode.cache.PartitionedRegionStorageException appears when <%=vars.product_name%> cant create a new entry. This exception arises from a lack of storage space for put and create operations or for get operations with a loader. PartitionedRegionStorageException often indicates data loss or impending data loss.
The text string indicates the cause of the exception, as in these examples:
``` pre
Unable to allocate sufficient stores for a bucket in the partitioned region....
```
``` pre
Ran out of retries attempting to allocate a bucket in the partitioned region....
```
Response:
- Check the network for traffic congestion or a broken connection to a member.
- Look at the overall installation for problems, such as operations at the application level set to a higher priority than the <%=vars.product_name%> processes.
- If you keep seeing PartitionedRegionStorageException, you should evaluate whether you need to start more members.
## <a id="diagnosing_system_probs__section_AFA1D06BC3AA44A4AB0593FD1EF0B0B7" class="no-quick-link"></a>Application crashes without producing an exception
If an application crashes without any exception, this may be caused by an object memory problem. The process is probably hitting its virtual address space limits. For details, see [OutOfMemoryError](diagnosing_system_probs.html#diagnosing_system_probs__section_3CFAA7BA258B43A795AEAB09F9DD9AAB).
Response: Control memory use by setting entry limits for the regions.
## <a id="diagnosing_system_probs__section_06C68EA0DACC46C58AA88E98C19AD2D8" class="no-quick-link"></a>Timeout alert
If a distributed message does not get a response within a specified time, it sends an alert to signal that something might be wrong with the system member that hasnt responded. The alert is logged in the senders log as a warning.
A timeout alert can be considered normal.
Response:
- If youre seeing a lot of timeouts and you havent seen them before, check whether your network is flooded.
- If you see these alerts constantly during normal operation, consider raising the ack-wait-threshold above the default 15 seconds.
## <a id="diagnosing_system_probs__section_66D11C8E84F941B58800EDB52194B087" class="no-quick-link"></a>Member produces SocketTimeoutException
A client and server produces a SocketTimeoutException when it stops waiting for a response from the other side of the connection and closes the socket. This exception typically happens on the handshake or when establishing a callback connection.
Response:
Increase the default socket timeout setting for the member. This timeout is set separately for the client Pool. For a client/server configuration, adjust the "read-timeout" value as described in [&lt;pool&gt;](../../reference/topics/client-cache.html#cc-pool) or use the `org.apache.geode.cache.client.PoolFactory.setReadTimeout` method.
## <a id="diagnosing_system_probs__section_8C7CB2EA0A274DAF90083FECE0BF3B1F" class="no-quick-link"></a>Member logs ForcedDisconnectException, Cache and DistributedSystem forcibly closed
A cluster members Cache and DistributedSystem are forcibly closed by the system membership coordinator if it becomes sick or too slow to respond to heartbeat requests. When this happens, listeners receive RegionDestroyed notification with an opcode of FORCED\_DISCONNECT. The <%=vars.product_name%> log file for the member shows a ForcedDisconnectException with the message
``` pre
This member has been forced out of the cluster because it did not respond
within member-timeout milliseconds
```
Response:
To minimize the chances of this happening, you can increase the DistributedSystem property member-timeout. Take care, however, as this setting also controls the length of time required to notice a network failure. It should not be set too high.
## <a id="diagnosing_system_probs__section_778D150443044847B1C73B9E02BE247B" class="no-quick-link"></a>Members cannot see each other
Suspect a network problem or a problem in the configuration of transport for memory and discovery.
Response:
- Check your network monitoring tools to see whether the network is down or flooded.
- If you are using multi-homed hosts, make sure a bind address is set and consistent for all system members. For details about using bind addresses, see [Topology and Communication General Concepts](../../topologies_and_comm/topology_concepts/chapter_overview.html#concept_7628F498DB534A2D8A99748F5DA5DC94).
- Check that all the applications and cache servers are using the same locator address.
## <a id="diagnosing_system_probs__section_E31AFADE4A3A45C7A6EABB67697CFF33" class="no-quick-link"></a>One part of the cluster cannot see another part
This situation can leave your caches in an inconsistent state. In networking circles, this kind of network outage is called the "split brain problem."
Response:
- Restart all the processes to ensure data consistency.
- Going forward, set up network monitoring tools to detect these kinds of outages quickly.
- Enable network partition detection.
Also see
[Understanding and Recovering from Network Outages](recovering_from_network_outages.html#rec_network_crash).
## <a id="diagnosing_system_probs__section_04CEF27475924E5D9860BEE6D64C49E2" class="no-quick-link"></a>Data distribution has stopped, although member processes are running
Suspect a problem with the network, the locator, or the multicast configuration, depending on the transport your cluster is using.
Response:
- Check the health of your system members. Search the logs for this string:
``` pre
Uncaught exception
```
An uncaught exception means a severe error, often an OutOfMemoryError. See [OutOfMemoryError](diagnosing_system_probs.html#diagnosing_system_probs__section_3CFAA7BA258B43A795AEAB09F9DD9AAB).
- Check your network monitoring tools to see whether the network is down or flooded.
- If you are using multicast, check whether the existing configuration is no long appropriate for the current network traffic.
- Check whether the locators have stopped. For a list of the locators in use, check the locators property in one of the application `gemfire.properties` files.
- Restart the locator processes on the same hosts, if possible. The cluster begins normal operation, and data distribution restarts automatically.
- If a locator must be moved to another host or a different IP address, complete these steps:
1. Shut down all the members of the cluster in the usual order.
2. Restart the locator process in its new location.
3. Edit all the gemfire.properties files to change this locators IP address in the locators attribute.
4. Restart the applications and cache servers in the usual order.
- Create a watchdog daemon or service on each locator host to restart the locator process when it stops
## <a id="diagnosing_system_probs__section_7A6113ED20044B8C868483AABC45216E" class="no-quick-link"></a>Distributed-ack operations take a very long time to complete
This problem can occur in systems with a great number of distributed-no-ack operations. That is, the presence of many no-ack operations can cause ack operation to take a long time to complete.
Response:
For information on alleviating this problem, see [Slow distributed-ack Messages](../monitor_tune/slow_messages.html#slow_mess).
## <a id="diagnosing_system_probs__section_E5DB25F2CC454510A9E58790C09C8CE3" class="no-quick-link"></a>Slow system performance
Slow system performance is sometimes caused by a buffer size that is too small for the objects being distributed.
Response:
If you are experiencing slow performance and are sending large objects (multiple megabytes), try increasing the socket buffer size settings in your system. For more information, see [Socket Communication](../monitor_tune/socket_communication.html).
## <a id="diagnosing_system_probs__section_F93DD765FF2A43439D3FF7936F8883DE" class="no-quick-link"></a>Can’t get Windows performance data
Attempting to run performance measurements for <%=vars.product_name%> on Windows can produce this error message:
``` pre
Can't get Windows performance data. RegQueryValueEx returned 5
```
This error can occur because incorrect information is returned when a Win32 application calls the ANSI version of RegQueryValueEx Win32 API with HKEY\_PERFORMANCE\_DATA. This error is described in Microsoft KB article ID 226371 at [http://support.microsoft.com/kb/226371/en-us](http://support.microsoft.com/kb/226371/en-us).
Response:
To successfully acquire Windows performance data, you need to verify that you have the proper registry key access permissions in the system registry. In particular, make sure that Perflib in the following registry path is readable (KEY\_READ access) by the <%=vars.product_name%> process:
``` pre
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Perflib
```
An example of reasonable security on the performance data would be to grant administrators KEY\_ALL\_ACCESS access and interactive users KEY\_READ access. This particular configuration would prevent non-administrator remote users from querying performance data.
See [http://support.microsoft.com/kb/310426](http://support.microsoft.com/kb/310426) and [http://support.microsoft.com/kb/146906](http://support.microsoft.com/kb/146906) for instructions about how to ensure that <%=vars.product_name%> processes have access to the registry keys associated with performance.
## <a id="diagnosing_system_probs__section_E70C332303A242BEAE9D2C0A2EE70E0A" class="no-quick-link"></a>Java applications on 64-bit platforms hang or use 100% CPU
If your Java applications suddenly start to use 100% CPU, you may be experiencing the leap second bug. This bug is found in the Linux kernel and can severely affect Java programs. In particular, you may notice that method invocations using `Thread.sleep(n)` where `n` is a small number will actually sleep for much longer period of time than defined by the method. To verify that you are experiencing this bug, check the host's `dmesg` output for the following message:
``` pre
[10703552.860274] Clock: inserting leap second 23:59:60 UTC
```
To fix this problem, issue the following commands on your affected Linux machines:
``` pre
prompt> /etc/init.d/ntp stop
prompt> date -s "$(date)"
```
See the following web site for more information:
[http://blog.wpkg.org/2012/07/01/java-leap-second-bug-30-june-1-july-2012-fix/](http://blog.wpkg.org/2012/07/01/java-leap-second-bug-30-june-1-july-2012-fix/)