Debugging Geode Applications

This article provides some practical suggestions for debugging Geode Applications.

Geode Artifacts

  • cache.xml - provided by the application developer to configure the caches and regions for Geode client and server processes.
  • Properties files - provided by the application developer to configure Geode system properties and membership discovery
  • System logs - logs generated by Geode clients, servers, and locators. The logs contain information about membership, client connections, warnings about outstanding message requests and errors. The log also displays both Geode and Java system properties.
  • Statistics - archive files generated by Geode clients and/or servers containing information about the Geode application. GemFire VSD (Visual Statistics Display) is used to graph the Geode and system metrics recorded in these archives.

Geode Shell (gfsh)

Geode gfsh provides a command-line interface from which you can launch, manage, and monitor Geode processes, data, and applications. The shell provides commands useful in debugging and to bring all the information to a single location for analysis. The following gfsh commands require first executing the gfsh connect command to establish the connection to the locator or JMX Manager of the distributed system. Please refer to the Geode documentation for more details.

  • export logs
  • export stack-traces
  • show log
  • show dead-locks

Geode MergeLogFiles

Use Geode MergeLogFiles (com.gemstone.gemfire.internal.logging.MergeLogFiles) to merge your log files based on timestamps.

$ java -classpath $CLASSPATH -Xmx1200M com.gemstone.gemfire.internal.logging.MergeLogFiles

Usage: java MergeLogFiles [(directory | logFile)]+
-dirCount n      Number of parent dirs to print
-mergeFile file  File in which to put merged logs
-pids            Search for PIDs in file names and use them to identify files
-align           Align non-timestamped lines with others
-noblanks        Suppress output of blank lines
-threaded        Use multithreading to take advantage of multiple CPUs

Merges multiple Geode log files and sorts them by timestamp.
The merged log file is written to System.out (or a file).

If a directory is specified, all .log files in that directory are merged.

General Guidelines

  1. Check your environment - Machine (e.g. ulimit settings), JDK, JVM properties (-Xmx -Xms), GC parameters

  2. Draw out a diagram of your system topology (servers, clients) and make a note of Listeners, Writers and other plug-ins.

  3. Verify your cache and region configuration

  4. Confirm your system properties (Review properties files and display in system log)

  5. On your system topology diagram, add notes on the initialization and processing being done in various members or classes of members.

  6. If you are debugging a specific interaction, draw a sequence diagram.

Specific search strings and patterns

  1. If possible, bring all the system logs and stack dumps together into a single directory for inspection (use gfsh commands above). Here's a simple script which will search for specific strings in the logs.

!bin/bash

find . -name hs_err*.log -print find . -name jrockit.*.dump -print find . -name core -print find . -name *.hprof -print

fgrep SerializationException */system.log fgrep OutOfMemory *.log /.log | grep -v HeapDumpOnOutOfMemoryError fgrep NullPointerException *.log /.log fgrep ConcurrentModificationException *.log /.log fgrep Assert *.log /.log fgrep deadlock *.log /.log fgrep InternalGemFire *.log /.log fgrep “Too many” *.log /.log fgrep “Suspect notification” *.log /.log

grep Exception *.log /.log | grep -v continuing | grep -v HeapDumpOnOutOfMemoryError

```
  1. Search the system logs for warning, error or severe messages

  2. Search the system logs for any underlying Exceptions. For example: ConcurrentModificationException, NullPointerException, SerializationException.

  3. Search the system logs for warnings about resources causing delays in statistics sampling. If found, use VSD to investigate further.

    [warning 2015/03/29 04:47:23.028 PDT gemfire1_w1-gst-dev26_15651 <Thread-5 StatSampler> tid=0x55] Statistics 
    

sampling thread detected a wakeup delay of 8,310 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics. ```

  1. Verify there are no HotSpot (hs_err_pid.log files) indicating a HotSpot error in the JVM Refer to the Oracle Troubleshooting Guide for more details.

  2. Verify that there are no heapdump (*.hprof) files or OutOfMemoryErrors. Tools such as jhat, jmap, visualvm and Eclipse Memory Analyzer can provide heap histograms and leak suspects.

  3. Search the stack dumps for Java-level deadlock. Dumping the stacks using jstack or the Linux command kill -3 <pid> will highlight any Java-level deadlocks including the threads involved in the deadlock as well as the stack dumps for each of those threads. When debugging, it is best to get stack dumps for all JVMs. To determine if progress is being made, execute multiple thread dumps several seconds apart for comparison.
    You can also search for state=BLOCKED threads and trace waiting to lock (e.g. waiting to lock java.lang.Object@16ce6f90) threads to whichever thread locked the object (e.g. locked java.lang.Object@16ce6f90). Follow this pattern until you find the root thread.

  4. Search the system logs for any 15 seconds have elapsed messages which don't have corresponding wait for replies has completed logs. You can match these log messages together via the thread id or native thread id. Note that these messages are only logged between peers in the Distributed System. See “Special Considerations for Clients” for messages specific to Geode clients.

    In this example, we can see that the request did complete, so while we should be concerned (and possibly check stats in vsd to see what system resources are causing this delay), it will not be the cause of our hang.

    [warning 2014/07/26 02:02:54.384 PDT dataStoregemfire5_w1-gst-dev12_25904 <Pooled Waiting Message Processor 11> tid=0xb5] 
    15 seconds have elapsed while waiting for replies: <DeposePrimaryBucketMessage$DeposePrimaryBucketResponse 2308 
    waiting for 1 replies from [10.138.44.112(dataStoregemfire2_w1-gst-dev12_37336:37336)<v82>:52253]> 
    

on 10.138.44.112(dataStoregemfire5_w1-gst-dev12_25904:25904):14125 whose current membership list is: [[10.138.44.112(dataStoregemfire4_w1-gst-dev12_14688:14688):31076, 10.138.44.112(accessorgemfire1_w1-gst-dev12_37800:37800):59828, 10.138.44.112(accessorgemfire3_w1-gst-dev12_6348:6348):16162, 10.138.44.112(dataStoregemfire2_w1-gst-dev12_37336:37336):52253, 10.138.44.112(accessorgemfire4_w1-gst-dev12_24644:24644):64839, 10.138.44.112(accessorgemfire2_w1-gst-dev12_40732:40732):54553, 10.138.44.112(dataStoregemfire3_w1-gst-dev12_3060:3060):43488, 10.138.44.112(accessorgemfire5_w1-gst-dev12_25924:25924):56180,10.138.44.112(dataStoregemfire5_w1-gst-dev12_25904:25904):14125, 10.138.44.112(locatorgemfire1_w1-gst-dev12_50584:50584:locator):16639, 10.138.44.112(dataStoregemfire1_w1-gst-dev12_31492:31492):40311]]

[info 2014/07/26 02:03:00.437 PDT dataStoregemfire5_w1-gst-dev12_25904 <Pooled Waiting Message Processor 11> tid=0xb5] 
DeposePrimaryBucketMessage$DeposePrimaryBucketResponse wait for replies completed
```  

If the request is never satisfied (there is no corresponding ```wait for replies completed```), look at the stack dumps for the non-responsive member.  There could be a Java-level deadlock within that JVM.  

There can also be distributed deadlocks between members.  This requires following the ```15 seconds have elapsed``` warnings to the remote members and looking at the stack dumps.  Searching for ```waiting to lock``` in the stack dumps can also help to identify the problematic JVM.  Once found in a non-responsive member, find the thread in that JVM that holds the lock and determine what prevents it from releasing the lock.    

This example shows the outstanding request from the system log and the relevant stack dumps from the non-responding JVM.

The system log shows that vm 12659 is still awaiting a response from vm 12706  
```

dataStoregemfire1_12659/system.log:[warning 2014/03/04 08:18:27.954 PST dataStoregemfire1_w1-gst-dev25_12659 tid=0x60] 15 seconds have elapsed while waiting for replies: <ManageBucketMessage$NodeResponse 6407 waiting for 1 replies from [10.138.44.125(dataStoregemfire3_w1-gst-dev25_12706:12706):2225]> on 10.138.44.125(dataStoregemfire1_w1-gst-dev25_12659:12659):55872 whose current membership list is: [[10.138.44.125(dataStoregemfire2_w1-gst-dev25_12480:12480):24218, 10.138.44.125(locatorgemfire1_w1-gst-dev25_22478:22478:locator):13070, 10.138.44.125(accessorgemfire3_w1-gst-dev25_12501:12501):18675, 10.138.44.125(dataStoregemfire1_w1-gst-dev25_12659:12659):55872, 10.138.44.125(dataStoregemfire5_w1-gst-dev25_12483:12483):27445, 10.138.44.125(accessorgemfire1_w1-gst-dev25_12607:12607):46692, 10.138.44.125(dataStoregemfire4_w1-gst-dev25_12500:12500):48232, 10.138.44.125(accessorgemfire2_w1-gst-dev25_12705:12705):11732, 10.138.44.125(dataStoregemfire3_w1-gst-dev25_12706:12706):2225]] ```

The stack dumps from 12706 show the ```waiting to lock <monitor>``` and ```locked <monitor>``` in the stack dumps  
```

“Pooled Waiting Message Processor 12” daemon prio=10 tid=0xdf040000 nid=0x383f waiting for monitor entry [0xd960b000] java.lang.Thread.State: BLOCKED (on object monitor) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:424) - waiting to lock <0xe47a7fa8> (a com.gemstone.gemfire.internal.cache.ProxyBucketRegion) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2940) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.handleManageBucketRequest(PartitionedRegionDataStore.java:1032) at com.gemstone.gemfire.internal.cache.partitioned.ManageBucketMessage.operateOnPartitionedRegion(ManageBucketMessage.java:123) at com.gemstone.gemfire.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:297) at com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:357) at com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:420) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:720) at com.gemstone.gemfire.distributed.internal.DistributionManager$6$1.run(DistributionManager.java:1057) at java.lang.Thread.run(Thread.java:662)

“Recovery thread for bucket _B__PR__1_46” daemon prio=10 tid=0xdc965400 nid=0x36dd in Object.wait() [0xdb3b7000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl$MembershipChangeListener.waitForChange(PersistenceAdvisorImpl.java:1047) - locked <0xe6223758> (a com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl$MembershipChangeListener) at com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:820) at com.gemstone.gemfire.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:47) at com.gemstone.gemfire.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1377) at com.gemstone.gemfire.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1196) at com.gemstone.gemfire.internal.cache.BucketRegion.initialize(BucketRegion.java:264) at com.gemstone.gemfire.internal.cache.LocalRegion.createSubregion(LocalRegion.java:1229) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.createBucketRegion(PartitionedRegionDataStore.java:742) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:451) - locked <0xe47a7fa8> (a com.gemstone.gemfire.internal.cache.ProxyBucketRegion) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.grabFreeBucketRecursively(PartitionedRegionDataStore.java:301) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2956) at com.gemstone.gemfire.internal.cache.ProxyBucketRegion.recoverFromDisk(ProxyBucketRegion.java:417) at com.gemstone.gemfire.internal.cache.ProxyBucketRegion.recoverFromDiskRecursively(ProxyBucketRegion.java:386) at com.gemstone.gemfire.internal.cache.PRHARedundancyProvider$5.run2(PRHARedundancyProvider.java:1934) at com.gemstone.gemfire.internal.cache.PRHARedundancyProvider$RecoveryRunnable.run(PRHARedundancyProvider.java:2243) at com.gemstone.gemfire.internal.cache.PRHARedundancyProvider$5.run(PRHARedundancyProvider.java:1926) at java.lang.Thread.run(Thread.java:662) ```

Special considerations for Geode clients

Geode clients can fail with ServerConnectivityExceptions when the servers are too busy to handle client requests. For example, with large GC pauses or distributed deadlocks.

Look for is being terminated because its client timeout messages in the server system logs and to determine whether or not this is occurring in your application. If so, review the server side system logs, stack dumps and statistics to determine the cause.

[warning 2015/03/14 06:00:15.062 PDT bridgegemfire_1_3_w1-gst-dev08_25525 <ClientHealthMonitor Thread> tid=0x54] 
Server connection from [identity(w1-gst-dev08(edgegemfire_1_1_w1-gst-dev08_25635:25635:loner):39198:65125d18:edgegemfire_1_1_w1-gst-dev08_25635,connection=1; port=49383] 
is being terminated because its client timeout of 30,000 has expired.

Improving traceability during initial development

Function Execution

To trace function execution from the initiator to the member executing the function, pass the initiating thread id to the function using withArgs and log in both. Of course, this could easily be a string containing the pid, DistributedMemberId or any other identifying information for your application.

    ArrayList aList = new ArrayList();
    aList.add("myFunctionOperation");
    aList.add(Thread.currentThread().getId());
    System.out.println("Executing " + aList.get(0) + " from " + aList.get(1));
    ResultCollector drc = dataSet.withFilter(keySet).withArgs(aList)
        .execute(myFunction.getId());
    }

Log these values within the function to help with tracing during development.

  public void execute(FunctionContext context) {
    ArrayList arguments = (ArrayList)(context.getArguments());
    String operation = (String)arguments.get(0);
    Object initiatingThreadID = arguments.get(1);
    String aStr = "In execute with context " + context + " and operation " + operation +
      " initiated by thread thr_" + initiatingThreadID + "_";
    for (int i = 2; i < arguments.size(); i++) {
      aStr = aStr + " additional arg: " + arguments.get(i);
    }
    System.out.println(aStr);

Events

Since Geode supports multiple CacheListeners, consider adding a LogListener which simply logs the relevant portion of the events as they are processed. This provides another way to enable traceability in your application during the early stages of development. For client/server applications, it can help to identify the originating member of an operation and the server that forwarded that event to a specific client.

If you are not using the CallbackArgument for your application, use the callbackArgument to encode information about the caller or the data, which you can log in your LogListener.

Events for operations initiated in the local JVM are logged by the calling thread as shown below. In this case vm_1_thr_10_edge4_w1-gst-dev18_10648.

[info 2014/05/30 14:04:10.674 PDT <vm_10_thr_10_edge4_w1-gst-dev18_10648> tid=0x51] Calling remove with key Object_395 value null, containsKey true, containsKeyOnServer true

[info 2014/05/30 14:04:10.674 PDT <vm_10_thr_10_edge4_w1-gst-dev18_10648> tid=0x51] Invoked util.SilenceListener for key Object_395: afterDestroy in edge4 event=EntryEventImpl[op=DESTROY;key=Object_395;oldValue=null;newValue=null;callbackArg=null;originRemote=false;originMember=w1-gst-dev18(edgegemfire4_w1-gst-dev18_10648:10648:loner):9766:1aa5f04e:edgegemfire4_w1-gst-dev18_10648;callbacksInvoked;version={v3; rv5; mbr=b90d31569c3243e8-8a2bcb43babe154a; ds=1; time=1401483850674; remote};id=EventID[threadID=2;sequenceID=0]]
     whereIWasRegistered: 10648
     event.getKey(): Object_395
     event.getOldValue(): null
     event.getNewValue(): null
     event.isLoad(): false
     event.isLocalLoad(): false
     event.isNetLoad(): false
     event.isNetSearch(): false
     event.isConcurrencyConflict(): false
     event.getDistributedMember(): w1-gst-dev18(edgegemfire4_w1-gst-dev18_10648:10648:loner):9766:1aa5f04e:edgegemfire4_w1-gst-dev18_10648
     event.getCallbackArgument(): null
     event.getRegion(): /testRegion
     event.isDistributed(): true
     event.isExpiration(): false
     event.isOriginRemote(): false
     Operation: DESTROY
     Operation.isLoad(): false
     Operation.isLocalLoad(): false
     Operation.isNetLoad(): false
     Operation.isNetSearch(): false
     Operation.isPutAll(): false
     Operation.isDistributed(): true
     Operation.isExpiration(): false

[info 2014/05/30 14:04:10.674 PDT <vm_10_thr_10_edge4_w1-gst-dev18_10648> tid=0x51] Done calling remove with key Object_395 value null, return value is true

Events fired in remote members are fired on asynchronous threads. In the case of clients, this asynchronous thread provides the identity of the server hosting the client's HARegionQueue. In this case bridgegemfire5_w1_gst_dev18_79056.

[info 2014/05/30 14:04:10.674 PDT <Cache Client Updater Thread  on w1-gst-dev18(bridgegemfire5_w1-gst-dev18_79056:79056)<v60>:3080 port 27043> tid=0x26] Invoked util.SilenceListener for key Object_395: afterDestroy in edge3 event=EntryEventImpl[op=DESTROY;key=Object_395;oldValue=null;newValue=null;callbackArg=null;originRemote=true;originMember=w1-gst-dev18(:loner):9766:1aa5f04e;callbacksInvoked;version={v3; rv5; mbr=b90d31569c3243e8-8a2bcb43babe154a; ds=1; time=1401483850674; remote};id=EventID[threadID=2;sequenceID=0];isFromServer]
     whereIWasRegistered: 53128
     event.getKey(): Object_395
     event.getOldValue(): null
     event.getNewValue(): null
     event.isLoad(): false
     event.isLocalLoad(): false
     event.isNetLoad(): false
     event.isNetSearch(): false
     event.isConcurrencyConflict(): false
     event.getDistributedMember(): w1-gst-dev18(:loner):9766:1aa5f04e
     event.getCallbackArgument(): null
     event.getRegion(): /testRegion
     event.isDistributed(): true
     event.isExpiration(): false
     event.isOriginRemote(): true
     Operation: DESTROY
     Operation.isLoad(): false
     Operation.isLocalLoad(): false
     Operation.isNetLoad(): false
     Operation.isNetSearch(): false
     Operation.isPutAll(): false
     Operation.isDistributed(): true
     Operation.isExpiration(): false