<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent"> | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<chapter id="ugr.async.mt"> | |
<title>Monitoring, Tuning and Debugging</title> | |
<para> | |
UIMA AS deployments can involve many separate parts running on many | |
different machines. Monitoring facilities and tools built into UIMA AS help | |
in collecting information on the performance of these parts. You can | |
use the monitoring information to identify deployment issues, such as | |
bottlenecks, and address these with various approaches that alter the | |
deployment choices; this is what we mean by "tuning the deployment". | |
</para> | |
<para> | |
Monitoring happens in several parts: | |
<itemizedlist> | |
<listitem><para>Each node running a JVM hosting UIMA AS services or clients provides | |
JMX information tracking many items of interest.</para></listitem> | |
<listitem> | |
<para>UIMA AS services include some of these measurements in the information | |
passed back to its client, along with the returned CAS. This allows | |
clients to collect and aggregate measurements over a cluster of remotely-deployed | |
components.</para> | |
</listitem> | |
<!--listitem> | |
<para>UIMA AS includes a Monitor component that can optionally be turned on to | |
sample the JMX data at | |
a specified interval, and write the results into the UIMA log (or to the | |
console if no log is configured) in several formats, one of which is | |
convenient for reading, and the other is convenient for importing into | |
a spreadsheet program.</para> | |
</listitem--> | |
</itemizedlist> | |
</para> | |
<para>Tuning a UIMA AS application is done using several approaches: | |
<itemizedlist> | |
<listitem><para>changing the topology of the scaleout - for instance, allocating more | |
nodes to some parts, less to others</para></listitem> | |
<listitem> | |
<para>adjusting deployment parameters, such as the number of CASes in a CasPool, or | |
the number of threads assigned to do various tasks</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
<para> | |
In addition, tuning can involve changing the actual analytic algorithms | |
to tune them - but that is beyond the scope of this chapter. | |
</para> | |
<para> | |
UIMA AS scale out configurations add multithreaded and out-of-order execution complexities to | |
core UIMA applications. Debugging a UIMA AS application is aided by UIMA's modular architecture | |
and an approach that exercises the code gradually from simpler to more complex configurations. | |
Two useful built-in debug features are: | |
<itemizedlist> | |
<listitem> | |
<para>Java errors at any component level are propagated back to the component originating | |
the request, with a full call chain of UIMA AS components, within | |
colocated aggregate components and across remote services which are | |
shared by multiple clients.</para> | |
</listitem> | |
<listitem> | |
<para>CASes can be saved before sending to any local or remote | |
delegate and later used to reproduce problems in a simple unit testing environment.</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
<section id="ugr.async.mt.monitoring"> | |
<title>Monitoring</title> | |
<section id="ugr.async.mt.jmx"> | |
<title>JMX</title> | |
<para>JMX (Java Management Extensions) is a standard Java mechanism that | |
is used to monitor and control Java applications. A standard Java tool | |
provided with most Javas, called | |
<code>jconsole</code>, is a GUI based application that can connect to | |
a JVM and display the information JMX is providing, and also control | |
the application in application-defined specific ways.</para> | |
<para>JMX information is provided by a hierarchy of JMX Beans. More | |
background and information on JMX and the jconsole tool is available on the web.</para> | |
<!--para>This section will first describe the basic JMX Beans, and then | |
later describe a UIMA AS monitor tool that can sample the values of these beans at | |
a specified interval and write the results to the UIMA log in various | |
formats.</para--> | |
</section> | |
<section id="ugr.async.mt.jmx_monitoring"> | |
<title>JMX Information from UIMA AS</title> | |
<para>JMX information is provided by every UIMA AS service or client as it runs. | |
Each item provided is either an instantaneous measurement ( | |
e.g. the number of items in a queue) or an accumulating measurement ( | |
e.g. the number of CASes processed). Accumulating measures | |
can be reset to 0 using standard JMX mechanisms.</para> | |
<para> | |
JMX information is provided on a JVM basis; a JVM can be hosting 0 or more | |
UIMA AS Services and/or clients. A UIMA AS Service is defined as a component | |
that connects to a queue | |
and accepts CASes to process. A UIMA AS Client, in contrast, sends CASes to | |
be processed; it can be a top level client, or | |
a UIMA AS Service having one or more AS Aggregate delegates, to which it is | |
sending CASes to be processed. | |
</para> | |
<para> | |
UIMA AS Services send | |
some of their measurements back to the UIMA AS Clients that sent them CASes; those | |
clients incorporate these measurements into aggregate statistics that they provide. | |
This allows accumulating information among components deployed over many nodes | |
interconnected on a network. | |
</para> | |
<para> | |
Some JMX measurement items are constant, and document various settings, descriptors, | |
names, etc., in use by the (one or more) UIMA AS services and/or | |
clients running on this JVM.</para> | |
<para>Some time measurements are associated with running some process. These, | |
where possible, are cpu times, as measured by the thread or threads running the process, using the | |
ThreadMXBean class. On some Javas, thread-based cpu time may not be supported, however. In that | |
case, wall-clock time is used instead.</para> | |
<para> | |
If the process is multi-threaded, and the cpu has multiple cores, | |
you can get time measurements which exceed the wall clock interval, due to the process consuming | |
cpu time on multiple threads at once.</para> | |
<para>Timing information not associated with running code, such as idle time, is measured as wall-clock time.</para> | |
<para>The following sections describe the JMX Beans implemented by UIMA AS. The | |
Notes in the tables include the following flags: | |
<itemizedlist> | |
<listitem> | |
<para><emphasis role="bold">inst/acc/const</emphasis> - instantaneous, accumulating, or constant measurement</para> | |
</listitem> | |
<listitem> | |
<para><emphasis role="bold">sent</emphasis> - sent up to the invoking client with returning CAS</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
<section id="ugr.async.mt.jmx_monitoring.service"> | |
<title>UIMA AS Services JMX measures</title> | |
<para>The next 4 tables detail the JMX measures provided by UIMA AS services.</para> | |
<section id="ugr.async.mt.jmx_monitoring.constant.service"> | |
<title>Service information</title> | |
<informaltable frame="all"> | |
<tgroup cols="4" colsep="1" rowsep="1"> | |
<colspec colname="c1" colwidth="2*"/> | |
<colspec colname="c2" colwidth="5*"/> | |
<colspec colname="c3" colwidth="1*"/> | |
<colspec colname="c4" colwidth="1.5*"/> | |
<thead> | |
<row> | |
<entry align="center">Name</entry> | |
<entry align="center">Description</entry> | |
<entry align="center">Units</entry> | |
<entry align="center">Notes</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>state</entry> | |
<entry>The state of the service (Running, Initializing, Disabled, Stopping, Failed)</entry> | |
<entry>string</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>input queueName</entry> | |
<entry>The name of the input queue</entry> | |
<entry>string</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>reply queueName</entry> | |
<entry>The internally generated name of the reply queue</entry> | |
<entry>string</entry> | |
<entry>const (but could change due to reconnection recovery)</entry> | |
</row> | |
<row> | |
<entry>broker URL</entry> | |
<entry>The URL of the JMS queue broker</entry> | |
<entry>string</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>deployment descriptor</entry> | |
<entry>The path to the deployment descriptor for this service</entry> | |
<entry>string</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>is CAS Multiplier</entry> | |
<entry>is this Service a CAS Multiplier</entry> | |
<entry>boolean</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>is top level</entry> | |
<entry>is this Service a top level service, meaning that it connects to | |
an input queue on a queue broker</entry> | |
<entry>boolean</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>service key</entry> | |
<entry>The key name used in the associated Analysis Engine aggregate that specifies | |
this as a delegate</entry> | |
<entry>string</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>is Aggregate</entry> | |
<entry>is this service an AS Aggregate (i.e., has delegates and | |
is marked async="true")</entry> | |
<entry>boolean</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>analysisEngine instance count</entry> | |
<entry>The number of replications of the AS Primitive</entry> | |
<entry>count</entry> | |
<entry>const</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</informaltable> | |
</section> | |
<section id="ugr.async.mt.jmx_monitoring.service.performance"> | |
<title>Service Performance Measurements</title> | |
<informaltable frame="all"> | |
<tgroup cols="4" colsep="1" rowsep="1"> | |
<colspec colname="c1" colwidth="2*"/> | |
<colspec colname="c2" colwidth="5*"/> | |
<colspec colname="c3" colwidth="1*"/> | |
<colspec colname="c4" colwidth="1*"/> | |
<thead> | |
<row> | |
<entry align="center">Name</entry> | |
<entry align="center">Description</entry> | |
<entry align="center">Units</entry> | |
<entry align="center">Notes</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>number of CASes processed</entry> | |
<entry>The number of CASes processed by a component</entry> | |
<entry>count - CASes</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>cas deserialization time</entry> | |
<entry>The thread time spent deserializing CASes (receiving, either from client, or replies from delegates)</entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>cas serialization time</entry> | |
<entry>The thread time spent serializing CASes (sending, either to delegates or back to client)</entry> | |
<entry>count - CASes</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>analysis time</entry> | |
<entry>The thread time spent in AS Primitive analytics</entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>idle time</entry> | |
<entry>The wall clock time a service has been idle. Measure starts | |
after a reply is sent until the next request is receives, and excludes | |
serialization/deserialization times.</entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>cas pool wait time</entry> | |
<entry>The time spent waiting for a CAS to become available in the CAS Pool</entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>shadow cas pool wait time</entry> | |
<entry>A shadow cas pool is established for services which are Cas Multipliers. | |
This is the time spent waiting for a CAS to become available in the Shadow CAS Pool.</entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>time spent in CM getNext</entry> | |
<entry>The time spent inside Cas Multipliers, getting another CAS. | |
This time (doesn't include / includes ????) | |
the time | |
spent waiting for a CAS to become available in the CAS Pool waiting for a CAS to become available in the CAS Pool</entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>process thread count</entry> | |
<entry>The number of threads available to process requests (number | |
of instances of a primitive)</entry> | |
<entry>count</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>reply thread count</entry> | |
<entry>The number of threads available to process replies</entry> | |
<entry>count</entry> | |
<entry>const</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</informaltable> | |
</section> | |
<section id="ugr.async.mt.jmx_monitoring.service.internal.queues"> | |
<title>Co-located Service Queues</title> | |
<para>Co-located services use light-weight, internal (not JMS) queues. | |
These have similar measures as are used with JMS queues, and include | |
these measures for both the input queues and the reply (output) queues: | |
<informaltable frame="all"> | |
<tgroup cols="4" colsep="1" rowsep="1"> | |
<colspec colname="c1" colwidth="2*"/> | |
<colspec colname="c2" colwidth="5*"/> | |
<colspec colname="c3" colwidth="1*"/> | |
<colspec colname="c4" colwidth="1*"/> | |
<thead> | |
<row> | |
<entry align="center">Name</entry> | |
<entry align="center">Description</entry> | |
<entry align="center">Units</entry> | |
<entry align="center">Notes</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>consumer count</entry> | |
<entry>The number of threads configured to read the queue</entry> | |
<entry>count</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>dequeue count</entry> | |
<entry>The number of CASes that have been read from this queue</entry> | |
<entry>count</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>queue size</entry> | |
<entry>The number of CASes in the queue</entry> | |
<entry>count</entry> | |
<entry>inst</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</informaltable> | |
</para> | |
</section> | |
<section id="ugr.async.mt.jmx_monitoring.service.error"> | |
<title>Service Error Measurements</title> | |
<informaltable frame="all"> | |
<tgroup cols="4" colsep="1" rowsep="1"> | |
<colspec colname="c1" colwidth="2*"/> | |
<colspec colname="c2" colwidth="5*"/> | |
<colspec colname="c3" colwidth="1*"/> | |
<colspec colname="c4" colwidth="1*"/> | |
<thead> | |
<row> | |
<entry align="center">Name</entry> | |
<entry align="center">Description</entry> | |
<entry align="center">Units</entry> | |
<entry align="center">Notes</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>process Errors</entry> | |
<entry>The number of process errors</entry> | |
<entry>count</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>getMetadata Errors</entry> | |
<entry>The number of getMetadata errors</entry> | |
<entry>count</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>cpc Errors</entry> | |
<entry>The number of Collection Process Complete (cpc) errors</entry> | |
<entry>count</entry> | |
<entry>acc</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</informaltable> | |
</section> | |
</section> | |
<section id="ugr.async.mt.jmx_monitoring.client"> | |
<title>Application Client information</title> | |
<para>This section describes monitoring | |
information provided by the UIMA AS Client APIs. | |
Any code that uses the <xref linkend="ugr.ref.async.api.organization"></xref>, | |
such as the example application | |
client <code>RunRemoteAsyncAE</code>, will have a set of these | |
JMX measures. Currently no additional | |
tooling (beyond standard tools like <code>jconsole</code>) are provided to | |
view these. | |
</para> | |
<section id="ugr.async.mt.jmx_monitoring.client.measures"> | |
<title>Client Measures</title> | |
<informaltable frame="all"> | |
<tgroup cols="4" colsep="1" rowsep="1"> | |
<colspec colname="c1" colwidth="2*"/> | |
<colspec colname="c2" colwidth="5*"/> | |
<colspec colname="c3" colwidth="1*"/> | |
<colspec colname="c4" colwidth="1*"/> | |
<thead> | |
<row> | |
<entry align="center">Name</entry> | |
<entry align="center">Description</entry> | |
<entry align="center">Units</entry> | |
<entry align="center">Notes</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>application Name</entry> | |
<entry>A user-supplied string identifying the application</entry> | |
<entry>string</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>service queue name</entry> | |
<entry>The name of the service queue this client connects to</entry> | |
<entry>string</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>serialization method</entry> | |
<entry>either xmi or binary. This is the serialization the client will use to send | |
CASes to the service, and also tells the service which serialization to use | |
in sending the CASes back.</entry> | |
<entry>string</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>cas pool size</entry> | |
<entry>This client's cas pool size, limiting the number of simultaneous outstanding requests in process</entry> | |
<entry>count</entry> | |
<entry>const</entry> | |
</row> | |
<row> | |
<entry>total number of CASes processed</entry> | |
<entry>count of the total number of CASes sent from this client. Note: in the case | |
where the service is a Cas Multiplier, the "child" CASes are not included in this count.</entry> | |
<entry>count</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>total time to process</entry> | |
<entry>total thread time spent in processing all CASes, including time in remote delegates</entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>average process time</entry> | |
<entry>total number of CASes processed / total time to process</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>max process time</entry> | |
<entry>maximum thread time spent in processing a CAS, including time in remote delegates</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>total serialization time</entry> | |
<entry>total thread time spent in serializing, both to delegates | |
(and recursively, to their delegates) and replies back to senders</entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>average serialization time</entry> | |
<entry>average thread time spent in serializing a CAS, both to delegates | |
(and recursively, to their delegates) and replies back to senders</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>max serialization time</entry> | |
<entry>maximum thread time spent in serializing a CAS, both to delegates | |
(and recursively, to their delegates) and replies back to senders</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>total deserialization time</entry> | |
<entry>total thread time spent in deserializing, both replies from delegates and CASes from upper | |
level components being sent to lower level ones.</entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>average deserialization time</entry> | |
<entry>average thread time spent in deserializing, both replies from delegates and CASes from upper | |
level components being sent to lower level ones.</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>max deserialization time</entry> | |
<entry>maximum thread time spent in deserializing, both replies from delegates and CASes from upper | |
level components being sent to lower level ones.</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>total idle time</entry> | |
<entry>total wall clock time a top-level service thread has been idle since the thread was last used. | |
If there is more than one service thread, this number is the sum.</entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>average idle time</entry> | |
<entry>average wall clock time all top-level service threads have been idle since they were last used</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>max idle time</entry> | |
<entry>maximum wall clock time a top-level service thread has been idle since the thread was last used</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>total time waiting for reply</entry> | |
<entry>total wall clock time, measured from the time a CAS is sent to the top-level queue, until that CAS | |
is returned. Any generated CASes from Cas Multipliers are not counted in this measurement.</entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>average time waiting for reply</entry> | |
<entry>average wall clock time from the time a CAS is sent to the reply is received</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>max time waiting for reply</entry> | |
<entry>maximum wall clock time from the time a CAS is sent to the reply is received</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>total response latency time</entry> | |
<entry>total wall clock time, measured from the time a CAS is sent to the top-level queue, including | |
the serialization and deserialization times at the client, until that CAS | |
is returned. Any generated CASes from Cas Multipliers are not counted in this measurement.</entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>average response latency time</entry> | |
<entry>average wall clock time, measured from the time a CAS is sent to the top-level queue, including | |
the serialization and deserialization times at the client, until that CAS | |
is returned.</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>max response latency time</entry> | |
<entry>maximum wall clock time, measured from the time a CAS is sent to the top-level queue, including | |
the serialization and deserialization times at the client, until that CAS | |
is returned.</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>total time waiting for CAS</entry> | |
<entry>total wall-clock time spent waiting for a | |
free CAS to be available in the client's CAS pool, before | |
sending the CAS to input queue for the top level service. </entry> | |
<entry>milli seconds</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>average time waiting for CAS</entry> | |
<entry>average wall-clock time spent waiting for a | |
free CAS to be available in the client's CAS pool</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>max time waiting for CAS</entry> | |
<entry>maximum wall-clock time spent waiting for a | |
free CAS to be available in the client's CAS pool</entry> | |
<entry>milli seconds</entry> | |
<entry>inst</entry> | |
</row> | |
<row> | |
<entry>total number of CASes requested</entry> | |
<entry>total number of CASes fetched from the CAS pool</entry> | |
<entry>count</entry> | |
<entry>acc</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</informaltable> | |
</section> | |
<section id="ugr.async.mt.jmx_monitoring.client.error"> | |
<title>Client Error Measurements</title> | |
<informaltable frame="all"> | |
<tgroup cols="4" colsep="1" rowsep="1"> | |
<colspec colname="c1" colwidth="2*"/> | |
<colspec colname="c2" colwidth="5*"/> | |
<colspec colname="c3" colwidth="1*"/> | |
<colspec colname="c4" colwidth="1*"/> | |
<thead> | |
<row> | |
<entry align="center">Name</entry> | |
<entry align="center">Description</entry> | |
<entry align="center">Units</entry> | |
<entry align="center">Notes</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>getMeta Timeout Error Count</entry> | |
<entry>number of times a getMeta timed out</entry> | |
<entry>count</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>getMeta Error Count</entry> | |
<entry>number of times a getMeta request returned with an error</entry> | |
<entry>count</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>process Timeout Error Count</entry> | |
<entry>number of times a process call timed out</entry> | |
<entry>count</entry> | |
<entry>acc</entry> | |
</row> | |
<row> | |
<entry>process Error Count</entry> | |
<entry>number of times a process call returned with an error</entry> | |
<entry>count</entry> | |
<entry>acc</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</informaltable> | |
</section> | |
</section> | |
</section> | |
</section> | |
<section id="ugr.async.mt.jmx_sampling"> | |
<title>Logging Sampled JMX information at intervals</title> | |
<para> | |
A common tuning procedure is to run a deployment for a fairly long time with a | |
typical load, and to see what and where hot spots develop. During this process, | |
it is sometimes useful to convert accumulating measurements into averages, perhaps | |
averages per CAS processed. | |
</para> | |
<para> | |
UIMA AS includes a monitor component, org.apache.uima.aae.jmx.monitor.JmxMonitor, | |
to sample JMX measures at specified intervals, | |
compute various averages, and write the results into the UIMA Log (or on the console | |
if no log is configured). The monitor program can be automatically enabled for any deployed service | |
by specifying <code>-D</code> parameters on the JVM command | |
line which launches the service, or, it can be run stand-alone; when run stand-alone, you provide an | |
argument specifying the JVM it is to connect to to get the JMX information. It only connects | |
to one JVM per run; typically, you would connect it to the top-level service. | |
</para> | |
<para> | |
The monitor outputs information for that service and its immediate delegates (local or remote); however, it | |
includes information from the complete recursive chain of delegates when computing its measures. You can | |
get detailed monitoring for sub-services by starting or attaching a monitor to those sub-services. | |
</para> | |
<para> | |
ActiveMQ uses Queue Brokers to manage the JMS queues used by UIMA AS. These brokers have JMX information | |
that is useful in tuning applications. The Monitor program identifies the Queue Broker being used by the | |
service, and connects to it and incorporates information about queue lengths (both the input queue | |
and the reply queue) into its measurements. | |
</para> | |
<section id="ugr.async.mt.jmx_sampling.configuring"> | |
<title>Configuring JVM to run the monitor</title> | |
<para>Specify the following JVM System Variable parameters to configure a UIMA AS Client or Service to enable | |
sampling and logging of JMX measures: | |
<itemizedlist> | |
<listitem><para><code>-Duima.jmx.monitor.interval=1000</code> - (default is 1000) specifies the | |
sampling interval in milliseconds</para></listitem> | |
<listitem><para><code>-Duima.jmx.monitor.formatter=<CustomFormatterClassName></code></para></listitem> | |
<listitem><para><code>-Dcom.sun.management.jmxremote</code> - enable JMX (only needed for local monitoring, not needed if port is specified)</para></listitem> | |
<listitem><para><code>-Dcom.sun.management.jmxremote.port=8009</code></para></listitem> | |
<listitem><para><code>-Dcom.sun.management.jmxremote.authenticate=false</code></para></listitem> | |
<listitem><para><code>-Dcom.sun.management.jmxremote.ssl=false</code></para></listitem> | |
</itemizedlist> | |
This configures JMX to run on port 8009 with no authentication, and sets the sampling interval to 1 second, | |
and specifies a custom formatter class name. | |
</para> | |
<para>There are two <code>formatter-classes</code> provided with UIMA AS: | |
<itemizedlist> | |
<listitem><para><code>org.apache.uima.aae.jmx.monitor.BasicUimaJmxMonitorListener - </code> | |
this is a multi-line formatter that formats for human-readable output</para></listitem> | |
<listitem><para><code>org.apache.uima.aae.jmx.monitor.SingleLineUimaJmxMonitorListener - </code> | |
this is a formatter that produces one line per interval, suitable for importing into | |
a spreadsheet program.</para></listitem> | |
</itemizedlist> | |
Both of these log to the UIMA log at the INFO log level. | |
</para> | |
<para>You can also write your own formatter. The monitor provides an API to plug in a custom formatter | |
for displaying service metrics. A custom formatter must implement JmxMonitorListener interface. | |
See the method <code>startMonitor</code> in the class <code>UIMA_Service</code> for an | |
example of how custom JMX Listeners are plugged into the monitor. | |
</para> | |
</section> | |
<section id="ugr.async.mt.jmx_sampling.standalone"> | |
<title>Running the Monitor program standalone</title> | |
<para>The monitor program can be started separately and pointed to a running UIMA AS Client or Service. | |
To start the program, invoke Java with the following classpath and parameters: | |
<itemizedlist> | |
<listitem> | |
<para>ClassPath:</para> | |
<itemizedlist> | |
<listitem><para>%UIMA_HOME%/lib/uimaj-as-activemq.jar</para></listitem> | |
<listitem><para>%UIMA_HOME%/lib/uimaj-as-core.jar</para></listitem> | |
<listitem><para>%UIMA_HOME%/lib/uima-core.jar</para></listitem> | |
<listitem><para>%UIMA_HOME%/apache-activemq/activemq-all-5.6.0.jar</para></listitem> | |
</itemizedlist> | |
</listitem> | |
<listitem> | |
<para>Parameters:</para> | |
<itemizedlist> | |
<listitem><para><code>-Djava.util.logging.config.file=%UIMA_HOME%/config/MonitorLogger.properties</code> | |
- specifies the logging file where the information is written to</para></listitem> | |
<listitem><para><code>org.apache.uima.aae.jmx.monitor.JmxMonitor</code> - | |
the class whose main method is invoked</para></listitem> | |
<listitem><para><code>uri</code> - the URI of the jmx instance to monitor.</para></listitem> | |
<listitem><para><code>interval</code> - the (optional) | |
sampling interval, in milliseconds (default = 1000)</para></listitem> | |
</itemizedlist> | |
</listitem> | |
</itemizedlist> | |
</para> | |
<para>When run in this manner, it is not (currently) possible to specify the | |
log message formatting class; the multi-line output format is always used.</para> | |
</section> | |
<section id="ugr.async.mt.jmx_sampling.output"> | |
<title>Monitoring output</title> | |
<para>The monitoring program combines information from the JMX measures, including the associated | |
Queue Broker, sampling accumulating measurements at the specified sampling interval, and produces | |
the following outputs: | |
<informaltable frame="all"> | |
<tgroup cols="3" colsep="1" rowsep="1"> | |
<colspec colname="c1" colwidth="2*"/> | |
<colspec colname="c2" colwidth="5*"/> | |
<colspec colname="c3" colwidth="1*"/> | |
<thead> | |
<row> | |
<entry align="center">Name</entry> | |
<entry align="center">Description</entry> | |
<entry align="center">Units</entry>> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>Input queue depth</entry> | |
<entry>number of CASes waiting to be processed by a service</entry> | |
<entry>count</entry> | |
</row> | |
<row> | |
<entry>Reply queue depth</entry> | |
<entry>number of CASes returned to the client but not yet picked up by the client</entry> | |
<entry>count</entry> | |
</row> | |
<row> | |
<entry>CASes processed in interval</entry> | |
<entry>Number of CASes processed in this sampling interval</entry> | |
<entry>count</entry> | |
</row> | |
<row> | |
<entry>Idle time in interval</entry> | |
<entry>The total time this service has been idle during this interval</entry> | |
<entry>milli seconds</entry> | |
</row> | |
<row> | |
<entry>Analysis time in interval</entry> | |
<entry>The sum of the times spent in analysis by the service during this interval, | |
including analysis time spent in delegates, recursively</entry> | |
<entry>milli seconds</entry> | |
</row> | |
<row> | |
<entry>Cas Pool free Cas Count</entry> | |
<entry>Number of available CASes in the Cas Pool at the end of the interval</entry> | |
<entry>count</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</informaltable> | |
</para> | |
<para>In addition to the performance metrics the monitor also provides basic service information: | |
<itemizedlist> | |
<listitem> | |
<para>Service name</para> | |
</listitem> | |
<listitem> | |
<para>Is service top level</para> | |
</listitem> | |
<listitem> | |
<para>Is service remote</para> | |
</listitem> | |
<listitem> | |
<para>Is service a cas multiplier</para> | |
</listitem> | |
<listitem> | |
<para>Number of processing threads</para> | |
</listitem> | |
<listitem> | |
<para>Service uptime (milliseconds)</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
</section> | |
</section> | |
<section id="ugr.async.mt.jmx_disable"> | |
<title>Disabling JMX in UIMA-AS</title> | |
<para>When opening JMX RMI port is not possible due to security concerns, the UIMA-AS can start | |
without JMX support. To disable JMX please add the following as a JVM argument -Duima.as.enable.jmx=false | |
</para> | |
</section> | |
<section id="ugr.async.mt.tuning"> | |
<title>Tuning</title> | |
<section id="ugr.async.mt.tuning.approach"> | |
<title>Tuning procedure</title> | |
<para>This section is a cookbook of best practices for tuning a UIMA AS deployment. The summary information | |
provided by the Monitor program is used to guide the tuning.</para> | |
<para>The main metric for detecting an overloaded service is the input queue depth. If it is growing or high, the service | |
is not able to keep up with the load. There are more CASes arriving at the queue than the service can process. | |
Consider increasing number of instances of the services within the JVM (if on a multi-core machine having | |
additional capacity), or deploy additional instances of the service.</para> | |
<para>The main metric for detecting idle service is the idle time. If it is high, it can indicate that the service is not | |
receiving enough CASes. This can be caused by a bottleneck in the service's client; supporting evidence for this | |
can be a high reply queue depth for the client - indicating the client is overloaded. | |
If the idle time is zero, the service may be saturated; adding more instances could | |
relieve a bottleneck.</para> | |
<para>A CasPool free Cas Count of 0 can point to a bottleneck in a service's client; supporting | |
evidence for this can be a high idle time. In this case, the service does not have enough CASes in its pool and is | |
forced to wait. Remember that a CAS is not returned to the Service's CAS pool until the client | |
(which can be a parent asynchronous aggregate) signals it can be. | |
A typical reason is a slow client (look for evidence such as a high reply queue depth). Consider | |
incrementing service's Cas pool and check the client's metrics to determine a reason why it is slow.</para> | |
<para>An asynchronous system must have something that limits the generation of | |
new work to do. CasPools are the mechanism used by UIMA AS to do this. | |
Also, because CASes can have large memory requirements, it is | |
important to limit the number and sizes of CASes in a process.</para> | |
</section> | |
<section id="ugr.async.mt.tuning.settings"> | |
<title>Tuning Settings</title> | |
<para>This section has a list of the tuning parameters and a description of what they do and how they interact.</para> | |
<informaltable frame="all"> | |
<tgroup cols="2" colsep="1" rowsep="1"> | |
<colspec colname="c1" colwidth="2*"/> | |
<colspec colname="c2" colwidth="4*"/> | |
<thead> | |
<row> | |
<entry align="center">Name</entry> | |
<entry align="center">Description</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>number of service instances</entry> | |
<entry>You can adjust the number of service processes assigned to a particular service, | |
even dynamically, by just starting / stopping additional servers that specify | |
the same input queue.</entry> | |
</row> | |
<row> | |
<entry>number of pipeline threads in a service instance</entry> | |
<entry>Similar to the number of service processes above, this | |
specifies replication of an AS Primitive within each JVM process. This allows multiple | |
processing threads to share large in-heap objects and thus utilize more CPU in multi-core machines | |
without running out of RAM.</entry> | |
</row> | |
<row> | |
<entry>CAS pool size</entry> | |
<entry>This size limits the number of CASes being processed asynchronously.</entry> | |
</row> | |
<row> | |
<entry>casMultiplier poolSize</entry> | |
<entry>This size limits the number of CASes generated by a CAS Multiplier that are being processed asynchronously.</entry> | |
</row> | |
<row> | |
<entry>service instance warm up</entry> | |
<entry>Allow warm up of each JVM process before attaching to the service input queue by processing | |
a collection of input CASes. | |
This feature is enabled by specifying -DWarmUpDataPath=<code>zipFile</code>, where the | |
zipFile contains CASes in Xmi or compressed binary formats. | |
If the zipFile name ends in '.zip' the name is assumed to be a fully rooted or relative file path, | |
else the name is assumed to be in the classpath with '.zip' appended. | |
For compressed binary, the first entry in the zipfile must be typesystem.xml containing | |
the full typesystem of the serialized CASes.</entry> | |
</row> | |
<row> | |
<entry>Service input queue prefetch</entry> | |
<entry>If set greater than 0, allows up to "n" CASes to be pulled into one service provider, at a time. | |
This can increase throughput, but can hurt latency, since one service may have several CASes pulled into it, | |
queued up, while another instance of the service could be "starved" and be sitting there idle. </entry> | |
</row> | |
<row> | |
<entry>Specifying async="true"/"false" on an aggregate</entry> | |
<entry>The default is false, because there is less overhead (no queues are set up, etc.). Setting this to | |
"true" allows multiple CASes to flow simultaneously in the aggregate.</entry> | |
</row> | |
<row> | |
<entry>remoteReplyQueueScaleout</entry> | |
<entry>This parameter indicates the number of threads that will be deployed to read from the remote reply queue. | |
Set to > 1 if deserialization time of replies is a bottleneck.</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</informaltable> | |
</section> | |
</section> | |
<section id="ugr.async.mt.debugging"> | |
<title>Debugging</title> | |
<para>One of the strongest UIMA features is the ability to develop and | |
debug components in isolation from each other, and then to incrementally | |
combine components and scaleout complexity. All that is needed to exercise | |
each configuration are one or more appropriate input CASes. | |
</para> | |
<para>It is strongly advised to first test UIMA components in the core UIMA | |
environment with a variety of input CASes. If the entire | |
application will not fit in a single process, deploy remote delegates as | |
UIMA AS primitives with only a single instance | |
(see <xref linkend="ugr.async.ov.concepts.deploying.multiples"></xref>), | |
and access them via JMS service descriptors | |
(see <xref linkend="ugr.async.ov.concepts.jms_descriptor"/>). | |
Run as much input data thru this "single-threaded" configuration as needed | |
to eliminate most "algorithmic" errors and to measure performance against | |
analysis time objectives. Thread safety and analysis | |
ordering issues can then be addressed separately.</para> | |
<para><emphasis role="bold">Thread safety bugs.</emphasis> Components intended to be run | |
multi-threaded should first be deployed as a multiple instance UIMA AS service | |
(again see <xref linkend="ugr.async.ov.concepts.deploying.multiples"></xref>), | |
and fed their input CASes with a driver | |
capable of keeping all instances busy at the same time. A good application | |
is the sample driver $UIMA_HOME/bin/runRemoteAsyncAE; use the -p argument | |
to increase the number of outstanding CAS requests sent to the target service. | |
When looking for threading problems try using http://findbugs.sourceforge.net/. | |
In addition to looking for exceptions caused by thread unsafe code, check that | |
the single and multi-threaded analysis results are the same. | |
</para> | |
<para><emphasis role="bold">Analysis ordering bugs.</emphasis> | |
In a core UIMA aggregate CASes are processed by each delegate in input order. | |
This relationship changes for the same aggregate deployed asynchronously if one of the delegates | |
is replicated, as CASes are progressed in parallel and then progress thru the subsequent aggregate | |
flow in a different order then they are received. | |
Similarly with a delegate CasMultiplier in a core UIMA aggregate each child CAS is processed | |
to completion before the next child CAS is started and the parent CAS is processed last. | |
When running asynchronously the parent CAS | |
can arrive at downstream components ahead of its children because the parent | |
is released from a CasMultiplier immediately after the last child is created. | |
For applications which require all children to be processed before their parent, | |
use the processParentLast flag (see <xref linkend="ugr.ref.async.deploy.descriptor.ae"></xref>). | |
</para> | |
<para><emphasis role="bold">Timing issues.</emphasis> | |
Invariably with complex analytics, some components will be slower and some artifacts | |
will take longer to process than desired. Making performance improvements relies on | |
identifying components running slower than expected and capturing the slow-running artifacts | |
to study in detail. | |
</para> | |
<section id="ugr.async.mt.debugging.tracing"> | |
<title>Error Reporting and Tracing</title> | |
<para>After the system is scaled out and substantially more data is being processed | |
it is likely that additional errors will occur. | |
</para> | |
<para>Java errors at any component level | |
are propagated back to the component originating the request | |
(unless suppressed by UIMA AS error handling options, | |
see <xref linkend="ugr.async.eh.error_handling_overview"></xref>). | |
The error stack traces the call chain of UIMA AS components, within | |
colocated aggregate components and across remote services which are | |
shared by multiple clients. Some errors can be resolved with this | |
information alone. | |
</para> | |
<para>If process timeouts are not used | |
(see <xref linkend="ugr.ref.async.deploy.descriptor.errorconfig"></xref>) | |
an asynchronous system can hang if one analysis step somewhere in | |
the system has hung. Given many CASes in process at the same time it can | |
be useful to create a custom trace of CAS activity by appropriate logging | |
in <emphasis role="bold">a custom flow controller</emphasis>. | |
Such logging would have a unique identifier in every CAS, | |
usually a singleton FeatureStructure with a unique String feature. Identifiers | |
for child CASes should include some reference to the CasMultiplier they were | |
created from as well as their parent CAS. | |
</para> | |
<para>The flow controller is also the ideal place to measure timing statistics | |
for components of interest. Global stats can easily be measured using the | |
time between flow steps, and time | |
thresholds used to flag specific CASes causing problems. Again the unique | |
CAS identifier can be quite useful here. | |
</para> | |
</section> | |
<section id="ugr.async.mt.debugging.caslogging"> | |
<title>CAS Logging</title> | |
<para>Within a UIMA AS asynchronous aggregate, CASes can be saved before sending to any local or remote | |
delegate and later used to reproduce a problem in a simple unit testing environment. | |
Logging is currently not supported for delegates of a UIMA aggregate deployed synchronously. | |
</para> | |
<para>CASes are stored | |
in XmiCas format in a separate directory for each delegate with logging enabled. | |
Along with the CAS files in each directory is a file "typesystem.xml" containing the complete CAS type system. | |
A delegate's directory name is the full delegate context with '/' chars converted to '-'. | |
If not specified, the base CAS logging directory is the process current directory. | |
By default the name of each CAS file is the time in milliseconds after CAS logging begins | |
for a delegate. If a string exists in the CAS that should be used as the file name, | |
it can be specified by {type, feature} found in a specific view. | |
</para> | |
<para>A list of delegates to enable for CAS logging can be specified as a Java property. | |
Logging can also be enabled/disabled dynamically via JMX. CAS logging control is | |
enabled in the "Annotator_Service Info" bean for each asynchonous delegate. | |
</para> | |
<para> | |
Java properties used for CAS logging: | |
</para> | |
<informaltable frame="all"> | |
<tgroup cols="2" colsep="1" rowsep="1"> | |
<colspec colname="c1" colwidth="1*"/> | |
<colspec colname="c2" colwidth="1*"/> | |
<thead> | |
<row> | |
<entry align="center">Property</entry> | |
<entry align="center">Description</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>UIMA_CASLOG_BASE_DIRECTORY</entry> | |
<entry>optional; this is the directory under which sub-directories with | |
XmiCas files will be created. If not specified, the process's current directory | |
will be the base.</entry> | |
</row> | |
<row> | |
<entry>UIMA_CASLOG_COMPONENT_ARRAY</entry> | |
<entry>This is a space separated list of delegates keys. If a | |
delegate is nested inside a co-located async aggregate, the name would include the key | |
name of the aggregate, e.g. "someAggName/someDelName". The XmiCas files will then be | |
written into directory $UIMA_CASLOG_BASE_DIRECTORY/someAggName-someDelName. | |
Note that delegates for the top level aggregate do not require an aggregate name context. | |
</entry> | |
</row> | |
<row> | |
<entry>UIMA_CASLOG_TYPE_NAME</entry> | |
<entry>optional; this is the name of a FeatureStructure in the CAS | |
containing a unique string to use to name each XmiCas file. If not specified, XmiCas | |
file name will be N.xmi, where N is the time in microseconds since the component was | |
initialized.</entry> | |
</row> | |
<row> | |
<entry>UIMA_CASLOG_FEATURE_NAME</entry> | |
<entry>optional unless the TYPE_NAME is specified; this parameter | |
gives the string feature to use. If the string value contains one or more | |
"/" characters only the text after the last "/" will be used.</entry> | |
</row> | |
<row> | |
<entry>UIMA_CASLOG_VIEW_NAME</entry> | |
<entry>optional; if the TYPE_NAME and FEATURE_NAME parameters are specified | |
this string selects the CAS view used to access the FeatureStructure with | |
unique string feature.</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</informaltable> | |
</section> | |
</section> | |
<!--section id="ugr.async.mt.limits"> | |
<title>Limitations</title> | |
<para>The current (2.3.0) implementation has the following limitations: | |
<itemizedlist> | |
<listitem><para>Monitoring program</para> | |
<itemizedlist> | |
<listitem><para>The monitoring program reads the JMS Queue Broker URL | |
from the configuration information provided by JMX for the UIMA AS Service | |
being monitored. It uses this information to connect to JMX on that broker, but | |
currently assumes that JMX is set up on the default port (1099). This is | |
currently hardcoded into the Monitor program, so be aware of this if you | |
change the port number for JMX on the JMS Queue Broker (a parameter in | |
ActiveMQ's configuration for the broker). | |
</para></listitem> | |
<listitem><para>When the Monitor program is run as a stand-alone program, | |
it is not (currently) possible to specify alternatives for the | |
log message formatting class; the multi-line output format is always used.</para></listitem> | |
</itemizedlist> | |
</listitem> | |
</itemizedlist> | |
</para> | |
</section--> | |
</chapter> |