blob: ae11dab6aa647b7b16eede6123d80f93e086f59d [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
<!ENTITY imgroot "images/tutorials_and_users_guides/tug.application/">
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent">
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="ugr.tug.application">
<title>Application Developer&apos;s Guide</title>
<para>This chapter describes how to develop an application using the Unstructured Information Management
Architecture (UIMA). The term <emphasis>application</emphasis> describes a program that provides end-user
functionality. A UIMA application incorporates one or more UIMA components such as Analysis Engines,
Collection Processing Engines, a Search Engine, and/or a Document Store and adds application-specific logic
and user interfaces.</para>
<section id="ugr.tug.appication.uimaframework_class">
<title>The UIMAFramework Class</title>
<para>An application developer's starting point for accessing UIMA framework functionality is the
<literal>org.apache.uima.UIMAFramework</literal> class. The following is a short introduction to some
important methods on this class. Several of these methods are used in examples in the rest of this chapter. For
more details, see the Javadocs (in the docs/api directory of the UIMA SDK).
<itemizedlist>
<listitem>
<para>UIMAFramework.getXMLParser(): Returns an instance of the UIMA XML Parser class, which then can be
used to parse the various types of UIMA component descriptors. Examples of this can be found in the
remainder of this chapter.</para>
</listitem>
<listitem>
<para>UIMAFramework.produceXXX(ResourceSpecifier): There are various produce methods that are used
to create different types of UIMA components from their descriptors. The argument type,
ResourceSpecifier, is the base interface that subsumes all types of component descriptors in UIMA. You
can get a ResourceSpecifier from the XMLParser. Examples of produce methods are:
<itemizedlist>
<listitem>
<para>produceAnalysisEngine</para>
</listitem>
<listitem>
<para>produceCasConsumer</para>
</listitem>
<listitem>
<para>produceCasInitializer</para>
</listitem>
<listitem>
<para>produceCollectionProcessingEngine</para>
</listitem>
<listitem>
<para>produceCollectionReader</para>
</listitem>
</itemizedlist>
There are other variations of each of these methods that take additional, optional arguments. See the
Javadocs for details. </para>
</listitem>
<listitem>
<para>UIMAFramework.getLogger(&lt;optional-logger-name&gt;): Gets a reference to the UIMA Logger,
to which you can write log messages. If no logger name is passed, the name of the returned logger instance
is <quote>org.apache.uima</quote>.</para>
</listitem>
<listitem>
<para>UIMAFramework.getVersionString(): Gets the number of the UIMA version you are using.</para>
</listitem>
<listitem>
<para>UIMAFramework.newDefaultResourceManager(): Gets an instance of the UIMA ResourceManager. The
key method on ResourceManager is setDataPath, which allows you to specify the location where UIMA
components will go to look for their external resources. Once you've obtained and initialized a
ResourceManager, you can pass it to any of the produceXXX methods. </para>
</listitem>
</itemizedlist></para>
</section>
<section id="ugr.tug.application.using_aes">
<title>Using Analysis Engines</title>
<para>This section describes how to add analysis capability to your application by using Analysis Engines
developed using the UIMA SDK. An <emphasis>Analysis Engine (AE)</emphasis> is a component that analyzes
artifacts (e.g. documents) and infers information about them.</para>
<para>An Analysis Engine consists of two parts - Java classes (typically packaged as one or more JAR files) and
<emphasis>AE descriptors</emphasis> (one or more XML files). You must put the Java classes in your
application&apos;s class path, but thereafter you will not need to directly interact with them. The UIMA
framework insulates you from this by providing a standard AnalysisEngine interfaces.</para>
<para>The term <emphasis>Text Analysis Engine (TAE)</emphasis> is sometimes used to describe an Analysis
Engine that analyzes a text document. In the UIMA SDK v1.x, there was a TextAnalysisEngine interface that was
commonly used. However, as of the UIMA SDK v2.0, this interface has been deprecated and all applications should
switch to using the standard AnalysisEngine interface.</para>
<para>The AE descriptor XML files contain the configuration settings for the Analysis Engine as well as a
description of the AE&apos;s input and output requirements. You may need to edit these files in order to
configure the AE appropriately for your application - the supplier of the AE may have provided documentation
(or comments in the XML descriptor itself) about how to do this.</para>
<section id="ugr.tug.application.instantiating_an_ae">
<title>Instantiating an Analysis Engine</title>
<para>The following code shows how to instantiate an AE from its XML descriptor:
<programlisting> //get Resource Specifier from XML file
XMLInputSource in = new XMLInputSource("MyDescriptor.xml");
ResourceSpecifier specifier =
UIMAFramework.getXMLParser().parseResourceSpecifier(in);
//create AE here
AnalysisEngine ae =
UIMAFramework.produceAnalysisEngine(specifier);</programlisting></para>
<para>The first two lines parse the XML descriptor (for AEs with multiple descriptor files, one of them is the
<quote>main</quote> descriptor - the AE documentation should indicate which it is). The result of the parse
is a <literal>ResourceSpecifier</literal> object. The third line of code invokes a static factory method
<literal>UIMAFramework.produceAnalysisEngine</literal>, which takes the specifier and instantiates
an <literal>AnalysisEngine</literal> object.</para>
<para>There is one caveat to using this approach - the Analysis Engine instance that you create will not support
multiple threads running through it concurrently. If you need to support this, see <xref
linkend="ugr.tug.applications.multi_threaded"/>.</para>
</section>
<section id="ugr.tug.application.analyzing_text_documents">
<title>Analyzing Text Documents</title>
<para>There are two ways to use the AE interface to analyze documents. You can either use the
<emphasis>JCas</emphasis> interface, which is described in detail in <olink
targetdoc="&uima_docs_ref;"/> <olink
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.jcas"/> or you can directly use the
<emphasis>CAS</emphasis> interface, which is described in detail in <olink
targetdoc="&uima_docs_ref;"/> <olink
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>. Besides text documents, other kinds of
artifacts can also be analyzed; see <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.aas"/> for more information.</para>
<para>The basic structure of your application will look similar in both cases:</para>
<para>Using the JCas
<programlisting> //create a JCas, given an Analysis Engine (ae)
JCas jcas = ae.newJCas();
//analyze a document
jcas.setDocumentText(doc1text);
ae.process(jcas);
doSomethingWithResults(jcas);
jcas.reset();
//analyze another document
jcas.setDocumentText(doc2text);
ae.process(jcas);
doSomethingWithResults(jcas);
jcas.reset();
...
//done
ae.destroy();</programlisting></para>
<para>Using the CAS
<programlisting>//create a CAS
CAS aCasView = ae.newCAS();
//analyze a document
aCasView.setDocumentText(doc1text);
ae.process(aCasView);
doSomethingWithResults(aCasView);
aCasView.reset();
//analyze another document
aCasView.setDocumentText(doc2text);
ae.process(aCasView);
doSomethingWithResults(aCasView);
aCasView.reset();
...
//done
ae.destroy();</programlisting></para>
<para>First, you create the CAS or JCas that you will use. Then, you repeat the following four steps for each
document:</para>
<orderedlist spacing="compact">
<listitem>
<para>Put the document text into the CAS or JCas.</para>
</listitem>
<listitem>
<para>Call the AE's process method, passing the CAS or JCas as an argument</para>
</listitem>
<listitem>
<para>Do something with the results that the AE has added to the CAS or JCas</para>
</listitem>
<listitem>
<para>Call the CAS's or JCas's reset() method to prepare for another analysis </para>
</listitem>
</orderedlist>
</section>
<section id="ugr.tug.applications.analyzing_non_text_artifacts">
<title>Analyzing Non-Text Artifacts</title>
<para>Analyzing non-text artifacts is similar to analyzing text documents. The main difference is that
instead of using the <literal>setDocumentText</literal> method, you need to use the Sofa APIs to set the
artifact into the CAS. See <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>
for details.</para>
</section>
<section id="ugr.tug.applications.accessing_analysis_results">
<title>Accessing Analysis Results</title>
<para>Annotators (and applications) access the results of analysis via the CAS, using the CAS or JCas
interfaces. These results are accessed using the CAS Indexes. There is one built-in index for instances of
the built-in type <literal>uima.tcas.Annotation</literal> that can be used to retrieve instances of
<literal>Annotation</literal> or any subtype of Annotation. You can also define additional indexes over
other types. </para>
<para>Indexes provide a method to obtain an iterators over their contents; the iterator returns the matching
elements one at time from the CAS.</para>
<section id="ugr.tug.applications.accessing_results_using_jcas">
<title>Accessing Analysis Results using the JCas</title>
<para>See:</para>
<itemizedlist>
<listitem>
<para> <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.aae.reading_results_previous_annotators"/> </para>
</listitem>
<listitem>
<para> <olink targetdoc="&uima_docs_ref;"/>
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.jcas"/></para>
</listitem>
<listitem>
<para>The Javadocs for <literal>org.apache.uima.jcas.JCas</literal>. </para>
</listitem>
</itemizedlist>
</section>
<section id="ugr.tug.application.accessing_results_using_cas">
<title>Accessing Analysis Results using the CAS</title>
<para>See:</para>
<itemizedlist>
<listitem>
<para> <olink targetdoc="&uima_docs_ref;"/>
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/></para>
</listitem>
<listitem>
<para> The source code for <literal>org.apache.uima.examples.PrintAnnotations</literal>, which
is in <literal>examples\src.</literal></para>
</listitem>
<listitem>
<para>The Javadocs for the <literal>org.apache.uima.cas</literal> and
<literal>org.apache.uima.cas.text</literal> packages. </para>
</listitem>
</itemizedlist>
</section>
</section>
<section id="ugr.tug.applications.multi_threaded">
<title>Multi-threaded Applications</title>
<para>You may be running on a multi-core system, and want to run multiple CASes at once through your pipeline. To support this, UIMA provides multiple approaches.
The most flexible and recommended way to do this is to use the features of UIMA-AS, which not only allows scale-up (multiple threads in one CPU), but also
supports scale-out (exploiting a cluster of machines).</para>
<para>This section describes the simplest way to use an AE in a multi-threaded environment.
First, note that most Analysis Engines are written with the assumption that only one thread will be accessing
it at any one time; that is, Analysis Engines are not written to be thread safe. The writers of these
assume that multiple instances of the Annotator Engine class will be instantiated as needed to support multiple
threads.
</para>
<para>If your application has multiple threads that might invoke an Analysis Engine, to insure that
only one thread at a time uses a CAS and runs in the pipeline,
you can use the Java synchronized keyword to
ensure that only one thread is using an AE at any given time. For example:
<programlisting>public class MyApplication {
private AnalysisEngine mAnalysisEngine;
private CAS mCAS;
public MyApplication() {
//get Resource Specifier from XML file
XMLInputSource in = new XMLInputSource("MyDescriptor.xml");
ResourceSpecifier specifier =
UIMAFramework.getXMLParser().parseResourceSpecifier(in);
//create Analysis Engine here
mAnalysisEngine = UIMAFramework.produceAnalysisEngine(specifier);
mCAS = mAnalysisEngine.newCAS();
}
// Assume some other part of your multi-threaded application could
// call <quote>analyzeDocument</quote> on different threads, asynchronously
public synchronized void analyzeDocument(String aDoc) {
//analyze a document
mCAS.setDocumentText(aDoc);
mAnalysisEngine.process();
doSomethingWithResults(mCAS);
mCAS.reset();
}
...
}</programlisting></para>
<para>Without the synchronized keyword, this application would not be thread-safe. If multiple threads
called the analyzeDocument method simultaneously, they would both use the same CAS and clobber each others'
results. The synchronized keyword ensures that no more than one thread is executing this method at any given
time. For more information on thread synchronization in Java, see <ulink
url="http://docs.oracle.com/javase/tutorial/essential/concurrency/"/>
.</para>
<para>The synchronized keyword ensures thread-safety, but does not allow you to process more than one
document at a time. If you need to process multiple documents simultaneously (for example, to make use of a
multiprocessor machine), you&apos;ll need to use more than one CAS instance.</para>
<para>Because CAS instances use memory and can take some time to construct, you don't want to create a new CAS
instance for each request. Instead, you should use a feature of the UIMA SDK called the <emphasis>CAS
Pool</emphasis>, implemented by the type <literal>CasPool</literal>.</para>
<para>A CAS Pool contains some number of CAS instances (you specify how many when you create the pool). When a
thread wants to use a CAS, it <emphasis>checks out</emphasis> an instance from the pool. When the thread is
done using the CAS, it must <emphasis>release</emphasis> the CAS instance back into the pool. If all
instances are checked out, additional threads will block and wait for an instance to become available. Here
is some example code:
<programlisting>public class MyApplication {
private CasPool mCasPool;
private AnalysisEngine mAnalysisEngine;
public MyApplication()
{
//get Resource Specifier from XML file
XMLInputSource in = new XMLInputSource("MyDescriptor.xml");
ResourceSpecifier specifier =
UIMAFramework.getXMLParser().parseResourceSpecifier(in);
//Create multithreadable AE that will
//Accept 3 simultaneous requests
//The 3rd parameter specifies a timeout.
//When the number of simultaneous requests exceeds 3,
// additional requests will wait for other requests to finish.
// This parameter determines the maximum number of milliseconds
// that a new request should wait before throwing an
// - a value of 0 will cause them to wait forever.
mAnalysisEngine = UIMAFramework.produceAnalysisEngine(specifier,3,0);
//create CAS pool with 3 CAS instances
mCasPool = new CasPool(3, mAnalysisEngine);
}
// Notice this is no longer "synchronized"
public void analyzeDocument(String aDoc) {
//check out a CAS instance (argument 0 means no timeout)
CAS cas = mCasPool.getCas(0);
try {
//analyze a document
cas.setDocumentText(aDoc);
mAnalysisEngine.process(cas);
doSomethingWithResults(cas);
} finally {
//MAKE SURE we release the CAS instance
mCasPool.releaseCas(cas);
}
}
...
}</programlisting></para>
<para>There is not much more code required here than in the previous example. First, there is one additional
parameter to the AnalysisEngine producer, specifying the number of annotator instances to
create<footnote>
<para> Both the UIMA Collection Processing Manager framework and the remote deployment services framework
have implementations which use CAS pools in this manner, and thereby relieve the annotator developer of
the necessity to make their annotators thread-safe.</para> </footnote>. Then, instead of creating a
single CAS in the constructor, we now create a CasPool containing 3 instances. In the analyze method, we check
out a CAS, use it, and then release it.</para> <note>
<para>Frequently, the two numbers (number of CASes, and the number of AEs) will be the same. It would not make
sense to have the number of CASes less than the number of AEs
&ndash; the extra AE instances would always block waiting for a CAS from the pool. It could make sense to have
additional CASes, though &ndash; if you had other multi-threaded processes that were using the CASes, other
than the AEs. </para> </note>
<para>The getCAS() method returns a CAS which is not specialized to any particular subject of analysis. To
process things other than this, please refer to <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.aas"/> .</para>
<para>Note the use of the try...finally block. This is very important, as it ensures that the CAS we have checked
out will be released back into the pool, even if the analysis code throws an exception. You should always use
try...finally when using the CAS pool; if you do not, you risk exhausting the pool and causing
deadlock.</para>
<para>The parameter 0 passed to the CasPool.getCas() method is a timeout value. If this is set to a positive
integer, it is the maximum number of milliseconds that the thread will wait for an instance to become
available in the pool. If this time elapses, the getCas method will return null, and the application can do
something intelligent, like ask the user to try again later. A value of 0 will cause the thread to wait for an
available CAS, potentially forever.</para>
<para>All of this can better be done using UIMA-AS. Besides taking care of setting up the CAS pools, etc.,
UIMA-AS allows a pipe line having several delegates to be scaled-up optimally for each delegate;
one delegate might have 5 instances, while another might have 3. It also does
a different kind of initialization, in that it creates a thread pool itself, and insures that each
annotator instance gets its process() method called using the same thread that was used for that annotator
instance's initialization call; some annotators could be written assuming that this is the case.</para>
</section>
<section id="ugr.tug.application.using_multiple_aes">
<title>Using Multiple Analysis Engines and Creating Shared CASes</title>
<titleabbrev>Multiple AEs &amp; Creating Shared CASes</titleabbrev>
<para>In most cases, the easiest way to use multiple Analysis Engines from within an application is to combine
them into an aggregate AE. For instructions, see <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.aae.building_aggregates"/>. Be sure that you understand this method before
deciding to use the more advanced feature described in this section.</para>
<para>If you decide that your application does need to instantiate multiple AEs and have those AEs share a
single CAS, then you will no longer be able to use the various methods on the
<literal>AnalysisEngine</literal> class that create CASes (or JCases) to create your CAS. This is because
these methods create a CAS with a data model specific to a single AE and which therefore cannot be shared by
other AEs. Instead, you create a CAS as follows:</para>
<para>Suppose you have two analysis engines, and one CAS Consumer, and you want to create one type system from
the merge of all of their type specifications. Then you can do the following:</para>
<programlisting>AnalysisEngineDescription aeDesc1 =
UIMAFramework.getXMLParser().parseAnalysisEngineDescription(...);
AnalysisEngineDescription aeDesc2 =
UIMAFramework.getXMLParser().parseAnalysisEngineDescription(...);
CasConsumerDescription ccDesc =
UIMAFramework.getXMLParser().parseCasConsumerDescription(...);
List list = new ArrayList();
list.add(aeDesc1);
list.add(aeDesc2);
list.add(ccDesc);
CAS cas = CasCreationUtils.createCas(list);
// (optional, if using the JCas interface)
JCas jcas = cas.getJCas();</programlisting>
<para>The CasCreationUtils class takes care of the work of merging the AEs&apos; type systems and producing a
CAS for the combined type system. If the type systems are not compatible, an exception will be thrown.</para>
</section>
<section id="ugr.tug.application.saving_cases_to_file_systems">
<title>Saving CASes to file systems or general Streams</title>
<para>The UIMA framework provides multiple APIs to save and restore the contents of a CAS to streams.
Two common uses of this are to save CASes to the file system, and to send CASes to other processes, running
on remote systems.</para>
<para>
The CASes can be serialized in multiple formats:
<itemizedlist>
<listitem>
<para>Binary formats:
<itemizedlist>
<listitem>
<para>plain binary: This is used to communicate with remote services, and also for interfacing with
annotators written in C/C++ or related languages via the JNI Java interface, from Java</para>
</listitem>
<listitem>
<para>Compressed binary: There are two forms of compressed binary. The recommend one is form 6, which also allows
type filtering. See <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.compress.overview"/>.</para>
</listitem>
</itemizedlist>
</para>
</listitem>
<listitem>
<para>XML formats: There are two forms of this format. The preferred form is the XMI form (see
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xmi"/>). An older format is also available,
called XCAS.</para>
</listitem>
<listitem>
<para>JSON formats (as of version 2.7.0):
This is intended for exposing results in the CAS as JSON objects for use by
web applications. See <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.json.overview"/>.
For JSON, only serialization is supported.</para>
</listitem>
<listitem>
<para>Java Object serialization: There are APIs to convert a CAS to a Java object that can be serialized
and deserialized
using standard Java object read and write Object methods. There is also a way to include the CAS's type system and
index definition.</para>
</listitem>
</itemizedlist>
</para>
<para>Each of these serializations has different capabilities, summarized in the table below.
<table frame="all" id="ugr.tug.tbl.serialization_capabilities">
<title>Serialization Capabilities</title>
<tgroup cols="8" rowsep="1" colsep="1">
<colspec colname="c1"/>
<colspec colname="c2"/>
<colspec colname="c3"/>
<colspec colname="c4"/>
<colspec colname="c5"/>
<colspec colname="c6"/>
<colspec colname="c7"/>
<colspec colname="c8"/>
<thead>
<row>
<entry align="center"></entry>
<entry align="center">XCAS</entry>
<entry align="center">XMI</entry>
<entry align="center">JSON</entry>
<entry align="center">Binary</entry>
<entry align="center">Cmpr 4</entry>
<entry align="center">Cmrp 6</entry>
<entry align="center">JavaObj</entry>
</row>
</thead>
<tbody>
<row>
<entry>Output</entry>
<entry>Output Stream</entry>
<entry>Output Stream</entry>
<entry>Output Stream, File, Writer</entry>
<entry>Output Stream</entry>
<entry>Output Stream, Data Output Stream, File</entry>
<entry>Output Stream, Data Output Stream, File</entry>
<entry>-</entry>
</row>
<row>
<entry>Lists/Arrays inline formatting?</entry>
<entry>-</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>-</entry>
<entry>-</entry>
<entry>-</entry>
<entry>-</entry>
</row>
<row>
<entry>Formatted?</entry>
<entry>-</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>-</entry>
<entry>-</entry>
<entry>-</entry>
<entry>-</entry>
</row>
<row>
<entry>Type Filtering?</entry>
<entry>-</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>-</entry>
<entry>-</entry>
<entry>Yes</entry>
<entry>-</entry>
</row>
<row>
<entry>Delta Cas?</entry>
<entry>-</entry>
<entry>Yes</entry>
<entry>-</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>-</entry>
</row>
<row>
<entry>OOTS?</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>-</entry>
<entry>-</entry>
<entry>-</entry>
<entry>-</entry>
<entry>-</entry>
</row>
<row>
<entry>Only send indexed + reachable FSs?</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>send all</entry>
<entry>send all</entry>
<entry>Yes</entry>
<entry>send all</entry>
</row>
<row>
<entry>NameSpace/Schemas?</entry>
<entry>-</entry>
<entry>Yes</entry>
<entry>-</entry>
<entry>-</entry>
<entry>-</entry>
<entry>-</entry>
<entry>-</entry>
</row>
<row>
<entry>lenient available?</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>-</entry>
<entry>-</entry>
<entry>-</entry>
<entry>Yes</entry>
<entry>-</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
<para>In the above table, Cmpr 4 and Cmpr 6 refer to Compressed forms of the serialization,
and JavaObj refers to Java Object serialization.</para>
<para>For the XMI and JSON formats, lists and arrays can sometimes be formatted "inline".
In this representation, the elements are formatted directly as the value of a particular
feature. This is only done if the arrays and lists are not multiply-referenced.</para>
<para>Type Filtering support enables only a subset of the types and/or features to be
serialized. An additional type system object is used to specify the types to be included
in the serialization. This can be useful, for instance, when sending a CAS to a remote service,
where the remote service only uses a small number of the types and features, to reduce the size
of the serialized CAS.</para>
<para>Delta Cas support makes use of a "mark" set in the CAS, and only serializes changes in the CAS,
both new and modified Feature Structures, that were added or changed after the mark was set.
This is useful for remote services, supporting the use-case where a large CAS is sent to the service,
which sets the mark in the received CAS, and then adds a small amount of information;
the Delta CAS then serializes only that small amount as the "reply" sent back to the sender.</para>
<para>OOTS means "Out of Type System" support, intended to support the use-case where a CAS is being sent
to a remote application. This supports deserializing an incoming CAS where
some of the types and/or features may not be present in the receiving CAS's type system. A "lenient"
option on the deserialization permits the deserialization to proceed, with the out-of-type-system
information preserved so that when the CAS is subsequently reserialized (in the use-case, to be
returned back to the sender), the out-of-type-system information is re-merged back into the output stream.
</para>
<para>The Binary, Java Object, and Compressed Form 4 serializations send all the Feature Structures in the CAS,
in the order they were created in the CAS. The other methods only
send Feature Structures that are reachable, either by
their being in some CAS index, or being referenced
as a feature of another Feature Structure which is reachable.</para>
<para>The NameSpace/Schema support allows specifying a set of schemas, each one corresponding to a particular
namespace, used in XMI serialization.</para>
<para>Lenient allows the receiving Type System to be missing types and/or features that being deserialized.
Normally this causes an exception, but with the lenient flag turned on, these extra types and/or features are
skipped over and ignored, with no error indicated.</para>
<para>To save an XMI representation of a CAS, use the <code>save</code> method in <code>CasIOUtils</code> or the
<literal>serialize</literal> method of the class
<literal>org.apache.uima.util.XmlCasSerializer</literal>. To save an XCAS representation of a CAS,
use the <code>save</code> method in <code>CasIOUtils</code> class or see the <literal>org.apache.uima.cas.impl.XCASSerializer</literal> instead; see the Javadocs
for details.</para>
<para>All the external forms (except JSON) can be read back in with standard options using the <code>CasIOUtils load</code> methods.
The <code>CasIOUtils load</code> methods also support loading type system and index definition information
at the same time (usually from addition input sources).
The XCAS and XMI external forms can also be read back in using the <literal>deserialize</literal> method of
the class <literal>org.apache.uima.util.XmlCasDeserializer</literal>. All of these methods deserialize
into a pre-existing CAS, which you must create ahead of time. See the
Javadocs for details.</para>
<para>The <code>CasIOUtils</code> class has a collection of static methods to load (deserialize) and save (serialize) CASes,
optionally with their type system and index definitions.
The <code>Serialization</code> class has various static methods for serializing and deserializing Java Object forms and
compressed forms, with finer control over available options.
See the Javadocs for that class for details.</para>
<para>Several of the APIs use or return instances of <code>SerialFormat</code>, which is an enum specifying the various
forms of serialization.</para>
</section>
</section>
<section id="ugr.tug.application.using_cpes">
<title>Using Collection Processing Engines</title>
<para>A <emphasis>Collection Processing Engine (CPE)</emphasis> processes collections of artifacts
(documents) through the combination of the following components: a Collection Reader, an optional CAS
Initializer, Analysis Engines, and CAS Consumers. Collection Processing Engines and their components are
described in <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.cpe"/> .</para>
<para>Like Analysis Engines, CPEs consist of a set of Java classes and a set of descriptors. You need to make sure
the Java classes are in your classpath, but otherwise you only deal with descriptors.</para>
<section id="ugr.tug.application.running_a_cpe_from_a_descriptor">
<title>Running a Collection Processing Engine from a Descriptor</title>
<titleabbrev>Running a CPE from a Descriptor</titleabbrev>
<para><olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.cpe.running_cpe_from_application"/> describes how to use the APIs to read a CPE
descriptor and run it from an application.</para>
</section>
<section id="ugr.tug.application.configuring_a_cpe_descriptor_programmatically">
<title>Configuring a Collection Processing Engine Descriptor Programmatically</title>
<titleabbrev>Configuring a CPE Descriptor Programmatically</titleabbrev>
<para>For the finest level of control over the CPE descriptor settings, the CPE offers programmatic access to
the descriptor via an API. With this API, a developer can create a complete descriptor and then save the result
to a file. This also can be used to read in a descriptor (using XMLParser.parseCpeDescription as shown in the
previous section), modify it, and write it back out again. The CPE Descriptor API allows a developer to
redefine default behavior related to error handling for each component, turn-on check-pointing, change
performance characteristics of the CPE, and plug-in a custom timer.</para>
<para>Below is some example code that illustrates how this works. See the Javadocs for package
org.apache.uima.collection.metadata for more details.</para>
<programlisting>//Creates descriptor with default settings
CpeDescription cpe = CpeDescriptorFactory.produceDescriptor();
//Add CollectionReader
cpe.addCollectionReader([descriptor]);
//Add CasInitializer (deprecated)
cpe.addCasInitializer(&lt;cas initializer descriptor&gt;);
// Provide the number of CASes the CPE will use
cpe.setCasPoolSize(2);
// Define and add Analysis Engine
CpeIntegratedCasProcessor personTitleProcessor =
CpeDescriptorFactory.produceCasProcessor (<quote>Person</quote>);
// Provide descriptor for the Analysis Engine
personTitleProcessor.setDescriptor([descriptor]);
//Continue, despite errors and skip bad Cas
personTitleProcessor.setActionOnMaxError(<quote>continue</quote>);
//Increase amount of time in ms the CPE waits for response
//from this Analysis Engine
personTitleProcessor.setTimeout(100000);
//Add Analysis Engine to the descriptor
cpe.addCasProcessor(personTitleProcessor);
// Define and add CAS Consumer
CpeIntegratedCasProcessor consumerProcessor =
CpeDescriptorFactory.produceCasProcessor(<quote>Printer</quote>);
consumerProcessor.setDescriptor([descriptor]);
//Define batch size
consumerProcessor.setBatchSize(100);
//Terminate CPE on max errors
consumerProcessor.setActionOnMaxError(<quote>terminate</quote>);
//Add CAS Consumer to the descriptor
cpe.addCasProcessor(consumerProcessor);
// Add Checkpoint file and define checkpoint frequency (ms)
cpe.setCheckpoint(<quote>[path]/checkpoint.dat</quote>, 3000);
// Plug in custom timer class used for timing events
cpe.setTimer(<quote>org.apache.uima.internal.util.JavaTimer</quote>);
// Define number of documents to process
cpe.setNumToProcess(1000);
// Dump the descriptor to the System.out
((CpeDescriptionImpl)cpe).toXML(System.out);</programlisting>
<para>The CPE descriptor for the above configuration looks like this:
<programlisting><![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<cpeDescription xmlns="http://uima.apache.org/resourceSpecifier">
<collectionReader>
<collectionIterator>
<descriptor>
<include href="[descriptor]"/>
</descriptor>
<configurationParameterSettings>...
</configurationParameterSettings>
</collectionIterator>
<casInitializer>
<descriptor>
<include href="[descriptor]"/>
</descriptor>
<configurationParameterSettings>...
</configurationParameterSettings>
</casInitializer>
</collectionReader>
<casProcessors casPoolSize="2" processingUnitThreadCount="1">
<casProcessor deployment="integrated" name="Person">
<descriptor>
<include href="[descriptor]"/>
</descriptor>
<deploymentParameters/>
<errorHandling>
<errorRateThreshold action="terminate" value="100/1000"/>
<maxConsecutiveRestarts action="terminate" value="30"/>
<timeout max="100000"/>
</errorHandling>
<checkpoint batch="100" time="1000ms"/>
</casProcessor>
<casProcessor deployment="integrated" name="Printer">
<descriptor>
<include href="[descriptor]"/>
</descriptor>
<deploymentParameters/>
<errorHandling>
<errorRateThreshold action="terminate"
value="100/1000"/>
<maxConsecutiveRestarts action="terminate"
value="30"/>
<timeout max="100000" default="-1"/>
</errorHandling>
<checkpoint batch="100" time="1000ms"/>
</casProcessor>
</casProcessors>
<cpeConfig>
<numToProcess>1000</numToProcess>
<deployAs>immediate</deployAs>
<checkpoint file="[path]/checkpoint.dat" time="3000ms"/>
<timerImpl>
org.apache.uima.reference_impl.util.JavaTimer
</timerImpl>
</cpeConfig>
</cpeDescription>]]></programlisting></para>
</section>
</section>
<section id="ugr.tug.application.setting_configuration_parameters">
<title>Setting Configuration Parameters</title>
<para>Configuration parameters can be set using APIs as well as configured using the XML descriptor metadata
specification (see <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.aae.configuration_parameters"/>.</para>
<para>There are two different places you can set the parameters via the APIs.</para>
<itemizedlist spacing="compact">
<listitem>
<para>After reading the XML descriptor for a component, but before you produce the component itself,
and</para>
</listitem>
<listitem>
<para>After the component has been produced. </para>
</listitem>
</itemizedlist>
<para>Setting the parameters before you produce the component is done using the
ConfigurationParameterSettings object. You get an instance of this for a particular component by accessing
that component description&apos;s metadata. For instance, if you produced a component description by using
<literal>UIMAFramework.getXMLParser().parse...</literal> method, you can use that component
description&apos;s getMetaData() method to get the metadata, and then the metadata&apos;s
getConfigurationParameterSettings method to get the ConfigurationParameterSettings object. Using that
object, you can set individual parameters using the setParameterValue method. Here&apos;s an example, for a
CAS Consumer component:
<programlisting>// Create a description object by reading the XML for the descriptor
CasConsumerDescription casConsumerDesc =
UIMAFramework.getXMLParser().parseCasConsumerDescription(new
XMLInputSource("descriptors/cas_consumer/InlineXmlCasConsumer.xml"));
// get the settings from the metadata
ConfigurationParameterSettings consumerParamSettings =
casConsumerDesc.getMetaData().getConfigurationParameterSettings();
// Set a parameter value
consumerParamSettings.setParameterValue(
InlineXmlCasConsumer.PARAM_OUTPUTDIR,
outputDir.getAbsolutePath());</programlisting></para>
<para>Then you might produce this component using:
<programlisting>CasConsumer component =
UIMAFramework.produceCasConsumer(casConsumerDesc);</programlisting></para>
<para>A side effect of producing a component is calling the component's <quote>initialize</quote> method,
allowing it to read its configuration parameters. If you want to change parameters after this, use
<programlisting>component.setConfigParameterValue(
<quote>&lt;parameter-name&gt;</quote>,
<quote>&lt;parameter-value&gt;</quote>);</programlisting>
and then signal the component to re-read its configuration by calling the component's reconfigure method:
<programlisting>component.reconfigure();</programlisting></para>
<para>Although these examples are for a CAS Consumer component, the parameter APIs also work for other kinds of
components.</para>
</section>
<section id="ugr.tug.application.integrating_text_analysis_and_search">
<title>Integrating Text Analysis and Search</title>
<para>The UIMA SDK on IBM's alphaWorks <ulink url="http://www.alphaworks.ibm.com/tech/uima"/> includes a
semantic search engine that you can use to build a search index that includes the results of the analysis done by
your AE. This combination of AEs with a search engine capable of indexing both words and annotations over spans
of text enables what UIMA refers to as <emphasis>semantic search</emphasis>. Over time we expect to provide
additional information on integrating other open source search engines.</para>
<para>Semantic search is a search where the semantic intent of the query is specified using one or more entity or
relation specifiers. For example, one could specify that they are looking for a person (named)
<quote>Bush.</quote> Such a query would then not return results about the kind of bushes that grow in your
garden.</para>
<section id="ugr.tug.application.building_an_index">
<title>Building an Index</title>
<para>To build a semantic search index using the UIMA SDK, you run a Collection Processing Engine that includes
your AE along with a CAS Consumer which takes the tokens and annotatitions, together with sentence
boundaries, and feeds them to a semantic searcher's index term input. The alphaWorks semantic search
component includes a CAS Consumer called the <emphasis>Semantic Search CAS Indexer</emphasis> that does
this; this component is available from the alphaWorks site. Your AE must include an annotator that produces
Tokens and Sentence annotations, along with any <quote>semantic</quote> annotations, because the
Indexer requires this. The Semantic Search CAS Indexer's descriptor is located here:
<literal>examples/descriptors/cas_consumer/SemanticSearchCasIndexer.xml</literal> .</para>
<section id="ugr.tug.application.search.configuring_indexer">
<title>Configuring the Semantic Search CAS Indexer</title>
<para>Since there are several ways you might want to build a search index from the information in the CAS
produced by your AE, you need to supply the Semantic Search CAS Consumer &ndash; Indexer with
configuration information in the form of an <emphasis>Index Build Specification</emphasis> file.
Apache UIMA includes code for parsing Index Build Specification files (see the Javadocs for details). An
example of an Indexing specification tailored to the AE from the tutorial in the <olink
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae"/> is located in
<literal>examples/descriptors/tutorial/search/MeetingIndexBuildSpec.xml</literal> . It looks
like this:
<programlisting><![CDATA[<indexBuildSpecification>
<indexBuildItem>
<name>org.apache.uima.examples.tokenizer.Token</name>
<indexRule>
<style name="Term"/>
</indexRule>
</indexBuildItem>
<indexBuildItem>
<name>org.apache.uima.examples.tokenizer.Sentence</name>
<indexRule>
<style name="Breaking"/>
</indexRule>
</indexBuildItem>
<indexBuildItem>
<name>org.apache.uima.tutorial.Meeting</name>
<indexRule>
<style name="Annotation"/>
</indexRule>
</indexBuildItem>
<indexBuildItem>
<name>org.apache.uima.tutorial.RoomNumber</name>
<indexRule>
<style name="Annotation">
<attributeMappings>
<mapping>
<feature>building</feature>
<indexName>building</indexName>
</mapping>
</attributeMappings>
</style>
</indexRule>
</indexBuildItem>
<indexBuildItem>
<name>org.apache.uima.tutorial.DateAnnot</name>
<indexRule>
<style name="Annotation"/>
</indexRule>
</indexBuildItem>
<indexBuildItem>
<name>org.apache.uima.tutorial.TimeAnnot</name>
<indexRule>
<style name="Annotation"/>
</indexRule>
</indexBuildItem>
</indexBuildSpecification>]]></programlisting></para>
<para>The index build specification is a series of index build items, each of which identifies a CAS
annotation type (a subtype of <literal>uima.tcas.Annotation</literal> &ndash; see <olink
targetdoc="&uima_docs_ref;"/> <olink
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>) and a style.</para>
<para>The first item in this example specifies that the annotation type
<literal>org.apache.uima.examples.tokenizer.Token</literal> should be indexed with the
<quote>Term</quote> style. This means that each span of text annotated by a Token will be considered a
single token for standard text search purposes.</para>
<para>The second item in this example specifies that the annotation type
<literal>org.apache.uima.examples.tokenizer.Sentence</literal> should be indexed with the
<quote>Breaking</quote> style. This means that each span of text annotated by a Sentence will be
considered a single sentence, which can affect that search engine's algorithm for matching queries. The
semantic search engine available from alphaWorks always requires tokens and sentences in order to index a
document.</para> <note>
<para>Requirements for Term and Breaking rules: The Semantic Search indexer from alphaWorks requires that
the items to be indexed as words be designated using the Term rule. </para></note>
<para>The remaining items all use the <quote>Annotation</quote> style. This indicates that each
annotation of the specified types will be stored in the index as a searchable span, with a name equal to the
annotation name (without the namespace).</para>
<para>Also, features of annotations can be indexed using the
<literal>&lt;attributeMappings&gt;</literal> subelement. In the example index build
specification, we declare that the <literal>building</literal> feature of the type
<literal>org.apache.uima.tutorial.RoomNumber</literal> should be indexed. The
<literal>&lt;indexName&gt;</literal> element can be used to map the feature name to a different name in
the index, but in this example we have opted to use the same name, <literal>building</literal>. </para>
<para> At the end of the batch or collection, the Semantic Search CAS Indexer builds the index. This index can
be queried with simple tokens or with XML tags.</para>
<para>Examples:
<itemizedlist spacing="compact">
<listitem>
<para>A query on the word <quote>UIMA</quote> will retrieve all documents that have the occurrence
of the word. But a query of the type <literal>&lt;Meeting&gt;UIMA&lt;/Meeting&gt;</literal>
will retrieve only those documents that contain a Meeting annotation (produced by our
MeetingDetector TAE, for example), where that Meeting annotation contains the word
<quote>UIMA</quote>.</para>
</listitem>
<listitem>
<para>A query for <literal>&lt;RoomNumber building="Yorktown"/&gt;</literal> will return
documents that have a RoomNumber annotation whose <literal>building</literal> feature
contains the term <quote>Yorktown</quote>. </para>
</listitem>
</itemizedlist></para>
<para>More information on the syntax of these kinds of queries, called XML Fragments, can be found in
documentation for the semantic search engine component on <ulink
url="http://www.alphaworks.ibm.com/tech/uima"/>. For more information on the Index Build
Specification format, see the UIMA Javadocs for class
<literal>org.apache.uima.search.IndexBuildSpecification</literal>. Accessing the Javadocs is
described in <olink targetdoc="&uima_docs_ref;"/>
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.javadocs"/>.</para>
</section>
<section id="ugr.tug.application.search.cpe_with_semantic_search_cas_consumer">
<title>Building and Running a CPE including the Semantic Search CAS Indexer</title>
<titleabbrev>Using Semantic Search CAS Indexer</titleabbrev>
<para>The following steps illustrate how to build and run a CPE that uses the UIMA Meeting Detector TAE and the
Simple Token and Sentence Annotator, discussed in the <olink
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae"/> along with a CAS Consumer
called the Semantic Search CAS Indexer, to build an index that allows you to query for documents based not
only on textual content but also on whether they contain mentions of Meetings detected by the TAE.</para>
<para>Run the CPE Configurator tool by executing the <literal>cpeGui</literal> shell script in the
<literal>bin</literal> directory of the UIMA SDK. (For instructions on using this tool, see the <olink
targetdoc="&uima_docs_tools;"/> <olink
targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cpe"/>.)</para>
<para>In the CPE Configurator tool, select the following components by browsing to their
descriptors:</para>
<itemizedlist spacing="compact">
<listitem>
<para>Collection Reader: <literal>%UIMA_HOME%/examples/descriptors/collectionReader/
FileSystemCollectionReader.xml</literal></para>
</listitem>
<listitem>
<para>Analysis Engine: include both of these; one produces tokens/sentences, required by the indexer
in all cases and the other produces the meeting annotations of interest.
<itemizedlist spacing="compact">
<listitem><para><literal><?db-font-size 70% ?>%UIMA_HOME%/examples/descriptors/analysis_engine/SimpleTokenAndSentenceAnnotator.xml</literal></para></listitem>
<listitem><para><literal><?db-font-size 70% ?>%UIMA_HOME%/examples/descriptors/tutorial/ex6/UIMAMeetingDetectorTAE.xml</literal></para></listitem>
</itemizedlist>
</para>
</listitem>
<!--
<literallayout>%UIMA_HOME%/examples/descriptors/analysis_engine/
SimpleTokenAndSentenceAnnotator.xml</literallayout></para>
</listitem>
<listitem>
<para><literal> and %UIMA_HOME%/examples/descriptors/tutorial/ex6/
UIMAMeetingDetectorTAE.xml</literal></para>
</listitem>
-->
<listitem>
<para>Two CAS Consumers:
<itemizedlist spacing="compact">
<listitem><para><literal><?db-font-size 70% ?>%UIMA_HOME%/examples/descriptors/cas_consumer/SemanticSearchCasIndexer.xml</literal></para></listitem>
<listitem><para><literal><?db-font-size 70% ?>%UIMA_HOME%/examples/descriptors/cas_consumer/XmiWriterCasConsumer.xml</literal></para></listitem>
</itemizedlist>
<!--
<literallayout>%UIMA_HOME%/examples/descriptors/cas_consumer/
SemanticSearchCasIndexer.xml
%UIMA_HOME%/examples/descriptors/cas_consumer/
XmiWriterCasConsumer.xml</literallayout>
-->
</para>
</listitem>
</itemizedlist>
<para>Set up parameters:</para>
<itemizedlist spacing="compact">
<listitem>
<para> Set the File System Collection Reader's <quote>Input Directory</quote> parameter to point to
the <literal>%UIMA_HOME%/examples/data</literal> directory.</para>
</listitem>
<listitem>
<para>Set the Semantic Search CAS Indexer's <quote>Indexing Specification Descriptor</quote>
parameter to point to <literal>%UIMA_HOME%/examples/descriptors/tutorial/search/
MeetingIndexBuildSpec.xml</literal></para>
</listitem>
<listitem>
<para>Set the Semantic Search CAS Indexer's <quote>Index Dir</quote> parameter to whatever
directory into which you want the indexer to write its index files. <warning>
<para>The Indexer <emphasis>erases</emphasis> old versions of the files it creates in this
directory. </para></warning> </para>
</listitem>
<listitem>
<para>Set the XMI Writer CAS Consumer's <quote>Output Directory</quote> parameter to whatever
directory into which you want to store the XMI files containing the results of your analysis for each
document. </para>
</listitem>
</itemizedlist>
<para>Click on the Run Button. Once the run completes, a statistics dialog should appear, in which you can see
how much time was spent in each of the components involved in the run.</para>
</section>
</section>
<section id="ugr.tug.application.search.query_tool">
<title>Semantic Search Query Tool</title>
<para>The Semantic Search component from UIMA on alphaWorks contains a simple tool for running queries
against a semantic search index. After building an index as described in the previous section, you can launch
this tool by running the shell script: semanticSearch, found in the <literal>/bin</literal> subdirectory
of the Semantic Search UIMA install, at the command prompt. If you are using Eclipse, and have installed the
UIMA examples, there will be a Run configuration you can use to conveniently launch this, called
<literal>UIMA Semantic Search</literal>. This will display the following screen:
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image002.jpg"/>
</imageobject>
<textobject><phrase>Screenshot of the Semantic Search tool set up to run
semantic queries against a semantic search index</phrase></textobject>
</mediaobject>
</screenshot></para>
<para>Configure the fields on this screen as follows:
<itemizedlist spacing="compact">
<listitem>
<para>Set the <quote>Index Directory</quote> to the directory where you built your index. This is the
same value that you supplied for the <quote>Index Dir</quote> parameter of the Semantic Search CAS
Indexer in the CPE Configurator.</para>
</listitem>
<listitem>
<para>Set the <quote>XMI/XCAS Directory</quote> to the directory where you stored the results of your
analysis. This is the same value that you supplied for the <quote>Output Directory</quote>
parameter of XMI Writer CAS Consumer in the CPE Configurator.</para>
</listitem>
<listitem>
<para>Optionally, set the <quote>Original Documents Directory</quote> to the directory containing
the original plain text documents that were analyzed and indexed. This is only needed for the "View
Original Document" button.</para>
</listitem>
<listitem>
<para> Set the <quote>Type System Descriptor</quote> to the location of the descriptor that describes
your type system. For this example, this will be
<literal>%UIMA_HOME%/examples/descriptors/tutorial/ex4/TutorialTypeSystem.xml</literal>
</para>
</listitem>
</itemizedlist></para>
<para>Now, in the <quote>XML Fragments</quote> field, you can type in single words or XML queries where the XML
tags correspond to the labels in the index build specification file (e.g.
<literal>&lt;Meeting&gt;UIMA&lt;/Meeting&gt;</literal>). XML Fragments are described in the
documentation for the semantic search engine component on <ulink
url="http://www.alphaworks.ibm.com/tech/uima"/>.</para>
<para>After you enter a query and click the <quote>Search</quote> button, a list of hits will appear. Select
one of the documents and click <quote>View Analysis</quote> to view the document in the UIMA Annotation
Viewer.</para>
<para>The source code for the Semantic Search query program is in
<literal>examples/src/com/ibm/apache-uima/search/examples/SemanticSearchGUI.java</literal> . A simple
command-line query program is also provided in
<literal>examples/src/com/ibm/apache-uima/search/examples/SemanticSearch.java</literal> . Using these
as a model, you can build a query interface from your own application. For details on the Semantic Search
Engine query language and interface, see the documentation for the semantic search engine component on
<ulink url="http://www.alphaworks.ibm.com/tech/uima"/>.</para>
</section>
</section>
<section id="ugr.tug.application.remote_services">
<title>Working with Remote Services</title>
<note><para>This chapter describes older methods of working with Remote Services. These approaches do not support
some of the newer CAS features, such as multiple views and CAS Multipliers. These methods have been supplanted by
UIMA-AS, which has full support for the new CAS features.</para></note>
<para>The UIMA SDK allows you to easily take any Analysis Engine or CAS Consumer and deploy it as a service. That
Analysis Engine or CAS Consumer can then be called from a remote machine using various network
protocols.</para>
<para>The UIMA SDK provides support for two communications protocols:
<itemizedlist spacing="compact">
<listitem>
<para>SOAP, the standard Web Services protocol</para>
</listitem>
<listitem>
<para>Vinci, a lightweight version of SOAP, included as a part of Apache UIMA. </para>
</listitem>
</itemizedlist></para>
<para>The UIMA framework can make use of these services in two different ways:
<orderedlist>
<listitem>
<para>An Analysis Engine can create a proxy to a remote service; this proxy acts like a local component, but
connects to the remote. The proxy has limited error handling and retry capabilities. Both Vinci and SOAP
are supported.</para>
</listitem>
<listitem>
<para>A Collection Processing Engine can specify non-Integrated mode (see <olink
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.cpe.deploying_a_cpe"/>. The
CPE provides more extensive error recovery capabilities. This mode only supports the Vinci
communications protocol. </para>
</listitem>
</orderedlist></para>
<section id="ugr.tug.application.how_to_deploy_as_soap">
<title>Deploying a UIMA Component as a SOAP Service</title>
<titleabbrev>Deploying as SOAP Service</titleabbrev>
<para>To deploy a UIMA component as a SOAP Web Service, you need to first install the following software
components:
<itemizedlist spacing="compact">
<listitem>
<para>Apache Tomcat 5.0 or 5.5 ( <ulink url="http://jakarta.apache.org/tomcat/"/>) </para>
</listitem>
<listitem>
<para>Apache Axis 1.3 or 1.4 (<ulink url="http://ws.apache.org/axis/"/>) </para>
</listitem>
</itemizedlist></para>
<para>Later versions of these components will likely also work, but have not been tested.</para>
<para>Next, you need to do the following setup steps:
<itemizedlist>
<listitem>
<para>Set the CATALINA_HOME environment variable to the location where Tomcat is installed.</para>
</listitem>
<listitem>
<para>Copy all of the JAR files from <literal>%UIMA_HOME%/lib</literal> to the
<literal>%CATALINA_HOME%/webapps/axis/WEB-INF/lib</literal> in your installation.</para>
</listitem>
<listitem>
<para>Copy your JAR files for the UIMA components that you wish to
<literal>%CATALINA_HOME%/webapps/axis/WEB-INF/lib</literal> in your installation.</para>
</listitem>
<listitem>
<para><emphasis role="bold-italic">IMPORTANT</emphasis>: any time you add JAR files to Tomcat (for
instance, in the above 2 steps), you must shutdown and restart Tomcat before it
<quote>notices</quote> this. So now, please shutdown and restart Tomcat.</para>
</listitem>
<listitem>
<para>All the Java classes for the UIMA Examples are packaged in the
<literal>uima-examples.jar</literal> file which is included in the
<literal>%UIMA_HOME%/lib</literal> folder.</para>
</listitem>
<listitem>
<para>In addition, if an annotator needs to locate resource files in the classpath, those resources
must be available in the Axis classpath, so copy these also to
<literal>%CATALINA_HOME%/webapps/axis/WEB-INF/classes</literal> .</para>
<para>As an example, if you are deploying the GovernmentTitleRecognizer (found in
<literal>examples/descriptors/analysis_engine/
GovernmentOfficialRecognizer_RegEx_TAE</literal>) as a SOAP service, you need to copy the file
<literal>examples/resources/GovernmentTitlePatterns.dat</literal> into
<literal>.../WEB-INF/classes</literal>. </para>
</listitem>
</itemizedlist></para>
<para>Test your installation of Tomcat and Axis by starting Tomcat and going to
<literal>http://localhost:8080/axis/happyaxis.jsp</literal> in your browser. Check to be sure that
this reports that all of the required Axis libraries are present. One common missing file may be
activation.jar, which you can get from java.sun.com.</para>
<para>After completing these setup instructions, you can deploy Analysis Engines or CAS Consumers as SOAP web
services by using the <literal>deploytool</literal> utility, with is located in the
<literal>/bin</literal> directory of the UIMA SDK. <literal>deploytool</literal> is a command line
program utility that takes as an argument a web services deployment descriptors (WSDD file); example WSDD
files are provided in the <literal>examples/deploy/soap</literal> directory of the UIMA SDK. Deployment
Descriptors have been provided for deploying and undeploying some of the example Analysis Engines that come
with the SDK.</para>
<para>As an example, the WSDD file for deploying the example Person Title annotator looks like this (important
parts are in bold italics):
<programlisting>&lt;deployment name="<emphasis role="bold-italic">PersonTitleAnnotator</emphasis>"
xmlns="http://xml.apache.org/axis/wsdd/"
xmlns:java="http://xml.apache.org/axis/wsdd/providers/java"&gt;
&lt;service name="<emphasis role="bold-italic">urn:PersonTitleAnnotator</emphasis>" provider="java:RPC"&gt;
&lt;parameter name="scope" value="Request"/&gt;
&lt;parameter name="className"
value="org.apache.uima.reference_impl.analysis_engine
.service.soap.AxisAnalysisEngineService_impl"/&gt;
&lt;parameter name="allowedMethods" value="getMetaData process"/&gt;
&lt;parameter name="allowedRoles" value="*"/&gt;
&lt;parameter name="resourceSpecifierPath"
value="<emphasis role="bold-italic">C:/Program Files/apache/uima/examples/
descriptors/analysis_engine/PersonTitleAnnotator.xml</emphasis>"/&gt;
&lt;parameter name="numInstances" value="3"/&gt;
&lt;!-- Type Mappings omitted from this document;
you will not need to edit them. --&gt;
&lt;typeMapping .../&gt;
&lt;typeMapping .../&gt;
&lt;typeMapping .../&gt;
&lt;/service&gt;
&lt;/deployment&gt;</programlisting></para>
<para>To modify this WSDD file to deploy your own Analysis Engine or CAS Consumer, just replace the areas
indicated in bold italics (deployment name, service name, and resource specifier path) with values
appropriate for your component.</para>
<para>The <literal>numInstances</literal> parameter specifies how many instances of your Analysis Engine
or CAS Consumer will be created. This allows your service to support multiple clients concurrently. When a
new request comes in, if all of the instances are busy, the new request will wait until an instance becomes
available.</para>
<para>To deploy the Person Title annotator service, issue the following command:
<programlisting>C:/Program Files/apache/uima/bin&gt;deploytool
../examples/deploy/soap/Deploy_PersonTitleAnnotator.wsdd</programlisting></para>
<para>Test if the deployment was successful by starting up a browser, pointing it to your Tomcat
installation's <quote>axis</quote> webpage (e.g., <literal>http://localhost:8080/axis</literal>)
and clicking on the List link. This should bring up a page which shows the deployed services, where you should
see the service you just deployed.</para>
<para>The other components can be deployed by replacing
<literal>Deploy_PersonTitleAnnotator.wsdd</literal> with one of the other Deploy descriptors in the
deploy directory. The deploytool utility can also undeploy services when passed one of the Undeploy
descriptors.</para> <note>
<para>The <literal>deploytool</literal> shell script assumes that the web services are to be installed at
<literal>http://localhost:8080/axis</literal>. If this is not the case, you will need to update the shell
script appropriately.</para> </note>
<para>Once you have deployed your component as a web service, you may call it from a remote machine. See <xref
linkend="ugr.tug.application.how_to_call_a_uima_service"/> for instructions.</para>
</section>
<section id="ugr.tug.application.how_to_deploy_a_vinci_service">
<title>Deploying a UIMA Component as a Vinci Service</title>
<titleabbrev>Deploying as a Vinci Service</titleabbrev>
<para>There are no software prerequisites for deploying a Vinci service. The necessary libraries are part of
the UIMA SDK. However, before you can use Vinci services you need to deploy the Vinci Naming Service (VNS), as
described in section <xref linkend="ugr.tug.application.vns"/>.</para>
<para>To deploy a service, you have to insure any components you want to include can be found on the class path.
One way to do this is to set the environment variable UIMA_CLASSPATH to the set of class paths you need for any
included components. Then run the <literal>startVinciService</literal> shell script, which is located
in the <literal>bin</literal> directory, and pass it the path to a Vinci deployment descriptor, for
example: <literal>C:UIMA&gt;bin/startVinciService
../examples/deploy/vinci/Deploy_PersonTitleAnnotator.xml</literal>.
If you are running Eclipse, and have the <literal>uimaj-examples</literal> project
in your workspace, you can use the Eclipse Menu &rarr; Run &rarr; Run... and then
pick <quote>UIMA Start Vinci Service</quote>.</para>
<para>This example deployment descriptor looks like:
<programlisting>&lt;deployment name=<emphasis role="bold-italic">"Vinci Person Title Annotator Service"</emphasis>&gt;
&lt;service name=<emphasis role="bold-italic">"uima.annotator.PersonTitleAnnotator"</emphasis> provider="vinci"&gt;
&lt;parameter name="resourceSpecifierPath"
value=<emphasis role="bold-italic">"C:/Program Files/apache/uima/examples/descriptors/
analysis_engine/PersonTitleAnnotator.xml"</emphasis>/&gt;
&lt;parameter name="numInstances" value="1"/&gt;
&lt;parameter name="serverSocketTimeout" value="120000"/&gt;
&lt;/service&gt;
&lt;/deployment&gt;</programlisting></para>
<para>To modify this deployment descriptor to deploy your own Analysis Engine or CAS Consumer, just replace
the areas indicated in bold italics (deployment name, service name, and resource specifier path) with
values appropriate for your component.</para>
<para>The <literal>numInstances</literal> parameter specifies how many instances of your Analysis Engine
or CAS Consumer will be created. This allows your service to support multiple clients concurrently. When a
new request comes in, if all of the instances are busy, the new request will wait until an instance becomes
available.</para>
<para>The <literal>serverSocketTimeout</literal> parameter specifies the number of milliseconds
(default = 5 minutes) that the service will wait between requests to process something. After this amount of
time, the server will presume the client may have gone away - and it <quote>cleans up</quote>, releasing any
resources it is holding. The next call to process on the service will result in a cycle which will cause the
client to re-establish its connection with the service (some additional overhead).</para>
<para>There are two additional parameters that you can add to your deployment descriptor:
</para>
<itemizedlist>
<listitem><para><literal>&lt;parameter name="threadPoolMinSize" value="[Integer]"/></literal>:
Specifies the number of threads that the Vinci service creates on startup in order to
serve clients' requests.</para></listitem>
<listitem><para><literal>&lt;parameter name="threadPoolMaxSize" value="[Integer]"/></literal>:
Specifies the maximum number of threads that the Vinci service will create. When the number of
concurrent requests exceeds the <literal>threadPoolMinSize</literal>, additional threads will be
created to serve requests, until the <literal>threadPoolMaxSize</literal> is reached.</para></listitem>
</itemizedlist>
<para>The <literal>startVinciService</literal> script takes two additional optional parameters. The
first one overrides the value of the VNS_HOST environment variable, allowing you to specify the name server
to use. The second parameter if specified needs to be a unique (on this server) non-negative number,
specifying the instance of this service. When used, this number allows multiple instances of the same named
service to be started on one server; they will all register with the Vinci name service and be made available to
client requests.</para>
<para>Once you have deployed your component as a web service, you may call it from a remote machine. See <xref
linkend="ugr.tug.application.how_to_call_a_uima_service"/> for instructions.</para>
</section>
<section id="ugr.tug.application.how_to_call_a_uima_service">
<title>How to Call a UIMA Service</title>
<titleabbrev>Calling a UIMA Service</titleabbrev>
<para>Once an Analysis Engine or CAS Consumer has been deployed as a service, it can be used from any UIMA
application, in the exact same way that a local Analysis Engine or CAS Consumer is used. For example, you can
call an Analysis Engine service from the Document Analyzer or use the CPE Configurator to build a CPE that
includes Analysis Engine and CAS Consumer services.</para>
<para>To do this, you use a <emphasis>service client descriptor</emphasis> in place of the usual Analysis
Engine or CAS Consumer Descriptor. A service client descriptor is a simple XML file that indicates the
location of the remote service and a few parameters. Example service client descriptors are provided in the
UIMA SDK under the directories <literal>examples/descriptors/soapService</literal> and
<literal>examples/descriptors/vinciService</literal>. The contents of these descriptors are
explained below.</para>
<para>Also, before you can call a SOAP service, you need to have the necessary Axis JAR files in your classpath.
If you use any of the scripts in the <literal>bin</literal> directory of the UIMA installation to launch your
application, such as documentAnalyzer, these JARs are added to the classpath, automatically, using the
<literal>CATALINA_HOME</literal> environment variable. The required files are the following (all part
of the Apache Axis download)
<itemizedlist spacing="compact">
<listitem>
<para>activation.jar</para>
</listitem>
<listitem>
<para>axis.jar</para>
</listitem>
<listitem>
<para>commons-discovery.jar</para>
</listitem>
<listitem>
<para>commons-logging.jar</para>
</listitem>
<listitem>
<para>jaxrpc.jar</para>
</listitem>
<listitem>
<para>saaj.jar</para>
</listitem>
</itemizedlist></para>
<section id="ugr.tug.application.soap_service_client_descriptor">
<title>SOAP Service Client Descriptor</title>
<para>The descriptor used to call the PersonTitleAnnotator SOAP service from the example above is:
<programlisting><![CDATA[<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
<resourceType>AnalysisEngine</resourceType>
<uri>http://localhost:8080/axis/services/urn:PersonTitleAnnotator</uri>
<protocol>SOAP</protocol>
<timeout>60000</timeout>
</uriSpecifier>]]></programlisting></para>
<para>The &lt;resourceType&gt; element must contain either AnalysisEngine or CasConsumer. This
specifies what type of component you expect to be at the specified service address.</para>
<para>The &lt;uri&gt; element describes which service to call. It specifies the host (localhost, in this
example) and the service name (urn:PersonTitleAnnotator), which must match the name specified in the
deployment descriptor used to deploy the service.</para>
</section>
<section id="ugr.tug.application.vinci_service_client_descriptor">
<title>Vinci Service Client Descriptor</title>
<para>To call a Vinci service, a similar descriptor is used:
<programlisting><![CDATA[<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
<resourceType>AnalysisEngine</resourceType>
<uri>uima.annot.PersonTitleAnnotator</uri>
<protocol>Vinci</protocol>
<timeout>60000</timeout>
<parameters>
<parameter name="VNS_HOST" value="some.internet.ip.name-or-address"/>
<parameter name="VNS_PORT" value="9000"/>
</parameters>
</uriSpecifier>]]></programlisting></para>
<para>Note that Vinci uses a centralized naming server, so the host where the service is deployed does not
need to be specified. Only a name (<literal>uima.annot.PersonTitleAnnotator</literal>) is given,
which must match the name specified in the deployment descriptor used to deploy the service.</para>
<para>The host and/or port where your Vinci Naming Service (VNS) server is running can be specified by the
optional &lt;parameter&gt; elements. If not specified, the value is taken from the specification given
your Java command line (if present) using <literal>-DVNS_HOST=&lt;host&gt; </literal>and
<literal>-DVNS_PORT=&lt;port&gt;</literal> system arguments. If not specified on the Java command
line, defaults are used: localhost for the <literal>VNS_HOST</literal>, and <literal>9000</literal>
for the <literal>VNS_PORT</literal>. See the next section for details on setting up a VNS server.</para>
</section>
</section>
<section id="ugr.tug.application.restrictions_on_remotely_deployed_services">
<title>Restrictions on remotely deployed services</title>
<para>Remotely deployed services are started on remote machines, using UIMA component descriptors on those
remote machines. These descriptors supply any configuration and resource parameters for the service
(configuration parameters are not transmitted from the calling instance to the remote one). Likewise, the
remote descriptors supply the type system specification for the remote annotators that will be run (the type
system of the calling instance is not transmitted to the remote one).</para>
<para>The remote service wrapper, when it receives a CAS from the caller, instantiates it for the remote
service, making instances of all types which the remote service specifies. Other instances in the incoming
CAS for types which the remote service has no type specification for are kept aside, and when the remote
service returns the CAS back to the caller, these type instances are re-merged back into the CAS being
transmitted back to the caller. Because of this design, a remote service which doesn't declare a type system
won't receive any type instances.</para> <note>
<para>This behavior may change in future releases, to one where configuration parameters and / or type systems
are transmitted to remote services. </para></note>
</section>
<section id="ugr.tug.application.vns">
<title>The Vinci Naming Services (VNS)</title>
<para>Vinci consists of components for building network-accessible services, clients for accessing those
services, and an infrastructure for locating and managing services. The primary infrastructure component
is the Vinci directory, known as VNS (for Vinci Naming Service).</para>
<para>On startup, Vinci services locate the VNS and provide it with information that is used by VNS during
service discovery. Vinci service provides the name of the host machine on which it runs, and the name of the
service. The VNS internally creates a binding for the service name and returns the port number on which the
Vinci service will wait for client requests. This VNS stores its bindings in a filesystem in a file called
vns.services.</para>
<para>In Vinci, services are identified by their service name. If there is more than one physical service with
the same service name, then Vinci assumes they are equivalent and will route queries to them randomly,
provided that they are all running on different hosts. You should therefore use a unique service name if you
don't want to conflict with other services listed in whatever VNS you have configured jVinci to use.</para>
<section id="ugr.tug.application.vns.starting">
<title>Starting VNS</title>
<para>To run the VNS use the <literal>startVNS</literal> script found in the
<literal>bin</literal> directory of the UIMA installation,
or launch it from Eclipse. If you've installed the <literal>uimaj-examples</literal> project,
it will supply a pre-configured launch script you can access in Eclipse by selecting
Menu &rarr; Run &rarr; Run... and picking <quote>UIMA Start VNS</quote>.</para>
<note><para>VNS runs on port 9000 by default so please make sure this port is
available. If you see the following exception:
<programlisting>java.net.BindException: Address already in use:
JVM_Bind</programlisting>
it indicates that another process is running on port 9000. In this case, add the parameter <literal>-p
&lt;port&gt;</literal> to the <literal>startVNS</literal> command, using
<literal>&lt;port&gt;</literal> to specify an alternative port to use. </para></note>
<para>When started, the VNS produces output similar to the following:
<programlisting><?db-font-size 80% ?>[10/6/04 3:44 PM | main] WARNING: Config file doesn't exist,
creating a new empty config file!
[10/6/04 3:44 PM | main] Loading config file : .vns.services
[10/6/04 3:44 PM | main] Loading workspaces file : .vns.workspaces
[10/6/04 3:44 PM | main] ====================================
(WARNING) Unexpected exception:
java.io.FileNotFoundException: .vns.workspaces (The system cannot find
the file specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.&lt;init&gt;(Unknown Source)
at java.io.FileInputStream.&lt;init&gt;(Unknown Source)
at java.io.FileReader.&lt;init&gt;(Unknown Source)
at org.apache.vinci.transport.vns.service.VNS.loadWorkspaces(VNS.java:339
at org.apache.vinci.transport.vns.service.VNS.startServing(VNS.java:237)
at org.apache.vinci.transport.vns.service.VNS.main(VNS.java:179)
[10/6/04 3:44 PM | main] WARNING: failed to load workspace.
[10/6/04 3:44 PM | main] VNS Workspace : null
[10/6/04 3:44 PM | main] Loading counter file : .vns.counter
[10/6/04 3:44 PM | main] Could not load the counter file : .vns.counter
[10/6/04 3:44 PM | main] Starting backup thread,
using files .vns.services.bak
and .vns.services
[10/6/04 3:44 PM | main] Serving on port : 9000
[10/6/04 3:44 PM | Thread-0] Backup thread started
[10/6/04 3:44 PM | Thread-0] Saving to config file : .vns.services.bak
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; VNS is up and running! &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Type 'quit' and hit ENTER to terminate VNS &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;
[10/6/04 3:44 PM | Thread-0] Config save required 10 millis.
[10/6/04 3:44 PM | Thread-0] Saving to config file : .vns.services
[10/6/04 3:44 PM | Thread-0] Config save required 10 millis.
[10/6/04 3:44 PM | Thread-0] Saving counter file : .vns.counter</programlisting></para>
<note>
<para>Disregard the <emphasis>java.io.FileNotFoundException: .\vns.workspaces (The system cannot
find the file specified).</emphasis> It is just a complaint. not a serious problem. VNS Workspace is a
feature of the VNS that is not critical. The important information to note is <literal>[10/6/04 3:44 PM |
main] Serving on port : 9000</literal> which states the actual port where VNS will listen for incoming
requests. All Vinci services and all clients connecting to services must provide the VNS port on the
command line IF the port is not a default. Again the default port is 9000. Please see <xref
linkend="ugr.tug.application.launching_vinci_services"/> below for details about the command
line and parameters.</para> </note>
</section>
<section id="ugr.tug.application.vns_files">
<title>VNS Files</title>
<para>The VNS maintains two external files:
<itemizedlist spacing="compact">
<listitem>
<para><literal>vns.services</literal></para>
</listitem>
<listitem>
<para><literal>vns.counter</literal></para>
</listitem>
</itemizedlist></para>
<para>These files are generated by the VNS in the same directory where the VNS is launched from. Since these
files may contain old information it is best to remove them before starting the VNS. This step ensures that
the VNS has always the newest information and will not attempt to connect to a service that has been
shutdown.</para>
</section>
<section id="ugr.tug.application.launching_vinci_services">
<title>Launching Vinci Services</title>
<para>When launching Vinci service, you must indicate which VNS the service will
connect to. A Vinci service is typically started using the script
<literal>startVinciService</literal>, found in the <literal>bin</literal>
directory of the UIMA installation. (If you're using Eclipse and have the
<literal>uimaj-examples</literal> project in the workspace, you will also find
an Eclipse launcher named <quote>UIMA Start Vinci Service</quote> you can use.)
For the script, the environmental variable VNS_HOST should
be set to the name or IP address of the machine hosting the Vinci Naming Service. The
default is localhost, the machine the service is deployed on. This name can also be
passed as the second argument to the startVinciService script. The default port
for VNS is 9000 but can be overriden with the VNS_PORT environmental
variable.</para>
<para>If you write your own startup script, to define Vinci&apos;s default VNS you must provide the
following JVM parameters:
<programlisting>java -DVNS_HOST=localhost -DVNS_PORT=9000 ...</programlisting></para>
<para>The above setting is for the VNS running on the same machine as the service. Of course one can deploy the
VNS on a different machine and the JVM parameter will need to be changed to this:
<programlisting>java -DVNS_HOST=&lt;host&gt; -DVNS_PORT=9000 ...</programlisting></para>
<para>where <quote>&lt;host&gt;</quote> is a machine name or its IP where the VNS is running.</para>
<note>
<para>VNS runs on port 9000 by default. If you see the following exception:
<programlisting>(WARNING) Unexpected exception:
org.apache.vinci.transport.ServiceDownException:
VNS inaccessible: java.net.Connect
Exception: Connection refused: connect</programlisting>
then, perhaps the VNS is not running OR the VNS is running but it is using a different port. To correct the
latter, set the environmental variable VNS_PORT to the correct port before starting the service.</para>
</note>
<para>To get the right port check the VNS output for something similar to the following:
<programlisting>[10/6/04 3:44 PM | main] Serving on port : 9000</programlisting></para>
<para>It is printed by the VNS on startup.</para>
</section>
</section>
<section id="ugr.tug.configuring_timeout_settings">
<title>Configuring Timeout Settings</title>
<para>UIMA has several timeout specifications, summarized here. The timeouts associated with remote
services are discussed below. In addition there are timeouts that can be specified for:
<itemizedlist>
<listitem><para><emphasis role="bold">Acquiring an empty CAS from a CAS Pool:</emphasis>
See <xref linkend="ugr.tug.applications.multi_threaded"/>.</para></listitem>
<listitem><para><emphasis role="bold">Reassembling chunks of a large document</emphasis>
See <olink targetdoc="&uima_docs_ref;"/>
<olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.cpe_descriptor.descriptor.operational_parameters"/></para>
</listitem>
</itemizedlist></para>
<para>If your application uses remote UIMA services it is important to consider how to set the
<emphasis>timeout</emphasis> values appropriately. This is particularly important if your service can
take a long time to process each request.</para>
<para>There are two types of timeout settings in UIMA, the <emphasis>client timeout</emphasis> and the
<emphasis>server socket timeout</emphasis>. The client timeout is usually the most important, it
specifies how long that client is willing to wait for the service to process each CAS. The client timeout can be
specified for both Vinci and SOAP. The server socket timeout (Vinci only) specifies how long the service
holds the connection open between calls from the client. After this amount of time, the server will presume
the client may have gone away - and it <quote>cleans up</quote>, releasing any resources it is holding. The
next call to process on the service will cause the client to re-establish its connection with the service
(some additional overhead).</para>
<section id="ugr.tug.setting_client_timeout">
<title>Setting the Client Timeout</title>
<para>The way to set the client timeout is different depending on what deployment mode you use in your CPE (if
any).</para>
<para>If you are using the default <quote>integrated</quote> deployment mode in your CPE, or if you are not
using a CPE at all, then the client timeout is specified in your Service Client Descriptor (see <xref
linkend="ugr.tug.application.how_to_call_a_uima_service"/>). For example:</para>
<programlisting>&lt;uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
&lt;resourceType>AnalysisEngine&lt;/resourceType>
&lt;uri>uima.annot.PersonTitleAnnotator&lt;/uri>
&lt;protocol>Vinci&lt;/protocol>
<emphasis role="bold-italic">&lt;timeout>60000&lt;/timeout></emphasis>
&lt;parameters>
&lt;parameter name="VNS_HOST" value="some.internet.ip.name-or-address"/>
&lt;parameter name="VNS_PORT" value="9000"/>
&lt;/parameters>
&lt;/uriSpecifier></programlisting>
<para>The client timeout in this example is <literal>60000</literal>. This value specifies the number of
milliseconds that the client will wait for the service to respond to each request. In this example, the
client will wait for one minute.</para>
<para>If the service does not respond within this amount of time, processing of the current CAS will abort. If
you called the <literal>AnalysisEngine.process</literal> method directly from your application, an
Exception will be thrown. If you are running a CPE, what happens next is dependent on the error handling
settings in your CPE descriptor (see <olink targetdoc="&uima_docs_ref;"/>
<olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.error_handling"/>
). The default action is for the CPE to terminate, but you can override this. </para>
<para>If you are using the <quote>managed</quote> or <quote>non-managed</quote> deployment mode in your
CPE, then the client timeout is specified in your CPE desciptor's <literal>errorHandling</literal>
element. For example:</para>
<programlisting><![CDATA[<errorHandling>
<maxConsecutiveRestarts .../>
<errorRateThreshold .../>
<timeout max="60000"/>
</errorHandling>]]></programlisting>
<para>As in the previous example, the client timeout is set to <literal>60000</literal>, and this
specifies the number of milliseconds that the client will wait for the service to respond to each
request.</para>
<para>If the service does not respond within the specified amount of time, the action is determined by the
settings for <literal>maxConsecutiveRestarts</literal> and
<literal>errorRateThreshold</literal>. These settings support such things as restarting the process
(for <quote>managed</quote> deployment mode), dropping and reestablishing the connection (for
<quote>non-managed</quote> deployment mode), and removing the offending service from the pipeline. See
<olink targetdoc="&uima_docs_ref;"/>
<olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.error_handling"/>
) for details. </para>
<para>Note that the client timeout does not apply to the <literal>GetMetaData</literal>
request that is made when the client first connects to the service. This call is typically
very fast and does not need a large timeout (the default is 60 seconds). However, if many
clients are competing for a small number of services, it may be necessary to increase this
value. See <olink targetdoc="&uima_docs_ref;"/> <olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.component_descriptor.service_client"/></para>
</section>
<section id="ugr.tug.setting_server_socket_timeout">
<title>Setting the Server Socket Timeout</title>
<para>The Server Socket Timeout applies only to Vinci services, and is specified in the Vinci deployment
descriptor as discussed in section <xref
linkend="ugr.tug.application.how_to_deploy_a_vinci_service"/>. For example:
<programlisting>&lt;deployment name="Vinci Person Title Annotator Service"&gt;
&lt;service name="uima.annotator.PersonTitleAnnotator" provider="vinci"&gt;
&lt;parameter name="resourceSpecifierPath"
value="C:/Program Files/apache/uima/examples/descriptors/
analysis_engine/PersonTitleAnnotator.xml"/&gt;
&lt;parameter name="numInstances" value="1"/&gt;
&lt;parameter name="serverSocketTimeout" value=<emphasis role="bold-italic">"120000"</emphasis>/&gt;
&lt;/service&gt;
&lt;/deployment&gt;</programlisting>
</para>
<para>The server socket timeout here is set to <literal>120000</literal> milliseconds, or two minutes.
This parameter specifies how long the service will wait between requests to process something. After this
amount of time, the server will presume the client may have gone away - and it <quote>cleans up</quote>,
releasing any resources it is holding. The next call to process on the service will cause the client to
re-establish its connection with the service (some additional overhead). The service may print a
<quote>Read Timed Out</quote> message to the console when the server socket timeout elapses.</para>
<para>In most cases, it is not a problem if the server socket timeout elapses. The client will simply
reconnect. However, if you notice <quote>Read Timed Out</quote> messages on your server console,
followed by other connection problems, it is possible that the client is having trouble reconnecting for
some reason. In this situation it may help increase the stability of your application if you increase the
server socket timeout so that it does not elapse during actual processing.</para>
</section>
</section>
</section>
<section id="ugr.tug.application.increasing_performance_using_parallelism">
<title>Increasing performance using parallelism</title>
<para>There are several ways to exploit parallelism to increase performance in the UIMA Framework. These range
from running with additional threads within one Java virtual machine on one host (which might be a
multi-processor or hyper-threaded host) to deploying analysis engines on a set of remote machines.</para>
<para>The Collection Processing facility in UIMA provides the ability to scale the pipe-line of analysis
engines. This scale-out runs multiple threads within the Java virtual machine running the CPM, one for each
pipe in the pipe-line. To activate it, in the <literal>&lt;casProcessors&gt;</literal> descriptor
element, set the attribute <literal>processingUnitThreadCount</literal>, which specifies the number of
replicated processing pipelines, to a value greater than 1, and insure that the size of the CAS pool is equal to or
greater than this number (the attribute of <literal>&lt;casProcessors&gt;</literal> to set is
<literal>casPoolSize</literal>). For more details on these settings, see <olink
targetdoc="&uima_docs_ref;"/> <olink
targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors"/> .</para>
<para>For deployments that incorporate remote analysis engines in the Collection Manager pipe-line, running
on multiple remote hosts, scale-out is supported which uses the Vinci naming service. If multiple instances of
a service with the same name, but running on different hosts, are registered with the Vinci Name Server, it will
assign these instances to incoming requests.</para>
<para>There are two modes supported: a <quote>random</quote> assignment, and a <quote>exclusive</quote>
one. The <quote>random</quote> mode distributes load using an algorithm that selects a service instance at
random. The UIMA framework supports this only for the case where all of the instances are running on unique
hosts; the framework does not support starting 2 or more instances on the same host.</para>
<para>The exclusive mode dedicates a particular remote instance to each Collection Manager pip-line instance.
This mode is enabled by adding a configuration parameter in the
&lt;casProcessor&gt; section of the CPE descriptor:</para>
<literallayout>&lt;deploymentParameters&gt;
&lt;parameter name="service-access" value="exclusive" /&gt;
&lt;/deploymentParameters&gt;</literallayout>
<para>If this is not specified, the <quote>random</quote> mode is used.</para>
<para>In addition, remote UIMA engine services can be started with a parameter that specifies the number of
instances the service should support (see the <literal>&lt;parameter name="numInstances"&gt;</literal>
XML element in remote deployment descriptor <xref linkend="ugr.tug.application.remote_services"/>
Specifying more than one causes the service wrapper for the analysis engine to use multi-threading (within the
single Java Virtual Machine &ndash; which can take advantage of multi-processor and hyper-threaded
architectures).</para> <note>
<para>When using Vinci in <quote>exclusive</quote> mode (see service access under <olink
targetdoc="&uima_docs_ref;"/> <olink
targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.deployment_parameters"/>
), only one thread is used. To achieve multi-processing on a server in this case, use multiple instances of the
service, instead of multiple threads (see <xref
linkend="ugr.tug.application.how_to_deploy_a_vinci_service"/>.</para> </note>
</section>
<section id="ugr.tug.application.jmx">
<title>Monitoring AE Performance using JMX</title>
<para>As of version 2, UIMA supports remote monitoring of Analysis Engine performance via the Java Management
Extensions (JMX) API. JMX is a standard part of the Java Runtime Environment v5.0; there is also a reference
implementation available from Sun for Java 1.4. An introduction to JMX is available from Sun here: <ulink
url="http://java.sun.com/developer/technicalArticles/J2SE/jmx.html"/>. When you run a UIMA with a
JVM that supports JMX, the UIMA framework will automatically detect the presence of JMX and will register
<emphasis>MBeans</emphasis> that provide access to the performance statistics.</para>
<para>Note: The Sun JVM supports local monitoring; for others you can configure your
application for remote monitoring (even when on the same host) by specifying a unique port number, e.g.
<literal>
-Dcom.sun.management.jmxremote.port=1098
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false</literal></para>
<para>Now, you can use any JMX client to view the statistics. JDK 5.0 or later provides a standard client that you can use.
Simply open a command prompt, make sure the JDK <literal>bin</literal> directory is in your path, and
execute the <literal>jconsole</literal> command. This should bring up a window allowing you to
select one of the local JMX-enabled applications currently running, or to enter a remote (or local) host and
port, e.g. localhost:1098. The next screen will show a summary of
information about the Java process that you connected to. Click on the <quote>MBeans</quote> tab, then expand
<quote>org.apache.uima</quote> in the tree at the left. You should see a view like this:
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image006.jpg"/>
</imageobject>
<textobject><phrase>Screenshot of JMX console monitoring UIMA components</phrase></textobject>
</mediaobject>
</screenshot></para>
<para>Each of the nodes under <quote><literal>org.apache.uima</literal></quote> in the tree represents one
of the UIMA Analysis Engines in the application that you connected to. You can select one of the analysis engines
to view its performance statistics in the view at the right.</para>
<para>Probably the most useful statistic is <quote>CASes Per Second</quote>, which is the number of CASes that
this AE has processed divided by the amount of time spent in the AE's process method, in seconds. Note that this is
the total elapsed time, not CPU time. Even so, it can be useful to compare the <quote>CASes Per Second</quote>
numbers of all of your Analysis Engines to discover where the bottlenecks occur in your application.</para>
<para>The <literal>AnalysisTime</literal>, <literal>BatchProcessCompleteTime</literal>, and
<literal>CollectionProcessCompleteTime</literal> properties show the total elapsed time, in
milliseconds, that has been spent in the AnalysisEngine's <literal>process(), batchProcessComplete(),
</literal>and <literal>collectionProcessComplete()</literal> methods, respectively. (Note that for
CAS Multipliers, time spent in the <literal>hasNext()</literal> and <literal>next()</literal> methods is
also counted towards the AnalysisTime.)</para>
<para>Note that once your UIMA application terminates, you can no longer view the statistics through the JMX
console. If you want to use JMX to view processes that have completed, you will need to write your application so
that the JVM remains running after processing completes, waiting for some user signal before
terminating.</para>
<para>It is possible to override the default JMX MBean names UIMA uses, for
example to better organize the UIMA MBeans with respect to MBeans exposed by
other parts of your application. This is done using the
<literal>AnalysisEngine.PARAM_MBEAN_NAME_PREFIX</literal> additional parameter
when creating your AnalysisEngine:
<programlisting> //set up Map with custom JMX MBean name prefix
Map paramMap = new HashMap();
paramMap.put(AnalysisEngine.PARAM_MBEAN_NAME_PREFIX,
"org.myorg:category=MyApp");
// create Analysis Engine
AnalysisEngine ae =
UIMAFramework.produceAnalysisEngine(specifier, paramMap);
</programlisting>
</para>
<para>Similary, you can use the <literal>AnalysisEngine.PARAM_MBEAN_SERVER</literal>
parameter to specify a particular instance of a JMX MBean Server with which UIMA
should register the MBeans. If no specified then the default is to register with
the platform MBeanServer (Java 5+ only).</para>
<para>More information on JMX can be found in the <ulink
url="http://java.sun.com/j2se/1.5.0/docs/api/javax/management/package-summary.html#package_description">
Java 5 documentation</ulink>.</para>
</section>
<section id="tug.application.pto">
<title>Performance Tuning Options</title>
<para>
There are a small number of performance tuning options available to
influence the runtime behavior of UIMA applications. Performance
tuning options need to be set programmatically when an analysis
engine is created. You simply create a Java Properties object with
the relevant options and pass it to the UIMA framework on the call
to create an analysis engine. Below is an example.
<programlisting>
XMLParser parser = UIMAFramework.getXMLParser();
ResourceSpecifier spec = parser.parseResourceSpecifier(
new XMLInputSource(descriptorFile));
// Create a new properties object to hold the settings.
Properties performanceTuningSettings = new Properties();
// Set the initial CAS heap size.
performanceTuningSettings.setProperty(
UIMAFramework.CAS_INITIAL_HEAP_SIZE,
"1000000");
// Disable JCas cache.
performanceTuningSettings.setProperty(
UIMAFramework.JCAS_CACHE_ENABLED,
"false");
// Create a wrapper properties object that can
// be passed to the framework.
Properties additionalParams = new Properties();
// Set the performance tuning properties as value to
// the appropriate parameter.
additionalParams.put(
Resource.PARAM_PERFORMANCE_TUNING_SETTINGS,
performanceTuningSettings);
// Create the analysis engine with the parameters.
// The second, unused argument here is a custom
// resource manager.
this.ae = UIMAFramework.produceAnalysisEngine(
spec, null, additionalParams);
</programlisting>
</para>
<para>
The following options are supported:
<itemizedlist>
<listitem>
<para><literal>UIMAFramework.JCAS_CACHE_ENABLED</literal>: allows you to disable
the JCas cache (true/false). The JCas cache is an internal datastructure that caches any JCas
object created
by the CAS. This may result in better performance for applications that make extensive use of
the JCas, but also incurs a steep memory overhead. If you're processing large documents and have
memory issues, you should disable this option. In general, just try running a few experiments to
see what setting works better for your application. The JCas cache is enabled by default.
</para>
</listitem>
<listitem>
<para><literal>UIMAFramework.CAS_INITIAL_HEAP_SIZE</literal>: set the initial CAS heap size in
number of cells (integer valued). The CAS uses 32bit integer cells, so four times the initial
size is the
approximate minimum size of the CAS in bytes. This is another space/time trade-off as growing
the CAS heap is relatively expensive. On the other hand, setting the initial size too high is
wasting memory. Unless you know you are processing very small or very large documents, you should
probably leave this option unchanged.
</para>
</listitem>
<listitem>
<para><literal>UIMAFramework.PROCESS_TRACE_ENABLED</literal>: enable the process trace mechanism
(true/false). When enabled, UIMA tracks the time spent in individual components of an aggregate
AE or CPE. For more information, see the API documentation of
<literal>org.apache.uima.util.ProcessTrace</literal>.
</para>
</listitem>
<listitem>
<para><literal>UIMAFramework.SOCKET_KEEPALIVE_ENABLED</literal>: enable socket KeepAlive
(true/false). This setting is currently only supported by Vinci clients. Defaults to
<literal>true</literal>.
</para>
</listitem>
</itemizedlist>
</para>
</section>
</chapter>