<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ | |
<!ENTITY imgroot "images/tutorials_and_users_guides/tug.application/"> | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent"> | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<chapter id="ugr.tug.application"> | |
<title>Application Developer's Guide</title> | |
<para>This chapter describes how to develop an application using the Unstructured Information Management | |
Architecture (UIMA). The term <emphasis>application</emphasis> describes a program that provides end-user | |
functionality. A UIMA application incorporates one or more UIMA components such as Analysis Engines, | |
Collection Processing Engines, a Search Engine, and/or a Document Store and adds application-specific logic | |
and user interfaces.</para> | |
<section id="ugr.tug.appication.uimaframework_class"> | |
<title>The UIMAFramework Class</title> | |
<para>An application developer's starting point for accessing UIMA framework functionality is the | |
<literal>org.apache.uima.UIMAFramework</literal> class. The following is a short introduction to some | |
important methods on this class. Several of these methods are used in examples in the rest of this chapter. For | |
more details, see the Javadocs (in the docs/api directory of the UIMA SDK). | |
<itemizedlist> | |
<listitem> | |
<para>UIMAFramework.getXMLParser(): Returns an instance of the UIMA XML Parser class, which then can be | |
used to parse the various types of UIMA component descriptors. Examples of this can be found in the | |
remainder of this chapter.</para> | |
</listitem> | |
<listitem> | |
<para>UIMAFramework.produceXXX(ResourceSpecifier): There are various produce methods that are used | |
to create different types of UIMA components from their descriptors. The argument type, | |
ResourceSpecifier, is the base interface that subsumes all types of component descriptors in UIMA. You | |
can get a ResourceSpecifier from the XMLParser. Examples of produce methods are: | |
<itemizedlist> | |
<listitem> | |
<para>produceAnalysisEngine</para> | |
</listitem> | |
<listitem> | |
<para>produceCasConsumer</para> | |
</listitem> | |
<listitem> | |
<para>produceCasInitializer</para> | |
</listitem> | |
<listitem> | |
<para>produceCollectionProcessingEngine</para> | |
</listitem> | |
<listitem> | |
<para>produceCollectionReader</para> | |
</listitem> | |
</itemizedlist> | |
There are other variations of each of these methods that take additional, optional arguments. See the | |
Javadocs for details. </para> | |
</listitem> | |
<listitem> | |
<para>UIMAFramework.getLogger(<optional-logger-name>): Gets a reference to the UIMA Logger, | |
to which you can write log messages. If no logger name is passed, the name of the returned logger instance | |
is <quote>org.apache.uima</quote>.</para> | |
</listitem> | |
<listitem> | |
<para>UIMAFramework.getVersionString(): Gets the number of the UIMA version you are using.</para> | |
</listitem> | |
<listitem> | |
<para>UIMAFramework.newDefaultResourceManager(): Gets an instance of the UIMA ResourceManager. The | |
key method on ResourceManager is setDataPath, which allows you to specify the location where UIMA | |
components will go to look for their external resources. Once you've obtained and initialized a | |
ResourceManager, you can pass it to any of the produceXXX methods. </para> | |
</listitem> | |
</itemizedlist></para> | |
</section> | |
<section id="ugr.tug.application.using_aes"> | |
<title>Using Analysis Engines</title> | |
<para>This section describes how to add analysis capability to your application by using Analysis Engines | |
developed using the UIMA SDK. An <emphasis>Analysis Engine (AE)</emphasis> is a component that analyzes | |
artifacts (e.g. documents) and infers information about them.</para> | |
<para>An Analysis Engine consists of two parts - Java classes (typically packaged as one or more JAR files) and | |
<emphasis>AE descriptors</emphasis> (one or more XML files). You must put the Java classes in your | |
application's class path, but thereafter you will not need to directly interact with them. The UIMA | |
framework insulates you from this by providing a standard AnalysisEngine interfaces.</para> | |
<para>The term <emphasis>Text Analysis Engine (TAE)</emphasis> is sometimes used to describe an Analysis | |
Engine that analyzes a text document. In the UIMA SDK v1.x, there was a TextAnalysisEngine interface that was | |
commonly used. However, as of the UIMA SDK v2.0, this interface has been deprecated and all applications should | |
switch to using the standard AnalysisEngine interface.</para> | |
<para>The AE descriptor XML files contain the configuration settings for the Analysis Engine as well as a | |
description of the AE's input and output requirements. You may need to edit these files in order to | |
configure the AE appropriately for your application - the supplier of the AE may have provided documentation | |
(or comments in the XML descriptor itself) about how to do this.</para> | |
<section id="ugr.tug.application.instantiating_an_ae"> | |
<title>Instantiating an Analysis Engine</title> | |
<para>The following code shows how to instantiate an AE from its XML descriptor: | |
<programlisting> //get Resource Specifier from XML file | |
XMLInputSource in = new XMLInputSource("MyDescriptor.xml"); | |
ResourceSpecifier specifier = | |
UIMAFramework.getXMLParser().parseResourceSpecifier(in); | |
//create AE here | |
AnalysisEngine ae = | |
UIMAFramework.produceAnalysisEngine(specifier);</programlisting></para> | |
<para>The first two lines parse the XML descriptor (for AEs with multiple descriptor files, one of them is the | |
<quote>main</quote> descriptor - the AE documentation should indicate which it is). The result of the parse | |
is a <literal>ResourceSpecifier</literal> object. The third line of code invokes a static factory method | |
<literal>UIMAFramework.produceAnalysisEngine</literal>, which takes the specifier and instantiates | |
an <literal>AnalysisEngine</literal> object.</para> | |
<para>There is one caveat to using this approach - the Analysis Engine instance that you create will not support | |
multiple threads running through it concurrently. If you need to support this, see <xref | |
linkend="ugr.tug.applications.multi_threaded"/>.</para> | |
</section> | |
<section id="ugr.tug.application.analyzing_text_documents"> | |
<title>Analyzing Text Documents</title> | |
<para>There are two ways to use the AE interface to analyze documents. You can either use the | |
<emphasis>JCas</emphasis> interface, which is described in detail in <olink | |
targetdoc="&uima_docs_ref;"/> <olink | |
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.jcas"/> or you can directly use the | |
<emphasis>CAS</emphasis> interface, which is described in detail in <olink | |
targetdoc="&uima_docs_ref;"/> <olink | |
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>. Besides text documents, other kinds of | |
artifacts can also be analyzed; see <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.aas"/> for more information.</para> | |
<para>The basic structure of your application will look similar in both cases:</para> | |
<para>Using the JCas | |
<programlisting> //create a JCas, given an Analysis Engine (ae) | |
JCas jcas = ae.newJCas(); | |
//analyze a document | |
jcas.setDocumentText(doc1text); | |
ae.process(jcas); | |
doSomethingWithResults(jcas); | |
jcas.reset(); | |
//analyze another document | |
jcas.setDocumentText(doc2text); | |
ae.process(jcas); | |
doSomethingWithResults(jcas); | |
jcas.reset(); | |
... | |
//done | |
ae.destroy();</programlisting></para> | |
<para>Using the CAS | |
<programlisting>//create a CAS | |
CAS aCasView = ae.newCAS(); | |
//analyze a document | |
aCasView.setDocumentText(doc1text); | |
ae.process(aCasView); | |
doSomethingWithResults(aCasView); | |
aCasView.reset(); | |
//analyze another document | |
aCasView.setDocumentText(doc2text); | |
ae.process(aCasView); | |
doSomethingWithResults(aCasView); | |
aCasView.reset(); | |
... | |
//done | |
ae.destroy();</programlisting></para> | |
<para>First, you create the CAS or JCas that you will use. Then, you repeat the following four steps for each | |
document:</para> | |
<orderedlist spacing="compact"> | |
<listitem> | |
<para>Put the document text into the CAS or JCas.</para> | |
</listitem> | |
<listitem> | |
<para>Call the AE's process method, passing the CAS or JCas as an argument</para> | |
</listitem> | |
<listitem> | |
<para>Do something with the results that the AE has added to the CAS or JCas</para> | |
</listitem> | |
<listitem> | |
<para>Call the CAS's or JCas's reset() method to prepare for another analysis </para> | |
</listitem> | |
</orderedlist> | |
</section> | |
<section id="ugr.tug.applications.analyzing_non_text_artifacts"> | |
<title>Analyzing Non-Text Artifacts</title> | |
<para>Analyzing non-text artifacts is similar to analyzing text documents. The main difference is that | |
instead of using the <literal>setDocumentText</literal> method, you need to use the Sofa APIs to set the | |
artifact into the CAS. See <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/> | |
for details.</para> | |
</section> | |
<section id="ugr.tug.applications.accessing_analysis_results"> | |
<title>Accessing Analysis Results</title> | |
<para>Annotators (and applications) access the results of analysis via the CAS, using the CAS or JCas | |
interfaces. These results are accessed using the CAS Indexes. There is one built-in index for instances of | |
the built-in type <literal>uima.tcas.Annotation</literal> that can be used to retrieve instances of | |
<literal>Annotation</literal> or any subtype of Annotation. You can also define additional indexes over | |
other types. </para> | |
<para>Indexes provide a method to obtain an iterators over their contents; the iterator returns the matching | |
elements one at time from the CAS.</para> | |
<section id="ugr.tug.applications.accessing_results_using_jcas"> | |
<title>Accessing Analysis Results using the JCas</title> | |
<para>See:</para> | |
<itemizedlist> | |
<listitem> | |
<para> <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.aae.reading_results_previous_annotators"/> </para> | |
</listitem> | |
<listitem> | |
<para> <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.jcas"/></para> | |
</listitem> | |
<listitem> | |
<para>The Javadocs for <literal>org.apache.uima.jcas.JCas</literal>. </para> | |
</listitem> | |
</itemizedlist> | |
</section> | |
<section id="ugr.tug.application.accessing_results_using_cas"> | |
<title>Accessing Analysis Results using the CAS</title> | |
<para>See:</para> | |
<itemizedlist> | |
<listitem> | |
<para> <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/></para> | |
</listitem> | |
<listitem> | |
<para> The source code for <literal>org.apache.uima.examples.PrintAnnotations</literal>, which | |
is in <literal>examples\src.</literal></para> | |
</listitem> | |
<listitem> | |
<para>The Javadocs for the <literal>org.apache.uima.cas</literal> and | |
<literal>org.apache.uima.cas.text</literal> packages. </para> | |
</listitem> | |
</itemizedlist> | |
</section> | |
</section> | |
<section id="ugr.tug.applications.multi_threaded"> | |
<title>Multi-threaded Applications</title> | |
<para>You may be running on a multi-core system, and want to run multiple CASes at once through your pipeline. To support this, UIMA provides multiple approaches. | |
The most flexible and recommended way to do this is to use the features of UIMA-AS, which not only allows scale-up (multiple threads in one CPU), but also | |
supports scale-out (exploiting a cluster of machines).</para> | |
<para>This section describes the simplest way to use an AE in a multi-threaded environment. | |
First, note that most Analysis Engines are written with the assumption that only one thread will be accessing | |
it at any one time; that is, Analysis Engines are not written to be thread safe. The writers of these | |
assume that multiple instances of the Annotator Engine class will be instantiated as needed to support multiple | |
threads. | |
</para> | |
<para>If your application has multiple threads that might invoke an Analysis Engine, to insure that | |
only one thread at a time uses a CAS and runs in the pipeline, | |
you can use the Java synchronized keyword to | |
ensure that only one thread is using an AE at any given time. For example: | |
<programlisting>public class MyApplication { | |
private AnalysisEngine mAnalysisEngine; | |
private CAS mCAS; | |
public MyApplication() { | |
//get Resource Specifier from XML file | |
XMLInputSource in = new XMLInputSource("MyDescriptor.xml"); | |
ResourceSpecifier specifier = | |
UIMAFramework.getXMLParser().parseResourceSpecifier(in); | |
//create Analysis Engine here | |
mAnalysisEngine = UIMAFramework.produceAnalysisEngine(specifier); | |
mCAS = mAnalysisEngine.newCAS(); | |
} | |
// Assume some other part of your multi-threaded application could | |
// call <quote>analyzeDocument</quote> on different threads, asynchronously | |
public synchronized void analyzeDocument(String aDoc) { | |
//analyze a document | |
mCAS.setDocumentText(aDoc); | |
mAnalysisEngine.process(); | |
doSomethingWithResults(mCAS); | |
mCAS.reset(); | |
} | |
... | |
}</programlisting></para> | |
<para>Without the synchronized keyword, this application would not be thread-safe. If multiple threads | |
called the analyzeDocument method simultaneously, they would both use the same CAS and clobber each others' | |
results. The synchronized keyword ensures that no more than one thread is executing this method at any given | |
time. For more information on thread synchronization in Java, see <ulink | |
url="http://docs.oracle.com/javase/tutorial/essential/concurrency/"/> | |
.</para> | |
<para>The synchronized keyword ensures thread-safety, but does not allow you to process more than one | |
document at a time. If you need to process multiple documents simultaneously (for example, to make use of a | |
multiprocessor machine), you'll need to use more than one CAS instance.</para> | |
<para>Because CAS instances use memory and can take some time to construct, you don't want to create a new CAS | |
instance for each request. Instead, you should use a feature of the UIMA SDK called the <emphasis>CAS | |
Pool</emphasis>, implemented by the type <literal>CasPool</literal>.</para> | |
<para>A CAS Pool contains some number of CAS instances (you specify how many when you create the pool). When a | |
thread wants to use a CAS, it <emphasis>checks out</emphasis> an instance from the pool. When the thread is | |
done using the CAS, it must <emphasis>release</emphasis> the CAS instance back into the pool. If all | |
instances are checked out, additional threads will block and wait for an instance to become available. Here | |
is some example code: | |
<programlisting>public class MyApplication { | |
private CasPool mCasPool; | |
private AnalysisEngine mAnalysisEngine; | |
public MyApplication() | |
{ | |
//get Resource Specifier from XML file | |
XMLInputSource in = new XMLInputSource("MyDescriptor.xml"); | |
ResourceSpecifier specifier = | |
UIMAFramework.getXMLParser().parseResourceSpecifier(in); | |
//Create multithreadable AE that will | |
//Accept 3 simultaneous requests | |
//The 3rd parameter specifies a timeout. | |
//When the number of simultaneous requests exceeds 3, | |
// additional requests will wait for other requests to finish. | |
// This parameter determines the maximum number of milliseconds | |
// that a new request should wait before throwing an | |
// - a value of 0 will cause them to wait forever. | |
mAnalysisEngine = UIMAFramework.produceAnalysisEngine(specifier,3,0); | |
//create CAS pool with 3 CAS instances | |
mCasPool = new CasPool(3, mAnalysisEngine); | |
} | |
// Notice this is no longer "synchronized" | |
public void analyzeDocument(String aDoc) { | |
//check out a CAS instance (argument 0 means no timeout) | |
CAS cas = mCasPool.getCas(0); | |
try { | |
//analyze a document | |
cas.setDocumentText(aDoc); | |
mAnalysisEngine.process(cas); | |
doSomethingWithResults(cas); | |
} finally { | |
//MAKE SURE we release the CAS instance | |
mCasPool.releaseCas(cas); | |
} | |
} | |
... | |
}</programlisting></para> | |
<para>There is not much more code required here than in the previous example. First, there is one additional | |
parameter to the AnalysisEngine producer, specifying the number of annotator instances to | |
create<footnote> | |
<para> Both the UIMA Collection Processing Manager framework and the remote deployment services framework | |
have implementations which use CAS pools in this manner, and thereby relieve the annotator developer of | |
the necessity to make their annotators thread-safe.</para> </footnote>. Then, instead of creating a | |
single CAS in the constructor, we now create a CasPool containing 3 instances. In the analyze method, we check | |
out a CAS, use it, and then release it.</para> <note> | |
<para>Frequently, the two numbers (number of CASes, and the number of AEs) will be the same. It would not make | |
sense to have the number of CASes less than the number of AEs | |
– the extra AE instances would always block waiting for a CAS from the pool. It could make sense to have | |
additional CASes, though – if you had other multi-threaded processes that were using the CASes, other | |
than the AEs. </para> </note> | |
<para>The getCAS() method returns a CAS which is not specialized to any particular subject of analysis. To | |
process things other than this, please refer to <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.aas"/> .</para> | |
<para>Note the use of the try...finally block. This is very important, as it ensures that the CAS we have checked | |
out will be released back into the pool, even if the analysis code throws an exception. You should always use | |
try...finally when using the CAS pool; if you do not, you risk exhausting the pool and causing | |
deadlock.</para> | |
<para>The parameter 0 passed to the CasPool.getCas() method is a timeout value. If this is set to a positive | |
integer, it is the maximum number of milliseconds that the thread will wait for an instance to become | |
available in the pool. If this time elapses, the getCas method will return null, and the application can do | |
something intelligent, like ask the user to try again later. A value of 0 will cause the thread to wait for an | |
available CAS, potentially forever.</para> | |
<para>All of this can better be done using UIMA-AS. Besides taking care of setting up the CAS pools, etc., | |
UIMA-AS allows a pipe line having several delegates to be scaled-up optimally for each delegate; | |
one delegate might have 5 instances, while another might have 3. It also does | |
a different kind of initialization, in that it creates a thread pool itself, and insures that each | |
annotator instance gets its process() method called using the same thread that was used for that annotator | |
instance's initialization call; some annotators could be written assuming that this is the case.</para> | |
</section> | |
<section id="ugr.tug.application.using_multiple_aes"> | |
<title>Using Multiple Analysis Engines and Creating Shared CASes</title> | |
<titleabbrev>Multiple AEs & Creating Shared CASes</titleabbrev> | |
<para>In most cases, the easiest way to use multiple Analysis Engines from within an application is to combine | |
them into an aggregate AE. For instructions, see <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.aae.building_aggregates"/>. Be sure that you understand this method before | |
deciding to use the more advanced feature described in this section.</para> | |
<para>If you decide that your application does need to instantiate multiple AEs and have those AEs share a | |
single CAS, then you will no longer be able to use the various methods on the | |
<literal>AnalysisEngine</literal> class that create CASes (or JCases) to create your CAS. This is because | |
these methods create a CAS with a data model specific to a single AE and which therefore cannot be shared by | |
other AEs. Instead, you create a CAS as follows:</para> | |
<para>Suppose you have two analysis engines, and one CAS Consumer, and you want to create one type system from | |
the merge of all of their type specifications. Then you can do the following:</para> | |
<programlisting>AnalysisEngineDescription aeDesc1 = | |
UIMAFramework.getXMLParser().parseAnalysisEngineDescription(...); | |
AnalysisEngineDescription aeDesc2 = | |
UIMAFramework.getXMLParser().parseAnalysisEngineDescription(...); | |
CasConsumerDescription ccDesc = | |
UIMAFramework.getXMLParser().parseCasConsumerDescription(...); | |
List list = new ArrayList(); | |
list.add(aeDesc1); | |
list.add(aeDesc2); | |
list.add(ccDesc); | |
CAS cas = CasCreationUtils.createCas(list); | |
// (optional, if using the JCas interface) | |
JCas jcas = cas.getJCas();</programlisting> | |
<para>The CasCreationUtils class takes care of the work of merging the AEs' type systems and producing a | |
CAS for the combined type system. If the type systems are not compatible, an exception will be thrown.</para> | |
</section> | |
<section id="ugr.tug.application.saving_cases_to_file_systems"> | |
<title>Saving CASes to file systems or general Streams</title> | |
<para>The UIMA framework provides multiple APIs to save and restore the contents of a CAS to streams. | |
Two common uses of this are to save CASes to the file system, and to send CASes to other processes, running | |
on remote systems.</para> | |
<para> | |
The CASes can be serialized in multiple formats: | |
<itemizedlist> | |
<listitem> | |
<para>Binary formats: | |
<itemizedlist> | |
<listitem> | |
<para>plain binary: This is used to communicate with remote services, and also for interfacing with | |
annotators written in C/C++ or related languages via the JNI Java interface, from Java</para> | |
</listitem> | |
<listitem> | |
<para>Compressed binary: There are two forms of compressed binary. The recommend one is form 6, which also allows | |
type filtering. See <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.compress.overview"/>.</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
</listitem> | |
<listitem> | |
<para>XML formats: There are two forms of this format. The preferred form is the XMI form (see | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xmi"/>). An older format is also available, | |
called XCAS.</para> | |
</listitem> | |
<listitem> | |
<para>JSON formats (as of version 2.7.0): | |
This is intended for exposing results in the CAS as JSON objects for use by | |
web applications. See <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.json.overview"/>. | |
For JSON, only serialization is supported.</para> | |
</listitem> | |
<listitem> | |
<para>Java Object serialization: There are APIs to convert a CAS to a Java object that can be serialized | |
and deserialized | |
using standard Java object read and write Object methods. There is also a way to include the CAS's type system and | |
index definition.</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
<para>Each of these serializations has different capabilities, summarized in the table below. | |
<table frame="all" id="ugr.tug.tbl.serialization_capabilities"> | |
<title>Serialization Capabilities</title> | |
<tgroup cols="8" rowsep="1" colsep="1"> | |
<colspec colname="c1"/> | |
<colspec colname="c2"/> | |
<colspec colname="c3"/> | |
<colspec colname="c4"/> | |
<colspec colname="c5"/> | |
<colspec colname="c6"/> | |
<colspec colname="c7"/> | |
<colspec colname="c8"/> | |
<thead> | |
<row> | |
<entry align="center"></entry> | |
<entry align="center">XCAS</entry> | |
<entry align="center">XMI</entry> | |
<entry align="center">JSON</entry> | |
<entry align="center">Binary</entry> | |
<entry align="center">Cmpr 4</entry> | |
<entry align="center">Cmrp 6</entry> | |
<entry align="center">JavaObj</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>Output</entry> | |
<entry>Output Stream</entry> | |
<entry>Output Stream</entry> | |
<entry>Output Stream, File, Writer</entry> | |
<entry>Output Stream</entry> | |
<entry>Output Stream, Data Output Stream, File</entry> | |
<entry>Output Stream, Data Output Stream, File</entry> | |
<entry>-</entry> | |
</row> | |
<row> | |
<entry>Lists/Arrays inline formatting?</entry> | |
<entry>-</entry> | |
<entry>Yes</entry> | |
<entry>Yes</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
</row> | |
<row> | |
<entry>Formatted?</entry> | |
<entry>-</entry> | |
<entry>Yes</entry> | |
<entry>Yes</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
</row> | |
<row> | |
<entry>Type Filtering?</entry> | |
<entry>-</entry> | |
<entry>Yes</entry> | |
<entry>Yes</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>Yes</entry> | |
<entry>-</entry> | |
</row> | |
<row> | |
<entry>Delta Cas?</entry> | |
<entry>-</entry> | |
<entry>Yes</entry> | |
<entry>-</entry> | |
<entry>Yes</entry> | |
<entry>Yes</entry> | |
<entry>Yes</entry> | |
<entry>-</entry> | |
</row> | |
<row> | |
<entry>OOTS?</entry> | |
<entry>Yes</entry> | |
<entry>Yes</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
</row> | |
<row> | |
<entry>Only send indexed + reachable FSs?</entry> | |
<entry>Yes</entry> | |
<entry>Yes</entry> | |
<entry>Yes</entry> | |
<entry>send all</entry> | |
<entry>send all</entry> | |
<entry>Yes</entry> | |
<entry>send all</entry> | |
</row> | |
<row> | |
<entry>NameSpace/Schemas?</entry> | |
<entry>-</entry> | |
<entry>Yes</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
</row> | |
<row> | |
<entry>lenient available?</entry> | |
<entry>Yes</entry> | |
<entry>Yes</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>-</entry> | |
<entry>Yes</entry> | |
<entry>-</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
</para> | |
<para>In the above table, Cmpr 4 and Cmpr 6 refer to Compressed forms of the serialization, | |
and JavaObj refers to Java Object serialization.</para> | |
<para>For the XMI and JSON formats, lists and arrays can sometimes be formatted "inline". | |
In this representation, the elements are formatted directly as the value of a particular | |
feature. This is only done if the arrays and lists are not multiply-referenced.</para> | |
<para>Type Filtering support enables only a subset of the types and/or features to be | |
serialized. An additional type system object is used to specify the types to be included | |
in the serialization. This can be useful, for instance, when sending a CAS to a remote service, | |
where the remote service only uses a small number of the types and features, to reduce the size | |
of the serialized CAS.</para> | |
<para>Delta Cas support makes use of a "mark" set in the CAS, and only serializes changes in the CAS, | |
both new and modified Feature Structures, that were added or changed after the mark was set. | |
This is useful for remote services, supporting the use-case where a large CAS is sent to the service, | |
which sets the mark in the received CAS, and then adds a small amount of information; | |
the Delta CAS then serializes only that small amount as the "reply" sent back to the sender.</para> | |
<para>OOTS means "Out of Type System" support, intended to support the use-case where a CAS is being sent | |
to a remote application. This supports deserializing an incoming CAS where | |
some of the types and/or features may not be present in the receiving CAS's type system. A "lenient" | |
option on the deserialization permits the deserialization to proceed, with the out-of-type-system | |
information preserved so that when the CAS is subsequently reserialized (in the use-case, to be | |
returned back to the sender), the out-of-type-system information is re-merged back into the output stream. | |
</para> | |
<para>The Binary, Java Object, and Compressed Form 4 serializations send all the Feature Structures in the CAS, | |
in the order they were created in the CAS. The other methods only | |
send Feature Structures that are reachable, either by | |
their being in some CAS index, or being referenced | |
as a feature of another Feature Structure which is reachable.</para> | |
<para>The NameSpace/Schema support allows specifying a set of schemas, each one corresponding to a particular | |
namespace, used in XMI serialization.</para> | |
<para>Lenient allows the receiving Type System to be missing types and/or features that being deserialized. | |
Normally this causes an exception, but with the lenient flag turned on, these extra types and/or features are | |
skipped over and ignored, with no error indicated.</para> | |
<para>To save an XMI representation of a CAS, use the <code>save</code> method in <code>CasIOUtils</code> or the | |
<literal>serialize</literal> method of the class | |
<literal>org.apache.uima.util.XmlCasSerializer</literal>. To save an XCAS representation of a CAS, | |
use the <code>save</code> method in <code>CasIOUtils</code> class or see the <literal>org.apache.uima.cas.impl.XCASSerializer</literal> instead; see the Javadocs | |
for details.</para> | |
<para>All the external forms (except JSON) can be read back in with standard options using the <code>CasIOUtils load</code> methods. | |
The <code>CasIOUtils load</code> methods also support loading type system and index definition information | |
at the same time (usually from addition input sources). | |
The XCAS and XMI external forms can also be read back in using the <literal>deserialize</literal> method of | |
the class <literal>org.apache.uima.util.XmlCasDeserializer</literal>. All of these methods deserialize | |
into a pre-existing CAS, which you must create ahead of time. See the | |
Javadocs for details.</para> | |
<para>The <code>CasIOUtils</code> class has a collection of static methods to load (deserialize) and save (serialize) CASes, | |
optionally with their type system and index definitions. | |
The <code>Serialization</code> class has various static methods for serializing and deserializing Java Object forms and | |
compressed forms, with finer control over available options. | |
See the Javadocs for that class for details.</para> | |
<para>Several of the APIs use or return instances of <code>SerialFormat</code>, which is an enum specifying the various | |
forms of serialization.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.application.using_cpes"> | |
<title>Using Collection Processing Engines</title> | |
<para>A <emphasis>Collection Processing Engine (CPE)</emphasis> processes collections of artifacts | |
(documents) through the combination of the following components: a Collection Reader, an optional CAS | |
Initializer, Analysis Engines, and CAS Consumers. Collection Processing Engines and their components are | |
described in <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.cpe"/> .</para> | |
<para>Like Analysis Engines, CPEs consist of a set of Java classes and a set of descriptors. You need to make sure | |
the Java classes are in your classpath, but otherwise you only deal with descriptors.</para> | |
<section id="ugr.tug.application.running_a_cpe_from_a_descriptor"> | |
<title>Running a Collection Processing Engine from a Descriptor</title> | |
<titleabbrev>Running a CPE from a Descriptor</titleabbrev> | |
<para><olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.cpe.running_cpe_from_application"/> describes how to use the APIs to read a CPE | |
descriptor and run it from an application.</para> | |
</section> | |
<section id="ugr.tug.application.configuring_a_cpe_descriptor_programmatically"> | |
<title>Configuring a Collection Processing Engine Descriptor Programmatically</title> | |
<titleabbrev>Configuring a CPE Descriptor Programmatically</titleabbrev> | |
<para>For the finest level of control over the CPE descriptor settings, the CPE offers programmatic access to | |
the descriptor via an API. With this API, a developer can create a complete descriptor and then save the result | |
to a file. This also can be used to read in a descriptor (using XMLParser.parseCpeDescription as shown in the | |
previous section), modify it, and write it back out again. The CPE Descriptor API allows a developer to | |
redefine default behavior related to error handling for each component, turn-on check-pointing, change | |
performance characteristics of the CPE, and plug-in a custom timer.</para> | |
<para>Below is some example code that illustrates how this works. See the Javadocs for package | |
org.apache.uima.collection.metadata for more details.</para> | |
<programlisting>//Creates descriptor with default settings | |
CpeDescription cpe = CpeDescriptorFactory.produceDescriptor(); | |
//Add CollectionReader | |
cpe.addCollectionReader([descriptor]); | |
//Add CasInitializer (deprecated) | |
cpe.addCasInitializer(<cas initializer descriptor>); | |
// Provide the number of CASes the CPE will use | |
cpe.setCasPoolSize(2); | |
// Define and add Analysis Engine | |
CpeIntegratedCasProcessor personTitleProcessor = | |
CpeDescriptorFactory.produceCasProcessor (<quote>Person</quote>); | |
// Provide descriptor for the Analysis Engine | |
personTitleProcessor.setDescriptor([descriptor]); | |
//Continue, despite errors and skip bad Cas | |
personTitleProcessor.setActionOnMaxError(<quote>continue</quote>); | |
//Increase amount of time in ms the CPE waits for response | |
//from this Analysis Engine | |
personTitleProcessor.setTimeout(100000); | |
//Add Analysis Engine to the descriptor | |
cpe.addCasProcessor(personTitleProcessor); | |
// Define and add CAS Consumer | |
CpeIntegratedCasProcessor consumerProcessor = | |
CpeDescriptorFactory.produceCasProcessor(<quote>Printer</quote>); | |
consumerProcessor.setDescriptor([descriptor]); | |
//Define batch size | |
consumerProcessor.setBatchSize(100); | |
//Terminate CPE on max errors | |
consumerProcessor.setActionOnMaxError(<quote>terminate</quote>); | |
//Add CAS Consumer to the descriptor | |
cpe.addCasProcessor(consumerProcessor); | |
// Add Checkpoint file and define checkpoint frequency (ms) | |
cpe.setCheckpoint(<quote>[path]/checkpoint.dat</quote>, 3000); | |
// Plug in custom timer class used for timing events | |
cpe.setTimer(<quote>org.apache.uima.internal.util.JavaTimer</quote>); | |
// Define number of documents to process | |
cpe.setNumToProcess(1000); | |
// Dump the descriptor to the System.out | |
((CpeDescriptionImpl)cpe).toXML(System.out);</programlisting> | |
<para>The CPE descriptor for the above configuration looks like this: | |
<programlisting><![CDATA[<?xml version="1.0" encoding="UTF-8"?> | |
<cpeDescription xmlns="http://uima.apache.org/resourceSpecifier"> | |
<collectionReader> | |
<collectionIterator> | |
<descriptor> | |
<include href="[descriptor]"/> | |
</descriptor> | |
<configurationParameterSettings>... | |
</configurationParameterSettings> | |
</collectionIterator> | |
<casInitializer> | |
<descriptor> | |
<include href="[descriptor]"/> | |
</descriptor> | |
<configurationParameterSettings>... | |
</configurationParameterSettings> | |
</casInitializer> | |
</collectionReader> | |
<casProcessors casPoolSize="2" processingUnitThreadCount="1"> | |
<casProcessor deployment="integrated" name="Person"> | |
<descriptor> | |
<include href="[descriptor]"/> | |
</descriptor> | |
<deploymentParameters/> | |
<errorHandling> | |
<errorRateThreshold action="terminate" value="100/1000"/> | |
<maxConsecutiveRestarts action="terminate" value="30"/> | |
<timeout max="100000"/> | |
</errorHandling> | |
<checkpoint batch="100" time="1000ms"/> | |
</casProcessor> | |
<casProcessor deployment="integrated" name="Printer"> | |
<descriptor> | |
<include href="[descriptor]"/> | |
</descriptor> | |
<deploymentParameters/> | |
<errorHandling> | |
<errorRateThreshold action="terminate" | |
value="100/1000"/> | |
<maxConsecutiveRestarts action="terminate" | |
value="30"/> | |
<timeout max="100000" default="-1"/> | |
</errorHandling> | |
<checkpoint batch="100" time="1000ms"/> | |
</casProcessor> | |
</casProcessors> | |
<cpeConfig> | |
<numToProcess>1000</numToProcess> | |
<deployAs>immediate</deployAs> | |
<checkpoint file="[path]/checkpoint.dat" time="3000ms"/> | |
<timerImpl> | |
org.apache.uima.reference_impl.util.JavaTimer | |
</timerImpl> | |
</cpeConfig> | |
</cpeDescription>]]></programlisting></para> | |
</section> | |
</section> | |
<section id="ugr.tug.application.setting_configuration_parameters"> | |
<title>Setting Configuration Parameters</title> | |
<para>Configuration parameters can be set using APIs as well as configured using the XML descriptor metadata | |
specification (see <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.aae.configuration_parameters"/>.</para> | |
<para>There are two different places you can set the parameters via the APIs.</para> | |
<itemizedlist spacing="compact"> | |
<listitem> | |
<para>After reading the XML descriptor for a component, but before you produce the component itself, | |
and</para> | |
</listitem> | |
<listitem> | |
<para>After the component has been produced. </para> | |
</listitem> | |
</itemizedlist> | |
<para>Setting the parameters before you produce the component is done using the | |
ConfigurationParameterSettings object. You get an instance of this for a particular component by accessing | |
that component description's metadata. For instance, if you produced a component description by using | |
<literal>UIMAFramework.getXMLParser().parse...</literal> method, you can use that component | |
description's getMetaData() method to get the metadata, and then the metadata's | |
getConfigurationParameterSettings method to get the ConfigurationParameterSettings object. Using that | |
object, you can set individual parameters using the setParameterValue method. Here's an example, for a | |
CAS Consumer component: | |
<programlisting>// Create a description object by reading the XML for the descriptor | |
CasConsumerDescription casConsumerDesc = | |
UIMAFramework.getXMLParser().parseCasConsumerDescription(new | |
XMLInputSource("descriptors/cas_consumer/InlineXmlCasConsumer.xml")); | |
// get the settings from the metadata | |
ConfigurationParameterSettings consumerParamSettings = | |
casConsumerDesc.getMetaData().getConfigurationParameterSettings(); | |
// Set a parameter value | |
consumerParamSettings.setParameterValue( | |
InlineXmlCasConsumer.PARAM_OUTPUTDIR, | |
outputDir.getAbsolutePath());</programlisting></para> | |
<para>Then you might produce this component using: | |
<programlisting>CasConsumer component = | |
UIMAFramework.produceCasConsumer(casConsumerDesc);</programlisting></para> | |
<para>A side effect of producing a component is calling the component's <quote>initialize</quote> method, | |
allowing it to read its configuration parameters. If you want to change parameters after this, use | |
<programlisting>component.setConfigParameterValue( | |
<quote><parameter-name></quote>, | |
<quote><parameter-value></quote>);</programlisting> | |
and then signal the component to re-read its configuration by calling the component's reconfigure method: | |
<programlisting>component.reconfigure();</programlisting></para> | |
<para>Although these examples are for a CAS Consumer component, the parameter APIs also work for other kinds of | |
components.</para> | |
</section> | |
<section id="ugr.tug.application.integrating_text_analysis_and_search"> | |
<title>Integrating Text Analysis and Search</title> | |
<para>The UIMA SDK on IBM's alphaWorks <ulink url="http://www.alphaworks.ibm.com/tech/uima"/> includes a | |
semantic search engine that you can use to build a search index that includes the results of the analysis done by | |
your AE. This combination of AEs with a search engine capable of indexing both words and annotations over spans | |
of text enables what UIMA refers to as <emphasis>semantic search</emphasis>. Over time we expect to provide | |
additional information on integrating other open source search engines.</para> | |
<para>Semantic search is a search where the semantic intent of the query is specified using one or more entity or | |
relation specifiers. For example, one could specify that they are looking for a person (named) | |
<quote>Bush.</quote> Such a query would then not return results about the kind of bushes that grow in your | |
garden.</para> | |
<section id="ugr.tug.application.building_an_index"> | |
<title>Building an Index</title> | |
<para>To build a semantic search index using the UIMA SDK, you run a Collection Processing Engine that includes | |
your AE along with a CAS Consumer which takes the tokens and annotatitions, together with sentence | |
boundaries, and feeds them to a semantic searcher's index term input. The alphaWorks semantic search | |
component includes a CAS Consumer called the <emphasis>Semantic Search CAS Indexer</emphasis> that does | |
this; this component is available from the alphaWorks site. Your AE must include an annotator that produces | |
Tokens and Sentence annotations, along with any <quote>semantic</quote> annotations, because the | |
Indexer requires this. The Semantic Search CAS Indexer's descriptor is located here: | |
<literal>examples/descriptors/cas_consumer/SemanticSearchCasIndexer.xml</literal> .</para> | |
<section id="ugr.tug.application.search.configuring_indexer"> | |
<title>Configuring the Semantic Search CAS Indexer</title> | |
<para>Since there are several ways you might want to build a search index from the information in the CAS | |
produced by your AE, you need to supply the Semantic Search CAS Consumer – Indexer with | |
configuration information in the form of an <emphasis>Index Build Specification</emphasis> file. | |
Apache UIMA includes code for parsing Index Build Specification files (see the Javadocs for details). An | |
example of an Indexing specification tailored to the AE from the tutorial in the <olink | |
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae"/> is located in | |
<literal>examples/descriptors/tutorial/search/MeetingIndexBuildSpec.xml</literal> . It looks | |
like this: | |
<programlisting><![CDATA[<indexBuildSpecification> | |
<indexBuildItem> | |
<name>org.apache.uima.examples.tokenizer.Token</name> | |
<indexRule> | |
<style name="Term"/> | |
</indexRule> | |
</indexBuildItem> | |
<indexBuildItem> | |
<name>org.apache.uima.examples.tokenizer.Sentence</name> | |
<indexRule> | |
<style name="Breaking"/> | |
</indexRule> | |
</indexBuildItem> | |
<indexBuildItem> | |
<name>org.apache.uima.tutorial.Meeting</name> | |
<indexRule> | |
<style name="Annotation"/> | |
</indexRule> | |
</indexBuildItem> | |
<indexBuildItem> | |
<name>org.apache.uima.tutorial.RoomNumber</name> | |
<indexRule> | |
<style name="Annotation"> | |
<attributeMappings> | |
<mapping> | |
<feature>building</feature> | |
<indexName>building</indexName> | |
</mapping> | |
</attributeMappings> | |
</style> | |
</indexRule> | |
</indexBuildItem> | |
<indexBuildItem> | |
<name>org.apache.uima.tutorial.DateAnnot</name> | |
<indexRule> | |
<style name="Annotation"/> | |
</indexRule> | |
</indexBuildItem> | |
<indexBuildItem> | |
<name>org.apache.uima.tutorial.TimeAnnot</name> | |
<indexRule> | |
<style name="Annotation"/> | |
</indexRule> | |
</indexBuildItem> | |
</indexBuildSpecification>]]></programlisting></para> | |
<para>The index build specification is a series of index build items, each of which identifies a CAS | |
annotation type (a subtype of <literal>uima.tcas.Annotation</literal> – see <olink | |
targetdoc="&uima_docs_ref;"/> <olink | |
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>) and a style.</para> | |
<para>The first item in this example specifies that the annotation type | |
<literal>org.apache.uima.examples.tokenizer.Token</literal> should be indexed with the | |
<quote>Term</quote> style. This means that each span of text annotated by a Token will be considered a | |
single token for standard text search purposes.</para> | |
<para>The second item in this example specifies that the annotation type | |
<literal>org.apache.uima.examples.tokenizer.Sentence</literal> should be indexed with the | |
<quote>Breaking</quote> style. This means that each span of text annotated by a Sentence will be | |
considered a single sentence, which can affect that search engine's algorithm for matching queries. The | |
semantic search engine available from alphaWorks always requires tokens and sentences in order to index a | |
document.</para> <note> | |
<para>Requirements for Term and Breaking rules: The Semantic Search indexer from alphaWorks requires that | |
the items to be indexed as words be designated using the Term rule. </para></note> | |
<para>The remaining items all use the <quote>Annotation</quote> style. This indicates that each | |
annotation of the specified types will be stored in the index as a searchable span, with a name equal to the | |
annotation name (without the namespace).</para> | |
<para>Also, features of annotations can be indexed using the | |
<literal><attributeMappings></literal> subelement. In the example index build | |
specification, we declare that the <literal>building</literal> feature of the type | |
<literal>org.apache.uima.tutorial.RoomNumber</literal> should be indexed. The | |
<literal><indexName></literal> element can be used to map the feature name to a different name in | |
the index, but in this example we have opted to use the same name, <literal>building</literal>. </para> | |
<para> At the end of the batch or collection, the Semantic Search CAS Indexer builds the index. This index can | |
be queried with simple tokens or with XML tags.</para> | |
<para>Examples: | |
<itemizedlist spacing="compact"> | |
<listitem> | |
<para>A query on the word <quote>UIMA</quote> will retrieve all documents that have the occurrence | |
of the word. But a query of the type <literal><Meeting>UIMA</Meeting></literal> | |
will retrieve only those documents that contain a Meeting annotation (produced by our | |
MeetingDetector TAE, for example), where that Meeting annotation contains the word | |
<quote>UIMA</quote>.</para> | |
</listitem> | |
<listitem> | |
<para>A query for <literal><RoomNumber building="Yorktown"/></literal> will return | |
documents that have a RoomNumber annotation whose <literal>building</literal> feature | |
contains the term <quote>Yorktown</quote>. </para> | |
</listitem> | |
</itemizedlist></para> | |
<para>More information on the syntax of these kinds of queries, called XML Fragments, can be found in | |
documentation for the semantic search engine component on <ulink | |
url="http://www.alphaworks.ibm.com/tech/uima"/>. For more information on the Index Build | |
Specification format, see the UIMA Javadocs for class | |
<literal>org.apache.uima.search.IndexBuildSpecification</literal>. Accessing the Javadocs is | |
described in <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.javadocs"/>.</para> | |
</section> | |
<section id="ugr.tug.application.search.cpe_with_semantic_search_cas_consumer"> | |
<title>Building and Running a CPE including the Semantic Search CAS Indexer</title> | |
<titleabbrev>Using Semantic Search CAS Indexer</titleabbrev> | |
<para>The following steps illustrate how to build and run a CPE that uses the UIMA Meeting Detector TAE and the | |
Simple Token and Sentence Annotator, discussed in the <olink | |
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae"/> along with a CAS Consumer | |
called the Semantic Search CAS Indexer, to build an index that allows you to query for documents based not | |
only on textual content but also on whether they contain mentions of Meetings detected by the TAE.</para> | |
<para>Run the CPE Configurator tool by executing the <literal>cpeGui</literal> shell script in the | |
<literal>bin</literal> directory of the UIMA SDK. (For instructions on using this tool, see the <olink | |
targetdoc="&uima_docs_tools;"/> <olink | |
targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cpe"/>.)</para> | |
<para>In the CPE Configurator tool, select the following components by browsing to their | |
descriptors:</para> | |
<itemizedlist spacing="compact"> | |
<listitem> | |
<para>Collection Reader: <literal>%UIMA_HOME%/examples/descriptors/collectionReader/ | |
FileSystemCollectionReader.xml</literal></para> | |
</listitem> | |
<listitem> | |
<para>Analysis Engine: include both of these; one produces tokens/sentences, required by the indexer | |
in all cases and the other produces the meeting annotations of interest. | |
<itemizedlist spacing="compact"> | |
<listitem><para><literal><?db-font-size 70% ?>%UIMA_HOME%/examples/descriptors/analysis_engine/SimpleTokenAndSentenceAnnotator.xml</literal></para></listitem> | |
<listitem><para><literal><?db-font-size 70% ?>%UIMA_HOME%/examples/descriptors/tutorial/ex6/UIMAMeetingDetectorTAE.xml</literal></para></listitem> | |
</itemizedlist> | |
</para> | |
</listitem> | |
<!-- | |
<literallayout>%UIMA_HOME%/examples/descriptors/analysis_engine/ | |
SimpleTokenAndSentenceAnnotator.xml</literallayout></para> | |
</listitem> | |
<listitem> | |
<para><literal> and %UIMA_HOME%/examples/descriptors/tutorial/ex6/ | |
UIMAMeetingDetectorTAE.xml</literal></para> | |
</listitem> | |
--> | |
<listitem> | |
<para>Two CAS Consumers: | |
<itemizedlist spacing="compact"> | |
<listitem><para><literal><?db-font-size 70% ?>%UIMA_HOME%/examples/descriptors/cas_consumer/SemanticSearchCasIndexer.xml</literal></para></listitem> | |
<listitem><para><literal><?db-font-size 70% ?>%UIMA_HOME%/examples/descriptors/cas_consumer/XmiWriterCasConsumer.xml</literal></para></listitem> | |
</itemizedlist> | |
<!-- | |
<literallayout>%UIMA_HOME%/examples/descriptors/cas_consumer/ | |
SemanticSearchCasIndexer.xml | |
%UIMA_HOME%/examples/descriptors/cas_consumer/ | |
XmiWriterCasConsumer.xml</literallayout> | |
--> | |
</para> | |
</listitem> | |
</itemizedlist> | |
<para>Set up parameters:</para> | |
<itemizedlist spacing="compact"> | |
<listitem> | |
<para> Set the File System Collection Reader's <quote>Input Directory</quote> parameter to point to | |
the <literal>%UIMA_HOME%/examples/data</literal> directory.</para> | |
</listitem> | |
<listitem> | |
<para>Set the Semantic Search CAS Indexer's <quote>Indexing Specification Descriptor</quote> | |
parameter to point to <literal>%UIMA_HOME%/examples/descriptors/tutorial/search/ | |
MeetingIndexBuildSpec.xml</literal></para> | |
</listitem> | |
<listitem> | |
<para>Set the Semantic Search CAS Indexer's <quote>Index Dir</quote> parameter to whatever | |
directory into which you want the indexer to write its index files. <warning> | |
<para>The Indexer <emphasis>erases</emphasis> old versions of the files it creates in this | |
directory. </para></warning> </para> | |
</listitem> | |
<listitem> | |
<para>Set the XMI Writer CAS Consumer's <quote>Output Directory</quote> parameter to whatever | |
directory into which you want to store the XMI files containing the results of your analysis for each | |
document. </para> | |
</listitem> | |
</itemizedlist> | |
<para>Click on the Run Button. Once the run completes, a statistics dialog should appear, in which you can see | |
how much time was spent in each of the components involved in the run.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.application.search.query_tool"> | |
<title>Semantic Search Query Tool</title> | |
<para>The Semantic Search component from UIMA on alphaWorks contains a simple tool for running queries | |
against a semantic search index. After building an index as described in the previous section, you can launch | |
this tool by running the shell script: semanticSearch, found in the <literal>/bin</literal> subdirectory | |
of the Semantic Search UIMA install, at the command prompt. If you are using Eclipse, and have installed the | |
UIMA examples, there will be a Run configuration you can use to conveniently launch this, called | |
<literal>UIMA Semantic Search</literal>. This will display the following screen: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image002.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of the Semantic Search tool set up to run | |
semantic queries against a semantic search index</phrase></textobject> | |
</mediaobject> | |
</screenshot></para> | |
<para>Configure the fields on this screen as follows: | |
<itemizedlist spacing="compact"> | |
<listitem> | |
<para>Set the <quote>Index Directory</quote> to the directory where you built your index. This is the | |
same value that you supplied for the <quote>Index Dir</quote> parameter of the Semantic Search CAS | |
Indexer in the CPE Configurator.</para> | |
</listitem> | |
<listitem> | |
<para>Set the <quote>XMI/XCAS Directory</quote> to the directory where you stored the results of your | |
analysis. This is the same value that you supplied for the <quote>Output Directory</quote> | |
parameter of XMI Writer CAS Consumer in the CPE Configurator.</para> | |
</listitem> | |
<listitem> | |
<para>Optionally, set the <quote>Original Documents Directory</quote> to the directory containing | |
the original plain text documents that were analyzed and indexed. This is only needed for the "View | |
Original Document" button.</para> | |
</listitem> | |
<listitem> | |
<para> Set the <quote>Type System Descriptor</quote> to the location of the descriptor that describes | |
your type system. For this example, this will be | |
<literal>%UIMA_HOME%/examples/descriptors/tutorial/ex4/TutorialTypeSystem.xml</literal> | |
</para> | |
</listitem> | |
</itemizedlist></para> | |
<para>Now, in the <quote>XML Fragments</quote> field, you can type in single words or XML queries where the XML | |
tags correspond to the labels in the index build specification file (e.g. | |
<literal><Meeting>UIMA</Meeting></literal>). XML Fragments are described in the | |
documentation for the semantic search engine component on <ulink | |
url="http://www.alphaworks.ibm.com/tech/uima"/>.</para> | |
<para>After you enter a query and click the <quote>Search</quote> button, a list of hits will appear. Select | |
one of the documents and click <quote>View Analysis</quote> to view the document in the UIMA Annotation | |
Viewer.</para> | |
<para>The source code for the Semantic Search query program is in | |
<literal>examples/src/com/ibm/apache-uima/search/examples/SemanticSearchGUI.java</literal> . A simple | |
command-line query program is also provided in | |
<literal>examples/src/com/ibm/apache-uima/search/examples/SemanticSearch.java</literal> . Using these | |
as a model, you can build a query interface from your own application. For details on the Semantic Search | |
Engine query language and interface, see the documentation for the semantic search engine component on | |
<ulink url="http://www.alphaworks.ibm.com/tech/uima"/>.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.application.remote_services"> | |
<title>Working with Remote Services</title> | |
<note><para>This chapter describes older methods of working with Remote Services. These approaches do not support | |
some of the newer CAS features, such as multiple views and CAS Multipliers. These methods have been supplanted by | |
UIMA-AS, which has full support for the new CAS features.</para></note> | |
<para>The UIMA SDK allows you to easily take any Analysis Engine or CAS Consumer and deploy it as a service. That | |
Analysis Engine or CAS Consumer can then be called from a remote machine using various network | |
protocols.</para> | |
<para>The UIMA SDK provides support for two communications protocols: | |
<itemizedlist spacing="compact"> | |
<listitem> | |
<para>SOAP, the standard Web Services protocol</para> | |
</listitem> | |
<listitem> | |
<para>Vinci, a lightweight version of SOAP, included as a part of Apache UIMA. </para> | |
</listitem> | |
</itemizedlist></para> | |
<para>The UIMA framework can make use of these services in two different ways: | |
<orderedlist> | |
<listitem> | |
<para>An Analysis Engine can create a proxy to a remote service; this proxy acts like a local component, but | |
connects to the remote. The proxy has limited error handling and retry capabilities. Both Vinci and SOAP | |
are supported.</para> | |
</listitem> | |
<listitem> | |
<para>A Collection Processing Engine can specify non-Integrated mode (see <olink | |
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.cpe.deploying_a_cpe"/>. The | |
CPE provides more extensive error recovery capabilities. This mode only supports the Vinci | |
communications protocol. </para> | |
</listitem> | |
</orderedlist></para> | |
<section id="ugr.tug.application.how_to_deploy_as_soap"> | |
<title>Deploying a UIMA Component as a SOAP Service</title> | |
<titleabbrev>Deploying as SOAP Service</titleabbrev> | |
<para>To deploy a UIMA component as a SOAP Web Service, you need to first install the following software | |
components: | |
<itemizedlist spacing="compact"> | |
<listitem> | |
<para>Apache Tomcat 5.0 or 5.5 ( <ulink url="http://jakarta.apache.org/tomcat/"/>) </para> | |
</listitem> | |
<listitem> | |
<para>Apache Axis 1.3 or 1.4 (<ulink url="http://ws.apache.org/axis/"/>) </para> | |
</listitem> | |
</itemizedlist></para> | |
<para>Later versions of these components will likely also work, but have not been tested.</para> | |
<para>Next, you need to do the following setup steps: | |
<itemizedlist> | |
<listitem> | |
<para>Set the CATALINA_HOME environment variable to the location where Tomcat is installed.</para> | |
</listitem> | |
<listitem> | |
<para>Copy all of the JAR files from <literal>%UIMA_HOME%/lib</literal> to the | |
<literal>%CATALINA_HOME%/webapps/axis/WEB-INF/lib</literal> in your installation.</para> | |
</listitem> | |
<listitem> | |
<para>Copy your JAR files for the UIMA components that you wish to | |
<literal>%CATALINA_HOME%/webapps/axis/WEB-INF/lib</literal> in your installation.</para> | |
</listitem> | |
<listitem> | |
<para><emphasis role="bold-italic">IMPORTANT</emphasis>: any time you add JAR files to Tomcat (for | |
instance, in the above 2 steps), you must shutdown and restart Tomcat before it | |
<quote>notices</quote> this. So now, please shutdown and restart Tomcat.</para> | |
</listitem> | |
<listitem> | |
<para>All the Java classes for the UIMA Examples are packaged in the | |
<literal>uima-examples.jar</literal> file which is included in the | |
<literal>%UIMA_HOME%/lib</literal> folder.</para> | |
</listitem> | |
<listitem> | |
<para>In addition, if an annotator needs to locate resource files in the classpath, those resources | |
must be available in the Axis classpath, so copy these also to | |
<literal>%CATALINA_HOME%/webapps/axis/WEB-INF/classes</literal> .</para> | |
<para>As an example, if you are deploying the GovernmentTitleRecognizer (found in | |
<literal>examples/descriptors/analysis_engine/ | |
GovernmentOfficialRecognizer_RegEx_TAE</literal>) as a SOAP service, you need to copy the file | |
<literal>examples/resources/GovernmentTitlePatterns.dat</literal> into | |
<literal>.../WEB-INF/classes</literal>. </para> | |
</listitem> | |
</itemizedlist></para> | |
<para>Test your installation of Tomcat and Axis by starting Tomcat and going to | |
<literal>http://localhost:8080/axis/happyaxis.jsp</literal> in your browser. Check to be sure that | |
this reports that all of the required Axis libraries are present. One common missing file may be | |
activation.jar, which you can get from java.sun.com.</para> | |
<para>After completing these setup instructions, you can deploy Analysis Engines or CAS Consumers as SOAP web | |
services by using the <literal>deploytool</literal> utility, with is located in the | |
<literal>/bin</literal> directory of the UIMA SDK. <literal>deploytool</literal> is a command line | |
program utility that takes as an argument a web services deployment descriptors (WSDD file); example WSDD | |
files are provided in the <literal>examples/deploy/soap</literal> directory of the UIMA SDK. Deployment | |
Descriptors have been provided for deploying and undeploying some of the example Analysis Engines that come | |
with the SDK.</para> | |
<para>As an example, the WSDD file for deploying the example Person Title annotator looks like this (important | |
parts are in bold italics): | |
<programlisting><deployment name="<emphasis role="bold-italic">PersonTitleAnnotator</emphasis>" | |
xmlns="http://xml.apache.org/axis/wsdd/" | |
xmlns:java="http://xml.apache.org/axis/wsdd/providers/java"> | |
<service name="<emphasis role="bold-italic">urn:PersonTitleAnnotator</emphasis>" provider="java:RPC"> | |
<parameter name="scope" value="Request"/> | |
<parameter name="className" | |
value="org.apache.uima.reference_impl.analysis_engine | |
.service.soap.AxisAnalysisEngineService_impl"/> | |
<parameter name="allowedMethods" value="getMetaData process"/> | |
<parameter name="allowedRoles" value="*"/> | |
<parameter name="resourceSpecifierPath" | |
value="<emphasis role="bold-italic">C:/Program Files/apache/uima/examples/ | |
descriptors/analysis_engine/PersonTitleAnnotator.xml</emphasis>"/> | |
<parameter name="numInstances" value="3"/> | |
<!-- Type Mappings omitted from this document; | |
you will not need to edit them. --> | |
<typeMapping .../> | |
<typeMapping .../> | |
<typeMapping .../> | |
</service> | |
</deployment></programlisting></para> | |
<para>To modify this WSDD file to deploy your own Analysis Engine or CAS Consumer, just replace the areas | |
indicated in bold italics (deployment name, service name, and resource specifier path) with values | |
appropriate for your component.</para> | |
<para>The <literal>numInstances</literal> parameter specifies how many instances of your Analysis Engine | |
or CAS Consumer will be created. This allows your service to support multiple clients concurrently. When a | |
new request comes in, if all of the instances are busy, the new request will wait until an instance becomes | |
available.</para> | |
<para>To deploy the Person Title annotator service, issue the following command: | |
<programlisting>C:/Program Files/apache/uima/bin>deploytool | |
../examples/deploy/soap/Deploy_PersonTitleAnnotator.wsdd</programlisting></para> | |
<para>Test if the deployment was successful by starting up a browser, pointing it to your Tomcat | |
installation's <quote>axis</quote> webpage (e.g., <literal>http://localhost:8080/axis</literal>) | |
and clicking on the List link. This should bring up a page which shows the deployed services, where you should | |
see the service you just deployed.</para> | |
<para>The other components can be deployed by replacing | |
<literal>Deploy_PersonTitleAnnotator.wsdd</literal> with one of the other Deploy descriptors in the | |
deploy directory. The deploytool utility can also undeploy services when passed one of the Undeploy | |
descriptors.</para> <note> | |
<para>The <literal>deploytool</literal> shell script assumes that the web services are to be installed at | |
<literal>http://localhost:8080/axis</literal>. If this is not the case, you will need to update the shell | |
script appropriately.</para> </note> | |
<para>Once you have deployed your component as a web service, you may call it from a remote machine. See <xref | |
linkend="ugr.tug.application.how_to_call_a_uima_service"/> for instructions.</para> | |
</section> | |
<section id="ugr.tug.application.how_to_deploy_a_vinci_service"> | |
<title>Deploying a UIMA Component as a Vinci Service</title> | |
<titleabbrev>Deploying as a Vinci Service</titleabbrev> | |
<para>There are no software prerequisites for deploying a Vinci service. The necessary libraries are part of | |
the UIMA SDK. However, before you can use Vinci services you need to deploy the Vinci Naming Service (VNS), as | |
described in section <xref linkend="ugr.tug.application.vns"/>.</para> | |
<para>To deploy a service, you have to insure any components you want to include can be found on the class path. | |
One way to do this is to set the environment variable UIMA_CLASSPATH to the set of class paths you need for any | |
included components. Then run the <literal>startVinciService</literal> shell script, which is located | |
in the <literal>bin</literal> directory, and pass it the path to a Vinci deployment descriptor, for | |
example: <literal>C:UIMA>bin/startVinciService | |
../examples/deploy/vinci/Deploy_PersonTitleAnnotator.xml</literal>. | |
If you are running Eclipse, and have the <literal>uimaj-examples</literal> project | |
in your workspace, you can use the Eclipse Menu → Run → Run... and then | |
pick <quote>UIMA Start Vinci Service</quote>.</para> | |
<para>This example deployment descriptor looks like: | |
<programlisting><deployment name=<emphasis role="bold-italic">"Vinci Person Title Annotator Service"</emphasis>> | |
<service name=<emphasis role="bold-italic">"uima.annotator.PersonTitleAnnotator"</emphasis> provider="vinci"> | |
<parameter name="resourceSpecifierPath" | |
value=<emphasis role="bold-italic">"C:/Program Files/apache/uima/examples/descriptors/ | |
analysis_engine/PersonTitleAnnotator.xml"</emphasis>/> | |
<parameter name="numInstances" value="1"/> | |
<parameter name="serverSocketTimeout" value="120000"/> | |
</service> | |
</deployment></programlisting></para> | |
<para>To modify this deployment descriptor to deploy your own Analysis Engine or CAS Consumer, just replace | |
the areas indicated in bold italics (deployment name, service name, and resource specifier path) with | |
values appropriate for your component.</para> | |
<para>The <literal>numInstances</literal> parameter specifies how many instances of your Analysis Engine | |
or CAS Consumer will be created. This allows your service to support multiple clients concurrently. When a | |
new request comes in, if all of the instances are busy, the new request will wait until an instance becomes | |
available.</para> | |
<para>The <literal>serverSocketTimeout</literal> parameter specifies the number of milliseconds | |
(default = 5 minutes) that the service will wait between requests to process something. After this amount of | |
time, the server will presume the client may have gone away - and it <quote>cleans up</quote>, releasing any | |
resources it is holding. The next call to process on the service will result in a cycle which will cause the | |
client to re-establish its connection with the service (some additional overhead).</para> | |
<para>There are two additional parameters that you can add to your deployment descriptor: | |
</para> | |
<itemizedlist> | |
<listitem><para><literal><parameter name="threadPoolMinSize" value="[Integer]"/></literal>: | |
Specifies the number of threads that the Vinci service creates on startup in order to | |
serve clients' requests.</para></listitem> | |
<listitem><para><literal><parameter name="threadPoolMaxSize" value="[Integer]"/></literal>: | |
Specifies the maximum number of threads that the Vinci service will create. When the number of | |
concurrent requests exceeds the <literal>threadPoolMinSize</literal>, additional threads will be | |
created to serve requests, until the <literal>threadPoolMaxSize</literal> is reached.</para></listitem> | |
</itemizedlist> | |
<para>The <literal>startVinciService</literal> script takes two additional optional parameters. The | |
first one overrides the value of the VNS_HOST environment variable, allowing you to specify the name server | |
to use. The second parameter if specified needs to be a unique (on this server) non-negative number, | |
specifying the instance of this service. When used, this number allows multiple instances of the same named | |
service to be started on one server; they will all register with the Vinci name service and be made available to | |
client requests.</para> | |
<para>Once you have deployed your component as a web service, you may call it from a remote machine. See <xref | |
linkend="ugr.tug.application.how_to_call_a_uima_service"/> for instructions.</para> | |
</section> | |
<section id="ugr.tug.application.how_to_call_a_uima_service"> | |
<title>How to Call a UIMA Service</title> | |
<titleabbrev>Calling a UIMA Service</titleabbrev> | |
<para>Once an Analysis Engine or CAS Consumer has been deployed as a service, it can be used from any UIMA | |
application, in the exact same way that a local Analysis Engine or CAS Consumer is used. For example, you can | |
call an Analysis Engine service from the Document Analyzer or use the CPE Configurator to build a CPE that | |
includes Analysis Engine and CAS Consumer services.</para> | |
<para>To do this, you use a <emphasis>service client descriptor</emphasis> in place of the usual Analysis | |
Engine or CAS Consumer Descriptor. A service client descriptor is a simple XML file that indicates the | |
location of the remote service and a few parameters. Example service client descriptors are provided in the | |
UIMA SDK under the directories <literal>examples/descriptors/soapService</literal> and | |
<literal>examples/descriptors/vinciService</literal>. The contents of these descriptors are | |
explained below.</para> | |
<para>Also, before you can call a SOAP service, you need to have the necessary Axis JAR files in your classpath. | |
If you use any of the scripts in the <literal>bin</literal> directory of the UIMA installation to launch your | |
application, such as documentAnalyzer, these JARs are added to the classpath, automatically, using the | |
<literal>CATALINA_HOME</literal> environment variable. The required files are the following (all part | |
of the Apache Axis download) | |
<itemizedlist spacing="compact"> | |
<listitem> | |
<para>activation.jar</para> | |
</listitem> | |
<listitem> | |
<para>axis.jar</para> | |
</listitem> | |
<listitem> | |
<para>commons-discovery.jar</para> | |
</listitem> | |
<listitem> | |
<para>commons-logging.jar</para> | |
</listitem> | |
<listitem> | |
<para>jaxrpc.jar</para> | |
</listitem> | |
<listitem> | |
<para>saaj.jar</para> | |
</listitem> | |
</itemizedlist></para> | |
<section id="ugr.tug.application.soap_service_client_descriptor"> | |
<title>SOAP Service Client Descriptor</title> | |
<para>The descriptor used to call the PersonTitleAnnotator SOAP service from the example above is: | |
<programlisting><![CDATA[<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier"> | |
<resourceType>AnalysisEngine</resourceType> | |
<uri>http://localhost:8080/axis/services/urn:PersonTitleAnnotator</uri> | |
<protocol>SOAP</protocol> | |
<timeout>60000</timeout> | |
</uriSpecifier>]]></programlisting></para> | |
<para>The <resourceType> element must contain either AnalysisEngine or CasConsumer. This | |
specifies what type of component you expect to be at the specified service address.</para> | |
<para>The <uri> element describes which service to call. It specifies the host (localhost, in this | |
example) and the service name (urn:PersonTitleAnnotator), which must match the name specified in the | |
deployment descriptor used to deploy the service.</para> | |
</section> | |
<section id="ugr.tug.application.vinci_service_client_descriptor"> | |
<title>Vinci Service Client Descriptor</title> | |
<para>To call a Vinci service, a similar descriptor is used: | |
<programlisting><![CDATA[<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier"> | |
<resourceType>AnalysisEngine</resourceType> | |
<uri>uima.annot.PersonTitleAnnotator</uri> | |
<protocol>Vinci</protocol> | |
<timeout>60000</timeout> | |
<parameters> | |
<parameter name="VNS_HOST" value="some.internet.ip.name-or-address"/> | |
<parameter name="VNS_PORT" value="9000"/> | |
</parameters> | |
</uriSpecifier>]]></programlisting></para> | |
<para>Note that Vinci uses a centralized naming server, so the host where the service is deployed does not | |
need to be specified. Only a name (<literal>uima.annot.PersonTitleAnnotator</literal>) is given, | |
which must match the name specified in the deployment descriptor used to deploy the service.</para> | |
<para>The host and/or port where your Vinci Naming Service (VNS) server is running can be specified by the | |
optional <parameter> elements. If not specified, the value is taken from the specification given | |
your Java command line (if present) using <literal>-DVNS_HOST=<host> </literal>and | |
<literal>-DVNS_PORT=<port></literal> system arguments. If not specified on the Java command | |
line, defaults are used: localhost for the <literal>VNS_HOST</literal>, and <literal>9000</literal> | |
for the <literal>VNS_PORT</literal>. See the next section for details on setting up a VNS server.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.application.restrictions_on_remotely_deployed_services"> | |
<title>Restrictions on remotely deployed services</title> | |
<para>Remotely deployed services are started on remote machines, using UIMA component descriptors on those | |
remote machines. These descriptors supply any configuration and resource parameters for the service | |
(configuration parameters are not transmitted from the calling instance to the remote one). Likewise, the | |
remote descriptors supply the type system specification for the remote annotators that will be run (the type | |
system of the calling instance is not transmitted to the remote one).</para> | |
<para>The remote service wrapper, when it receives a CAS from the caller, instantiates it for the remote | |
service, making instances of all types which the remote service specifies. Other instances in the incoming | |
CAS for types which the remote service has no type specification for are kept aside, and when the remote | |
service returns the CAS back to the caller, these type instances are re-merged back into the CAS being | |
transmitted back to the caller. Because of this design, a remote service which doesn't declare a type system | |
won't receive any type instances.</para> <note> | |
<para>This behavior may change in future releases, to one where configuration parameters and / or type systems | |
are transmitted to remote services. </para></note> | |
</section> | |
<section id="ugr.tug.application.vns"> | |
<title>The Vinci Naming Services (VNS)</title> | |
<para>Vinci consists of components for building network-accessible services, clients for accessing those | |
services, and an infrastructure for locating and managing services. The primary infrastructure component | |
is the Vinci directory, known as VNS (for Vinci Naming Service).</para> | |
<para>On startup, Vinci services locate the VNS and provide it with information that is used by VNS during | |
service discovery. Vinci service provides the name of the host machine on which it runs, and the name of the | |
service. The VNS internally creates a binding for the service name and returns the port number on which the | |
Vinci service will wait for client requests. This VNS stores its bindings in a filesystem in a file called | |
vns.services.</para> | |
<para>In Vinci, services are identified by their service name. If there is more than one physical service with | |
the same service name, then Vinci assumes they are equivalent and will route queries to them randomly, | |
provided that they are all running on different hosts. You should therefore use a unique service name if you | |
don't want to conflict with other services listed in whatever VNS you have configured jVinci to use.</para> | |
<section id="ugr.tug.application.vns.starting"> | |
<title>Starting VNS</title> | |
<para>To run the VNS use the <literal>startVNS</literal> script found in the | |
<literal>bin</literal> directory of the UIMA installation, | |
or launch it from Eclipse. If you've installed the <literal>uimaj-examples</literal> project, | |
it will supply a pre-configured launch script you can access in Eclipse by selecting | |
Menu → Run → Run... and picking <quote>UIMA Start VNS</quote>.</para> | |
<note><para>VNS runs on port 9000 by default so please make sure this port is | |
available. If you see the following exception: | |
<programlisting>java.net.BindException: Address already in use: | |
JVM_Bind</programlisting> | |
it indicates that another process is running on port 9000. In this case, add the parameter <literal>-p | |
<port></literal> to the <literal>startVNS</literal> command, using | |
<literal><port></literal> to specify an alternative port to use. </para></note> | |
<para>When started, the VNS produces output similar to the following: | |
<programlisting><?db-font-size 80% ?>[10/6/04 3:44 PM | main] WARNING: Config file doesn't exist, | |
creating a new empty config file! | |
[10/6/04 3:44 PM | main] Loading config file : .vns.services | |
[10/6/04 3:44 PM | main] Loading workspaces file : .vns.workspaces | |
[10/6/04 3:44 PM | main] ==================================== | |
(WARNING) Unexpected exception: | |
java.io.FileNotFoundException: .vns.workspaces (The system cannot find | |
the file specified) | |
at java.io.FileInputStream.open(Native Method) | |
at java.io.FileInputStream.<init>(Unknown Source) | |
at java.io.FileInputStream.<init>(Unknown Source) | |
at java.io.FileReader.<init>(Unknown Source) | |
at org.apache.vinci.transport.vns.service.VNS.loadWorkspaces(VNS.java:339 | |
at org.apache.vinci.transport.vns.service.VNS.startServing(VNS.java:237) | |
at org.apache.vinci.transport.vns.service.VNS.main(VNS.java:179) | |
[10/6/04 3:44 PM | main] WARNING: failed to load workspace. | |
[10/6/04 3:44 PM | main] VNS Workspace : null | |
[10/6/04 3:44 PM | main] Loading counter file : .vns.counter | |
[10/6/04 3:44 PM | main] Could not load the counter file : .vns.counter | |
[10/6/04 3:44 PM | main] Starting backup thread, | |
using files .vns.services.bak | |
and .vns.services | |
[10/6/04 3:44 PM | main] Serving on port : 9000 | |
[10/6/04 3:44 PM | Thread-0] Backup thread started | |
[10/6/04 3:44 PM | Thread-0] Saving to config file : .vns.services.bak | |
>>>>>>>>>>>>> VNS is up and running! <<<<<<<<<<<<<<<<< | |
>>>>>>>>>>>>> Type 'quit' and hit ENTER to terminate VNS <<<<<<<<<<<<< | |
[10/6/04 3:44 PM | Thread-0] Config save required 10 millis. | |
[10/6/04 3:44 PM | Thread-0] Saving to config file : .vns.services | |
[10/6/04 3:44 PM | Thread-0] Config save required 10 millis. | |
[10/6/04 3:44 PM | Thread-0] Saving counter file : .vns.counter</programlisting></para> | |
<note> | |
<para>Disregard the <emphasis>java.io.FileNotFoundException: .\vns.workspaces (The system cannot | |
find the file specified).</emphasis> It is just a complaint. not a serious problem. VNS Workspace is a | |
feature of the VNS that is not critical. The important information to note is <literal>[10/6/04 3:44 PM | | |
main] Serving on port : 9000</literal> which states the actual port where VNS will listen for incoming | |
requests. All Vinci services and all clients connecting to services must provide the VNS port on the | |
command line IF the port is not a default. Again the default port is 9000. Please see <xref | |
linkend="ugr.tug.application.launching_vinci_services"/> below for details about the command | |
line and parameters.</para> </note> | |
</section> | |
<section id="ugr.tug.application.vns_files"> | |
<title>VNS Files</title> | |
<para>The VNS maintains two external files: | |
<itemizedlist spacing="compact"> | |
<listitem> | |
<para><literal>vns.services</literal></para> | |
</listitem> | |
<listitem> | |
<para><literal>vns.counter</literal></para> | |
</listitem> | |
</itemizedlist></para> | |
<para>These files are generated by the VNS in the same directory where the VNS is launched from. Since these | |
files may contain old information it is best to remove them before starting the VNS. This step ensures that | |
the VNS has always the newest information and will not attempt to connect to a service that has been | |
shutdown.</para> | |
</section> | |
<section id="ugr.tug.application.launching_vinci_services"> | |
<title>Launching Vinci Services</title> | |
<para>When launching Vinci service, you must indicate which VNS the service will | |
connect to. A Vinci service is typically started using the script | |
<literal>startVinciService</literal>, found in the <literal>bin</literal> | |
directory of the UIMA installation. (If you're using Eclipse and have the | |
<literal>uimaj-examples</literal> project in the workspace, you will also find | |
an Eclipse launcher named <quote>UIMA Start Vinci Service</quote> you can use.) | |
For the script, the environmental variable VNS_HOST should | |
be set to the name or IP address of the machine hosting the Vinci Naming Service. The | |
default is localhost, the machine the service is deployed on. This name can also be | |
passed as the second argument to the startVinciService script. The default port | |
for VNS is 9000 but can be overriden with the VNS_PORT environmental | |
variable.</para> | |
<para>If you write your own startup script, to define Vinci's default VNS you must provide the | |
following JVM parameters: | |
<programlisting>java -DVNS_HOST=localhost -DVNS_PORT=9000 ...</programlisting></para> | |
<para>The above setting is for the VNS running on the same machine as the service. Of course one can deploy the | |
VNS on a different machine and the JVM parameter will need to be changed to this: | |
<programlisting>java -DVNS_HOST=<host> -DVNS_PORT=9000 ...</programlisting></para> | |
<para>where <quote><host></quote> is a machine name or its IP where the VNS is running.</para> | |
<note> | |
<para>VNS runs on port 9000 by default. If you see the following exception: | |
<programlisting>(WARNING) Unexpected exception: | |
org.apache.vinci.transport.ServiceDownException: | |
VNS inaccessible: java.net.Connect | |
Exception: Connection refused: connect</programlisting> | |
then, perhaps the VNS is not running OR the VNS is running but it is using a different port. To correct the | |
latter, set the environmental variable VNS_PORT to the correct port before starting the service.</para> | |
</note> | |
<para>To get the right port check the VNS output for something similar to the following: | |
<programlisting>[10/6/04 3:44 PM | main] Serving on port : 9000</programlisting></para> | |
<para>It is printed by the VNS on startup.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.configuring_timeout_settings"> | |
<title>Configuring Timeout Settings</title> | |
<para>UIMA has several timeout specifications, summarized here. The timeouts associated with remote | |
services are discussed below. In addition there are timeouts that can be specified for: | |
<itemizedlist> | |
<listitem><para><emphasis role="bold">Acquiring an empty CAS from a CAS Pool:</emphasis> | |
See <xref linkend="ugr.tug.applications.multi_threaded"/>.</para></listitem> | |
<listitem><para><emphasis role="bold">Reassembling chunks of a large document</emphasis> | |
See <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.cpe_descriptor.descriptor.operational_parameters"/></para> | |
</listitem> | |
</itemizedlist></para> | |
<para>If your application uses remote UIMA services it is important to consider how to set the | |
<emphasis>timeout</emphasis> values appropriately. This is particularly important if your service can | |
take a long time to process each request.</para> | |
<para>There are two types of timeout settings in UIMA, the <emphasis>client timeout</emphasis> and the | |
<emphasis>server socket timeout</emphasis>. The client timeout is usually the most important, it | |
specifies how long that client is willing to wait for the service to process each CAS. The client timeout can be | |
specified for both Vinci and SOAP. The server socket timeout (Vinci only) specifies how long the service | |
holds the connection open between calls from the client. After this amount of time, the server will presume | |
the client may have gone away - and it <quote>cleans up</quote>, releasing any resources it is holding. The | |
next call to process on the service will cause the client to re-establish its connection with the service | |
(some additional overhead).</para> | |
<section id="ugr.tug.setting_client_timeout"> | |
<title>Setting the Client Timeout</title> | |
<para>The way to set the client timeout is different depending on what deployment mode you use in your CPE (if | |
any).</para> | |
<para>If you are using the default <quote>integrated</quote> deployment mode in your CPE, or if you are not | |
using a CPE at all, then the client timeout is specified in your Service Client Descriptor (see <xref | |
linkend="ugr.tug.application.how_to_call_a_uima_service"/>). For example:</para> | |
<programlisting><uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier"> | |
<resourceType>AnalysisEngine</resourceType> | |
<uri>uima.annot.PersonTitleAnnotator</uri> | |
<protocol>Vinci</protocol> | |
<emphasis role="bold-italic"><timeout>60000</timeout></emphasis> | |
<parameters> | |
<parameter name="VNS_HOST" value="some.internet.ip.name-or-address"/> | |
<parameter name="VNS_PORT" value="9000"/> | |
</parameters> | |
</uriSpecifier></programlisting> | |
<para>The client timeout in this example is <literal>60000</literal>. This value specifies the number of | |
milliseconds that the client will wait for the service to respond to each request. In this example, the | |
client will wait for one minute.</para> | |
<para>If the service does not respond within this amount of time, processing of the current CAS will abort. If | |
you called the <literal>AnalysisEngine.process</literal> method directly from your application, an | |
Exception will be thrown. If you are running a CPE, what happens next is dependent on the error handling | |
settings in your CPE descriptor (see <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.error_handling"/> | |
). The default action is for the CPE to terminate, but you can override this. </para> | |
<para>If you are using the <quote>managed</quote> or <quote>non-managed</quote> deployment mode in your | |
CPE, then the client timeout is specified in your CPE desciptor's <literal>errorHandling</literal> | |
element. For example:</para> | |
<programlisting><![CDATA[<errorHandling> | |
<maxConsecutiveRestarts .../> | |
<errorRateThreshold .../> | |
<timeout max="60000"/> | |
</errorHandling>]]></programlisting> | |
<para>As in the previous example, the client timeout is set to <literal>60000</literal>, and this | |
specifies the number of milliseconds that the client will wait for the service to respond to each | |
request.</para> | |
<para>If the service does not respond within the specified amount of time, the action is determined by the | |
settings for <literal>maxConsecutiveRestarts</literal> and | |
<literal>errorRateThreshold</literal>. These settings support such things as restarting the process | |
(for <quote>managed</quote> deployment mode), dropping and reestablishing the connection (for | |
<quote>non-managed</quote> deployment mode), and removing the offending service from the pipeline. See | |
<olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.error_handling"/> | |
) for details. </para> | |
<para>Note that the client timeout does not apply to the <literal>GetMetaData</literal> | |
request that is made when the client first connects to the service. This call is typically | |
very fast and does not need a large timeout (the default is 60 seconds). However, if many | |
clients are competing for a small number of services, it may be necessary to increase this | |
value. See <olink targetdoc="&uima_docs_ref;"/> <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.service_client"/></para> | |
</section> | |
<section id="ugr.tug.setting_server_socket_timeout"> | |
<title>Setting the Server Socket Timeout</title> | |
<para>The Server Socket Timeout applies only to Vinci services, and is specified in the Vinci deployment | |
descriptor as discussed in section <xref | |
linkend="ugr.tug.application.how_to_deploy_a_vinci_service"/>. For example: | |
<programlisting><deployment name="Vinci Person Title Annotator Service"> | |
<service name="uima.annotator.PersonTitleAnnotator" provider="vinci"> | |
<parameter name="resourceSpecifierPath" | |
value="C:/Program Files/apache/uima/examples/descriptors/ | |
analysis_engine/PersonTitleAnnotator.xml"/> | |
<parameter name="numInstances" value="1"/> | |
<parameter name="serverSocketTimeout" value=<emphasis role="bold-italic">"120000"</emphasis>/> | |
</service> | |
</deployment></programlisting> | |
</para> | |
<para>The server socket timeout here is set to <literal>120000</literal> milliseconds, or two minutes. | |
This parameter specifies how long the service will wait between requests to process something. After this | |
amount of time, the server will presume the client may have gone away - and it <quote>cleans up</quote>, | |
releasing any resources it is holding. The next call to process on the service will cause the client to | |
re-establish its connection with the service (some additional overhead). The service may print a | |
<quote>Read Timed Out</quote> message to the console when the server socket timeout elapses.</para> | |
<para>In most cases, it is not a problem if the server socket timeout elapses. The client will simply | |
reconnect. However, if you notice <quote>Read Timed Out</quote> messages on your server console, | |
followed by other connection problems, it is possible that the client is having trouble reconnecting for | |
some reason. In this situation it may help increase the stability of your application if you increase the | |
server socket timeout so that it does not elapse during actual processing.</para> | |
</section> | |
</section> | |
</section> | |
<section id="ugr.tug.application.increasing_performance_using_parallelism"> | |
<title>Increasing performance using parallelism</title> | |
<para>There are several ways to exploit parallelism to increase performance in the UIMA Framework. These range | |
from running with additional threads within one Java virtual machine on one host (which might be a | |
multi-processor or hyper-threaded host) to deploying analysis engines on a set of remote machines.</para> | |
<para>The Collection Processing facility in UIMA provides the ability to scale the pipe-line of analysis | |
engines. This scale-out runs multiple threads within the Java virtual machine running the CPM, one for each | |
pipe in the pipe-line. To activate it, in the <literal><casProcessors></literal> descriptor | |
element, set the attribute <literal>processingUnitThreadCount</literal>, which specifies the number of | |
replicated processing pipelines, to a value greater than 1, and insure that the size of the CAS pool is equal to or | |
greater than this number (the attribute of <literal><casProcessors></literal> to set is | |
<literal>casPoolSize</literal>). For more details on these settings, see <olink | |
targetdoc="&uima_docs_ref;"/> <olink | |
targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors"/> .</para> | |
<para>For deployments that incorporate remote analysis engines in the Collection Manager pipe-line, running | |
on multiple remote hosts, scale-out is supported which uses the Vinci naming service. If multiple instances of | |
a service with the same name, but running on different hosts, are registered with the Vinci Name Server, it will | |
assign these instances to incoming requests.</para> | |
<para>There are two modes supported: a <quote>random</quote> assignment, and a <quote>exclusive</quote> | |
one. The <quote>random</quote> mode distributes load using an algorithm that selects a service instance at | |
random. The UIMA framework supports this only for the case where all of the instances are running on unique | |
hosts; the framework does not support starting 2 or more instances on the same host.</para> | |
<para>The exclusive mode dedicates a particular remote instance to each Collection Manager pip-line instance. | |
This mode is enabled by adding a configuration parameter in the | |
<casProcessor> section of the CPE descriptor:</para> | |
<literallayout><deploymentParameters> | |
<parameter name="service-access" value="exclusive" /> | |
</deploymentParameters></literallayout> | |
<para>If this is not specified, the <quote>random</quote> mode is used.</para> | |
<para>In addition, remote UIMA engine services can be started with a parameter that specifies the number of | |
instances the service should support (see the <literal><parameter name="numInstances"></literal> | |
XML element in remote deployment descriptor <xref linkend="ugr.tug.application.remote_services"/> | |
Specifying more than one causes the service wrapper for the analysis engine to use multi-threading (within the | |
single Java Virtual Machine – which can take advantage of multi-processor and hyper-threaded | |
architectures).</para> <note> | |
<para>When using Vinci in <quote>exclusive</quote> mode (see service access under <olink | |
targetdoc="&uima_docs_ref;"/> <olink | |
targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.deployment_parameters"/> | |
), only one thread is used. To achieve multi-processing on a server in this case, use multiple instances of the | |
service, instead of multiple threads (see <xref | |
linkend="ugr.tug.application.how_to_deploy_a_vinci_service"/>.</para> </note> | |
</section> | |
<section id="ugr.tug.application.jmx"> | |
<title>Monitoring AE Performance using JMX</title> | |
<para>As of version 2, UIMA supports remote monitoring of Analysis Engine performance via the Java Management | |
Extensions (JMX) API. JMX is a standard part of the Java Runtime Environment v5.0; there is also a reference | |
implementation available from Sun for Java 1.4. An introduction to JMX is available from Sun here: <ulink | |
url="http://java.sun.com/developer/technicalArticles/J2SE/jmx.html"/>. When you run a UIMA with a | |
JVM that supports JMX, the UIMA framework will automatically detect the presence of JMX and will register | |
<emphasis>MBeans</emphasis> that provide access to the performance statistics.</para> | |
<para>Note: The Sun JVM supports local monitoring; for others you can configure your | |
application for remote monitoring (even when on the same host) by specifying a unique port number, e.g. | |
<literal> | |
-Dcom.sun.management.jmxremote.port=1098 | |
-Dcom.sun.management.jmxremote.authenticate=false | |
-Dcom.sun.management.jmxremote.ssl=false</literal></para> | |
<para>Now, you can use any JMX client to view the statistics. JDK 5.0 or later provides a standard client that you can use. | |
Simply open a command prompt, make sure the JDK <literal>bin</literal> directory is in your path, and | |
execute the <literal>jconsole</literal> command. This should bring up a window allowing you to | |
select one of the local JMX-enabled applications currently running, or to enter a remote (or local) host and | |
port, e.g. localhost:1098. The next screen will show a summary of | |
information about the Java process that you connected to. Click on the <quote>MBeans</quote> tab, then expand | |
<quote>org.apache.uima</quote> in the tree at the left. You should see a view like this: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image006.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of JMX console monitoring UIMA components</phrase></textobject> | |
</mediaobject> | |
</screenshot></para> | |
<para>Each of the nodes under <quote><literal>org.apache.uima</literal></quote> in the tree represents one | |
of the UIMA Analysis Engines in the application that you connected to. You can select one of the analysis engines | |
to view its performance statistics in the view at the right.</para> | |
<para>Probably the most useful statistic is <quote>CASes Per Second</quote>, which is the number of CASes that | |
this AE has processed divided by the amount of time spent in the AE's process method, in seconds. Note that this is | |
the total elapsed time, not CPU time. Even so, it can be useful to compare the <quote>CASes Per Second</quote> | |
numbers of all of your Analysis Engines to discover where the bottlenecks occur in your application.</para> | |
<para>The <literal>AnalysisTime</literal>, <literal>BatchProcessCompleteTime</literal>, and | |
<literal>CollectionProcessCompleteTime</literal> properties show the total elapsed time, in | |
milliseconds, that has been spent in the AnalysisEngine's <literal>process(), batchProcessComplete(), | |
</literal>and <literal>collectionProcessComplete()</literal> methods, respectively. (Note that for | |
CAS Multipliers, time spent in the <literal>hasNext()</literal> and <literal>next()</literal> methods is | |
also counted towards the AnalysisTime.)</para> | |
<para>Note that once your UIMA application terminates, you can no longer view the statistics through the JMX | |
console. If you want to use JMX to view processes that have completed, you will need to write your application so | |
that the JVM remains running after processing completes, waiting for some user signal before | |
terminating.</para> | |
<para>It is possible to override the default JMX MBean names UIMA uses, for | |
example to better organize the UIMA MBeans with respect to MBeans exposed by | |
other parts of your application. This is done using the | |
<literal>AnalysisEngine.PARAM_MBEAN_NAME_PREFIX</literal> additional parameter | |
when creating your AnalysisEngine: | |
<programlisting> //set up Map with custom JMX MBean name prefix | |
Map paramMap = new HashMap(); | |
paramMap.put(AnalysisEngine.PARAM_MBEAN_NAME_PREFIX, | |
"org.myorg:category=MyApp"); | |
// create Analysis Engine | |
AnalysisEngine ae = | |
UIMAFramework.produceAnalysisEngine(specifier, paramMap); | |
</programlisting> | |
</para> | |
<para>Similary, you can use the <literal>AnalysisEngine.PARAM_MBEAN_SERVER</literal> | |
parameter to specify a particular instance of a JMX MBean Server with which UIMA | |
should register the MBeans. If no specified then the default is to register with | |
the platform MBeanServer (Java 5+ only).</para> | |
<para>More information on JMX can be found in the <ulink | |
url="http://java.sun.com/j2se/1.5.0/docs/api/javax/management/package-summary.html#package_description"> | |
Java 5 documentation</ulink>.</para> | |
</section> | |
<section id="tug.application.pto"> | |
<title>Performance Tuning Options</title> | |
<para> | |
There are a small number of performance tuning options available to | |
influence the runtime behavior of UIMA applications. Performance | |
tuning options need to be set programmatically when an analysis | |
engine is created. You simply create a Java Properties object with | |
the relevant options and pass it to the UIMA framework on the call | |
to create an analysis engine. Below is an example. | |
<programlisting> | |
XMLParser parser = UIMAFramework.getXMLParser(); | |
ResourceSpecifier spec = parser.parseResourceSpecifier( | |
new XMLInputSource(descriptorFile)); | |
// Create a new properties object to hold the settings. | |
Properties performanceTuningSettings = new Properties(); | |
// Set the initial CAS heap size. | |
performanceTuningSettings.setProperty( | |
UIMAFramework.CAS_INITIAL_HEAP_SIZE, | |
"1000000"); | |
// Disable JCas cache. | |
performanceTuningSettings.setProperty( | |
UIMAFramework.JCAS_CACHE_ENABLED, | |
"false"); | |
// Create a wrapper properties object that can | |
// be passed to the framework. | |
Properties additionalParams = new Properties(); | |
// Set the performance tuning properties as value to | |
// the appropriate parameter. | |
additionalParams.put( | |
Resource.PARAM_PERFORMANCE_TUNING_SETTINGS, | |
performanceTuningSettings); | |
// Create the analysis engine with the parameters. | |
// The second, unused argument here is a custom | |
// resource manager. | |
this.ae = UIMAFramework.produceAnalysisEngine( | |
spec, null, additionalParams); | |
</programlisting> | |
</para> | |
<para> | |
The following options are supported: | |
<itemizedlist> | |
<listitem> | |
<para><literal>UIMAFramework.JCAS_CACHE_ENABLED</literal>: allows you to disable | |
the JCas cache (true/false). The JCas cache is an internal datastructure that caches any JCas | |
object created | |
by the CAS. This may result in better performance for applications that make extensive use of | |
the JCas, but also incurs a steep memory overhead. If you're processing large documents and have | |
memory issues, you should disable this option. In general, just try running a few experiments to | |
see what setting works better for your application. The JCas cache is enabled by default. | |
</para> | |
</listitem> | |
<listitem> | |
<para><literal>UIMAFramework.CAS_INITIAL_HEAP_SIZE</literal>: set the initial CAS heap size in | |
number of cells (integer valued). The CAS uses 32bit integer cells, so four times the initial | |
size is the | |
approximate minimum size of the CAS in bytes. This is another space/time trade-off as growing | |
the CAS heap is relatively expensive. On the other hand, setting the initial size too high is | |
wasting memory. Unless you know you are processing very small or very large documents, you should | |
probably leave this option unchanged. | |
</para> | |
</listitem> | |
<listitem> | |
<para><literal>UIMAFramework.PROCESS_TRACE_ENABLED</literal>: enable the process trace mechanism | |
(true/false). When enabled, UIMA tracks the time spent in individual components of an aggregate | |
AE or CPE. For more information, see the API documentation of | |
<literal>org.apache.uima.util.ProcessTrace</literal>. | |
</para> | |
</listitem> | |
<listitem> | |
<para><literal>UIMAFramework.SOCKET_KEEPALIVE_ENABLED</literal>: enable socket KeepAlive | |
(true/false). This setting is currently only supported by Vinci clients. Defaults to | |
<literal>true</literal>. | |
</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
</section> | |
</chapter> |