<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ | |
<!ENTITY imgroot "images/tutorials_and_users_guides/tug.aae/"> | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent"> | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<chapter id="ugr.tug.aae"> | |
<title>Annotator and Analysis Engine Developer's Guide</title> | |
<titleabbrev>Annotator & AE Developer's Guide</titleabbrev> | |
<para>This chapter describes how to develop UIMA <emphasis>type systems</emphasis>, | |
<emphasis>Annotators</emphasis> and <emphasis>Analysis Engines</emphasis> using | |
the UIMA SDK. It is helpful to read the UIMA Conceptual Overview chapter for a review on | |
these concepts.</para> | |
<para>An <emphasis>Analysis Engine (AE)</emphasis> is a program that analyzes artifacts | |
(e.g. documents) and infers information from them.</para> | |
<para>Analysis Engines are constructed from building blocks called | |
<emphasis>Annotators</emphasis>. An annotator is a component that contains analysis | |
logic. Annotators analyze an artifact (for example, a text document) and create | |
additional data (metadata) about that artifact. It is a goal of UIMA that annotators need | |
not be concerned with anything other than their analysis logic – for example the | |
details of their deployment or their interaction with other annotators.</para> | |
<para>An Analysis Engine (AE) may contain a single annotator (this is referred to as a | |
<emphasis>Primitive AE)</emphasis>, or it may be a composition of others and therefore | |
contain multiple annotators (this is referred to as an <emphasis>Aggregate | |
AE</emphasis>). Primitive and aggregate AEs implement the same interface and can be used | |
interchangeably by applications.</para> | |
<para>Annotators produce their analysis results in the form of typed <emphasis>Feature | |
Structures</emphasis>, which are simply data structures that have a type and a set of | |
(attribute, value) pairs. An <emphasis>annotation</emphasis> is a particular type of | |
Feature Structure that is attached to a region of the artifact being analyzed (a span of | |
text in a document, for example).</para> | |
<para>For example, an annotator may produce an Annotation over the span of text | |
<literal>President Bush</literal>, where the type of the Annotation is | |
<literal>Person</literal> and the attribute <literal>fullName</literal> has the | |
value <literal>George W. Bush</literal>, and its position in the artifact is character | |
position 12 through character position 26.</para> | |
<para>It is also possible for annotators to record information associated with the entire | |
document rather than a particular span (these are considered Feature Structures but not | |
Annotations).</para> | |
<para>All feature structures, including annotations, are represented in the UIMA | |
<emphasis>Common Analysis Structure(CAS)</emphasis>. The CAS is the central data | |
structure through which all UIMA components communicate. Included with the UIMA SDK is an | |
easy-to-use, native Java interface to the CAS called the <emphasis>JCas</emphasis>. | |
The JCas represents each feature structure as a Java object; the example feature | |
structure from the previous paragraph would be an instance of a Java class Person with | |
getFullName() and setFullName() methods. Though the examples in this guide all use the | |
JCas, it is also possible to directly access the underlying CAS system; for more | |
information see <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/> | |
.</para> | |
<para>The remainder of this chapter will refer to the analysis of text documents and the | |
creation of annotations that are attached to spans of text in those documents. Keep in mind | |
that the CAS can represent arbitrary types of feature structures, and feature structures | |
can refer to other feature structures. For example, you can use the CAS to represent a parse | |
tree for a document. Also, the artifact that you are analyzing need not be a text | |
document.</para> | |
<para>This guide is organized as follows:</para> | |
<itemizedlist> | |
<listitem> | |
<para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.getting_started"/></emphasis> is a | |
tutorial with step-by-step instructions for how to develop and test a simple UIMA annotator.</para> | |
</listitem> | |
<listitem> | |
<para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.configuration_logging"/> | |
</emphasis> discusses how to make your UIMA annotator configurable, and how it can write messages to the UIMA | |
log file.</para> | |
</listitem> | |
<listitem> | |
<para> <emphasis role="bold-italic"><xref linkend="ugr.tug.aae.building_aggregates"/></emphasis> | |
describes how annotators can be combined into aggregate analysis engines. It also describes how one | |
annotator can make use of the analysis results produced by an annotator that has run previously.</para> | |
</listitem> | |
<listitem> | |
<para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.other_examples"/></emphasis> | |
describes several other examples you may find interesting, including</para> | |
<itemizedlist spacing="compact"> | |
<listitem> | |
<para>SimpleTokenAndSentenceAnnotator | |
– a simple tokenizer and sentence annotator.</para> | |
</listitem> | |
<listitem> | |
<para>PersonTitleDBWriterCasConsumer – a sample CAS Consumer which populates a relational | |
database with some annotations. It uses JDBC and in this example, hooks up with the Open Source Apache | |
Derby database. </para> | |
</listitem> | |
</itemizedlist> | |
</listitem> | |
<listitem> | |
<para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.additional_topics"/></emphasis> | |
describes additional features of the UIMA SDK that may help you in building your own annotators and analysis | |
engines.</para> | |
</listitem> | |
<listitem> | |
<para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.common_pitfalls"/> </emphasis> | |
contains some useful guidelines to help you ensure that your annotators will work correctly in any UIMA | |
application.</para> | |
</listitem> | |
</itemizedlist> | |
<para>This guide does not discuss how to build UIMA Applications, which are programs that | |
use Analysis Engines, along with other components, e.g. a search engine, document store, | |
and user interface, to deliver a complete package of functionality to an end-user. For | |
information on application development, see <olink | |
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.application" | |
xrefstyle="select: label quotedtitle"/> | |
.</para> | |
<section id="ugr.tug.aae.getting_started"> | |
<title>Getting Started</title> | |
<para>This section is a step-by-step tutorial that will get you started developing UIMA | |
annotators. All of the files referred to by the examples in this chapter are in the | |
<literal>examples</literal> directory of the UIMA SDK. This directory is designed to | |
be imported into your Eclipse workspace; see <olink targetdoc="&uima_docs_overview;"/> | |
<olink targetdoc="&uima_docs_overview;" | |
targetptr="ugr.ovv.eclipse_setup.example_code"/> for instructions on how to do | |
this. | |
See <olink targetdoc="&uima_docs_overview;"/> <olink targetdoc="&uima_docs_overview;" | |
targetptr="ugr.ovv.eclipse_setup.linking_uima_javadocs"/> for how to attach the UIMA | |
Javadocs to the jar files. | |
Also you may wish to refer to the UIMA SDK Javadocs located in the <ulink | |
url="api/index.html">docs/api/index.html</ulink> directory.</para> | |
<note><para>In Eclipse 3.1, if you highlight a UIMA class or method defined in the UIMA SDK | |
Javadocs, you can conveniently have Eclipse open the corresponding Javadoc for that | |
class or method in a browser, by pressing Shift + F2.</para></note> | |
<note><para>If you downloaded the source distribution for UIMA, you can attach that as | |
well to the library Jar files; for information on how to do this, see | |
<olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.javadocs"/>.</para></note> | |
<para>The example annotator that we are going to walk through will detect room numbers for | |
rooms where the room numbering scheme follows some simple conventions. In our example, | |
there are two kinds of patterns we want to find; here are some examples, together with | |
their corresponding regular expression patterns: | |
<variablelist> | |
<varlistentry> | |
<term>Yorktown patterns:</term> | |
<listitem><para>20-001, 31-206, 04-123(Regular Expression Pattern: | |
##-[0-2]##)</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>Hawthorne patterns:</term> | |
<listitem><para>GN-K35, 1S-L07, 4N-B21 (Regular Expression Pattern: | |
[G1-4][NS]-[A-Z]##)</para></listitem> | |
</varlistentry> | |
</variablelist> </para> | |
<para>There are several steps to develop and test a simple UIMA annotator.</para> | |
<orderedlist spacing="compact"><listitem><para>Define the CAS types that the | |
annotator will use.</para></listitem> | |
<listitem><para>Generate the Java classes for these types.</para></listitem> | |
<listitem><para>Write the actual annotator Java code.</para></listitem> | |
<listitem><para>Create the Analysis Engine descriptor.</para></listitem> | |
<listitem><para>Test the annotator. </para></listitem></orderedlist> | |
<para>These steps are discussed in the next sections.</para> | |
<section id="ugr.tug.aae.defining_types"> | |
<title>Defining Types</title> | |
<para>The first step in developing an annotator is to define the CAS Feature Structure | |
types that it creates. This is done in an XML file called a <emphasis>Type System | |
Descriptor</emphasis>. UIMA defines basic primitive types such as | |
Boolean, Byte, Short, Integer, Long, Float, and Double, as well as Arrays of these primitive | |
types. UIMA also defines the built-in types <literal>TOP</literal>, which is the root | |
of the type system, analogous to Object in Java; <literal>FSArray</literal>, which is | |
an array of Feature Structures (i.e. an array of instances of TOP); and | |
<literal>Annotation</literal>, which we will discuss in more detail in this section.</para> | |
<para>UIMA includes an Eclipse plug-in that will help you edit Type System | |
Descriptors, so if you are using Eclipse you will not need to worry about the details of | |
the XML syntax. See <olink targetdoc="&uima_docs_overview;"/> <olink targetdoc="&uima_docs_overview;" | |
targetptr="ugr.ovv.eclipse_setup"/> for instructions on setting up Eclipse and | |
installing the plugin.</para> | |
<para>The Type System Descriptor for our annotator is located in the file | |
<literal>descriptors/tutorial/ex1/TutorialTypeSystem.xml.</literal> (This | |
and all other examples are located in the <literal>examples</literal> directory of | |
the installation of the UIMA SDK, which can be imported into an Eclipse project for | |
your convenience, as described in <olink targetdoc="&uima_docs_overview;"/> | |
<olink targetdoc="&uima_docs_overview;" | |
targetptr="ugr.ovv.eclipse_setup.example_code"/>.)</para> | |
<para>In Eclipse, expand the <literal>uimaj-examples</literal> project in the | |
Package Explorer view, and browse to the file | |
<literal>descriptors/tutorial/ex1/TutorialTypeSystem.xml</literal>. | |
Right-click on the file in the navigator and select Open With → Component | |
Descriptor Editor. Once the editor opens, click on the <quote>Type System</quote> | |
tab at the bottom of the editor window. You should see a view such as the | |
following:</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata scale="100" format="JPG" fileref="&imgroot;image002.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of editor for Type System Definitions</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>Our annotator will need only one type – | |
<literal>org.apache.uima.tutorial.RoomNumber</literal>. (We use the same | |
namespace conventions as are used for Java classes.) Just as in Java, types have | |
supertypes. The supertype is listed in the second column of the left table. In this | |
case our RoomNumber annotation extends from the built-in type | |
<literal>uima.tcas.Annotation</literal>.</para> | |
<para>Descriptions can be included with types and features. In this example, there is a | |
description associated with the <literal>building</literal> feature. To see it, | |
hover the mouse over the feature.</para> | |
<para>The bottom tab labeled <quote>Source</quote> will show you the XML source file | |
associated with this descriptor.</para> | |
<para>The built-in Annotation type declares three fields (called | |
<emphasis>Features</emphasis> in CAS terminology). The features <literal>begin</literal> | |
and <literal>end</literal> store the character offsets of the span of text to which the | |
annotation refers. The feature <literal>sofa</literal> (Subject of Analysis) indicates | |
which document the begin and end offsets point into. The <literal>sofa</literal> feature | |
can be ignored for now since we assume in this tutorial that the CAS contains only one | |
subject of analysis (document).</para> | |
<para>Our RoomNumber type will inherit these three features from | |
<literal>uima.tcas.Annotation</literal>, its supertype; they are not visible in | |
this view because inherited features are not shown. One additional feature, | |
<literal>building</literal>, is declared. It takes a String as its value. Instead | |
of String, we could have declared the range-type of our feature to be any other CAS type | |
(defined or built-in).</para> | |
<para>If you are not using Eclipse, if you need to edit the type system, do so using any XML | |
or text editor, directly. The following is the actual XML representation of the Type | |
System displayed above in the editor:</para> | |
<programlisting><![CDATA[<?xml version="1.0" encoding="UTF-8" ?> | |
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier"> | |
<name>TutorialTypeSystem</name> | |
<description>Type System Definition for the tutorial examples - | |
as of Exercise 1</description> | |
<vendor>Apache Software Foundation</vendor> | |
<version>1.0</version> | |
<types> | |
<typeDescription> | |
<name>org.apache.uima.tutorial.RoomNumber</name> | |
<description></description> | |
<supertypeName>uima.tcas.Annotation</supertypeName> | |
<features> | |
<featureDescription> | |
<name>building</name> | |
<description>Building containing this room</description> | |
<rangeTypeName>uima.cas.String</rangeTypeName> | |
</featureDescription> | |
</features> | |
</typeDescription> | |
</types> | |
</typeSystemDescription>]]></programlisting> | |
</section> | |
<section id="ugr.tug.aae.generating_jcas_sources"> | |
<title>Generating Java Source Files for CAS Types</title> | |
<para>When you save a descriptor that you have modified, the Component Descriptor | |
Editor will automatically generate Java classes corresponding to the types that are | |
defined in that descriptor (unless this has been disabled), using a utility called | |
JCasGen. These Java classes will have the same name (including package) as the CAS | |
types, and will have get and set methods for each of the features that you have | |
defined.</para> | |
<para>This feature is enabled/disabled using the UIMA menu pulldown (or the Eclipse | |
Preferences → UIMA). If automatic running of JCasGen is not happening, please | |
make sure the option is checked:</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image004.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of enabling automatic running of JCasGen</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>The Java class for the example org.apache.uima.tutorial.RoomNumber type can | |
be found in <literal>src/org/apache/uima/tutorial/RoomNumber.java</literal> | |
. You will see how to use these generated classes in the next section.</para> | |
<para>If you are not using the Component Descriptor Editor, you will need to generate | |
these Java classes by using the <emphasis>JCasGen</emphasis> tool. JCasGen reads a | |
Type System Descriptor XML file and generates the corresponding Java classes that | |
you can then use in your annotator code. To launch JCasGen, run the jcasgen shell | |
script located in the <literal>/bin</literal> directory of the UIMA SDK | |
installation. This should launch a GUI that looks something like this:</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image006.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of JCasGen</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>Use the <quote>Browse</quote> buttons to select your input file | |
(TutorialTypeSystem.xml) and output directory (the root of the source tree into | |
which you want the generated files placed). Then click the <quote>Go</quote> | |
button. If the Type System Descriptor has no errors, new Java source files will be | |
generated under the specified output directory.</para> | |
<para>There are some additional options to choose from when running JCasGen; please | |
refer to the <olink targetdoc="&uima_docs_tools;"/> <olink targetdoc="&uima_docs_tools;" | |
targetptr="ugr.tools.jcasgen"/> for details.</para> | |
</section> | |
<section id="ugr.tug.aae.developing_annotator_code"> | |
<title>Developing Your Annotator Code</title> | |
<para>Annotator implementations all implement a standard interface (AnalysisComponent), having several | |
methods, the most important of which are: | |
<itemizedlist spacing="compact"> | |
<listitem> | |
<para><literal>initialize</literal>, </para> | |
</listitem> | |
<listitem> | |
<para><literal>process</literal>, and </para> | |
</listitem> | |
<listitem> | |
<para><literal>destroy</literal>. </para> | |
</listitem> | |
</itemizedlist></para> | |
<para><literal>initialize</literal> is called by the framework once when it first creates an instance of the | |
annotator class. <literal>process</literal> is called once per item being processed. | |
<literal>destroy</literal> may be called by the application when it is done using your annotator. There is a | |
default implementation of this interface for annotators using the JCas, called JCasAnnotator_ImplBase, which | |
has implementations of all required methods except for the process method.</para> | |
<para>Our annotator class extends the JCasAnnotator_ImplBase; most annotators that use the JCas will extend | |
from this class, so they only have to implement the process method. This class is not restricted to handling | |
just text; see <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>.</para> | |
<para>Annotators are not required to extend from the JCasAnnotator_ImplBase class; they may instead | |
directly implement the AnalysisComponent interface, and provide all method implementations themselves. | |
<footnote> | |
<para>Note that AnalysisComponent is not specific to JCAS. There is a method getRequiredCasInterface() | |
which the user would have to implement to return <literal>JCas.class</literal>. Then in the | |
<literal>process(AbstractCas cas)</literal> method, they would need to typecast | |
<literal>cas</literal> to type <literal>JCas</literal>.</para></footnote> This allows you to have | |
your annotator inherit from some other superclass if necessary. If you would like to do this, see the Javadocs | |
for JCasAnnotator for descriptions of the methods you must implement.</para> | |
<para>Annotator classes need to be public, cannot be declared abstract, and must have public, 0-argument | |
constructors, so that they can be instantiated by the framework. <footnote> | |
<para> Although Java classes in which you do not define any constructor will, by default, have a 0-argument | |
constructor that doesn't do anything, a class in which you have defined at least one constructor does | |
not get a default 0-argument constructor.</para> </footnote> .</para> | |
<para>The class definition for our RoomNumberAnnotator implements the process method, and is shown here. You | |
can find the source for this in the | |
<literal>uimaj-examples/src/org/apache/uima/tutorial/ex1/RoomNumberAnnotator.java</literal> . | |
<note> | |
<para>In Eclipse, in the <quote>Package Explorer</quote> view, this will appear by default in the project | |
<literal>uimaj-examples</literal>, in the folder <literal>src</literal>, in the package | |
<literal>org.apache.uima.tutorial.ex1</literal>.</para></note> In Eclipse, open the | |
RoomNumberAnnotator.java in the uimaj-examples project, under the src directory.</para> | |
<programlisting>package org.apache.uima.tutorial.ex1; | |
import java.util.regex.Matcher; | |
import java.util.regex.Pattern; | |
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase; | |
import org.apache.uima.jcas.JCas; | |
import org.apache.uima.tutorial.RoomNumber; | |
/** | |
* Example annotator that detects room numbers using | |
* Java 1.4 regular expressions. | |
*/ | |
public class RoomNumberAnnotator extends JCasAnnotator_ImplBase { | |
private Pattern mYorktownPattern = | |
Pattern.compile("\\b[0-4]\\d-[0-2]\\d\\d\\b"); | |
private Pattern mHawthornePattern = | |
Pattern.compile("\\b[G1-4][NS]-[A-Z]\\d\\d\\b"); | |
public void process(JCas aJCas) { | |
// Discussed Later | |
} | |
}</programlisting> | |
<para>The two Java class fields, mYorktownPattern and mHawthornePattern, hold regular expressions that | |
will be used in the process method. Note that these two fields are part of the Java implementation of the | |
annotator code, and not a part of the CAS type system. We are using the regular expression facility that is | |
built into Java 1.4. It is not critical that you know the details of how this works, but if you are curious the | |
details can be found in the Java API docs for the java.util.regex package.</para> | |
<para>The only method that we are required to implement is <literal>process</literal>. This method is typically | |
called once for each document that is being analyzed. This method takes one argument, which is a JCas instance; | |
this holds the document to be analyzed and all of the analysis results. <footnote> | |
<para>Version 1 of UIMA specified an additional parameter, the ResultSpecification. This provides a | |
specification of which types and features are desired to be computed and "output" from this annotator. Its | |
use is optional; many annotators ignore it.</para> | |
<para> This parameter has been replaced by specific set/getResultSpecification() methods, which allow | |
the annotator to receive a signal (a method call) when the result specification changes.</para> | |
</footnote></para> | |
<programlisting>public void process(JCas aJCas) { | |
// get document text | |
String docText = aJCas.getDocumentText(); | |
// search for Yorktown room numbers | |
Matcher matcher = mYorktownPattern.matcher(docText); | |
int pos = 0; | |
while (matcher.find(pos)) { | |
// found one - create annotation | |
RoomNumber annotation = new RoomNumber(aJCas); | |
annotation.setBegin(matcher.start()); | |
annotation.setEnd(matcher.end()); | |
annotation.setBuilding("Yorktown"); | |
annotation.addToIndexes(); | |
pos = matcher.end(); | |
} | |
// search for Hawthorne room numbers | |
matcher = mHawthornePattern.matcher(docText); | |
pos = 0; | |
while (matcher.find(pos)) { | |
// found one - create annotation | |
RoomNumber annotation = new RoomNumber(aJCas); | |
annotation.setBegin(matcher.start()); | |
annotation.setEnd(matcher.end()); | |
annotation.setBuilding("Hawthorne"); | |
annotation.addToIndexes(); | |
pos = matcher.end(); | |
} | |
}</programlisting> | |
<para>The Matcher class is part of the java.util.regex package and is used to find the room numbers in the | |
document text. When we find one, recording the annotation is as simple as creating a new Java object and | |
calling some set methods:</para> | |
<programlisting>RoomNumber annotation = new RoomNumber(aJCas); | |
annotation.setBegin(matcher.start()); | |
annotation.setEnd(matcher.end()); | |
annotation.setBuilding("Yorktown");</programlisting> | |
<para>The <literal>RoomNumber</literal> class was generated from the type system description by the | |
Component Descriptor Editor or the JCasGen tool, as discussed in the previous section.</para> | |
<para>Finally, we call <literal>annotation.addToIndexes()</literal> to add the new annotation to the | |
indexes maintained in the CAS. By default, the CAS implementation used for analysis of text documents keeps | |
an index of all annotations in their order from beginning to end of the document. Subsequent annotators or | |
applications use the indexes to iterate over the annotations. </para> | |
<note> | |
<para> If you don't add the instance to the indexes, it cannot be retrieved by down-stream annotators, | |
using the indexes. </para></note> | |
<note> | |
<para>You can also call <literal>addToIndexes()</literal> on Feature Structures that are not subtypes of | |
<literal>uima.tcas.Annotation</literal>, but these will not be sorted in any particular way. If you want | |
to specify a sort order, you can define your own custom indexes in the CAS: see | |
<olink targetdoc="&uima_docs_ref;"/> <olink | |
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/> and <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.aes.index"/> for details.</para></note> | |
<para>We're almost ready to test the RoomNumberAnnotator. There is just one more step | |
remaining.</para> | |
</section> | |
<section id="ugr.tug.aae.creating_xml_descriptor"> | |
<title>Creating the XML Descriptor</title> | |
<para>The UIMA architecture requires that descriptive information about an | |
annotator be represented in an XML file and provided along with the annotator class | |
file(s) to the UIMA framework at run time. This XML file is called an | |
<emphasis>Analysis Engine Descriptor</emphasis>. The descriptor includes: | |
<itemizedlist><listitem><para>Name, description, version, and vendor</para> | |
</listitem> | |
<listitem><para>The annotator's inputs and outputs, defined in terms of | |
the types in a Type System Descriptor</para></listitem> | |
<listitem><para>Declaration of the configuration parameters that the | |
annotator accepts </para></listitem></itemizedlist> </para> | |
<para>The <emphasis>Component Descriptor Editor</emphasis> plugin, which we | |
previously used to edit the Type System descriptor, can also be used to edit Analysis | |
Engine Descriptors.</para> | |
<para>A descriptor for our RoomNumberAnnotator is provided with the UIMA | |
distribution under the name | |
<literal>descriptors/tutorial/ex1/RoomNumberAnnotator.xml.</literal> To | |
edit it in Eclipse, right-click on that file in the navigator and select Open With | |
→ Component Descriptor Editor.</para> <tip><para>In Eclipse, you can double | |
click on the tab at the top of the Component Descriptor Editor's window | |
identifying the currently selected editor, and the window will | |
<quote>Maximize</quote>. Double click it again to restore the original size.</para> | |
</tip> | |
<para>If you are not using Eclipse, you will need to edit Analysis Engine descriptors | |
manually. See <xref linkend="ugr.tug.aae.xml_intro_ae_descriptor"/> for an | |
introduction to the Analysis Engine descriptor XML syntax. The remainder of this | |
section assumes you are using the Component Descriptor Editor plug-in to edit the | |
Analysis Engine descriptor.</para> | |
<para>The Component Descriptor Editor consists of several tabbed pages; we will only | |
need to use a few of them here. For more information on using this editor, see <olink | |
targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde"/>.</para> | |
<para>The initial page of the Component Descriptor Editor is the Overview page, which | |
appears as follows:</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image008.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of Component Descriptor Editor overview page</phrase> | |
</textobject> | |
</mediaobject> | |
</screenshot> | |
<para>This presents an overview of the RoomNumberAnnotator Analysis Engine (AE). The | |
left side of the page shows that this descriptor is for a | |
<emphasis>Primitive</emphasis> AE (meaning it consists of a single annotator), | |
and that the annotator code is developed in Java. Also, it specifies the Java class | |
that implements our logic (the code which was discussed in the previous section). | |
Finally, on the right side of the page are listed some descriptive attributes of our | |
annotator.</para> | |
<para>The other two pages that need to be filled out are the Type System page and the | |
Capabilities page. You can switch to these pages using the tabs at the bottom of the | |
Component Descriptor Editor. In the tutorial, these are already filled out for | |
you.</para> | |
<para>The RoomNumberAnnotator will be using the TutorialTypeSystem we looked at in | |
Section <xref linkend="ugr.tug.aae.defining_types"/>. To specify this, we add | |
this type system to the Analysis Engine's list of Imported Type Systems, using | |
the Type System page's right side panel, as shown here:</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image010.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of CDE Type System page</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>On the Capabilities page, we define our annotator's inputs and outputs, in | |
terms of the types in the type system. The Capabilities page is shown below:</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.3in" format="JPG" fileref="&imgroot;image012.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of CDE Capabilities page</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>Although capabilities come in sets, having multiple sets is deprecated; here | |
we're just using one set. The RoomNumberAnnotator is very simple. It requires | |
no input types, as it operates directly on the document text -- which is supplied as a | |
part of the CAS initialization (and which is always assumed to be present). It | |
produces only one output type (RoomNumber), and it sets the value of the | |
<literal>building</literal> feature on that type. This is all represented on the | |
Capabilities page.</para> | |
<para>The Capabilities page has two other parts for specifying languages and Sofas. | |
The languages section allows you to specify which languages your Analysis Engine | |
supports. The RoomNumberAnnotator happens to be language-independent, so we can | |
leave this blank. The Sofas section allows you to specify the names of additional | |
subjects of analysis. This capability and the Sofa Mappings at the bottom are | |
advanced topics, described in <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.aas"/>. </para> | |
<para>This is all of the information we need to provide for a simple annotator. If you | |
want to peek at the XML that this tool saves you from having to write, click on the | |
<quote>Source</quote> tab at the bottom to view the generated XML.</para> | |
</section> | |
<section id="ugr.tug.aae.testing_your_annotator"> | |
<title>Testing Your Annotator</title> | |
<para>Having developed an annotator, we need a way to try it out on some example | |
documents. The UIMA SDK includes a tool called the Document Analyzer that will allow | |
us to do this. To run the Document Analyzer, execute the documentAnalyzer shell | |
script that is in the <literal>bin</literal> directory of your UIMA SDK | |
installation, or, if you are using the example Eclipse project, execute the | |
<quote>UIMA Document Analyzer</quote> run configuration supplied with that | |
project. (To do this, click on the menu bar Run → Run ... → and under Java | |
Applications in the left box, click on UIMA Document Analyzer.)</para> | |
<para>You should see a screen that looks like this:</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image014.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of UIMA Document Analyzer GUI</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>There are six options on this screen:</para> | |
<orderedlist><listitem><para>Directory containing documents to analyze</para> | |
</listitem> | |
<listitem><para>Directory where analysis results will be written</para> | |
</listitem> | |
<listitem><para>The XML descriptor for the Analysis Engine (AE) you want to | |
run</para></listitem> | |
<listitem><para>(Optional) an XML tag, within the input documents, that contains | |
the text to be analyzed. For example, the value TEXT would cause the AE to only | |
analyze the portion of the document enclosed within | |
<TEXT>...</TEXT> tags.</para></listitem> | |
<listitem><para>Language of the document </para></listitem> | |
<listitem><para>Character encoding </para></listitem></orderedlist> | |
<para>Use the Browse button next to the third item to set the <quote>Location of AE XML | |
Descriptor</quote> field to the descriptor we've just been discussing | |
— | |
<literal><where-you-installed-uima-e.g.UIMA_HOME> | |
/examples/descriptors/tutorial/ex1/RoomNumberAnnotator.xml</literal> | |
. Set the other fields to the values shown in the screen shot above (which should be the | |
default values if this is the first time you've run the Document Analyzer). Then | |
click the <quote>Run</quote> button to start processing.</para> | |
<para>When processing completes, an <quote>Analysis Results</quote> window should | |
appear.</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="3.5in" format="JPG" fileref="&imgroot;image016.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of UIMA Document Analyzer Results GUI</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>Make sure <quote>Java Viewer</quote> is selected as the Results Display | |
Format, and <emphasis role="bold">double-click</emphasis> on the document | |
UIMASummerSchool2003.txt to view the annotations that were discovered. The view | |
should look something like this:</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image018.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of UIMA CAS Annotation Viewer GUI</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>You can click the mouse on one of the highlighted annotations to see a list of all | |
its features in the frame on the right.</para> <note><para>The legend will only show | |
those types which have at least one instance in the CAS, and are declared as outputs in the | |
capabilities section of the descriptor (see <xref | |
linkend="ugr.tug.aae.creating_xml_descriptor"/>. </para></note> | |
<para>You can use the DocumentAnalyzer to test any UIMA annotator | |
— just make sure that the annotator's classes are in the class | |
path.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.aae.configuration_logging"> | |
<title>Configuration and Logging</title> | |
<section id="ugr.tug.aae.configuration_parameters"> | |
<title>Configuration Parameters</title> | |
<para>The example RoomNumberAnnotator from the previous section used hardcoded | |
regular expressions and location names, which is obviously not very flexible. For | |
example, you might want to have the patterns of room numbers be supplied by a | |
configuration parameter, rather than having to redo the annotator's Java code | |
to add additional patterns. Rather than add a new hardcoded regular expression for a | |
new pattern, a better solution is to use configuration parameters.</para> | |
<para>UIMA allows annotators to declare configuration parameters in their | |
descriptors. The descriptor also specifies default values for the parameters, | |
though these can be overridden at runtime.</para> | |
<section id="ugr.tug.aae.declaring_parameters_in_the_descriptor"> | |
<title>Declaring Parameters in the Descriptor</title> | |
<para>The example descriptor | |
<literal>descriptors/tutorial/ex2/RoomNumberAnnotator.xml</literal> is | |
the same as the descriptor from the previous section except that information has | |
been filled in for the Parameters and Parameter Settings pages of the Component | |
Descriptor Editor.</para> | |
<para>First, in Eclipse, open example two's RoomNumberAnnotator in the | |
Component Descriptor Editor, and then go to the Parameters page (click on the | |
parameters tab at the bottom of the window), which is shown below:</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image020.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of UIMA Component Descriptor Editor (CDE) Parameters page</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>Two parameters – Patterns and Locations -- have been declared. In this | |
screen shot, the mouse (not shown) is hovering over Patterns to show its | |
description in the small popup window. Every parameter has the following | |
information associated with it:</para> | |
<itemizedlist><listitem><para>name – the name by which the annotator code | |
refers to the parameter</para></listitem> | |
<listitem><para>description – a natural language description of the | |
intent of the parameter</para></listitem> | |
<listitem><para>type – the data type of the parameter's value | |
– must be one of String, Integer, Float, or Boolean.</para></listitem> | |
<listitem><para>multiValued – true if the parameter can take | |
multiple-values (an array), false if the parameter takes only a single value. | |
Shown above as <literal>Multi</literal>.</para></listitem> | |
<listitem><para>mandatory – true if a value must be provided for the | |
parameter. Shown above as <literal>Req</literal> (for required). </para> | |
</listitem></itemizedlist> | |
<para>Both of our parameters are mandatory and accept an array of Strings as their | |
value.</para> | |
<para>Next, default values are assigned to the parameters on the Parameter Settings | |
page:</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image022.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of UIMA Component Descriptor Editor (CDE) Parameter Settings page</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>Here the <quote>Patterns</quote> parameter is selected, and the right pane | |
shows the list of values for this parameter, in this case the regular expressions | |
that match particular room numbering conventions. Notice the third pattern is | |
new, for matching the style of room numbers in the third building, which has room | |
numbers such as <literal>J2-A11</literal>.</para> | |
</section> | |
<section id="ugr.tug.aae.accessing_parameter_values_from_annotator"> | |
<title>Accessing Parameter Values from the Annotator Code</title> | |
<para>The class | |
<literal>org.apache.uima.tutorial.ex2.RoomNumberAnnotator</literal> has | |
overridden the initialize method. The initialize method is called by the UIMA | |
framework when the annotator is instantiated, so it is a good place to read | |
configuration parameter values. The default initialize method does nothing with | |
configuration parameters, so you have to override it. To see the code in Eclipse, | |
switch to the src folder, and open | |
<literal>org.apache.uima.tutorial.ex2</literal>. Here is the method | |
body:</para> | |
<programlisting>/** | |
* @see AnalysisComponent#initialize(UimaContext) | |
*/ | |
public void initialize(UimaContext aContext) | |
throws ResourceInitializationException { | |
super.initialize(aContext); | |
// Get config. parameter values | |
String[] patternStrings = | |
(String[]) aContext.getConfigParameterValue("Patterns"); | |
mLocations = | |
(String[]) aContext.getConfigParameterValue("Locations"); | |
// compile regular expressions | |
mPatterns = new Pattern[patternStrings.length]; | |
for (int i = 0; i < patternStrings.length; i++) { | |
mPatterns[i] = Pattern.compile(patternStrings[i]); | |
} | |
}</programlisting> | |
<para>Configuration parameter values are accessed through the UimaContext. As you | |
will see in subsequent sections of this chapter, the UimaContext is the | |
annotator's access point for all of the facilities provided by the UIMA | |
framework – for example logging and external resource access.</para> | |
<para>The UimaContext's <literal>getConfigParameterValue</literal> | |
method takes the name of the parameter as an argument; this must match one of the | |
parameters declared in the descriptor. The return value of this method is a Java | |
Object, whose type corresponds to the declared type of the parameter. It is up to the | |
annotator to cast it to the appropriate type, String[] in this case.</para> | |
<para>If there is a problem retrieving the parameter values, the framework throws an | |
exception. Generally annotators don't handle these, and just let them | |
propagate up.</para> | |
<para>To see the configuration parameters working, run the Document Analyzer | |
application and select the descriptor | |
<literal>examples/descriptors/tutorial/ex2/RoomNumberAnnotator.xml</literal> | |
. In the example document <literal>WatsonConferenceRooms.txt</literal>, you | |
should see some examples of Hawthorne II room numbers that would not have been | |
detected by the ex1 version of RoomNumberAnnotator.</para> | |
</section> | |
<section id="ugr.tug.aae.supporting_reconfiguration"> | |
<title>Supporting Reconfiguration</title> | |
<para>If you take a look at the Javadocs (located in the <ulink | |
url="api/index.html">docs/api</ulink> directory) for | |
<literal>org.apache.uima.analysis_component.AnaysisComponent</literal> | |
(which our annotator implements indirectly through JCasAnnotator_ImplBase), | |
you will see that there is a reconfigure() method, which is called by the containing | |
application through the UIMA framework, if the configuration parameter values | |
are changed.</para> | |
<para>The AnalysisComponent_ImplBase class provides a default implementation | |
that just calls the annotator's destroy method followed by its initialize | |
method. This works fine for our annotator. The only situation in which you might | |
want to override the default reconfigure() is if your annotator has very expensive | |
initialization logic, and you don't want to reinitialize everything if just | |
one configuration parameter has changed. In that case, you can provide a more | |
intelligent implementation of reconfigure() for your annotator.</para> | |
</section> | |
<section id="ugr.tug.aae.configuration_parameter_groups"> | |
<title>Configuration Parameter Groups</title> | |
<para>For annotators with many sets of configuration parameters, UIMA supports | |
organizing them into groups. It is possible to define a parameter with the same name | |
in multiple groups; one common use for this is for annotators that can process | |
documents in several languages and which want to have different parameter | |
settings for the different languages.</para> | |
<para>The syntax for defining parameter groups in your descriptor is fairly | |
straightforward – see <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor"/> for details. Values of | |
parameters defined within groups are accessed through the two-argument version | |
of <literal>UimaContext.getConfigParameterValue</literal>, which takes | |
both the group name and the parameter name as its arguments.</para> | |
</section> | |
<section id="ugr.tug.aae.configuration_parameter_overrides"> | |
<title>Overriding Configuration Parameter Settings</title> | |
<para>There are two ways that the value assigned to a configuration parameter can be | |
overridden. An aggregate may declare a parameter that overrides one or more of the | |
parameters in one or more of its delegates. The aggregate must also define a value for the | |
parameter, unless the parameter is itself overridden by a setting in the parent | |
aggregate.</para> | |
<para>An alternative method that avoids these strict hierarchical override constraints is to | |
associate an external global name with a parameter and to assign values to these external | |
names in an external properties file. With this approach a particular parameter setting can | |
be easily shared by multiple descriptors, even across different applications. For applications | |
with many levels of descriptor nesting it avoids the need to edit aggregate override | |
definitions when the location of an annotator in the hierarchy is changed. | |
For details see | |
<olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.aes.external_configuration_parameter_overrides"/> | |
</para> | |
</section> | |
</section> | |
<section id="ugr.tug.aae.logging"> | |
<title>Logging</title> | |
<para>The UIMA SDK provides a logging facility, which is very similar to the | |
java.util.logging.Logger class that was introduced in Java 1.4.</para> | |
<para>In the Java architecture, each logger instance is associated with a name. By | |
convention, this name is often the fully qualified class name of the component | |
issuing the logging call. The name can be referenced in a configuration file when | |
specifying which kinds of log messages to actually log, and where they should | |
go.</para> | |
<para>The UIMA framework supports this convention using the | |
<literal>UimaContext</literal> object. If you access a logger instance using | |
<literal>getContext().getLogger()</literal> within an Annotator, the logger | |
name will be the fully qualified name of the Annotator implementation class.</para> | |
<para>Here is an example from the process method of | |
<literal>org.apache.uima.tutorial.ex2.RoomNumberAnnotator</literal>: | |
<programlisting>getContext().getLogger().log(Level.FINEST,"Found: " + annotation);</programlisting> | |
</para> | |
<para>The first argument to the log method is the level of the log output. Here, a value of | |
FINEST indicates that this is a highly-detailed tracing message. While useful for | |
debugging, it is likely that real applications will not output log messages at this | |
level, in order to improve their performance. Other defined levels, from lowest to | |
highest importance, are FINER, FINE, CONFIG, INFO, WARNING, and SEVERE.</para> | |
<para>If no logging configuration file is provided (see next section), the Java | |
Virtual Machine defaults would be used, which typically set the level to INFO and | |
higher messages, and direct output to the console.</para> | |
<para>If you specify the standard UIMA SDK <literal>Logger.properties,</literal> | |
the output will be directed to a file named uima.log, in the current working directory | |
(often the <quote>project</quote> directory when running from Eclipse, for | |
instance).</para> <note><para>When using Eclipse, the uima.log file, if written | |
into the Eclipse workspace in the project uimaj-examples, for example, may not appear | |
in the Eclipse package explorer view until you right-click the uimaj-examples project | |
with the mouse, and select <quote>Refresh</quote>. This operation refreshes the | |
Eclipse display to conform to what may have changed on the file system. Also, you can set | |
the Eclipse preferences for the workspace to automatically refresh (Window → | |
Preferences → General → Workspace, then click the <quote>refresh | |
automatically</quote> checkbox.</para></note> | |
<section id="ugr.tug.aae.logging.configuring"> | |
<title>Specifying the Logging Configuration</title> | |
<para>The standard UIMA logger uses the underlying Java 1.4 logging mechanism. You | |
can use the APIs that come with that to configure the logging. In addition, the | |
standard Java 1.4 logging initialization mechanisms will look for a Java System | |
Property named <literal>java.util.logging.config.file</literal> and if | |
found, will use the value of this property as the name of a standard | |
<quote>properties</quote> file, for setting the logging level. Please refer to | |
the Java 1.4. documentation for more information on the format and use of this | |
file.</para> | |
<para>Two sample logging specification property files can be found in the UIMA_HOME | |
directory where the UIMA SDK is installed: | |
<literal>config/Logger.properties</literal>, and | |
<literal>config/FileConsoleLogger.properties</literal>. These specify the same | |
logging, except the first logs just to a file, while the second logs both to a file and | |
to the console. You can edit these files, or create additional ones, as described | |
below, to change the logging behavior.</para> | |
<para>When running your own Java application, you can specify the location of the | |
logging configuration file on your Java command line by setting the Java system | |
property <literal>java.util.logging.config.file</literal> to be the logging | |
configuration filename. This file specification can be either absolute or | |
relative to the working directory. For example: | |
<programlisting><?db-font-size 65% ?>java "-Djava.util.logging.config.file=C:/Program Files/apache-uima/config/Logger.properties"</programlisting> | |
<note><para>In a shell script, you can use environment variables such as | |
UIMA_HOME if convenient.</para></note> </para> | |
<para>If you are using Eclipse to launch your application, you can set this property | |
in the VM arguments section of the Arguments tab of the run configuration screen. If | |
you've set an environment variable UIMA_HOME, you could for example, use the | |
string: | |
<literal>"-Djava.util.logging.config.file=${env_var:UIMA_HOME}/config/Logger.properties".</literal> | |
</para> | |
<para>If you running the .bat or .sh files in the UIMA SDK's <literal>bin</literal> directory, you can specify the location of your | |
logger configuration file by setting the <literal>UIMA_LOGGER_CONFIG_FILE</literal> environment variable prior to running the script, | |
for example (on Windows): | |
<programlisting><?db-font-size 70% ?>set UIMA_LOGGER_CONFIG_FILE=C:/myapp/MyLogger.properties</programlisting> | |
</para> | |
</section> | |
<section id="ugr.tug.aae.logging.setting_logging_levels"> | |
<title>Setting Logging Levels</title> | |
<para>Within the logging control file, the default global logging level specifies | |
which kinds of events are logged across all loggers. For any given facility this | |
global level can be overridden by a facility specific level. Multiple handlers are | |
supported. This allows messages to be directed to a log file, as well as to a | |
<quote>console</quote>. Note that the ConsoleHandler also has a separate level | |
setting to limit messages printed to the console. For example: <literal>.level= | |
INFO</literal> </para> | |
<para>The properties file can change where the log is written, as well.</para> | |
<para>Facility specific properties allow different logging for each class, as | |
well. For example, to set the com.xyz.foo logger to only log SEVERE messages: | |
<literal>com.xyz.foo.level = SEVERE</literal></para> | |
<para>If you have a sample annotator in the package | |
<literal>org.apache.uima.SampleAnnotator</literal> you can set the log level | |
by specifying: <literal>org.apache.uima.SampleAnnotator.level = | |
ALL</literal></para> | |
<para>There are other logging controls; for a full discussion, please read the | |
contents of the <literal>Logger.properties</literal> file and the Java | |
specification for logging in Java 1.4.</para> | |
</section> | |
<section id="ugr.tug.aae.logging.output_format"> | |
<title>Format of logging output</title> | |
<para>The logging output is formatted by handlers specified in the properties file | |
for configuring logging, described above. The default formatter that comes with | |
the UIMA SDK formats logging output as follows:</para> | |
<para><literal>Timestamp - threadID: sourceInfo: Message level: | |
message</literal></para> | |
<para> Here's an example:</para> | |
<para><literal>7/12/04 2:15:35 PM - 10: | |
org.apache.uima.util.TestClass.main(62): INFO: You are not logged | |
in!</literal></para> | |
</section> | |
<section id="ugr.tug.aae.logging.meaning_of_severity_levels"> | |
<title>Meaning of the logging severity levels</title> | |
<para>These levels are defined by the Java logging framework, which was | |
incorporated into Java as of the 1.4 release level. The levels are defined in the | |
Javadocs for java.util.logging.Level, and include both logging and tracing | |
levels: | |
<itemizedlist spacing="compact"> | |
<listitem><para>OFF is a special level that can be used to turn off | |
logging.</para></listitem> | |
<listitem><para>ALL indicates that all messages should be logged. </para> | |
</listitem> | |
<listitem><para>CONFIG is a message level for configuration messages. These | |
would typically occur once (during configuration) in methods like | |
<literal>initialize()</literal>. </para></listitem> | |
<listitem><para>INFO is a message level for informational messages, for | |
example, connected to server IP: 192.168.120.12 </para></listitem> | |
<listitem><para>WARNING is a message level indicating a potential | |
problem.</para></listitem> | |
<listitem><para>SEVERE is a message level indicating a serious | |
failure.</para></listitem> | |
</itemizedlist></para> | |
<para> Tracing levels, typically used for debugging: | |
<itemizedlist> | |
<listitem><para>FINE is a message level providing tracing information, | |
typically at a collection level (messages occurring once per collection). | |
</para></listitem> | |
<listitem><para>FINER indicates a fairly detailed tracing message, | |
typically at a document level (once per document).</para></listitem> | |
<listitem><para>FINEST indicates a highly detailed tracing message. </para> | |
</listitem></itemizedlist></para> | |
</section> | |
<section id="ugr.tug.aae.logging.using_outside_of_an_annotator"> | |
<title>Using the logger outside of an annotator</title> | |
<para>An application using UIMA may want to log its messages using the same logging | |
framework. This can be done by getting a reference to the UIMA logger, as follows: | |
<programlisting>Logger logger = UIMAFramework.getLogger(TestClass.class);</programlisting> | |
</para> | |
<para>The optional class argument allows filtering by class (if the log handler | |
supports this). If not specified, the name of the returned logger instance is | |
<quote>org.apache.uima</quote>.</para> | |
</section> | |
<section id="ugr.tug.aae.logging.change_logger_implementation"> | |
<title>Changing the underlying UIMA logging implementation</title> | |
<para>By default the UIMA framework use, under the hood of the UIMA Logger interface, the Java logging framework | |
to do logging. But it is possible to change the logging implementation that UIMA use from Java logging to | |
an arbitrary logging system when specifying the system property | |
<programlisting>-Dorg.apache.uima.logger.class=<loggerClass></programlisting> | |
when the UIMA framework is started. | |
</para> | |
<para> | |
The specified logger class must be available in the classpath and have to implement the | |
<code>org.apache.uima.util.Logger</code> interface. | |
</para> | |
<para> | |
UIMA also provides a logging implementation that use Apache Log4j instead of Java logging. To | |
use Log4j you have to provide the Log4j jars in the classpath and your application | |
must specify the logging configuration as shown below. | |
<programlisting><?db-font-size 80% ?>-Dorg.apache.uima.logger.class=org.apache.uima.util.impl.Log4jLogger_impl</programlisting> | |
</para> | |
</section> | |
</section> | |
</section> | |
<section id="ugr.tug.aae.building_aggregates"> | |
<title>Building Aggregate Analysis Engines</title> | |
<section id="ugr.tug.aae.combining_annotators"> | |
<title>Combining Annotators</title> | |
<para>The UIMA SDK makes it very easy to combine any sequence of Analysis Engines to | |
form an <emphasis>Aggregate Analysis Engine</emphasis>. This is done through an | |
XML descriptor; no Java code is required!</para> | |
<para>If you go to the <literal>examples/descriptors/tutorial/ex3</literal> | |
folder (in Eclipse, it's in your uimaj-examples project, under the | |
<literal>descriptors/tutorial/ex3</literal> folder), you will find a | |
descriptor for a TutorialDateTime annotator. This annotator detects dates and | |
times. To see what this annotator can do, try it out | |
using the Document Analyzer. If you are curious as to how this annotator works, the | |
source code is included, but it is not necessary to understand the code at this | |
time.</para> | |
<para>We are going to combine the TutorialDateTime annotator with the | |
RoomNumberAnnotator to create an aggregate Analysis Engine. This is illustrated | |
in the following figure: | |
<figure id="ugr.tug.aae.fig.combining_annotators"> | |
<title>Combining Annotators to form an Aggregate Analysis Engine</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="PNG" | |
fileref="&imgroot;image024.png"/> | |
</imageobject> | |
<textobject> <phrase>Combining Annotators to form an Aggregate Analysis | |
Engine</phrase> | |
</textobject> | |
</mediaobject> | |
</figure> </para> | |
<para>The descriptor that does this is named | |
<literal>RoomNumberAndDateTime.xml</literal>, which you can open in the | |
Component Descriptor Editor plug-in. This is in the uimaj-examples project in the | |
folder <literal>descriptors/tutorial/ex3</literal>. </para> | |
<para>The <quote>Aggregate</quote> page of the Component Descriptor Editor is | |
used to define which components make up the aggregate. A screen shot is shown below. | |
(If you are not using Eclipse, see <xref | |
linkend="ugr.tug.aae.xml_intro_ae_descriptor"/> for the actual XML syntax | |
for Aggregate Analysis Engine Descriptors.)</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image026.jpg"/> | |
</imageobject> | |
<textobject> | |
<phrase>Aggregate page of the Component Descriptor Editor (CDE)</phrase> | |
</textobject> | |
</mediaobject> | |
</screenshot> | |
<para>On the left side of the screen is the list of component engines that make up the | |
aggregate – in this case, the TutorialDateTime annotator and the | |
RoomNumberAnnotator. To add a component, you can click the <quote>Add</quote> | |
button and browse to its descriptor. You can also click the <quote>Find AE</quote> | |
button and search for an Analysis Engine in your Eclipse workspace. | |
<note><para>The <quote>AddRemote</quote> button is used for adding components | |
which run remotely (for example, on another machine using a remote networking | |
connection). This capability is described in section <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.application.how_to_call_a_uima_service"/>,</para> | |
</note> </para> | |
<para>The order of the components in the left pane does not imply an order of | |
execution. The order of execution, or <quote>flow</quote> is determined in the | |
<quote>Component Engine Flow</quote> section on the right. UIMA supports | |
different types of algorithms (including user-definable) for determining the | |
flow. Here we pick the simplest: <literal>FixedFlow</literal>. We have chosen to | |
have the RoomNumberAnnotator execute first, although in this case it | |
doesn't really matter, since the RoomNumber and DateTime annotators do not | |
have any dependencies on one another.</para> | |
<para>If you look at the <quote>Type System</quote> page of the Component | |
Descriptor Editor, you will see that it displays the type system but is not | |
editable. The Type System of an Aggregate Analysis Engine is automatically | |
computed by merging the Type Systems of all of its components.</para> | |
<warning><para>If the components have different definitions for the same type name, | |
The Component Descriptor Editor will show a warning. It is possible to continue past | |
this warning, in which case your aggregate's type system will have the correct | |
<quote>merged</quote> | |
type definition that contains all of the features defined on that type by all of your | |
components. However, it is not recommended to use this feature in conjunction with JCAS, | |
since the JCAS Java Class definitions cannot be so easily merged. See | |
<olink targetdoc="&uima_docs_ref;"/> | |
<olink | |
targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.jcas.merging_types_from_other_specs"/> for more information. | |
</para></warning> | |
<para>The Capabilities page is where you explicitly declare the aggregate Analysis | |
Engine's inputs and outputs. Sofas and Languages are described later. | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image028.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screen shot of the Capabilities page of the Component Descriptor Editor | |
</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
</para> | |
<para>Note that it is not automatically assumed that all outputs of each component | |
Analysis Engine (AE) are passed through as outputs of the aggregate AE. If, for example, | |
the TutorialDateTime annotator also produced Word and Sentence annotations, | |
but those were not of interest as output in this case, we can exclude them from the | |
list of outputs.</para> | |
<para>You can run this AE using the Document Analyzer in the same way that you run any | |
other AE. Just select the <literal>examples/descriptors/tutorial/ex3/ | |
RoomNumberAndDateTime.xml</literal> descriptor and click the Run button. You | |
should see that RoomNumbers, Dates, and Times are all shown:</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image030.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screen shot results of running the Document Analyzer | |
</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
</section> | |
<section id="ugr.tug.aae.aaes_can_contain_cas_consumers"> | |
<title>AAEs can also contain CAS Consumers</title> | |
<para>In addition to aggregating Analysis Engines, Aggregates can also contain CAS | |
Consumers (see <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.cpe"/>, or even a mixture of these components with regular | |
Analysis Engines. The UIMA Examples has an example of an Aggregate which contains | |
both an analysis engine and a CAS consumer, in | |
<literal>examples/descriptors/MixedAggregate.xml.</literal></para> | |
<para>Analysis Engines support the <literal>collectionProcessComplete</literal> | |
method, which is particularly important for many CAS Consumers. If | |
an application (or a Collection Processing Engine) calls | |
<literal>collectionProcessComplete</literal> on an aggregate, the framework | |
will deliver that call to all of the components of the aggregate. If you use | |
one of the built-in flow types (fixedFlow or capabilityLanguageFlow), then the | |
order specified in that flow will be the same order in which the | |
<literal>collectionProcessComplete</literal> calls are made to the components. | |
If a custom flow is used, then the calls will be made in arbitrary order. | |
</para> | |
</section> | |
<section id="ugr.tug.aae.reading_results_previous_annotators"> | |
<title>Reading the Results of Previous Annotators</title> | |
<para>So far, we have been looking at annotators that look directly at the document text. However, annotators | |
can also use the results of other annotators. One useful thing we can do at this point is look for the | |
co-occurrence of a Date, a RoomNumber, and two Times – and annotate that as a Meeting.</para> | |
<para>The CAS maintains <emphasis>indexes</emphasis> of annotations, and from an index you can obtain an | |
iterator that allows you to step through all annotations of a particular type. Here's some example code | |
that would iterate over all of the TimeAnnot annotations in the JCas: | |
<programlisting>for (TimeAnnot : aJCas.<TimeAnnot>select(TimeAnnot.class)) { | |
//do something | |
}</programlisting></para> | |
<note> | |
<para>You can also use the method | |
<literal>aJCas.getAllIndexedFS(YourClass.type)</literal>, which returns an iterator | |
over instances of <literal>YourClass</literal> in no particular order. | |
<!-- Fixed by UIMA-4111 But beware - if you've defined | |
a <literal>set</literal> index for this type, and haven't defined any non-set indexes for this type, then, | |
the method would return only those instances in the set. So, in a pathological case, if you defined the | |
set so that the key was some particular field, and all instances of this type had the same key, then | |
only one instance of this type would be found.</para> | |
<para>To guarantee the existance of an index that would have an entry for all unique indexed | |
Feature Structures, define a bag or sorted index for the type. | |
</para>. | |
<para>All types which are subtypes of the built-in Annotation type have a sorted index, and so all instances of those | |
types are guaranteed to be found (at least once) by this iterator. --> | |
</para> | |
<para>Also, if you've defined your own custom index as described in <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.aes.index"/>, you can get an iterator over that | |
specific index by calling <literal>aJCas.getIndex(label, clazz)</literal>. | |
The <literal>getIndex(...)</literal> method's second argument | |
specialized the index to subtype of the type the index was declared to index. For instance, | |
if you defined an index called "allEvents" over the type <literal>Event</literal>, and wanted | |
to get an index over just a particular subtype of event, say, <literal>TimeEvent</literal>, | |
you can ask for that index using | |
<literal>aJCas.getIndex("allEvents", TimeEvent.class)</literal>.</para></note> | |
<para>Now that we've explained the basics, let's take a look at the process method for | |
<literal>org.apache.uima.tutorial.ex4.MeetingAnnotator</literal>. Since we're looking for a | |
combination of a RoomNumber, a Date, and two Times, there are four nested iterators. (There's surely a | |
better algorithm for doing this, but to keep things simple we're just going to look at every combination | |
of the four items.)</para> | |
<para>For each combination of the four annotations, we compute the span of text that includes all of them, and | |
then we check to see if that span is smaller than a <quote>window</quote> size, a configuration parameter. | |
There are also some checks to make sure that we don't annotate the same span of text multiple times. If all | |
the checks pass, we create a Meeting annotation over the whole span. There's really nothing to | |
it!</para> | |
<para>The XML descriptor, located in | |
<literal>examples/descriptors/tutorial/ex4/MeetingAnnotator.xml</literal> , is also very | |
straightforward. An important difference from previous descriptors is that this is the first annotator | |
we've discussed that has input requirements. This can be seen on the <quote>Capabilities</quote> | |
page of the Component Descriptor Editor:</para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image032.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screen shot of Capabilities page of the Component Descriptor Editor | |
</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>If we were to run the MeetingAnnotator on its own, it wouldn't detect anything because it | |
wouldn't have any input annotations to work with. The required input annotations can be produced by the | |
RoomNumber and DateTime annotators. So, we create an aggregate Analysis Engine containing these two | |
annotators, followed by the Meeting annotator. This aggregate is illustrated in <xref | |
linkend="ugr.tug.aae.fig.aggregate_for_meeting_annotator"/>. The descriptor for this is in | |
<literal>examples/descriptors/tutorial/ex4/MeetingDetectorAE.xml</literal> . Give it a try in the | |
Document Analyzer. | |
<figure id="ugr.tug.aae.fig.aggregate_for_meeting_annotator"> | |
<title>An Aggregate Analysis Engine where an internal component uses output from previous | |
engines</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="PNG" fileref="&imgroot;image034.png"/> | |
</imageobject> | |
<textobject><phrase>An Aggregate Analysis Engine where an internal component uses output from | |
previous engines. </phrase> | |
</textobject> | |
</mediaobject> | |
</figure> </para> | |
</section> | |
</section> | |
<section id="ugr.tug.aae.other_examples"> | |
<title>Other examples</title> | |
<para>The UIMA SDK include several other examples you may find interesting, | |
including</para> | |
<itemizedlist spacing="compact"> | |
<listitem><para>SimpleTokenAndSentenceAnnotator – a simple tokenizer and | |
sentence annotator.</para></listitem> | |
<listitem><para>XmlDetagger – A multi-sofa annotator that does XML | |
detagging. Multiple Sofas (Subjects of Analysis) are described in a later – | |
see <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.mvs"/>. Reads XML data from the input Sofa | |
(named "xmlDocument"); this data can be stored in the CAS as a string or array, or it can | |
be a URI to a remote file. The XML is parsed using the JVM's default parser, and the | |
plain-text content is written to a new sofa called "plainTextDocument".</para> | |
</listitem> | |
<listitem><para>PersonTitleDBWriterCasConsumer – a sample CAS Consumer | |
which populates a relational database with some annotations. It uses JDBC and in this | |
example, hooks up with the Open Source Apache Derby database. </para></listitem> | |
</itemizedlist> | |
</section> | |
<section id="ugr.tug.aae.additional_topics"> | |
<title>Additional Topics</title> | |
<section id="ugr.tug.aae.contract_for_annotator_methods"> | |
<title>Contract: Annotator Methods Called by the Framework</title> | |
<titleabbrev>Annotator Methods</titleabbrev> | |
<para>The UIMA framework ensures that an Annotator instance is called by only one | |
thread at a time. An instance never has to worry about running some method on one | |
thread, and then asynchronously being called using another thread. This approach | |
simplifies the design of annotators – they do not have to be designed to support | |
multi-threading. When multiple threading is wanted, for performance, multiple | |
instances of the Annotator are created, each one running on just one thread.</para> | |
<para>The following table defines the methods called by the framework, when they are | |
called, and the requirements annotator implementations must follow.</para> | |
<informaltable frame="all"> | |
<tgroup cols="3" colsep="1" rowsep="1"> | |
<colspec colname="c1" colwidth="1*"/> | |
<colspec colname="c2" colwidth="2*"/> | |
<colspec colname="c3" colwidth="2*"/> | |
<thead> | |
<row> | |
<entry align="center">Method</entry> | |
<entry align="center">When Called by Framework</entry> | |
<entry align="center">Requirements</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>initialize</entry> | |
<entry>Typically only called once, when instance is created. Can be called | |
again if application does a reinitialize call and the default behavior | |
isn't overridden (the default behavior for reinitialize is to call | |
<literal>destroy</literal> followed by | |
<literal>initialize</literal></entry> | |
<entry>Normally does one-time initialization, including reading of | |
configuration parameters. If the application changes the parameters, it | |
can call initialize to have the annotator re-do its | |
initialization.</entry> | |
</row> | |
<row> | |
<entry>typeSystemInit</entry> | |
<entry>Called before <literal>process</literal> whenever the type system | |
in the CAS being passed in differs from what was previously passed in a | |
<literal>process</literal> call (and called for the first CAS passed in, | |
too). The Type System being passed to an annotator only changes in the case of | |
remote annotators that are active as servers, receiving possibly | |
different type systems to operate on.</entry> | |
<entry>Typically, users of JCas do not implement any method for this. An | |
annotator can use this call to read the CAS type system and setup any instance | |
variables that make accessing the types and features convenient.</entry> | |
</row> | |
<row> | |
<entry>process</entry> | |
<entry>Called once for each CAS. Called by the application if not using | |
Collection Processing Manager (CPM); the application calls the process | |
method on the analysis engine, which is then delegated by the framework to | |
all the annotators in the engine. For Collection Processing application, | |
the CPM calls the process method. If the application creates and manages | |
your own Collection Processing Engine via API calls (see Javadocs), the | |
application calls this on the Collection Processing Engine, and it is | |
delegated by the framework to the components.</entry> | |
<entry>Process the CAS, adding and/or modifying elements in it</entry> | |
</row> | |
<row> | |
<entry>destroy</entry> | |
<entry>This method can be called by applications, and is also called by the | |
Collection Processing Manager framework when the collection processing | |
completes. It is also called on Aggregate delegate components, if those | |
components successfully complete their <literal>initialize</literal> call, if | |
a subsequent delegate (or flow controller) in the aggregate fails to initialize. | |
This allows components which need to clean up things done during initialization | |
to do so. It is up to the component writer to use a try/finally construct during initialization | |
to cleanup from errors that occur during initialization within one component. | |
The <literal>destroy</literal> call on an aggregate is | |
propagated to all contained analysis engines.</entry> | |
<entry>An annotator should release all resources, close files, close | |
database connections, etc., and return to a state where another initialize | |
call could be received to restart. Typically, after a destroy call, no | |
further calls will be made to an annotator instance.</entry> | |
</row> | |
<row> | |
<entry>reconfigure</entry> | |
<entry><para>This method is never called by the framework, unless an | |
application calls it on the Engine object – in which case it the | |
framework propagates it to all annotators contained in the Engine.</para> | |
<para>Its purpose is to signal that the configuration parameters have | |
changed.</para></entry> | |
<entry>A default implementation of this calls destroy, followed by | |
initialize. This is the only case where initialize would be called more than | |
once. Users should implement whatever logic is needed to return the | |
annotator to an initialized state, including re-reading the | |
configuration parameter data.</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</informaltable> | |
</section> | |
<section id="ugr.tug.aae.reporting_errors_from_annotators"> | |
<title>Reporting errors from Annotators</title> | |
<para>There are two broad classes of errors that can occur: recoverable and | |
unrecoverable. Because Annotators are often expected to process very large numbers | |
of artifacts (for example, text documents), they should be written to recover where | |
possible.</para> | |
<para>For example, if an upstream annotator created some input for an annotator which | |
is invalid, the annotator may want to log this event, ignore the bad input and | |
continue. It may include a notification of this event in the CAS, for further | |
downstream annotators to consider. Or, it may throw an exception (see next section) | |
– but in this case, it cannot do any further processing on that | |
document.</para> <note><para>The choice of what to do can be made configurable, | |
using the configuration parameters. </para></note> | |
</section> | |
<section id="ugr.tug.aae.throwing_exceptions_from_annotators"> | |
<title>Throwing Exceptions from Annotators</title> | |
<para>Let's say an invalid regular expression was passed as a parameter to the | |
RoomNumberAnnotator. Because this is an error related to the overall | |
configuration, and not something we could expect to ignore, we should throw an | |
appropriate exception, and most Java programmers would expect to do so like | |
this:</para> | |
<programlisting>throw new ResourceInitializationException( | |
"The regular expression " + x + " is not valid.");</programlisting> | |
<para>UIMA, however, does not do it this way. All UIMA exceptions are | |
<emphasis>internationalized</emphasis>, meaning that they support translation | |
into other languages. This is accomplished by eliminating hardcoded message | |
strings and instead using external message digests. Message digests are files | |
containing (key, value) pairs. The key is used in the Java code instead of the actual | |
message string. This allows the message string to be easily translated later by | |
modifying the message digest file, not the Java code. Also, message strings in the | |
digest can contain parameters that are filled in when the exception is thrown. The | |
format of the message digest file is described in the Javadocs for the Java class | |
<literal>java.util.PropertyResourceBundle</literal> and in the load method of | |
<literal>java.util.Properties</literal>.</para> | |
<para>The first thing an annotator developer must choose is what Exception class to | |
use. There are three to choose from: | |
<orderedlist><listitem><para>ResourceConfigurationException should be | |
thrown from the annotator's reconfigure() method if invalid configuration | |
parameter values have been specified. | |
</para></listitem> | |
<listitem><para>ResourceInitializationException should be thrown from the | |
annotator's initialize() method if initialization fails for any | |
reason (including invalid configuration parameters).</para></listitem> | |
<listitem><para>AnalysisEngineProcessException should be thrown from the | |
annotator's process() method if the processing of a particular document | |
fails for any reason. </para></listitem></orderedlist></para> | |
<para>Generally you will not need to define your own custom exception classes, but if | |
you do they must extend one of these three classes, which are the only types of | |
Exceptions that the annotator interface permits annotators to throw.</para> | |
<para>All of the UIMA Exception classes share common constructor varieties. There are | |
four possible arguments:</para> | |
<para>The name of the message digest to use (optional – if not specified the | |
default UIMA message digest is used).</para> | |
<para>The key string used to select the message in the message digest.</para> | |
<para>An object array containing the parameters to include in the message. Messages | |
can have substitutable parts. When the message is given, the string representation | |
of the objects passed are substituted into the message. The object array is often | |
created using the syntax new Object[]{x, y}.</para> | |
<para>Another exception which is the <quote>cause</quote> of the exception you are | |
throwing. This feature is commonly used when you catch another exception and rethrow | |
it. (optional)</para> | |
<para>If you look at source file (folder: src in Eclipse) | |
<literal>org.apache.uima.tutorial.ex5.RoomNumberAnnotator</literal>, you | |
will see the following code: | |
<programlisting>try { | |
mPatterns[i] = Pattern.compile(patternStrings[i]); | |
} | |
catch (PatternSyntaxException e) { | |
throw new ResourceInitializationException( | |
MESSAGE_DIGEST, "regex_syntax_error", | |
new Object[]{patternStrings[i]}, e); | |
}</programlisting> | |
where the MESSAGE_DIGEST constant has the value <literal> | |
"org.apache.uima.tutorial.ex5.RoomNumberAnnotator_Messages". </literal> | |
</para> | |
<para>Message digests are specified using a dotted name, just like Java classes. This | |
file, with the .properties extension, must be present in the class path. In Eclipse, | |
you find this file under the src folder, in the package | |
org.apache.uima.tutorial.ex5, with the name | |
RoomNumberAnnotator_Messages.properties. Outside of Eclipse, you can find this | |
in the <literal>uimaj-examples.jar</literal> with the name | |
<literal>org/apache/uima/tutorial/ex5/RoomNumberAnnotator_Messages.properties.</literal> | |
If you look in this file you will see the line: | |
<programlisting>regex_syntax_error = {0} is not a valid regular expression.</programlisting> | |
which is the error message for the example exception we showed above. The placeholder | |
{0} will be filled by the toString() value of the argument passed to the exception | |
constructor – in this case, the regular expression pattern that didn't | |
compile. If there were additional arguments, their locations in the message would be | |
indicated as {1}, {2}, and so on.</para> | |
<para>If a message digest is not specified in the call to the exception constructor, the | |
default is <literal>UIMAException.STANDARD_MESSAGE_CATALOG</literal> (whose | |
value is <quote><literal>org.apache.uima.UIMAException_Messages</literal> | |
</quote> in the current release but may change). This message digest is located in the | |
<literal>uima-core.jar</literal> file at | |
<literal>org/apache/uima/UIMAException_messages.properties</literal> | |
– you can take a look to see if any of these exception messages are useful to | |
use.</para> | |
<para>To try out the regex_syntax_error exception, just use the Document Analyzer to | |
run | |
<literal>examples/descriptors/tutorial/ex5/RoomNumberAnnotator.xml</literal> | |
, which happens to have an invalid regular expression in its configuration parameter | |
settings.</para> | |
<para>To summarize, here are the steps to take if you want to define your own exception | |
message:</para> | |
<para>Create a file with the .properties extension, where you declare message keys and | |
their associated messages, using the same syntax as shown above for the | |
regex_syntax_error exception. The properties file syntax is more completely | |
described in the Javadocs for the <ulink | |
url="http://java.sun.com/j2se/1.5.0/docs/api/java/util/Properties.html#load(java.io.InputStream)"> | |
load</ulink> method of the java.util.Properties class.</para> | |
<para>Put your properties file somewhere in your class path (it can be in your | |
annotator's .jar file).</para> | |
<para>Define a String constant (called MESSAGE_DIGEST for example) in your annotator | |
code whose value is the dotted name of this properties file. For example, if your | |
properties file is inside your jar file at the location | |
<literal>org/myorg/myannotator/Messages.properties</literal>, then this | |
String constant should have the value | |
<literal>org.myorg.myannotator.Messages</literal>. Do not include the | |
.properties extension. In Java Internationalization terminology, this is called | |
the Resource Bundle name. For more information see the Javadocs for the <ulink | |
url="http://java.sun.com/j2se/1.5.0/docs/api/java/util/PropertyResourceBundle.html"> | |
PropertyResourceBundle</ulink> class.</para> | |
<para>In your annotator code, throw an exception like this: | |
<programlisting>throw new ResourceInitializationException( | |
MESSAGE_DIGEST, "your_message_name", | |
new Object[]{param1,param2,...});</programlisting></para> | |
<para>You may also wish to look at the Javadocs for the UIMAException class.</para> | |
<para>For more information on Java's internationalization features, see the | |
<ulink url="http://java.sun.com/j2se/1.5.0/docs/guide/intl/index.html"> | |
Java Internationalization Guide</ulink>.</para> | |
</section> | |
<section id="ugr.tug.aae.accessing_external_resource_files"> | |
<title>Accessing External Resources</title> | |
<para>External Resources are Java objects that have a life cycle where they | |
are (optionally) initialized at startup time by reading external data from | |
a file or via a URL (which can access information over the http protocol, for instance). | |
It is not <emphasis>required</emphasis> that Extermal Resource objects | |
do any external data reading to initialize themselves. However, this is such a | |
common use case, that we will presume this mode of operation in the description below.</para> | |
<para>Sometimes you may want an annotator to read from an external resource, | |
such as a URL or a file – for | |
example, a long list of keys and values that you are going to build into a HashMap. You | |
could, of course, just introduce a configuration parameter that holds the absolute | |
path or URL to this resource, and build the HashMap in your annotator's | |
initialize method. However, this is not the best solution for three reasons:</para> | |
<orderedlist><listitem><para>Including an absolute path in your descriptor to | |
specify the initialization data makes | |
your annotator difficult for others to use. Each user will need to edit this | |
descriptor and set the absolute path to a value appropriate for his or her | |
installation.</para></listitem> | |
<listitem><para>You cannot share the created Java object(s), e.g., a HashMap, | |
between multiple annotators. Also, | |
in some deployment scenarios there may be more than one instance of your annotator, | |
and you would like to have the option for them to share the same Java Object(s).</para></listitem> | |
<listitem><para>Your annotator would become dependent on a particular | |
implementation of the Java Object(s). It would be better if there was | |
a decoupling between the actual implementation, and the API used to | |
access it. </para></listitem></orderedlist> | |
<para>A better way to create these sharable Java objects and initialize them | |
via external disk or URL sources is through the ResourceManager | |
component. In this section we are going to show an example of how to use the Resource | |
Manager.</para> | |
<para>This example annotator will annotate UIMA acronyms (e.g. UIMA, AE, CAS, JCas) | |
and store the acronym's expanded form as a feature of the annotation. The | |
acronyms and their expanded forms are stored in an external file.</para> | |
<para>First, look at the | |
<literal>examples/descriptors/tutorial/ex6/UimaAcronymAnnotator.xml</literal> | |
descriptor. | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image036.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screen shot of Component Descriptor Editor page for configuring External Resources | |
</phrase></textobject> | |
</mediaobject> | |
</screenshot></para> | |
<para>The values of the rows in the two tables are longer than can be easily shown. You can | |
click the small button at the top right to shift the layout from two side-by-side | |
tables, to a vertically stacked layout. You can also click the small twisty on the | |
<quote>Imports for External Resources and Bindings</quote> to collapse this | |
section, because it's not used here. Then the same screen will appear like this: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image038.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screen shot of Component Descriptor Editor page for configuring External Resources after | |
adjusting the layout | |
</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
</para> | |
<para>The top window has a scroll bar allowing you to see the rest of the line.</para> | |
<section id="ugr.tug.aae.resources.declaring_dependencies"> | |
<title>Declaring Resource Dependencies</title> | |
<para>The bottom window is where an annotator declares an external resource | |
dependency. The XML for this is as follows:</para> | |
<programlisting><![CDATA[<externalResourceDependency> | |
<key>AcronymTable</key> | |
<description>Table of acronyms and their expanded forms.</description> | |
<interfaceName> | |
org.apache.uima.tutorial.ex6.StringMapResource | |
</interfaceName> | |
</externalResourceDependency> | |
]]></programlisting> | |
<para>The <key> value (AcronymTable) is the name by which the annotator | |
identifies this resource. The key must be unique for all resources that this | |
annotator accesses, but the same key could be used by different annotators to mean | |
different things. The interface name | |
(<literal>org.apache.uima.tutorial.ex6.StringMapResource</literal>) is | |
the Java interface through which the annotator accesses the data. Specifying an | |
interface name is optional. If you do not specify an interface name, annotators | |
will instead get an interface which can provide direct access to the | |
data resource (file or URL) that is | |
associated with this external resource.</para> | |
</section> | |
<section id="ugr.tug.aae.resources.accessing_from_uimacontext"> | |
<title>Accessing the Resource from the UimaContext</title> | |
<para> If you look at the | |
<literal>org.apache.uima.tutorial.ex6.UimaAcronymAnnotator</literal> | |
source, you will see that the annotator accesses this resource from the | |
UimaContext by calling: | |
<programlisting>StringMapResource mMap = | |
(StringMapResource)getContext().getResourceObject("AcronymTable");</programlisting> | |
</para> | |
<para>The object returned from the <literal>getResourceObject</literal> method | |
will implement the interface declared in the | |
<literal><interfaceName></literal> section of the descriptor, | |
<literal>StringMapResource</literal> in this case. The annotator code does not | |
need to know the location of external data that may be used to initilize this | |
object, nor the Java class that might be used to read the | |
data and implement the <literal>StringMapResource</literal> | |
interface.</para> | |
<para>Note that if we did not specify a Java interface in our descriptor, our | |
annotator could directly access the resource data as follows: | |
<programlisting>InputStream stream = getContext().getResourceAsStream("AcronymTable");</programlisting></para> | |
<para>If necessary, the annotator could also determine the location of the resource | |
file, by calling: | |
<programlisting>URI uri = getContext().getResourceURI("AcronymTable");</programlisting></para> | |
<para>These last two options are only available in the case where the descriptor does | |
not declare a Java interface.</para> | |
<note><para>The methods for getting access to resources include <literal>getResourceURL</literal>. That | |
method returns a URL, which may contain spaces encoded as %20. url.getPath() would | |
return the path without decoding these %20 into spaces. <literal>getResourceURI</literal> | |
on the other hand, returns a URI, and the uri.getPath() <emphasis>does</emphasis> | |
do the conversion of %20 into spaces. See also <literal>getResourceFilePath</literal>, | |
which does a getResourceURI followed by uri.getPath().</para></note> | |
</section> | |
<section id="ugr.tug.aae.resources.declaring_and_bindings"> | |
<title>Declaring Resources and Bindings</title> | |
<para>Refer back to the top window in the Resources page of the Component Descriptor | |
Editor. This is where we specify the location of the resource data, and the Java | |
class used to read the data. For the example, this corresponds to the following | |
section of the descriptor: | |
<programlisting><![CDATA[<resourceManagerConfiguration> | |
<externalResources> | |
<externalResource> | |
<name>UimaAcronymTableFile</name> | |
<description> | |
A table containing UIMA acronyms and their expanded forms. | |
</description> | |
<fileResourceSpecifier> | |
<fileUrl>file:org/apache/uima/tutorial/ex6/uimaAcronyms.txt | |
</fileUrl> | |
</fileResourceSpecifier> | |
<implementationName> | |
org.apache.uima.tutorial.ex6.StringMapResource_impl | |
</implementationName> | |
</externalResource> | |
</externalResources> | |
<externalResourceBindings> | |
<externalResourceBinding> | |
<key>AcronymTable</key> | |
<resourceName>UimaAcronymTableFile</resourceName> | |
</externalResourceBinding> | |
</externalResourceBindings> | |
</resourceManagerConfiguration> | |
]]></programlisting></para> | |
<para>The first section of this XML declares an externalResource, the | |
<literal>UimaAcronymTableFile</literal>. With this, the fileUrl element | |
specifies the path to the data file. This can be a file on the file system, | |
but can also be a remote resource access via, e.g., the http protocol. | |
The fileUrl element doesn't have to be a "file", it can be a URL. | |
This can be an absolute URL (e.g. one that starts | |
with file:/ or file:///, or file://my.host.org/), but that is not recommended | |
because it makes installation of your component more difficult, as noted earlier. | |
Better is a relative URL, which will be looked up within the classpath (and/or | |
datapath), as used in this example. In this case, the file | |
<literal>org/apache/uima/tutorial/ex6/uimaAcronyms.txt</literal> is | |
located in <literal>uimaj-examples.jar</literal>, which is in the classpath. | |
If you look in this file you will see the definitions of several UIMA | |
acronyms.</para> | |
<para>The second section of the XML declares an externalResourceBinding, which | |
connects the key <literal>AcronymTable</literal>, declared in the | |
annotator's external resource dependency, to the actual resource name | |
<literal>UimaAcronymTableFile</literal>. This is rather trivial in this case; | |
for more on bindings see the example | |
<literal>UimaMeetingDetectorAE.xml</literal> below. There is no global | |
repository for external resources; it is up to the user to define each resource | |
needed by a particular set of annotators.</para> | |
<para>In the Component Descriptor Editor, bindings are indicated below the | |
external resource. To create a new binding, you select an external resource (which | |
must have previously been defined), and an external resource dependency, and then | |
click the <literal>Bind</literal> button, which only enables if you have | |
selected two things to bind together.</para> | |
<para>When the Analysis Engine is initialized, it creates a single instance of | |
<literal>StringMapResource_impl</literal> and loads it with the contents of | |
the data file. This means that the framework calls the instance's <literal>load</literal> | |
method, passing it an instance of DataResource, from which you can obtain | |
a stream or URI/URL of the external resource that was declared in the external resource; | |
for resources where | |
loading does not make sense, you can implement a <literal>load</literal> method | |
which ignores its argument and just returns, or performes whatever | |
initialization is appropriate at startup time. See the Javadocs for | |
SharedResourceObject for details on this.</para> | |
<para> | |
The UimaAcronymAnnotator then accesses the data through the | |
<literal>StringMapResource</literal> interface. This single instance could | |
be shared among multiple annotators, as will be explained later.</para> | |
<warning><para> | |
Because the implementation of the resource is shared, | |
you should insure your implementation is thread-safe, as it | |
could be called multiple times on multiple threads, simultaneously.</para></warning> | |
<para>Note that all resource implementation classes (e.g. | |
StringMapResource_impl in the provided example) must be declared public | |
must not be declared abstract, and must have public, 0-argument constructors, so | |
that they can be instantiated by the framework. (Although Java classes in which | |
you do not define any constructor will, by default, have a 0-argument constructor | |
that doesn't do anything, a class in which you have defined at least one | |
constructor does not get a default 0-argument constructor.)</para> | |
<para>All resource implementation classes that provide access to resource data | |
must also implement the interface org.apache.uima.resource.SharedResourceObject. | |
The UIMA Framework | |
will invoke this interface's only method, <code>load</code>, | |
after this object has been instantiated. The implementation of this method | |
can then read data from the specified <code>DataResource</code> | |
and use that data to initialize this object. It can also do whatever | |
resource initialization might be appropriate to do at startup time.</para> | |
<para>This annotator is illustrated in <xref | |
linkend="ugr.tug.aae.fig.external_resource_binding"/>. To see it in | |
action, just run it using the Document Analyzer. When it finishes, open up the | |
UIMA_Seminars document in the processed results window, (double-click it), and | |
then left-click on one of the highlighted terms, to see the expandedForm | |
feature's value. | |
<figure id="ugr.tug.aae.fig.external_resource_binding"> | |
<title>External Resource Binding</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="3.7in" format="PNG" | |
fileref="&imgroot;image040.png"/> | |
</imageobject> | |
<textobject><phrase>External Resource Binding</phrase></textobject> | |
</mediaobject> | |
</figure> </para> | |
<para>By designing our annotator in this way, we have gained some flexibility. We can | |
freely replace the StringMapResource_impl class with any other implementation | |
that implements the simple StringMapResource interface. (For example, for very | |
large resources we might not be able to have the entire map in memory.) We have also | |
made our external resource dependencies explicit in the descriptor, which will | |
help others to deploy our annotator.</para> | |
</section> | |
<section id="ugr.tug.aae.resources.sharing_among_annotators"> | |
<title>Sharing Resources among Annotators</title> | |
<para>Another advantage of the Resource Manager is that it allows our data to be | |
shared between annotators. To demonstrate this we have developed another | |
annotator that will use the same acronym table. The UimaMeetingAnnotator will | |
iterate over Meeting annotations discovered by the Meeting Detector we | |
previously developed and attempt to determine whether the topic of the meeting is | |
related to UIMA. It will do this by looking for occurrences of UIMA acronyms in close | |
proximity to the meeting annotation. We could implement this by using the | |
UimaAcronymAnnotator, of course, but for the sake of this example we will have the | |
UimaMeetingAnnotator access the acronym map directly.</para> | |
<para>The Java code for the UimaMeetingAnnotator in example 6 creates a new type, | |
UimaMeeting, if it finds a meeting within 50 characters of the UIMA | |
acronym.</para> | |
<para>We combine three analysis engines, the UimaAcronymAnnotator to annotate | |
UIMA acronyms, the MeetingDectector from example 4 to find meetings and finally | |
the UimaMeetingAnnotator to annotate just meetings about UIMA. Together these | |
are assembled to form the new aggregate analysis engine, UimaMeetingDectector. | |
This aggregate and the sharing of a common resource are illustrated in <xref | |
linkend="ugr.tug.aae.fig.sharing_common_resource"/>. | |
<figure id="ugr.tug.aae.fig.sharing_common_resource"> | |
<title>Component engines of an aggregate share a common resource</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="PNG" | |
fileref="&imgroot;image042.png"/> | |
</imageobject> | |
<textobject><phrase>Picture of Component engines of an aggregate sharing a | |
common resource</phrase></textobject> | |
</mediaobject> | |
</figure> The important thing to notice is in the | |
<literal>UimaMeetingDetectorAE.xml</literal> aggregate descriptor. It | |
includes both the UimaMeetingAnnotator and the UimaAcronymAnnotator, and | |
contains a single declaration of the UimaAcronymTableFile resource. (The actual | |
example has the order of the first two annotators reversed versus the above | |
picture, which is OK since they do not depend on one another).</para> | |
<para>It also binds the resources as follows: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image044.jpg"/> | |
</imageobject> | |
<textobject><phrase>UimaMeetingDetectorAE.xml binding a common resource</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<programlisting><![CDATA[<externalResourceBindings> | |
<externalResourceBinding> | |
<key>UimaAcronymAnnotator/AcronymTable</key> | |
<resourceName>UimaAcronymTableFile</resourceName> | |
</externalResourceBinding> | |
<externalResourceBinding> | |
<key>UimaMeetingAnnotator/UimaTermTable</key> | |
<resourceName>UimaAcronymTableFile</resourceName> | |
</externalResourceBinding> | |
</externalResourceBindings> | |
]]></programlisting> | |
</para> | |
<para>This binds the resource dependencies of both the UimaAcronymAnnotator | |
(which uses the name AcronymTable) and UimaMeetingAnnotator (which uses | |
UimaTermTable) to the single declared resource named UimaAcronymFile. | |
Therefore they will share the same instance. Resource bindings in the aggregate | |
descriptor <emphasis role="bold-italic">override</emphasis> any resource | |
declarations in individual annotator descriptors.</para> | |
<para>If we wanted to have the annotators use different acronym tables, we could | |
easily do that. We would simply have to change the resourceName elements in the | |
bindings so that they referred to two different resources. The Resource Manager | |
gives us the flexibility to make this decision at deployment time, without | |
changing any Java code.</para> | |
</section> | |
<section id="ugr.tug.aae.resources.threading"> | |
<title>Threading and Shared Resources</title> | |
<para>Sharing can also occur when multiple instances of an annotator are | |
created by the framework in response to run-time deployment specifications. | |
If an implementation class is specified in the external resource, | |
only one instance of that implementation class | |
is created for a given binding, and is shared among all | |
annotators. Because of this, the implementation of that shared instance must be written to be | |
thread-safe - that is, to operate correctly when called at arbitrary times | |
by multiple threads. Writing thread-safe code in Java is addressed in several | |
books, such as Brian Goetz's <emphasis>Java Concurrency in Practice</emphasis>.</para> | |
<para> | |
If no implementation class is specified, then the getResource method returns a | |
DataResource object, from which each annotator instance can obtain their | |
own (non-shared) input stream; so threading is not an issue in this case. | |
</para> | |
</section> | |
</section> | |
<section id="ugr.tug.aae.result_specification_setting"> | |
<title>Result Specifications</title> | |
<para>Annotators often are written to do a lot of computation and produce a lot of different outputs. | |
For example, a tokenizer can, in addition to identifying tokens, look them up in dictionaries, create | |
lemma forms (dropping suffexes and prefixes), etc. Result Specifications provide a way to dynamically | |
specify what results are desired for a particular CAS being processed.</para> | |
<para>It is up to the annotator writer to take advantage of the result specification; using it is optional. | |
If it is used, the annotator writer checks if a particular output is wanted, by asking the result specification | |
if it contains a specific Type and/or Feature. If it does, then the annotator produces that type/feature; if not, | |
it skips the computations for producing that type/feature.</para> | |
<para>The Result Specification querying may | |
include the language. A typical use case: The CAS contains a document written in some language, and some | |
upstream Annotator has discovered what this language is. | |
The Annotator extracts the previously discovered language specification from the CAS and | |
then includes it when querying the Result Specification. The exact method of encoding | |
language specifications in the CAS is left up to annotator developers; however, | |
the framework provides a commonly used type for this - the org.apache.uima.tcas.DocumentAnnotation | |
type.</para> | |
<para>The Result Specification is passed to the annotator instance by calling its | |
setResultSpecificaiton method (this call is typically done by the framework, based on Capability specifications). | |
When called, the default implementation saves the | |
result specification in an instance variable of the Annotator instance, which can be | |
accessed by the annotator using the protected | |
<literal>getResultSpecification()</literal> method.</para> | |
<para>A Result Specification is a list of output types and / or type:feature | |
names, catagorized by language(s), which are expected to be output from (produced by) the | |
annotator. Annotators may use this to optimize their operations, when possible, for | |
those cases where only particular outputs are wanted. The interface to the Result | |
Specification object (see the Javadocs) allows querying both types and particular | |
features of types.</para> | |
<para>The languages specifications used by Result Specifications are the same that are | |
specifiable in Capability Specifications; examples include "en" for English, "en-uk" for | |
British English, etc. There is also a language type, "x-unspecified", which is presumed | |
if no language specification(s) are given.</para> | |
<para>If a query of the Result Specification doesn't include a language, it is treated as if the | |
language "x-unspecified" was specified. Language matching is hierarchically defaulted, | |
in one direction: if a query includes the language "en-uk", meaning that the document | |
being processed is in that language, it will match | |
Result Specifications whose languages "en-uk", "en", or "x-unspecified". In other words, if the | |
Result Specifications say to produce output if the actual document's language | |
is en-uk, or en, or x-unspecified, then having the actual document's language be | |
en-uk would "match" any of these Result Specifications. However the reverse is not true: | |
If the query asks about producing output if the actual document's language is "x-unspecified", | |
then it would not match if the Result Specification said to produce output only if the | |
actual document is en-uk or en; the Result Specification would need to say to | |
produce output for "x-unspecified). | |
</para> | |
<para>If the Result Specification indicates it wants output | |
produced for "en-uk", but the annotator is given a language which is unknown, | |
or one that is known, but isn't "en-uk", then the query (using the language | |
of the document) will return false. This is true even if the language is "en". | |
However, if the Result Specification indicates it wants output for "en", | |
and the query is for a document whose language is "en-uk" then the query will return true. | |
</para> | |
<para>Sometimes you can specify the Result Specification; othertimes, you cannot | |
(for instance, inside a Collection Processing Engine, you cannot). When you cannot | |
specify it, or choose not to specify it (for example, using the form of the | |
process(...) call on an Analysis Engine that doesn't include the Result | |
Specification), a <quote>Default</quote> Result Specification is used.</para> | |
<section id="ugr.tug.aae.result_spec.default"> | |
<title>Default ResultSpecification</title> | |
<para>The default Result Specification is taken from the Engine's output | |
Capability Specification. Remember that a Capability Specification has both | |
inputs and outputs, can specify types and / or features, and there can be more than one | |
Capability Set. If there is more than one set, the logical union by language of these sets is used. | |
Each set can have a different "language(s)" specified; the default Result Specification | |
will have the outputs by language(s), so that the annotator can query which outputs | |
should be provided for particular languages. The methods to query the Result Specification | |
take a type and (optionally) a feature, and optionally, a language. If the queried type is | |
a subtype of some otherwise matching type in the Result Specification, it will match the query. | |
See the Javadocs for more details on this. | |
</para> | |
</section> | |
<section id="ugr.tug.aae.result_spec.passing_to_annotators"> | |
<title>Passing Result Specifications to Annotators</title> | |
<para>If you are not using a Collection Processing Engine, you can specify a Result | |
Specification for your AnalysisEngine(s) by calling the | |
<literal>AnalysisEngine.setResultSpecification(ResultSpecification)</literal> | |
method.</para> | |
<para>It is also possible to pass a Result Specification on each call to | |
<literal>AnalysisEngine.process(CAS, ResultSpecification)</literal>. However, | |
this is not recommended if your Result Specification will stay constant across | |
multiple calls to | |
<literal>process</literal>. In that case it will be more efficient to call | |
<literal>AnalysisEngine.setResultSpecification(ResultSpecification)</literal> | |
only when the Result Specification changes.</para> | |
<para> For primitive Analysis Engines, whatever Result Specification you pass in is | |
passed along to the annotator's | |
<literal>setResultSpecification(ResultSpecification)</literal> method. For | |
aggregate Analysis Engines, see below.</para> | |
</section> | |
<section id="ugr.tug.aae.result_spec.aggregates"> | |
<title>Aggregates</title> | |
<para>For aggregate engines, the Result Specification passed to the | |
<code>AnalysisEngine.setResultSpecification(ResultSpecification)</code> | |
method is intended to specify the set of output types/features that the aggregate | |
should produce. This is not necessarily equivalent to the set of output | |
types/features that each annotator should produce. For example, an annotator may | |
need to produce an intermediate type that is then consumed by a downstream annotator, | |
even though that intermediate type is not part of the Result Specification.</para> | |
<para>To handle this situation, when | |
<code>AnalysisEngine.setResultSpecification(ResultSpecification)</code> | |
is called on an aggregate, the framework computes the union of the passed Result | |
Specification with the set of | |
<emphasis>all</emphasis> input types and features of | |
<emphasis>all</emphasis> component AnalysisEngines within that aggregate. This forms the | |
complete set of types and features that any component of the aggregate might need to | |
produce. This derived Result Specification is then intersected with the | |
delegate's output capabilities, and the result is passed to the | |
<code>AnalysisEngine.setResultSpecification(ResultSpecification)</code> | |
of each component AnalysisEngine. In the case of nested aggregates, this procedure | |
is applied recursively.</para> | |
</section> | |
<section id="ugr.tug.aae.result_spec.aggregates.cpes"> | |
<title>Collection Proessing Engines</title> | |
<para>The Default Result Specification is always used for all components of a | |
Collection Processing Engine.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.aae.classpath_when_using_jcas"> | |
<title>Class path setup when using JCas</title> | |
<para>JCas provides Java classes that correspond to each CAS type in an application. | |
These classes are generated by the JCasGen utility (which can be automatically | |
invoked from the Component Descriptor Editor).</para> | |
<para>The Java source classes generated by the JCasGen utility are typically compiled | |
and packaged into a JAR file. This JAR file must be present in the classpath of the UIMA | |
application.</para> | |
<para>For more details on issues around setting up this class path, including | |
deployment issues where class loaders are being used to isolate multiple UIMA | |
applications inside a single running Java Virtual Machine, please see | |
<olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.jcas.class_loaders"/> | |
.</para> | |
</section> | |
<section id="ugr.tug.aae.using_shell_scripts"> | |
<title>Using the Shell Scripts</title> | |
<para>The SDK includes a <literal>/bin</literal> subdirectory containing shell | |
scripts, for Windows (.bat files) and Unix (.sh files). Many of these scripts invoke | |
sample Java programs which require a class path; they call a common shell script, | |
<literal>setUimaClassPath</literal> to set up the UIMA required files and | |
directories on the class path.</para> | |
<para>If you need to include files on the class path, the scripts will add anything you | |
specify in the environment variables CLASSPATH or UIMA_CLASSPATH to the classpath. So, for | |
example, if you are running the document analyzer, and wanted it to find a Java class | |
file named (on Windows) c:\a\b\c\myProject\myJarFile.jar, you could first issue a | |
<literal>set</literal> command to set the UIMA_CLASSPATH to this file, followed by | |
the documentAnalyzer script: | |
<programlisting>set UIMA_CLASSPATH=c:\a\b\c\myProject\myJarFile.jar | |
documentAnalyzer</programlisting> | |
</para> | |
<para>Other environment variables are used by the shell scripts, as follows: | |
<table frame="all" id="ugr.aae.tbl.env_vars_used_by_shell_scripts"> | |
<title>Environment variables used by the shell scripts</title> | |
<tgroup cols="2" rowsep="1" colsep="1"> | |
<colspec colname="c1"/> | |
<colspec colname="c2"/> | |
<thead> | |
<row> | |
<entry align="center">Environment Variable</entry> | |
<entry align="center">Description</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>UIMA_HOME</entry> | |
<entry>Path where the UIMA SDK was installed.</entry> | |
</row> | |
<row> | |
<entry>JAVA_HOME</entry> | |
<entry>(Optional) Path to a Java Runtime Environment. If not set, the Java | |
JRE that is in your system PATH is used.</entry> | |
</row> | |
<row> | |
<entry>UIMA_CLASSPATH</entry> | |
<entry>(Optional) if specified, a path specification to use as the default | |
ClassPath. You can also set the CLASSPATH variable. If you set both, they | |
will be concatenated.</entry> | |
</row> | |
<row> | |
<entry>UIMA_DATAPATH</entry> | |
<entry>(Optional) if specified, a path specification to use as the default | |
DataPath (see <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.datapath"/>)</entry> | |
</row> | |
<row> | |
<entry>UIMA_LOGGER_CONFIG_FILE</entry> | |
<entry>(Optional) if specified, a path to a Java Logger properties file | |
(see <xref linkend="ugr.tug.aae.configuration_logging"/>)</entry> | |
</row> | |
<row> | |
<entry>UIMA_JVM_OPTS</entry> | |
<entry>(Optional) if specified, the JVM arguments to be used when the Java | |
process is started. This can be used for example to set the maximum Java | |
heap size or to define system properties.</entry> | |
</row> | |
<row> | |
<entry>VNS_PORT</entry> | |
<entry>(Optional) if specified, the network IP port number of the Vinci | |
Name Server (VNS) (see <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.application.vns"/>)</entry> | |
</row> | |
<row> | |
<entry>ECLIPSE_HOME</entry> | |
<entry>(Optional) Needs to be set to the root of your Eclipse installation | |
when using shell scripts that invoke Eclipse (e.g. | |
jcasgen_merge)</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> </para> | |
</section> | |
</section> | |
<section id="ugr.tug.aae.common_pitfalls"> | |
<title>Common Pitfalls</title> | |
<para>Here are some things to avoid doing in your annotator code:</para> | |
<para><emphasis role="bold">Retaining references to JCas objects between calls to | |
process()</emphasis></para> | |
<para>The JCas will be cleared between calls to your annotator's process() method. | |
All of the analysis results related to the previous document will be deleted to make way | |
for analysis of a new document. Therefore, you should never save a reference to a JCas | |
Feature Structure object (i.e. an instance of a class created using JCasGen) and | |
attempt to reuse it in a future invocation of the process() method. If you do so, the | |
results will be undefined.</para> | |
<para><emphasis role="bold">Careless use of static data</emphasis></para> | |
<para>Always keep in mind that an application that uses your annotator may create | |
multiple instances of your annotator class. A multithreaded application may attempt | |
to use two instances of your annotator to process two different documents | |
simultaneously. This will generally not cause any problems as long as your annotator | |
instances do not share static data.</para> | |
<para>In general, you should not use static variables other than static final constants | |
of primitive data types (String, int, float, etc). Other types of static variables may | |
allow one annotator instance to set a value that affects another annotator instance, | |
which can lead to unexpected effects. Also, static references to classes that | |
aren't thread-safe are likely to cause errors in multithreaded | |
applications.</para> | |
</section> | |
<section id="ugr.tug.aae.viewing_UIMA_objects_in_eclipse_debugger"> | |
<title>Viewing UIMA objects in the Eclipse debugger</title> | |
<titleabbrev>UIMA Objects in Eclipse Debugger</titleabbrev> | |
<para>Eclipse (as of version 3.1 or later) has a new feature for viewing Java Logical | |
Structures. When enabled, it will permit you to see a view of UIMA objects (such as | |
feature structure instances, CAS or JCas instances, etc.) which displays the logical | |
subparts. For example, here is a view of a feature structure for the RoomNumber | |
annotation, from the tutorial example 1: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image046.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of Eclipse debugger showing non-logical-structure display of | |
a feature structure</phrase></textobject> | |
</mediaobject> | |
</screenshot></para> | |
<para>The <quote>annotation</quote> object in Java shows as a 2 element object, not very | |
convenient for seeing the features or the part of the input that is being annotatoed. But | |
if you turn on the Java Logical Structure mode by pushing this button: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.6in" format="JPG" fileref="&imgroot;image048.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of Eclipse debugger showing button to push to | |
enable viewing logical structures</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
the features of the FeatureStructure instance will be shown: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image050.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of Eclipse debugger showing logical structure display of | |
an annotation</phrase></textobject> | |
</mediaobject> | |
</screenshot></para> | |
</section> | |
<section id="ugr.tug.aae.xml_intro_ae_descriptor"> | |
<title>Introduction to Analysis Engine Descriptor XML Syntax</title> | |
<titleabbrev>Analysis Engine XML Descriptor</titleabbrev> | |
<para>This section is an introduction to the syntax used for Analysis Engine | |
Descriptors. Most users do not need to understand these details; they can use the | |
Component Descriptor Editor Eclipse plugin to edit Analysis Engine Descriptors | |
rather than editing the XML directly.</para> | |
<para>This section walks through the actual XML descriptor for the RoomNumberAnnotator | |
example introduced in section <xref linkend="ugr.tug.aae.getting_started"/>. The | |
discussion is divided into several logical sections of the descriptor.</para> | |
<para>The full specification for Analysis Engine Descriptors is defined in | |
<olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.component_descriptor"/> | |
.</para> | |
<section id="ugr.tug.aae.header_annotator_class_identification"> | |
<title>Header and Annotator Class Identification</title> | |
<programlisting><?db-font-size 80% ?><![CDATA[<?xml version="1.0" encoding="UTF-8" ?> | |
<!-- Descriptor for the example RoomNumberAnnotator. --> | |
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier"> | |
<frameworkImplementation>org.apache.uima.java</frameworkImplementation> | |
<primitive>true</primitive> | |
<annotatorImplementationName> | |
org.apache.uima.tutorial.ex1.RoomNumberAnnotator | |
</annotatorImplementationName> | |
]]></programlisting> | |
<para>The document begins with a standard XML header and a comment. The root element of | |
the document is named <literal><analysisEngineDescription>,</literal> | |
and must specify the XML namespace | |
<literal>http://uima.apache.org/resourceSpecifier</literal>.</para> | |
<para>The first subelement, | |
<literal><frameworkImplementation></literal>, must contain the value | |
<literal>org.apache.uima.java</literal>. The second subelement, | |
<literal><primitive></literal>, contains the Boolean value true, | |
indicating that this XML document describes a <emphasis>Primitive</emphasis> | |
Analysis Engine. A Primitive Analysis Engine is comprised of a single annotator. It | |
is also possible to construct XML descriptors for non-primitive or | |
<emphasis>Aggregate</emphasis> Analysis Engines; this is covered later.</para> | |
<para>The next element, | |
<literal><annotatorImplementationName></literal>, contains the | |
fully-qualified class name of our annotator class. This is how the UIMA framework | |
determines which annotator class to instantiate.</para> | |
</section> | |
<section id="ugr.tug.aae.xml_intro_simple_metadata_attributes"> | |
<title>Simple Metadata Attributes</title> | |
<programlisting><![CDATA[<analysisEngineMetaData> | |
<name>Room Number Annotator</name> | |
<description>An example annotator that searches for room numbers in | |
the IBM Watson research buildings.</description> | |
<version>1.0</version> | |
<vendor>The Apache Software Foundation</vendor></para> | |
]]></programlisting> | |
<para>Here are shown four simple metadata fields – name, description, version, | |
and vendor. Providing values for these fields is optional, but recommended.</para> | |
</section> | |
<section id="ugr.tug.aae.xml_intro_type_system_definition"> | |
<title>Type System Definition</title> | |
<programlisting><![CDATA[<typeSystemDescription> | |
<imports> | |
<import location="TutorialTypeSystem.xml"/> | |
</imports> | |
</typeSystemDescription> | |
]]></programlisting> | |
<para>This section of the XML descriptor defines which types the annotator works with. | |
The recommended way to do this is to <emphasis>import</emphasis> the type system | |
definition from a separate file, as shown here. The location specified here should be | |
a relative path, and it will be resolved relative to the location of the aggregate | |
descriptor. It is also possible to define types directly in the Analysis Engine | |
descriptor, but these types will not be easily shareable by others.</para> | |
</section> | |
<section id="ugr.tug.aae.xml_intro_capabilities"> | |
<title>Capabilities</title> | |
<programlisting><![CDATA[<capabilities> | |
<capability> | |
<inputs /> | |
<outputs> | |
<type>org.apache.uima.tutorial.RoomNumber</type> | |
<feature>org.apache.uima.tutorial.RoomNumber:building</feature> | |
</outputs> | |
</capability> | |
</capabilities> | |
]]></programlisting> | |
<para>The last section of the descriptor describes the | |
<emphasis>Capabilities</emphasis> of the annotator – the Types/Features | |
it consumes (input) and the Types/Features that it produces (output). These must be | |
the names of types and features that exist in the ANALYSIS ENGINE descriptor's | |
type system definition.</para> | |
<para>Our annotator outputs only one Type, RoomNumber and one feature, | |
RoomNumber:building. The fully-qualified names (including namespace) are | |
needed.</para> | |
<para>The building feature is listed separately here, but clearly specifying every | |
feature for a complex type would be cumbersome. Therefore, a shortcut syntax exists. | |
The <outputs> section above could be replaced with the equivalent section: | |
<programlisting><![CDATA[<outputs> | |
<type allAnnotatorFeatures ="true"> | |
org.apache.uima.tutorial.RoomNumber | |
</type> | |
</outputs>]]></programlisting></para> | |
</section> | |
<section id="ugr.tug.aae.xml_intro.configuration_parameters"> | |
<title>Configuration Parameters (Optional)</title> | |
<section id="ugr.tug.aae.xml_intro.configuration_parameters_declarations"> | |
<title>Configuration Parameter Declarations</title> | |
<programlisting><![CDATA[<configurationParameters> | |
<configurationParameter> | |
<name>Patterns</name> | |
<description>List of room number regular expression patterns. | |
</description> | |
<type>String</type> | |
<multiValued>true</multiValued> | |
<mandatory>true</mandatory> | |
</configurationParameter> | |
<configurationParameter> | |
<name>Locations</name> | |
<description>List of locations corresponding to the room number | |
expressions specified by the Patterns parameter. | |
</description> | |
<type>String</type> | |
<multiValued>true</multiValued> | |
<mandatory>true</mandatory> | |
</configurationParameter> | |
</configurationParameters>]]></programlisting> | |
<para>The <literal><configurationParameters></literal> element | |
contains the definitions of the configuration parameters that our annotator | |
accepts. We have declared two parameters. For each configuration parameter, the | |
following are specified: | |
<itemizedlist><listitem><para><emphasis role="bold">name</emphasis> | |
– the name that the annotator code uses to refer to the parameter</para> | |
</listitem> | |
<listitem><para><emphasis role="bold">description</emphasis> | |
– a natural language description of the intent of the parameter</para> | |
</listitem> | |
<listitem><para><emphasis role="bold">type</emphasis> – the data | |
type of the parameter's value – must be one of String, Integer, | |
Float, or Boolean.</para></listitem> | |
<listitem><para><emphasis role="bold">multiValued</emphasis> | |
– true if the parameter can take multiple-values (an array), false if | |
the parameter takes only a single value. </para></listitem> | |
<listitem><para><emphasis role="bold">mandatory</emphasis> – true | |
if a value must be provided for the parameter </para></listitem> | |
</itemizedlist></para> | |
<para>Both of our parameters are mandatory and accept an array of Strings as their | |
value.</para> | |
</section> | |
<section id="ugr.tug.aae.xml_intro_configuration_parameter_settings"> | |
<title>Configuration Parameter Settings</title> | |
<programlisting><![CDATA[<configurationParameterSettings> | |
<nameValuePair> | |
<name>Patterns</name> | |
<value> | |
<array> | |
<string>b[0-4]d-[0-2]ddb</string> | |
<string>b[G1-4][NS]-[A-Z]ddb</string> | |
<string>bJ[12]-[A-Z]ddb</string> | |
</array> | |
</value> | |
</nameValuePair> | |
<nameValuePair> | |
<name>Locations</name> | |
<value> | |
<array> | |
<string>Watson - Yorktown</string> | |
<string>Watson - Hawthorne I</string> | |
<string>Watson - Hawthorne II</string> | |
</array> | |
</value> | |
</nameValuePair> | |
</configurationParameterSettings>]]></programlisting> | |
</section> | |
<section id="ugr.tug.aae.xml_intro.aggregate"> | |
<title>Aggregate Analysis Engine Descriptor</title> | |
<programlisting><?db-font-size 80% ?><![CDATA[<?xml version="1.0" encoding="UTF-8" ?> | |
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier"> | |
<frameworkImplementation>org.apache.uima.java</frameworkImplementation> | |
<primitive>false</primitive> | |
<delegateAnalysisEngineSpecifiers> | |
<delegateAnalysisEngine key="RoomNumber"> | |
<import location="../ex2/RoomNumberAnnotator.xml"/> | |
</delegateAnalysisEngine> | |
<delegateAnalysisEngine key="DateTime"> | |
<import location="TutorialDateTime.xml" /> | |
</delegateAnalysisEngine> | |
</delegateAnalysisEngineSpecifiers>]]></programlisting> | |
<para>The first difference between this descriptor and an individual | |
annotator's descriptor is that the | |
<literal><primitive></literal> element contains the value | |
<literal>false</literal>. This indicates that this Analysis Engine (AE) is an | |
aggregate AE rather than a primitive AE.</para> | |
<para>Then, instead of a single annotator class name, we have a list of | |
<literal>delegateAnalysisEngineSpecifiers</literal>. Each specifies one of | |
the components that constitute our Aggregate . We refer to each component by the | |
relative path from this XML descriptor to the component AE's XML | |
descriptor.</para> | |
<para>This list of component AEs does not imply an ordering of them in the execution | |
pipeline. Ordering is done by another section of the descriptor: | |
<programlisting><![CDATA[<analysisEngineMetaData> | |
<name>Aggregate AE - Room Number and DateTime Annotators</name> | |
<description>Detects Room Numbers, Dates, and Times</description> | |
<flowConstraints> | |
<fixedFlow> | |
<node>RoomNumber</node> | |
<node>DateTime</node> | |
</fixedFlow> | |
</flowConstraints>]]></programlisting></para> | |
<para>Here, a fixedFlow is adequate, and we specify the exact ordering in which the | |
AEs will be executed. In this case, it doesn't really matter, since the | |
RoomNumber and DateTime annotators do not have any dependencies on one | |
another.</para> | |
<para>Finally, the descriptor has a capabilities section, which has exactly the | |
same syntax as a primitive AE's capabilities section: | |
<programlisting><![CDATA[<capabilities> | |
<capability> | |
<inputs /> | |
<outputs> | |
<type allAnnotatorFeatures="true"> | |
org.apache.uima.tutorial.RoomNumber | |
</type> | |
<type allAnnotatorFeatures="true"> | |
org.apache.uima.tutorial.DateAnnot | |
</type> | |
<type allAnnotatorFeatures="true"> | |
org.apache.uima.tutorial.TimeAnnot | |
</type> | |
</outputs> | |
<languagesSupported> | |
<language>en</language> | |
</languagesSupported> | |
</capability> | |
</capabilities>]]></programlisting> | |
</para> | |
</section> | |
</section> | |
</section> | |
</chapter> |