blob: bf91534698f617f95e7b4b674d2bca3b11fcb910 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
<!ENTITY imgroot "images/references/ref.cas/" >
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="ugr.ref.cas">
<title>CAS Reference</title>
<para>The CAS (Common Analysis System) is the part of the Unstructured Information
Management Architecture (UIMA) that is concerned with creating and handling the data
that annotators manipulate.</para>
<para>Java users typically use the JCas (Java interface to the CAS) when manipulating
objects in the CAS. This chapter describes an alternative interface to the CAS which
allows discovery and specification of types and features at run time. It is recommended
for use when the using code cannot know ahead of time the type system it will be dealing
with.</para>
<para>Use of the CAS as described here is also recommended (or necessary) when components add
to the definitions of types of other components. This UIMA feature allows users to add features
to a type that was already defined elsewhere. When this feature is used in conjunction with the
JCas, it can lead to problems with class loading. This is because different JCas representations
of a single type are generated by the different components, and only one of them is loaded
(unless you are using Pear descriptors). Note:
we do not recommend that you add features to pre-existing types. A type should be defined in one
place only, and then there is no problem with using the JCas. However, if you do use this feature,
do not use the JCas. Similarly, if you distribute your components for inclusion in somebody else's
UIMA application, and you're not sure that they won't add features to your types, do not use the
JCas for the same reasons.
</para>
<para>CASes passed to Annotator Components are either a base CAS or a regular CAS. Base CASes
are only passed to Multi-View components - they are like regular CASes, but do not have user
accessible indexes or Sofas. They are used by the component only for switching to other CAS
views, which are regular CASes.</para>
<section id="ugr.ref.cas.javadocs">
<title>Javadocs</title>
<para>The subdirectory <literal>docs/api</literal> contains the documentation
details of all the classes, methods, and constants for the APIs discussed here. Please
refer to this for details on the methods, classes and constants, specifically in the
packages <literal>org.apache.uima.cas.*</literal>.</para>
</section>
<section id="ugr.ref.cas.overview">
<title>CAS Overview</title>
<para>There are three<footnote><para>A fourth part, the Subject of Analysis,
is discussed in <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.aas"/>.</para></footnote> main parts to the CAS: the type system, data creation and
manipulation, and indexing. We will start with a brief
description of these components.</para>
<section id="ugr.ref.cas.type_system">
<title>The Type System</title>
<para>The type system specifies what kind of data you will be able to manipulate in your
annotators. The type system defines two kinds of entities, types and features. Types
are arranged in a single inheritance tree and define the kinds of entities (objects)
you can manipulate in the CAS. Features optionally specify slots or fields within a
type. The correspondence to Java is to equate a CAS Type to a Java Class, and the CAS
Features to fields within the type. A critical difference is that CAS types have no
methods; they are just data structures with named slots (features). These features can
have as values primitive things like integers, floating point numbers, and strings,
and they also can hold references to other instances of objects in the CAS. We call
instances of the data structures declared by the type system <quote>feature
structures</quote> (not to be confused with <quote>features</quote>). Feature
structures are similar to the many variants of record structures found in computer
science.<footnote><para> The name <quote>feature structure</quote> comes from
terminology used in linguistics.</para></footnote></para>
<para>Each CAS Type defines a supertype; it is a subtype of that supertype. This means
that any features that the supertype defines are features of the subtype; in other
words, it inherits its supertype&apos;s features. Only single inheritance is
supported; a type&apos;s feature set is the union of all of the features in its
supertype hierarchy. There is a built-in type called uima.cas.TOP; this is the top,
root node of the inheritance tree. It defines no features.</para>
<para>The values that can be stored in features are either built-in primitive values or
references to other feature structures. The primitive values are
<literal>boolean</literal>, <literal>byte</literal>,
<literal>short</literal> (16 bit integers), <literal>integer</literal> (32
bit), <literal>long</literal> (64 bit), <literal>float</literal> (32 bit),
<literal>double</literal> (64 bit floats) and strings; the official names of these
are <literal>uima.cas.Boolean</literal>, <literal>uima.cas.Byte</literal>,
<literal>uima.cas.Short</literal>, <literal>uima.cas.Integer</literal>,
<literal>uima.cas.Long</literal>, <literal>uima.cas.Float</literal>
,<literal> uima.cas.Double</literal> and <literal>uima.cas.String</literal>
. The strings are Java strings, and characters are Java characters. Technically, this means
that characters are UTF-16 code points, which is not quite the same as a Unicode character.
This distinction should make no difference for almost all applications.
The CAS also defines other basic built-in types for arrays of these, plus arrays of
references to other objects, called <literal>uima.cas.IntegerArray</literal>
,<literal> uima.cas.FloatArray</literal>,
<literal>uima.cas.StringArray</literal>,
<literal>uima.cas.FSArray</literal>, etc.</para>
<para>The CAS also defines a built-in type called
<literal>uima.tcas.Annotation</literal> which inherits from
<literal>uima.cas.AnnotationBase</literal> which in turn inherits from
<literal>uima.cas.TOP</literal>. There are two features defined by this type,
called <literal>begin</literal> and <literal>end</literal>, both of which are
integer valued.</para>
</section>
<section id="ugr.ref.cas.creating_accessing_manipulating_data">
<title>Creating, accessing and manipulating data</title>
<titleabbrev>Creating/Accessing/Changing data</titleabbrev>
<para>
Creating and accessing data in the CAS requires knowledge about the types and features
defined in the type system. The idea is similar to other data access APIs, such as the XML
DOM or SAX APIs, or database access APIs such as JDBC. Contrary to those APIs, however, the
CAS does not use the names of type system entities directly in the APIs. Rather, you use
the type system to access type and feature entities by name, then use these entities in the
data manipulation APIs. This can be compared to the Java reflection APIs: the type system
is comparable to the Java class loader, and the type and feature objects to the
<literal>java.lang.Class</literal> and <literal>java.lang.reflect.Field</literal> classes.
</para>
<para>
Why does it have to be this complicated? You wouldn&apos;t normally use reflection to create a
Java object, either. As mentioned earlier, the JCas provides the more straightforward
method to manipulate CAS data. The CAS access methods described here need only be used for
generic types of applications that need to be able to handle any kind of data (e.g., generic
tooling) or when the JCas may not be used for other reasons. The generic kinds of applications
are exactly the ones where you would use the reflection API in Java as well.
</para>
</section>
<section id="ugr.ref.cas.creating_using_indexes">
<title>Creating and using indexes</title>
<para>Each view of a CAS provides a set of indexes for that view. Instances of feature
structures can be added to a view&apos;s indexes. These indexes provide
the only way for other annotators to locate existing data in the CAS. The only way for an
annotator to use data that another annotator has created is by using an index (or the
method <literal>getAllIndexedFS</literal> of the object <literal>FSIndexRepository</literal>) to
retrieve feature structures the first annotator created. If you want the data you
create to be visible to other annotators, you must explicitly call methods which
add it to the indexes &mdash; you must index it.</para>
<para>Indexes are named and are associated with a CAS Type; they are used to index
instances of that CAS type (including instances of that type&apos;s subtypes). If
you are using multiple views (see <olink
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.mvs"/>),
each view contains a separate instantiation of all of the indexes.
To access an index, you
minimally need to know its name. A CAS view provides an index repository which you can
query for indexes for that view. Once you have a handle to an index, you can get
information about the feature structures in the index, the size of the index, as well
as an iterator over the feature structures.</para>
<para>Indexes are defined in the XML descriptor metadata for the application. Each CAS
View has its own, separate instantiation of indexes based on these definitions,
kept in the view's index repository. When you obtain an index, it is always from a
particular CAS view. When you index an item, it is always added to all indexes where it
belongs, within just one repository. You can specify different repositories
(associated with different CAS views) to use; a given Feature Structure instance
may be indexed in more
than one CAS View.</para>
<para>Iterators allow you to enumerate the feature structures in an index. FS iterators
provide two kinds of APIs: the regular Java iterator API, and a specific FS iterator API
where the usual Java iterator APIs (<literal>hasNext()</literal> and <literal>next()</literal>)
are replaced by <literal>isValid()</literal>, <literal>moveToNext()</literal> (which does
not return an element) and <literal>get()</literal>. Which API style you use is up to you,
but we do not recommend mixing the styles as the results are sometimes unexpected. If you
just want to iterate over an index from start to finish, either style is equally appropriate.
If you also use <literal>moveTo(FeatureStructure fs)</literal> and
<literal>moveToPrevious()</literal>, it is better to use the special FS iterator style.
</para>
<note><para>The reason to not mix these styles is that you might be thinking that
next() followed by moveToPrevious() would always work. This is not true, because
next() returns the "current" element, and advances to the next position, which might be
beyond the last element. At that point, the interator becomes "invalid", and by the iterator
contracts, moveToNext and moveToPrevious are not allowed on "invalid" iterators;
when an iterator is not valid, all bets are off. But you can
call these methods on the iterator &mdash; moveToFirst(), moveToLast(), or moveTo(FS) &mdash; to reset it.</para></note>
<para>Indexes are created by specifying them in the annotator&apos;s or
aggregate&apos;s resource descriptor. An index specification includes its name,
the CAS type being indexed, the kind of index it is, and an (optional) ordering
relation on the feature structures to be indexed. At startup time, all index
specifications are combined; duplicate definitions (having the same name) are
allowed only if their definitions are the same. </para>
<para>Feature structure instances need to be explicitly added to the index repository by a
method call. Feature structures that are not indexed will not be visible to other
annotators, (unless they are located via being referenced by some other feature of
another feature structure, which is indexed, or through a chain of these).</para>
<para>The framework defines an unnamed bag index which indexes all types. The
only access provided for this index is the getAllIndexedFS(type) method on the
index repository, which returns an iterator over all indexed instances of the
specified type (including its subtypes) for that CAS View.
</para>
<para>The framework defines one standard, built-in annotation index, called
AnnotationIndex, which indexes the <literal>uima.tcas.Annotation</literal>
type: all feature structures of type <literal>uima.tcas.Annotation</literal> or
its subtypes are automatically indexed with this built-in index.</para>
<para>The ordering relation used by this index is to first order by the value of the
<quote>begin</quote> features (in ascending order) and then by the value of the
<quote>end</quote> feature (in descending order). This ordering insures that
longer annotations starting at the same spot come before shorter ones. For Subjects
of Analysis other than Text, this may not be an appropriate index.</para>
</section>
</section>
<section id="ugr.ref.cas.builtin_types">
<title>Built-in CAS Types</title>
<para>The CAS has two kinds of built-in types &ndash; primitive and non-primitive. The
primitive types are:
<itemizedlist spacing="compact">
<listitem><para>uima.cas.Boolean</para></listitem>
<listitem><para>uima.cas.Byte</para></listitem>
<listitem><para>uima.cas.Short</para></listitem>
<listitem><para>uima.cas.Integer</para></listitem>
<listitem><para>uima.cas.Long</para></listitem>
<listitem><para>uima.cas.Float</para></listitem>
<listitem><para>uima.cas.Double</para></listitem>
<listitem><para>uima.cas.String</para></listitem>
</itemizedlist></para>
<para>The <literal>Byte, Short, Integer, </literal>and<literal> Long</literal> are
all signed integer types, of length 8, 16, 32, and 64 bits. The
<literal>Double</literal> type is 64 bit floating point. The
<literal>String</literal> type can be subtyped to create sets of allowed values; see
<olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.component_descriptor.type_system.string_subtypes"/>.
These types can be used to specify the range of a String-valued feature. They act like
Strings, but have additional checking to insure the setting of values into them
conforms to one of the allowed values, or to null (which is the value if it is not set).
Note that the other primitive types cannot be used
as a supertype for another type definition; only
<literal>uima.cas.String</literal> can be sub-typed.</para>
<para>The non-primitive types exist in a type hierarchy; the top of the hierarchy is the
type <literal>uima.cas.TOP</literal>. All other non-primitive types inherit from
some supertype.</para>
<para>There are 9 built-in array types. These arrays have a size specified when they are
created; the size is fixed at creation time. They are named:
<itemizedlist spacing="compact">
<listitem><para>uima.cas.BooleanArray</para></listitem>
<listitem><para>uima.cas.ByteArray</para></listitem>
<listitem><para>uima.cas.ShortArray</para></listitem>
<listitem><para>uima.cas.IntegerArray</para></listitem>
<listitem><para>uima.cas.LongArray</para></listitem>
<listitem><para>uima.cas.FloatArray</para></listitem>
<listitem><para>uima.cas.DoubleArray</para></listitem>
<listitem><para>uima.cas.StringArray</para></listitem>
<listitem><para>uima.cas.FSArray</para></listitem>
</itemizedlist></para>
<para>The <literal>uima.cas.FSArray</literal> type is an array whose elements are
arbitrary other feature structures (instances of non-primitive types).</para>
<para>There are 3 built-in types associated with the artifact being analyzed:
<itemizedlist spacing="compact">
<listitem><para>uima.cas.AnnotationBase</para></listitem>
<listitem><para>uima.tcas.Annotation</para></listitem>
<listitem><para>uima.tcas.DocumentAnnotation</para></listitem>
</itemizedlist></para>
<para>The <literal>AnnotationBase</literal> type defines one system-used feature
which specifies for an annotation the subject of analysis (Sofa) to which it refers. The
Annotation type extends from this and defines 2 features, taking
<literal>uima.cas.Integer</literal> values, called <literal>begin</literal>
and <literal>end</literal>. The <literal>begin</literal> feature typically
identifies the start of a span of text the annotation covers; the
<literal>end</literal> feature identifies the end. The values refer to character
offsets; the starting index is 0. An annotation of the word <quote>CAS</quote> in a text
<quote>CAS Reference</quote> would have a start index of 0, and an end index of 3; the
difference between end and start is the length of the span the annotation refers
to.</para>
<para>Annotations are always with respect to some Sofa (Subject of Analysis &ndash; see
<olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>
.</para>
<note><para>Artifacts which are not text strings may have a different interpretation of
the meaning of begin and end, or may define their own kind of annotation, extending from
<literal>AnnotationBase</literal>. </para></note>
<para id="ugr.ref.cas.document_annotation">The <literal>DocumentAnnotation</literal> type has one special instance. It is
a subtype of the Annotation type, and the built-in definition defines one feature,
<literal>language</literal>, which is a string indicating the language of the
document in the CAS. The value of this language feature is used by the system to control
flow among annotators when the <quote>CapabilityLanguageFlow</quote> mode is used,
allowing the flow to skip over annotators that don&apos;t process particular
languages. Users may extend this type by adding additional features to it, using the XML
Descriptor element for defining a type.</para>
<note><para>
We do <emphasis>not</emphasis> recommend extending the <literal>DocumentAnnotation</literal>
type. If you do, you must <emphasis>not</emphasis> use the JCas, for the reasons stated
earlier.
</para></note>
<para>Each CAS view has a different associated instance of the
<literal>DocumentAnnotation</literal> type. On the CAS, use
<literal>getDocumentationAnnotation()</literal> to access the
<literal>DocumentAnnotation</literal>.</para>
<para>There are also built-in types supporting linked lists, similar to the ones available in
Java and other programming languages. Their use is
constrained by the usual properties of linked lists: not very space efficient, no (efficient)
random access, but an easy choice if you don't know how long your list will be ahead of time. The
implementation is type specific; there are different list building objects for each of
the primitive types, plus one for general feature structures. Here are the type names:
<itemizedlist spacing="compact">
<listitem><para>uima.cas.FloatList</para></listitem>
<listitem><para>uima.cas.IntegerList</para></listitem>
<listitem><para>uima.cas.StringList</para></listitem>
<listitem><para>uima.cas.FSList</para>
<para></para></listitem>
<listitem><para>uima.cas.EmptyFloatList</para></listitem>
<listitem><para>uima.cas.EmptyIntegerList</para></listitem>
<listitem><para>uima.cas.EmptyStringList</para></listitem>
<listitem><para>uima.cas.EmptyFSList</para>
<para></para></listitem>
<listitem><para>uima.cas.NonEmptyFloatList</para></listitem>
<listitem><para>uima.cas.NonEmptyIntegerList</para></listitem>
<listitem><para>uima.cas.NonEmptyStringList</para></listitem>
<listitem><para>uima.cas.NonEmptyFSList</para></listitem>
</itemizedlist></para>
<para>For the primitive types <literal>Float</literal>,
<literal>Integer</literal>, <literal>String</literal> and
<literal>FeatureStructure</literal>, there is a base type, for instance,
<literal>uima.cas.FloatList</literal>. For each of these, there are two subtypes,
corresponding to a non-empty element, and a marker that serves to indicate the end of the
list, or an empty list. The non-empty types define two features &ndash;
<literal>head</literal> and <literal>tail</literal>. The head feature holds the
particular value for that part of the list. The tail refers to the next list object
(either a non-empty one or the empty version to indicate the end of the list).</para>
<para>There are no other built-in types. Users are free to define their own type systems,
building upon these types.</para>
</section>
<section id="ugr.ref.cas.accessing_the_type_system">
<title>Accessing the type system</title>
<para>
During annotator processing, or outside an annotator, access the type system by calling
<literal>CAS.getTypeSystem()</literal>.
</para>
<para>However, CAS annotators implement an additional method,
<literal>typeSystemInit()</literal>, which is called by the UIMA framework before the
annotator&apos;s process method. This method, implemented by the annotator writer,
is passed a reference to the CAS&apos;s type system metadata. The method typically uses
the type system APIs to obtain type and feature objects corresponding to all the types
and features the annotator will be using in its process method. This initialization
step should not be done during an annotator&apos;s initialize method since the type
system can change after the initialize method is called; it should not be done during the
process method, since this is presumably work that is identical for each incoming
document, and so should be performed only when the type system changes (which will be a
rare event). The UIMA framework guarantees it will call the <literal>typeSystemInit
</literal>method of an annotator whenever the type system changes, before calling the
annotator&apos;s <literal>process()</literal> method.</para>
<para>The initialization done by <literal>typeSystemInit()</literal> is done by the
UIMA framework when you use the JCas APIs; you only need to provide a
<literal>typeSystemInit()</literal> method, as described here, when you are not using
the JCas approach.</para>
<section id="ugr.ref.cas.type_system.printer_example">
<title>TypeSystemPrinter example</title>
<para>Here is a code fragment that, given a CAS Type System, will print a list of all
types.</para>
<programlisting>// Get all type names from the type system
// and print them to stdout.
private void listTypes1(TypeSystem ts) {
// Get an iterator over types
Iterator typeIterator = ts.getTypeIterator();
Type t;
System.out.println("Types in the type system:");
while (typeIterator.hasNext()) {
// Retrieve a type...
t = (Type) typeIterator.next();
// ...and print its name.
System.out.println(t.getName());
}
System.out.println();
}</programlisting>
<para>This method is passed the type system as a parameter. From the type system, we can
get an iterator
over all known types. If you run this against a CAS created with no additional
user-defined types, we should see something like this on the console:</para>
<programlisting>Types in the type system:
uima.cas.Boolean
uima.cas.Byte
uima.cas.Short
uima.cas.Integer
uima.cas.Long
uima.cas.ArrayBase
...
</programlisting>
<para>If the type system had user-defined types these would show up too. Note that some
of these types are not directly creatable &ndash; they are types used by the framework
in the type hierarchy (e.g. uima.cas.ArrayBase).</para>
<para>CAS type names include a name-space prefix. The components of a type name are
separated by the dot (.). A type name component must start with a Unicode letter,
followed by an arbitrary sequence of letters, digits and the underscore (_). By
convention, the last component of a type name starts with an uppercase letter, the
rest start with a lowercase letter.</para>
<para>Listing the type names is mildly useful, but it would be even better if we could see
the inheritance relation between the types. The following code prints the
inheritance tree in indented format.</para>
<programlisting>private static final int INDENT = 2;
private void listTypes2(TypeSystem ts) {
// Get the root of the inheritance tree.
Type top = ts.getTopType();
// Recursively print the tree.
printInheritanceTree(ts, top, 0);
}
private void printInheritanceTree(TypeSystem ts, Type type, int level) {
indent(level); // Print indentation.
System.out.println(type.getName());
// Get a vector of the immediate subtypes.
Vector subTypes =
ts.getDirectlySubsumedTypes(type);
++level; // Increase the indentation level.
for (int i = 0; i &lt; subTypes.size(); i++) {
// Print the subtypes.
printInheritanceTree(ts, (Type) subTypes.get(i), level);
}
}
// A simple, inefficient indenter
private void indent(int level) {
int spaces = level * INDENT;
for (int i = 0; i &lt; spaces; i++) {
System.out.print(" ");
}
}</programlisting>
<para> This example shows that you can traverse the type hierarchy by starting at the top
with TypeSystem.getTopType and by retrieving subtypes with
<literal>TypeSystem.getDirectlySubsumedTypes()</literal>.</para>
<para>The Javadocs also have APIs that allow you to access the features, as well as what
the allowed value type is for that feature. Here is sample code which prints out all the
features of all the types, together with the allowed value types (the feature
<quote>range</quote>). Each feature has a <quote>domain</quote> which is the type
where it is defined, as well as a <quote>range</quote>.
<programlisting>private void listFeatures2(TypeSystem ts) {
Iterator featureIterator = ts.getFeatures();
Feature f;
System.out.println("Features in the type system:");
while (featureIterator.hasNext()) {
f = (Feature) featureIterator.next();
System.out.println(
f.getShortName() + ": " +
f.getDomain() + " -&gt; " + f.getRange());
}
System.out.println();
}</programlisting></para>
<para>We can ask a feature object for its domain (the type it is defined on) and its range
(the type of the value of the feature). The terminology derives from the fact that
features can be viewed as functions on subspaces of the object space.</para>
</section>
<section id="ugr.ref.cas.cas_apis_create_modify_feature_structures">
<title>Using the CAS APIs to create and modify feature structures</title>
<titleabbrev>Using CAS APIs: Feature Structures</titleabbrev>
<para>Assume a type system declaration that defines two types: Entity and Person.
Entity has no features defined within it but inherits from uima.tcas.Annotation
&ndash; so it has the begin and end features. Person is, in turn, a subtype of Entity,
and adds firstName and lastName features. CAS type systems are declaratively
specified using XML; the format of this XML is described in <olink
targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.component_descriptor.type_system"/>.
<programlisting><![CDATA[<!-- Type System Definition -->
<typeSystemDescription>
<types>
<typeDescription>
<name>com.xyz.proj.Entity</name>
<description />
<supertypeName>uima.tcas.Annotation</supertypeName>
</typeDescription>
<typeDescription>
<name>Person</name>
<description />
<supertypeName>com.xyz.proj.Entity </supertypeName>
<features>
<featureDescription>
<name>firstName</name>
<description />
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
<featureDescription>
<name>lastName</name>
<description />
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
</types>
</typeSystemDescription>]]></programlisting></para>
<para>
To be able to access types and features, we need to know their names. The CAS interface defines
constants that hold the names of built-in feature names, such as, e.g.,
<literal>CAS.TYPE_NAME_INTEGER</literal>. It is good programming practice to create such
constants for the types and features you define, for your own use as well as for others who will
be using your annotators.
</para>
<programlisting>/** Entity type name constant. */
public static final String ENTITY_TYPE_NAME = "com.xyz.proj.Entity";
/** Person type name constant. */
public static final String PERSON_TYPE_NAME = "com. xyz.proj.Person";
/** First name feature name constant. */
public static final String FIRST_NAME_FEAT_NAME = "firstName";
/** Last name feature name constant. */
public static final String LAST_NAME_FEAT_NAME = "lastName";</programlisting>
<para>Next we define type and feature member variables; these will hold the values of the
type and feature objects needed by the CAS APIs, to be assigned during
<literal>typeSystemInit()</literal>.</para>
<programlisting>// Type system object variables
private Type entityType;
private Type personType;
private Feature firstNameFeature;
private Feature lastNameFeature;
private Type stringType;</programlisting>
<para>The type system does not throw an exception if we ask for something that is
not known, it simply returns null; therefore the code checks for this and throws a proper
exception. We require all these types and features to be defined for the annotator to
work. One might imagine situations where certain computations are predicated on some type
or feature being defined in the type system, but that is not the case here.</para>
<programlisting>// Get a type object corresponding to a name.
// If it doesn&apos;t exist, throw an exception.
private Type initType(String typeName)
throws AnnotatorInitializationException {
Type type = ts.getType(typeName);
if (type == null) {
throw new AnnotatorInitializationException(
AnnotatorInitializationException.TYPE_NOT_FOUND,
new Object[] { this.getClass().getName(), typeName });
}
return type;
}
// We add similar code for retrieving feature objects.
// Get a feature object from a name and a type object.
// If it doesn&apos;t exist, throw an exception.
private Feature initFeature(String featName, Type type)
throws AnnotatorInitializationException {
Feature feat = type.getFeatureByBaseName(featName);
if (feat == null) {
throw new AnnotatorInitializationException(
AnnotatorInitializationException.FEATURE_NOT_FOUND,
new Object[] { this.getClass().getName(), featName });
}
return feat;
}</programlisting>
<para>Using these two functions, code for initializing the type system described
above would be:
<programlisting>public void typeSystemInit(TypeSystem aTypeSystem)
throws AnalysisEngineProcessException {
this.typeSystem = aTypeSystem;
// Set type system member variables.
this.entityType = initType(ENTITY_TYPE_NAME);
this.personType = initType(PERSON_TYPE_NAME);
this.firstNameFeature =
initFeature(FIRST_NAME_FEAT_NAME, personType);
this.lastNameFeature =
initFeature(LAST_NAME_FEAT_NAME, personType);
this.stringType = initType(CAS.TYPE_NAME_STRING);
}</programlisting></para>
<para>Note that we initialize the string type by using a type name constant from the
CAS.</para>
</section>
</section>
<section id="ugr.ref.cas.creating_feature_structures">
<title>Creating feature structures</title>
<para>To create feature structures in JCas, we use the Java <quote>new</quote>
operator. In the CAS, we use one of several different API methods on the CAS object,
depending on which of the 10 basic kinds of feature structures we are creating (a plain
feature structure, or an instance of the built-in primitive type arrays or FSArray).
There are is also a method to create an instance of a
<literal>uima.tcas.Annotation</literal>, setting the begin and end
values.</para>
<para>Once a feature structure is created, it needs to be added to the CAS indexes (unless
it will be accessed via some reference from another accessible feature structure). The
CAS provides this API: Assuming aCAS holds a reference to a CAS, and token holds a
reference to a newly created feature structure, here&apos;s the code to add that
feature structure to all the relevant CAS indexes:</para>
<programlisting> // Add the token to the index repository.
aCAS.addFsToIndexes(token);</programlisting>
<para>There is also a corresponding <literal>removeFsFromIndexes(token)</literal>
method on CAS objects.</para>
<para>Because some of the indexes (the Sorted and Set types) use comparators defined
on particular values of the features of an indexed type, if you change the values of
those features being used in the index key, the correct way to do this is to
<orderedlist spacing="compact">
<listitem><para>remove the item from all indexes where it is indexed, in all views
where it is indexed,</para>
</listitem>
<listitem><para>update the value of the features being used as keys,</para></listitem>
<listitem><para>add the item back to the indexes, in all views.</para></listitem>
</orderedlist></para>
</section>
<section id="ugr.ref.cas.accessing_modifying_features_of_feature_structures">
<title>Accessing or modifying features of feature structures</title>
<titleabbrev>Accessing or modifying Features</titleabbrev>
<para>Values of individual features for a feature structure can be set or referenced,
using a set of methods that depend on the type of value that feature is declared to have.
There are methods on FeatureStructure for this: getBooleanValue, getByteValue,
getShortValue, getIntValue, getLongValue, getFloatValue, getDoubleValue,
getStringValue, and getFeatureValue (which means to get a value which in turn is a
reference to a feature structure). There are corresponding <quote>setter</quote>
methods, as well. These methods on the feature structure object take as arguments the
feature object retrieved earlier in the typeSystemInit method.</para>
<para>Using the previous example, with the type system initialized with type personType
and feature lastNameFeature, here&apos;s a sample code fragment that gets and sets
that feature:</para>
<programlisting>// Assume aPerson is a variable holding an object of type Person
// get the lastNameFeature value from the feature structure
String lastName = aPerson.getStringValue(lastNameFeature);
// set the lastNameFeature value
aPerson.setStringValue(lastNameFeature, newStringValueForLastName);</programlisting>
<para>The getters and setters for each of the primitive types are defined in the Javadocs
as methods of the FeatureStructure interface.</para>
</section>
<section id="ugr.ref.cas.indexes_and_iterators">
<title>Indexes and Iterators</title>
<para>Each CAS can have many indexes associated with it; each CAS View contains
a complete set of instantions of the indexes. Each index is represented by an
instance of the type org.apache.uima.cas.FSIndex. You use the object
org.apache.uima.cas.FSIndexRepository, accessible via a method on a CAS object, to
retrieve instances of indexes. There are methods that let you select the index
by name, by type, or by both name and type. Since each index is already associated with a type,
passing both a name and a type is valid only if the type passed in is the same
type or a subtype of the one declared in the index specification for the named index. If you
pass in a subtype, the returned FSIndex object refers to an index that will return only
items belonging to that subtype (or subtypes of that subtype).</para>
<para>The returned FSIndex objects are used, in turn, to create iterators.
There is also a method on the Index Repository, <literal>getAllIndexedFS</literal>,
which will return an iterator over all indexed Feature Structures (for that CAS View),
in no particular order. The iterators
created can be used like common Java iterators, to sequentially retrieve items
indexed. If the index represents a sorted index, the items are returned in a sorted
order, where the sort order is specified in the XML index definition. This XML is part of
the Component Descriptor, see <olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.component_descriptor.aes.index"/>.</para>
<para>Feature structures should not be added to or removed from indexes while iterating
over them; a ConcurrentModificationException is thrown when this is detected.
Certain operations are allowed with the iterators after modification, which can
<quote>reset</quote> this condition, such as moving to beginning, end, or moving to a
particular feature structure. So - if you have to modify the index, you can move it back to
the last FS you had retrieved from the iterator, and then continue, if that makes sense in
your application.</para>
<section id="ugr.ref.cas.index.built_in_indexes">
<title>Built-in Indexes</title>
<para>An unnamed built-in bag index exists which holds all feature structures which are indexed.
The only access to this index is the method getAllIndexedFS(Type) which returns an iterator
over all indexed Feature Structures.</para>
<para>The CAS also contains a built-in index for the type <literal>uima.tcas.Annotation</literal>, which sorts
annotations in the order in which they appear in the document. Annotations are sorted first by increasing
<literal>begin</literal> position. Ties are then broken by <emphasis>decreasing</emphasis>
<literal>end</literal> position (so that longer annotations come first). Annotations that match in both
their <literal>begin</literal> and <literal>end</literal> features are sorted using the Type Priority
(see <olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.component_descriptor.aes.type_priority"/> )</para>
</section>
<section id="ugr.ref.cas.index.adding_to_indexes">
<title>Adding Feature Structures to the Indexes</title>
<para>Feature Structures are added to the indexes by calling the
<literal>FSIndexRepository.addFS(FeatureStructure)</literal> method or the equivalent convenience
method <literal>CAS.addFsToIndexes(FeatureStructure)</literal>. This adds the Feature Structure to
<emphasis>all</emphasis> indexes that are defined for the type of that FeatureStructure (or any of its
supertypes). Note that you should not add a Feature Structure to the indexes until you have set values for all
of the features that may be used as sort keys in an index.</para>
</section>
<section id="ugr.ref.cas.index.iterators">
<title>Iterators</title>
<para>Iterators are objects of class <literal>org.apache.uima.cas.FSIterator.</literal> This class
extends <literal>java.util.Iterator</literal> and implements the normal Java iterator methods, plus
additional ones that allow moving both forwards and backwards.</para>
</section>
<section id="ugr.ref.cas.index.annotation_index">
<title>Special iterators for Annotation types</title>
<para>The built-in index over the <literal>uima.tcas.Annotation</literal> type
named <quote><literal>AnnotationIndex</literal></quote> has additional
capabilities. To use them, you first get a reference to this built-in index using
either the <literal>getAnnotationIndex</literal> method on a CAS View object, or
by asking the <literal>FSIndexRepository</literal> object for an index having the
particular name <quote>AnnotationIndex</quote>, for example:
<programlisting>AnnotationIndex idx = aCAS.getAnnotationIndex();
// or you can iterate over a specific subtype of Annotation:
AnnotationIndex idx = aCAS.getAnnotationIndex(aType); </programlisting></para>
<para>This object can be used to produce several additional kinds of iterators. It can
produce unambiguous iterators; these skip over elements until it finds one where the
start position of the next annotation is equal to or greater than the end position of
the previously returned annotation.</para>
<para>It can also produce several kinds of subiterators; these are iterators whose
annotations fall within the span of another annotation. This kind of iterator can
also have the unambiguous property, if desired. It also can be
<quote>strict</quote> or not; strict means that the returned annotation lies
completely within the span of the controlling annotation. Non-strict only implies
that the beginning of the returned annotation falls within the span of the
controlling annotation.</para>
<para>There is also a method which produces an <literal>AnnotationTree</literal>
object, which contains nodes representing the results of doing a strict,
unambiguous subiterator over the span of some controlling annotation. For more
details, please refer to the Javadocs for the
<literal>org.apache.uima.cas.text</literal> package.</para>
</section>
<section id="ugr.ref.cas.index.constraints_and_filtered_iterators">
<title>Constraints and Filtered iterators</title>
<para>There is a set of API calls that build constraint objects. These objects can be
used directly to test if a particular feature structure matches (satisfies) the
constraint, or they can be passed to the createFilteredIterator method to create an
iterator that skips over instances which fail to satisfy the constraint.</para>
<para>It is possible to specify a feature value located by following a chain of
references starting from the feature structure being tested. Here&apos;s a
scenario to explore this concept. Let&apos;s suppose you have the following type
system (namespaces are omitted for clarity):
<blockquote>
<para><emphasis role="bold">Token</emphasis>, having a feature PartOfSpeech
which holds a reference to another type (POS)</para>
<para><emphasis role="bold">POS</emphasis> (a type with many subtypes, each
representing a different part of speech)</para>
<para><emphasis role="bold">Noun</emphasis> (a subtype of POS)</para>
<para><emphasis role="bold">ProperName</emphasis> (a subtype of Noun),
having a feature Class which holds an integer value encoding some information
about the proper noun.</para></blockquote></para>
<para>If you want to filter Token instances, such that only those tokens get through
which are proper names of class 3 (for example), you would need a test that started with
a Token instance, followed its PartOfSpeech reference to another instance (the
ProperName instance) and then tested the Class feature of that instance for a value
equal to 3.</para>
<para>To support this, the filtering approach has components that specify tests, and
components that specify <quote>paths</quote>. The tests that can be done include
testing references to type instances to see if they are instances of some type or its
subtypes; this is done with a FSTypeConstraint constraint. Other tests check for
equality or, for numeric values, ranges.</para>
<para>Each test may be combined with a path &ndash; to get to the value to test. Tests that
start from a feature structure instance can be combined with and and or connectors.
The Javadocs for these are in the package org.apache.uima.cas in the classes that end
in Constraint, plus the classes ConstraintFactory, FeaturePath and CAS.
Here&apos;s an example; assume the variable cas holds a reference to a CAS instance.
<programlisting>// Start by getting the constraint factory from the CAS.
ConstraintFactory cf = cas.getConstraintFactory();
// To specify a path to an item to test, you start by
// creating an empty path.
FeaturePath path = cas.createFeaturePath();
// Add POS feature to path, creating one-element path.
path.addFeature(posFeat);
// You can extend the chain arbitrarily by adding additional
// features.
// Create a new type constraint.
// Type constraints will check that structures
// they match against have a type at least as specific
// as the type specified in the constraint.
FSTypeConstraint nounConstraint = cf.createTypeConstraint();
// Set the type (by default it is TOP).
// This succeeds if the type being tested by this constraint
// is nounType or a subtype of nounType.
nounConstraint.add(nounType);
// Embed the noun constraint under the pos path.
// This means, associate the test with the path, so it tests the
// proper value.
// The result is a test which will
// match a feature structure that has a posFeat defined
// which has a value which is an instance of a nounType or
// one of its subtypes.
FSMatchConstraint embeddedNoun = cf.embedConstraint(path, nounConstraint);
// Create a type constraint for token (or a subtype of it)
FSTypeConstraint tokenConstraint = cf.createTypeConstraint();
// Set the type.
tokenConstraint.add(tokenType);
// Create the final constraint by conjoining the two constraints.
FSMatchConstraint nounTokenCons = cf.and(nounConstraint, tokenConstraint);
// Create a filtered iterator from some annotation iterator.
FSIterator it = cas.createFilteredIterator(annotIt, nounTokenCons);</programlisting>
</para></section></section>
<section id="ugr.ref.cas.guide_to_javadocs">
<title>The CAS API&apos;s &ndash; a guide to the Javadocs</title>
<titleabbrev>CAS API&apos;s Javadocs</titleabbrev>
<para>The CAS APIs are organized into 3 Java packages: cas, cas.impl, and cas.text. Most
of the APIs described here are in the cas package. The cas.impl package contains classes
used in serializing and deserializing (reading and writing to external strings) the
XCAS form of the CAS (XCAS is an XML serialization of the CAS). The XCAS form is used for
transporting the CAS among local and remote annotators, or for storing the CAS in
permanent storage. The cas.text contains the APIs that extend the CAS to support
artifact (including <quote>text</quote>) analysis.</para>
<section id="ugr.ref.cas.javadocs.cas_package">
<title>APIs in the CAS package</title>
<para>The main objects implementing the APIs discussed here are shown in the diagram
below. The hierarchy represents that there is a way to get from an upper object to an
instance of the lower object, usually by using a method on the upper object; this is not
an inheritance hierarchy.
<figure id="ugr.ref.cas.fig.api_hierarchy">
<title>CAS Object hierarchy</title>
<mediaobject>
<imageobject>
<imagedata width="5.8in" format="JPG"
fileref="&imgroot;image001.png"/>
</imageobject>
<textobject><phrase>CAS object hierarchy</phrase></textobject>
</mediaobject>
</figure> </para>
<para>The main Interface is the CAS interface. This has most of the functionality of the
CAS, except for the type system metadata access, and the indexing access. JCas and CAS
are alternative representations and API approaches to the CAS; each has a method to
get the other. You can mix JCas and CAS APIs in your application as needed. To use the
JCas APIs, you have to create the Java classes that correspond to the CAS types, and
include them in the Java class path of the application. If you have a CAS object, you can
get a JCas object by using the getJCas() method call on the CAS object; likewise, you
can get the CAS object from a JCas by using the getCAS() method call on the JCas object.
There is also a low level CAS interface that is not part of the official API, and is
intended for internal use only &ndash; it is not documented here.</para>
<para>The type system metadata APIs are found in the TypeSystem interface. The objects
defining each type and feature are defined by the interfaces Type and Feature. The
Type interface has methods to see what types subsume other types, to iterate over the
types available, and to extract information about the types, including what
features it has. The Feature interface has methods that get what type it belongs to,
its name, and its range (the kind of values it can hold).</para>
<para>The FSIndexRepository gives you access to methods to get instances of indexes, and
also provides access to the iterator over all indexed feature structures:
<literal>getAllIndexedFS(aType)</literal>.
The FSIndex and AnnotationIndex objects give you methods to create instances of
iterators.</para>
<para>Iterators and the CAS methods that create new feature structures return
FeatureStructure objects. These objects can be used to set and get the values of
defined features within them.</para>
</section>
</section>
</chapter>