<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ | |
<!ENTITY imgroot "images/references/ref.cas/" > | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" > | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<chapter id="ugr.ref.cas"> | |
<title>CAS Reference</title> | |
<para>The CAS (Common Analysis System) is the part of the Unstructured Information | |
Management Architecture (UIMA) that is concerned with creating and handling the data | |
that annotators manipulate.</para> | |
<para>Java users typically use the JCas (Java interface to the CAS) when manipulating | |
objects in the CAS. This chapter describes an alternative interface to the CAS which | |
allows discovery and specification of types and features at run time. It is recommended | |
for use when the using code cannot know ahead of time the type system it will be dealing | |
with.</para> | |
<para>Use of the CAS as described here is also recommended (or necessary) when components add | |
to the definitions of types of other components. This UIMA feature allows users to add features | |
to a type that was already defined elsewhere. When this feature is used in conjunction with the | |
JCas, it can lead to problems with class loading. This is because different JCas representations | |
of a single type are generated by the different components, and only one of them is loaded | |
(unless you are using Pear descriptors). Note: | |
we do not recommend that you add features to pre-existing types. A type should be defined in one | |
place only, and then there is no problem with using the JCas. However, if you do use this feature, | |
do not use the JCas. Similarly, if you distribute your components for inclusion in somebody else's | |
UIMA application, and you're not sure that they won't add features to your types, do not use the | |
JCas for the same reasons. | |
</para> | |
<para>CASes passed to Annotator Components are either a base CAS or a regular CAS. Base CASes | |
are only passed to Multi-View components - they are like regular CASes, but do not have user | |
accessible indexes or Sofas. They are used by the component only for switching to other CAS | |
views, which are regular CASes.</para> | |
<section id="ugr.ref.cas.javadocs"> | |
<title>Javadocs</title> | |
<para>The subdirectory <literal>docs/api</literal> contains the documentation | |
details of all the classes, methods, and constants for the APIs discussed here. Please | |
refer to this for details on the methods, classes and constants, specifically in the | |
packages <literal>org.apache.uima.cas.*</literal>.</para> | |
</section> | |
<section id="ugr.ref.cas.overview"> | |
<title>CAS Overview</title> | |
<para>There are three<footnote><para>A fourth part, the Subject of Analysis, | |
is discussed in <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.aas"/>.</para></footnote> main parts to the CAS: the type system, data creation and | |
manipulation, and indexing. We will start with a brief | |
description of these components.</para> | |
<section id="ugr.ref.cas.type_system"> | |
<title>The Type System</title> | |
<para>The type system specifies what kind of data you will be able to manipulate in your | |
annotators. The type system defines two kinds of entities, types and features. Types | |
are arranged in a single inheritance tree and define the kinds of entities (objects) | |
you can manipulate in the CAS. Features optionally specify slots or fields within a | |
type. The correspondence to Java is to equate a CAS Type to a Java Class, and the CAS | |
Features to fields within the type. A critical difference is that CAS types have no | |
methods; they are just data structures with named slots (features). These features can | |
have as values primitive things like integers, floating point numbers, and strings, | |
and they also can hold references to other instances of objects in the CAS. We call | |
instances of the data structures declared by the type system <quote>feature | |
structures</quote> (not to be confused with <quote>features</quote>). Feature | |
structures are similar to the many variants of record structures found in computer | |
science.<footnote><para> The name <quote>feature structure</quote> comes from | |
terminology used in linguistics.</para></footnote></para> | |
<para>Each CAS Type defines a supertype; it is a subtype of that supertype. This means | |
that any features that the supertype defines are features of the subtype; in other | |
words, it inherits its supertype's features. Only single inheritance is | |
supported; a type's feature set is the union of all of the features in its | |
supertype hierarchy. There is a built-in type called uima.cas.TOP; this is the top, | |
root node of the inheritance tree. It defines no features.</para> | |
<para>The values that can be stored in features are either built-in primitive values or | |
references to other feature structures. The primitive values are | |
<literal>boolean</literal>, <literal>byte</literal>, | |
<literal>short</literal> (16 bit integers), <literal>integer</literal> (32 | |
bit), <literal>long</literal> (64 bit), <literal>float</literal> (32 bit), | |
<literal>double</literal> (64 bit floats) and strings; the official names of these | |
are <literal>uima.cas.Boolean</literal>, <literal>uima.cas.Byte</literal>, | |
<literal>uima.cas.Short</literal>, <literal>uima.cas.Integer</literal>, | |
<literal>uima.cas.Long</literal>, <literal>uima.cas.Float</literal> | |
,<literal> uima.cas.Double</literal> and <literal>uima.cas.String</literal> | |
. The strings are Java strings, and characters are Java characters. Technically, this means | |
that characters are UTF-16 code points, which is not quite the same as a Unicode character. | |
This distinction should make no difference for almost all applications. | |
The CAS also defines other basic built-in types for arrays of these, plus arrays of | |
references to other objects, called <literal>uima.cas.IntegerArray</literal> | |
,<literal> uima.cas.FloatArray</literal>, | |
<literal>uima.cas.StringArray</literal>, | |
<literal>uima.cas.FSArray</literal>, etc.</para> | |
<para>The CAS also defines a built-in type called | |
<literal>uima.tcas.Annotation</literal> which inherits from | |
<literal>uima.cas.AnnotationBase</literal> which in turn inherits from | |
<literal>uima.cas.TOP</literal>. There are two features defined by this type, | |
called <literal>begin</literal> and <literal>end</literal>, both of which are | |
integer valued.</para> | |
</section> | |
<section id="ugr.ref.cas.creating_accessing_manipulating_data"> | |
<title>Creating, accessing and manipulating data</title> | |
<titleabbrev>Creating/Accessing/Changing data</titleabbrev> | |
<para> | |
Creating and accessing data in the CAS requires knowledge about the types and features | |
defined in the type system. The idea is similar to other data access APIs, such as the XML | |
DOM or SAX APIs, or database access APIs such as JDBC. Contrary to those APIs, however, the | |
CAS does not use the names of type system entities directly in the APIs. Rather, you use | |
the type system to access type and feature entities by name, then use these entities in the | |
data manipulation APIs. This can be compared to the Java reflection APIs: the type system | |
is comparable to the Java class loader, and the type and feature objects to the | |
<literal>java.lang.Class</literal> and <literal>java.lang.reflect.Field</literal> classes. | |
</para> | |
<para> | |
Why does it have to be this complicated? You wouldn't normally use reflection to create a | |
Java object, either. As mentioned earlier, the JCas provides the more straightforward | |
method to manipulate CAS data. The CAS access methods described here need only be used for | |
generic types of applications that need to be able to handle any kind of data (e.g., generic | |
tooling) or when the JCas may not be used for other reasons. The generic kinds of applications | |
are exactly the ones where you would use the reflection API in Java as well. | |
</para> | |
</section> | |
<section id="ugr.ref.cas.creating_using_indexes"> | |
<title>Creating and using indexes</title> | |
<para>Each view of a CAS provides a set of indexes for that view. Instances of feature | |
structures can be added to a view's indexes. These indexes provide | |
the only way for other annotators to locate existing data in the CAS. The only way for an | |
annotator to use data that another annotator has created is by using an index (or the | |
method <literal>getAllIndexedFS</literal> of the object <literal>FSIndexRepository</literal>) to | |
retrieve feature structures the first annotator created. If you want the data you | |
create to be visible to other annotators, you must explicitly call methods which | |
add it to the indexes — you must index it.</para> | |
<para>Indexes are named and are associated with a CAS Type; they are used to index | |
instances of that CAS type (including instances of that type's subtypes). If | |
you are using multiple views (see <olink | |
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.mvs"/>), | |
each view contains a separate instantiation of all of the indexes. | |
To access an index, you | |
minimally need to know its name. A CAS view provides an index repository which you can | |
query for indexes for that view. Once you have a handle to an index, you can get | |
information about the feature structures in the index, the size of the index, as well | |
as an iterator over the feature structures.</para> | |
<para>Indexes are defined in the XML descriptor metadata for the application. Each CAS | |
View has its own, separate instantiation of indexes based on these definitions, | |
kept in the view's index repository. When you obtain an index, it is always from a | |
particular CAS view. When you index an item, it is always added to all indexes where it | |
belongs, within just one repository. You can specify different repositories | |
(associated with different CAS views) to use; a given Feature Structure instance | |
may be indexed in more | |
than one CAS View.</para> | |
<para>Iterators allow you to enumerate the feature structures in an index. FS iterators | |
provide two kinds of APIs: the regular Java iterator API, and a specific FS iterator API | |
where the usual Java iterator APIs (<literal>hasNext()</literal> and <literal>next()</literal>) | |
are replaced by <literal>isValid()</literal>, <literal>moveToNext()</literal> (which does | |
not return an element) and <literal>get()</literal>. Which API style you use is up to you, | |
but we do not recommend mixing the styles as the results are sometimes unexpected. If you | |
just want to iterate over an index from start to finish, either style is equally appropriate. | |
If you also use <literal>moveTo(FeatureStructure fs)</literal> and | |
<literal>moveToPrevious()</literal>, it is better to use the special FS iterator style. | |
</para> | |
<note><para>The reason to not mix these styles is that you might be thinking that | |
next() followed by moveToPrevious() would always work. This is not true, because | |
next() returns the "current" element, and advances to the next position, which might be | |
beyond the last element. At that point, the interator becomes "invalid", and by the iterator | |
contracts, moveToNext and moveToPrevious are not allowed on "invalid" iterators; | |
when an iterator is not valid, all bets are off. But you can | |
call these methods on the iterator — moveToFirst(), moveToLast(), or moveTo(FS) — to reset it.</para></note> | |
<para>Indexes are created by specifying them in the annotator's or | |
aggregate's resource descriptor. An index specification includes its name, | |
the CAS type being indexed, the kind of index it is, and an (optional) ordering | |
relation on the feature structures to be indexed. At startup time, all index | |
specifications are combined; duplicate definitions (having the same name) are | |
allowed only if their definitions are the same. </para> | |
<para>Feature structure instances need to be explicitly added to the index repository by a | |
method call. Feature structures that are not indexed will not be visible to other | |
annotators, (unless they are located via being referenced by some other feature of | |
another feature structure, which is indexed, or through a chain of these).</para> | |
<para>The framework defines an unnamed bag index which indexes all types. The | |
only access provided for this index is the getAllIndexedFS(type) method on the | |
index repository, which returns an iterator over all indexed instances of the | |
specified type (including its subtypes) for that CAS View. | |
</para> | |
<para>The framework defines one standard, built-in annotation index, called | |
AnnotationIndex, which indexes the <literal>uima.tcas.Annotation</literal> | |
type: all feature structures of type <literal>uima.tcas.Annotation</literal> or | |
its subtypes are automatically indexed with this built-in index.</para> | |
<para>The ordering relation used by this index is to first order by the value of the | |
<quote>begin</quote> features (in ascending order) and then by the value of the | |
<quote>end</quote> feature (in descending order). This ordering insures that | |
longer annotations starting at the same spot come before shorter ones. For Subjects | |
of Analysis other than Text, this may not be an appropriate index.</para> | |
</section> | |
</section> | |
<section id="ugr.ref.cas.builtin_types"> | |
<title>Built-in CAS Types</title> | |
<para>The CAS has two kinds of built-in types – primitive and non-primitive. The | |
primitive types are: | |
<itemizedlist spacing="compact"> | |
<listitem><para>uima.cas.Boolean</para></listitem> | |
<listitem><para>uima.cas.Byte</para></listitem> | |
<listitem><para>uima.cas.Short</para></listitem> | |
<listitem><para>uima.cas.Integer</para></listitem> | |
<listitem><para>uima.cas.Long</para></listitem> | |
<listitem><para>uima.cas.Float</para></listitem> | |
<listitem><para>uima.cas.Double</para></listitem> | |
<listitem><para>uima.cas.String</para></listitem> | |
</itemizedlist></para> | |
<para>The <literal>Byte, Short, Integer, </literal>and<literal> Long</literal> are | |
all signed integer types, of length 8, 16, 32, and 64 bits. The | |
<literal>Double</literal> type is 64 bit floating point. The | |
<literal>String</literal> type can be sub-typed to create sets of allowed values; see | |
<olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.type_system.string_subtypes"/>. | |
These types can be used to specify the range of a String-valued feature. They act like | |
Strings, but have additional checking to insure the setting of values into them | |
conforms to one of the allowed values. Note that the other primitive types cannot be used | |
as a supertype for another type definition; only | |
<literal>uima.cas.String</literal> can be sub-typed.</para> | |
<para>The non-primitive types exist in a type hierarchy; the top of the hierarchy is the | |
type <literal>uima.cas.TOP</literal>. All other non-primitive types inherit from | |
some supertype.</para> | |
<para>There are 9 built-in array types. These arrays have a size specified when they are | |
created; the size is fixed at creation time. They are named: | |
<itemizedlist spacing="compact"> | |
<listitem><para>uima.cas.BooleanArray</para></listitem> | |
<listitem><para>uima.cas.ByteArray</para></listitem> | |
<listitem><para>uima.cas.ShortArray</para></listitem> | |
<listitem><para>uima.cas.IntegerArray</para></listitem> | |
<listitem><para>uima.cas.LongArray</para></listitem> | |
<listitem><para>uima.cas.FloatArray</para></listitem> | |
<listitem><para>uima.cas.DoubleArray</para></listitem> | |
<listitem><para>uima.cas.StringArray</para></listitem> | |
<listitem><para>uima.cas.FSArray</para></listitem> | |
</itemizedlist></para> | |
<para>The <literal>uima.cas.FSArray</literal> type is an array whose elements are | |
arbitrary other feature structures (instances of non-primitive types).</para> | |
<para>There are 3 built-in types associated with the artifact being analyzed: | |
<itemizedlist spacing="compact"> | |
<listitem><para>uima.cas.AnnotationBase</para></listitem> | |
<listitem><para>uima.tcas.Annotation</para></listitem> | |
<listitem><para>uima.tcas.DocumentAnnotation</para></listitem> | |
</itemizedlist></para> | |
<para>The <literal>AnnotationBase</literal> type defines one system-used feature | |
which specifies for an annotation the subject of analysis (Sofa) to which it refers. The | |
Annotation type extends from this and defines 2 features, taking | |
<literal>uima.cas.Integer</literal> values, called <literal>begin</literal> | |
and <literal>end</literal>. The <literal>begin</literal> feature typically | |
identifies the start of a span of text the annotation covers; the | |
<literal>end</literal> feature identifies the end. The values refer to character | |
offsets; the starting index is 0. An annotation of the word <quote>CAS</quote> in a text | |
<quote>CAS Reference</quote> would have a start index of 0, and an end index of 3; the | |
difference between end and start is the length of the span the annotation refers | |
to.</para> | |
<para>Annotations are always with respect to some Sofa (Subject of Analysis – see | |
<olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/> | |
.</para> | |
<note><para>Artifacts which are not text strings may have a different interpretation of | |
the meaning of begin and end, or may define their own kind of annotation, extending from | |
<literal>AnnotationBase</literal>. </para></note> | |
<para id="ugr.ref.cas.document_annotation">The <literal>DocumentAnnotation</literal> type has one special instance. It is | |
a subtype of the Annotation type, and the built-in definition defines one feature, | |
<literal>language</literal>, which is a string indicating the language of the | |
document in the CAS. The value of this language feature is used by the system to control | |
flow among annotators when the <quote>CapabilityLanguageFlow</quote> mode is used, | |
allowing the flow to skip over annotators that don't process particular | |
languages. Users may extend this type by adding additional features to it, using the XML | |
Descriptor element for defining a type.</para> | |
<note><para> | |
We do <emphasis>not</emphasis> recommend extending the <literal>DocumentAnnotation</literal> | |
type. If you do, you must <emphasis>not</emphasis> use the JCas, for the reasons stated | |
earlier. | |
</para></note> | |
<para>Each CAS view has a different associated instance of the | |
<literal>DocumentAnnotation</literal> type. On the CAS, use | |
<literal>getDocumentationAnnotation()</literal> to access the | |
<literal>DocumentAnnotation</literal>.</para> | |
<para>There are also built-in types supporting linked lists, similar to the ones available in | |
Java and other programming languages. Their use is | |
constrained by the usual properties of linked lists: not very space efficient, no (efficient) | |
random access, but an easy choice if you don't know how long your list will be ahead of time. The | |
implementation is type specific; there are different list building objects for each of | |
the primitive types, plus one for general feature structures. Here are the type names: | |
<itemizedlist spacing="compact"> | |
<listitem><para>uima.cas.FloatList</para></listitem> | |
<listitem><para>uima.cas.IntegerList</para></listitem> | |
<listitem><para>uima.cas.StringList</para></listitem> | |
<listitem><para>uima.cas.FSList</para> | |
<para></para></listitem> | |
<listitem><para>uima.cas.EmptyFloatList</para></listitem> | |
<listitem><para>uima.cas.EmptyIntegerList</para></listitem> | |
<listitem><para>uima.cas.EmptyStringList</para></listitem> | |
<listitem><para>uima.cas.EmptyFSList</para> | |
<para></para></listitem> | |
<listitem><para>uima.cas.NonEmptyFloatList</para></listitem> | |
<listitem><para>uima.cas.NonEmptyIntegerList</para></listitem> | |
<listitem><para>uima.cas.NonEmptyStringList</para></listitem> | |
<listitem><para>uima.cas.NonEmptyFSList</para></listitem> | |
</itemizedlist></para> | |
<para>For the primitive types <literal>Float</literal>, | |
<literal>Integer</literal>, <literal>String</literal> and | |
<literal>FeatureStructure</literal>, there is a base type, for instance, | |
<literal>uima.cas.FloatList</literal>. For each of these, there are two subtypes, | |
corresponding to a non-empty element, and a marker that serves to indicate the end of the | |
list, or an empty list. The non-empty types define two features – | |
<literal>head</literal> and <literal>tail</literal>. The head feature holds the | |
particular value for that part of the list. The tail refers to the next list object | |
(either a non-empty one or the empty version to indicate the end of the list).</para> | |
<para>There are no other built-in types. Users are free to define their own type systems, | |
building upon these types.</para> | |
</section> | |
<section id="ugr.ref.cas.accessing_the_type_system"> | |
<title>Accessing the type system</title> | |
<para> | |
During annotator processing, or outside an annotator, access the type system by calling | |
<literal>CAS.getTypeSystem()</literal>. | |
</para> | |
<para>However, CAS annotators implement an additional method, | |
<literal>typeSystemInit()</literal>, which is called by the UIMA framework before the | |
annotator's process method. This method, implemented by the annotator writer, | |
is passed a reference to the CAS's type system metadata. The method typically uses | |
the type system APIs to obtain type and feature objects corresponding to all the types | |
and features the annotator will be using in its process method. This initialization | |
step should not be done during an annotator's initialize method since the type | |
system can change after the initialize method is called; it should not be done during the | |
process method, since this is presumably work that is identical for each incoming | |
document, and so should be performed only when the type system changes (which will be a | |
rare event). The UIMA framework guarantees it will call the <literal>typeSystemInit | |
</literal>method of an annotator whenever the type system changes, before calling the | |
annotator's <literal>process()</literal> method.</para> | |
<para>The initialization done by <literal>typeSystemInit()</literal> is done by the | |
UIMA framework when you use the JCas APIs; you only need to provide a | |
<literal>typeSystemInit()</literal> method, as described here, when you are not using | |
the JCas approach.</para> | |
<section id="ugr.ref.cas.type_system.printer_example"> | |
<title>TypeSystemPrinter example</title> | |
<para>Here is a code fragment that, given a CAS Type System, will print a list of all | |
types.</para> | |
<programlisting>// Get all type names from the type system | |
// and print them to stdout. | |
private void listTypes1(TypeSystem ts) { | |
// Get an iterator over types | |
Iterator typeIterator = ts.getTypeIterator(); | |
Type t; | |
System.out.println("Types in the type system:"); | |
while (typeIterator.hasNext()) { | |
// Retrieve a type... | |
t = (Type) typeIterator.next(); | |
// ...and print its name. | |
System.out.println(t.getName()); | |
} | |
System.out.println(); | |
}</programlisting> | |
<para>This method is passed the type system as a parameter. From the type system, we can | |
get an iterator | |
over all known types. If you run this against a CAS created with no additional | |
user-defined types, we should see something like this on the console:</para> | |
<programlisting>Types in the type system: | |
uima.cas.Boolean | |
uima.cas.Byte | |
uima.cas.Short | |
uima.cas.Integer | |
uima.cas.Long | |
uima.cas.ArrayBase | |
... | |
</programlisting> | |
<para>If the type system had user-defined types these would show up too. Note that some | |
of these types are not directly creatable – they are types used by the framework | |
in the type hierarchy (e.g. uima.cas.ArrayBase).</para> | |
<para>CAS type names include a name-space prefix. The components of a type name are | |
separated by the dot (.). A type name component must start with a Unicode letter, | |
followed by an arbitrary sequence of letters, digits and the underscore (_). By | |
convention, the last component of a type name starts with an uppercase letter, the | |
rest start with a lowercase letter.</para> | |
<para>Listing the type names is mildly useful, but it would be even better if we could see | |
the inheritance relation between the types. The following code prints the | |
inheritance tree in indented format.</para> | |
<programlisting>private static final int INDENT = 2; | |
private void listTypes2(TypeSystem ts) { | |
// Get the root of the inheritance tree. | |
Type top = ts.getTopType(); | |
// Recursively print the tree. | |
printInheritanceTree(ts, top, 0); | |
} | |
private void printInheritanceTree(TypeSystem ts, Type type, int level) { | |
indent(level); // Print indentation. | |
System.out.println(type.getName()); | |
// Get a vector of the immediate subtypes. | |
Vector subTypes = | |
ts.getDirectlySubsumedTypes(type); | |
++level; // Increase the indentation level. | |
for (int i = 0; i < subTypes.size(); i++) { | |
// Print the subtypes. | |
printInheritanceTree(ts, (Type) subTypes.get(i), level); | |
} | |
} | |
// A simple, inefficient indenter | |
private void indent(int level) { | |
int spaces = level * INDENT; | |
for (int i = 0; i < spaces; i++) { | |
System.out.print(" "); | |
} | |
}</programlisting> | |
<para> This example shows that you can traverse the type hierarchy by starting at the top | |
with TypeSystem.getTopType and by retrieving subtypes with | |
<literal>TypeSystem.getDirectlySubsumedTypes()</literal>.</para> | |
<para>The Javadocs also have APIs that allow you to access the features, as well as what | |
the allowed value type is for that feature. Here is sample code which prints out all the | |
features of all the types, together with the allowed value types (the feature | |
<quote>range</quote>). Each feature has a <quote>domain</quote> which is the type | |
where it is defined, as well as a <quote>range</quote>. | |
<programlisting>private void listFeatures2(TypeSystem ts) { | |
Iterator featureIterator = ts.getFeatures(); | |
Feature f; | |
System.out.println("Features in the type system:"); | |
while (featureIterator.hasNext()) { | |
f = (Feature) featureIterator.next(); | |
System.out.println( | |
f.getShortName() + ": " + | |
f.getDomain() + " -> " + f.getRange()); | |
} | |
System.out.println(); | |
}</programlisting></para> | |
<para>We can ask a feature object for its domain (the type it is defined on) and its range | |
(the type of the value of the feature). The terminology derives from the fact that | |
features can be viewed as functions on subspaces of the object space.</para> | |
</section> | |
<section id="ugr.ref.cas.cas_apis_create_modify_feature_structures"> | |
<title>Using the CAS APIs to create and modify feature structures</title> | |
<titleabbrev>Using CAS APIs: Feature Structures</titleabbrev> | |
<para>Assume a type system declaration that defines two types: Entity and Person. | |
Entity has no features defined within it but inherits from uima.tcas.Annotation | |
– so it has the begin and end features. Person is, in turn, a subtype of Entity, | |
and adds firstName and lastName features. CAS type systems are declaratively | |
specified using XML; the format of this XML is described in <olink | |
targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.type_system"/>. | |
<programlisting><![CDATA[<!-- Type System Definition --> | |
<typeSystemDescription> | |
<types> | |
<typeDescription> | |
<name>com.xyz.proj.Entity</name> | |
<description /> | |
<supertypeName>uima.tcas.Annotation</supertypeName> | |
</typeDescription> | |
<typeDescription> | |
<name>Person</name> | |
<description /> | |
<supertypeName>com.xyz.proj.Entity </supertypeName> | |
<features> | |
<featureDescription> | |
<name>firstName</name> | |
<description /> | |
<rangeTypeName>uima.cas.String</rangeTypeName> | |
</featureDescription> | |
<featureDescription> | |
<name>lastName</name> | |
<description /> | |
<rangeTypeName>uima.cas.String</rangeTypeName> | |
</featureDescription> | |
</features> | |
</typeDescription> | |
</types> | |
</typeSystemDescription>]]></programlisting></para> | |
<para> | |
To be able to access types and features, we need to know their names. The CAS interface defines | |
constants that hold the names of built-in feature names, such as, e.g., | |
<literal>CAS.TYPE_NAME_INTEGER</literal>. It is good programming practice to create such | |
constants for the types and features you define, for your own use as well as for others who will | |
be using your annotators. | |
</para> | |
<programlisting>/** Entity type name constant. */ | |
public static final String ENTITY_TYPE_NAME = "com.xyz.proj.Entity"; | |
/** Person type name constant. */ | |
public static final String PERSON_TYPE_NAME = "com. xyz.proj.Person"; | |
/** First name feature name constant. */ | |
public static final String FIRST_NAME_FEAT_NAME = "firstName"; | |
/** Last name feature name constant. */ | |
public static final String LAST_NAME_FEAT_NAME = "lastName";</programlisting> | |
<para>Next we define type and feature member variables; these will hold the values of the | |
type and feature objects needed by the CAS APIs, to be assigned during | |
<literal>typeSystemInit()</literal>.</para> | |
<programlisting>// Type system object variables | |
private Type entityType; | |
private Type personType; | |
private Feature firstNameFeature; | |
private Feature lastNameFeature; | |
private Type stringType;</programlisting> | |
<para>The type system does not throw an exception if we ask for something that is | |
not known, it simply returns null; therefore the code checks for this and throws a proper | |
exception. We require all these types and features to be defined for the annotator to | |
work. One might imagine situations where certain computations are predicated on some type | |
or feature being defined in the type system, but that is not the case here.</para> | |
<programlisting>// Get a type object corresponding to a name. | |
// If it doesn't exist, throw an exception. | |
private Type initType(String typeName) | |
throws AnnotatorInitializationException { | |
Type type = ts.getType(typeName); | |
if (type == null) { | |
throw new AnnotatorInitializationException( | |
AnnotatorInitializationException.TYPE_NOT_FOUND, | |
new Object[] { this.getClass().getName(), typeName }); | |
} | |
return type; | |
} | |
// We add similar code for retrieving feature objects. | |
// Get a feature object from a name and a type object. | |
// If it doesn't exist, throw an exception. | |
private Feature initFeature(String featName, Type type) | |
throws AnnotatorInitializationException { | |
Feature feat = type.getFeatureByBaseName(featName); | |
if (feat == null) { | |
throw new AnnotatorInitializationException( | |
AnnotatorInitializationException.FEATURE_NOT_FOUND, | |
new Object[] { this.getClass().getName(), featName }); | |
} | |
return feat; | |
}</programlisting> | |
<para>Using these two functions, code for initializing the type system described | |
above would be: | |
<programlisting>public void typeSystemInit(TypeSystem aTypeSystem) | |
throws AnalysisEngineProcessException { | |
this.typeSystem = aTypeSystem; | |
// Set type system member variables. | |
this.entityType = initType(ENTITY_TYPE_NAME); | |
this.personType = initType(PERSON_TYPE_NAME); | |
this.firstNameFeature = | |
initFeature(FIRST_NAME_FEAT_NAME, personType); | |
this.lastNameFeature = | |
initFeature(LAST_NAME_FEAT_NAME, personType); | |
this.stringType = initType(CAS.TYPE_NAME_STRING); | |
}</programlisting></para> | |
<para>Note that we initialize the string type by using a type name constant from the | |
CAS.</para> | |
</section> | |
</section> | |
<section id="ugr.ref.cas.creating_feature_structures"> | |
<title>Creating feature structures</title> | |
<para>To create feature structures in JCas, we use the Java <quote>new</quote> | |
operator. In the CAS, we use one of several different API methods on the CAS object, | |
depending on which of the 10 basic kinds of feature structures we are creating (a plain | |
feature structure, or an instance of the built-in primitive type arrays or FSArray). | |
There are is also a method to create an instance of a | |
<literal>uima.tcas.Annotation</literal>, setting the begin and end | |
values.</para> | |
<para>Once a feature structure is created, it needs to be added to the CAS indexes (unless | |
it will be accessed via some reference from another accessible feature structure). The | |
CAS provides this API: Assuming aCAS holds a reference to a CAS, and token holds a | |
reference to a newly created feature structure, here's the code to add that | |
feature structure to all the relevant CAS indexes:</para> | |
<programlisting> // Add the token to the index repository. | |
aCAS.addFsToIndexes(token);</programlisting> | |
<para>There is also a corresponding <literal>removeFsFromIndexes(token)</literal> | |
method on CAS objects.</para> | |
<para>Because some of the indexes (the Sorted and Set types) use comparators defined | |
on particular values of the features of an indexed type, if you change the values of | |
those features being used in the index key, the correct way to do this is to | |
<orderedlist spacing="compact"> | |
<listitem><para>remove the item from all indexes where it is indexed, in all views | |
where it is indexed,</para> | |
</listitem> | |
<listitem><para>update the value of the features being used as keys,</para></listitem> | |
<listitem><para>add the item back to the indexes, in all views.</para></listitem> | |
</orderedlist></para> | |
</section> | |
<section id="ugr.ref.cas.accessing_modifying_features_of_feature_structures"> | |
<title>Accessing or modifying features of feature structures</title> | |
<titleabbrev>Accessing or modifying Features</titleabbrev> | |
<para>Values of individual features for a feature structure can be set or referenced, | |
using a set of methods that depend on the type of value that feature is declared to have. | |
There are methods on FeatureStructure for this: getBooleanValue, getByteValue, | |
getShortValue, getIntValue, getLongValue, getFloatValue, getDoubleValue, | |
getStringValue, and getFeatureValue (which means to get a value which in turn is a | |
reference to a feature structure). There are corresponding <quote>setter</quote> | |
methods, as well. These methods on the feature structure object take as arguments the | |
feature object retrieved earlier in the typeSystemInit method.</para> | |
<para>Using the previous example, with the type system initialized with type personType | |
and feature lastNameFeature, here's a sample code fragment that gets and sets | |
that feature:</para> | |
<programlisting>// Assume aPerson is a variable holding an object of type Person | |
// get the lastNameFeature value from the feature structure | |
String lastName = aPerson.getStringValue(lastNameFeature); | |
// set the lastNameFeature value | |
aPerson.setStringValue(lastNameFeature, newStringValueForLastName);</programlisting> | |
<para>The getters and setters for each of the primitive types are defined in the Javadocs | |
as methods of the FeatureStructure interface.</para> | |
</section> | |
<section id="ugr.ref.cas.indexes_and_iterators"> | |
<title>Indexes and Iterators</title> | |
<para>Each CAS can have many indexes associated with it; each CAS View contains | |
a complete set of instantions of the indexes. Each index is represented by an | |
instance of the type org.apache.uima.cas.FSIndex. You use the object | |
org.apache.uima.cas.FSIndexRepository, accessible via a method on a CAS object, to | |
retrieve instances of indexes. There are methods that let you select the index | |
by name, by type, or by both name and type. Since each index is already associated with a type, | |
passing both a name and a type is valid only if the type passed in is the same | |
type or a subtype of the one declared in the index specification for the named index. If you | |
pass in a subtype, the returned FSIndex object refers to an index that will return only | |
items belonging to that subtype (or subtypes of that subtype).</para> | |
<para>The returned FSIndex objects are used, in turn, to create iterators. | |
There is also a method on the Index Repository, <literal>getAllIndexedFS</literal>, | |
which will return an iterator over all indexed Feature Structures (for that CAS View), | |
in no particular order. The iterators | |
created can be used like common Java iterators, to sequentially retrieve items | |
indexed. If the index represents a sorted index, the items are returned in a sorted | |
order, where the sort order is specified in the XML index definition. This XML is part of | |
the Component Descriptor, see <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.aes.index"/>.</para> | |
<para>Feature structures should not be added to or removed from indexes while iterating | |
over them; a ConcurrentModificationException is thrown when this is detected. | |
Certain operations are allowed with the iterators after modification, which can | |
<quote>reset</quote> this condition, such as moving to beginning, end, or moving to a | |
particular feature structure. So - if you have to modify the index, you can move it back to | |
the last FS you had retrieved from the iterator, and then continue, if that makes sense in | |
your application.</para> | |
<section id="ugr.ref.cas.index.built_in_indexes"> | |
<title>Built-in Indexes</title> | |
<para>An unnamed built-in bag index exists which holds all feature structures which are indexed. | |
The only access to this index is the method getAllIndexedFS(Type) which returns an iterator | |
over all indexed Feature Structures.</para> | |
<para>The CAS also contains a built-in index for the type <literal>uima.tcas.Annotation</literal>, which sorts | |
annotations in the order in which they appear in the document. Annotations are sorted first by increasing | |
<literal>begin</literal> position. Ties are then broken by <emphasis>decreasing</emphasis> | |
<literal>end</literal> position (so that longer annotations come first). Annotations that match in both | |
their <literal>begin</literal> and <literal>end</literal> features are sorted using the Type Priority | |
(see <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.aes.type_priority"/> )</para> | |
</section> | |
<section id="ugr.ref.cas.index.adding_to_indexes"> | |
<title>Adding Feature Structures to the Indexes</title> | |
<para>Feature Structures are added to the indexes by calling the | |
<literal>FSIndexRepository.addFS(FeatureStructure)</literal> method or the equivalent convenience | |
method <literal>CAS.addFsToIndexes(FeatureStructure)</literal>. This adds the Feature Structure to | |
<emphasis>all</emphasis> indexes that are defined for the type of that FeatureStructure (or any of its | |
supertypes). Note that you should not add a Feature Structure to the indexes until you have set values for all | |
of the features that may be used as sort keys in an index.</para> | |
</section> | |
<section id="ugr.ref.cas.index.iterators"> | |
<title>Iterators</title> | |
<para>Iterators are objects of class <literal>org.apache.uima.cas.FSIterator.</literal> This class | |
extends <literal>java.util.Iterator</literal> and implements the normal Java iterator methods, plus | |
additional ones that allow moving both forwards and backwards.</para> | |
</section> | |
<section id="ugr.ref.cas.index.annotation_index"> | |
<title>Special iterators for Annotation types</title> | |
<para>The built-in index over the <literal>uima.tcas.Annotation</literal> type | |
named <quote><literal>AnnotationIndex</literal></quote> has additional | |
capabilities. To use them, you first get a reference to this built-in index using | |
either the <literal>getAnnotationIndex</literal> method on a CAS View object, or | |
by asking the <literal>FSIndexRepository</literal> object for an index having the | |
particular name <quote>AnnotationIndex</quote>, for example: | |
<programlisting>AnnotationIndex idx = aCAS.getAnnotationIndex(); | |
// or you can iterate over a specific subtype of Annotation: | |
AnnotationIndex idx = aCAS.getAnnotationIndex(aType); </programlisting></para> | |
<para>This object can be used to produce several additional kinds of iterators. It can | |
produce unambiguous iterators; these skip over elements until it finds one where the | |
start position of the next annotation is equal to or greater than the end position of | |
the previously returned annotation.</para> | |
<para>It can also produce several kinds of subiterators; these are iterators whose | |
annotations fall within the span of another annotation. This kind of iterator can | |
also have the unambiguous property, if desired. It also can be | |
<quote>strict</quote> or not; strict means that the returned annotation lies | |
completely within the span of the controlling annotation. Non-strict only implies | |
that the beginning of the returned annotation falls within the span of the | |
controlling annotation.</para> | |
<para>There is also a method which produces an <literal>AnnotationTree</literal> | |
object, which contains nodes representing the results of doing a strict, | |
unambiguous subiterator over the span of some controlling annotation. For more | |
details, please refer to the Javadocs for the | |
<literal>org.apache.uima.cas.text</literal> package.</para> | |
</section> | |
<section id="ugr.ref.cas.index.constraints_and_filtered_iterators"> | |
<title>Constraints and Filtered iterators</title> | |
<para>There is a set of API calls that build constraint objects. These objects can be | |
used directly to test if a particular feature structure matches (satisfies) the | |
constraint, or they can be passed to the createFilteredIterator method to create an | |
iterator that skips over instances which fail to satisfy the constraint.</para> | |
<para>It is possible to specify a feature value located by following a chain of | |
references starting from the feature structure being tested. Here's a | |
scenario to explore this concept. Let's suppose you have the following type | |
system (namespaces are omitted for clarity): | |
<blockquote> | |
<para><emphasis role="bold">Token</emphasis>, having a feature PartOfSpeech | |
which holds a reference to another type (POS)</para> | |
<para><emphasis role="bold">POS</emphasis> (a type with many subtypes, each | |
representing a different part of speech)</para> | |
<para><emphasis role="bold">Noun</emphasis> (a subtype of POS)</para> | |
<para><emphasis role="bold">ProperName</emphasis> (a subtype of Noun), | |
having a feature Class which holds an integer value encoding some information | |
about the proper noun.</para></blockquote></para> | |
<para>If you want to filter Token instances, such that only those tokens get through | |
which are proper names of class 3 (for example), you would need a test that started with | |
a Token instance, followed its PartOfSpeech reference to another instance (the | |
ProperName instance) and then tested the Class feature of that instance for a value | |
equal to 3.</para> | |
<para>To support this, the filtering approach has components that specify tests, and | |
components that specify <quote>paths</quote>. The tests that can be done include | |
testing references to type instances to see if they are instances of some type or its | |
subtypes; this is done with a FSTypeConstraint constraint. Other tests check for | |
equality or, for numeric values, ranges.</para> | |
<para>Each test may be combined with a path – to get to the value to test. Tests that | |
start from a feature structure instance can be combined with and and or connectors. | |
The Javadocs for these are in the package org.apache.uima.cas in the classes that end | |
in Constraint, plus the classes ConstraintFactory, FeaturePath and CAS. | |
Here's an example; assume the variable cas holds a reference to a CAS instance. | |
<programlisting>// Start by getting the constraint factory from the CAS. | |
ConstraintFactory cf = cas.getConstraintFactory(); | |
// To specify a path to an item to test, you start by | |
// creating an empty path. | |
FeaturePath path = cas.createFeaturePath(); | |
// Add POS feature to path, creating one-element path. | |
path.addFeature(posFeat); | |
// You can extend the chain arbitrarily by adding additional | |
// features. | |
// Create a new type constraint. | |
// Type constraints will check that structures | |
// they match against have a type at least as specific | |
// as the type specified in the constraint. | |
FSTypeConstraint nounConstraint = cf.createTypeConstraint(); | |
// Set the type (by default it is TOP). | |
// This succeeds if the type being tested by this constraint | |
// is nounType or a subtype of nounType. | |
nounConstraint.add(nounType); | |
// Embed the noun constraint under the pos path. | |
// This means, associate the test with the path, so it tests the | |
// proper value. | |
// The result is a test which will | |
// match a feature structure that has a posFeat defined | |
// which has a value which is an instance of a nounType or | |
// one of its subtypes. | |
FSMatchConstraint embeddedNoun = cf.embedConstraint(path, nounConstraint); | |
// Create a type constraint for token (or a subtype of it) | |
FSTypeConstraint tokenConstraint = cf.createTypeConstraint(); | |
// Set the type. | |
tokenConstraint.add(tokenType); | |
// Create the final constraint by conjoining the two constraints. | |
FSMatchConstraint nounTokenCons = cf.and(nounConstraint, tokenConstraint); | |
// Create a filtered iterator from some annotation iterator. | |
FSIterator it = cas.createFilteredIterator(annotIt, nounTokenCons);</programlisting> | |
</para></section></section> | |
<section id="ugr.ref.cas.guide_to_javadocs"> | |
<title>The CAS API's – a guide to the Javadocs</title> | |
<titleabbrev>CAS API's Javadocs</titleabbrev> | |
<para>The CAS APIs are organized into 3 Java packages: cas, cas.impl, and cas.text. Most | |
of the APIs described here are in the cas package. The cas.impl package contains classes | |
used in serializing and deserializing (reading and writing to external strings) the | |
XCAS form of the CAS (XCAS is an XML serialization of the CAS). The XCAS form is used for | |
transporting the CAS among local and remote annotators, or for storing the CAS in | |
permanent storage. The cas.text contains the APIs that extend the CAS to support | |
artifact (including <quote>text</quote>) analysis.</para> | |
<section id="ugr.ref.cas.javadocs.cas_package"> | |
<title>APIs in the CAS package</title> | |
<para>The main objects implementing the APIs discussed here are shown in the diagram | |
below. The hierarchy represents that there is a way to get from an upper object to an | |
instance of the lower object, usually by using a method on the upper object; this is not | |
an inheritance hierarchy. | |
<figure id="ugr.ref.cas.fig.api_hierarchy"> | |
<title>CAS Object hierarchy</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.8in" format="JPG" | |
fileref="&imgroot;image001.png"/> | |
</imageobject> | |
<textobject><phrase>CAS object hierarchy</phrase></textobject> | |
</mediaobject> | |
</figure> </para> | |
<para>The main Interface is the CAS interface. This has most of the functionality of the | |
CAS, except for the type system metadata access, and the indexing access. JCas and CAS | |
are alternative representations and API approaches to the CAS; each has a method to | |
get the other. You can mix JCas and CAS APIs in your application as needed. To use the | |
JCas APIs, you have to create the Java classes that correspond to the CAS types, and | |
include them in the Java class path of the application. If you have a CAS object, you can | |
get a JCas object by using the getJCas() method call on the CAS object; likewise, you | |
can get the CAS object from a JCas by using the getCAS() method call on the JCas object. | |
There is also a low level CAS interface that is not part of the official API, and is | |
intended for internal use only – it is not documented here.</para> | |
<para>The type system metadata APIs are found in the TypeSystem interface. The objects | |
defining each type and feature are defined by the interfaces Type and Feature. The | |
Type interface has methods to see what types subsume other types, to iterate over the | |
types available, and to extract information about the types, including what | |
features it has. The Feature interface has methods that get what type it belongs to, | |
its name, and its range (the kind of values it can hold).</para> | |
<para>The FSIndexRepository gives you access to methods to get instances of indexes, and | |
also provides access to the iterator over all indexed feature structures: | |
<literal>getAllIndexedFS(aType)</literal>. | |
The FSIndex and AnnotationIndex objects give you methods to create instances of | |
iterators.</para> | |
<para>Iterators and the CAS methods that create new feature structures return | |
FeatureStructure objects. These objects can be used to set and get the values of | |
defined features within them.</para> | |
</section> | |
</section> | |
</chapter> |