<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ | |
<!ENTITY imgroot "images/references/ref.cas/" > | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" > | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<chapter id="ugr.ref.cas"> | |
<title>CAS Reference</title> | |
<para>The CAS (Common Analysis System) is the part of the Unstructured Information | |
Management Architecture (UIMA) that is concerned with creating and handling the data | |
that annotators manipulate.</para> | |
<para>Java users typically use the JCas (Java interface to the CAS) when manipulating | |
objects in the CAS. This chapter describes an alternative interface to the CAS which | |
allows discovery and specification of types and features at run time. It is recommended | |
for use when the using code cannot know ahead of time the type system it will be dealing | |
with.</para> | |
<para>Use of the CAS as described here is also recommended (or necessary) when components add | |
to the definitions of types of other components. This UIMA feature allows users to add features | |
to a type that was already defined elsewhere. When this feature is used in conjunction with the | |
JCas, it can lead to problems with class loading. This is because different JCas representations | |
of a single type are generated by the different components, and only one of them is loaded | |
(unless you are using Pear descriptors). Note: | |
we do not recommend that you add features to pre-existing types. A type should be defined in one | |
place only, and then there is no problem with using the JCas. However, if you do use this feature, | |
do not use the JCas. Similarly, if you distribute your components for inclusion in somebody else's | |
UIMA application, and you're not sure that they won't add features to your types, do not use the | |
JCas for the same reasons. | |
</para> | |
<section id="ugr.ref.cas.javadocs"> | |
<title>Javadocs</title> | |
<para>The subdirectory <literal>docs/api</literal> contains the documentation | |
details of all the classes, methods, and constants for the APIs discussed here. Please | |
refer to this for details on the methods, classes and constants, specifically in the | |
packages <literal>org.apache.uima.cas.*</literal>.</para> | |
</section> | |
<section id="ugr.ref.cas.overview"> | |
<title>CAS Overview</title> | |
<para>There are three<footnote><para>A fourth part, the Subject of Analysis, | |
is discussed in <olink targetdoc="&uima_docs_tutorial_guides;" | |
/> <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.aas"/>.</para></footnote> main parts to the CAS: the type system, data creation and | |
manipulation, and indexing. We will start with a brief | |
description of these components.</para> | |
<section id="ugr.ref.cas.type_system"> | |
<title>The Type System</title> | |
<para>The type system specifies what kind of data you will be able to manipulate in your | |
annotators. The type system defines two kinds of entities, types and features. Types | |
are arranged in a single inheritance tree and define the kinds of entities (objects) | |
you can manipulate in the CAS. Features optionally specify slots or fields within a | |
type. The correspondence to Java is to equate a CAS Type to a Java Class, and the CAS | |
Features to fields within the type. A critical difference is that CAS types have no | |
methods; they are just data structures with named slots (features). These features can | |
have as values primitive things like integers, floating point numbers, and strings, | |
and they also can hold references to other instances of objects in the CAS. We call | |
instances of the data structures declared by the type system <quote>feature | |
structures</quote> (not to be confused with <quote>features</quote>). Feature | |
structures are similar to the many variants of record structures found in computer | |
science.<footnote><para> The name <quote>feature structure</quote> comes from | |
terminology used in linguistics.</para></footnote></para> | |
<para>Each CAS Type defines a supertype; it is a subtype of that supertype. This means | |
that any features that the supertype defines are features of the subtype; in other | |
words, it inherits its supertype's features. Only single inheritance is | |
supported; a type's feature set is the union of all of the features in its | |
supertype hierarchy. There is a built-in type called uima.cas.TOP; this is the top, | |
root node of the inheritance tree. It defines no features.</para> | |
<para>The values that can be stored in features are either built-in primitive values or | |
references to other feature structures. The primitive values are | |
<literal>boolean</literal>, <literal>byte</literal>, | |
<literal>short</literal> (16 bit integers), <literal>integer</literal> (32 | |
bit), <literal>long</literal> (64 bit), <literal>float</literal> (32 bit), | |
<literal>double</literal> (64 bit floats) and strings; the official names of these | |
are <literal>uima.cas.Boolean</literal>, <literal>uima.cas.Byte</literal>, | |
<literal>uima.cas.Short</literal>, <literal>uima.cas.Integer</literal>, | |
<literal>uima.cas.Long</literal>, <literal>uima.cas.Float</literal> | |
,<literal> uima.cas.Double</literal> and <literal>uima.cas.String</literal> | |
. The strings are Java strings, and characters are Java characters. Technically, this means | |
that characters are UTF-16 code points, which is not quite the same as a Unicode character. | |
This distinction should make no difference for almost all applications. | |
The CAS also defines other basic built-in types for arrays of these, plus arrays of | |
references to other objects, called <literal>uima.cas.IntegerArray</literal> | |
,<literal> uima.cas.FloatArray</literal>, | |
<literal>uima.cas.StringArray</literal>, | |
<literal>uima.cas.FSArray</literal>, etc.</para> | |
<para>The CAS also defines a built-in type called | |
<literal>uima.tcas.Annotation</literal> which inherits from | |
<literal>uima.cas.AnnotationBase</literal> which in turn inherits from | |
<literal>uima.cas.TOP</literal>. There are two features defined by this type, | |
called <literal>begin</literal> and <literal>end</literal>, both of which are | |
integer valued.</para> | |
</section> | |
<section id="ugr.ref.cas.creating_accessing_manipulating_data"> | |
<title>Creating, accessing and manipulating data</title> | |
<titleabbrev>Creating/Accessing/Changing data</titleabbrev> | |
<para> | |
Creating and accessing data in the CAS requires knowledge about the types and features | |
defined in the type system. The idea is similar to other data access APIs, such as the XML | |
DOM or SAX APIs, or database access APIs such as JDBC. Contrary to those APIs, however, the | |
CAS does not use the names of type system entities directly in the APIs. Rather, you use | |
the type system to access type and feature entities by name, then use these entities in the | |
data manipulation APIs. This can be compared to the Java reflection APIs: the type system | |
is comparable to the Java class loader, and the type and feature objects to the | |
<literal>java.lang.Class</literal> and <literal>java.lang.reflect.Field</literal> classes. | |
</para> | |
<para> | |
Why does it have to be this complicated? You wouldn't normally use reflection to create a | |
Java object, either. As mentioned earlier, the JCas provides the more straightforward | |
method to manipulate CAS data. The CAS access methods described here need only be used for | |
generic types of applications that need to be able to handle any kind of data (e.g., generic | |
tooling) or when the JCas may not be used for other reasons. The generic kinds of applications | |
are exactly the ones where you would use the reflection API in Java as well. | |
</para> | |
</section> | |
<section id="ugr.ref.cas.creating_using_indexes"> | |
<title>Creating and using indexes</title> | |
<para>Each view of a CAS provides a set of indexes for that view. Instances of Types (that is, Feature | |
Structures) can be added to a view's indexes. These indexes provide | |
a way for annotators to locate existing data in the CAS, using a specific index (or the | |
method <literal>getAllIndexedFS</literal> of the object <literal>FSIndexRepository</literal>) to | |
retrieve the Feature Structures that were previously created. If you want the data you | |
Newly created Feature Structures are not automatically added to the indexes; you choose which | |
Feature Structures to add and use one of several APIs to add them. | |
</para> | |
<para>Indexes are named and are associated with a CAS Type; they are used to index | |
instances of that CAS type (including instances of that type's subtypes). If | |
you are using multiple views (see <olink | |
targetdoc="&uima_docs_tutorial_guides;"/> <olink | |
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.mvs"/>), | |
each view contains a separate instantiation of all of the indexes. | |
To access an index, you | |
minimally need to know its name. A CAS view provides an index repository which you can | |
query for indexes for that view. Once you have a handle to an index, you can get | |
information about the feature structures in the index, the size of the index, as well | |
as an iterator over the feature structures.</para> | |
<para>There are three kinds of indexes: | |
<itemizedlist spacing="compact"> | |
<listitem> | |
<para>bag - no ordering</para> | |
</listitem> | |
<listitem> | |
<para>set - uses a user-specfied set of keys to define equality; holds one instance of the set of equal items.</para> | |
</listitem> | |
<listitem> | |
<para>sorted - uses a user-specified set of keys to define ordering.</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
<para>For set indexes, the comparator keys are augmented with an implicit additional field - the type of the | |
feature structure. This means that an index over Annotations, having subtype Token, and a key of the "begin" value, | |
will behave as follows: | |
<itemizedlist> | |
<listitem><para>If you make two Tokens (or two Annotations), both having a begin value of 17, and add both of them to the indexes, | |
only one of them will be in the index.</para> | |
</listitem> | |
<listitem><para>If you make 1 Token and 1 Annotation, both having a begin value of 17, and add both of them to the indexes, | |
both of them will be in the index (because the types are different). | |
</para></listitem> | |
</itemizedlist> | |
</para> | |
<para>Indexes are defined in the XML descriptor metadata for the application. Each CAS | |
View has its own, separate instantiation of indexes based on these definitions, | |
kept in the view's index repository. When you obtain an index, it is always from a | |
particular CAS view's index repository. | |
When you index an item, it is always added to all indexes where it | |
belongs, within just the view's repository. You can specify different repositories | |
(associated with different CAS views) to use; a given Feature Structure instance | |
may be indexed in more than one CAS View (unless it is a subtype of AnnotationBase).</para> | |
<para>Indexes implement the Iterable interface, so you may use the Java enhanced for loop to iterate over them.</para> | |
<para>You can also get iterators from indexes; | |
iterators allow you to enumerate the feature structures in an index. There are two kinds of iterators supported: | |
the regular Java iterator API, and a specific FS iterator API | |
where the usual Java iterator APIs (<literal>hasNext()</literal> and <literal>next()</literal>) | |
are augmented by <literal>isValid()</literal>, <literal>moveToNext() / moveToPrevious()</literal> (which does | |
not return an element) and <literal>get()</literal>. Finally, there is a <literal>moveTo(FeatureStructure)</literal> | |
API, which, for sorted indexes, moves the iteration point to the left-most (among otherwise "equal") item | |
in the index which compares "equal" to the given FeatureStructure, using the index's defined comparator. | |
</para> | |
<para> | |
Which API style you use is up to you, | |
but we do not recommend mixing the styles as the results are sometimes unexpected. If you | |
just want to iterate over an index from start to finish, either style is equally appropriate. | |
If you also use <literal>moveTo(FeatureStructure fs)</literal> and | |
<literal>moveToPrevious()</literal>, it is better to use the special FS iterator style. | |
</para> | |
<note><para>The reason to not mix these styles is that you might be thinking that | |
next() followed by moveToPrevious() would always work. This is not true, because | |
next() returns the "current" element, and advances to the next position, which might be | |
beyond the last element. At that point, the iterator becomes "invalid", and | |
moveToNext and moveToPrevious no longer move the iterator. But you can | |
call these methods on the iterator — moveToFirst(), moveToLast(), or moveTo(FS) — to reset it.</para></note> | |
<para>Indexes are created by specifying them in the annotator's or | |
aggregate's resource descriptor. An index specification includes its name, | |
the CAS type being indexed, the kind (bag, set or sorted) of index it is, and an (optional) set of keys. | |
The keys are used for set and sorted indexes, and specify what values are used for | |
ordering, or (for sets) what values are used to determine set equality. | |
When a CAS pipeline is created, all index | |
specifications are combined; duplicate definitions (having the same name) are | |
allowed only if their definitions are the same. </para> | |
<para>Feature structure instances need to be explicitly added to the index repository by a | |
method call. Feature structures that are not indexed will not be visible to other | |
annotators, (unless they are located via being referenced by some other feature of | |
another feature structure, which is indexed, or through a chain of these).</para> | |
<para>The framework defines an unnamed bag index which indexes all types. The | |
only access provided for this index is the getAllIndexedFS(type) method on the | |
index repository, which returns an iterator over all indexed instances of the | |
specified type (including its subtypes) for that CAS View. | |
</para> | |
<para>The framework defines one standard, built-in annotation index, called | |
AnnotationIndex, which indexes the <literal>uima.tcas.Annotation</literal> | |
type: all feature structures of type <literal>uima.tcas.Annotation</literal> or | |
its subtypes are automatically indexed with this built-in index.</para> | |
<para>The ordering relation used by this index is to first order by the value of the | |
<quote>begin</quote> features (in ascending order) and then by the value of the | |
<quote>end</quote> feature (in descending order), and then, finally, by the | |
Type Priority. This ordering insures that | |
longer annotations starting at the same spot come before shorter ones. For Subjects | |
of Analysis other than Text, this may not be an appropriate index.</para> | |
<para>In addition to normal iterators, there is a <literal>select</literal> API, documented | |
in the Version 3 Users guide, which provides additional capabilities for accessing | |
Feature Structures via the indexes.</para> | |
</section> | |
</section> | |
<section id="ugr.ref.cas.builtin_types"> | |
<title>Built-in CAS Types</title> | |
<para>The CAS has two kinds of built-in types – primitive and non-primitive. The | |
primitive types are: | |
<itemizedlist spacing="compact"> | |
<listitem><para>uima.cas.Boolean</para></listitem> | |
<listitem><para>uima.cas.Byte</para></listitem> | |
<listitem><para>uima.cas.Short</para></listitem> | |
<listitem><para>uima.cas.Integer</para></listitem> | |
<listitem><para>uima.cas.Long</para></listitem> | |
<listitem><para>uima.cas.Float</para></listitem> | |
<listitem><para>uima.cas.Double</para></listitem> | |
<listitem><para>uima.cas.String</para></listitem> | |
</itemizedlist></para> | |
<para>The <literal>Byte, Short, Integer, </literal>and<literal> Long</literal> are | |
all signed integer types, of length 8, 16, 32, and 64 bits. The | |
<literal>Double</literal> type is 64 bit floating point. The | |
<literal>String</literal> type can be subtyped to create sets of allowed values; see | |
<olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.type_system.string_subtypes"/>. | |
These types can be used to specify the range of a String-valued feature. They act like | |
Strings, but have additional checking to insure the setting of values into them | |
conforms to one of the allowed values, or to null (which is the value if it is not set). | |
Note that the other primitive types cannot be used | |
as a supertype for another type definition; only | |
<literal>uima.cas.String</literal> can be sub-typed.</para> | |
<para>The non-primitive types exist in a type hierarchy; the top of the hierarchy is the | |
type <literal>uima.cas.TOP</literal>. All other non-primitive types inherit from | |
some supertype.</para> | |
<para>There are 9 built-in array types. These arrays have a size specified when they are | |
created; the size is fixed at creation time. They are named: | |
<itemizedlist spacing="compact"> | |
<listitem><para>uima.cas.BooleanArray</para></listitem> | |
<listitem><para>uima.cas.ByteArray</para></listitem> | |
<listitem><para>uima.cas.ShortArray</para></listitem> | |
<listitem><para>uima.cas.IntegerArray</para></listitem> | |
<listitem><para>uima.cas.LongArray</para></listitem> | |
<listitem><para>uima.cas.FloatArray</para></listitem> | |
<listitem><para>uima.cas.DoubleArray</para></listitem> | |
<listitem><para>uima.cas.StringArray</para></listitem> | |
<listitem><para>uima.cas.FSArray</para></listitem> | |
</itemizedlist></para> | |
<para>The <literal>uima.cas.FSArray</literal> type is an array whose elements are | |
arbitrary other feature structures (instances of non-primitive types).</para> | |
<para>The JCas cover classes for the array types support the Iterable API, so you may | |
write extended for loops over instances of these. For example: | |
<programlisting>FSArray<MyType> myArray = ... | |
for (MyType fs : myArray) { | |
some_method(fs); | |
}</programlisting> | |
</para> | |
<para>There are 3 built-in types associated with the artifact being analyzed: | |
<itemizedlist spacing="compact"> | |
<listitem><para>uima.cas.AnnotationBase</para></listitem> | |
<listitem><para>uima.tcas.Annotation</para></listitem> | |
<listitem><para>uima.tcas.DocumentAnnotation</para></listitem> | |
</itemizedlist></para> | |
<para>The <literal>AnnotationBase</literal> type defines one system-used feature | |
which specifies for an annotation the subject of analysis (Sofa) to which it refers. The | |
Annotation type extends from this and defines 2 features, taking | |
<literal>uima.cas.Integer</literal> values, called <literal>begin</literal> | |
and <literal>end</literal>. The <literal>begin</literal> feature typically | |
identifies the start of a span of text the annotation covers; the | |
<literal>end</literal> feature identifies the end. The values refer to character | |
offsets; the starting index is 0. An annotation of the word <quote>CAS</quote> in a text | |
<quote>CAS Reference</quote> would have a start index of 0, and an end index of 3; the | |
difference between end and start is the length of the span the annotation refers | |
to.</para> | |
<para>Annotations are always with respect to some Sofa (Subject of Analysis – see | |
<olink targetdoc="&uima_docs_tutorial_guides;"/> | |
<olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/> | |
.</para> | |
<note><para>Artifacts which are not text strings may have a different interpretation of | |
the meaning of begin and end, or may define their own kind of annotation, extending from | |
<literal>AnnotationBase</literal>. </para></note> | |
<para id="ugr.ref.cas.document_annotation">The <literal>DocumentAnnotation</literal> type has one special instance. It is | |
a subtype of the Annotation type, and the built-in definition defines one feature, | |
<literal>language</literal>, which is a string indicating the language of the | |
document in the CAS. The value of this language feature is used by the system to control | |
flow among annotators when the <quote>CapabilityLanguageFlow</quote> mode is used, | |
allowing the flow to skip over annotators that don't process particular | |
languages. Users may extend this type by adding additional features to it, using the XML | |
Descriptor element for defining a type.</para> | |
<note><para> | |
We do <emphasis>not</emphasis> recommend extending the <literal>DocumentAnnotation</literal> | |
type. If you do, you must <emphasis>not</emphasis> use the JCas, for the reasons stated | |
earlier. | |
</para></note> | |
<para>Each CAS view has a different associated instance of the | |
<literal>DocumentAnnotation</literal> type. On the CAS, use | |
<literal>getDocumentationAnnotation()</literal> to access the | |
<literal>DocumentAnnotation</literal>.</para> | |
<para>There are also built-in types supporting linked lists, similar to the ones available in | |
Java and other programming languages. Their use is | |
constrained by the usual properties of linked lists: not very space efficient, no (efficient) | |
random access, but an easy choice if you don't know how long your list will be ahead of time. The | |
implementation is type specific; there are different list building objects for each of | |
the primitive types, plus one for general feature structures. Here are the type names: | |
<itemizedlist spacing="compact"> | |
<listitem><para>uima.cas.FloatList</para></listitem> | |
<listitem><para>uima.cas.IntegerList</para></listitem> | |
<listitem><para>uima.cas.StringList</para></listitem> | |
<listitem><para>uima.cas.FSList</para> | |
<para></para></listitem> | |
<listitem><para>uima.cas.EmptyFloatList</para></listitem> | |
<listitem><para>uima.cas.EmptyIntegerList</para></listitem> | |
<listitem><para>uima.cas.EmptyStringList</para></listitem> | |
<listitem><para>uima.cas.EmptyFSList</para> | |
<para></para></listitem> | |
<listitem><para>uima.cas.NonEmptyFloatList</para></listitem> | |
<listitem><para>uima.cas.NonEmptyIntegerList</para></listitem> | |
<listitem><para>uima.cas.NonEmptyStringList</para></listitem> | |
<listitem><para>uima.cas.NonEmptyFSList</para></listitem> | |
</itemizedlist></para> | |
<para>For the primitive types <literal>Float</literal>, | |
<literal>Integer</literal>, <literal>String</literal> and | |
<literal>FeatureStructure</literal>, there is a base type, for instance, | |
<literal>uima.cas.FloatList</literal>. For each of these, there are two subtypes, | |
corresponding to a non-empty element, and a marker that serves to indicate the end of the | |
list, or an empty list. The non-empty types define two features – | |
<literal>head</literal> and <literal>tail</literal>. The head feature holds the | |
particular value for that part of the list. The tail refers to the next list object | |
(either a non-empty one or the empty version to indicate the end of the list).</para> | |
<para>For JCas users, the new operator for the NonEmptyXyzList classes includes a 3 argument version | |
where you may specify the head and tail values as part of the constructor. The JCas | |
cover classes for these implement | |
a <code>push(item)</code> method which creates a new non-empty node, sets the <code>head</code> value | |
to <code>item</code>, and the tail to the node it is called on, and returns the new node. | |
These classes also implement Iterable, so you can use the enhanced Java <code>for</code> operator. | |
The iterator stops when it gets to the end of the list, determined by either the tail being null or | |
the element being one of the EmptyXXXList elements. | |
Here's a StringList example: | |
<programlisting>StringList sl = jcas.emptyStringList(); | |
sl = sl.push("2"); | |
sl = sl.push("1"); | |
for (String s : sl) { | |
someMethod(s); // some sample use | |
}</programlisting> | |
</para> | |
<para>There are no other built-in types. Users are free to define their own type systems, | |
building upon these types.</para> | |
</section> | |
<section id="ugr.ref.cas.accessing_the_type_system"> | |
<title>Accessing the type system</title> | |
<para> | |
During annotator processing, or outside an annotator, access the type system by calling | |
<literal>CAS.getTypeSystem()</literal>. | |
</para> | |
<para>However, CAS annotators implement an additional method, | |
<literal>typeSystemInit()</literal>, which is called by the UIMA framework before the | |
annotator's process method. This method, implemented by the annotator writer, | |
is passed a reference to the CAS's type system metadata. The method typically uses | |
the type system APIs to obtain type and feature objects corresponding to all the types | |
and features the annotator will be using in its process method. This initialization | |
step should not be done during an annotator's initialize method since the type | |
system can change after the initialize method is called; it should not be done during the | |
process method, since this is presumably work that is identical for each incoming | |
document, and so should be performed only when the type system changes (which will be a | |
rare event). The UIMA framework guarantees it will call the <literal>typeSystemInit | |
</literal>method of an annotator whenever the type system changes, before calling the | |
annotator's <literal>process()</literal> method.</para> | |
<para>The initialization done by <literal>typeSystemInit()</literal> is done by the | |
UIMA framework when you use the JCas APIs; you only need to provide a | |
<literal>typeSystemInit()</literal> method, as described here, when you are not using | |
the JCas approach.</para> | |
<section id="ugr.ref.cas.type_system.printer_example"> | |
<title>TypeSystemPrinter example</title> | |
<para>Here is a code fragment that, given a CAS Type System, will print a list of all | |
types.</para> | |
<programlisting>// Get all type names from the type system | |
// and print them to stdout. | |
private void listTypes1(TypeSystem ts) { | |
for (Type t : ts) { | |
// print its name. | |
System.out.println(t.getName()); | |
} | |
}</programlisting> | |
<para>This method is passed the type system as a parameter. From the type system, we can | |
get an iterator | |
over all the types. If you run this against a CAS created with no additional | |
user-defined types, we should see something like this on the console:</para> | |
<programlisting>Types in the type system: | |
uima.cas.Boolean | |
uima.cas.Byte | |
uima.cas.Short | |
uima.cas.Integer | |
uima.cas.Long | |
uima.cas.ArrayBase | |
... | |
</programlisting> | |
<para>If the type system had user-defined types these would show up too. Note that some | |
of these types are not directly creatable – they are types used by the framework | |
in the type hierarchy (e.g. uima.cas.ArrayBase).</para> | |
<para>CAS type names include a name-space prefix. The components of a type name are | |
separated by the dot (.). A type name component must start with a Unicode letter, | |
followed by an arbitrary sequence of letters, digits and the underscore (_). By | |
convention, the last component of a type name starts with an uppercase letter, the | |
rest start with a lowercase letter.</para> | |
<para>Listing the type names is mildly useful, but it would be even better if we could see | |
the inheritance relation between the types. The following code prints the | |
inheritance tree in indented format.</para> | |
<programlisting>private static final int INDENT = 2; | |
private void listTypes2(TypeSystem ts) { | |
// Get the root of the inheritance tree. | |
Type top = ts.getTopType(); | |
// Recursively print the tree. | |
printInheritanceTree(ts, top, 0); | |
} | |
private void printInheritanceTree(TypeSystem ts, Type type, int level) { | |
indent(level); // Print indentation. | |
System.out.println(type.getName()); | |
// Get a vector of the immediate subtypes. | |
Vector subTypes = | |
ts.getDirectlySubsumedTypes(type); | |
++level; // Increase the indentation level. | |
for (int i = 0; i < subTypes.size(); i++) { | |
// Print the subtypes. | |
printInheritanceTree(ts, (Type) subTypes.get(i), level); | |
} | |
} | |
// A simple, inefficient indenter | |
private void indent(int level) { | |
int spaces = level * INDENT; | |
for (int i = 0; i < spaces; i++) { | |
System.out.print(" "); | |
} | |
}</programlisting> | |
<para> This example shows that you can traverse the type hierarchy by starting at the top | |
with TypeSystem.getTopType and by retrieving subtypes with | |
<literal>TypeSystem.getDirectlySubsumedTypes()</literal>.</para> | |
<para>The Javadocs also have APIs that allow you to access the features, as well as what | |
the allowed value type is for that feature. Here is sample code which prints out all the | |
features of all the types, together with the allowed value types (the feature | |
<quote>range</quote>). Each feature has a <quote>domain</quote> which is the type | |
where it is defined, as well as a <quote>range</quote>. | |
<programlisting>private void listFeatures2(TypeSystem ts) { | |
Iterator featureIterator = ts.getFeatures(); | |
Feature f; | |
System.out.println("Features in the type system:"); | |
while (featureIterator.hasNext()) { | |
f = (Feature) featureIterator.next(); | |
System.out.println( | |
f.getShortName() + ": " + | |
f.getDomain() + " -> " + f.getRange()); | |
} | |
System.out.println(); | |
}</programlisting></para> | |
<para>We can ask a feature object for its domain (the type it is defined on) and its range | |
(the type of the value of the feature). The terminology derives from the fact that | |
features can be viewed as functions on subspaces of the object space.</para> | |
</section> | |
<section id="ugr.ref.cas.cas_apis_create_modify_feature_structures"> | |
<title>Using the CAS APIs to create and modify feature structures</title> | |
<titleabbrev>Using CAS APIs: Feature Structures</titleabbrev> | |
<para>Assume a type system declaration that defines two types: Entity and Person. | |
Entity has no features defined within it but inherits from uima.tcas.Annotation | |
– so it has the begin and end features. Person is, in turn, a subtype of Entity, | |
and adds firstName and lastName features. CAS type systems are declaratively | |
specified using XML; the format of this XML is described in <olink | |
targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.type_system"/>. | |
<programlisting><![CDATA[<!-- Type System Definition --> | |
<typeSystemDescription> | |
<types> | |
<typeDescription> | |
<name>com.xyz.proj.Entity</name> | |
<description /> | |
<supertypeName>uima.tcas.Annotation</supertypeName> | |
</typeDescription> | |
<typeDescription> | |
<name>Person</name> | |
<description /> | |
<supertypeName>com.xyz.proj.Entity </supertypeName> | |
<features> | |
<featureDescription> | |
<name>firstName</name> | |
<description /> | |
<rangeTypeName>uima.cas.String</rangeTypeName> | |
</featureDescription> | |
<featureDescription> | |
<name>lastName</name> | |
<description /> | |
<rangeTypeName>uima.cas.String</rangeTypeName> | |
</featureDescription> | |
</features> | |
</typeDescription> | |
</types> | |
</typeSystemDescription>]]></programlisting></para> | |
<para> | |
To be able to access types and features, we need to know their names. The CAS interface defines | |
constants that hold the names of built-in feature names, such as, e.g., | |
<literal>CAS.TYPE_NAME_INTEGER</literal>. It is good programming practice to create such | |
constants for the types and features you define, for your own use as well as for others who will | |
be using your annotators. | |
</para> | |
<programlisting>/** Entity type name constant. */ | |
public static final String ENTITY_TYPE_NAME = "com.xyz.proj.Entity"; | |
/** Person type name constant. */ | |
public static final String PERSON_TYPE_NAME = "com. xyz.proj.Person"; | |
/** First name feature name constant. */ | |
public static final String FIRST_NAME_FEAT_NAME = "firstName"; | |
/** Last name feature name constant. */ | |
public static final String LAST_NAME_FEAT_NAME = "lastName";</programlisting> | |
<para>Next we define type and feature member variables; these will hold the values of the | |
type and feature objects needed by the CAS APIs, to be assigned during | |
<literal>typeSystemInit()</literal>.</para> | |
<programlisting>// Type system object variables | |
private Type entityType; | |
private Type personType; | |
private Feature firstNameFeature; | |
private Feature lastNameFeature; | |
private Type stringType;</programlisting> | |
<para>The type system does not throw an exception if we ask for something that is | |
not known, it simply returns null; therefore the code checks for this and throws a proper | |
exception. We require all these types and features to be defined for the annotator to | |
work. One might imagine situations where certain computations are predicated on some type | |
or feature being defined in the type system, but that is not the case here.</para> | |
<programlisting>// Get a type object corresponding to a name. | |
// If it doesn't exist, throw an exception. | |
private Type initType(String typeName) | |
throws AnnotatorInitializationException { | |
Type type = ts.getType(typeName); | |
if (type == null) { | |
throw new AnnotatorInitializationException( | |
AnnotatorInitializationException.TYPE_NOT_FOUND, | |
new Object[] { this.getClass().getName(), typeName }); | |
} | |
return type; | |
} | |
// We add similar code for retrieving feature objects. | |
// Get a feature object from a name and a type object. | |
// If it doesn't exist, throw an exception. | |
private Feature initFeature(String featName, Type type) | |
throws AnnotatorInitializationException { | |
Feature feat = type.getFeatureByBaseName(featName); | |
if (feat == null) { | |
throw new AnnotatorInitializationException( | |
AnnotatorInitializationException.FEATURE_NOT_FOUND, | |
new Object[] { this.getClass().getName(), featName }); | |
} | |
return feat; | |
}</programlisting> | |
<para>Using these two functions, code for initializing the type system described | |
above would be: | |
<programlisting>public void typeSystemInit(TypeSystem aTypeSystem) | |
throws AnalysisEngineProcessException { | |
this.typeSystem = aTypeSystem; | |
// Set type system member variables. | |
this.entityType = initType(ENTITY_TYPE_NAME); | |
this.personType = initType(PERSON_TYPE_NAME); | |
this.firstNameFeature = | |
initFeature(FIRST_NAME_FEAT_NAME, personType); | |
this.lastNameFeature = | |
initFeature(LAST_NAME_FEAT_NAME, personType); | |
this.stringType = initType(CAS.TYPE_NAME_STRING); | |
}</programlisting></para> | |
<para>Note that we initialize the string type by using a type name constant from the | |
CAS.</para> | |
</section> | |
</section> | |
<section id="ugr.ref.cas.creating_feature_structures"> | |
<title>Creating feature structures</title> | |
<para>To create feature structures in JCas, we use the Java <quote>new</quote> | |
operator. In the CAS, we use one of several different API methods on the CAS object, | |
depending on which of the 10 basic kinds of feature structures we are creating (a plain | |
feature structure, or an instance of the built-in primitive type arrays or FSArray). | |
There are is also a method to create an instance of a | |
<literal>uima.tcas.Annotation</literal>, setting the begin and end | |
values.</para> | |
<para>Once a feature structure is created, it needs to be added to the CAS indexes (unless | |
it will be accessed via some reference from another accessible feature structure). The | |
CAS provides this API: Assuming aCAS holds a reference to a CAS, and token holds a | |
reference to a newly created feature structure, here's the code to add that | |
feature structure to all the relevant CAS indexes:</para> | |
<programlisting> // Add the token to the index repository. | |
aCAS.addFsToIndexes(token);</programlisting> | |
<para>There is also a corresponding <literal>removeFsFromIndexes(token)</literal> | |
method on CAS objects.</para> | |
<para>As of version 2.4.1, there are two methods you can use on an index repository | |
to efficiently bulk-remove all | |
instances of particular types of feature structures from a particular view. One of these, | |
<code>aCas.getIndexRepository().removeAllIncludingSubtypes(aType)</code> removes all instances of a particular | |
type, including instances which are subtypes of the specified type. The other, | |
<code>aCas.getIndexRepository().removeAllExcludingSubtypes(aType)</code> remove all instances of a particular | |
type, only. In both cases, the removal is done from the particular view of the CAS referenced | |
by aCas.</para> | |
<section id="ugr.ref.cas.updating_indexed_feature_structures"> | |
<title>Updating indexed feature structures</title> | |
<para>Version 2.7.0 added protection for indexes when feature structure key | |
value features are updated. By default this protection is automatic, but | |
at some performance cost. Users may optimize this further.</para> | |
<para>Protection is needed because some of the indexes (the Sorted and Set types) use comparators defined | |
to use values of the particular features; if these values | |
need to be changed after the feature structure is added to the indexes, | |
the correct way to do this is to: | |
<orderedlist spacing="compact"> | |
<listitem><para>completely remove the item from all indexes where it is indexed, in all views | |
where it is indexed,</para> | |
</listitem> | |
<listitem><para>update the value of the features being used as keys,</para></listitem> | |
<listitem><para>add the item back to the indexes, in all views.</para></listitem> | |
</orderedlist></para> | |
<note><para>It’s OK to change feature values which are not used in determining | |
sort ordering (or set membership), without removing and re-adding back to the index. | |
</para></note> | |
<!-- <para>To completely remove an item from the indexes may entail removing it multiple times, if it was | |
added multiple times and (as of version 2.7.0) the JVM global property | |
<code>uima.allow_duplicate_add_to_indexes</code> is true.</para> --> | |
<para>The automatic protection checks for updates of | |
features being used as keys, and if it finds an update like this for a feature structure that | |
is in the indexes, it removes the feature structure from the indexes, does the update, | |
and adds it back. It will do this for every feature update. This is obviously not | |
efficient when multiple features are being updated; in that case it would better to | |
remove the feature structure, do all the updates to all the features needing updates, and then | |
do a single add-back operation.</para> | |
<para>This is supported in user’s code by using the new method <code>protectIndexes</code> | |
available in both the CAS and JCas interface. | |
Here's two ways | |
of using this, one with a try / finally and the other with a Runnable: | |
<programlisting>// an approach using try / finally | |
AutoCloseable ac = my_cas.protectIndexes(); // my_cas is a CAS or a JCas | |
try { | |
... arbitrary user code which updates features | |
which may be "keys" in one or more indexes | |
} finally { | |
ac.close(); | |
} | |
// This can more compactly be written using the auto-close feature of try: | |
try (AutoCloseable ac = my_cas.protectIndexes()) { | |
... arbitrary user code which updates features | |
which may be "keys" in one or more indexes | |
} | |
// an approach using a Runnable, written in Java 8 lambda syntax | |
my_cas.protectIndexes(() -> { | |
... arbitrary user code updating "key" features, | |
but no checked exceptions are permitted | |
});</programlisting></para> | |
<para>The <code>protectIndexes</code> implementation only removes feature structures that | |
have features being updated which are used as keys in some index(es). At the end of the scope | |
of the protectIndexes, it adds all of these back. It also skips removing feature structures | |
from bag indexes, since these have no keys.</para> | |
<para>Within a <code>protectIndexes</code> block, do not do any operations which depend on the | |
indexes being valid, such as creating and using an iterator. This is because the removed FSs | |
are only added back at the end of the protectIndexes block.</para> | |
<para>The JVM property <code>-Duima.report_fs_update_corrupts_index</code> will generate a log entry | |
everytime the frameworks finds (and automatically surrounds with a remove - add-back) an update to | |
a feature which could corrupt the index. The log entries can be identified by scanning for messages | |
starting with <code>While FS was in the index, the feature</code> - the message goes on to identify | |
the feature in question. Users can use these reports to find the places in their code where | |
they can either change the design to avoid updating these values after the item is indexed, or | |
surround the updates with their own <code>protectIndexes</code> blocks.</para> | |
<para>Initially, the out-of-the-box defaults | |
for the UIMA framework will run with an automatic (but somewhat inefficient) protection. To improve upon this, | |
users would: | |
<itemizedlist> | |
<listitem><para>Turn on reporting using a global JVM flag <code> | |
-Duima.report_fs_update_corrupts_index</code>. | |
This will cause a message to be logged each time the automatic protection is being invoked, | |
and allows the user to find the spots to improve.</para> | |
</listitem> | |
<listitem><para>Improve each spot, perhaps by surrounding the update code with a protectIndexes | |
block, or by rearranging code to reduce updating feature values used as index keys.</para> | |
</listitem> | |
<listitem><para>Once the code is no longer generating any reports, you can turn off the | |
automatic protection for production runs using the JVM global property | |
<code>-Duima.disable_auto_protect_indexes</code>, and rely on the protectIndexes blocks. | |
If protection is disabled, then the corruption detection is skipped, making the production | |
runs perhaps a bit faster, although this is not significant in most cases.</para></listitem> | |
<listitem><para>For automated build systems, there’s a JVM parameter, | |
<code>-Duima.exception_when_fs_update_corrupts_index</code>, which will throw an | |
exception if any automatic recovery situation is encountered. You can use this | |
in build/test scenarios to insure | |
(after adding all needed protectIndexes blocks) that the code remains safe for | |
turning off the checking in production runs.</para></listitem> | |
</itemizedlist> | |
</para> | |
</section> | |
</section> | |
<section id="ugr.ref.cas.accessing_modifying_features_of_feature_structures"> | |
<title>Accessing or modifying features of feature structures</title> | |
<titleabbrev>Accessing or modifying Features</titleabbrev> | |
<para>Values of individual features for a feature structure can be set or referenced, | |
using a set of methods that depend on the type of value that feature is declared to have. | |
There are methods on FeatureStructure for this: getBooleanValue, getByteValue, | |
getShortValue, getIntValue, getLongValue, getFloatValue, getDoubleValue, | |
getStringValue, and getFeatureValue (which means to get a value which in turn is a | |
reference to a feature structure). There are corresponding <quote>setter</quote> | |
methods, as well. These methods on the feature structure object take as arguments the | |
feature object retrieved earlier in the typeSystemInit method.</para> | |
<para>Using the previous example, with the type system initialized with type personType | |
and feature lastNameFeature, here's a sample code fragment that gets and sets | |
that feature:</para> | |
<programlisting>// Assume aPerson is a variable holding an object of type Person | |
// get the lastNameFeature value from the feature structure | |
String lastName = aPerson.getStringValue(lastNameFeature); | |
// set the lastNameFeature value | |
aPerson.setStringValue(lastNameFeature, newStringValueForLastName);</programlisting> | |
<para>The getters and setters for each of the primitive types are defined in the Javadocs | |
as methods of the FeatureStructure interface.</para> | |
</section> | |
<section id="ugr.ref.cas.indexes_and_iterators"> | |
<title>Indexes and Iterators</title> | |
<para>Each CAS can have many indexes associated with it; each CAS View contains | |
a complete set of instantiations of the indexes. Each index is represented by an | |
instance of the type org.apache.uima.cas.FSIndex. You use the object | |
org.apache.uima.cas.FSIndexRepository, accessible via a method on a CAS object, to | |
retrieve instances of indexes. There are methods that let you select the index | |
by name, by type, or by both name and type. Since each index is already associated with a type, | |
passing both a name and a type is valid only if the type passed in is the same | |
type or a subtype of the one declared in the index specification for the named index. If you | |
pass in a subtype, the returned FSIndex object refers to an index that will return only | |
items belonging to that subtype (or subtypes of that subtype).</para> | |
<para>The returned FSIndex objects are used, in turn, to create iterators. | |
There is also a method on the Index Repository, <literal>getAllIndexedFS</literal>, | |
which will return an iterator over all indexed Feature Structures (for that CAS View), | |
in no particular order. The iterators | |
created can be used like common Java iterators, to sequentially retrieve items | |
indexed. If the index represents a sorted index, the items are returned in a sorted | |
order, where the sort order is specified in the XML index definition. This XML is part of | |
the Component Descriptor, see <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.aes.index"/>.</para> | |
<para>In UIMA V3, Feature structures may be added to or removed from indexes while iterating | |
over them. If this happens, any iterators already created will continue to operate over the | |
before-modification version of the index, unless or until the iterator is re-synchronized with the current | |
value of the index via one of the following specific 3 iterator API calls: | |
moveToFirst, moveToLast, or moveTo(FeatureStructure). | |
ConcurrentModificationException is no longer thrown in UIMA v3. | |
</para> | |
<para>Feature structures being iterated over may have features which are used as the "keys" of an index, updated. | |
If this is done, UIMA will protect the indexes (to prevent index corruption) by automatically removing the | |
Feature Structure from the indexes, | |
updating the field, and adding the FS back to the index (possibly in a new position). | |
This automatic remove / add-back operation no longer makes the iterator throw a ConcurrentModificationException | |
(as it did in UIMA Version 2) if the iterator is incremented or decremented; | |
existing iterators will continue to operate as if no index modification occurred. | |
</para> | |
<!-- <para>As of version 2.7.0, a new method on FSIndex, <code>withSnapshotIterators(),</code> | |
allows creating a light-weight FSIndex based on the original FSIndex | |
that supports doing arbitrary index operations while iterating, and will not throw | |
<code>ConcurrentModificationException</code>. Iterators obtained from this instance use a | |
<emphasis>snapshot</emphasis> technique - they create a snapshot of the original index when the | |
iterator is created, and then use that snapshot while operating, so the iteration is unaffected by any | |
modifications to the actual index.</para> --> | |
<section id="ugr.ref.cas.index.built_in_indexes"> | |
<title>Built-in Indexes</title> | |
<para>An unnamed built-in bag index exists which holds all feature structures which are indexed. | |
The only access to this index is the method getAllIndexedFS(Type) which returns an iterator | |
over all indexed Feature Structures.</para> | |
<para>The CAS also contains a built-in index for the type <literal>uima.tcas.Annotation</literal>, which sorts | |
annotations in the order in which they appear in the document. Annotations are sorted first by increasing | |
<literal>begin</literal> position. Ties are then broken by <emphasis>decreasing</emphasis> | |
<literal>end</literal> position (so that longer annotations come first). Annotations that match in both | |
their <literal>begin</literal> and <literal>end</literal> features are sorted using the Type Priority, | |
if any are defined | |
(see <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.aes.type_priority"/> )</para> | |
</section> | |
<section id="ugr.ref.cas.index.adding_to_indexes"> | |
<title>Adding Feature Structures to the Indexes</title> | |
<para>Feature Structures are added to the indexes by various APIs. These add the Feature Structure to | |
<emphasis>all</emphasis> indexes that are defined for the type of that FeatureStructure (or any of its | |
supertypes), in a particular view. | |
Note that you should not add a Feature Structure to the indexes until you have set values for all | |
of the features that may be used as sort keys in an index.</para> | |
<para>There are multiple APIs for adding FSs to the index. | |
<itemizedlist> | |
<listitem><para>(preferred) myFeatureStructure.addToIndexes(). This adds the feature structure instance to the | |
view in which it was originally created.</para> | |
</listitem> | |
<listitem><para>(preferred) myFeatureStructure.addToIndexes(JCas or CAS). This adds the feature structure instance to the | |
view represented by the argument.</para> | |
</listitem> | |
<listitem><para>(older form) casView.addFsToIndexes(myFeatureStructure) or jcasView.addFsToIndexes(myFeatureStructure). | |
This adds the feature structure instance to the | |
view represented by the cas (or jcas).</para> | |
</listitem> | |
<listitem><para>(older form) fsIndexRepositoryView.addFsToIndexes(myFeatureStructure). | |
This adds the feature structure instance to the | |
view represented by the fsIndexRepository instance.</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
</section> | |
<section id="ugr.ref.cas.index.iterators"> | |
<title>Iterators over UIMA Indexes</title> | |
<para>Iterators are objects of class <literal>org.apache.uima.cas.FSIterator.</literal> This class | |
extends <literal>java.util.Iterator</literal> and implements the normal Java iterator methods, plus | |
additional ones that allow moving both forwards and backwards.</para> | |
<para>UIMA Indexes implement iterable, so you can use the index directly in a Java extended for loop.</para> | |
</section> | |
<section id="ugr.ref.cas.index.annotation_index"> | |
<title>Special iterators for Annotation types</title> | |
<para>Note: we recommend using the UIMA V3 select framework, instead of the following. | |
It implements all of the following capabilities, and more, in a uniform manner.</para> | |
<para>The built-in index over the <literal>uima.tcas.Annotation</literal> type | |
named <quote><literal>AnnotationIndex</literal></quote> has additional | |
capabilities. To use them, you first get a reference to this built-in index using | |
either the <literal>getAnnotationIndex</literal> method on a CAS View object, or | |
by asking the <literal>FSIndexRepository</literal> object for an index having the | |
particular name <quote>AnnotationIndex</quote>, for example: | |
<programlisting>AnnotationIndex idx = aCAS.getAnnotationIndex(); | |
// or you can iterate over a specific subtype of Annotation: | |
AnnotationIndex idx = aCAS.getAnnotationIndex(aType); </programlisting></para> | |
<para>This object can be used to produce several additional kinds of iterators. It can | |
produce unambiguous iterators; these skip over elements until it finds one where the | |
start position of the next annotation is equal to or greater than the end position of | |
the previously returned annotation.</para> | |
<para>It can also produce several kinds of subiterators; these are iterators whose | |
annotations fall within the span of another annotation. This kind of iterator can | |
also have the unambiguous property, if desired. It also can be | |
<quote>strict</quote> or not; strict means that the returned annotation lies | |
completely within the span of the controlling annotation. Non-strict only implies | |
that the beginning of the returned annotation falls within the span of the | |
controlling annotation.</para> | |
<para>There is also a method which produces an <literal>AnnotationTree</literal> | |
object, which contains nodes representing the results of doing a strict, | |
unambiguous subiterator over the span of some controlling annotation. For more | |
details, please refer to the Javadocs for the | |
<literal>org.apache.uima.cas.text</literal> package.</para> | |
</section> | |
<section id="ugr.ref.cas.index.constraints_and_filtered_iterators"> | |
<title>Constraints and Filtered iterators</title> | |
<para>Note: for new code, consider using the select framework plus Streams, instead of | |
the following.</para> | |
<para>There is a set of API calls that build constraint objects. These objects can be | |
used directly to test if a particular feature structure matches (satisfies) the | |
constraint, or they can be passed to the createFilteredIterator method to create an | |
iterator that skips over instances which fail to satisfy the constraint.</para> | |
<para>It is possible to specify a feature value located by following a chain of | |
references starting from the feature structure being tested. Here's a | |
scenario to explore this concept. Let's suppose you have the following type | |
system (namespaces are omitted for clarity): | |
<blockquote> | |
<para><emphasis role="bold">Token</emphasis>, having a feature PartOfSpeech | |
which holds a reference to another type (POS)</para> | |
<para><emphasis role="bold">POS</emphasis> (a type with many subtypes, each | |
representing a different part of speech)</para> | |
<para><emphasis role="bold">Noun</emphasis> (a subtype of POS)</para> | |
<para><emphasis role="bold">ProperName</emphasis> (a subtype of Noun), | |
having a feature Class which holds an integer value encoding some information | |
about the proper noun.</para></blockquote></para> | |
<para>If you want to filter Token instances, such that only those tokens get through | |
which are proper names of class 3 (for example), you would need a test that started with | |
a Token instance, followed its PartOfSpeech reference to another instance (the | |
ProperName instance) and then tested the Class feature of that instance for a value | |
equal to 3.</para> | |
<para>To support this, the filtering approach has components that specify tests, and | |
components that specify <quote>paths</quote>. The tests that can be done include | |
testing references to type instances to see if they are instances of some type or its | |
subtypes; this is done with a FSTypeConstraint constraint. Other tests check for | |
equality or, for numeric values, ranges.</para> | |
<para>Each test may be combined with a path – to get to the value to test. Tests that | |
start from a feature structure instance can be combined with and and or connectors. | |
The Javadocs for these are in the package org.apache.uima.cas in the classes that end | |
in Constraint, plus the classes ConstraintFactory, FeaturePath and CAS. | |
Here's an example; assume the variable cas holds a reference to a CAS instance. | |
<programlisting>// Start by getting the constraint factory from the CAS. | |
ConstraintFactory cf = cas.getConstraintFactory(); | |
// To specify a path to an item to test, you start by | |
// creating an empty path. | |
FeaturePath path = cas.createFeaturePath(); | |
// Add POS feature to path, creating one-element path. | |
path.addFeature(posFeat); | |
// You can extend the chain arbitrarily by adding additional | |
// features. | |
// Create a new type constraint. | |
// Type constraints will check that structures | |
// they match against have a type at least as specific | |
// as the type specified in the constraint. | |
FSTypeConstraint nounConstraint = cf.createTypeConstraint(); | |
// Set the type (by default it is TOP). | |
// This succeeds if the type being tested by this constraint | |
// is nounType or a subtype of nounType. | |
nounConstraint.add(nounType); | |
// Embed the noun constraint under the pos path. | |
// This means, associate the test with the path, so it tests the | |
// proper value. | |
// The result is a test which will | |
// match a feature structure that has a posFeat defined | |
// which has a value which is an instance of a nounType or | |
// one of its subtypes. | |
FSMatchConstraint embeddedNoun = cf.embedConstraint(path, nounConstraint); | |
// Create a type constraint for token (or a subtype of it) | |
FSTypeConstraint tokenConstraint = cf.createTypeConstraint(); | |
// Set the type. | |
tokenConstraint.add(tokenType); | |
// Create the final constraint by conjoining the two constraints. | |
FSMatchConstraint nounTokenCons = cf.and(nounConstraint, tokenConstraint); | |
// Create a filtered iterator from some annotation iterator. | |
FSIterator it = cas.createFilteredIterator(annotIt, nounTokenCons);</programlisting> | |
</para></section></section> | |
<section id="ugr.ref.cas.guide_to_javadocs"> | |
<title>The CAS API's – a guide to the Javadocs</title> | |
<titleabbrev>CAS API's Javadocs</titleabbrev> | |
<para>The CAS APIs are organized into 3 Java packages: cas, cas.impl, and cas.text. Most | |
of the APIs described here are in the cas package. The cas.impl package contains classes | |
used in serializing and deserializing (reading and writing external representations) the | |
CAS in various formats, for | |
transporting the CAS among local and remote annotators, or for storing the CAS in | |
permanent storage. The cas.text contains the APIs that extend the CAS to support | |
artifact (including <quote>text</quote>) analysis.</para> | |
<section id="ugr.ref.cas.javadocs.cas_package"> | |
<title>APIs in the CAS package</title> | |
<para>The main objects implementing the APIs discussed here are shown in the diagram | |
below. The hierarchy represents that there is a way to get from an upper object to an | |
instance of the lower object, usually by using a method on the upper object; this is not | |
an inheritance hierarchy. | |
<figure id="ugr.ref.cas.fig.api_hierarchy"> | |
<title>CAS Object hierarchy</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.8in" format="JPG" | |
fileref="&imgroot;image001.png"/> | |
</imageobject> | |
<textobject><phrase>CAS object hierarchy</phrase></textobject> | |
</mediaobject> | |
</figure> </para> | |
<para>The main Interface is the CAS interface. This has most of the functionality of the | |
CAS, except for the type system metadata access, and the indexing access. JCas and CAS | |
are alternative representations and API approaches to the CAS; each has a method to | |
get the other. You can mix JCas and CAS APIs in your application as needed. To use the | |
JCas APIs, you have to create the Java classes that correspond to the CAS types, and | |
include them in the Java class path of the application. If you have a CAS object, you can | |
get a JCas object by using the getJCas() method call on the CAS object; likewise, you | |
can get the CAS object from a JCas by using the getCAS() method call on the JCas object. | |
There is also a low level CAS interface that is not part of the official API, and is | |
intended for internal use only – it is not documented here.</para> | |
<para>The type system metadata APIs are found in the TypeSystem interface. The objects | |
defining each type and feature are defined by the interfaces Type and Feature. The | |
Type interface has methods to see what types subsume other types, to iterate over the | |
types available, and to extract information about the types, including what | |
features it has. The Feature interface has methods that get what type it belongs to, | |
its name, and its range (the kind of values it can hold).</para> | |
<para>The FSIndexRepository gives you access to methods to get instances of indexes, and | |
also provides access to the iterator over all indexed feature structures: | |
<literal>getAllIndexedFS(aType)</literal>. | |
The FSIndex and AnnotationIndex objects give you methods to create instances of | |
iterators.</para> | |
<para>Iterators and the CAS methods that create new feature structures return | |
FeatureStructure objects. These objects can be used to set and get the values of | |
defined features within them.</para> | |
</section> | |
</section> | |
<section id="ugr.ref.cas.typemerging"> | |
<title>Type Merging</title> | |
<para>When annotators are combined in an aggregate, their defined type systems are merged. | |
This is designed to support independent development of annotator components. The merge | |
results in a single defined type system for CASes that flow through a particular set of | |
annotators.</para> | |
<para>The basic operation of a type system merge is to iterate through all the defined types, | |
and if two annotators define the same fully qualified type name, | |
to take the features defined for those types | |
and form a logical union of those features. This operation requires that same-named features | |
have the same range type names. The resulting type system has features comprising the union | |
of all features over all the various definitions for this type in different annotators. | |
</para> | |
<para>Feature merging checks that for all features having the same name in a type, that the | |
range type is identical; otherwise an error is signaled.</para> | |
<para>Types are combined for merging when their fully qualified names are the same. | |
Two different definitions can be merged even if their supertype definitions do not match, if | |
one supertype subsumes the other supertype; otherwise an error is signaled. Likewise, two types | |
with the same name can be merged only if their features can be merged. | |
</para> | |
</section> | |
<section id="ugr.ref.cas.limitedmultipleaccess"> | |
<title>Limited multi-thread access to read-only CASs</title> | |
<para>Some applications may find it useful to scale up pipelines and run these in parallel.</para> | |
<para> | |
Generally, CASs are not threadsafe, and only one thread at a time may operate on it. In many | |
scenarios, a CAS may be initialized and then filled with Feature Structures, and after some point, | |
no more updates to that particular CAS will be done.</para> | |
<para> | |
If a CAS is no longer going to be changed, it is possible to | |
access it on multiple threads in a read-only mode, simultaneously, with some limitations. Limitations | |
arise because some UIMA Framework activities may update internal CAS data structures.</para> | |
<para>Operational data is updated while running a pipeline when a PEAR is entered or exited, | |
because PEARs establish new class loaders and can potentially switch the JCas classes being used | |
(This happens because the class loaders might define different JCas cover classes | |
implementing the same UIMA type). | |
Because of this, you cannot have multiple pipelines accessing a CAS in read-only mode if one or more of those | |
pipelines contains a PEAR. There are other edge cases where this may happen as well; for example, if you are | |
running a pipeline with an Extension Class Loader, | |
and have a callback routine loaded under a different class loader, UIMA will switch the JCas classes when | |
calling the callback. | |
</para> | |
</section> | |
</chapter> |