blob: 28daa6d971e0eff113e7b3cd2f00cc92f574885b [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
[[ugr.ref.cas]]
= CAS Reference
The CAS (Common Analysis System) is the part of the Unstructured Information Management Architecture (UIMA) that is concerned with creating and handling the data that annotators manipulate.
Java users typically use the JCas (Java interface to the CAS) when manipulating objects in the CAS.
This chapter describes an alternative interface to the CAS which allows discovery and specification of types and features at run time.
It is recommended for use when the using code cannot know ahead of time the type system it will be dealing with.
Use of the CAS as described here is also recommended (or necessary) when components add to the definitions of types of other components.
This UIMA feature allows users to add features to a type that was already defined elsewhere.
When this feature is used in conjunction with the JCas, it can lead to problems with class loading.
This is because different JCas representations of a single type are generated by the different components, and only one of them is loaded (unless you are using Pear descriptors). Note: we do not recommend that you add features to pre-existing types.
A type should be defined in one place only, and then there is no problem with using the JCas.
However, if you do use this feature, do not use the JCas.
Similarly, if you distribute your components for inclusion in somebody else's UIMA application, and you're not sure that they won't add features to your types, do not use the JCas for the same reasons.
[[ugr.ref.cas.javadocs]]
== Javadocs
The subdirectory `docs/api` contains the documentation details of all the classes, methods, and constants for the APIs discussed here.
Please refer to this for details on the methods, classes and constants, specifically in the packages ``org.apache.uima.cas.*``.
[[ugr.ref.cas.overview]]
== CAS Overview
There are threefootnote:[A fourth part, the Subject of Analysis,
is discussed in .] main parts to the CAS: the type system, data creation and manipulation, and indexing.
We will start with a brief description of these components.
[[ugr.ref.cas.type_system]]
=== The Type System
The type system specifies what kind of data you will be able to manipulate in your annotators.
The type system defines two kinds of entities, types and features.
Types are arranged in a single inheritance tree and define the kinds of entities (objects) you can manipulate in the CAS.
Features optionally specify slots or fields within a type.
The correspondence to Java is to equate a CAS Type to a Java Class, and the CAS Features to fields within the type.
A critical difference is that CAS types have no methods; they are just data structures with named slots (features). These features can have as values primitive things like integers, floating point numbers, and strings, and they also can hold references to other instances of objects in the CAS.
We call instances of the data structures declared by the type system "`feature
structures`" (not to be confused with "`features`"). Feature structures are similar to the many variants of record structures found in computer science.footnote:[The name feature structure comes from
terminology used in linguistics.]
Each CAS Type defines a supertype; it is a subtype of that supertype.
This means that any features that the supertype defines are features of the subtype; in other words, it inherits its supertype's features.
Only single inheritance is supported; a type's feature set is the union of all of the features in its supertype hierarchy.
There is a built-in type called uima.cas.TOP; this is the top, root node of the inheritance tree.
It defines no features.
The values that can be stored in features are either built-in primitive values or references to other feature structures.
The primitive values are ``boolean``, ``byte``, `short` (16 bit integers), `integer` (32 bit), `long` (64 bit), `float` (32 bit), `double` (64 bit floats) and strings; the official names of these are ``uima.cas.Boolean``, ``uima.cas.Byte``, ``uima.cas.Short``, ``uima.cas.Integer``, ``uima.cas.Long``, `uima.cas.Float` ,`` uima.cas.Double`` and `uima.cas.String` . The strings are Java strings, and characters are Java characters.
Technically, this means that characters are UTF-16 code points, which is not quite the same as a Unicode character.
This distinction should make no difference for almost all applications.
The CAS also defines other basic built-in types for arrays of these, plus arrays of references to other objects, called `uima.cas.IntegerArray` ,`` uima.cas.FloatArray``, ``uima.cas.StringArray``, ``uima.cas.FSArray``, etc.
The CAS also defines a built-in type called `uima.tcas.Annotation` which inherits from `uima.cas.AnnotationBase` which in turn inherits from ``uima.cas.TOP``.
There are two features defined by this type, called `begin` and ``end``, both of which are integer valued.
[[ugr.ref.cas.creating_accessing_manipulating_data]]
=== Creating, accessing and manipulating data
// <titleabbrev>Creating/Accessing/Changing data</titleabbrev>
Creating and accessing data in the CAS requires knowledge about the types and features defined in the type system.
The idea is similar to other data access APIs, such as the XML DOM or SAX APIs, or database access APIs such as JDBC.
Contrary to those APIs, however, the CAS does not use the names of type system entities directly in the APIs.
Rather, you use the type system to access type and feature entities by name, then use these entities in the data manipulation APIs.
This can be compared to the Java reflection APIs: the type system is comparable to the Java class loader, and the type and feature objects to the `java.lang.Class` and `java.lang.reflect.Field` classes.
Why does it have to be this complicated? You wouldn't normally use reflection to create a Java object, either.
As mentioned earlier, the JCas provides the more straightforward method to manipulate CAS data.
The CAS access methods described here need only be used for generic types of applications that need to be able to handle any kind of data (e.g., generic tooling) or when the JCas may not be used for other reasons.
The generic kinds of applications are exactly the ones where you would use the reflection API in Java as well.
[[ugr.ref.cas.creating_using_indexes]]
=== Creating and using indexes
Each view of a CAS provides a set of indexes for that view.
Instances of Types (that is, Feature Structures) can be added to a view's indexes.
These indexes provide a way for annotators to locate existing data in the CAS, using a specific index (or the method `getAllIndexedFS` of the object ``FSIndexRepository``) to retrieve the Feature Structures that were previously created.
If you want the data you Newly created Feature Structures are not automatically added to the indexes; you choose which Feature Structures to add and use one of several APIs to add them.
Indexes are named and are associated with a CAS Type; they are used to index instances of that CAS type (including instances of that type's subtypes). If you are using xref:tug.adoc#ugr.tug.mvs[multiple views], each view contains a separate instantiation of all of the indexes.
To access an index, you minimally need to know its name.
A CAS view provides an index repository which you can query for indexes for that view.
Once you have a handle to an index, you can get information about the feature structures in the index, the size of the index, as well as an iterator over the feature structures.
There are three kinds of indexes:
* bag - no ordering
* set - uses a user-specfied set of keys to define equality; holds one instance of the set of equal items.
* sorted - uses a user-specified set of keys to define ordering.
For set indexes, the comparator keys are augmented with an implicit additional field - the type of the feature structure.
This means that an index over Annotations, having subtype Token, and a key of the "begin" value, will behave as follows:
* If you make two Tokens (or two Annotations), both having a begin value of 17, and add both of them to the indexes, only one of them will be in the index.
* If you make 1 Token and 1 Annotation, both having a begin value of 17, and add both of them to the indexes, both of them will be in the index (because the types are different).
Indexes are defined in the XML descriptor metadata for the application.
Each CAS View has its own, separate instantiation of indexes based on these definitions, kept in the view's index repository.
When you obtain an index, it is always from a particular CAS view's index repository.
When you index an item, it is always added to all indexes where it belongs, within just the view's repository.
You can specify different repositories (associated with different CAS views) to use; a given Feature Structure instance may be indexed in more than one CAS View (unless it is a subtype of AnnotationBase).
Indexes implement the Iterable interface, so you may use the Java enhanced for loop to iterate over them.
You can also get iterators from indexes; iterators allow you to enumerate the feature structures in an index.
There are two kinds of iterators supported: the regular Java iterator API, and a specific FS iterator API where the usual Java iterator APIs (``hasNext()`` and ``next()``) are augmented by ``isValid()``, `moveToNext() / moveToPrevious()` (which does not return an element) and ``get()``.
Finally, there is a `moveTo(FeatureStructure)` API, which, for sorted indexes, moves the iteration point to the left-most (among otherwise "equal") item in the index which compares "equal" to the given FeatureStructure, using the index's defined comparator.
Which API style you use is up to you, but we do not recommend mixing the styles as the results are sometimes unexpected.
If you just want to iterate over an index from start to finish, either style is equally appropriate.
If you also use `moveTo(FeatureStructure fs)` and ``moveToPrevious()``, it is better to use the special FS iterator style.
[NOTE]
====
The reason to not mix these styles is that you might be thinking that next() followed by moveToPrevious() would always work.
This is not true, because next() returns the "current" element, and advances to the next position, which might be beyond the last element.
At that point, the iterator becomes "invalid", and moveToNext and moveToPrevious no longer move the iterator.
But you can call these methods on the iterator -- `moveToFirst()`, `moveToLast()`, or `moveTo(FS)` -- to reset it.
====
Indexes are created by specifying them in the annotator's or aggregate's resource descriptor.
An index specification includes its name, the CAS type being indexed, the kind (bag, set or sorted) of index it is, and an (optional) set of keys.
The keys are used for set and sorted indexes, and specify what values are used for ordering, or (for sets) what values are used to determine set equality.
When a CAS pipeline is created, all index specifications are combined; duplicate definitions (having the same name) are allowed only if their definitions are the same.
Feature structure instances need to be explicitly added to the index repository by a method call.
Feature structures that are not indexed will not be visible to other annotators, (unless they are located via being referenced by some other feature of another feature structure, which is indexed, or through a chain of these).
The framework defines an unnamed bag index which indexes all types.
The only access provided for this index is the getAllIndexedFS(type) method on the index repository, which returns an iterator over all indexed instances of the specified type (including its subtypes) for that CAS View.
The framework defines one standard, built-in annotation index, called AnnotationIndex, which indexes the `uima.tcas.Annotation` type: all feature structures of type `uima.tcas.Annotation` or its subtypes are automatically indexed with this built-in index.
The ordering relation used by this index is to first order by the value of the "`begin`" features (in ascending order) and then by the value of the "`end`" feature (in descending order), and then, finally, by the Type Priority.
This ordering insures that longer annotations starting at the same spot come before shorter ones.
For Subjects of Analysis other than Text, this may not be an appropriate index.
In addition to normal iterators, there is a `select` API, documented in the Version 3 Users guide, which provides additional capabilities for accessing Feature Structures via the indexes.
[[ugr.ref.cas.builtin_types]]
== Built-in CAS Types
The CAS has two kinds of built-in types –primitive and non-primitive.
The primitive types are:
* uima.cas.Boolean
* uima.cas.Byte
* uima.cas.Short
* uima.cas.Integer
* uima.cas.Long
* uima.cas.Float
* uima.cas.Double
* uima.cas.String
The ``Byte, Short, Integer, ``and`` Long`` are all signed integer types, of length 8, 16, 32, and 64 bits.
The `Double` type is 64 bit floating point.
The `String` type can be subtyped to create sets of allowed values; see xref:ref.adoc#ugr.ref.xml.component_descriptor.type_system.string_subtypes[String Subtypes].
These types can be used to specify the range of a String-valued feature.
They act like Strings, but have additional checking to insure the setting of values into them conforms to one of the allowed values, or to null (which is the value if it is not set). Note that the other primitive types cannot be used as a supertype for another type definition; only `uima.cas.String` can be sub-typed.
The non-primitive types exist in a type hierarchy; the top of the hierarchy is the type ``uima.cas.TOP``.
All other non-primitive types inherit from some supertype.
There are 9 built-in array types.
These arrays have a size specified when they are created; the size is fixed at creation time.
They are named:
* uima.cas.BooleanArray
* uima.cas.ByteArray
* uima.cas.ShortArray
* uima.cas.IntegerArray
* uima.cas.LongArray
* uima.cas.FloatArray
* uima.cas.DoubleArray
* uima.cas.StringArray
* uima.cas.FSArray
The `uima.cas.FSArray` type is an array whose elements are arbitrary other feature structures (instances of non-primitive types).
The JCas cover classes for the array types support the Iterable API, so you may write extended for loops over instances of these.
For example:
[source]
----
FSArray<MyType> myArray = ...
for (MyType fs : myArray) {
some_method(fs);
}
----
There are 3 built-in types associated with the artifact being analyzed:
* uima.cas.AnnotationBase
* uima.tcas.Annotation
* uima.tcas.DocumentAnnotation
The `AnnotationBase` type defines one system-used feature which specifies for an annotation the subject of analysis (Sofa) to which it refers.
The Annotation type extends from this and defines 2 features, taking `uima.cas.Integer` values, called `begin` and ``end``.
The `begin` feature typically identifies the start of a span of text the annotation covers; the `end` feature identifies the end.
The values refer to character offsets; the starting index is 0.
An annotation of the word "`CAS`" in a text "`CAS Reference`" would have a start index of 0, and an end index of 3; the difference between end and start is the length of the span the annotation refers to.
Annotations are always with respect to some Sofa (Subject of Analysis –see xref:tug.adoc#ugr.tug.aas[Annotations, Artifacts, and Sofas].
[NOTE]
====
Artifacts which are not text strings may have a different interpretation of the meaning of begin and end, or may define their own kind of annotation, extending from ``AnnotationBase``.
====
The `DocumentAnnotation` type has one special instance.
It is a subtype of the Annotation type, and the built-in definition defines one feature, ``language``, which is a string indicating the language of the document in the CAS.
The value of this language feature is used by the system to control flow among annotators when the "`CapabilityLanguageFlow`" mode is used, allowing the flow to skip over annotators that don't process particular languages.
Users may extend this type by adding additional features to it, using the XML Descriptor element for defining a type.
[NOTE]
====
We do _not_ recommend extending the `DocumentAnnotation` type.
If you do, you must _not_ use the JCas, for the reasons stated earlier.
====
Each CAS view has a different associated instance of the `DocumentAnnotation` type.
On the CAS, use `getDocumentationAnnotation()` to access the ``DocumentAnnotation``.
There are also built-in types supporting linked lists, similar to the ones available in Java and other programming languages.
Their use is constrained by the usual properties of linked lists: not very space efficient, no (efficient) random access, but an easy choice if you don't know how long your list will be ahead of time.
The implementation is type specific; there are different list building objects for each of the primitive types, plus one for general feature structures.
Here are the type names:
* uima.cas.FloatList
* uima.cas.IntegerList
* uima.cas.StringList
* uima.cas.FSList
+
* uima.cas.EmptyFloatList
* uima.cas.EmptyIntegerList
* uima.cas.EmptyStringList
* uima.cas.EmptyFSList
+
* uima.cas.NonEmptyFloatList
* uima.cas.NonEmptyIntegerList
* uima.cas.NonEmptyStringList
* uima.cas.NonEmptyFSList
For the primitive types ``Float``, ``Integer``, `String` and ``FeatureStructure``, there is a base type, for instance, ``uima.cas.FloatList``.
For each of these, there are two subtypes, corresponding to a non-empty element, and a marker that serves to indicate the end of the list, or an empty list.
The non-empty types define two features –``head`` and ``tail``.
The head feature holds the particular value for that part of the list.
The tail refers to the next list object (either a non-empty one or the empty version to indicate the end of the list).
For JCas users, the new operator for the NonEmptyXyzList classes includes a 3 argument version where you may specify the head and tail values as part of the constructor.
The JCas cover classes for these implement a `push(item)` method which creates a new non-empty node, sets the `head` value to ``item``, and the tail to the node it is called on, and returns the new node.
These classes also implement Iterable, so you can use the enhanced Java `for` operator.
The iterator stops when it gets to the end of the list, determined by either the tail being null or the element being one of the EmptyXXXList elements.
Here's a StringList example:
[source]
----
StringList sl = jcas.emptyStringList();
sl = sl.push("2");
sl = sl.push("1");
for (String s : sl) {
someMethod(s); // some sample use
}
----
There are no other built-in types.
Users are free to define their own type systems, building upon these types.
[[ugr.ref.cas.accessing_the_type_system]]
== Accessing the type system
During annotator processing, or outside an annotator, access the type system by calling ``CAS.getTypeSystem()``.
However, CAS annotators implement an additional method, ``typeSystemInit()``, which is called by the UIMA framework before the annotator's process method.
This method, implemented by the annotator writer, is passed a reference to the CAS's type system metadata.
The method typically uses the type system APIs to obtain type and feature objects corresponding to all the types and features the annotator will be using in its process method.
This initialization step should not be done during an annotator's initialize method since the type system can change after the initialize method is called; it should not be done during the process method, since this is presumably work that is identical for each incoming document, and so should be performed only when the type system changes (which will be a rare event). The UIMA framework guarantees it will call the ``typeSystemInit ``method of an annotator whenever the type system changes, before calling the annotator's `process()` method.
The initialization done by `typeSystemInit()` is done by the UIMA framework when you use the JCas APIs; you only need to provide a `typeSystemInit()` method, as described here, when you are not using the JCas approach.
[[ugr.ref.cas.type_system.printer_example]]
=== TypeSystemPrinter example
Here is a code fragment that, given a CAS Type System, will print a list of all types.
[source]
----
// Get all type names from the type system
// and print them to stdout.
private void listTypes1(TypeSystem ts) {
for (Type t : ts) {
// print its name.
System.out.println(t.getName());
}
}
----
This method is passed the type system as a parameter.
From the type system, we can get an iterator over all the types.
If you run this against a CAS created with no additional user-defined types, we should see something like this on the console:
[source]
----
Types in the type system:
uima.cas.Boolean
uima.cas.Byte
uima.cas.Short
uima.cas.Integer
uima.cas.Long
uima.cas.ArrayBase
...
----
If the type system had user-defined types these would show up too.
Note that some of these types are not directly creatable –they are types used by the framework in the type hierarchy (e.g.
uima.cas.ArrayBase).
CAS type names include a name-space prefix.
The components of a type name are separated by the dot (.). A type name component must start with a Unicode letter, followed by an arbitrary sequence of letters, digits and the underscore (_). By convention, the last component of a type name starts with an uppercase letter, the rest start with a lowercase letter.
Listing the type names is mildly useful, but it would be even better if we could see the inheritance relation between the types.
The following code prints the inheritance tree in indented format.
[source]
----
private static final int INDENT = 2;
private void listTypes2(TypeSystem ts) {
// Get the root of the inheritance tree.
Type top = ts.getTopType();
// Recursively print the tree.
printInheritanceTree(ts, top, 0);
}
private void printInheritanceTree(TypeSystem ts, Type type, int level) {
indent(level); // Print indentation.
System.out.println(type.getName());
// Get a vector of the immediate subtypes.
Vector subTypes =
ts.getDirectlySubsumedTypes(type);
++level; // Increase the indentation level.
for (int i = 0; i < subTypes.size(); i++) {
// Print the subtypes.
printInheritanceTree(ts, (Type) subTypes.get(i), level);
}
}
// A simple, inefficient indenter
private void indent(int level) {
int spaces = level * INDENT;
for (int i = 0; i < spaces; i++) {
System.out.print(" ");
}
}
----
This example shows that you can traverse the type hierarchy by starting at the top with TypeSystem.getTopType and by retrieving subtypes with ``TypeSystem.getDirectlySubsumedTypes()``.
The Javadocs also have APIs that allow you to access the features, as well as what the allowed value type is for that feature.
Here is sample code which prints out all the features of all the types, together with the allowed value types (the feature "`range`"). Each feature has a "`domain`" which is the type where it is defined, as well as a "`range`".
[source]
----
private void listFeatures2(TypeSystem ts) {
Iterator featureIterator = ts.getFeatures();
Feature f;
System.out.println("Features in the type system:");
while (featureIterator.hasNext()) {
f = (Feature) featureIterator.next();
System.out.println(
f.getShortName() + ": " +
f.getDomain() + " -> " + f.getRange());
}
System.out.println();
}
----
We can ask a feature object for its domain (the type it is defined on) and its range (the type of the value of the feature). The terminology derives from the fact that features can be viewed as functions on subspaces of the object space.
[[ugr.ref.cas.cas_apis_create_modify_feature_structures]]
=== Using the CAS APIs to create and modify feature structures
// <titleabbrev>Using CAS APIs: Feature Structures</titleabbrev>
Assume a type system declaration that defines two types: Entity and Person.
Entity has no features defined within it but inherits from uima.tcas.Annotation -- so it has the begin and end features.
Person is, in turn, a subtype of Entity, and adds firstName and lastName features.
CAS type systems are declaratively specified using XML; the format of this XML is described in the xref:ref.adoc#ugr.ref.xml.component_descriptor.type_system[Type System Reference].
[source]
----
<!-- Type System Definition -->
<typeSystemDescription>
<types>
<typeDescription>
<name>com.xyz.proj.Entity</name>
<description />
<supertypeName>uima.tcas.Annotation</supertypeName>
</typeDescription>
<typeDescription>
<name>Person</name>
<description />
<supertypeName>com.xyz.proj.Entity </supertypeName>
<features>
<featureDescription>
<name>firstName</name>
<description />
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
<featureDescription>
<name>lastName</name>
<description />
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
</types>
</typeSystemDescription>
----
To be able to access types and features, we need to know their names.
The CAS interface defines constants that hold the names of built-in feature names, such as, e.g., ``CAS.TYPE_NAME_INTEGER``.
It is good programming practice to create such constants for the types and features you define, for your own use as well as for others who will be using your annotators.
[source]
----
/** Entity type name constant. */
public static final String ENTITY_TYPE_NAME = "com.xyz.proj.Entity";
/** Person type name constant. */
public static final String PERSON_TYPE_NAME = "com. xyz.proj.Person";
/** First name feature name constant. */
public static final String FIRST_NAME_FEAT_NAME = "firstName";
/** Last name feature name constant. */
public static final String LAST_NAME_FEAT_NAME = "lastName";
----
Next we define type and feature member variables; these will hold the values of the type and feature objects needed by the CAS APIs, to be assigned during ``typeSystemInit()``.
[source]
----
// Type system object variables
private Type entityType;
private Type personType;
private Feature firstNameFeature;
private Feature lastNameFeature;
private Type stringType;
----
The type system does not throw an exception if we ask for something that is not known, it simply returns null; therefore the code checks for this and throws a proper exception.
We require all these types and features to be defined for the annotator to work.
One might imagine situations where certain computations are predicated on some type or feature being defined in the type system, but that is not the case here.
[source]
----
// Get a type object corresponding to a name.
// If it doesn't exist, throw an exception.
private Type initType(String typeName)
throws AnnotatorInitializationException {
Type type = ts.getType(typeName);
if (type == null) {
throw new AnnotatorInitializationException(
AnnotatorInitializationException.TYPE_NOT_FOUND,
new Object[] { this.getClass().getName(), typeName });
}
return type;
}
// We add similar code for retrieving feature objects.
// Get a feature object from a name and a type object.
// If it doesn't exist, throw an exception.
private Feature initFeature(String featName, Type type)
throws AnnotatorInitializationException {
Feature feat = type.getFeatureByBaseName(featName);
if (feat == null) {
throw new AnnotatorInitializationException(
AnnotatorInitializationException.FEATURE_NOT_FOUND,
new Object[] { this.getClass().getName(), featName });
}
return feat;
}
----
Using these two functions, code for initializing the type system described above would be:
[source]
----
public void typeSystemInit(TypeSystem aTypeSystem)
throws AnalysisEngineProcessException {
this.typeSystem = aTypeSystem;
// Set type system member variables.
this.entityType = initType(ENTITY_TYPE_NAME);
this.personType = initType(PERSON_TYPE_NAME);
this.firstNameFeature =
initFeature(FIRST_NAME_FEAT_NAME, personType);
this.lastNameFeature =
initFeature(LAST_NAME_FEAT_NAME, personType);
this.stringType = initType(CAS.TYPE_NAME_STRING);
}
----
Note that we initialize the string type by using a type name constant from the CAS.
[[ugr.ref.cas.creating_feature_structures]]
== Creating feature structures
To create feature structures in JCas, we use the Java "`new`" operator.
In the CAS, we use one of several different API methods on the CAS object, depending on which of the 10 basic kinds of feature structures we are creating (a plain feature structure, or an instance of the built-in primitive type arrays or FSArray). There are is also a method to create an instance of a ``uima.tcas.Annotation``, setting the begin and end values.
Once a feature structure is created, it needs to be added to the CAS indexes (unless it will be accessed via some reference from another accessible feature structure). The CAS provides this API: Assuming aCAS holds a reference to a CAS, and token holds a reference to a newly created feature structure, here's the code to add that feature structure to all the relevant CAS indexes:
[source]
----
// Add the token to the index repository.
aCAS.addFsToIndexes(token);
----
There is also a corresponding `removeFsFromIndexes(token)` method on CAS objects.
As of version 2.4.1, there are two methods you can use on an index repository to efficiently bulk-remove all instances of particular types of feature structures from a particular view.
One of these, `aCas.getIndexRepository().removeAllIncludingSubtypes(aType)` removes all instances of a particular type, including instances which are subtypes of the specified type.
The other, `aCas.getIndexRepository().removeAllExcludingSubtypes(aType)` remove all instances of a particular type, only.
In both cases, the removal is done from the particular view of the CAS referenced by aCas.
[[ugr.ref.cas.updating_indexed_feature_structures]]
=== Updating indexed feature structures
Version 2.7.0 added protection for indexes when feature structure key value features are updated.
By default this protection is automatic, but at some performance cost.
Users may optimize this further.
Protection is needed because some of the indexes (the Sorted and Set types) use comparators defined to use values of the particular features; if these values need to be changed after the feature structure is added to the indexes, the correct way to do this is to:
. completely remove the item from all indexes where it is indexed, in all views where it is indexed,
. update the value of the features being used as keys,
. add the item back to the indexes, in all views.
[NOTE]
====
It's OK to change feature values which are not used in determining sort ordering (or set membership), without removing and re-adding back to the index.
====
The automatic protection checks for updates of features being used as keys, and if it finds an update like this for a feature structure that is in the indexes, it removes the feature structure from the indexes, does the update, and adds it back.
It will do this for every feature update.
This is obviously not efficient when multiple features are being updated; in that case it would better to remove the feature structure, do all the updates to all the features needing updates, and then do a single add-back operation.
This is supported in user's code by using the new method `protectIndexes` available in both the CAS and JCas interface.
Here's two ways of using this, one with a try / finally and the other with a Runnable:
[source]
----
// an approach using try / finally
AutoCloseable ac = my_cas.protectIndexes(); // my_cas is a CAS or a JCas
try {
... arbitrary user code which updates features
which may be "keys" in one or more indexes
} finally {
ac.close();
}
// This can more compactly be written using the auto-close feature of try:
try (AutoCloseable ac = my_cas.protectIndexes()) {
... arbitrary user code which updates features
which may be "keys" in one or more indexes
}
// an approach using a Runnable, written in Java 8 lambda syntax
my_cas.protectIndexes(() -> {
... arbitrary user code updating "key" features,
but no checked exceptions are permitted
});
----
The `protectIndexes` implementation only removes feature structures that have features being updated which are used as keys in some index(es). At the end of the scope of the protectIndexes, it adds all of these back.
It also skips removing feature structures from bag indexes, since these have no keys.
Within a `protectIndexes` block, do not do any operations which depend on the indexes being valid, such as creating and using an iterator.
This is because the removed FSs are only added back at the end of the protectIndexes block.
The JVM property `-Duima.report_fs_update_corrupts_index` will generate a log entry everytime the frameworks finds (and automatically surrounds with a remove - add-back) an update to a feature which could corrupt the index.
The log entries can be identified by scanning for messages starting with `While FS was in the index, the feature` - the message goes on to identify the feature in question.
Users can use these reports to find the places in their code where they can either change the design to avoid updating these values after the item is indexed, or surround the updates with their own `protectIndexes` blocks.
Initially, the out-of-the-box defaults for the UIMA framework will run with an automatic (but somewhat inefficient) protection.
To improve upon this, users would:
* Turn on reporting using a global JVM flag `` -Duima.report_fs_update_corrupts_index``. This will cause a message to be logged each time the automatic protection is being invoked, and allows the user to find the spots to improve.
* Improve each spot, perhaps by surrounding the update code with a protectIndexes block, or by rearranging code to reduce updating feature values used as index keys.
* Once the code is no longer generating any reports, you can turn off the automatic protection for production runs using the JVM global property ``-Duima.disable_auto_protect_indexes``, and rely on the protectIndexes blocks. If protection is disabled, then the corruption detection is skipped, making the production runs perhaps a bit faster, although this is not significant in most cases.
* For automated build systems, there's a JVM parameter, ``-Duima.exception_when_fs_update_corrupts_index``, which will throw an exception if any automatic recovery situation is encountered. You can use this in build/test scenarios to insure (after adding all needed protectIndexes blocks) that the code remains safe for turning off the checking in production runs.
[[ugr.ref.cas.accessing_modifying_feature_structures]]
== Accessing or modifying feature structures
Values of individual features for a feature structure can be set or referenced, using a set of methods that depend on the type of value that feature is declared to have.
There are methods on FeatureStructure for this: getBooleanValue, getByteValue, getShortValue, getIntValue, getLongValue, getFloatValue, getDoubleValue, getStringValue, and getFeatureValue (which means to get a value which in turn is a reference to a feature structure). There are corresponding "`setter`" methods, as well.
These methods on the feature structure object take as arguments the feature object retrieved earlier in the typeSystemInit method.
Using the previous example, with the type system initialized with type personType and feature lastNameFeature, here's a sample code fragment that gets and sets that feature:
[source]
----
// Assume aPerson is a variable holding an object of type Person
// get the lastNameFeature value from the feature structure
String lastName = aPerson.getStringValue(lastNameFeature);
// set the lastNameFeature value
aPerson.setStringValue(lastNameFeature, newStringValueForLastName);
----
The getters and setters for each of the primitive types are defined in the Javadocs as methods of the FeatureStructure interface.
[[ugr.ref.cas.indexes_and_iterators]]
== Indexes and Iterators
Each CAS can have many indexes associated with it; each CAS View contains a complete set of instantiations of the indexes.
Each index is represented by an instance of the type org.apache.uima.cas.FSIndex.
You use the object org.apache.uima.cas.FSIndexRepository, accessible via a method on a CAS object, to retrieve instances of indexes.
There are methods that let you select the index by name, by type, or by both name and type.
Since each index is already associated with a type, passing both a name and a type is valid only if the type passed in is the same type or a subtype of the one declared in the index specification for the named index.
If you pass in a subtype, the returned FSIndex object refers to an index that will return only items belonging to that subtype (or subtypes of that subtype).
The returned FSIndex objects are used, in turn, to create iterators.
There is also a method on the Index Repository, ``getAllIndexedFS``, which will return an iterator over all indexed Feature Structures (for that CAS View), in no particular order.
The iterators created can be used like common Java iterators, to sequentially retrieve items indexed.
If the index represents a sorted index, the items are returned in a sorted order, where the sort order is specified in the XML index definition.
This XML is part of the Component Descriptor, see xref:ref.adoc#ugr.ref.xml.component_descriptor.aes.index[Index Definition].
In UIMA V3, Feature structures may be added to or removed from indexes while iterating over them.
If this happens, any iterators already created will continue to operate over the before-modification version of the index, unless or until the iterator is re-synchronized with the current value of the index via one of the following specific 3 iterator API calls: moveToFirst, moveToLast, or moveTo(FeatureStructure). ConcurrentModificationException is no longer thrown in UIMA v3.
Feature structures being iterated over may have features which are used as the "keys" of an index, updated.
If this is done, UIMA will protect the indexes (to prevent index corruption) by automatically removing the Feature Structure from the indexes, updating the field, and adding the FS back to the index (possibly in a new position). This automatic remove / add-back operation no longer makes the iterator throw a ConcurrentModificationException (as it did in UIMA Version 2) if the iterator is incremented or decremented; existing iterators will continue to operate as if no index modification occurred.
[[ugr.ref.cas.index.built_in_indexes]]
=== Built-in Indexes
An unnamed built-in bag index exists which holds all feature structures which are indexed.
The only access to this index is the method `getAllIndexedFS(Type)`` which returns an iterator over all indexed Feature Structures.
The CAS also contains a built-in index for the type ``uima.tcas.Annotation``, which sorts annotations in the order in which they appear in the document.
Annotations are sorted first by increasing `begin` position.
Ties are then broken by _decreasing_``end`` position (so that longer annotations come first). Annotations that match in both their `begin` and `end` features are sorted using the xref:ref.adoc#ugr.ref.xml.component_descriptor.aes.type_priority[Type Priority], if any are defined.
[[ugr.ref.cas.index.adding_to_indexes]]
=== Adding Feature Structures to the Indexes
Feature Structures are added to the indexes by various APIs.
These add the Feature Structure to _all_ indexes that are defined for the type of that `FeatureStructure` (or any of its supertypes), in a particular view.
Note that you should not add a Feature Structure to the indexes until you have set values for all of the features that may be used as sort keys in an index.
There are multiple APIs for adding FSs to the index.
* (preferred) `myFeatureStructure.addToIndexes()`. This adds the feature structure instance to the view in which it was originally created.
* (preferred) `myFeatureStructure.addToIndexes(JCas or CAS)`. This adds the feature structure instance to the view represented by the argument.
* (older form) `casView.addFsToIndexes(myFeatureStructure)` or `jcasView.addFsToIndexes(myFeatureStructure)`. This adds the feature structure instance to the view represented by the cas (or jcas).
* (older form) `fsIndexRepositoryView.addFsToIndexes(myFeatureStructure)`. This adds the feature structure instance to the view represented by the `fsIndexRepository` instance.
[[ugr.ref.cas.index.iterators]]
=== Iterators over UIMA Indexes
Iterators are objects of class `org.apache.uima.cas.FSIterator.` This class extends `java.util.Iterator` and implements the normal Java iterator methods, plus additional ones that allow moving both forwards and backwards.
UIMA Indexes implement `Iterable`, so you can use the index directly in a Java extended for loop.
[[ugr.ref.cas.index.annotation_index]]
=== Special iterators for Annotation types
Note: we recommend using the UIMA V3 select framework, instead of the following.
It implements all of the following capabilities, and more, in a uniform manner.
The built-in index over the `uima.tcas.Annotation` type named "``AnnotationIndex``" has additional capabilities.
To use them, you first get a reference to this built-in index using either the `getAnnotationIndex` method on a CAS View object, or by asking the `FSIndexRepository` object for an index having the particular name "`AnnotationIndex`", for example:
[source]
----
AnnotationIndex idx = aCAS.getAnnotationIndex();
// or you can iterate over a specific subtype of Annotation:
AnnotationIndex idx = aCAS.getAnnotationIndex(aType);
----
This object can be used to produce several additional kinds of iterators.
It can produce unambiguous iterators; these skip over elements until it finds one where the start position of the next annotation is equal to or greater than the end position of the previously returned annotation.
It can also produce several kinds of subiterators; these are iterators whose annotations fall within the span of another annotation.
This kind of iterator can also have the unambiguous property, if desired.
It also can be "`strict`" or not; strict means that the returned annotation lies completely within the span of the controlling annotation.
Non-strict only implies that the beginning of the returned annotation falls within the span of the controlling annotation.
There is also a method which produces an `AnnotationTree` object, which contains nodes representing the results of doing a strict, unambiguous subiterator over the span of some controlling annotation.
For more details, please refer to the Javadocs for the `org.apache.uima.cas.text` package.
[[ugr.ref.cas.index.constraints_and_filtered_iterators]]
=== Constraints and Filtered iterators
Note: for new code, consider using the select framework plus Streams, instead of the following.
There is a set of API calls that build constraint objects.
These objects can be used directly to test if a particular feature structure matches (satisfies) the constraint, or they can be passed to the createFilteredIterator method to create an iterator that skips over instances which fail to satisfy the constraint.
It is possible to specify a feature value located by following a chain of references starting from the feature structure being tested.
Here's a scenario to explore this concept.
Let's suppose you have the following type system (namespaces are omitted for clarity):
____
**Token**, having a feature PartOfSpeech which holds a reference to another type (POS)
*POS* (a type with many subtypes, each representing a different part of speech)
*Noun* (a subtype of POS)
*ProperName* (a subtype of Noun), having a feature Class which holds an integer value encoding some information about the proper noun.
____
If you want to filter Token instances, such that only those tokens get through which are proper names of class 3 (for example), you would need a test that started with a Token instance, followed its PartOfSpeech reference to another instance (the ProperName instance) and then tested the Class feature of that instance for a value equal to 3.
To support this, the filtering approach has components that specify tests, and components that specify "`paths`".
The tests that can be done include testing references to type instances to see if they are instances of some type or its subtypes; this is done with a FSTypeConstraint constraint.
Other tests check for equality or, for numeric values, ranges.
Each test may be combined with a path -- to get to the value to test.
Tests that start from a feature structure instance can be combined with and and or connectors.
The Javadocs for these are in the package org.apache.uima.cas in the classes that end in Constraint, plus the classes ConstraintFactory, FeaturePath and CAS.
Here's an example; assume the variable cas holds a reference to a CAS instance.
[source]
----
// Start by getting the constraint factory from the CAS.
ConstraintFactory cf = cas.getConstraintFactory();
// To specify a path to an item to test, you start by
// creating an empty path.
FeaturePath path = cas.createFeaturePath();
// Add POS feature to path, creating one-element path.
path.addFeature(posFeat);
// You can extend the chain arbitrarily by adding additional
// features.
// Create a new type constraint.
// Type constraints will check that structures
// they match against have a type at least as specific
// as the type specified in the constraint.
FSTypeConstraint nounConstraint = cf.createTypeConstraint();
// Set the type (by default it is TOP).
// This succeeds if the type being tested by this constraint
// is nounType or a subtype of nounType.
nounConstraint.add(nounType);
// Embed the noun constraint under the pos path.
// This means, associate the test with the path, so it tests the
// proper value.
// The result is a test which will
// match a feature structure that has a posFeat defined
// which has a value which is an instance of a nounType or
// one of its subtypes.
FSMatchConstraint embeddedNoun = cf.embedConstraint(path, nounConstraint);
// Create a type constraint for token (or a subtype of it)
FSTypeConstraint tokenConstraint = cf.createTypeConstraint();
// Set the type.
tokenConstraint.add(tokenType);
// Create the final constraint by conjoining the two constraints.
FSMatchConstraint nounTokenCons = cf.and(nounConstraint, tokenConstraint);
// Create a filtered iterator from some annotation iterator.
FSIterator it = cas.createFilteredIterator(annotIt, nounTokenCons);
----
[[ugr.ref.cas.guide_to_javadocs]]
== The CAS API's -- a guide to the Javadocs
// <titleabbrev>CAS API's Javadocs</titleabbrev>
The CAS APIs are organized into 3 Java packages: cas, cas.impl, and cas.text.
Most of the APIs described here are in the cas package.
The cas.impl package contains classes used in serializing and deserializing (reading and writing external representations) the CAS in various formats, for transporting the CAS among local and remote annotators, or for storing the CAS in permanent storage.
The cas.text contains the APIs that extend the CAS to support artifact (including "`text`") analysis.
[[ugr.ref.cas.javadocs.cas_package]]
=== APIs in the CAS package
The main objects implementing the APIs discussed here are shown in the diagram below.
The hierarchy represents that there is a way to get from an upper object to an instance of the lower object, usually by using a method on the upper object; this is not an inheritance hierarchy.
.CAS Object hierarchy
image::images/references/ref.cas/image001.png[CAS object hierarchy]
The main Interface is the CAS interface.
This has most of the functionality of the CAS, except for the type system metadata access, and the indexing access.
JCas and CAS are alternative representations and API approaches to the CAS; each has a method to get the other.
You can mix JCas and CAS APIs in your application as needed.
To use the JCas APIs, you have to create the Java classes that correspond to the CAS types, and include them in the Java class path of the application.
If you have a CAS object, you can get a JCas object by using the `getJCas()` method call on the CAS object; likewise, you can get the CAS object from a JCas by using the `getCAS()` method call on the JCas object.
There is also a low level CAS interface that is not part of the official API, and is intended for internal use only -- it is not documented here.
The type system metadata APIs are found in the TypeSystem interface.
The objects defining each type and feature are defined by the interfaces Type and Feature.
The Type interface has methods to see what types subsume other types, to iterate over the types available, and to extract information about the types, including what features it has.
The Feature interface has methods that get what type it belongs to, its name, and its range (the kind of values it can hold).
The FSIndexRepository gives you access to methods to get instances of indexes, and also provides access to the iterator over all indexed feature structures: ``getAllIndexedFS(aType)``.
The FSIndex and AnnotationIndex objects give you methods to create instances of iterators.
Iterators and the CAS methods that create new feature structures return FeatureStructure objects.
These objects can be used to set and get the values of defined features within them.
[[ugr.ref.cas.typemerging]]
== Type Merging
When annotators are combined in an aggregate, their defined type systems are merged.
This is designed to support independent development of annotator components.
The merge results in a single defined type system for CASes that flow through a particular set of annotators.
The basic operation of a type system merge is to iterate through all the defined types, and if two annotators define the same fully qualified type name, to take the features defined for those types and form a logical union of those features.
This operation requires that same-named features have the same range type names.
The resulting type system has features comprising the union of all features over all the various definitions for this type in different annotators.
Feature merging checks that for all features having the same name in a type, that the range type is identical; otherwise an error is signaled.
Types are combined for merging when their fully qualified names are the same.
Two different definitions can be merged even if their supertype definitions do not match, if one supertype subsumes the other supertype; otherwise an error is signaled.
Likewise, two types with the same name can be merged only if their features can be merged.
[[ugr.ref.cas.limitedmultipleaccess]]
== Limited multi-thread access to read-only CASs
Some applications may find it useful to scale up pipelines and run these in parallel.
Generally, CASs are not threadsafe, and only one thread at a time may operate on it.
In many scenarios, a CAS may be initialized and then filled with Feature Structures, and after some point, no more updates to that particular CAS will be done.
If a CAS is no longer going to be changed, it is possible to access it on multiple threads in a read-only mode, simultaneously, with some limitations.
Limitations arise because some UIMA Framework activities may update internal CAS data structures.
Operational data is updated while running a pipeline when a PEAR is entered or exited, because PEARs establish new class loaders and can potentially switch the JCas classes being used (This happens because the class loaders might define different JCas cover classes implementing the same UIMA type). Because of this, you cannot have multiple pipelines accessing a CAS in read-only mode if one or more of those pipelines contains a PEAR.
There are other edge cases where this may happen as well; for example, if you are running a pipeline with an Extension Class Loader, and have a callback routine loaded under a different class loader, UIMA will switch the JCas classes when calling the callback.