| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" |
| "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[ |
| <!ENTITY imgroot "../images/references/ref.cas/" > |
| <!ENTITY % uimaents SYSTEM "../entities.ent" > |
| %uimaents; |
| ]> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <chapter id="ugr.ref.cas"> |
| <title>CAS Reference</title> |
| |
| <para>The CAS (Common Analysis System) is the part of the Unstructured Information |
| Management Architecture (UIMA) that is concerned with creating and handling the data |
| that annotators manipulate.</para> |
| |
| <para>Java users typically use the JCas (Java interface to the CAS) when manipulating |
| objects in the CAS. This chapter describes an alternative interface to the CAS which |
| allows discovery and specification of types and features at run time. It is recommended |
| for use when the using code cannot know ahead of time the type system it will be dealing |
| with.</para> |
| |
| <para>Use of the CAS as described here is also recommended (or necessary) when components add |
| to the definitions of types of other components. This UIMA feature allows users to add features |
| to a type that was already defined elsewhere. When this feature is used in conjunction with the |
| JCas, it can lead to problems with class loading. This is because different JCas representations |
| of a single type are generated by the different components, and only one of them is loaded |
| (unless you are using Pear descriptors). Note: |
| we do not recommend that you add features to pre-existing types. A type should be defined in one |
| place only, and then there is no problem with using the JCas. However, if you do use this feature, |
| do not use the JCas. Similarly, if you distribute your components for inclusion in somebody else's |
| UIMA application, and you're not sure that they won't add features to your types, do not use the |
| JCas for the same reasons. |
| </para> |
| |
| <para>CASes passed to Annotator Components are either a base CAS or a regular CAS. Base CASes |
| are only passed to Multi-View components - they are like regular CASes, but do not have user |
| accessible indexes or Sofas. They are used by the component only for switching to other CAS |
| views, which are regular CASes.</para> |
| |
| <section id="ugr.ref.cas.javadocs"> |
| <title>Javadocs</title> |
| |
| <para>The subdirectory <literal>docs/api</literal> contains the documentation |
| details of all the classes, methods, and constants for the APIs discussed here. Please |
| refer to this for details on the methods, classes and constants, specifically in the |
| packages <literal>org.apache.uima.cas.*</literal>.</para> |
| </section> |
| |
| <section id="ugr.ref.cas.overview"> |
| <title>CAS Overview</title> |
| |
| <para>There are three<footnote><para>A fourth part, the Subject of Analysis, |
| is discussed in <olink targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.aas"/>.</para></footnote> main parts to the CAS: the type system, data creation and |
| manipulation, and indexing. We will start with a brief |
| description of these components.</para> |
| <section id="ugr.ref.cas.type_system"> |
| <title>The Type System</title> |
| |
| <para>The type system specifies what kind of data you will be able to manipulate in your |
| annotators. The type system defines two kinds of entities, types and features. Types |
| are arranged in a single inheritance tree and define the kinds of entities (objects) |
| you can manipulate in the CAS. Features optionally specify slots or fields within a |
| type. The correspondence to Java is to equate a CAS Type to a Java Class, and the CAS |
| Features to fields within the type. A critical difference is that CAS types have no |
| methods; they are just data structures with named slots (features). These features can |
| have as values primitive things like integers, floating point numbers, and strings, |
| and they also can hold references to other instances of objects in the CAS. We call |
| instances of the data structures declared by the type system <quote>feature |
| structures</quote> (not to be confused with <quote>features</quote>). Feature |
| structures are similar to the many variants of record structures found in computer |
| science.<footnote><para> The name <quote>feature structure</quote> comes from |
| terminology used in linguistics.</para></footnote></para> |
| |
| <para>Each CAS Type defines a supertype; it is a subtype of that supertype. This means |
| that any features that the supertype defines are features of the subtype; in other |
| words, it inherits its supertype's features. Only single inheritance is |
| supported; a type's feature set is the union of all of the features in its |
| supertype hierarchy. There is a built-in type called uima.cas.TOP; this is the top, |
| root node of the inheritance tree. It defines no features.</para> |
| |
| <para>The values that can be stored in features are either built-in primitive values or |
| references to other feature structures. The primitive values are |
| <literal>boolean</literal>, <literal>byte</literal>, |
| <literal>short</literal> (16 bit integers), <literal>integer</literal> (32 |
| bit), <literal>long</literal> (64 bit), <literal>float</literal> (32 bit), |
| <literal>double</literal> (64 bit floats) and strings; the official names of these |
| are <literal>uima.cas.Boolean</literal>, <literal>uima.cas.Byte</literal>, |
| <literal>uima.cas.Short</literal>, <literal>uima.cas.Integer</literal>, |
| <literal>uima.cas.Long</literal>, <literal>uima.cas.Float</literal> |
| ,<literal> uima.cas.Double</literal> and <literal>uima.cas.String</literal> |
| . The strings are Java strings, and characters are Java characters. Technically, this means |
| that characters are UTF-16 code points, which is not quite the same as a Unicode character. |
| This distinction should make no difference for almost all applications. |
| The CAS also defines other basic built-in types for arrays of these, plus arrays of |
| references to other objects, called <literal>uima.cas.IntegerArray</literal> |
| ,<literal> uima.cas.FloatArray</literal>, |
| <literal>uima.cas.StringArray</literal>, |
| <literal>uima.cas.FSArray</literal>, etc.</para> |
| |
| <para>The CAS also defines a built-in type called |
| <literal>uima.tcas.Annotation</literal> which inherits from |
| <literal>uima.cas.AnnotationBase</literal> which in turn inherits from |
| <literal>uima.cas.TOP</literal>. There are two features defined by this type, |
| called <literal>begin</literal> and <literal>end</literal>, both of which are |
| integer valued.</para> |
| |
| </section> |
| |
| <section id="ugr.ref.cas.creating_accessing_manipulating_data"> |
| <title>Creating, accessing and manipulating data</title> |
| <titleabbrev>Creating/Accessing/Changing data</titleabbrev> |
| |
| <para> |
| Creating and accessing data in the CAS requires knowledge about the types and features |
| defined in the type system. The idea is similar to other data access APIs, such as the XML |
| DOM or SAX APIs, or database access APIs such as JDBC. Contrary to those APIs, however, the |
| CAS does not use the names of type system entities directly in the APIs. Rather, you use |
| the type system to access type and feature entities by name, then use these entities in the |
| data manipulation APIs. This can be compared to the Java reflection APIs: the type system |
| is comparable to the Java class loader, and the type and feature objects to the |
| <literal>java.lang.Class</literal> and <literal>java.lang.reflect.Field</literal> classes. |
| </para> |
| |
| <para> |
| Why does it have to be this complicated? You wouldn't normally use reflection to create a |
| Java object, either. As mentioned earlier, the JCas provides the more straightforward |
| method to manipulate CAS data. The CAS access methods described here need only be used for |
| generic types of applications that need to be able to handle any kind of data (e.g., generic |
| tooling) or when the JCas may not be used for other reasons. The generic kinds of applications |
| are exactly the ones where you would use the reflection API in Java as well. |
| </para> |
| |
| </section> |
| |
| <section id="ugr.ref.cas.creating_using_indexes"> |
| <title>Creating and using indexes</title> |
| |
| <para>Each view of a CAS provides a set of indexes for that view. Instances of feature |
| structures can be added to a view's indexes. These indexes provide |
| the only way for other annotators to locate existing data in the CAS. The only way for an |
| annotator to use data that another annotator has created is by using an index (or the |
| method <literal>getAllIndexedFS</literal> of the object <literal>FSIndexRepository</literal>) to |
| retrieve feature structures the first annotator created. If you want the data you |
| create to be visible to other annotators, you must explicitly call methods which |
| add it to the indexes — you must index it.</para> |
| |
| <para>Indexes are named and are associated with a CAS Type; they are used to index |
| instances of that CAS type (including instances of that type's subtypes). If |
| you are using multiple views (see <olink |
| targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.mvs"/>), |
| each view contains a separate instantiation of all of the indexes. |
| To access an index, you |
| minimally need to know its name. A CAS view provides an index repository which you can |
| query for indexes for that view. Once you have a handle to an index, you can get |
| information about the feature structures in the index, the size of the index, as well |
| as an iterator over the feature structures.</para> |
| |
| <para>Indexes are defined in the XML descriptor metadata for the application. Each CAS |
| View has its own, separate instantiation of indexes based on these definitions, |
| kept in the view's index repository. When you obtain an index, it is always from a |
| particular CAS view. When you index an item, it is always added to all indexes where it |
| belongs, within just one repository. You can specify different repositories |
| (associated with different CAS views) to use; a given Feature Structure instance |
| may be indexed in more |
| than one CAS View.</para> |
| |
| <para>Iterators allow you to enumerate the feature structures in an index. FS iterators |
| provide two kinds of APIs: the regular Java iterator API, and a specific FS iterator API |
| where the usual Java iterator APIs (<literal>hasNext()</literal> and <literal>next()</literal>) |
| are replaced by <literal>isValid()</literal>, <literal>moveToNext()</literal> (which does |
| not return an element) and <literal>get()</literal>. Which API style you use is up to you, |
| but we do not recommend mixing the styles as the results are sometimes unexpected. If you |
| just want to iterate over an index from start to finish, either style is equally appropriate. |
| If you also use <literal>moveTo(FeatureStructure fs)</literal> and |
| <literal>moveToPrevious()</literal>, it is better to use the special FS iterator style. |
| </para> |
| <note><para>The reason to not mix these styles is that you might be thinking that |
| next() followed by moveToPrevious() would always work. This is not true, because |
| next() returns the "current" element, and advances to the next position, which might be |
| beyond the last element. At that point, the interator becomes "invalid", and by the iterator |
| contracts, moveToNext and moveToPrevious are not allowed on "invalid" iterators; |
| when an iterator is not valid, all bets are off. But you can |
| call these methods on the iterator — moveToFirst(), moveToLast(), or moveTo(FS) — to reset it.</para></note> |
| |
| <para>Indexes are created by specifying them in the annotator's or |
| aggregate's resource descriptor. An index specification includes its name, |
| the CAS type being indexed, the kind of index it is, and an (optional) ordering |
| relation on the feature structures to be indexed. At startup time, all index |
| specifications are combined; duplicate definitions (having the same name) are |
| allowed only if their definitions are the same. </para> |
| |
| <para>Feature structure instances need to be explicitly added to the index repository by a |
| method call. Feature structures that are not indexed will not be visible to other |
| annotators, (unless they are located via being referenced by some other feature of |
| another feature structure, which is indexed, or through a chain of these).</para> |
| |
| <para>The framework defines an unnamed bag index which indexes all types. The |
| only access provided for this index is the getAllIndexedFS(type) method on the |
| index repository, which returns an iterator over all indexed instances of the |
| specified type (including its subtypes) for that CAS View. |
| </para> |
| |
| <para>The framework defines one standard, built-in annotation index, called |
| AnnotationIndex, which indexes the <literal>uima.tcas.Annotation</literal> |
| type: all feature structures of type <literal>uima.tcas.Annotation</literal> or |
| its subtypes are automatically indexed with this built-in index.</para> |
| |
| <para>The ordering relation used by this index is to first order by the value of the |
| <quote>begin</quote> features (in ascending order) and then by the value of the |
| <quote>end</quote> feature (in descending order). This ordering insures that |
| longer annotations starting at the same spot come before shorter ones. For Subjects |
| of Analysis other than Text, this may not be an appropriate index.</para> |
| |
| </section> |
| </section> |
| |
| <section id="ugr.ref.cas.builtin_types"> |
| <title>Built-in CAS Types</title> |
| |
| <para>The CAS has two kinds of built-in types – primitive and non-primitive. The |
| primitive types are: |
| |
| <itemizedlist spacing="compact"> |
| <listitem><para>uima.cas.Boolean</para></listitem> |
| <listitem><para>uima.cas.Byte</para></listitem> |
| <listitem><para>uima.cas.Short</para></listitem> |
| <listitem><para>uima.cas.Integer</para></listitem> |
| <listitem><para>uima.cas.Long</para></listitem> |
| <listitem><para>uima.cas.Float</para></listitem> |
| <listitem><para>uima.cas.Double</para></listitem> |
| <listitem><para>uima.cas.String</para></listitem> |
| </itemizedlist></para> |
| |
| <para>The <literal>Byte, Short, Integer, </literal>and<literal> Long</literal> are |
| all signed integer types, of length 8, 16, 32, and 64 bits. The |
| <literal>Double</literal> type is 64 bit floating point. The |
| <literal>String</literal> type can be sub-typed to create sets of allowed values; see |
| <olink targetdoc="&uima_docs_ref;" |
| targetptr="ugr.ref.xml.component_descriptor.type_system.string_subtypes"/>. |
| These types can be used to specify the range of a String-valued feature. They act like |
| Strings, but have additional checking to insure the setting of values into them |
| conforms to one of the allowed values. Note that the other primitive types cannot be used |
| as a supertype for another type definition; only |
| <literal>uima.cas.String</literal> can be sub-typed.</para> |
| |
| <para>The non-primitive types exist in a type hierarchy; the top of the hierarchy is the |
| type <literal>uima.cas.TOP</literal>. All other non-primitive types inherit from |
| some supertype.</para> |
| |
| <para>There are 9 built-in array types. These arrays have a size specified when they are |
| created; the size is fixed at creation time. They are named: |
| |
| <itemizedlist spacing="compact"> |
| <listitem><para>uima.cas.BooleanArray</para></listitem> |
| <listitem><para>uima.cas.ByteArray</para></listitem> |
| <listitem><para>uima.cas.ShortArray</para></listitem> |
| <listitem><para>uima.cas.IntegerArray</para></listitem> |
| <listitem><para>uima.cas.LongArray</para></listitem> |
| <listitem><para>uima.cas.FloatArray</para></listitem> |
| <listitem><para>uima.cas.DoubleArray</para></listitem> |
| <listitem><para>uima.cas.StringArray</para></listitem> |
| <listitem><para>uima.cas.FSArray</para></listitem> |
| </itemizedlist></para> |
| |
| <para>The <literal>uima.cas.FSArray</literal> type is an array whose elements are |
| arbitrary other feature structures (instances of non-primitive types).</para> |
| |
| <para>There are 3 built-in types associated with the artifact being analyzed: |
| |
| <itemizedlist spacing="compact"> |
| <listitem><para>uima.cas.AnnotationBase</para></listitem> |
| <listitem><para>uima.tcas.Annotation</para></listitem> |
| <listitem><para>uima.tcas.DocumentAnnotation</para></listitem> |
| </itemizedlist></para> |
| |
| <para>The <literal>AnnotationBase</literal> type defines one system-used feature |
| which specifies for an annotation the subject of analysis (Sofa) to which it refers. The |
| Annotation type extends from this and defines 2 features, taking |
| <literal>uima.cas.Integer</literal> values, called <literal>begin</literal> |
| and <literal>end</literal>. The <literal>begin</literal> feature typically |
| identifies the start of a span of text the annotation covers; the |
| <literal>end</literal> feature identifies the end. The values refer to character |
| offsets; the starting index is 0. An annotation of the word <quote>CAS</quote> in a text |
| <quote>CAS Reference</quote> would have a start index of 0, and an end index of 3; the |
| difference between end and start is the length of the span the annotation refers |
| to.</para> |
| |
| <para>Annotations are always with respect to some Sofa (Subject of Analysis – see |
| <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/> |
| .</para> |
| <note><para>Artifacts which are not text strings may have a different interpretation of |
| the meaning of begin and end, or may define their own kind of annotation, extending from |
| <literal>AnnotationBase</literal>. </para></note> |
| |
| <para id="ugr.ref.cas.document_annotation">The <literal>DocumentAnnotation</literal> type has one special instance. It is |
| a subtype of the Annotation type, and the built-in definition defines one feature, |
| <literal>language</literal>, which is a string indicating the language of the |
| document in the CAS. The value of this language feature is used by the system to control |
| flow among annotators when the <quote>CapabilityLanguageFlow</quote> mode is used, |
| allowing the flow to skip over annotators that don't process particular |
| languages. Users may extend this type by adding additional features to it, using the XML |
| Descriptor element for defining a type.</para> |
| |
| <note><para> |
| We do <emphasis>not</emphasis> recommend extending the <literal>DocumentAnnotation</literal> |
| type. If you do, you must <emphasis>not</emphasis> use the JCas, for the reasons stated |
| earlier. |
| </para></note> |
| |
| <para>Each CAS view has a different associated instance of the |
| <literal>DocumentAnnotation</literal> type. On the CAS, use |
| <literal>getDocumentationAnnotation()</literal> to access the |
| <literal>DocumentAnnotation</literal>.</para> |
| |
| <para>There are also built-in types supporting linked lists, similar to the ones available in |
| Java and other programming languages. Their use is |
| constrained by the usual properties of linked lists: not very space efficient, no (efficient) |
| random access, but an easy choice if you don't know how long your list will be ahead of time. The |
| implementation is type specific; there are different list building objects for each of |
| the primitive types, plus one for general feature structures. Here are the type names: |
| <itemizedlist spacing="compact"> |
| <listitem><para>uima.cas.FloatList</para></listitem> |
| <listitem><para>uima.cas.IntegerList</para></listitem> |
| <listitem><para>uima.cas.StringList</para></listitem> |
| <listitem><para>uima.cas.FSList</para> |
| <para></para></listitem> |
| <listitem><para>uima.cas.EmptyFloatList</para></listitem> |
| <listitem><para>uima.cas.EmptyIntegerList</para></listitem> |
| <listitem><para>uima.cas.EmptyStringList</para></listitem> |
| <listitem><para>uima.cas.EmptyFSList</para> |
| <para></para></listitem> |
| <listitem><para>uima.cas.NonEmptyFloatList</para></listitem> |
| <listitem><para>uima.cas.NonEmptyIntegerList</para></listitem> |
| <listitem><para>uima.cas.NonEmptyStringList</para></listitem> |
| <listitem><para>uima.cas.NonEmptyFSList</para></listitem> |
| |
| </itemizedlist></para> |
| |
| <para>For the primitive types <literal>Float</literal>, |
| <literal>Integer</literal>, <literal>String</literal> and |
| <literal>FeatureStructure</literal>, there is a base type, for instance, |
| <literal>uima.cas.FloatList</literal>. For each of these, there are two subtypes, |
| corresponding to a non-empty element, and a marker that serves to indicate the end of the |
| list, or an empty list. The non-empty types define two features – |
| <literal>head</literal> and <literal>tail</literal>. The head feature holds the |
| particular value for that part of the list. The tail refers to the next list object |
| (either a non-empty one or the empty version to indicate the end of the list).</para> |
| |
| <para>There are no other built-in types. Users are free to define their own type systems, |
| building upon these types.</para> |
| |
| </section> |
| |
| <section id="ugr.ref.cas.accessing_the_type_system"> |
| <title>Accessing the type system</title> |
| |
| <para> |
| During annotator processing, or outside an annotator, access the type system by calling |
| <literal>CAS.getTypeSystem()</literal>. |
| </para> |
| |
| <para>However, CAS annotators implement an additional method, |
| <literal>typeSystemInit()</literal>, which is called by the UIMA framework before the |
| annotator's process method. This method, implemented by the annotator writer, |
| is passed a reference to the CAS's type system metadata. The method typically uses |
| the type system APIs to obtain type and feature objects corresponding to all the types |
| and features the annotator will be using in its process method. This initialization |
| step should not be done during an annotator's initialize method since the type |
| system can change after the initialize method is called; it should not be done during the |
| process method, since this is presumably work that is identical for each incoming |
| document, and so should be performed only when the type system changes (which will be a |
| rare event). The UIMA framework guarantees it will call the <literal>typeSystemInit |
| </literal>method of an annotator whenever the type system changes, before calling the |
| annotator's <literal>process()</literal> method.</para> |
| |
| <para>The initialization done by <literal>typeSystemInit()</literal> is done by the |
| UIMA framework when you use the JCas APIs; you only need to provide a |
| <literal>typeSystemInit()</literal> method, as described here, when you are not using |
| the JCas approach.</para> |
| |
| <section id="ugr.ref.cas.type_system.printer_example"> |
| <title>TypeSystemPrinter example</title> |
| |
| <para>Here is a code fragment that, given a CAS Type System, will print a list of all |
| types.</para> |
| |
| |
| <programlisting>// Get all type names from the type system |
| // and print them to stdout. |
| private void listTypes1(TypeSystem ts) { |
| // Get an iterator over types |
| Iterator typeIterator = ts.getTypeIterator(); |
| Type t; |
| System.out.println("Types in the type system:"); |
| while (typeIterator.hasNext()) { |
| // Retrieve a type... |
| t = (Type) typeIterator.next(); |
| // ...and print its name. |
| System.out.println(t.getName()); |
| } |
| System.out.println(); |
| }</programlisting> |
| |
| <para>This method is passed the type system as a parameter. From the type system, we can |
| get an iterator |
| over all known types. If you run this against a CAS created with no additional |
| user-defined types, we should see something like this on the console:</para> |
| |
| <programlisting>Types in the type system: |
| uima.cas.Boolean |
| uima.cas.Byte |
| uima.cas.Short |
| uima.cas.Integer |
| uima.cas.Long |
| uima.cas.ArrayBase |
| ... |
| </programlisting> |
| |
| <para>If the type system had user-defined types these would show up too. Note that some |
| of these types are not directly creatable – they are types used by the framework |
| in the type hierarchy (e.g. uima.cas.ArrayBase).</para> |
| |
| <para>CAS type names include a name-space prefix. The components of a type name are |
| separated by the dot (.). A type name component must start with a Unicode letter, |
| followed by an arbitrary sequence of letters, digits and the underscore (_). By |
| convention, the last component of a type name starts with an uppercase letter, the |
| rest start with a lowercase letter.</para> |
| |
| <para>Listing the type names is mildly useful, but it would be even better if we could see |
| the inheritance relation between the types. The following code prints the |
| inheritance tree in indented format.</para> |
| |
| |
| <programlisting>private static final int INDENT = 2; |
| private void listTypes2(TypeSystem ts) { |
| // Get the root of the inheritance tree. |
| Type top = ts.getTopType(); |
| // Recursively print the tree. |
| printInheritanceTree(ts, top, 0); |
| } |
| |
| private void printInheritanceTree(TypeSystem ts, Type type, int level) { |
| indent(level); // Print indentation. |
| System.out.println(type.getName()); |
| // Get a vector of the immediate subtypes. |
| Vector subTypes = |
| ts.getDirectlySubsumedTypes(type); |
| ++level; // Increase the indentation level. |
| for (int i = 0; i < subTypes.size(); i++) { |
| // Print the subtypes. |
| printInheritanceTree(ts, (Type) subTypes.get(i), level); |
| } |
| } |
| |
| // A simple, inefficient indenter |
| private void indent(int level) { |
| int spaces = level * INDENT; |
| for (int i = 0; i < spaces; i++) { |
| System.out.print(" "); |
| } |
| }</programlisting> |
| |
| <para> This example shows that you can traverse the type hierarchy by starting at the top |
| with TypeSystem.getTopType and by retrieving subtypes with |
| <literal>TypeSystem.getDirectlySubsumedTypes()</literal>.</para> |
| |
| <para>The Javadocs also have APIs that allow you to access the features, as well as what |
| the allowed value type is for that feature. Here is sample code which prints out all the |
| features of all the types, together with the allowed value types (the feature |
| <quote>range</quote>). Each feature has a <quote>domain</quote> which is the type |
| where it is defined, as well as a <quote>range</quote>. |
| |
| |
| <programlisting>private void listFeatures2(TypeSystem ts) { |
| Iterator featureIterator = ts.getFeatures(); |
| Feature f; |
| System.out.println("Features in the type system:"); |
| while (featureIterator.hasNext()) { |
| f = (Feature) featureIterator.next(); |
| System.out.println( |
| f.getShortName() + ": " + |
| f.getDomain() + " -> " + f.getRange()); |
| } |
| System.out.println(); |
| }</programlisting></para> |
| |
| <para>We can ask a feature object for its domain (the type it is defined on) and its range |
| (the type of the value of the feature). The terminology derives from the fact that |
| features can be viewed as functions on subspaces of the object space.</para> |
| |
| </section> |
| |
| <section id="ugr.ref.cas.cas_apis_create_modify_feature_structures"> |
| <title>Using the CAS APIs to create and modify feature structures</title> |
| <titleabbrev>Using CAS APIs: Feature Structures</titleabbrev> |
| |
| <para>Assume a type system declaration that defines two types: Entity and Person. |
| Entity has no features defined within it but inherits from uima.tcas.Annotation |
| – so it has the begin and end features. Person is, in turn, a subtype of Entity, |
| and adds firstName and lastName features. CAS type systems are declaratively |
| specified using XML; the format of this XML is described in <olink |
| targetdoc="&uima_docs_ref;" |
| targetptr="ugr.ref.xml.component_descriptor.type_system"/>. |
| |
| |
| <programlisting><![CDATA[<!-- Type System Definition --> |
| <typeSystemDescription> |
| <types> |
| <typeDescription> |
| <name>com.xyz.proj.Entity</name> |
| <description /> |
| <supertypeName>uima.tcas.Annotation</supertypeName> |
| </typeDescription> |
| <typeDescription> |
| <name>Person</name> |
| <description /> |
| <supertypeName>com.xyz.proj.Entity </supertypeName> |
| <features> |
| <featureDescription> |
| <name>firstName</name> |
| <description /> |
| <rangeTypeName>uima.cas.String</rangeTypeName> |
| </featureDescription> |
| <featureDescription> |
| <name>lastName</name> |
| <description /> |
| <rangeTypeName>uima.cas.String</rangeTypeName> |
| </featureDescription> |
| </features> |
| </typeDescription> |
| </types> |
| </typeSystemDescription>]]></programlisting></para> |
| |
| <para> |
| To be able to access types and features, we need to know their names. The CAS interface defines |
| constants that hold the names of built-in feature names, such as, e.g., |
| <literal>CAS.TYPE_NAME_INTEGER</literal>. It is good programming practice to create such |
| constants for the types and features you define, for your own use as well as for others who will |
| be using your annotators. |
| </para> |
| |
| |
| <programlisting>/** Entity type name constant. */ |
| public static final String ENTITY_TYPE_NAME = "com.xyz.proj.Entity"; |
| |
| /** Person type name constant. */ |
| public static final String PERSON_TYPE_NAME = "com. xyz.proj.Person"; |
| |
| /** First name feature name constant. */ |
| public static final String FIRST_NAME_FEAT_NAME = "firstName"; |
| |
| /** Last name feature name constant. */ |
| public static final String LAST_NAME_FEAT_NAME = "lastName";</programlisting> |
| |
| <para>Next we define type and feature member variables; these will hold the values of the |
| type and feature objects needed by the CAS APIs, to be assigned during |
| <literal>typeSystemInit()</literal>.</para> |
| |
| |
| <programlisting>// Type system object variables |
| private Type entityType; |
| private Type personType; |
| private Feature firstNameFeature; |
| private Feature lastNameFeature; |
| private Type stringType;</programlisting> |
| |
| <para>The type system does not throw an exception if we ask for something that is |
| not known, it simply returns null; therefore the code checks for this and throws a proper |
| exception. We require all these types and features to be defined for the annotator to |
| work. One might imagine situations where certain computations are predicated on some type |
| or feature being defined in the type system, but that is not the case here.</para> |
| |
| |
| <programlisting>// Get a type object corresponding to a name. |
| // If it doesn't exist, throw an exception. |
| private Type initType(String typeName) |
| throws AnnotatorInitializationException { |
| Type type = ts.getType(typeName); |
| if (type == null) { |
| throw new AnnotatorInitializationException( |
| AnnotatorInitializationException.TYPE_NOT_FOUND, |
| new Object[] { this.getClass().getName(), typeName }); |
| } |
| return type; |
| } |
| |
| // We add similar code for retrieving feature objects. |
| // Get a feature object from a name and a type object. |
| // If it doesn't exist, throw an exception. |
| private Feature initFeature(String featName, Type type) |
| throws AnnotatorInitializationException { |
| Feature feat = type.getFeatureByBaseName(featName); |
| if (feat == null) { |
| throw new AnnotatorInitializationException( |
| AnnotatorInitializationException.FEATURE_NOT_FOUND, |
| new Object[] { this.getClass().getName(), featName }); |
| } |
| return feat; |
| }</programlisting> |
| |
| <para>Using these two functions, code for initializing the type system described |
| above would be: |
| |
| |
| <programlisting>public void typeSystemInit(TypeSystem aTypeSystem) |
| throws AnalysisEngineProcessException { |
| this.typeSystem = aTypeSystem; |
| // Set type system member variables. |
| this.entityType = initType(ENTITY_TYPE_NAME); |
| this.personType = initType(PERSON_TYPE_NAME); |
| this.firstNameFeature = |
| initFeature(FIRST_NAME_FEAT_NAME, personType); |
| this.lastNameFeature = |
| initFeature(LAST_NAME_FEAT_NAME, personType); |
| this.stringType = initType(CAS.TYPE_NAME_STRING); |
| }</programlisting></para> |
| |
| <para>Note that we initialize the string type by using a type name constant from the |
| CAS.</para> |
| |
| </section> |
| </section> |
| |
| <section id="ugr.ref.cas.creating_feature_structures"> |
| <title>Creating feature structures</title> |
| |
| <para>To create feature structures in JCas, we use the Java <quote>new</quote> |
| operator. In the CAS, we use one of several different API methods on the CAS object, |
| depending on which of the 10 basic kinds of feature structures we are creating (a plain |
| feature structure, or an instance of the built-in primitive type arrays or FSArray). |
| There are is also a method to create an instance of a |
| <literal>uima.tcas.Annotation</literal>, setting the begin and end |
| values.</para> |
| |
| <para>Once a feature structure is created, it needs to be added to the CAS indexes (unless |
| it will be accessed via some reference from another accessible feature structure). The |
| CAS provides this API: Assuming aCAS holds a reference to a CAS, and token holds a |
| reference to a newly created feature structure, here's the code to add that |
| feature structure to all the relevant CAS indexes:</para> |
| |
| |
| <programlisting> // Add the token to the index repository. |
| aCAS.addFsToIndexes(token);</programlisting> |
| |
| <para>There is also a corresponding <literal>removeFsFromIndexes(token)</literal> |
| method on CAS objects.</para> |
| |
| <para>Because some of the indexes (the Sorted and Set types) use comparators defined |
| on particular values of the features of an indexed type, if you change the values of |
| those features being used in the index key, the correct way to do this is to |
| <orderedlist spacing="compact"> |
| <listitem><para>remove the item from all indexes where it is indexed, in all views |
| where it is indexed,</para> |
| </listitem> |
| <listitem><para>update the value of the features being used as keys,</para></listitem> |
| <listitem><para>add the item back to the indexes, in all views.</para></listitem> |
| </orderedlist></para> |
| </section> |
| |
| <section id="ugr.ref.cas.accessing_modifying_features_of_feature_structures"> |
| <title>Accessing or modifying features of feature structures</title> |
| <titleabbrev>Accessing or modifying Features</titleabbrev> |
| |
| <para>Values of individual features for a feature structure can be set or referenced, |
| using a set of methods that depend on the type of value that feature is declared to have. |
| There are methods on FeatureStructure for this: getBooleanValue, getByteValue, |
| getShortValue, getIntValue, getLongValue, getFloatValue, getDoubleValue, |
| getStringValue, and getFeatureValue (which means to get a value which in turn is a |
| reference to a feature structure). There are corresponding <quote>setter</quote> |
| methods, as well. These methods on the feature structure object take as arguments the |
| feature object retrieved earlier in the typeSystemInit method.</para> |
| |
| <para>Using the previous example, with the type system initialized with type personType |
| and feature lastNameFeature, here's a sample code fragment that gets and sets |
| that feature:</para> |
| |
| |
| <programlisting>// Assume aPerson is a variable holding an object of type Person |
| // get the lastNameFeature value from the feature structure |
| String lastName = aPerson.getStringValue(lastNameFeature); |
| // set the lastNameFeature value |
| aPerson.setStringValue(lastNameFeature, newStringValueForLastName);</programlisting> |
| |
| <para>The getters and setters for each of the primitive types are defined in the Javadocs |
| as methods of the FeatureStructure interface.</para> |
| |
| </section> |
| |
| <section id="ugr.ref.cas.indexes_and_iterators"> |
| <title>Indexes and Iterators</title> |
| |
| <para>Each CAS can have many indexes associated with it; each CAS View contains |
| a complete set of instantions of the indexes. Each index is represented by an |
| instance of the type org.apache.uima.cas.FSIndex. You use the object |
| org.apache.uima.cas.FSIndexRepository, accessible via a method on a CAS object, to |
| retrieve instances of indexes. There are methods that let you select the index |
| by name, by type, or by both name and type. Since each index is already associated with a type, |
| passing both a name and a type is valid only if the type passed in is the same |
| type or a subtype of the one declared in the index specification for the named index. If you |
| pass in a subtype, the returned FSIndex object refers to an index that will return only |
| items belonging to that subtype (or subtypes of that subtype).</para> |
| |
| <para>The returned FSIndex objects are used, in turn, to create iterators. |
| There is also a method on the Index Repository, <literal>getAllIndexedFS</literal>, |
| which will return an iterator over all indexed Feature Structures (for that CAS View), |
| in no particular order. The iterators |
| created can be used like common Java iterators, to sequentially retrieve items |
| indexed. If the index represents a sorted index, the items are returned in a sorted |
| order, where the sort order is specified in the XML index definition. This XML is part of |
| the Component Descriptor, see <olink targetdoc="&uima_docs_ref;" |
| targetptr="ugr.ref.xml.component_descriptor.aes.index"/>.</para> |
| |
| <para>Feature structures should not be added to or removed from indexes while iterating |
| over them; a ConcurrentModificationException is thrown when this is detected. |
| Certain operations are allowed with the iterators after modification, which can |
| <quote>reset</quote> this condition, such as moving to beginning, end, or moving to a |
| particular feature structure. So - if you have to modify the index, you can move it back to |
| the last FS you had retrieved from the iterator, and then continue, if that makes sense in |
| your application.</para> |
| |
| <section id="ugr.ref.cas.index.built_in_indexes"> |
| <title>Built-in Indexes</title> |
| |
| <para>An unnamed built-in bag index exists which holds all feature structures which are indexed. |
| The only access to this index is the method getAllIndexedFS(Type) which returns an iterator |
| over all indexed Feature Structures.</para> |
| |
| <para>The CAS also contains a built-in index for the type <literal>uima.tcas.Annotation</literal>, which sorts |
| annotations in the order in which they appear in the document. Annotations are sorted first by increasing |
| <literal>begin</literal> position. Ties are then broken by <emphasis>decreasing</emphasis> |
| <literal>end</literal> position (so that longer annotations come first). Annotations that match in both |
| their <literal>begin</literal> and <literal>end</literal> features are sorted using the Type Priority |
| (see <olink targetdoc="&uima_docs_ref;" |
| targetptr="ugr.ref.xml.component_descriptor.aes.type_priority"/> )</para> |
| </section> |
| |
| |
| <section id="ugr.ref.cas.index.adding_to_indexes"> |
| <title>Adding Feature Structures to the Indexes</title> |
| |
| <para>Feature Structures are added to the indexes by calling the |
| <literal>FSIndexRepository.addFS(FeatureStructure)</literal> method or the equivalent convenience |
| method <literal>CAS.addFsToIndexes(FeatureStructure)</literal>. This adds the Feature Structure to |
| <emphasis>all</emphasis> indexes that are defined for the type of that FeatureStructure (or any of its |
| supertypes). Note that you should not add a Feature Structure to the indexes until you have set values for all |
| of the features that may be used as sort keys in an index.</para> |
| </section> |
| |
| <section id="ugr.ref.cas.index.iterators"> |
| <title>Iterators</title> |
| |
| <para>Iterators are objects of class <literal>org.apache.uima.cas.FSIterator.</literal> This class |
| extends <literal>java.util.Iterator</literal> and implements the normal Java iterator methods, plus |
| additional ones that allow moving both forwards and backwards.</para> |
| </section> |
| |
| <section id="ugr.ref.cas.index.annotation_index"> |
| <title>Special iterators for Annotation types</title> |
| |
| <para>The built-in index over the <literal>uima.tcas.Annotation</literal> type |
| named <quote><literal>AnnotationIndex</literal></quote> has additional |
| capabilities. To use them, you first get a reference to this built-in index using |
| either the <literal>getAnnotationIndex</literal> method on a CAS View object, or |
| by asking the <literal>FSIndexRepository</literal> object for an index having the |
| particular name <quote>AnnotationIndex</quote>, for example: |
| |
| <programlisting>AnnotationIndex idx = aCAS.getAnnotationIndex(); |
| // or you can iterate over a specific subtype of Annotation: |
| AnnotationIndex idx = aCAS.getAnnotationIndex(aType); </programlisting></para> |
| |
| <para>This object can be used to produce several additional kinds of iterators. It can |
| produce unambiguous iterators; these skip over elements until it finds one where the |
| start position of the next annotation is equal to or greater than the end position of |
| the previously returned annotation.</para> |
| |
| <para>It can also produce several kinds of subiterators; these are iterators whose |
| annotations fall within the span of another annotation. This kind of iterator can |
| also have the unambiguous property, if desired. It also can be |
| <quote>strict</quote> or not; strict means that the returned annotation lies |
| completely within the span of the controlling annotation. Non-strict only implies |
| that the beginning of the returned annotation falls within the span of the |
| controlling annotation.</para> |
| |
| <para>There is also a method which produces an <literal>AnnotationTree</literal> |
| object, which contains nodes representing the results of doing a strict, |
| unambiguous subiterator over the span of some controlling annotation. For more |
| details, please refer to the Javadocs for the |
| <literal>org.apache.uima.cas.text</literal> package.</para> |
| |
| </section> |
| |
| <section id="ugr.ref.cas.index.constraints_and_filtered_iterators"> |
| <title>Constraints and Filtered iterators</title> |
| |
| <para>There is a set of API calls that build constraint objects. These objects can be |
| used directly to test if a particular feature structure matches (satisfies) the |
| constraint, or they can be passed to the createFilteredIterator method to create an |
| iterator that skips over instances which fail to satisfy the constraint.</para> |
| |
| <para>It is possible to specify a feature value located by following a chain of |
| references starting from the feature structure being tested. Here's a |
| scenario to explore this concept. Let's suppose you have the following type |
| system (namespaces are omitted for clarity): |
| |
| <blockquote> |
| <para><emphasis role="bold">Token</emphasis>, having a feature PartOfSpeech |
| which holds a reference to another type (POS)</para> |
| |
| <para><emphasis role="bold">POS</emphasis> (a type with many subtypes, each |
| representing a different part of speech)</para> |
| |
| <para><emphasis role="bold">Noun</emphasis> (a subtype of POS)</para> |
| |
| <para><emphasis role="bold">ProperName</emphasis> (a subtype of Noun), |
| having a feature Class which holds an integer value encoding some information |
| about the proper noun.</para></blockquote></para> |
| |
| <para>If you want to filter Token instances, such that only those tokens get through |
| which are proper names of class 3 (for example), you would need a test that started with |
| a Token instance, followed its PartOfSpeech reference to another instance (the |
| ProperName instance) and then tested the Class feature of that instance for a value |
| equal to 3.</para> |
| |
| <para>To support this, the filtering approach has components that specify tests, and |
| components that specify <quote>paths</quote>. The tests that can be done include |
| testing references to type instances to see if they are instances of some type or its |
| subtypes; this is done with a FSTypeConstraint constraint. Other tests check for |
| equality or, for numeric values, ranges.</para> |
| |
| <para>Each test may be combined with a path – to get to the value to test. Tests that |
| start from a feature structure instance can be combined with and and or connectors. |
| The Javadocs for these are in the package org.apache.uima.cas in the classes that end |
| in Constraint, plus the classes ConstraintFactory, FeaturePath and CAS. |
| Here's an example; assume the variable cas holds a reference to a CAS instance. |
| |
| |
| <programlisting>// Start by getting the constraint factory from the CAS. |
| ConstraintFactory cf = cas.getConstraintFactory(); |
| |
| // To specify a path to an item to test, you start by |
| // creating an empty path. |
| FeaturePath path = cas.createFeaturePath(); |
| |
| // Add POS feature to path, creating one-element path. |
| path.addFeature(posFeat); |
| |
| // You can extend the chain arbitrarily by adding additional |
| // features. |
| |
| // Create a new type constraint. |
| |
| // Type constraints will check that structures |
| // they match against have a type at least as specific |
| // as the type specified in the constraint. |
| FSTypeConstraint nounConstraint = cf.createTypeConstraint(); |
| |
| // Set the type (by default it is TOP). |
| // This succeeds if the type being tested by this constraint |
| // is nounType or a subtype of nounType. |
| nounConstraint.add(nounType); |
| |
| // Embed the noun constraint under the pos path. |
| // This means, associate the test with the path, so it tests the |
| // proper value. |
| |
| // The result is a test which will |
| // match a feature structure that has a posFeat defined |
| // which has a value which is an instance of a nounType or |
| // one of its subtypes. |
| FSMatchConstraint embeddedNoun = cf.embedConstraint(path, nounConstraint); |
| |
| // Create a type constraint for token (or a subtype of it) |
| FSTypeConstraint tokenConstraint = cf.createTypeConstraint(); |
| |
| // Set the type. |
| tokenConstraint.add(tokenType); |
| |
| // Create the final constraint by conjoining the two constraints. |
| FSMatchConstraint nounTokenCons = cf.and(nounConstraint, tokenConstraint); |
| |
| // Create a filtered iterator from some annotation iterator. |
| FSIterator it = cas.createFilteredIterator(annotIt, nounTokenCons);</programlisting> |
| </para></section></section> |
| |
| <section id="ugr.ref.cas.guide_to_javadocs"> |
| <title>The CAS API's – a guide to the Javadocs</title> |
| <titleabbrev>CAS API's Javadocs</titleabbrev> |
| |
| <para>The CAS APIs are organized into 3 Java packages: cas, cas.impl, and cas.text. Most |
| of the APIs described here are in the cas package. The cas.impl package contains classes |
| used in serializing and deserializing (reading and writing to external strings) the |
| XCAS form of the CAS (XCAS is an XML serialization of the CAS). The XCAS form is used for |
| transporting the CAS among local and remote annotators, or for storing the CAS in |
| permanent storage. The cas.text contains the APIs that extend the CAS to support |
| artifact (including <quote>text</quote>) analysis.</para> |
| |
| <section id="ugr.ref.cas.javadocs.cas_package"> |
| <title>APIs in the CAS package</title> |
| |
| <para>The main objects implementing the APIs discussed here are shown in the diagram |
| below. The hierarchy represents that there is a way to get from an upper object to an |
| instance of the lower object, usually by using a method on the upper object; this is not |
| an inheritance hierarchy. |
| <figure id="ugr.ref.cas.fig.api_hierarchy"> |
| <title>CAS Object hierarchy</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata width="5.8in" format="JPG" |
| fileref="&imgroot;image001.png"/> |
| </imageobject> |
| <textobject><phrase>CAS object hierarchy</phrase></textobject> |
| </mediaobject> |
| </figure> </para> |
| |
| <para>The main Interface is the CAS interface. This has most of the functionality of the |
| CAS, except for the type system metadata access, and the indexing access. JCas and CAS |
| are alternative representations and API approaches to the CAS; each has a method to |
| get the other. You can mix JCas and CAS APIs in your application as needed. To use the |
| JCas APIs, you have to create the Java classes that correspond to the CAS types, and |
| include them in the Java class path of the application. If you have a CAS object, you can |
| get a JCas object by using the getJCas() method call on the CAS object; likewise, you |
| can get the CAS object from a JCas by using the getCAS() method call on the JCas object. |
| There is also a low level CAS interface that is not part of the official API, and is |
| intended for internal use only – it is not documented here.</para> |
| |
| <para>The type system metadata APIs are found in the TypeSystem interface. The objects |
| defining each type and feature are defined by the interfaces Type and Feature. The |
| Type interface has methods to see what types subsume other types, to iterate over the |
| types available, and to extract information about the types, including what |
| features it has. The Feature interface has methods that get what type it belongs to, |
| its name, and its range (the kind of values it can hold).</para> |
| |
| <para>The FSIndexRepository gives you access to methods to get instances of indexes, and |
| also provides access to the iterator over all indexed feature structures: |
| <literal>getAllIndexedFS(aType)</literal>. |
| The FSIndex and AnnotationIndex objects give you methods to create instances of |
| iterators.</para> |
| |
| <para>Iterators and the CAS methods that create new feature structures return |
| FeatureStructure objects. These objects can be used to set and get the values of |
| defined features within them.</para> |
| </section> |
| </section> |
| </chapter> |