uima-docbook-references/src/docbook/ref.cas.xml - uima-uimaj - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
 <!ENTITY imgroot "images/references/ref.cas/" >
 <!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
 %uimaents;
 ]>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <chapter id="ugr.ref.cas">
   <title>CAS Reference</title>

   <para>The CAS (Common Analysis System) is the part of the Unstructured Information
     Management Architecture (UIMA) that is concerned with creating and handling the data
     that annotators manipulate.</para>

   <para>Java users typically use the JCas (Java interface to the CAS) when manipulating
     objects in the CAS. This chapter describes an alternative interface to the CAS which
     allows discovery and specification of types and features at run time. It is recommended
     for use when the using code cannot know ahead of time the type system it will be dealing
     with.</para>

   <para>Use of the CAS as described here is also recommended (or necessary) when components add
   to the definitions of types of other components.  This UIMA feature allows users to add features
   to a type that was already defined elsewhere.  When this feature is used in conjunction with the
   JCas, it can lead to problems with class loading.  This is because different JCas representations
   of a single type are generated by the different components, and only one of them is loaded
   (unless you are using Pear descriptors).  Note:
   we do not recommend that you add features to pre-existing types.  A type should be defined in one
   place only, and then there is no problem with using the JCas.  However, if you do use this feature,
   do not use the JCas.  Similarly, if you distribute your components for inclusion in somebody else's
   UIMA application, and you're not sure that they won't add features to your types, do not use the
   JCas for the same reasons.
   </para>

   <para>CASes passed to Annotator Components are either a base CAS or a regular CAS. Base CASes
     are only passed to Multi-View components - they are like regular CASes, but do not have user
     accessible indexes or Sofas. They are used by the component only for switching to other CAS
     views, which are regular CASes.</para>

   <section id="ugr.ref.cas.javadocs">
     <title>Javadocs</title>

     <para>The subdirectory <literal>docs/api</literal> contains the documentation
       details of all the classes, methods, and constants for the APIs discussed here. Please
       refer to this for details on the methods, classes and constants, specifically in the
       packages <literal>org.apache.uima.cas.*</literal>.</para>
   </section>

   <section id="ugr.ref.cas.overview">
     <title>CAS Overview</title>

     <para>There are three<footnote><para>A fourth part, the Subject of Analysis,
       is discussed in <olink targetdoc="&uima_docs_tutorial_guides;"
         targetptr="ugr.tug.aas"/>.</para></footnote> main parts to the CAS: the type system, data creation and
       manipulation, and indexing.  We will start with a brief
       description of these components.</para>
     <section id="ugr.ref.cas.type_system">
       <title>The Type System</title>

       <para>The type system specifies what kind of data you will be able to manipulate in your
         annotators. The type system defines two kinds of entities, types and features. Types
         are arranged in a single inheritance tree and define the kinds of entities (objects)
         you can manipulate in the CAS. Features optionally specify slots or fields within a
         type. The correspondence to Java is to equate a CAS Type to a Java Class, and the CAS
         Features to fields within the type. A critical difference is that CAS types have no
         methods; they are just data structures with named slots (features). These features can
         have as values primitive things like integers, floating point numbers, and strings,
         and they also can hold references to other instances of objects in the CAS. We call
         instances of the data structures declared by the type system <quote>feature
         structures</quote> (not to be confused with <quote>features</quote>). Feature
         structures are similar to the many variants of record structures found in computer
         science.<footnote><para> The name <quote>feature structure</quote> comes from
         terminology used in linguistics.</para></footnote></para>

       <para>Each CAS Type defines a supertype; it is a subtype of that supertype. This means
         that any features that the supertype defines are features of the subtype; in other
         words, it inherits its supertype&apos;s features. Only single inheritance is
         supported; a type&apos;s feature set is the union of all of the features in its
         supertype hierarchy. There is a built-in type called uima.cas.TOP; this is the top,
         root node of the inheritance tree. It defines no features.</para>

       <para>The values that can be stored in features are either built-in primitive values or
         references to other feature structures. The primitive values are
         <literal>boolean</literal>, <literal>byte</literal>,
         <literal>short</literal> (16 bit integers), <literal>integer</literal> (32
         bit), <literal>long</literal> (64 bit), <literal>float</literal> (32 bit),
         <literal>double</literal> (64 bit floats) and strings; the official names of these
         are <literal>uima.cas.Boolean</literal>, <literal>uima.cas.Byte</literal>,
         <literal>uima.cas.Short</literal>, <literal>uima.cas.Integer</literal>,
         <literal>uima.cas.Long</literal>, <literal>uima.cas.Float</literal>
         ,<literal> uima.cas.Double</literal> and <literal>uima.cas.String</literal>
         . The strings are Java strings, and characters are Java characters.  Technically, this means
         that characters are UTF-16 code points, which is not quite the same as a Unicode character.
         This distinction should make no difference for almost all applications.
         The CAS also defines other basic built-in types for arrays of these, plus arrays of
         references to other objects, called <literal>uima.cas.IntegerArray</literal>
         ,<literal> uima.cas.FloatArray</literal>,
         <literal>uima.cas.StringArray</literal>,
         <literal>uima.cas.FSArray</literal>, etc.</para>

       <para>The CAS also defines a built-in type called
         <literal>uima.tcas.Annotation</literal> which inherits from
         <literal>uima.cas.AnnotationBase</literal> which in turn inherits from
         <literal>uima.cas.TOP</literal>. There are two features defined by this type,
         called <literal>begin</literal> and <literal>end</literal>, both of which are
         integer valued.</para>

     </section>

     <section id="ugr.ref.cas.creating_accessing_manipulating_data">
       <title>Creating, accessing and manipulating data</title>
       <titleabbrev>Creating/Accessing/Changing data</titleabbrev>

       <para>
         Creating and accessing data in the CAS requires knowledge about the types and features
         defined in the type system.  The idea is similar to other data access APIs, such as the XML
         DOM or SAX APIs, or database access APIs such as JDBC.  Contrary to those APIs, however, the
         CAS does not use the names of type system entities directly in the APIs.  Rather, you use
         the type system to access type and feature entities by name, then use these entities in the
         data manipulation APIs.  This can be compared to the Java reflection APIs: the type system
         is comparable to the Java class loader, and the type and feature objects to the
         <literal>java.lang.Class</literal> and <literal>java.lang.reflect.Field</literal> classes.
       </para>

       <para>
         Why does it have to be this complicated?  You wouldn&apos;t normally use reflection to create a
         Java object, either.  As mentioned earlier, the JCas provides the more straightforward
         method to manipulate CAS data.  The CAS access methods described here need only be used for
         generic types of applications that need to be able to handle any kind of data (e.g., generic
         tooling) or when the JCas may not be used for other reasons.  The generic kinds of applications
         are exactly the ones where you would use the reflection API in Java as well.
       </para>

     </section>

     <section id="ugr.ref.cas.creating_using_indexes">
       <title>Creating and using indexes</title>

       <para>Each view of a CAS provides a set of indexes for that view. Instances of feature
         structures can be added to a view&apos;s indexes. These indexes provide
         the only way for other annotators to locate existing data in the CAS. The only way for an
         annotator to use data that another annotator has created is by using an index (or the
         method <literal>getAllIndexedFS</literal> of the object <literal>FSIndexRepository</literal>) to
         retrieve feature structures the first annotator created. If you want the data you
         create to be visible to other annotators, you must explicitly call methods which
         add it to the indexes &mdash; you must index it.</para>

       <para>Indexes are named and are associated with a CAS Type; they are used to index
         instances of that CAS type (including instances of that type&apos;s subtypes). If
         you are using multiple views (see <olink
           targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.mvs"/>),
         each view contains a separate instantiation of all of the indexes.
         To access an index, you
         minimally need to know its name. A CAS view provides an index repository which you can
         query for indexes for that view. Once you have a handle to an index, you can get
         information about the feature structures in the index, the size of the index, as well
         as an iterator over the feature structures.</para>

       <para>Indexes are defined in the XML descriptor metadata for the application. Each CAS
         View has its own, separate instantiation of indexes based on these definitions,
         kept in the view's index repository. When you obtain an index, it is always from a
         particular CAS view. When you index an item, it is always added to all indexes where it
         belongs, within just one repository. You can specify different repositories
         (associated with different CAS views) to use; a given Feature Structure instance
         may be indexed in more
         than one CAS View.</para>

       <para>Iterators allow you to enumerate the feature structures in an index.  FS iterators
         provide two kinds of APIs: the regular Java iterator API, and a specific FS iterator API
         where the usual Java iterator APIs (<literal>hasNext()</literal> and <literal>next()</literal>)
         are replaced by <literal>isValid()</literal>, <literal>moveToNext()</literal> (which does
         not return an element) and <literal>get()</literal>.  Which API style you use is up to you,
         but we do not recommend mixing the styles as the results are sometimes unexpected.  If you
         just want to iterate over an index from start to finish, either style is equally appropriate.
         If you also use <literal>moveTo(FeatureStructure fs)</literal> and
         <literal>moveToPrevious()</literal>, it is better to use the special FS iterator style.
       </para>
       <note><para>The reason to not mix these styles is that you might be thinking that
         next() followed by moveToPrevious() would always work.  This is not true, because
         next() returns the "current" element, and advances to the next position, which might be
         beyond the last element.  At that point, the interator becomes "invalid", and by the iterator
         contracts, moveToNext and moveToPrevious are not allowed on "invalid" iterators;
         when an iterator is not valid, all bets are off.  But you can
         call these methods on the iterator &mdash; moveToFirst(), moveToLast(), or moveTo(FS) &mdash; to reset it.</para></note>

       <para>Indexes are created by specifying them in the annotator&apos;s or
         aggregate&apos;s resource descriptor. An index specification includes its name,
         the CAS type being indexed, the kind of index it is, and an (optional) ordering
         relation on the feature structures to be indexed. At startup time, all index
         specifications are combined; duplicate definitions (having the same name) are
         allowed only if their definitions are the same. </para>

       <para>Feature structure instances need to be explicitly added to the index repository by a
         method call. Feature structures that are not indexed will not be visible to other
         annotators, (unless they are located via being referenced by some other feature of
         another feature structure, which is indexed, or through a chain of these).</para>

       <para>The framework defines an unnamed bag index which indexes all types.  The
       only access provided for this index is the getAllIndexedFS(type) method on the
         index repository, which returns an iterator over all indexed instances of the
         specified type (including its subtypes) for that CAS View.
       </para>

       <para>The framework defines one standard, built-in annotation index, called
         AnnotationIndex, which indexes the <literal>uima.tcas.Annotation</literal>
         type: all feature structures of type <literal>uima.tcas.Annotation</literal> or
         its subtypes are automatically indexed with this built-in index.</para>

       <para>The ordering relation used by this index is to first order by the value of the
         <quote>begin</quote> features (in ascending order) and then by the value of the
         <quote>end</quote> feature (in descending order). This ordering insures that
         longer annotations starting at the same spot come before shorter ones. For Subjects
         of Analysis other than Text, this may not be an appropriate index.</para>

     </section>
   </section>

   <section id="ugr.ref.cas.builtin_types">
     <title>Built-in CAS Types</title>

     <para>The CAS has two kinds of built-in types &ndash; primitive and non-primitive. The
       primitive types are:

       <itemizedlist spacing="compact">
         <listitem><para>uima.cas.Boolean</para></listitem>
         <listitem><para>uima.cas.Byte</para></listitem>
         <listitem><para>uima.cas.Short</para></listitem>
         <listitem><para>uima.cas.Integer</para></listitem>
         <listitem><para>uima.cas.Long</para></listitem>
         <listitem><para>uima.cas.Float</para></listitem>
         <listitem><para>uima.cas.Double</para></listitem>
         <listitem><para>uima.cas.String</para></listitem>
       </itemizedlist></para>

     <para>The <literal>Byte, Short, Integer, </literal>and<literal> Long</literal> are
       all signed integer types, of length 8, 16, 32, and 64 bits. The
       <literal>Double</literal> type is 64 bit floating point. The
       <literal>String</literal> type can be subtyped to create sets of allowed values; see
         <olink targetdoc="&uima_docs_ref;"
         targetptr="ugr.ref.xml.component_descriptor.type_system.string_subtypes"/>.
       These types can be used to specify the range of a String-valued feature. They act like
       Strings, but have additional checking to insure the setting of values into them
       conforms to one of the allowed values, or to null (which is the value if it is not set).
       Note that the other primitive types cannot be used
       as a supertype for another type definition; only
       <literal>uima.cas.String</literal> can be sub-typed.</para>

     <para>The non-primitive types exist in a type hierarchy; the top of the hierarchy is the
       type <literal>uima.cas.TOP</literal>. All other non-primitive types inherit from
       some supertype.</para>

     <para>There are 9 built-in array types. These arrays have a size specified when they are
       created; the size is fixed at creation time. They are named:

       <itemizedlist spacing="compact">
         <listitem><para>uima.cas.BooleanArray</para></listitem>
         <listitem><para>uima.cas.ByteArray</para></listitem>
         <listitem><para>uima.cas.ShortArray</para></listitem>
         <listitem><para>uima.cas.IntegerArray</para></listitem>
         <listitem><para>uima.cas.LongArray</para></listitem>
         <listitem><para>uima.cas.FloatArray</para></listitem>
         <listitem><para>uima.cas.DoubleArray</para></listitem>
         <listitem><para>uima.cas.StringArray</para></listitem>
         <listitem><para>uima.cas.FSArray</para></listitem>
       </itemizedlist></para>

     <para>The <literal>uima.cas.FSArray</literal> type is an array whose elements are
       arbitrary other feature structures (instances of non-primitive types).</para>

     <para>There are 3 built-in types associated with the artifact being analyzed:

       <itemizedlist spacing="compact">
         <listitem><para>uima.cas.AnnotationBase</para></listitem>
         <listitem><para>uima.tcas.Annotation</para></listitem>
         <listitem><para>uima.tcas.DocumentAnnotation</para></listitem>
       </itemizedlist></para>

     <para>The <literal>AnnotationBase</literal> type defines one system-used feature
       which specifies for an annotation the subject of analysis (Sofa) to which it refers. The
       Annotation type extends from this and defines 2 features, taking
       <literal>uima.cas.Integer</literal> values, called <literal>begin</literal>
       and <literal>end</literal>. The <literal>begin</literal> feature typically
       identifies the start of a span of text the annotation covers; the
       <literal>end</literal> feature identifies the end. The values refer to character
       offsets; the starting index is 0. An annotation of the word <quote>CAS</quote> in a text
       <quote>CAS Reference</quote> would have a start index of 0, and an end index of 3; the
       difference between end and start is the length of the span the annotation refers
       to.</para>

     <para>Annotations are always with respect to some Sofa (Subject of Analysis &ndash; see
         <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>
       .</para>
     <note><para>Artifacts which are not text strings may have a different interpretation of
     the meaning of begin and end, or may define their own kind of annotation, extending from
     <literal>AnnotationBase</literal>. </para></note>

     <para id="ugr.ref.cas.document_annotation">The <literal>DocumentAnnotation</literal> type has one special instance. It is
       a subtype of the Annotation type, and the built-in definition defines one feature,
       <literal>language</literal>, which is a string indicating the language of the
       document in the CAS. The value of this language feature is used by the system to control
       flow among annotators when the <quote>CapabilityLanguageFlow</quote> mode is used,
       allowing the flow to skip over annotators that don&apos;t process particular
       languages. Users may extend this type by adding additional features to it, using the XML
       Descriptor element for defining a type.</para>

     <note><para>
       We do <emphasis>not</emphasis> recommend extending the <literal>DocumentAnnotation</literal>
       type.  If you do, you must <emphasis>not</emphasis> use the JCas, for the reasons stated
       earlier.
     </para></note>

     <para>Each CAS view has a different associated instance of the
       <literal>DocumentAnnotation</literal> type.  On the CAS, use
       <literal>getDocumentationAnnotation()</literal> to access the
       <literal>DocumentAnnotation</literal>.</para>

     <para>There are also built-in types supporting linked lists, similar to the ones available in
     Java and other programming languages. Their use is
       constrained by the usual properties of linked lists: not very space efficient, no (efficient)
       random access, but an easy choice if you don't know how long your list will be ahead of time. The
       implementation is type specific; there are different list building objects for each of
       the primitive types, plus one for general feature structures. Here are the type names:
       <itemizedlist spacing="compact">
         <listitem><para>uima.cas.FloatList</para></listitem>
         <listitem><para>uima.cas.IntegerList</para></listitem>
         <listitem><para>uima.cas.StringList</para></listitem>
         <listitem><para>uima.cas.FSList</para>
           <para></para></listitem>
         <listitem><para>uima.cas.EmptyFloatList</para></listitem>
         <listitem><para>uima.cas.EmptyIntegerList</para></listitem>
         <listitem><para>uima.cas.EmptyStringList</para></listitem>
         <listitem><para>uima.cas.EmptyFSList</para>
           <para></para></listitem>
         <listitem><para>uima.cas.NonEmptyFloatList</para></listitem>
         <listitem><para>uima.cas.NonEmptyIntegerList</para></listitem>
         <listitem><para>uima.cas.NonEmptyStringList</para></listitem>
         <listitem><para>uima.cas.NonEmptyFSList</para></listitem>

       </itemizedlist></para>

     <para>For the primitive types <literal>Float</literal>,
       <literal>Integer</literal>, <literal>String</literal> and
       <literal>FeatureStructure</literal>, there is a base type, for instance,
       <literal>uima.cas.FloatList</literal>. For each of these, there are two subtypes,
       corresponding to a non-empty element, and a marker that serves to indicate the end of the
       list, or an empty list. The non-empty types define two features &ndash;
       <literal>head</literal> and <literal>tail</literal>. The head feature holds the
       particular value for that part of the list. The tail refers to the next list object
       (either a non-empty one or the empty version to indicate the end of the list).</para>

     <para>There are no other built-in types. Users are free to define their own type systems,
       building upon these types.</para>

   </section>

   <section id="ugr.ref.cas.accessing_the_type_system">
     <title>Accessing the type system</title>

     <para>
       During annotator processing, or outside an annotator, access the type system by calling
       <literal>CAS.getTypeSystem()</literal>.
     </para>

     <para>However, CAS annotators implement an additional method,
       <literal>typeSystemInit()</literal>, which is called by the UIMA framework before the
       annotator&apos;s process method. This method, implemented by the annotator writer,
       is passed a reference to the CAS&apos;s type system metadata. The method typically uses
       the type system APIs to obtain type and feature objects corresponding to all the types
       and features the annotator will be using in its process method. This initialization
       step should not be done during an annotator&apos;s initialize method since the type
       system can change after the initialize method is called; it should not be done during the
       process method, since this is presumably work that is identical for each incoming
       document, and so should be performed only when the type system changes (which will be a
       rare event). The UIMA framework guarantees it will call the <literal>typeSystemInit
       </literal>method of an annotator whenever the type system changes, before calling the
       annotator&apos;s <literal>process()</literal> method.</para>

     <para>The initialization done by <literal>typeSystemInit()</literal> is done by the
       UIMA framework when you use the JCas APIs; you only need to provide a
       <literal>typeSystemInit()</literal> method, as described here, when you are not using
       the JCas approach.</para>

     <section id="ugr.ref.cas.type_system.printer_example">
       <title>TypeSystemPrinter example</title>

       <para>Here is a code fragment that, given a CAS Type System, will print a list of all
         types.</para>


       <programlisting>// Get all type names from the type system
 // and print them to stdout.
 private void listTypes1(TypeSystem ts) {
   // Get an iterator over types
   Iterator typeIterator = ts.getTypeIterator();
   Type t;
   System.out.println("Types in the type system:");
   while (typeIterator.hasNext()) {
     // Retrieve a type...
     t = (Type) typeIterator.next();
     // ...and print its name.
     System.out.println(t.getName());
   }
   System.out.println();
 }</programlisting>

       <para>This method is passed the type system as a parameter.  From the type system, we can
         get an iterator
         over all known types. If you run this against a CAS created with no additional
         user-defined types, we should see something like this on the console:</para>

       <programlisting>Types in the type system:
 uima.cas.Boolean
 uima.cas.Byte
 uima.cas.Short
 uima.cas.Integer
 uima.cas.Long
 uima.cas.ArrayBase
 ...
         </programlisting>

       <para>If the type system had user-defined types these would show up too. Note that some
         of these types are not directly creatable &ndash; they are types used by the framework
         in the type hierarchy (e.g. uima.cas.ArrayBase).</para>

       <para>CAS type names include a name-space prefix. The components of a type name are
         separated by the dot (.). A type name component must start with a Unicode letter,
         followed by an arbitrary sequence of letters, digits and the underscore (_). By
         convention, the last component of a type name starts with an uppercase letter, the
         rest start with a lowercase letter.</para>

       <para>Listing the type names is mildly useful, but it would be even better if we could see
         the inheritance relation between the types. The following code prints the
         inheritance tree in indented format.</para>


       <programlisting>private static final int INDENT = 2;
 private void listTypes2(TypeSystem ts) {
   // Get the root of the inheritance tree.
   Type top = ts.getTopType();
   // Recursively print the tree.
   printInheritanceTree(ts, top, 0);
 }

 private void printInheritanceTree(TypeSystem ts, Type type, int level) {
   indent(level); // Print indentation.
   System.out.println(type.getName());
   // Get a vector of the immediate subtypes.
   Vector subTypes =
     ts.getDirectlySubsumedTypes(type);
   ++level; // Increase the indentation level.
   for (int i = 0; i &lt; subTypes.size(); i++) {
     // Print the subtypes.
     printInheritanceTree(ts, (Type) subTypes.get(i), level);
   }
 }

 // A simple, inefficient indenter
 private void indent(int level) {
   int spaces = level * INDENT;
   for (int i = 0; i &lt; spaces; i++) {
     System.out.print(" ");
   }
 }</programlisting>

       <para> This example shows that you can traverse the type hierarchy by starting at the top
         with TypeSystem.getTopType and by retrieving subtypes with
         <literal>TypeSystem.getDirectlySubsumedTypes()</literal>.</para>

       <para>The Javadocs also have APIs that allow you to access the features, as well as what
         the allowed value type is for that feature. Here is sample code which prints out all the
         features of all the types, together with the allowed value types (the feature
         <quote>range</quote>). Each feature has a <quote>domain</quote> which is the type
         where it is defined, as well as a <quote>range</quote>.


         <programlisting>private void listFeatures2(TypeSystem ts) {
   Iterator featureIterator = ts.getFeatures();
   Feature f;
   System.out.println("Features in the type system:");
   while (featureIterator.hasNext()) {
     f = (Feature) featureIterator.next();
     System.out.println(
       f.getShortName() + ": " +
       f.getDomain() + " -&gt; " + f.getRange());
   }
   System.out.println();
 }</programlisting></para>

       <para>We can ask a feature object for its domain (the type it is defined on) and its range
         (the type of the value of the feature). The terminology derives from the fact that
         features can be viewed as functions on subspaces of the object space.</para>

     </section>

     <section id="ugr.ref.cas.cas_apis_create_modify_feature_structures">
       <title>Using the CAS APIs to create and modify feature structures</title>
       <titleabbrev>Using CAS APIs: Feature Structures</titleabbrev>

       <para>Assume a type system declaration that defines two types: Entity and Person.
         Entity has no features defined within it but inherits from uima.tcas.Annotation
         &ndash; so it has the begin and end features. Person is, in turn, a subtype of Entity,
         and adds firstName and lastName features. CAS type systems are declaratively
         specified using XML; the format of this XML is described in <olink
           targetdoc="&uima_docs_ref;"
           targetptr="ugr.ref.xml.component_descriptor.type_system"/>.


         <programlisting><![CDATA[<!-- Type System Definition -->
 <typeSystemDescription>
   <types>
     <typeDescription>
       <name>com.xyz.proj.Entity</name>
       <description />
       <supertypeName>uima.tcas.Annotation</supertypeName>
     </typeDescription>
     <typeDescription>
       <name>Person</name>
       <description />
       <supertypeName>com.xyz.proj.Entity </supertypeName>
       <features>
         <featureDescription>
           <name>firstName</name>
           <description />
           <rangeTypeName>uima.cas.String</rangeTypeName>
         </featureDescription>
         <featureDescription>
           <name>lastName</name>
           <description />
           <rangeTypeName>uima.cas.String</rangeTypeName>
         </featureDescription>
       </features>
     </typeDescription>
   </types>
 </typeSystemDescription>]]></programlisting></para>

   <para>
     To be able to access types and features, we need to know their names.  The CAS interface defines
     constants that hold the names of built-in feature names, such as, e.g.,
     <literal>CAS.TYPE_NAME_INTEGER</literal>.  It is good programming practice to create such
     constants for the types and features you define, for your own use as well as for others who will
     be using your annotators.
   </para>


       <programlisting>/** Entity type name constant. */
 public static final String ENTITY_TYPE_NAME = "com.xyz.proj.Entity";

 /** Person type name constant. */
 public static final String PERSON_TYPE_NAME = "com. xyz.proj.Person";

 /** First name feature name constant. */
 public static final String FIRST_NAME_FEAT_NAME = "firstName";

 /** Last name feature name constant. */
 public static final String LAST_NAME_FEAT_NAME = "lastName";</programlisting>

       <para>Next we define type and feature member variables; these will hold the values of the
         type and feature objects needed by the CAS APIs, to be assigned during
         <literal>typeSystemInit()</literal>.</para>


       <programlisting>// Type system object variables
 private Type entityType;
 private Type personType;
 private Feature firstNameFeature;
 private Feature lastNameFeature;
 private Type stringType;</programlisting>

       <para>The type system does not throw an exception if we ask for something that is
         not known, it simply returns null; therefore the code checks for this and throws a proper
         exception.  We require all these types and features to be defined for the annotator to
         work.  One might imagine situations where certain computations are predicated on some type
         or feature being defined in the type system, but that is not the case here.</para>


       <programlisting>// Get a type object corresponding to a name.
 // If it doesn&apos;t exist, throw an exception.
 private Type initType(String typeName)
   throws AnnotatorInitializationException {
   Type type = ts.getType(typeName);
   if (type == null) {
     throw new AnnotatorInitializationException(
       AnnotatorInitializationException.TYPE_NOT_FOUND,
       new Object[] { this.getClass().getName(), typeName });
   }
   return type;
 }

 // We add similar code for retrieving feature objects.
 // Get a feature object from a name and a type object.
 // If it doesn&apos;t exist, throw an exception.
 private Feature initFeature(String featName, Type type)
   throws AnnotatorInitializationException {
   Feature feat = type.getFeatureByBaseName(featName);
   if (feat == null) {
     throw new AnnotatorInitializationException(
       AnnotatorInitializationException.FEATURE_NOT_FOUND,
       new Object[] { this.getClass().getName(), featName });
   }
   return feat;
 }</programlisting>

       <para>Using these two functions, code for initializing the type system described
         above would be:


         <programlisting>public void typeSystemInit(TypeSystem aTypeSystem)
     throws AnalysisEngineProcessException {
   this.typeSystem = aTypeSystem;
   // Set type system member variables.
   this.entityType = initType(ENTITY_TYPE_NAME);
   this.personType = initType(PERSON_TYPE_NAME);
   this.firstNameFeature =
     initFeature(FIRST_NAME_FEAT_NAME, personType);
   this.lastNameFeature =
     initFeature(LAST_NAME_FEAT_NAME, personType);
   this.stringType = initType(CAS.TYPE_NAME_STRING);
 }</programlisting></para>

       <para>Note that we initialize the string type by using a type name constant from the
         CAS.</para>

     </section>
   </section>

   <section id="ugr.ref.cas.creating_feature_structures">
     <title>Creating feature structures</title>

     <para>To create feature structures in JCas, we use the Java <quote>new</quote>
       operator. In the CAS, we use one of several different API methods on the CAS object,
       depending on which of the 10 basic kinds of feature structures we are creating (a plain
       feature structure, or an instance of the built-in primitive type arrays or FSArray).
       There are is also a method to create an instance of a
       <literal>uima.tcas.Annotation</literal>, setting the begin and end
       values.</para>

     <para>Once a feature structure is created, it needs to be added to the CAS indexes (unless
       it will be accessed via some reference from another accessible feature structure). The
       CAS provides this API: Assuming aCAS holds a reference to a CAS, and token holds a
       reference to a newly created feature structure, here&apos;s the code to add that
       feature structure to all the relevant CAS indexes:</para>


     <programlisting>    // Add the token to the index repository.
     aCAS.addFsToIndexes(token);</programlisting>

     <para>There is also a corresponding <literal>removeFsFromIndexes(token)</literal>
       method on CAS objects.</para>

     <para>Because some of the indexes (the Sorted and Set types) use comparators defined
     on particular values of the features of an indexed type, if you change the values of
     those features being used in the index key, the correct way to do this is to
     <orderedlist spacing="compact">
       <listitem><para>remove the item from all indexes where it is indexed, in all views
       where it is indexed,</para>
       </listitem>
       <listitem><para>update the value of the features being used as keys,</para></listitem>
       <listitem><para>add the item back to the indexes, in all views.</para></listitem>
     </orderedlist></para>
   </section>

   <section id="ugr.ref.cas.accessing_modifying_features_of_feature_structures">
     <title>Accessing or modifying features of feature structures</title>
     <titleabbrev>Accessing or modifying Features</titleabbrev>

     <para>Values of individual features for a feature structure can be set or referenced,
       using a set of methods that depend on the type of value that feature is declared to have.
       There are methods on FeatureStructure for this: getBooleanValue, getByteValue,
       getShortValue, getIntValue, getLongValue, getFloatValue, getDoubleValue,
       getStringValue, and getFeatureValue (which means to get a value which in turn is a
       reference to a feature structure). There are corresponding <quote>setter</quote>
       methods, as well. These methods on the feature structure object take as arguments the
       feature object retrieved earlier in the typeSystemInit method.</para>

     <para>Using the previous example, with the type system initialized with type personType
       and feature lastNameFeature, here&apos;s a sample code fragment that gets and sets
       that feature:</para>


     <programlisting>// Assume aPerson is a variable holding an object of type Person
 // get the lastNameFeature value from the feature structure
 String lastName = aPerson.getStringValue(lastNameFeature);
 // set the lastNameFeature value
 aPerson.setStringValue(lastNameFeature, newStringValueForLastName);</programlisting>

     <para>The getters and setters for each of the primitive types are defined in the Javadocs
       as methods of the FeatureStructure interface.</para>

   </section>

   <section id="ugr.ref.cas.indexes_and_iterators">
     <title>Indexes and Iterators</title>

     <para>Each CAS can have many indexes associated with it; each CAS View contains
       a complete set of instantions of the indexes.   Each index is represented by an
       instance of the type org.apache.uima.cas.FSIndex. You use the object
       org.apache.uima.cas.FSIndexRepository, accessible via a method on a CAS object, to
       retrieve instances of indexes. There are methods that let you select the index
       by name, by type, or by both name and type. Since each index is already associated with a type,
       passing both a name and a type is valid only if the type passed in is the same
       type or a subtype of the one declared in the index specification for the named index. If you
       pass in a subtype, the returned FSIndex object refers to an index that will return only
       items belonging to that subtype (or subtypes of that subtype).</para>

     <para>The returned FSIndex objects are used, in turn, to create iterators.
       There is also a method on the Index Repository, <literal>getAllIndexedFS</literal>,
       which will return an iterator over all indexed Feature Structures (for that CAS View),
       in no particular order.  The iterators
       created can be used like common Java iterators, to sequentially retrieve items
       indexed. If the index represents a sorted index, the items are returned in a sorted
       order, where the sort order is specified in the XML index definition. This XML is part of
       the Component Descriptor, see <olink targetdoc="&uima_docs_ref;"
         targetptr="ugr.ref.xml.component_descriptor.aes.index"/>.</para>

     <para>Feature structures should not be added to or removed from indexes while iterating
       over them; a ConcurrentModificationException is thrown when this is detected.
       Certain operations are allowed with the iterators after modification, which can
       <quote>reset</quote> this condition, such as moving to beginning, end, or moving to a
       particular feature structure. So - if you have to modify the index, you can move it back to
       the last FS you had retrieved from the iterator, and then continue, if that makes sense in
       your application.</para>

     <section id="ugr.ref.cas.index.built_in_indexes">
       <title>Built-in Indexes</title>

       <para>An unnamed built-in bag index exists which holds all feature structures which are indexed.
       The only access to this index is the method getAllIndexedFS(Type) which returns an iterator
       over all indexed Feature Structures.</para>

       <para>The CAS also contains a built-in index for the type <literal>uima.tcas.Annotation</literal>, which sorts
         annotations in the order in which they appear in the document. Annotations are sorted first by increasing
         <literal>begin</literal> position. Ties are then broken by <emphasis>decreasing</emphasis>
         <literal>end</literal> position (so that longer annotations come first). Annotations that match in both
         their <literal>begin</literal> and <literal>end</literal> features are sorted using the Type Priority
         (see <olink targetdoc="&uima_docs_ref;"
           targetptr="ugr.ref.xml.component_descriptor.aes.type_priority"/> )</para>
     </section>


     <section id="ugr.ref.cas.index.adding_to_indexes">
       <title>Adding Feature Structures to the Indexes</title>

       <para>Feature Structures are added to the indexes by calling the
         <literal>FSIndexRepository.addFS(FeatureStructure)</literal> method or the equivalent convenience
         method <literal>CAS.addFsToIndexes(FeatureStructure)</literal>. This adds the Feature Structure to
         <emphasis>all</emphasis> indexes that are defined for the type of that FeatureStructure (or any of its
         supertypes). Note that you should not add a Feature Structure to the indexes until you have set values for all
         of the features that may be used as sort keys in an index.</para>
     </section>

     <section id="ugr.ref.cas.index.iterators">
       <title>Iterators</title>

       <para>Iterators are objects of class <literal>org.apache.uima.cas.FSIterator.</literal> This class
         extends <literal>java.util.Iterator</literal> and implements the normal Java iterator methods, plus
         additional ones that allow moving both forwards and backwards.</para>
     </section>

     <section id="ugr.ref.cas.index.annotation_index">
       <title>Special iterators for Annotation types</title>

       <para>The built-in index over the <literal>uima.tcas.Annotation</literal> type
         named <quote><literal>AnnotationIndex</literal></quote> has additional
         capabilities. To use them, you first get a reference to this built-in index using
         either the <literal>getAnnotationIndex</literal> method on a CAS View object, or
         by asking the <literal>FSIndexRepository</literal> object for an index having the
         particular name <quote>AnnotationIndex</quote>, for example:

         <programlisting>AnnotationIndex idx = aCAS.getAnnotationIndex();
 // or you can iterate over a specific subtype of Annotation:
 AnnotationIndex idx = aCAS.getAnnotationIndex(aType); </programlisting></para>

       <para>This object can be used to produce several additional kinds of iterators. It can
         produce unambiguous iterators; these skip over elements until it finds one where the
         start position of the next annotation is equal to or greater than the end position of
         the previously returned annotation.</para>

       <para>It can also produce several kinds of subiterators; these are iterators whose
         annotations fall within the span of another annotation. This kind of iterator can
         also have the unambiguous property, if desired. It also can be
         <quote>strict</quote> or not; strict means that the returned annotation lies
         completely within the span of the controlling annotation. Non-strict only implies
         that the beginning of the returned annotation falls within the span of the
         controlling annotation.</para>

       <para>There is also a method which produces an <literal>AnnotationTree</literal>
         object, which contains nodes representing the results of doing a strict,
         unambiguous subiterator over the span of some controlling annotation. For more
         details, please refer to the Javadocs for the
         <literal>org.apache.uima.cas.text</literal> package.</para>

     </section>

     <section id="ugr.ref.cas.index.constraints_and_filtered_iterators">
       <title>Constraints and Filtered iterators</title>

       <para>There is a set of API calls that build constraint objects. These objects can be
         used directly to test if a particular feature structure matches (satisfies) the
         constraint, or they can be passed to the createFilteredIterator method to create an
         iterator that skips over instances which fail to satisfy the constraint.</para>

       <para>It is possible to specify a feature value located by following a chain of
         references starting from the feature structure being tested. Here&apos;s a
         scenario to explore this concept. Let&apos;s suppose you have the following type
         system (namespaces are omitted for clarity):

         <blockquote>
           <para><emphasis role="bold">Token</emphasis>, having a feature PartOfSpeech
             which holds a reference to another type (POS)</para>

           <para><emphasis role="bold">POS</emphasis> (a type with many subtypes, each
             representing a different part of speech)</para>

           <para><emphasis role="bold">Noun</emphasis> (a subtype of POS)</para>

           <para><emphasis role="bold">ProperName</emphasis> (a subtype of Noun),
             having a feature Class which holds an integer value encoding some information
             about the proper noun.</para></blockquote></para>

       <para>If you want to filter Token instances, such that only those tokens get through
         which are proper names of class 3 (for example), you would need a test that started with
         a Token instance, followed its PartOfSpeech reference to another instance (the
         ProperName instance) and then tested the Class feature of that instance for a value
         equal to 3.</para>

       <para>To support this, the filtering approach has components that specify tests, and
         components that specify <quote>paths</quote>. The tests that can be done include
         testing references to type instances to see if they are instances of some type or its
         subtypes; this is done with a FSTypeConstraint constraint. Other tests check for
         equality or, for numeric values, ranges.</para>

       <para>Each test may be combined with a path &ndash; to get to the value to test. Tests that
         start from a feature structure instance can be combined with and and or connectors.
         The Javadocs for these are in the package org.apache.uima.cas in the classes that end
         in Constraint, plus the classes ConstraintFactory, FeaturePath and CAS.
         Here&apos;s an example; assume the variable cas holds a reference to a CAS instance.


         <programlisting>// Start by getting the constraint factory from the CAS.
 ConstraintFactory cf = cas.getConstraintFactory();

 // To specify a path to an item to test, you start by
 // creating an empty path.
 FeaturePath path = cas.createFeaturePath();

 // Add POS feature to path, creating one-element path.
 path.addFeature(posFeat);

 // You can extend the chain arbitrarily by adding additional
 // features.

 // Create a new type constraint.

 // Type constraints will check that structures
 // they match against have a type at least as specific
 // as the type specified in the constraint.
 FSTypeConstraint nounConstraint = cf.createTypeConstraint();

 // Set the type (by default it is TOP).
 // This succeeds if the type being tested by this constraint
 // is nounType or a subtype of nounType.
 nounConstraint.add(nounType);

 // Embed the noun constraint under the pos path.
 // This means, associate the test with the path, so it tests the
 // proper value.

 // The result is a test which will
 // match a feature structure that has a posFeat defined
 // which has a value which is an instance of a nounType or
 // one of its subtypes.
 FSMatchConstraint embeddedNoun = cf.embedConstraint(path, nounConstraint);

 // Create a type constraint for token (or a subtype of it)
 FSTypeConstraint tokenConstraint = cf.createTypeConstraint();

 // Set the type.
 tokenConstraint.add(tokenType);

 // Create the final constraint by conjoining the two constraints.
 FSMatchConstraint nounTokenCons = cf.and(nounConstraint, tokenConstraint);

 // Create a filtered iterator from some annotation iterator.
 FSIterator it = cas.createFilteredIterator(annotIt, nounTokenCons);</programlisting>
         </para></section></section>

   <section id="ugr.ref.cas.guide_to_javadocs">
     <title>The CAS API&apos;s &ndash; a guide to the Javadocs</title>
     <titleabbrev>CAS API&apos;s Javadocs</titleabbrev>

     <para>The CAS APIs are organized into 3 Java packages: cas, cas.impl, and cas.text. Most
       of the APIs described here are in the cas package. The cas.impl package contains classes
       used in serializing and deserializing (reading and writing to external strings) the
       XCAS form of the CAS (XCAS is an XML serialization of the CAS). The XCAS form is used for
       transporting the CAS among local and remote annotators, or for storing the CAS in
       permanent storage. The cas.text contains the APIs that extend the CAS to support
       artifact (including <quote>text</quote>) analysis.</para>

     <section id="ugr.ref.cas.javadocs.cas_package">
       <title>APIs in the CAS package</title>

       <para>The main objects implementing the APIs discussed here are shown in the diagram
         below. The hierarchy represents that there is a way to get from an upper object to an
         instance of the lower object, usually by using a method on the upper object; this is not
         an inheritance hierarchy.
         <figure id="ugr.ref.cas.fig.api_hierarchy">
           <title>CAS Object hierarchy</title>
           <mediaobject>
             <imageobject>
               <imagedata width="5.8in" format="JPG"
                 fileref="&imgroot;image001.png"/>
             </imageobject>
             <textobject><phrase>CAS object hierarchy</phrase></textobject>
           </mediaobject>
         </figure> </para>

       <para>The main Interface is the CAS interface. This has most of the functionality of the
         CAS, except for the type system metadata access, and the indexing access. JCas and CAS
         are alternative representations and API approaches to the CAS; each has a method to
         get the other. You can mix JCas and CAS APIs in your application as needed. To use the
         JCas APIs, you have to create the Java classes that correspond to the CAS types, and
         include them in the Java class path of the application. If you have a CAS object, you can
         get a JCas object by using the getJCas() method call on the CAS object; likewise, you
         can get the CAS object from a JCas by using the getCAS() method call on the JCas object.
         There is also a low level CAS interface that is not part of the official API, and is
         intended for internal use only &ndash; it is not documented here.</para>

       <para>The type system metadata APIs are found in the TypeSystem interface. The objects
         defining each type and feature are defined by the interfaces Type and Feature. The
         Type interface has methods to see what types subsume other types, to iterate over the
         types available, and to extract information about the types, including what
         features it has. The Feature interface has methods that get what type it belongs to,
         its name, and its range (the kind of values it can hold).</para>

       <para>The FSIndexRepository gives you access to methods to get instances of indexes, and
         also provides access to the iterator over all indexed feature structures:
         <literal>getAllIndexedFS(aType)</literal>.
         The FSIndex and AnnotationIndex objects give you methods to create instances of
         iterators.</para>

       <para>Iterators and the CAS methods that create new feature structures return
         FeatureStructure objects. These objects can be used to set and get the values of
         defined features within them.</para>
     </section>
   </section>
 </chapter>