uimaj-2.2.0-RC8/uima-docbooks/src/docbook/tutorials_and_users_guides/tug.cas_multiplier.xml - uima-uimaj - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
 <!ENTITY imgroot "../images/tutorials_and_users_guides/tug.cas_multiplier/">
 <!ENTITY % uimaents SYSTEM "../entities.ent">
 %uimaents;
 ]>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <chapter id="ugr.tug.cm">
   <title>CAS Multiplier Developer&apos;s Guide</title>
   <titleabbrev>CAS Multiplier</titleabbrev>

   <para>The UIMA analysis components (Annotators and CAS Consumers) described previously in this manual all take a
     single CAS as input, optionally make modifications to it, and output that same CAS. This chapter describes an
     advanced feature that became available in the UIMA SDK v2.0: a new type of analysis component called a
     <emphasis>CAS Multiplier</emphasis>, which can create new CASes during processing.</para>

   <para>CAS Multipliers are often used to split a large artifact into manageable pieces. This is a common requirement
     of audio and video analysis applications, but can also occur in text analysis on very large documents. A CAS
     Multiplier would take as input a single CAS representing the large artifact (perhaps by a remote reference to the
     actual data &mdash; see <olink targetdoc="&uima_docs_tutorial_guides;"
       targetptr="ugr.tug.aas.sofa_data_formats"/>) and produce as output a series of new CASes each of which
     contains only a small portion of the original artifact.</para>

   <para>CAS Multipliers are not limited to dividing an artifact into smaller pieces, however. A CAS Multiplier can
     also be used to combine smaller segments together to form larger segments. In general, a CAS Multiplier is used to
     <emphasis>change</emphasis> the segmentation of a series of CASes; that is, to change how a stream of data is
     divided among discrete CAS objects.</para>

   <section id="ugr.tug.cm.developing_multiplier_code">
     <title>Developing the CAS Multiplier Code</title>

     <section id="ugr.tug.cm.cm_interface_overview">
       <title>CAS Multiplier Interface Overview</title>

       <para>CAS Multiplier implementations should extend from the
         <literal>JCasMultiplier_ImplBase</literal> or <literal>CasMultiplier_ImplBase</literal>
         classes, depending on which CAS interface they prefer to use. As with other types of analysis components, the
         CAS Multiplier ImplBase classes define optional <literal>initialize</literal>,
         <literal>destroy</literal>, and <literal>reconfigure</literal> methods. There are then three
         required methods: <literal>process</literal>, <literal>hasNext</literal>, and
         <literal>next</literal>. The framework interacts with these methods as follows:</para>

       <orderedlist>
         <listitem>
           <para>The framework calls the CAS Multiplier&apos;s <literal>process</literal> method, passing it an
             input CAS. The process method returns, but may hold on to a reference to the input CAS.</para>
         </listitem>

         <listitem>
           <para>The framework then calls the CAS Multiplier&apos;s <literal>hasNext</literal> method. The CAS
             Multiplier should return <literal>true</literal> from this method if it intends to output one or more
             new CASes (for instance, segments of this CAS), and <literal>false</literal> if not.</para>
         </listitem>

         <listitem>
           <para>If <literal>hasNext</literal> returned true, the framework will call the CAS Multiplier&apos;s
             <literal>next</literal> method. The CAS Multiplier creates a new CAS (we will see how in a moment),
             populates it, and returns it from the <literal>hasNext</literal> method.</para>
         </listitem>

         <listitem>
           <para>Steps 2 and 3 continue until <literal>hasNext</literal> returns false. </para>
         </listitem>
       </orderedlist>

       <para>From the time when <literal>process</literal> is called until the <literal>hasNext</literal>
         method returns false, the CAS Multiplier <quote>owns</quote> the CAS that was passed to its
         <literal>process</literal> method. The CAS Multiplier can store a reference to this CAS in a local field and
         can read from it or write to it during this time. Once <literal>hasNext</literal> returns false, the CAS
         Multiplier gives up ownership of the input CAS and should no longer retain a reference to it.</para>
     </section>

     <section id="ugr.tug.cm.how_to_get_empty_cas_instance">
       <title>How to Get an Empty CAS Instance</title>
       <titleabbrev>Getting an empty CAS Instance</titleabbrev>

       <para>The CAS Multiplier&apos;s <literal>next</literal> method must return a CAS instance that represents
         a new representation of the input artifact. Since CAS instances are managed by the framework, the CAS
         Multiplier cannot actually create a new CAS; instead it should request an empty CAS by calling the method:

         <programlisting>CAS getEmptyCAS()

 or

 JCas getEmptyJCas()</programlisting> which are
         defined on the <literal>CasMultiplier_ImplBase</literal> and
         <literal>JCasMultiplier_ImplBase</literal> classes, respectively.</para>

       <para>Note that if it is more convenient you can request an empty CAS during the <literal>process</literal> or
         <literal>hasNext</literal> methods, not just during the <literal>next</literal> method.</para>

       <para>By default, a CAS Multiplier is only allowed to hold one output CAS instance at a time. You must return the
         CAS from the <literal>next</literal> method before you can request a second CAS. If you try to call
         getEmptyCAS a second time you will get an Exception. You can change this default behavior by overriding the
         method <literal>getCasInstancesRequired</literal> to return the number of CAS instances that you need.
         Be aware that CAS instances consume a significant amount of memory, so setting this to a large value will cause
         your application to use a lot of RAM. So, for example, it is not a good practice to attempt to generate a large
         number of new CASes in the CAS Multiplier&apos;s <literal>process</literal> method. Instead, you should
         spread your processing out across the calls to the <literal>hasNext</literal> or
         <literal>next</literal> methods.</para>

       <note><para>You can only call <literal>getEmptyCAS()</literal> or <literal>getEmptyJCas()</literal>
         from your CAS Multiplier's <literal>process</literal>, <literal>hasNext</literal>, or
         <literal>next</literal> methods.  You cannot call it from other methods such as
         <literal>initialize</literal>.  This is because the Aggregate AE's Type System is not available
         until all of the components of the aggregate have finished their initialization.
       </para></note>

       <para>The Type System of the empty CAS will contain all of the type definitions for all
         components of the outermost Aggregate Analysis Engine or Collection Processing Engine
         that contains your CAS Multiplier.  Therefore downstream components that receive
         these CASes can add new instances of any type that they define.</para>

       <warning><para>Be careful to keep the Feature Structures that belong to each CAS separate.  You
         cannot create references from a Feature Structure in one CAS to a Feature Structure in another CAS.
         You also cannot add a Feature Structure created in one CAS to the indexes of a different CAS.
         If you attempt to do this, the results are undefined.
       </para>
       </warning>
     </section>

     <section id="ugr.tug.cm.example_code">
       <title>Example Code</title>

       <para>This section walks through the source code of an example CAS Multiplier that breaks text documents into
         smaller pieces. The Java class for the example is
         <literal>org.apache.uima.examples.casMultiplier.SimpleTextSegmenter</literal> and the source
         code is included in the UIMA SDK under the <literal>examples/src</literal> directory.</para>

       <section id="ugr.tug.cm.example_code.overall_structure">
         <title>Overall Structure</title>


         <programlisting>public class SimpleTextSegmenter extends JCasMultiplier_ImplBase {
   private String mDoc;
   private int mPos;
   private int mSegmentSize;
   private String mDocUri;

   public void initialize(UimaContext aContext)
           throws ResourceInitializationException
   { ... }

   public void process(JCas aJCas) throws AnalysisEngineProcessException
   { ... }

   public boolean hasNext() throws AnalysisEngineProcessException
   { ... }

   public AbstractCas next() throws AnalysisEngineProcessException
   { ... }
 }</programlisting>

         <para>The <literal>SimpleTextSegmenter</literal> class extends
           <literal>JCasMultiplier_ImplBase</literal> and implements the optional
           <literal>initialize</literal> method as well as the required <literal>process</literal>,
           <literal>hasNext</literal>, and <literal>next</literal> methods. Each method is described
           below.</para>

       </section>

       <section id="ugr.tug.cm.example_code.initialize">
         <title>Initialize Method</title>


         <programlisting>public void initialize(UimaContext aContext) throws
                     ResourceInitializationException {
   super.initialize(aContext);
   mSegmentSize = ((Integer)aContext.getConfigParameterValue(
                             "segmentSize")).intValue();
 }</programlisting>

         <para>Like an Annotator, a CAS Multiplier can override the initialize method and read configuration
           parameter values from the UimaContext. The SimpleTextSegmenter defines one parameter, <quote>Segment
           Size</quote>, which determines the approximate size (in characters) of each segment that it will
           produce.</para>

       </section>

       <section id="ugr.tug.cm.example_code.process">
         <title>Process Method</title>


         <programlisting>public void process(JCas aJCas)
        throws AnalysisEngineProcessException {
   mDoc = aJCas.getDocumentText();
   mPos = 0;
   // retreive the filename of the input file from the CAS so that it can
   // be added to each segment
   FSIterator it = aJCas.
           getAnnotationIndex(SourceDocumentInformation.type).iterator();
   if (it.hasNext()) {
     SourceDocumentInformation fileLoc =
           (SourceDocumentInformation)it.next();
     mDocUri = fileLoc.getUri();
   }
   else {
     mDocUri = null;
   }
  }</programlisting>

         <para>The process method receives a new JCas to be processed(segmented) by this CAS Multiplier. The
           SimpleTextSegmenter extracts some information from this JCas and stores it in fields (the document text
           is stored in the field mDoc and the source URI in the field mDocURI). Recall that the CAS Multiplier is
           considered to <quote>own</quote> the JCas from the time when process is called until the time when hasNext
           returns false. Therefore it is acceptable to retain references to objects from the JCas in a CAS
           Multiplier, whereas this should never be done in an Annotator. The CAS Multiplier could have chosen to
           store a reference to the JCas itself, but that was not necessary for this example.</para>

         <para>The CAS Multiplier also initializes the mPos variable to 0. This variable is a position into the
           document text and will be incremented as each new segment is produced.</para>

       </section>

       <section id="ugr.tug.cm.example_code.hasnext">
         <title>HasNext Method</title>


         <programlisting>public boolean hasNext() throws AnalysisEngineProcessException {
   return mPos &lt; mDoc.length();
 }</programlisting>

         <para>The job of the hasNext method is to report whether there are any additional output CASes to produce. For
           this example, the CAS Multiplier will break the entire input document into segments, so we know there will
           always be a next segment until the very end of the document has been reached.</para>

       </section>

       <section id="ugr.tug.cm.example_code.next">
         <title>Next Method</title>


         <programlisting>public AbstractCas next() throws AnalysisEngineProcessException {
   int breakAt = mPos + mSegmentSize;
   if (breakAt > mDoc.length())
     breakAt = mDoc.length();

   // search for the next newline character.
   // Note: this example segmenter implementation
   // assumes that the document contains many newlines.
   // In the worst case, if this segmenter
   // is run on a document with no newlines,
   // it will produce only one segment containing the
   // entire document text.
   // A better implementation might specify a maximum segment size as
   // well as a minimum.

   while (breakAt &lt; mDoc.length() &amp;&amp;
          mDoc.charAt(breakAt - 1) != '\n')
     breakAt++;

   JCas jcas = getEmptyJCas();
   try {
     jcas.setDocumentText(mDoc.substring(mPos, breakAt));
     // if original CAS had SourceDocumentInformation,
           also add SourceDocumentInformatio
     // to each segment
     if (mDocUri != null) {
       SourceDocumentInformation sdi =
           new SourceDocumentInformation(jcas);
       sdi.setUri(mDocUri);
       sdi.setOffsetInSource(mPos);
       sdi.setDocumentSize(breakAt - mPos);
       sdi.addToIndexes();

       if (breakAt == mDoc.length()) {
         sdi.setLastSegment(true);
       }
     }

     mPos = breakAt;
     return jcas;
   } catch (Exception e) {
     jcas.release();
     throw new AnalysisEngineProcessException(e);
   }
 }</programlisting>

         <para>The <literal>next</literal> method actually produces the next segment and returns it. The
           framework guarantees that it will not call <literal>next</literal> unless
           <literal>hasNext</literal> has returned true since the last call to <literal>process</literal> or
           <literal>next</literal> .</para>

         <para>Note that in order to produce a segment, the CAS Multiplier must get an empty JCas to populate. This is
           done by the line:</para>

         <programlisting>JCas jcas = getEmptyJCas();</programlisting>

         <para>This requests an empty JCas from the framework, which maintains a pool of JCas instances to draw
           from.</para>

         <para>Also, note the use of the <literal>try...catch</literal> block to ensure that a JCas is released back
           to the pool if an exception occurs. This is very important to allow a CAS Multiplier to recover from
           errors.</para>

       </section>
     </section>
   </section>

   <section id="ugr.tug.cm.creating_cm_descriptor">
     <title>Creating the CAS Multiplier Descriptor</title>
     <titleabbrev>CAS Multiplier Descriptor</titleabbrev>

     <para>There is not a separate type of descriptor for a CAS Multiplier. CAS Multiplier are considered a type of
       Analysis Engine, and so their descriptors use the same syntax as any other Analysis Engine Descriptor.</para>

     <para>The descriptor for the <literal>SimpleTextSegmenter</literal> is located in the
       <literal>examples/descriptors/cas_multiplier/SimpleTextSegmenter.xml</literal> directory of the
       UIMA SDK.</para>

     <para>The Analysis Engine Description, in its <quote>Operational Properties</quote> section, now contains a
       new <quote>outputsNewCASes</quote> property which takes a Boolean value. If the Analysis Engine is a CAS
       Multiplier, this property should be set to true.</para>

     <para>If you use the CDE, be sure to check the <quote>Outputs new CASes</quote> box in the Runtime Information
       section on the Overview page, as shown here:


       <screenshot>
     <mediaobject>
       <imageobject>
         <imagedata width="5.2in" align="center" format="JPG" fileref="&imgroot;image002.jpg"/>
       </imageobject>
       <textobject><phrase>Screen shot of Component Descriptor Editor on Overview
         showing checking of "Outputs new CASes" box</phrase>
       </textobject>
     </mediaobject>
   </screenshot></para>

     <para>If you edit the Analysis Engine Descriptor by hand, you need to add a
       <literal>&lt;outputsNewCASes&gt;</literal> element to your descriptor as shown here:</para>


     <programlisting>&lt;operationalProperties&gt;
     &lt;modifiesCas&gt;false&lt;/modifiesCas&gt;
     &lt;multipleDeploymentAllowed&gt;true&lt;/multipleDeploymentAllowed&gt;
     <emphasis role="bold">&lt;outputsNewCASes&gt;true&lt;/outputsNewCASes&gt;</emphasis>
   &lt;/operationalProperties&gt;</programlisting>
     <note>
     <para>The <quote>modifiedCas</quote> operational property refers to the input CAS, not the new output CASes
       produced. So our example SimpleTextSegmenter has modifiesCas set to false since it doesn&apos;t modify the
       input CAS. </para></note>

   </section>

   <section id="ugr.tug.cm.using_cm_in_aae">
     <title>Using a CAS Multiplier in an Aggregate Analysis Engine</title>
     <titleabbrev>Using CAS Multipliers in Aggregates</titleabbrev>

     <para>You can include a CAS Multiplier as a component in an Aggregate Analysis Engine. For example, this allows
       you to construct an Aggregate Analysis Engine that takes each input CAS, breaks it up into segments, and runs a
       series of Annotators on each segment.</para>

     <section id="ugr.tug.cm.adding_cm_to_aggregate">
       <title>Adding the CAS Multiplier to the Aggregate</title>
       <titleabbrev>Aggregate: Adding the CAS Multiplier</titleabbrev>

       <para>Since CAS Multiplier are considered a type of Analysis Engine, adding them to an aggregate works the same
         way as for other Analysis Engines. Using the CDE, you just click the <quote>Add...</quote> button in the
         Component Engines view and browse to the Analysis Engine Descriptor of your CAS Multiplier. If editing the
         aggregate descriptor directly, just <literal>import</literal> the Analysis Engine Descriptor of your
         CAS Multiplier as usual.</para>

       <para>An example descriptor for an Aggregate Analysis Engine containing a CAS Multiplier is provided in
         <literal>examples/descriptors/cas_multiplier/SegmenterAndTokenizerAE.xml</literal>. This
         Aggregate runs the <literal>SimpleTextSegmenter</literal> example to break a large document into
         segments, and then runs each segment through the <literal>SimpleTokenAndSentenceAnnotator</literal>.
         Try running it in the Document Analyzer tool with a large text file as input, to see that it outputs multiple
         output CASes, one for each segment produced by the <literal>SimpleTextSegmenter</literal>.</para>

     </section>

     <section id="ugr.tug.cm.cm_and_fc">
       <title>CAS Multipliers and Flow Control</title>

       <para>CAS Multipliers are only supported in the context of Fixed Flow or custom Flow Control. If you use the
         built-in <quote>Fixed Flow</quote> for your Aggregate Analysis Engine, you can position the CAS
         Multiplier anywhere in that flow. Processing then works as follows: When a CAS is input to the Aggregate AE,
         that CAS is routed to the components in the order specified by the Fixed Flow, until that CAS reaches a CAS
         Multiplier.</para>

       <para>Upon reaching a CAS Multiplier, if that CAS Multiplier produces new output CASes, then each output CAS
         from that CAS Multiplier will continue through the flow, starting at the node immediately after the CAS
         Multiplier in the Fixed Flow. No further processing will be done on the original input CAS after it has reached
         a CAS Multiplier &ndash; it will <emphasis>not</emphasis> continue in the flow.</para>

       <para>If the CAS Multiplier does <emphasis>not</emphasis> produce any output CASes for a given input CAS,
         then that input CAS <emphasis>will</emphasis> continue in the flow. This behavior is appropriate, for
         example, for a CAS Multiplier that may segment an input CAS into pieces but only does so if the input CAS is
         larger than a certain size.</para>

       <para>It is possible to put more than one CAS Multiplier in your flow. In this case, when a new CAS output from the
         first CAS Multiplier reaches the second CAS Multiplier and if the second CAS Multiplier produces output
         CASes, then no further processing will occur on the input CAS, and any new output CASes produced by the second
         CAS Multiplier will continue the flow starting at the node after the second CAS Multiplier.</para>

       <para>This default behavior can be customized. The <literal>FixedFlowController</literal> component
         that implement's UIMA&apos;s default flow defines a configuration parameter
         <literal>ActionAfterCasMultiplier</literal> that can take the following values:</para>
       <itemizedlist>
         <listitem>
           <para><literal>continue</literal> &ndash; the CAS continues on to the next element in the flow</para>
         </listitem>
         <listitem>
           <para><literal>stop</literal> &ndash; the CAS will no longer continue in the flow, and will be returned
             from the aggregate if possible.</para>
         </listitem>
         <listitem>
           <para><literal>drop</literal> &ndash; the CAS will no longer continue in the flow, and will be dropped
             (not returned from the aggregate) if possible.</para>
         </listitem>
         <listitem>
           <para><literal>dropIfNewCasProduced</literal> (the default) &ndash; if the CAS multiplier produced
             a new CAS as a result of processing this CAS, then this CAS will be dropped. If not, then this CAS will
             continue.</para>
         </listitem>
       </itemizedlist>

       <para>You can override this parameter in your Aggregate Analysis Engine the same way you would override a
         parameter in a delegate Analysis Engine. But to do so you must first explicitly identify that you are using the
         <literal>FixedFlowController</literal> implementation by importing its descriptor into your
         aggregate as follows:</para>


       <programlisting>&lt;flowController key="FixedFlowController">
           &lt;import name="org.apache.uima.flow.FixedFlowController"/>
         &lt;/flowController>      </programlisting>

       <para>The parameter could then be overriden as, for example:</para>


       <programlisting>&lt;configurationParameters>
           &lt;configurationParameter>
             &lt;name>ActionForIntermediateSegments&lt;/name>
             &lt;type>String&lt;/type>
             &lt;multiValued>false&lt;/multiValued>
             &lt;mandatory>false&lt;/mandatory>
             &lt;overrides>
               &lt;parameter>
                 FixedFlowController/ActionAfterCasMultiplier
               &lt;/parameter>
             &lt;/overrides>
           &lt;/configurationParameter>
         &lt;/configurationParameters>

        &lt;configurationParameterSettings>
          &lt;nameValuePair>
            &lt;name>ActionForIntermediateSegments&lt;/name>
            &lt;value>
              &lt;string>drop&lt;/string>
            &lt;/value>
          &lt;/nameValuePair>
        &lt;/configurationParameterSettings></programlisting>

       <para>This overriding can also be done using the Component Descriptor Editor tool. An example of an Analysis
         Engine that overrides this parameter can be found in
         <literal>examples/descriptors/cas_multiplier/Segment_Annotate_Merge_AE.xml</literal>. For more
         information about how to specify a flow controller as part of your Aggregate Analysis Engine descriptor, see
           <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.fc.adding_fc_to_aggregate"/>.</para>

       <para>If you would like to further customize the flow, you will need to implement a custom FlowController as
         described in <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.fc"/>. For example,
         you could implement a flow where a CAS that is input to a CAS Multiplier will be processed further by
         <emphasis>some</emphasis> downstream components, but not others.</para>

     </section>

     <section id="ugr.tug.cm.aggregate_cms">
       <title>Aggregate CAS Multipliers</title>

       <para>An important consideration when you put a CAS Multiplier inside an Aggregate Analysis Engine is whether
         you want the Aggregate to also function as a CAS Multiplier
         &ndash; that is, whether you want the new output CASes produced within the Aggregate to be output from the
         Aggregate. This is controlled by the <literal>&lt;outputsNewCASes&gt;</literal> element in the
         Operational Properties of your Aggregate Analysis Engine descriptor. The syntax is the same as what was
         described in <xref linkend="ugr.tug.cm.creating_cm_descriptor"/> .</para>

       <para>If you set this property to <literal>true</literal>, then any new output CASes produced by a CAS
         Multiplier inside this Aggregate will be output from the Aggregate. Thus the Aggregate will function as a CAS
         Multiplier and can be used in any of the ways in which a primitive CAS Multiplier can be used.</para>

       <para>If you set the &lt;outputsNewCASes&gt; property to <literal>false</literal> , then any new output
         CASes produced by a CAS Multiplier inside the Aggregate will be dropped (i.e. the CASes will be released back
         to the pool) once they have finished being processed. Such an Aggregate Analysis Engine functions just like a
         <quote>normal</quote> non-CAS-Multiplier Analysis Engine; the fact that CAS Multiplication is
         occurring inside it is hidden from users of that Analysis Engine.</para> <note>
       <para>If you want to output some new Output CASes and not others, you need to implement a custom Flow Controller
         that makes this decision &mdash; see <olink targetdoc="&uima_docs_tutorial_guides;"
           targetptr="ugr.tug.fc.using_fc_with_cas_multipliers"/>. </para> </note>

     </section>
   </section>

   <section id="ugr.tug.cm.using_cm_in_cpe">
     <title>Using a CAS Multiplier in a Collection Processing Engine</title>
     <titleabbrev>CAS Multipliers in CPE&apos;s</titleabbrev>

     <para>It is currently a limitation that CAS Multiplier cannot be deployed directly in a Collection Processing
       Engine. The only way that you can use a CAS Multiplier in a CPE is to first wrap it in an Aggregate Analysis Engine
       whose <literal>outputsNewCASes </literal>property is set to <literal>false</literal>, which in effect
       hides the existence of the CAS Multiplier from the CPE.</para>

     <para>Note that you can build an Aggregate Analysis Engine that consists of CAS Multipliers and Annotators,
       followed by CAS Consumers. This can simulate what a CPE would do, but without the deployment and error handling
       options that the CPE provides.</para>

   </section>

   <section id="ugr.tug.cm.calling_cm_from_app">
     <title>Calling a CAS Multiplier from an Application</title>
     <titleabbrev>Applications: Calling CAS Multipliers</titleabbrev>

     <section id="ugr.tug.cm.retrieving_output_cases">
       <title>Retrieving Output CASes from the CAS Multiplier</title>
       <titleabbrev>Output CASes</titleabbrev>
       <para>The <literal>AnalysisEngine</literal> interface has the following methods that allow you to
         interact with CAS Multiplier:
         <itemizedlist>
           <listitem>
             <para><literal>CasIterator processAndOutputNewCASes(CAS)</literal></para>
           </listitem>
           <listitem>
             <para><literal>JCasIterator processAndOutputNewCASes(JCas)</literal></para>
           </listitem>
         </itemizedlist></para>

       <para>From your application, you call <literal>processAndOutputNewCASes</literal> and pass it the input
         CAS. An iterator is returned that allows you to step through each of the new output CASes that are produced by
         the Analysis Engine.</para>

       <para>It is very important to realize that CASes are pooled objects and so your application must release each
         CAS (by calling the <literal>CAS.release()</literal> method) that it obtains from the CasIterator
         <emphasis>before</emphasis> it calls the <literal>CasIterator.next</literal> method again.
         Otherwise, the CAS pool will be exhausted and a deadlock will occur.</para>

       <para>The example code in the class <literal>org.apache.uima.examples.casMultiplier.
         CasMultiplierExampleApplication</literal> illusrates this. Here is the main processing loop:</para>


       <programlisting>CasIterator casIterator = ae.processAndOutputNewCASes(initialCas);
 while (casIterator.hasNext()) {
   CAS outCas = casIterator.next();

   //dump the document text and annotations for this segment
   System.out.println("********* NEW SEGMENT *********");
   System.out.println(outCas.getDocumentText());
   PrintAnnotations.printAnnotations(outCas, System.out);

   //release the CAS (important)
   outCas.release();</programlisting>

       <para>Note that as defined by the CAS Multiplier contract in <xref
           linkend="ugr.tug.cm.cm_interface_overview"/>, the CAS Multiplier owns the input CAS
         (<literal>initialCas</literal> in the example) until the last new output CAS has been produced. This means
         that the application should not try to make changes to <literal>initialCas</literal> until after the
         <literal>CasIterator.hasNext</literal> method has returned false, indicating that the segmenter has
         finished.</para>

       <para>Note that the processing time of the Analysis Engine is spread out over the calls to the
         <literal>CasIterator&apos;s hasNext</literal> and <literal>next</literal> methods. That is, the next
         output CAS may not actually be produced and annotated until the application asks for it. So the application
         should not expect calls to the <literal>CasIterator</literal> to necessarily complete quickly.</para>

       <para>Also, calls to the <literal>CasIterator</literal> may throw Exceptions indicating an error has
         occurred during processing. If an Exception is thrown, all processing of the input CAS will stop, and no more
         output CASes will be produced. There is currently no error recovery mechanism that will allow processing to
         continue after an exception.</para>

     </section>
     <section id="ugr.tug.cm.using_cm_with_other_aes">
       <title>Using a CAS Multiplier with other Analysis Engines</title>
       <titleabbrev>CAS Multipliers with other AEs</titleabbrev>
       <para>In your application you can take the output CASes from a CAS Multiplier and pass them to
         the <literal>process</literal> method of other Analysis Engines.  However there are some
         special considerations regarding the Type System of these CASes.</para>
       <para>By default, the output CASes of a CAS Multiplier will have a Type System that contains all
         of the types and features declared by any component in the outermost Aggregate Analysis Engine or
         Collection Processing Engine that contains the CAS Multiplier.  If in your application you
         create a CAS Multiplier and another Analysis Engine, where these are not enclosed in an aggregate,
         then the output CASes from the CAS Multiplier will not support any types or features that are
         declared in the latter Analysis Engine but not in the CAS Multiplier.
       </para>
       <para>This can be remedied by forcing the CAS Multiplier and Analysis Engine to share a single
         <literal>UimaContext</literal> when they are created, as follows:
       <programlisting>//create a "root" UIMA context for your whole application

 UimaContextAdmin rootContext =
    UIMAFramework.newUimaContext(UIMAFramework.getLogger(),
       UIMAFramework.newDefaultResourceManager(),
       UIMAFramework.newConfigurationManager());

 XMLInputSource input = new XMLInputSource("MyCasMultiplier.xml");
 AnalysisEngineDescription desc = UIMAFramework.getXMLParser().
         parseAnalysisEngineDescription(input);

 //create a UIMA Context for the new AE we are about to create

 //first argument is unique key among all AEs used in the application
 UimaContextAdmin childContext = rootContext.createChild(
         "myCasMultiplier", Collections.EMPTY_MAP);

 //instantiate CAS Multiplier AE, passing the UIMA Context through the
 //additional parameters map

 Map additionalParams = new HashMap();
 additionalParams.put(Resource.PARAM_UIMA_CONTEXT, childContext);

 AnalysisEngine casMultiplierAE = UIMAFramework.produceAnalysisEngine(
         desc,additionalParams);

 //repeat for another AE
 XMLInputSource input2 = new XMLInputSource("MyAE.xml");
 AnalysisEngineDescription desc2 = UIMAFramework.getXMLParser().
         parseAnalysisEngineDescription(input2);

 UimaContextAdmin childContext2 = rootContext.createChild(
         "myAE", Collections.EMPTY_MAP);

 Map additionalParams2 = new HashMap();
 additionalParams2.put(Resource.PARAM_UIMA_CONTEXT, childContext2);

 AnalysisEngine myAE = UIMAFramework.produceAnalysisEngine(
         desc2, additionalParams2);</programlisting>

       </para>
     </section>

   </section>

   <section id="ugr.tug.cm.using_cm_to_merge_cases">
     <title>Using a CAS Multiplier to Merge CASes</title>
     <titleabbrev>Merging with CAS Multipliers</titleabbrev>

     <para>A CAS Multiplier can also be used to combine smaller CASes together to form larger CASes. In this section we
       describe how this works and walk through an example.</para>

     <section id="ugr.tug.cm.overview_of_how_to_merge_cases">
       <title>Overview of How to Merge CASes</title>
       <titleabbrev>CAS Merging Overview</titleabbrev>

       <orderedlist>
         <listitem>
           <para>When the framework first calls the CAS Multiplier&apos;s <literal>process</literal> method,
             the CAS Multiplier requests an empty CAS (which we'll call the "merged CAS") and copies relevant data
             from the input CAS into the merged CAS. The class
             <literal>org.apache.uima.util.CasCopier</literal> provides utilities for copying Feature
             Structures between CASes.</para>
         </listitem>

         <listitem>
           <para>When the framework then calls the CAS Multiplier&apos;s <literal>hasNext</literal> method, the
             CAS Multiplier returns <literal>false</literal> to indicate that it has no output at this
             time.</para>
         </listitem>

         <listitem>
           <para>When the framework calls <literal>process</literal> again with a new input CAS, the CAS
             Multiplier copies data from that input CAS into the merged CAS, combining it with the data that was
             previously copied.</para>
         </listitem>

         <listitem>
           <para>Eventually, when the CAS Multiplier decides that it wants to output the merged CAS, it returns
             <literal>true</literal> from the <literal>hasNext</literal> method, and then when the framework
             subsequently calls the <literal>next</literal> method, the CAS Multiplier returns the merged
             CAS.</para>
         </listitem>
       </orderedlist> <note>
       <para>There is no explicit call to flush out any pending CASes from a CAS Multiplier when collection processing
         completes. It is up to the application to provide some mechanism to let a CAS Multiplier recognize the last CAS
         in a collection so that it can ensure that its final output CASes are complete.</para></note>
     </section>
     <section id="ugr.tug.cm.example_cas_merger">
       <title>Example CAS Merger</title>
       <para>An example CAS Multiplier that merges CASes can be found is provided in the UIMA SDK. The Java class for
         this example is <literal>org.apache.uima.examples.casMultiplier.SimpleTextMerger</literal> and
         the source code is located under the <literal>examples/src</literal> directory.</para>
       <section id="ugr.tug.cm.example_cas_merger.process">
         <title>Process Method</title>
         <para>Almost all of the code for this example is in the <literal>process</literal> method. The first part of
           the <literal>process</literal> method shows how to copy Feature Structures from the input CAS to the
           "merged CAS":</para>


         <programlisting>public void process(JCas aJCas) throws AnalysisEngineProcessException {
     // procure a new CAS if we don't have one already
     if (mMergedCas == null) {
       mMergedCas = getEmptyJCas();
     }

     // append document text
     String docText = aJCas.getDocumentText();
     int prevDocLen = mDocBuf.length();
     mDocBuf.append(docText);

     // copy specified annotation types
     // CasCopier takes two args: the CAS to copy from.
     //                           the CAS to copy into.
     CasCopier copier = new CasCopier(aJCas.getCas(), mMergedCas.getCas());

     // needed in case one annotation is in two indexes (could
     // happen if specified annotation types overlap)
     Set copiedIndexedFs = new HashSet();
     for (int i = 0; i &lt; mAnnotationTypesToCopy.length; i++) {
       Type type = mMergedCas.getTypeSystem()
           .getType(mAnnotationTypesToCopy[i]);
       FSIndex index = aJCas.getCas().getAnnotationIndex(type);
       Iterator iter = index.iterator();
       while (iter.hasNext()) {
         FeatureStructure fs = (FeatureStructure) iter.next();
         if (!copiedIndexedFs.contains(fs)) {
           Annotation copyOfFs = (Annotation) copier.copyFs(fs);
           // update begin and end
           copyOfFs.setBegin(copyOfFs.getBegin() + prevDocLen);
           copyOfFs.setEnd(copyOfFs.getEnd() + prevDocLen);
           mMergedCas.addFsToIndexes(copyOfFs);
           copiedIndexedFs.add(fs);
         }
       }
     }</programlisting>

         <para>The <literal>CasCopier</literal> class is used to copy Feature Structures of certain types
           (specified by a configuration parameter) to the merged CAS. The <literal>CasCopier</literal> does deep
           copies, meaning that if the copied FeatureStructure references another FeatureStructure, the
           referenced FeatureStructure will also be copied.</para>

         <para>This example also merges the document text using a separate <literal>StringBuffer</literal>. Note
           that we cannot append document text to the Sofa data of the merged CAS because Sofa data cannot be modified
           once it is set.</para>

         <para>The remainder of the <literal>process</literal> method determines whether it is time to output a new
           CAS. For this example, we are attempting to merge all CASes that are segments of one original artifact. This
           is done by checking the
           <code>SourceDocumentInformation</code> Feature Structure in the CAS to see if its
           <code>lastSegment</code> feature is set to <literal>true</literal>. That feature (which is set by the
           example
           <code>SimpleTextSegmenter</code> discussed previously) marks the CAS as being the last segment of an
           artifact, so when the CAS Multiplier sees this segment it knows it is time to produce an output CAS.</para>


         <programlisting>// get the SourceDocumentInformation FS,
 // which indicates the sourceURI of the document
 // and whether the incoming CAS is the last segment
 FSIterator it = aJCas
         .getAnnotationIndex(SourceDocumentInformation.type).iterator();
 if (!it.hasNext()) {
   throw new RuntimeException("Missing SourceDocumentInformation");
 }
 SourceDocumentInformation sourceDocInfo =
       (SourceDocumentInformation) it.next();
 if (sourceDocInfo.getLastSegment()) {
   // time to produce an output CAS
   // set the document text
   mMergedCas.setDocumentText(mDocBuf.toString());

   // add source document info to destination CAS
   SourceDocumentInformation destSDI =
       new SourceDocumentInformation(mMergedCas);
   destSDI.setUri(sourceDocInfo.getUri());
   destSDI.setOffsetInSource(0);
   destSDI.setLastSegment(true);
   destSDI.addToIndexes();

   mDocBuf = new StringBuffer();
   mReadyToOutput = true;
 }</programlisting>

         <para>When it is time to produce an output CAS, the CAS Multiplier makes final updates to the merged CAS
           (setting the document text and adding a <literal>SourceDocumentInformation</literal>
           FeatureStructure), and then sets the <literal>mReadyToOutput</literal> field to true. This field is
           then used in the <literal>hasNext</literal> and <literal>next</literal> methods.</para>
       </section>
       <section id="ugr.tug.cm.example_cas_merger.hasnext_and_next">
         <title>HasNext and Next Methods</title>
         <para>These methods are relatively simple:</para>


         <programlisting>public boolean hasNext() throws AnalysisEngineProcessException {
     return mReadyToOutput;
   }

   public AbstractCas next() throws AnalysisEngineProcessException {
     if (!mReadyToOutput) {
       throw new RuntimeException("No next CAS");
     }
     JCas casToReturn = mMergedCas;
     mMergedCas = null;
     mReadyToOutput = false;
     return casToReturn;
   }</programlisting>
         <para>When the merged CAS is ready to be output, <literal>hasNext</literal> will return true, and
           <literal>next</literal> will return the merged CAS, taking care to set the
           <literal>mMergedCas</literal> field to
           <code>null</code> so that the next call to
           <code>process</code> will start with a fresh CAS.</para>
       </section>
     </section>
     <section id="ugr.tug.cm.using_the_simple_text_merger_in_an_aggregate_ae">
       <title>Using the SimpleTextMerger in an Aggregate Analysis Engine</title>
       <titleabbrev>SimpleTextMerger in an Aggregate</titleabbrev>

       <para>An example descriptor for an Aggregate Analysis Engine that uses the
         <literal>SimpleTextMerger</literal> is provided in
         <literal>examples/descriptors/cas_multiplier/Segment_Annotate_Merge_AE.xml</literal>. This
         Aggregate first runs the <literal>SimpleTextSegmenter</literal> example to break a large document into
         segments. It then runs each segment through the example tokenizer and name recognizer annotators. Finally
         it runs the <literal>SimpleTextMerger</literal> to reassemble the segments back into one CAS. The
         <literal>Name</literal> annotations are copied to the final merged CAS but the <literal>Token</literal>
         annotations are not.</para>
       <para>This example illustrates how you can break large artifacts into pieces for more efficient processing
         and then reassemble a single output CAS containing only the results most useful to the application.
         Intermediate results such as tokens, which may consume a lot of space, need not be retained over the entire
         input artifact.</para>

       <para>The intermediate segments are dropped and are never output from the Aggregate Analysis Engine.  This
         is done by configuring the Fixed Flow Controller as described in
         <xref linkend="ugr.tug.cm.cm_and_fc"/>, above.</para>

       <para>Try running this Analysis Engine in the Document Analyzer tool with a large text file as input, to see that
         it outputs just one CAS per input file, and that the final CAS contains only the <literal>Name</literal> annotations. </para>
     </section>
   </section>
 </chapter>