uimaj-2.2.1-incubating/uima-docbooks/src/docbook/references/ref.xml.cpe_descriptor.xml - uima-uimaj - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
 <!ENTITY imgroot "../images/references/ref.xml.cpe_descriptor/">
 <!ENTITY tp "ugr.ref.xml.cpe_descriptor.">
 <!ENTITY % uimaents SYSTEM "../entities.ent" >
 %uimaents;
 ]>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <chapter id="ugr.ref.xml.cpe_descriptor">
   <title>Collection Processing Engine Descriptor Reference</title>
   <titleabbrev>CPE Descriptor Reference</titleabbrev>

   <para>A UIMA <emphasis>Collection Processing Engine</emphasis> (CPE) is a combination
     of UIMA components assembled to analyze a collection of artifacts. A CPE is an
     instantiation of the UIMA <emphasis>Collection Processing Architecture</emphasis>,
     which defines the collection processing components, interfaces, and APIs. A CPE is
     executed by a UIMA framework component called the <emphasis>Collection Processing
     Manager</emphasis> (CPM), which provides a number of services for deploying CPEs,
     running CPEs, and handling errors.</para>

   <para>A CPE can be assembled programmatically within a Java application, or it can be
     assembled declaratively via a CPE configuration specification, called a CPE
     Descriptor. This chapter describes the format of the CPE Descriptor.</para>

   <para>Details about the CPE, including its function, sub-components, APIs, and related
     tools, can be found in <olink targetdoc="&uima_docs_tutorial_guides;"
       targetptr="ugr.tug.cpe"/>. Here we briefly summarize the CPE to define terms and
     provide context for the later sections that describe the CPE Descriptor.</para>

   <section id="&tp;overview">
     <title>CPE Overview</title>

     <figure id="&tp;overview.fig.runtime">
       <title>CPE Runtime Overview</title>
       <mediaobject>
         <imageobject>
           <imagedata width="5.8in" format="PNG"
             fileref="&imgroot;image002.png"/>
         </imageobject>
         <textobject><phrase>CPE Runtime Overview diagram</phrase></textobject>
       </mediaobject>
     </figure>

     <para>An illustration of the CPE runtime is shown in <xref
         linkend="&tp;overview.fig.runtime"/>. Some of the CPE components, such as the
       <emphasis>queues</emphasis> and <emphasis>processing pipelines</emphasis>, are
       internal to the CPE, but their behavior and deployment may be configured using the CPE
       Descriptor. Other CPE components, such as the <emphasis>Collection
       Reader</emphasis> and <emphasis>CAS Processors</emphasis>, are defined and
       configured externally from the CPE and then plugged in to the CPE to create the overall
       engine. The parts of a CPE are:

       <variablelist>
         <varlistentry>
           <term>Collection Reader</term>
           <listitem><para>understands the native data collection format and iterates
             over the collection producing subjects of analysis</para></listitem>
         </varlistentry>

         <varlistentry>
           <term>CAS Initializer<footnote><para>Deprecated</para></footnote>
             </term>
           <listitem><para>initializes a CAS with a subject of analysis</para>
             </listitem>
         </varlistentry>

         <varlistentry>
           <term>Artifact Producer</term>
           <listitem><para>asynchronously pulls CASes from the Collection Reader,
             creates batches of CASes and puts them into the work queue</para></listitem>
         </varlistentry>

         <varlistentry>
           <term>Work Queue</term>
           <listitem><para>shared queue containing batches of CASes queued by the Artifact
             Producer for analysis by Analysis Engines</para>
           </listitem>
         </varlistentry>

         <varlistentry>
           <term>B1-Bn</term>
           <listitem><para>individual batches containing 1 or more CASes</para>
             </listitem>
         </varlistentry>

         <varlistentry>
           <term>AE1-AEn</term>
           <listitem><para>Analysis Engines arranged by a CPE descriptor</para>
             </listitem>
         </varlistentry>

         <varlistentry>
           <term>Processing Pipelines</term>
           <listitem><para>each pipeline runs in a separate thread and contains a
             replicated set of the Analysis Engines running in the defined sequence</para>
             </listitem>
         </varlistentry>

         <varlistentry>
           <term>Output Queue</term>
           <listitem><para>holds batches of CASes with analysis results intended for CAS
             Consumers</para></listitem>
         </varlistentry>

         <varlistentry>
           <term>CAS Consumers</term>
           <listitem><para>perform collection level analysis over the CASes and extract
             analysis results, e.g., creating indexes or databases</para></listitem>
         </varlistentry>
       </variablelist>
       </para>
   </section>

   <section id="&tp;notation">
     <title>Notation</title>

     <para>CPE Descriptors are XML files. This chapter uses an informal notation to specify
       the syntax of CPE Descriptors.</para>

     <para>The notation used in this chapter is:

       <itemizedlist><listitem><para>An ellipsis (...) inside an element body indicates
         that the substructure of that element has been omitted (to be described in another
         section of this chapter). An example of this would be:


         <programlisting>&lt;collectionReader&gt;
 ...
 &lt;/collectionReader&gt;</programlisting></para>
         </listitem>

         <listitem><para>An ellipsis immediately after an element indicates that the
           element type may be repeated arbitrarily many times. For example:


           <programlisting>&lt;parameter&gt;[String]&lt;/parameter&gt;
 &lt;parameter&gt;[String]&lt;/parameter&gt;
 ...</programlisting>
           indicates that there may be arbitrarily many parameter elements in this
           context.</para></listitem>

         <listitem><para>An ellipsis inside an element means details of the attributes
           associated with that element are defined later, e.g.:

           <programlisting>&lt;casProcessor ...&gt;</programlisting></para>
           </listitem>

         <listitem><para>Bracketed expressions (e.g. <literal>[String]</literal>)
           indicate the type of value that may be used at that location.</para></listitem>

         <listitem><para>A vertical bar, as in <literal>true|false</literal>, indicates
           alternatives. This can be applied to literal values, bracketed type names, and
           elements. </para></listitem></itemizedlist></para>

     <para>Which elements are optional and which are required is specified in prose, not in the
       syntax definition.</para>

   </section>

   <section id="&tp;imports">
     <title>Imports</title>

     <para>As of version 2.2, a CPE Descriptor can use the same <literal>import</literal> mechanism
       as other component descriptors.  This allows referring to component
       descriptors using either relative paths (resolved relative to the location of the CPE descriptor)
       or the classpath/datapath.  For details see <olink targetdoc="&uima_docs_ref;"
       targetptr="ugr.ref.xml.component_descriptor"/>.</para>

     <para>The follwing older syntax is still supported, but <emphasis>not recommended</emphasis>:

       <programlisting><![CDATA[<descriptor>
     <include href="[URL or File]"/>
 </descriptor>]]></programlisting></para>

     <para>The <literal>[URL or File]</literal> attribute is a URL or a filename for the descriptor of the
       incorporated component. The argument is first attempted to be resolved as a URL.</para>

     <para>
       Relative paths in an <literal>include</literal> are resolved relative to the current working directory
       (NOT the CPE descriptor location as is the case for <literal>import</literal>).
       A filename relative to another directory can be specified using the <literal>CPM_HOME</literal>
       variable, e.g.,
     <programlisting>&lt;descriptor&gt;
     &lt;include href="${CPM_HOME}/desc_dir/descriptor.xml"/&gt;
 &lt;/descriptor&gt;</programlisting>

       In this case, the value for the <literal>CPM_HOME</literal> variable must be
       provided to the CPE by specifying it on the Java command line, e.g.,

     <programlisting>java -DCPM_HOME="C:/Program Files/apache/uima/cpm" ...</programlisting>

   </para>

   </section>

   <section id="&tp;descriptor">
     <title>CPE Descriptor Overview</title>

     <para>A CPE Descriptor consists of information describing the following four main
       elements.</para>

     <orderedlist><listitem><para>The <emphasis>Collection Reader</emphasis>, which
       is responsible for gathering artifacts and initializing the Common Analysis
       Structure (CAS) used to support processing in the UIMA collection processing
       engine.</para></listitem>

       <listitem><para>The <emphasis>CAS Processors</emphasis>, responsible for
         analyzing individual artifacts, analyzing across artifacts, and extracting
         analysis results. CAS Processors include <emphasis>Analysis Engines</emphasis>
         and <emphasis>CAS Consumers</emphasis>.</para></listitem>

       <listitem><para>Operational parameters of the <emphasis>Collection Processing
         Manager</emphasis> (CPM), such as checkpoint frequency and deployment
         mode.</para></listitem>

       <listitem><para>Resource Manager Configuration (optional). </para></listitem>
       </orderedlist>

     <para>The CPE Descriptor has the following high level skeleton:


       <programlisting><![CDATA[<?xml version="1.0"?>
 <cpeDescription>
    <collectionReader>
 ...
    </collectionReader>
    <casProcessors>
 ...
    </casProcessors>
    <cpeConfig>
 ...
    </cpeConfig>
    <resourceManagerConfiguration>
 ...
    </resourceManagerConfiguration>
 </cpeDescription>]]></programlisting></para>

     <para>Details of each of the four main elements are described in the sections that
       follow.</para>
  </section>
     <section id="&tp;descriptor.collection_reader">
       <title>Collection Reader</title>

       <para>The <literal>&lt;collectionReader&gt;</literal> section identifies the
         Collection Reader and optional CAS Initializer that are to be used in the CPE. The
         Collection Reader is responsible for retrieval of artifacts from a collection
         outside of the CPE, and the optional CAS Initializer (deprecated as of UIMA Version 2)
         is responsible for initializing the CAS with the artifact.</para>

       <para>A Collection Reader may initialize the CAS itself, in which case it does not
         require a CAS Initializer. This should be clearly specified in the documentation for
         the Collection Reader. Specifying a CAS Initializer for a Collection Reader that
         does not make use of a CAS Initializer will not cause an error, but the specified CAS
         Initializer will not be used.</para>

       <para>The complete structure of the <literal>&lt;collectionReader&gt;</literal>
         section is:


         <programlisting><![CDATA[<collectionReader>
   <collectionIterator>
     <descriptor>
       <import ...> | <include .../>
     </descriptor>
     <configurationParameterSettings>...</configurationParameterSettings>
     <sofaNameMappings>...</sofaNameMappings>
   </collectionIterator>
   <casInitializer>
     <descriptor>
       <import ...> | <include .../>
     </descriptor>
     <configurationParameterSettings>...</configurationParameterSettings>
     <sofaNameMappings>...</sofaNameMappings>
   </casInitializer>
 </collectionReader>]]></programlisting></para>

       <para>The <literal>&lt;collectionIterator&gt;</literal> identifies the
         descriptor for the Collection Reader, and the <literal>&lt;casInitializer&gt;
         </literal>identifies the descriptor for the CAS Initializer. The format and
         details of the Collection Reader and CAS Initializer descriptors are described in
           <olink targetdoc="&uima_docs_ref;"
           targetptr="ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader"/>
         . The <literal>&lt;configurationParameterSettings&gt; </literal>and the
         <literal>&lt;sofaNameMappings&gt;</literal> elements are described in the next
         section.</para>

       <section id="&tp;descriptor.collection_reader.error_handling">
         <title>Error handling for Collection Readers</title>

         <para>The CPM will abort if the Collection Reader throws a large number of
           consecutive exceptions (default = 100). This default can by changed by using the
           Java initialization parameter <literal>-DMaxCRErrorThreshold
           xxx.</literal></para>
       </section>
     </section>

     <section id="&tp;descriptor.cas_processors">
       <title>CAS Processors</title>

       <para>The <literal>&lt;casProcessors&gt;</literal> section identifies the
         components that perform the analysis on the input data, including CAS analysis
         (Analysis Engines) and analysis results extraction (CAS Consumers). The CAS
         Consumers may also perform collection level analysis, where the analysis is
         performed (or aggregated) over multiple CASes. The basic structure of the CAS
         Processors section is:


         <programlisting><![CDATA[<casProcessors
     dropCasOnException="true|false"
     casPoolSize="[Number]"
     processingUnitThreadCount="[Number]">

   <casProcessor ...>
         ...
   </casProcessor>

   <casProcessor ...>
         ...
   </casProcessor>
     ...
 </casProcessors>]]></programlisting></para>

       <para>The <literal>&lt;casProcessors&gt;</literal> section has two mandatory
         attributes and one optional attribute that configure the characteristics of the CAS
         Processor flow in the CPE. The first mandatory attribute is a casPoolSize, which
         defines the fixed number of CAS instances that the CPM will create and use during
         processing. All CAS instances are maintained in a CAS Pool with a check-in and
         check-out access. Each CAS is checked-out from the CAS Pool by the Collection Reader
         and initialized with an initial subject of analysis. The CAS is checked-in into the
         CAS Pool when it is completely processed, at the end of the processing chain. A larger
         CAS Pool size will result in more memory being used by the CPM. CAS objects can be large
         and care should be taken to determine the optimum size of the CAS Pool, weighing memory
         tradeoffs with performance.</para>

       <para>The second mandatory <literal>&lt;casProcessors&gt;</literal> attribute
         is <literal>processingUnitThreadCount</literal>, which specifies the number of
         replicated <emphasis>Processing Pipelines</emphasis>. Each Processing
         Pipeline runs in its own thread. The CPM takes CASes from the work queue and submits
         each CAS to one of the Processing Pipelines for analysis. A Processing Pipeline
         contains one or more Analysis Engines invoked in a given sequence. If more than one
         Processing Pipeline is specified, the CPM replicates instances of each Analysis
         Engine defined in the CPE descriptor. Each Processing Pipeline thread runs
         independently, consuming CASes from work queue and depositing CASes with analysis
         results onto the output queue. On multiprocessor machines, multiple Processing
         Pipelines can run in parallel, improving overall throughput of the CPM.</para>
       <note><para>The number of Processing Pipelines should be equal to or greater than CAS
       Pool size. </para></note>

       <para>Elements in the pipeline (each represented by a &lt;casProcessor&gt; element)
         may indicate that they do not permit multiple deployment in their Analysis Engine
         descriptor. If so, even though multiple pipelines are being used, all CASes passing
         through the pipelines will be routed through one instance of these marked Engines.
         </para>

       <para>The final, optional, &lt;casProcessors&gt; attribute is
         <literal>dropCasOnException</literal>. It defines a policy that determines what
         happens with the CAS when an exception happens during processing. If the value of this
         attribute is set to true and an exception happens, the CPM will notify all registered
         listeners of the exception (see <olink targetdoc="&uima_docs_tutorial_guides;"
           targetptr="ugr.tug.cpe.using_listeners"/>), clear the CAS and check the CAS
         back into the CAS Pool so that it can be re-used. The presumption is that an exception
         may leave the CAS in an inconsistent state and therefore that CAS should not be allowed
         to move through the processing chain. When this attribute is omitted the CPM&apos;s
         default is the same as specifying
         <literal>dropCasOnException="false"</literal>.</para>

       <section id="&tp;descriptor.cas_processors.individual">
         <title>Specifying an Individual CAS Processor</title>

         <para>The CAS Processors that make up the Processing Pipeline and the CAS Consumer
           pipeline are specified with the <literal>&lt;casProcessor&gt;</literal>
           entity, which appears within the <literal>&lt;casProcessors&gt;</literal>
           entity. It may appear multiple times, once for each CAS Processor specified for
           this CPE.</para>

         <para>The order of the <literal>&lt;casProcessor&gt;</literal> entities with
           the <literal>&lt;casProcessors&gt;</literal> section specifies the order in
           which the CAS Processors will run. Although CAS Consumers are usually put at the end
           of the pipeline, they need not be. Also, Aggregate Analysis Engines may include CAS
           Consumers.</para>

         <para>The overall format of the <literal>&lt;casProcessor&gt;</literal> entity
           is:


           <programlisting><![CDATA[<casProcessor deployment="local|remote|integrated" name="[String]" >
     <descriptor>
       <import ...> | <include .../>
     </descriptor>
     <configurationParameterSettings>...</configurationParameterSettings>
     <sofaNameMappings>...</sofaNameMappings>
     <runInSeparateProcess>...</runInSeparateProcess>
     <deploymentParameters>...</deploymentParameters>
     <filter/>
     <errorHandling>...</errorHandling>
     <checkpoint batch="Number"/>
 </casProcessor>]]></programlisting></para>

         <para>The <literal>&lt;casProcessor&gt;</literal> element has two mandatory
           attributes, <literal>deployment</literal> and <literal>name</literal>. The
           mandatory <literal>name</literal> attribute specifies a unique string
           identifying the CAS Processor.</para>

         <para>The mandatory <literal>deployment</literal> attribute specifies the CAS
           Processor deployment mode. Currently, three deployment options are supported:

           <variablelist>
             <varlistentry>
               <term>integrated</term>
               <listitem><para>indicates <emphasis>integrated</emphasis> deployment
                 of the CAS Processor. The CPM deploys and collocates the CAS Processor in the
                 same process space as the CPM. This type of deployment is recommended to
                 increase the performance of the CPE. However, it is NOT recommended to
                 deploy annotators containing JNI this way. Such CAS Processors may cause a
                 fatal exception and force the JVM to exit without cleanup (bringing down the
                 CPM). Any UIMA SDK compliant pure Java CAS Processors may be safely deployed
                 this way.</para>
                 <para>The descriptor for an integrated deployment can, in fact, be a remote
                   service descriptor. When used this way, however, the CPM error recovery
                   options (see below) operate in the integrated mode, which means that many
                   of the retry options are not available.</para></listitem>
             </varlistentry>
             <varlistentry>
               <term>remote</term>
               <listitem><para>indicates <emphasis>non-managed</emphasis>
                 deployment of the CAS Processor. The CAS Processor descriptor referenced
                 in the <literal>&lt;descriptor&gt;</literal> element must be a Vinci
                 <emphasis>Service Client Descriptor</emphasis>, which identifies a
                 remotely deployed CAS Processor service (see <olink
                   targetdoc="&uima_docs_tutorial_guides;"
                   targetptr="ugr.tug.application.remote_services"/>). The CPM
                 assumes that the CAS Processor is already running as a remote service and
                 will connect to it using the URI provided in the client service descriptor.
                 The lifecycle of a remotely deployed CAS Processor is not managed by the CPM,
                 so appropriate infrastructure should be in place to start/restart such CAS
                 Processors when necessary. This deployment provides fault isolation and
                 is implementation (i.e., programming language) neutral.</para>
                 </listitem>
             </varlistentry>
             <varlistentry>
               <term>local</term>
               <listitem><para>indicates <emphasis>managed</emphasis> deployment of
                 the CAS Processor. The CAS Processor descriptor referenced in the
                 <literal>&lt;descriptor&gt;</literal> element must be a Vinci
                 <emphasis>Service Deployment Descriptor</emphasis>, which configures
                 a CAS Processor for deployment as a Vinci service (see <olink
                   targetdoc="&uima_docs_tutorial_guides;"
                   targetptr="ugr.tug.application.remote_services"/>). The CPM
                 deploys the CAS Processor in a separate process and manages the life cycle
                 (start/stop) of the CAS Processor. Communication between the CPM and the
                 CAS Processor is done with Vinci. When the CPM completes processing, the
                 process containing the CAS Processor is terminated. This deployment mode
                 insulates the CPM from the CAS Processor, creating a more robust deployment
                 at the cost of a small communication overhead. On multiprocessor machines,
                 the separate processes may run concurrently and improve overall
                 throughput.</para></listitem>
             </varlistentry>
           </variablelist></para>

         <para>A number of elements may appear within the
           <literal>&lt;casProcessor&gt;</literal> element.</para>

         <section id="&tp;descriptor.cas_processors.individual.descriptor">
           <title>&lt;descriptor&gt; Element</title>

           <para>The <literal>&lt;descriptor&gt;</literal> element is mandatory. It
             identifies the descriptor for the referenced CAS Processor using the syntax
             described in <olink targetdoc="&uima_docs_ref;"
               targetptr="ugr.ref.xml.component_descriptor.aes"/>.

             <itemizedlist spacing="compact"><listitem><para>For
               <emphasis><literal>remote</literal></emphasis> CAS Processors, the
               referenced descriptor must be a Vinci <emphasis>Service Client
               Descriptor</emphasis>, which identifies a remotely deployed CAS Processor
               service.</para></listitem>

               <listitem><para>For <emphasis>local</emphasis> CAS Processors, the
                 referenced descriptor must be a Vinci <emphasis>Service Deployment
                 Descriptor</emphasis>.</para></listitem>

               <listitem><para>For <emphasis>integrated</emphasis> CAS Processors,
                 the referenced descriptor must be an Analysis Engine Descriptor
                 (primitive or aggregate). </para></listitem></itemizedlist> </para>

           <para>See <olink targetdoc="&uima_docs_tutorial_guides;"
               targetptr="ugr.tug.application.remote_services"/> for more
             information on creating these descriptors and deploying services.</para>

         </section>

         <section
           id="&tp;descriptor.cas_processors.individual.configuration_parameter_settings">
           <title>&lt;configurationParameterSettings&gt; Element</title>

           <para>This element provides a way to override the contained Analysis
             Engine&apos;s parameters settings. Any entry specified here must already be
             defined; values specified replace the corresponding values for each
             parameter. <emphasis role="bold-italic">For Cas Processors, this mechanism
             is only available when they are deployed in <quote>integrated</quote>
             mode.</emphasis> For Collection Readers and Initializers, it always is
             available.</para>

           <para>The content of this element is identical to the component descriptor for
             specifying parameters (in the case where no parameter groups are
             specified)<footnote><para>An earlier UIMA version required these to have a
             suffix of <quote>_p</quote>, e.g., <quote>string_p</quote>. This is no
             longer required, but this format is accepted, also, for backward
             compatibility.</para></footnote>. Here is an example:


             <programlisting><![CDATA[<configurationParameterSettings>
   <nameValuePair>
     <name>CivilianTitles</name>
     <value>
       <array>
         <string>Mr.</string>
         <string>Ms.</string>
         <string>Mrs.</string>
         <string>Dr.</string>
       </array>
     </value>
   </nameValuePair>
   ...
 </configurationParameterSettings>]]></programlisting></para>

         </section>

         <section
           id="&tp;descriptor.cas_processors.individual.sofa_name_mappings">
           <title>&lt;sofaNameMappings&gt; Element</title>

           <para>This optional element provides a mapping from defined Sofa names in the
             component, or the default Sofa name (if the component does not declare any Sofa
             names). The form of this element is:


             <programlisting>&lt;sofaNameMappings&gt;
   &lt;sofaNameMapping cpeSofaName="a_CPE_name"
                    componentSofaName="a_component_Name"/&gt;
   ...
 &lt;/sofaNameMappings&gt;</programlisting></para>

           <para>There can be any number of<literal>
             &lt;sofaNameMapping&gt;</literal> elements contained in the
             <literal>&lt;sofaNameMappings&gt;</literal> element. The
             <literal>componentSofaName</literal> attribute is optional; leave it out to
             specify a mapping for the <literal>_InitialView</literal> - that is, for
             Single-View components.</para>

         </section>

         <section id="&tp;descriptor.cas_processors.run_in_separate_process">
           <title>&lt;runInSeparateProcess&gt; Element</title>

           <para>The <literal>&lt;runInSeparateProcess&gt;</literal> element is
             mandatory for <literal>local</literal> CAS Processors, but should not appear
             for <literal>remote</literal> or <literal>integrated</literal> CAS
             Processors. It enables the CPM to create external processes using the provided
             runtime environment. Applications launched this way communicate with the CPM
             using the Vinci protocol and connectivity is enabled by a local instance of the
             VNS that the CPM manages. Since communication is based on Vinci, the application
             need not be implemented in Java. Any language for which Vinci provides support
             may be used to create an application, and the CPM will seamlessly communicate
             with it. The overall structure of this element is:


             <programlisting><![CDATA[<runInSeparateProcess>
     <exec dir="[String]" executable="[String]">
         <env key="[String]" value ="[String]"/>
         ...
         <arg>[String]</arg>
         ...
     </exec>
 </runInSeparateProcess>]]></programlisting></para>

           <para>The <literal>&lt;exec&gt;</literal> element provides information
             about how to execute the referenced CAS Processor. Two attributes are defined
             for the <literal>&lt;exec&gt;</literal> element. The
             <literal>dir</literal> attribute is currently not used &ndash; it is reserved
             for future functionality. The <literal>executable</literal> attribute
             specifies the actual Vinci service executable that will be run by the CPM, e.g.,
             <literal>java</literal>, a batch script, an application (.exe), etc. The
             executable must be specified with a fully qualified path, or be found in the
             <literal>PATH</literal> of the CPM.</para>

           <para>The <literal>&lt;exec&gt;</literal> element has two elements within it
             that define parameters used to construct the command line for executing the CAS
             Processor. These elements must be listed in the order in which they should be
             defined for the CAS Processor.</para>

           <para>The optional <literal>&lt;env&gt;</literal> element is used to set an
             environment variable. The variable <literal>key</literal> will be set to
             <literal>value</literal>. For example,


             <programlisting>&lt;env key="CLASSPATH" value="C:Javalib"/&gt;</programlisting>
             will set the environment variable <literal>CLASSPATH</literal> to the value
             <literal>C:Javalib</literal>. The <literal>&lt;env&gt;</literal>
             element may be repeated to set multiple environment variables. All of the
             key/value pairs will be added to the environment by the CPM prior to launching the
             executable.</para>
           <note><para>The CPM actually adds ALL system environment variables when it
           launches the program. It queries the Operating System for its current system
           variables and one by one adds them to the program&apos;s process
           configuration.</para></note>

           <para>The <literal>&lt;arg&gt;</literal> element is used to specify arbitrary
             string arguments that will appear on the command line when the CPM runs the
             command specified in the <literal>executable</literal> attribute.</para>

           <para>For example, the following would be used to invoke the UIMA Java
             implementation of the Vinci service wrapper on a Java CAS Processor:


             <programlisting><![CDATA[<runInSeparateProcess>
     <exec executable="java">
         <arg>-DVNS_HOST=localhost</arg>
         <arg>-DVNS_PORT=9099</arg>
         <arg>org.apache.uima.reference_impl.analysis_engine.service.
 vinci.VinciAnalysisEngineService_impl</arg>
         <arg>C:uimadescdeployCasProcessor.xml</arg>
     </exec>
 <runInSeparateProcess>]]></programlisting></para>

           <para>This will cause the CPM to run the following command line when starting the
             CAS Processor:


             <programlisting>java -DVNS_HOST=localhost -DVNS_PORT=9099
   org.apache.uima.reference_impl.analysis_engine.service.vinci.\\
               VinciAnalysisEngineService_impl
   C:uimadescdeployCasProcessor.xml</programlisting></para>

           <para>The first argument specifies that the Vinci Naming Service is running on the
             <literal>localhost</literal>. The second argument specifies that the Vinci
             Naming Service port number is <literal>9099</literal>. The third argument
             (split over 2 lines in this documentation)
             identifies the UIMA implementation of the Vinci service wrapper. This class
             contains the <literal>main</literal> method that will execute. That main
             method in turn takes a single argument &ndash; the filename for the CAS Processor
             service deployment descriptor. Thus the last argument identifies the Vinci
             service deployment descriptor file for the CAS Processor. Since this is the same
             descriptor file specified earlier in the
             <literal>&lt;descriptor&gt;</literal> element, the string
             <literal>${descriptor}</literal> can be used to refer to the descriptor,
             e.g.:


             <programlisting>&lt;arg&gt;${descriptor}&lt;/arg&gt;</programlisting></para>

           <para>The CPM will expand this out to the service deployment descriptor file
             referenced in the <literal>&lt;descriptor&gt;</literal> element.</para>

         </section>

         <section
           id="&tp;descriptor.cas_processors.individual.deployment_parameters">
           <title>&lt;deploymentParameters&gt; Element</title>

           <para>The <literal>&lt;deploymentParameters&gt;</literal> element defines
             a number of deployment parameters that control how the CPM will interact with the
             CAS Processor. This element has the following overall form:


             <programlisting>&lt;deploymentParameters&gt;
     &lt;parameter name="[String]" value="..." type="string|integer" /&gt;
     ...
 &lt;/deploymentParameters&gt;</programlisting></para>

           <para>The <literal>name</literal> attribute identifies the parameter, the
             <literal>value</literal> attribute specifies the value that will be assigned
             to the parameter, and the <literal>type</literal> attribute indicates the
             type of the parameter, either <literal>string</literal> or
             <literal>integer</literal>. The available parameters include:

             <variablelist>

               <varlistentry>
                 <term>service-access</term>
                 <listitem><para>string parameter whose value must be
                   <quote>exclusive</quote>, if present. This parameter is only
                   effective for remote deployments. It modifies the Vinci service
                   connections to be preallocated and dedicated, one service instance per
                   pipe-line. It is only relevant for non-Integrated deployement modes. If
                   there are fewer services instances that are available (and alive &ndash;
                   responding to a <quote>ping</quote> request) than there are pipelines,
                   the number of pipelines (the number of concurrent threads) is reduced to
                   match the number of available instances. If not specified, the VNS is
                   queried each time a service is needed, and a <quote>random</quote>
                   instance is assigned from the pool of available instances. If a services
                   dies during processing, the CPM will use its normal error handling
                   procedures to attempt to reconnect. The number of attempts is specified
                   in the CPE descriptor for each Cas Processor using the
                   <literal>&lt;maxConsecutiveRestarts value="10"
                   action="kill-pipeline"
                   waitTimeBetweenRetries="50"/&gt;</literal> xml element. The
                   <quote>value</quote> attribute is the number of reconnection tries;
                   the <quote>action</quote> says what to do if the retries exceed the
                   limit. The <quote>kill-pipeline</quote> action stops the pipeline
                   that was associated with the failing service (other pipelines will
                   continue to work). The CAS in process within a killed pipeline will be
                   dropped. These events are communicated to the application using the
                   normal event listener mechanism. The
                   <literal>waitTimeBetweenRetries</literal> says how many
                   milliseconds to wait inbetween attempts to reconnect.</para>
                   </listitem>
               </varlistentry>

               <varlistentry>
                 <term>vnsHost</term>
                 <listitem><para>(Deprecated) string parameter specifying the VNS host,
                   e.g., <literal>localhost</literal> for local CAS Processors, host
                   name or IP address of VNS host for remote CAS Processors. This parameter is
                   deprecated; use the parameter specification instead inside the Vinci
                   <emphasis>Service Client Descriptor</emphasis>, if needed. It is
                   ignored for integrated and local deployments. If present, for remote
                   deployments, it specifies the VNS Host to use, unless that is specified in
                   the Vinci <emphasis>Service Client Descriptor</emphasis>.</para>
                   </listitem>
               </varlistentry>

               <varlistentry>
                 <term>vnsPort</term>
                 <listitem><para>(Deprecated) integer parameter specifying the VNS port
                   number. This parameter is deprecated; use the parameter specification
                   instead inside the Vinci <emphasis>Service Client
                   Descriptor,</emphasis> if needed. It is ignored for integrated and
                   local deployments. If present, for remote deployments, it specifies the
                   VNS Port number to use, unless that is specified in the Vinci
                   <emphasis>Service Client Descriptor.</emphasis></para>
                   </listitem>
               </varlistentry>
             </variablelist></para>

           <para>For example, the following parameters might be used with a CAS Processor
             deployed in local mode:


             <programlisting>&lt;deploymentParameters&gt;
   &lt;parameter name="service-access" value="exclusive" type="string"/&gt;
 &lt;/deploymentParameters&gt;</programlisting></para>

         </section>

         <section id="&tp;descriptor.cas_processors.individual.filter">
           <title>&lt;filter&gt; Element</title>

           <para>The &lt;filter&gt; element is a required element but currently should be
             left empty. This element is reserved for future use.</para>

         </section>

         <section id="&tp;descriptor.cas_processors.individual.error_handling">
           <title>&lt;errorHandling&gt; Element</title>

           <para>The mandatory <literal>&lt;errorHandling&gt;</literal> element
             defines error and restart policies for the CAS Processor. Each CAS Processor may
             define different actions in the event of errors and restarts. The CPM monitors
             and logs errant behaviors and attempts to recover the component based on the
             policies specified in this element.</para>

           <para>There are two kinds of faults:

             <orderedlist><listitem><para>One kind only occurs with non-integrated CAS
               Processors &ndash; this fault is either a timeout attempting to launch or
               connect to the non-integrated component, or some other kind of connection
               related exception (for instance, the network connection might timeout or get
               reset).</para></listitem>

               <listitem><para>The other kind happens when the CAS Processor component (an
                 Annotator, for example) throws any kind of exception. This kind may occur
                 with any kind of deployment, integrated or not. </para></listitem>
               </orderedlist></para>

           <para>The &lt;errorHandling&gt; has specifications for each of these kinds of
             faults. The format of this element is:


             <programlisting><![CDATA[<errorHandling>
   <maxConsecutiveRestarts action="continue|disable|terminate"
                            value="[Number]"/>
   <errorRateThreshold action="continue|disable|terminate" value="[Rate]"/>
   <timeout max="[Number]"/>
 </errorHandling>]]></programlisting></para>

           <para>The mandatory <literal>&lt;maxConsecutiveRestarts&gt;</literal>
             element applies only to faults of the first kind, and therefore, only applies to
             non-integrated deployments. If such a fault occurs, a retry is attempted, up to
             <literal>value="[Number]"</literal> of times. This retry resets the
             connection (if one was made) and attempts to reconnect and perhaps re-launch
             (see below for details). The original CAS (not a partially updated one) is sent to
             the CAS Processor as part of the retry, once the deployed component has been
             successfully restarted or reconnected to.</para>

           <para>The <literal>action</literal> attribute specifies the action to take
             when the threshold specified by the <literal>value="[Number]"</literal> is
             exceeded. The possible actions are:

             <variablelist>
               <varlistentry>
                 <term>continue</term>
                 <listitem><para>skip any further processing for this CAS by this CAS
                   Processor, and pass the CAS to the next CAS Processor in the Pipeline.
                   </para>
                   <para>The <quote>restart</quote> action is done, because it is needed
                     for the next CAS.</para>

                   <para>If the <literal>dropCasOnException="true"</literal>, the CPM
                     will NOT pass the CAS to the next CAS Processor in the chain. Instead, the
                     CPM will abort processing of this CAS, release the CAS back to the CAS
                     Pool and will process the next CAS in the queue.</para>

                   <para>The counter counting the restarts toward the threshold is only
                     reset after a CAS is successfully processed.</para></listitem>
               </varlistentry>

               <varlistentry>
                 <term>disable</term>
                 <listitem><para>the current CAS is handled just as in the
                   <literal>continue</literal> case, but in addition, the CAS Processor
                   is marked so that its <emphasis>process()</emphasis> method will not be
                   called again (i.e., it will be <quote>skipped</quote> for future
                   CASes)</para></listitem>
               </varlistentry>

               <varlistentry>
                 <term>terminate</term>
                 <listitem><para>the CPM will terminate all processing and exit.</para>
                   </listitem>
               </varlistentry>
             </variablelist></para>

           <para>The definition of an error for the
             <literal>&lt;maxConsecutiveRestarts&gt;</literal> element differs
             slightly for each of the three CAS Processor deployment modes:
             <variablelist>
               <varlistentry>
                 <term>local</term>
                 <listitem><para>Local CAS Processors experience two general error
                   types:
                   <itemizedlist>
                     <listitem><para>launch errors &ndash; errors associated with
                       launching a process</para></listitem>
                     <listitem><para>processing errors &ndash; errors associated with
                       sending Vinci commands to the process</para></listitem>
                   </itemizedlist></para>

                   <para>A launch error is defined by a failure of the process to
                     successfully register with the local VNS within a default time window.
                     The current timeout is 15 minutes. Multiple local CAS Processors are
                     launched sequentially, with a subsequent processor launched
                     immediately after its previous processor successfully registers
                     with the VNS.</para>

                   <para>A processing error is detected if a connection to the CAS Processor
                     is lost or if the processing time exceeds a specified timeout
                     value.</para>

                   <para>For local CAS Processors, the
                     &lt;maxConsecutiveRestarts&gt; element specifies the number of
                     consecutive attempts made to launch the CAS Processor at CPM startup or
                     after the CPM has lost a connection to the CAS Processor.</para>
                   </listitem>
               </varlistentry>

               <varlistentry>
                 <term>remote</term>
                 <listitem><para>For remote CAS Processors, the
                   &lt;maxConsecutiveRestarts&gt; element applies to errors from
                   sending Vinci commands. An error is detected if a connection to the CAS
                   Processor is lost, or if the processing time exceeds the timeout value
                   specified in the &lt;timeout&gt; element (see below).</para>
                   </listitem>
               </varlistentry>

               <varlistentry>
                 <term>integrated</term>
                 <listitem><para>Although mandatory, the
                   &lt;maxConsecutiveRestarts&gt; element is NOT used for integrated CAS
                   Processors, because Integrated CAS Processors are not
                   re-instantiated/restarted on exceptions. This setting is ignored by
                   the CPM for Integrated CAS Processors but it is required. Future version
                   of the CPM will make this element mandatory for remote and local CAS
                   Processors only.</para></listitem>
               </varlistentry>

             </variablelist></para>

           <para>The mandatory <literal>&lt;errorRateThreshold&gt;</literal> element
             is used for all faults &ndash; both those above, and exceptions thrown by the CAS
             Processor itself. It specifies the number of retries for exceptions thrown by
             the CAS Processor itself, a maximum error rate, and the corresponding action to
             take when this rate is exceeded. The <literal>value</literal> attribute
             specifies the error rate in terms of errors per sample size in the form
             <quote><literal>N/M</literal></quote>, where <literal>N</literal> is the
             number of errors and <literal>M</literal> is the sample size, defined in terms
             of the number of documents.</para>

           <para>The first number is used also to indicate the maximum number of retries. If
             this number is less than the <literal>&lt;maxConsecutiveRestarts
             value="[Number]"&gt;, </literal>it will override, reducing the number of
             <quote>restarts</quote> attempted. A retry is done only if the
             <literal>dropCasOnException </literal>is false. If it is set to true, no retry
             occurs, but the error is counted.</para>

           <para>When the number of counted errors exceeds the sample size, an action
             specified by the <literal>action</literal> attribute is taken. The possible
             actions and their meaning are the same as described above for the
             <literal>&lt;maxConsecutiveRestarts&gt;</literal> element:
             <itemizedlist spacing="compact">
               <listitem><para><literal>continue</literal></para></listitem>
               <listitem><para><literal>disable</literal></para></listitem>
               <listitem><para><literal>terminate</literal></para></listitem>
             </itemizedlist></para>

           <para>The <literal>dropCasOnException="true"</literal> attribute of the
             <literal>&lt;casProcessors&gt;</literal> element modifies the action
             taken for continue and disable, in the same manner as above. For example:


             <programlisting>&lt;errorRateThreshold value="3/1000" action="disable"/&gt;</programlisting>
             specifies that each error thrown by the CAS Processor itself will be retried up to
             3 times (if <literal>dropCasOnException</literal> is false) and the CAS
             Processor will be disabled if the error rate exceeds 3 errors in 1000
             documents.</para>

           <para>If a document causes an error and the error rate threshold for the CAS
             Processor is not exceeded, the CPM increments the CAS Processor&apos;s error
             count and retries processing that document (if
             <literal>dropCasOnException</literal> is false). The retry means that the
             CPM calls the CAS Processor&apos;s process() method again, passing in as an
             argument the same CAS that previously caused an exception.</para>
           <note><para>The CPM does not attempt to rollback any partial changes that may have
           been applied to the CAS in the previous process() call. </para></note>

           <para>Errors are accumulated across documents. For example, assume the error
             rate threshold is <literal>3/1000</literal>. The same document may fail three
             times before finally succeeding on the fourth try, but the error count is now 3. If
             one more error occurs within the current sample of 1000 documents, the error rate
             threshold will be exceeded and the specified action will be taken. If no more
             errors occur within the current sample, the error counter is reset to 0 for the
             next sample of 1000 documents.</para>

           <para>The <literal>&lt;timeout&gt;</literal> element is a mandatory element.
             Although mandatory for all CAS Processors, this element is only relevant for
             local and remote CAS Processors. For integrated CAS Processors, this element is
             ignored. In the current CPM implementation the integrated CAS Processor
             process() method is not subject to timeouts.</para>

           <para>The <literal>max</literal> attribute specifies the maximum amount of
             time in milliseconds the CPM will wait for a process() method to complete When
             exceeded, the CPM will generate an exception and will treat this as an error
             subject to the threshold defined in the
             <literal>&lt;errorRateThreshold&gt;</literal> element above, including
             doing retries.</para>

           <section
             id="&tp;descriptor.cas_processors.individual.error_handling.timeout_retry_action">
             <title>Retry action taken on a timeout</title>

             <para>The action taken depends on whether the CAS Processor is local (managed)
               or remote (unmanaged). Local CAS Processors (which are services) are killed
               and restarted, and a new connection to them is established. For remote CAS
               Processors, the connection to them is dropped, and a new connection is
               reestablished (which may actually connect to a different instance of the
               remote services, if it has multiple instances).</para>
           </section>
         </section>

         <section id="&tp;descriptor.cas_processors.individual.checkpoint">
           <title>&lt;checkpoint&gt; Element</title>

           <para>The <literal>&lt;checkpoint&gt;</literal> element is an optional
             element used to improve the performance of CAS Consumers. It has a single
             attribute, <literal>batch</literal>, which specifies the number of CASes in a
             batch, e.g.:


             <programlisting>&lt;checkpoint batch="1000"&gt;</programlisting></para>

           <para>sets the batch size to 1000 CASes. The batch size is the interval used to mark a
             point in processing requiring special handling. The CAS Processor&apos;s
             <literal>batchProcessComplete()</literal> method will be called by the CPM
             when this mark is reached so that the processor can take appropriate action. This
             mark could be used as a mechanism to buffer up results in CAS Consumers and perform
             time-consuming operations, such as check-pointing, that should not be done on a
             per-document basis.</para>

         </section>
       </section>
     </section>

     <section id="&tp;descriptor.operational_parameters">
       <title>CPE Operational Parameters</title>

       <para>The parameters for configuring the overall CPE and CPM are specified in the
         <literal>&lt;cpeConfig&gt;</literal> section. The overall format of this
         section is:


         <programlisting><![CDATA[<cpeConfig>
   <startAt>[NumberOrID]</startAt>

   <numToProcess>[Number]</numToProcess>

   <outputQueue dequeueTimeout="[Number]" queueClass="[ClassName]" />

   <checkpoint file="[File]" time="[Number]" batch="[Number]"/>

   <timerImpl>[ClassName]</timerImpl>

   <deployAs>vinciService|interactive|immediate|single-threaded
   </deployAs>

 </cpeConfig>]]></programlisting></para>

       <para>This section of the CPE descriptor allows for defining the starting entity, the
         number of entities to process, a checkpoint file and frequency, a pluggable timer, an
         optional output queue implementation, and finally a mode of operation. The mode of
         operation determines how the CPM interacts with users and other systems.</para>

       <para>The <literal>&lt;startAt&gt;</literal> element is an optional argument. It
         defines the starting entity in the collection at which the CPM should start
         processing.</para>

       <para>The implementation in the CPM passes this argument to the Collection Reader
         as the value of the parameter <quote><literal>startNumber</literal></quote>.
         The CPM does not do anything else with this parameter; in particular, the CPM has no
         ability to skip to a specific document - that function, if available, is only provided
         by a particular Collection Reader implementation.</para>

       <para>If the <literal>&lt;startAt&gt;</literal> element is used, the Collection
         Reader descriptor must define a single-valued configuration parameter with the
         name <literal>startNumber</literal>. It can declare this value to be of any type;
         the value passed in this XML element must be convertible to that type.</para>

       <para>A typical use is to declare this to be an integer type, and to pass the sequential
         document number where processing should start. An alternative implementation
         might take a specific document ID; the collection reader could search through its
         collection until it reaches this ID and then start there.</para>

       <para>This parameter will only make sense if the particular collection reader is
         implemented to use the <literal>startNumber</literal> configuration
         parameter.</para>

       <para>The <literal>&lt;numToProcess&gt;</literal> element is an optional
         element. It specifies the total number of entities to process. Use -1 to indicate ALL.
         If not defined, the number of entities to process will be taken from the Collection
         Reader configuration. If present, this value overrides the Collection Reader
         configuration.</para>

       <para>The <literal>&lt;outputQueue&gt;</literal> element is an optional element.
         It enables plugging in a custom implementation for the Output Queue. When omitted,
         the CPM will use a default output queue that is based on First-in First-out (FIFO)
         model.</para>

       <para>The UIMA SDK provides a second implementation for the Output Queue that can be
         plugged in to the CPM, named <quote>
         <literal>org.apache.uima.collection.impl.cpm.engine.SequencedQueue</literal>
         </quote>.</para>

       <para>This implementation supports handling very large documents that are split into
         <quote>chunks</quote>; it provides a delivery mechanism that insures the
         sequential order of the chunks using information carried in the CAS metadata. This
         metadata, which is required for this implementation to work correctly, must be added
         as an instance of a Feature Structure of type
         <literal>org.apache.es.tt.DocumentMetaData</literal> and referred to by an
         additional feature named <literal>esDocumentMetaData</literal> in the special
         instance of <literal>uima.tcas.DocumentAnnotation</literal> that is
         associated with the CAS. This is usually done by the Collection Reader; the instance
         contains the following features:

         <variablelist>
           <varlistentry>
             <term>sequenceNumber</term>
             <listitem><para>[Number] the sequential number of a chunk, starting at 1. If
               not a chunk (i.e. complete document), the value should be 0.</para>
               </listitem>
           </varlistentry>
           <varlistentry>
             <term>documentId</term>
             <listitem><para>[Number] current document id. Chunks belonging to the same
               document have identical document id.</para></listitem>
           </varlistentry>
           <varlistentry>
             <term>isCompleted</term>
             <listitem><para>[Number] 1 if the chunk is the last in a sequence, 0
               otherwise.</para></listitem>
           </varlistentry>
           <varlistentry>
             <term>url</term>
             <listitem><para>[String] document url.</para></listitem>
           </varlistentry>
           <varlistentry>
             <term>throttleID</term>
             <listitem><para>[String] special attribute currently used by
               OmniFind.</para></listitem>
           </varlistentry>
         </variablelist></para>

       <para>This implementation of a sequenced queue supports proper sequencing of CASes in
         CPM deployments that use document chunking. Chunking is a technique of splitting
         large documents into pieces to reduce overall memory consumption. Chunking does not
         depend on the number of CASes in the CAS Pool. It works equally well with one or more
         CASes in the CAS Pool. Each chunk is packaged in a separate CAS and placed in the Work
         Queue. If the CAS Pool is depleted, the CollectionReader thread is suspended until a
         CAS is released back to the pool by the processing threads. A document may be split into
         1, 2, 3 or more chunks that are analyzed independently. In order to reconstruct the
         document correctly, the CAS Consumer can depend on receiving the chunks in the same
         sequential order that the chunks were <quote>produced</quote>, when this
         sequenced queue implementation is used. To plug in this sequenced queue to the CPM use
         the following specification:


         <programlisting>&lt;outputQueue dequeueTimeout="100000" queueClass=
 "org.apache.uima.collection.impl.cpm.engine.SequencedQueue"/&gt;</programlisting>

         where the mandatory <literal>queueClass</literal> attribute defines the name of
         the class and the second mandatory attribute, <literal>dequeueTimeout</literal>
         specifies the maximum number of milliseconds to wait for the expected chunk.</para>

       <note><para>The value for this timeout must be carefully determined to avoid
       excessive occurrences of timeouts. Typically, the size of a chunk and the type of
       analysis being done are the most important factors when deciding on the value for the
       timeout. The larger the chunk and the more complicated analysis, the more time it takes
       for the chunk to go from source to sink. You may specify 0, in which case, the timeout is
       disabled - i.e., it is equivalent to an infinitely long timeout.</para></note>

       <para>If the chunk doesn&apos;t arrive in the configured time window, the entire
         document is presumed to be invalid and the CAS is dropped from further processing.
         This action occurs regardless of any other error action specification. The
         SequencedQueue invalidate the document, adding the offending document&apos;s
         metadata to a local cache of invalid documents. </para>

       <para>If the time out occurs, the CPM notifies all registered listeners (see <olink
           targetdoc="&uima_docs_tutorial_guides;"
           targetptr="ugr.tug.cpe.using_listeners"/>) by calling
         entityProcessComplete(). As part of this call, the SequencedQueue will pass null
         instead of a CAS as the first argument, and a special exception &ndash;
         CPMChunkTimeoutException. The reason for passing null as the first argument is
         because the time out occurs due to the fact that the chunk has not been received in the
         configured timeout window, so there is no CAS available when the timeout event
         occurs.</para>

       <para>The CPMChunkTimeoutException object includes an API that allows the listener
         to retrieve the offending document id as well as the other metadata attributes as
         defined above. These attributes are part of each chunk&apos;s metadata and are added
         by the Collection Reader.</para>

       <para>Each chunk that SequencedQueue works on is subjected to a test to determine if the
         chunk belongs to an invalid document. This test checks the chunk&apos;s metadata
         against the data in the local cache. If there is a match, the chunk is dropped. This
         check is only performed for chunks and complete documents are not subject to this
         check.</para>

       <para>If there is an exception during the processing of a chunk, the CPM sends a
         notification to all registered listeners. The notification includes the CAS and an
         exception. When the listener notification is completed, the CPM also sends separate
         notifications, containing the CAS, to the Artifact Producer and the
         SequencedQueue. The intent is to stop adding new chunks to the Work Queue that belong
         to an <quote>invalid</quote> document and also to deal with chunks that are
         en-route, being processed by the processing threads.</para>

       <para>In response to the notification, the Artifact Producer will drop and release
         back to the CAS Pool all CASes that belong to an <quote>invalid</quote> document.
         Currently, there is no support in the CollectionReader&apos;s API to tell it to stop
         generating chunks. The CollectionReader keeps producing the chunks but the
         Artifact Producer immediately drops/releases them to the CAS Pool. Before the CAS is
         released back to the CAS Pool, the Artifact Producer sends notification to all
         registered listeners. This notification includes the CAS and an exception &ndash;
         SkipCasException.</para>

       <para>In response to the notification of an exception involving a chunk, the
         SequencedQueue retrieves from the CAS the metadata and adds it to its local cache of
         <quote>invalid</quote> documents. All chunks de-queued from the OutputQueue and
         belonging to <quote>invalid</quote> documents will be dropped and released back to
         the CAS Pool. Before dropping the CAS, the CPM sends notification to all registered
         listeners. The notification includes the CAS and SkipCasException.</para>

       <para>The <literal>&lt;checkpoint&gt;</literal> element is an optional element.
         It specifies a CPE checkpoint file, checkpoint frequency, and strategy for
         checkpoints (time or count based). At checkpoint time, the CPM saves status
         information and statistics to the checkpoint file. The checkpoint file is specified
         in the <literal>file</literal> attribute, which has the same form as the
         <literal>href</literal> attribute of the <literal>&lt;include&gt;</literal>
         element described in <xref linkend="&tp;imports"/>. The
         <literal>time</literal> attribute indicates that a checkpoint should be taken
         every <literal>[Number]</literal> seconds, and the <literal>batch</literal>
         attribute indicates that a checkpoint should be taken every
         <literal>[Number]</literal> batches.</para>

       <para>The <literal>&lt;timerImpl&gt;</literal> element is optional. It is used to
         identify a custom timer plug-in class to generate time stamps during the CPM
         execution. The value of the element is a Java class name.</para>

       <para>The <literal>&lt;deployAs&gt;</literal> element indicates the type of CPM
         deployment. Valid contents for this element include:

         <variablelist>
           <varlistentry>
             <term>vinciService</term>
             <listitem><para>Vinci service exposing APIs for stop, pause, resume, and
               getStats</para></listitem>
           </varlistentry>
           <varlistentry>
             <term>interactive</term>
             <listitem><para>provide command line menus (start, stop, pause,
               resume)</para></listitem>
           </varlistentry>
           <varlistentry>
             <term>immediate</term>
             <listitem><para>run the CPM without menus or a service API</para></listitem>
           </varlistentry>
           <varlistentry>
             <term>single-threaded</term>
             <listitem><para>run the CPM in a single threaded mode. In this mode, the
               Collection Reader, the Processing Pipeline, and the CAS Consumer Pipeline
               are all running in one thread without the work queue and the output
               queue.</para></listitem>
           </varlistentry>
         </variablelist></para>

     </section>

     <section id="&tp;descriptor.resource_manager_configuration">
       <title>Resource Manager Configuration</title>

       <para>External resource bindings for the CPE may optionally be specified in an
         element:


         <programlisting>&lt;resourceManagerConfiguration href="..."/&gt;</programlisting></para>

       <para>For an introduction to external resources, refer to <olink
           targetdoc="&uima_docs_tutorial_guides;"
           targetptr="ugr.tug.aae.accessing_external_resource_files"/>.</para>

       <para>In the <literal>resourceManagerConfiguration</literal> element, the value
         of the href attribute refers to another file that contains definitions and bindings
         for the external resources used by the CPE. The format of this file is the same as the XML
         snippet <olink targetdoc="&uima_docs_ref;"
           targetptr="ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings"/>
         . For example, in a CPE containing an aggregate analysis engine with two annotators,
         and a CAS Consumer, the following resource manager configuration file would bind
         external resource dependencies in all three components to the same physical
         resource:


         <programlisting><![CDATA[<resourceManagerConfiguration>

   <!-- Declare Resource -->

   <externalResources>
     <externalResource>
       <name>ExampleResource</name>
       <fileResourceSpecifier>
         <fileUrl>file:MyResourceFile.dat</fileUrl>
       </fileResourceSpecifier>
     </externalResource>
   </externalResources>

   <!-- Bind component resource dependencies to ExampleResource -->

   <externalResourceBindings>
     <externalResourceBinding>
       <key>MyAE/annotator1/myResourceKey</key>
       <resourceName>ExampleResource</resourceName>
     </externalResourceBinding>

     <externalResourceBinding>
       <key>MyAE/annotator2/someResourceKey</key>
       <resourceName>ExampleResource</resourceName>
     </externalResourceBinding>

     <externalResourceBinding>
       <key>MyCasConsumer/otherResourceKey</key>
       <resourceName>ExampleResource</resourceName>
     </externalResourceBinding>

   </externalResourceBindings>

 </resourceManagerConfiguration>]]></programlisting></para>

       <para>In this example, <literal>MyAE</literal> and
         <literal>MyCasConsumer</literal> are the names of the Analysis Engine and CAS
         Consumer, as specified by the name attributes of the CPE&apos;s
         <literal>&lt;casProcessor&gt;</literal> elements.
         <literal>annotator1</literal> and <literal>annotator2</literal> are the
         annotator keys specified within the Aggregate AE Descriptor, and
         <literal>myResourceKey</literal>, <literal>someResourceKey</literal>, and
         <literal>otherResourceKey</literal> are the keys of the resource dependencies
         declared in the individual annotator and CAS Consumer descriptors.</para>

     </section>

     <section id="&tp;descriptor.example">
       <title>Example CPE Descriptor</title>


       <programlisting><![CDATA[<?xml version="1.0" encoding="UTF-8"?>
 <cpeDescription>
   <collectionReader>
     <collectionIterator>
       <descriptor>
         <import location=
            "../collection_reader/FileSystemCollectionReader.xml"/>
       </descriptor>
     </collectionIterator>
   </collectionReader>
   <casProcessors dropCasOnException="true" casPoolSize="1"
       processingUnitThreadCount="1">
     <casProcessor deployment="integrated"
       name="Aggregate TAE - Name Recognizer and Person Title Annotator">
       <descriptor>
         <import location=
            "../analysis_engine/NamesAndPersonTitles_TAE.xml"/>
       </descriptor>
       <deploymentParameters/>
       <filter/>
       <errorHandling>
         <errorRateThreshold action="terminate" value="100/1000"/>
                 <maxConsecutiveRestarts action="terminate" value="30"/>
                 <timeout max="100000"/>
       </errorHandling>
       <checkpoint batch="1"/>
     </casProcessor>
     <casProcessor deployment="integrated" name="Annotation Printer">
       <descriptor>
         <import location="../cas_consumer/AnnotationPrinter.xml"/>
       </descriptor>
       <deploymentParameters/>
       <filter/>
       <errorHandling>
         <errorRateThreshold action="terminate" value="100/1000"/>
         <maxConsecutiveRestarts action="terminate" value="30"/>
         <timeout max="100000"/>
       </errorHandling>
       <checkpoint batch="1"/>
     </casProcessor>
   </casProcessors>
   <cpeConfig>
     <numToProcess>1</numToProcess>
     <deployAs>immediate</deployAs>
     <checkpoint file="" time="3000"/>
     <timerImpl/>
   </cpeConfig>
 </cpeDescription>]]></programlisting>
     </section>

 </chapter>