blob: bb642db3308c0e0a0e4415e987bd97844e0a6c4f [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
<!ENTITY imgroot "../images/references/ref.xml.cpe_descriptor/">
<!ENTITY tp "ugr.ref.xml.cpe_descriptor.">
<!ENTITY % uimaents SYSTEM "../entities.ent" >
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="ugr.ref.xml.cpe_descriptor">
<title>Collection Processing Engine Descriptor Reference</title>
<titleabbrev>CPE Descriptor Reference</titleabbrev>
<para>A UIMA <emphasis>Collection Processing Engine</emphasis> (CPE) is a combination
of UIMA components assembled to analyze a collection of artifacts. A CPE is an
instantiation of the UIMA <emphasis>Collection Processing Architecture</emphasis>,
which defines the collection processing components, interfaces, and APIs. A CPE is
executed by a UIMA framework component called the <emphasis>Collection Processing
Manager</emphasis> (CPM), which provides a number of services for deploying CPEs,
running CPEs, and handling errors.</para>
<para>A CPE can be assembled programmatically within a Java application, or it can be
assembled declaratively via a CPE configuration specification, called a CPE
Descriptor. This chapter describes the format of the CPE Descriptor.</para>
<para>Details about the CPE, including its function, sub-components, APIs, and related
tools, can be found in <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.cpe"/>. Here we briefly summarize the CPE to define terms and
provide context for the later sections that describe the CPE Descriptor.</para>
<section id="&tp;overview">
<title>CPE Overview</title>
<figure id="&tp;overview.fig.runtime">
<title>CPE Runtime Overview</title>
<mediaobject>
<imageobject>
<imagedata width="5.8in" format="PNG"
fileref="&imgroot;image002.png"/>
</imageobject>
<textobject><phrase>CPE Runtime Overview diagram</phrase></textobject>
</mediaobject>
</figure>
<para>An illustration of the CPE runtime is shown in <xref
linkend="&tp;overview.fig.runtime"/>. Some of the CPE components, such as the
<emphasis>queues</emphasis> and <emphasis>processing pipelines</emphasis>, are
internal to the CPE, but their behavior and deployment may be configured using the CPE
Descriptor. Other CPE components, such as the <emphasis>Collection
Reader</emphasis> and <emphasis>CAS Processors</emphasis>, are defined and
configured externally from the CPE and then plugged in to the CPE to create the overall
engine. The parts of a CPE are:
<variablelist>
<varlistentry>
<term>Collection Reader</term>
<listitem><para>understands the native data collection format and iterates
over the collection producing subjects of analysis</para></listitem>
</varlistentry>
<varlistentry>
<term>CAS Initializer<footnote><para>Deprecated</para></footnote>
</term>
<listitem><para>initializes a CAS with a subject of analysis</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Artifact Producer</term>
<listitem><para>asynchronously pulls CASes from the Collection Reader,
creates batches of CASes and puts them into the work queue</para></listitem>
</varlistentry>
<varlistentry>
<term>Work Queue</term>
<listitem><para>shared queue containing batches of CASes queued by the Artifact
Producer for analysis by Analysis Engines</para>
</listitem>
</varlistentry>
<varlistentry>
<term>B1-Bn</term>
<listitem><para>individual batches containing 1 or more CASes</para>
</listitem>
</varlistentry>
<varlistentry>
<term>AE1-AEn</term>
<listitem><para>Analysis Engines arranged by a CPE descriptor</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Processing Pipelines</term>
<listitem><para>each pipeline runs in a separate thread and contains a
replicated set of the Analysis Engines running in the defined sequence</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Output Queue</term>
<listitem><para>holds batches of CASes with analysis results intended for CAS
Consumers</para></listitem>
</varlistentry>
<varlistentry>
<term>CAS Consumers</term>
<listitem><para>perform collection level analysis over the CASes and extract
analysis results, e.g., creating indexes or databases</para></listitem>
</varlistentry>
</variablelist>
</para>
</section>
<section id="&tp;notation">
<title>Notation</title>
<para>CPE Descriptors are XML files. This chapter uses an informal notation to specify
the syntax of CPE Descriptors.</para>
<para>The notation used in this chapter is:
<itemizedlist><listitem><para>An ellipsis (...) inside an element body indicates
that the substructure of that element has been omitted (to be described in another
section of this chapter). An example of this would be:
<programlisting>&lt;collectionReader&gt;
...
&lt;/collectionReader&gt;</programlisting></para>
</listitem>
<listitem><para>An ellipsis immediately after an element indicates that the
element type may be repeated arbitrarily many times. For example:
<programlisting>&lt;parameter&gt;[String]&lt;/parameter&gt;
&lt;parameter&gt;[String]&lt;/parameter&gt;
...</programlisting>
indicates that there may be arbitrarily many parameter elements in this
context.</para></listitem>
<listitem><para>An ellipsis inside an element means details of the attributes
associated with that element are defined later, e.g.:
<programlisting>&lt;casProcessor ...&gt;</programlisting></para>
</listitem>
<listitem><para>Bracketed expressions (e.g. <literal>[String]</literal>)
indicate the type of value that may be used at that location.</para></listitem>
<listitem><para>A vertical bar, as in <literal>true|false</literal>, indicates
alternatives. This can be applied to literal values, bracketed type names, and
elements. </para></listitem></itemizedlist></para>
<para>Which elements are optional and which are required is specified in prose, not in the
syntax definition.</para>
</section>
<section id="&tp;imports">
<title>Imports</title>
<para>As of version 2.2, a CPE Descriptor can use the same <literal>import</literal> mechanism
as other component descriptors. This allows referring to component
descriptors using either relative paths (resolved relative to the location of the CPE descriptor)
or the classpath/datapath. For details see <olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.component_descriptor"/>.</para>
<para>The follwing older syntax is still supported, but <emphasis>not recommended</emphasis>:
<programlisting><![CDATA[<descriptor>
<include href="[URL or File]"/>
</descriptor>]]></programlisting></para>
<para>The <literal>[URL or File]</literal> attribute is a URL or a filename for the descriptor of the
incorporated component. The argument is first attempted to be resolved as a URL.</para>
<para>
Relative paths in an <literal>include</literal> are resolved relative to the current working directory
(NOT the CPE descriptor location as is the case for <literal>import</literal>).
A filename relative to another directory can be specified using the <literal>CPM_HOME</literal>
variable, e.g.,
<programlisting>&lt;descriptor&gt;
&lt;include href="${CPM_HOME}/desc_dir/descriptor.xml"/&gt;
&lt;/descriptor&gt;</programlisting>
In this case, the value for the <literal>CPM_HOME</literal> variable must be
provided to the CPE by specifying it on the Java command line, e.g.,
<programlisting>java -DCPM_HOME="C:/Program Files/apache/uima/cpm" ...</programlisting>
</para>
</section>
<section id="&tp;descriptor">
<title>CPE Descriptor Overview</title>
<para>A CPE Descriptor consists of information describing the following four main
elements.</para>
<orderedlist><listitem><para>The <emphasis>Collection Reader</emphasis>, which
is responsible for gathering artifacts and initializing the Common Analysis
Structure (CAS) used to support processing in the UIMA collection processing
engine.</para></listitem>
<listitem><para>The <emphasis>CAS Processors</emphasis>, responsible for
analyzing individual artifacts, analyzing across artifacts, and extracting
analysis results. CAS Processors include <emphasis>Analysis Engines</emphasis>
and <emphasis>CAS Consumers</emphasis>.</para></listitem>
<listitem><para>Operational parameters of the <emphasis>Collection Processing
Manager</emphasis> (CPM), such as checkpoint frequency and deployment
mode.</para></listitem>
<listitem><para>Resource Manager Configuration (optional). </para></listitem>
</orderedlist>
<para>The CPE Descriptor has the following high level skeleton:
<programlisting><![CDATA[<?xml version="1.0"?>
<cpeDescription>
<collectionReader>
...
</collectionReader>
<casProcessors>
...
</casProcessors>
<cpeConfig>
...
</cpeConfig>
<resourceManagerConfiguration>
...
</resourceManagerConfiguration>
</cpeDescription>]]></programlisting></para>
<para>Details of each of the four main elements are described in the sections that
follow.</para>
</section>
<section id="&tp;descriptor.collection_reader">
<title>Collection Reader</title>
<para>The <literal>&lt;collectionReader&gt;</literal> section identifies the
Collection Reader and optional CAS Initializer that are to be used in the CPE. The
Collection Reader is responsible for retrieval of artifacts from a collection
outside of the CPE, and the optional CAS Initializer (deprecated as of UIMA Version 2)
is responsible for initializing the CAS with the artifact.</para>
<para>A Collection Reader may initialize the CAS itself, in which case it does not
require a CAS Initializer. This should be clearly specified in the documentation for
the Collection Reader. Specifying a CAS Initializer for a Collection Reader that
does not make use of a CAS Initializer will not cause an error, but the specified CAS
Initializer will not be used.</para>
<para>The complete structure of the <literal>&lt;collectionReader&gt;</literal>
section is:
<programlisting><![CDATA[<collectionReader>
<collectionIterator>
<descriptor>
<import ...> | <include .../>
</descriptor>
<configurationParameterSettings>...</configurationParameterSettings>
<sofaNameMappings>...</sofaNameMappings>
</collectionIterator>
<casInitializer>
<descriptor>
<import ...> | <include .../>
</descriptor>
<configurationParameterSettings>...</configurationParameterSettings>
<sofaNameMappings>...</sofaNameMappings>
</casInitializer>
</collectionReader>]]></programlisting></para>
<para>The <literal>&lt;collectionIterator&gt;</literal> identifies the
descriptor for the Collection Reader, and the <literal>&lt;casInitializer&gt;
</literal>identifies the descriptor for the CAS Initializer. The format and
details of the Collection Reader and CAS Initializer descriptors are described in
<olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader"/>
. The <literal>&lt;configurationParameterSettings&gt; </literal>and the
<literal>&lt;sofaNameMappings&gt;</literal> elements are described in the next
section.</para>
<section id="&tp;descriptor.collection_reader.error_handling">
<title>Error handling for Collection Readers</title>
<para>The CPM will abort if the Collection Reader throws a large number of
consecutive exceptions (default = 100). This default can by changed by using the
Java initialization parameter <literal>-DMaxCRErrorThreshold
xxx.</literal></para>
</section>
</section>
<section id="&tp;descriptor.cas_processors">
<title>CAS Processors</title>
<para>The <literal>&lt;casProcessors&gt;</literal> section identifies the
components that perform the analysis on the input data, including CAS analysis
(Analysis Engines) and analysis results extraction (CAS Consumers). The CAS
Consumers may also perform collection level analysis, where the analysis is
performed (or aggregated) over multiple CASes. The basic structure of the CAS
Processors section is:
<programlisting><![CDATA[<casProcessors
dropCasOnException="true|false"
casPoolSize="[Number]"
processingUnitThreadCount="[Number]">
<casProcessor ...>
...
</casProcessor>
<casProcessor ...>
...
</casProcessor>
...
</casProcessors>]]></programlisting></para>
<para>The <literal>&lt;casProcessors&gt;</literal> section has two mandatory
attributes and one optional attribute that configure the characteristics of the CAS
Processor flow in the CPE. The first mandatory attribute is a casPoolSize, which
defines the fixed number of CAS instances that the CPM will create and use during
processing. All CAS instances are maintained in a CAS Pool with a check-in and
check-out access. Each CAS is checked-out from the CAS Pool by the Collection Reader
and initialized with an initial subject of analysis. The CAS is checked-in into the
CAS Pool when it is completely processed, at the end of the processing chain. A larger
CAS Pool size will result in more memory being used by the CPM. CAS objects can be large
and care should be taken to determine the optimum size of the CAS Pool, weighing memory
tradeoffs with performance.</para>
<para>The second mandatory <literal>&lt;casProcessors&gt;</literal> attribute
is <literal>processingUnitThreadCount</literal>, which specifies the number of
replicated <emphasis>Processing Pipelines</emphasis>. Each Processing
Pipeline runs in its own thread. The CPM takes CASes from the work queue and submits
each CAS to one of the Processing Pipelines for analysis. A Processing Pipeline
contains one or more Analysis Engines invoked in a given sequence. If more than one
Processing Pipeline is specified, the CPM replicates instances of each Analysis
Engine defined in the CPE descriptor. Each Processing Pipeline thread runs
independently, consuming CASes from work queue and depositing CASes with analysis
results onto the output queue. On multiprocessor machines, multiple Processing
Pipelines can run in parallel, improving overall throughput of the CPM.</para>
<note><para>The number of Processing Pipelines should be equal to or greater than CAS
Pool size. </para></note>
<para>Elements in the pipeline (each represented by a &lt;casProcessor&gt; element)
may indicate that they do not permit multiple deployment in their Analysis Engine
descriptor. If so, even though multiple pipelines are being used, all CASes passing
through the pipelines will be routed through one instance of these marked Engines.
</para>
<para>The final, optional, &lt;casProcessors&gt; attribute is
<literal>dropCasOnException</literal>. It defines a policy that determines what
happens with the CAS when an exception happens during processing. If the value of this
attribute is set to true and an exception happens, the CPM will notify all registered
listeners of the exception (see <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.cpe.using_listeners"/>), clear the CAS and check the CAS
back into the CAS Pool so that it can be re-used. The presumption is that an exception
may leave the CAS in an inconsistent state and therefore that CAS should not be allowed
to move through the processing chain. When this attribute is omitted the CPM&apos;s
default is the same as specifying
<literal>dropCasOnException="false"</literal>.</para>
<section id="&tp;descriptor.cas_processors.individual">
<title>Specifying an Individual CAS Processor</title>
<para>The CAS Processors that make up the Processing Pipeline and the CAS Consumer
pipeline are specified with the <literal>&lt;casProcessor&gt;</literal>
entity, which appears within the <literal>&lt;casProcessors&gt;</literal>
entity. It may appear multiple times, once for each CAS Processor specified for
this CPE.</para>
<para>The order of the <literal>&lt;casProcessor&gt;</literal> entities with
the <literal>&lt;casProcessors&gt;</literal> section specifies the order in
which the CAS Processors will run. Although CAS Consumers are usually put at the end
of the pipeline, they need not be. Also, Aggregate Analysis Engines may include CAS
Consumers.</para>
<para>The overall format of the <literal>&lt;casProcessor&gt;</literal> entity
is:
<programlisting><![CDATA[<casProcessor deployment="local|remote|integrated" name="[String]" >
<descriptor>
<import ...> | <include .../>
</descriptor>
<configurationParameterSettings>...</configurationParameterSettings>
<sofaNameMappings>...</sofaNameMappings>
<runInSeparateProcess>...</runInSeparateProcess>
<deploymentParameters>...</deploymentParameters>
<filter/>
<errorHandling>...</errorHandling>
<checkpoint batch="Number"/>
</casProcessor>]]></programlisting></para>
<para>The <literal>&lt;casProcessor&gt;</literal> element has two mandatory
attributes, <literal>deployment</literal> and <literal>name</literal>. The
mandatory <literal>name</literal> attribute specifies a unique string
identifying the CAS Processor.</para>
<para>The mandatory <literal>deployment</literal> attribute specifies the CAS
Processor deployment mode. Currently, three deployment options are supported:
<variablelist>
<varlistentry>
<term>integrated</term>
<listitem><para>indicates <emphasis>integrated</emphasis> deployment
of the CAS Processor. The CPM deploys and collocates the CAS Processor in the
same process space as the CPM. This type of deployment is recommended to
increase the performance of the CPE. However, it is NOT recommended to
deploy annotators containing JNI this way. Such CAS Processors may cause a
fatal exception and force the JVM to exit without cleanup (bringing down the
CPM). Any UIMA SDK compliant pure Java CAS Processors may be safely deployed
this way.</para>
<para>The descriptor for an integrated deployment can, in fact, be a remote
service descriptor. When used this way, however, the CPM error recovery
options (see below) operate in the integrated mode, which means that many
of the retry options are not available.</para></listitem>
</varlistentry>
<varlistentry>
<term>remote</term>
<listitem><para>indicates <emphasis>non-managed</emphasis>
deployment of the CAS Processor. The CAS Processor descriptor referenced
in the <literal>&lt;descriptor&gt;</literal> element must be a Vinci
<emphasis>Service Client Descriptor</emphasis>, which identifies a
remotely deployed CAS Processor service (see <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.application.remote_services"/>). The CPM
assumes that the CAS Processor is already running as a remote service and
will connect to it using the URI provided in the client service descriptor.
The lifecycle of a remotely deployed CAS Processor is not managed by the CPM,
so appropriate infrastructure should be in place to start/restart such CAS
Processors when necessary. This deployment provides fault isolation and
is implementation (i.e., programming language) neutral.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>local</term>
<listitem><para>indicates <emphasis>managed</emphasis> deployment of
the CAS Processor. The CAS Processor descriptor referenced in the
<literal>&lt;descriptor&gt;</literal> element must be a Vinci
<emphasis>Service Deployment Descriptor</emphasis>, which configures
a CAS Processor for deployment as a Vinci service (see <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.application.remote_services"/>). The CPM
deploys the CAS Processor in a separate process and manages the life cycle
(start/stop) of the CAS Processor. Communication between the CPM and the
CAS Processor is done with Vinci. When the CPM completes processing, the
process containing the CAS Processor is terminated. This deployment mode
insulates the CPM from the CAS Processor, creating a more robust deployment
at the cost of a small communication overhead. On multiprocessor machines,
the separate processes may run concurrently and improve overall
throughput.</para></listitem>
</varlistentry>
</variablelist></para>
<para>A number of elements may appear within the
<literal>&lt;casProcessor&gt;</literal> element.</para>
<section id="&tp;descriptor.cas_processors.individual.descriptor">
<title>&lt;descriptor&gt; Element</title>
<para>The <literal>&lt;descriptor&gt;</literal> element is mandatory. It
identifies the descriptor for the referenced CAS Processor using the syntax
described in <olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.component_descriptor.aes"/>.
<itemizedlist spacing="compact"><listitem><para>For
<emphasis><literal>remote</literal></emphasis> CAS Processors, the
referenced descriptor must be a Vinci <emphasis>Service Client
Descriptor</emphasis>, which identifies a remotely deployed CAS Processor
service.</para></listitem>
<listitem><para>For <emphasis>local</emphasis> CAS Processors, the
referenced descriptor must be a Vinci <emphasis>Service Deployment
Descriptor</emphasis>.</para></listitem>
<listitem><para>For <emphasis>integrated</emphasis> CAS Processors,
the referenced descriptor must be an Analysis Engine Descriptor
(primitive or aggregate). </para></listitem></itemizedlist> </para>
<para>See <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.application.remote_services"/> for more
information on creating these descriptors and deploying services.</para>
</section>
<section
id="&tp;descriptor.cas_processors.individual.configuration_parameter_settings">
<title>&lt;configurationParameterSettings&gt; Element</title>
<para>This element provides a way to override the contained Analysis
Engine&apos;s parameters settings. Any entry specified here must already be
defined; values specified replace the corresponding values for each
parameter. <emphasis role="bold-italic">For Cas Processors, this mechanism
is only available when they are deployed in <quote>integrated</quote>
mode.</emphasis> For Collection Readers and Initializers, it always is
available.</para>
<para>The content of this element is identical to the component descriptor for
specifying parameters (in the case where no parameter groups are
specified)<footnote><para>An earlier UIMA version required these to have a
suffix of <quote>_p</quote>, e.g., <quote>string_p</quote>. This is no
longer required, but this format is accepted, also, for backward
compatibility.</para></footnote>. Here is an example:
<programlisting><![CDATA[<configurationParameterSettings>
<nameValuePair>
<name>CivilianTitles</name>
<value>
<array>
<string>Mr.</string>
<string>Ms.</string>
<string>Mrs.</string>
<string>Dr.</string>
</array>
</value>
</nameValuePair>
...
</configurationParameterSettings>]]></programlisting></para>
</section>
<section
id="&tp;descriptor.cas_processors.individual.sofa_name_mappings">
<title>&lt;sofaNameMappings&gt; Element</title>
<para>This optional element provides a mapping from defined Sofa names in the
component, or the default Sofa name (if the component does not declare any Sofa
names). The form of this element is:
<programlisting>&lt;sofaNameMappings&gt;
&lt;sofaNameMapping cpeSofaName="a_CPE_name"
componentSofaName="a_component_Name"/&gt;
...
&lt;/sofaNameMappings&gt;</programlisting></para>
<para>There can be any number of<literal>
&lt;sofaNameMapping&gt;</literal> elements contained in the
<literal>&lt;sofaNameMappings&gt;</literal> element. The
<literal>componentSofaName</literal> attribute is optional; leave it out to
specify a mapping for the <literal>_InitialView</literal> - that is, for
Single-View components.</para>
</section>
<section id="&tp;descriptor.cas_processors.run_in_separate_process">
<title>&lt;runInSeparateProcess&gt; Element</title>
<para>The <literal>&lt;runInSeparateProcess&gt;</literal> element is
mandatory for <literal>local</literal> CAS Processors, but should not appear
for <literal>remote</literal> or <literal>integrated</literal> CAS
Processors. It enables the CPM to create external processes using the provided
runtime environment. Applications launched this way communicate with the CPM
using the Vinci protocol and connectivity is enabled by a local instance of the
VNS that the CPM manages. Since communication is based on Vinci, the application
need not be implemented in Java. Any language for which Vinci provides support
may be used to create an application, and the CPM will seamlessly communicate
with it. The overall structure of this element is:
<programlisting><![CDATA[<runInSeparateProcess>
<exec dir="[String]" executable="[String]">
<env key="[String]" value ="[String]"/>
...
<arg>[String]</arg>
...
</exec>
</runInSeparateProcess>]]></programlisting></para>
<para>The <literal>&lt;exec&gt;</literal> element provides information
about how to execute the referenced CAS Processor. Two attributes are defined
for the <literal>&lt;exec&gt;</literal> element. The
<literal>dir</literal> attribute is currently not used &ndash; it is reserved
for future functionality. The <literal>executable</literal> attribute
specifies the actual Vinci service executable that will be run by the CPM, e.g.,
<literal>java</literal>, a batch script, an application (.exe), etc. The
executable must be specified with a fully qualified path, or be found in the
<literal>PATH</literal> of the CPM.</para>
<para>The <literal>&lt;exec&gt;</literal> element has two elements within it
that define parameters used to construct the command line for executing the CAS
Processor. These elements must be listed in the order in which they should be
defined for the CAS Processor.</para>
<para>The optional <literal>&lt;env&gt;</literal> element is used to set an
environment variable. The variable <literal>key</literal> will be set to
<literal>value</literal>. For example,
<programlisting>&lt;env key="CLASSPATH" value="C:Javalib"/&gt;</programlisting>
will set the environment variable <literal>CLASSPATH</literal> to the value
<literal>C:Javalib</literal>. The <literal>&lt;env&gt;</literal>
element may be repeated to set multiple environment variables. All of the
key/value pairs will be added to the environment by the CPM prior to launching the
executable.</para>
<note><para>The CPM actually adds ALL system environment variables when it
launches the program. It queries the Operating System for its current system
variables and one by one adds them to the program&apos;s process
configuration.</para></note>
<para>The <literal>&lt;arg&gt;</literal> element is used to specify arbitrary
string arguments that will appear on the command line when the CPM runs the
command specified in the <literal>executable</literal> attribute.</para>
<para>For example, the following would be used to invoke the UIMA Java
implementation of the Vinci service wrapper on a Java CAS Processor:
<programlisting><![CDATA[<runInSeparateProcess>
<exec executable="java">
<arg>-DVNS_HOST=localhost</arg>
<arg>-DVNS_PORT=9099</arg>
<arg>org.apache.uima.reference_impl.analysis_engine.service.
vinci.VinciAnalysisEngineService_impl</arg>
<arg>C:uimadescdeployCasProcessor.xml</arg>
</exec>
<runInSeparateProcess>]]></programlisting></para>
<para>This will cause the CPM to run the following command line when starting the
CAS Processor:
<programlisting>java -DVNS_HOST=localhost -DVNS_PORT=9099
org.apache.uima.reference_impl.analysis_engine.service.vinci.\\
VinciAnalysisEngineService_impl
C:uimadescdeployCasProcessor.xml</programlisting></para>
<para>The first argument specifies that the Vinci Naming Service is running on the
<literal>localhost</literal>. The second argument specifies that the Vinci
Naming Service port number is <literal>9099</literal>. The third argument
(split over 2 lines in this documentation)
identifies the UIMA implementation of the Vinci service wrapper. This class
contains the <literal>main</literal> method that will execute. That main
method in turn takes a single argument &ndash; the filename for the CAS Processor
service deployment descriptor. Thus the last argument identifies the Vinci
service deployment descriptor file for the CAS Processor. Since this is the same
descriptor file specified earlier in the
<literal>&lt;descriptor&gt;</literal> element, the string
<literal>${descriptor}</literal> can be used to refer to the descriptor,
e.g.:
<programlisting>&lt;arg&gt;${descriptor}&lt;/arg&gt;</programlisting></para>
<para>The CPM will expand this out to the service deployment descriptor file
referenced in the <literal>&lt;descriptor&gt;</literal> element.</para>
</section>
<section
id="&tp;descriptor.cas_processors.individual.deployment_parameters">
<title>&lt;deploymentParameters&gt; Element</title>
<para>The <literal>&lt;deploymentParameters&gt;</literal> element defines
a number of deployment parameters that control how the CPM will interact with the
CAS Processor. This element has the following overall form:
<programlisting>&lt;deploymentParameters&gt;
&lt;parameter name="[String]" value="..." type="string|integer" /&gt;
...
&lt;/deploymentParameters&gt;</programlisting></para>
<para>The <literal>name</literal> attribute identifies the parameter, the
<literal>value</literal> attribute specifies the value that will be assigned
to the parameter, and the <literal>type</literal> attribute indicates the
type of the parameter, either <literal>string</literal> or
<literal>integer</literal>. The available parameters include:
<variablelist>
<varlistentry>
<term>service-access</term>
<listitem><para>string parameter whose value must be
<quote>exclusive</quote>, if present. This parameter is only
effective for remote deployments. It modifies the Vinci service
connections to be preallocated and dedicated, one service instance per
pipe-line. It is only relevant for non-Integrated deployement modes. If
there are fewer services instances that are available (and alive &ndash;
responding to a <quote>ping</quote> request) than there are pipelines,
the number of pipelines (the number of concurrent threads) is reduced to
match the number of available instances. If not specified, the VNS is
queried each time a service is needed, and a <quote>random</quote>
instance is assigned from the pool of available instances. If a services
dies during processing, the CPM will use its normal error handling
procedures to attempt to reconnect. The number of attempts is specified
in the CPE descriptor for each Cas Processor using the
<literal>&lt;maxConsecutiveRestarts value="10"
action="kill-pipeline"
waitTimeBetweenRetries="50"/&gt;</literal> xml element. The
<quote>value</quote> attribute is the number of reconnection tries;
the <quote>action</quote> says what to do if the retries exceed the
limit. The <quote>kill-pipeline</quote> action stops the pipeline
that was associated with the failing service (other pipelines will
continue to work). The CAS in process within a killed pipeline will be
dropped. These events are communicated to the application using the
normal event listener mechanism. The
<literal>waitTimeBetweenRetries</literal> says how many
milliseconds to wait inbetween attempts to reconnect.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>vnsHost</term>
<listitem><para>(Deprecated) string parameter specifying the VNS host,
e.g., <literal>localhost</literal> for local CAS Processors, host
name or IP address of VNS host for remote CAS Processors. This parameter is
deprecated; use the parameter specification instead inside the Vinci
<emphasis>Service Client Descriptor</emphasis>, if needed. It is
ignored for integrated and local deployments. If present, for remote
deployments, it specifies the VNS Host to use, unless that is specified in
the Vinci <emphasis>Service Client Descriptor</emphasis>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>vnsPort</term>
<listitem><para>(Deprecated) integer parameter specifying the VNS port
number. This parameter is deprecated; use the parameter specification
instead inside the Vinci <emphasis>Service Client
Descriptor,</emphasis> if needed. It is ignored for integrated and
local deployments. If present, for remote deployments, it specifies the
VNS Port number to use, unless that is specified in the Vinci
<emphasis>Service Client Descriptor.</emphasis></para>
</listitem>
</varlistentry>
</variablelist></para>
<para>For example, the following parameters might be used with a CAS Processor
deployed in local mode:
<programlisting>&lt;deploymentParameters&gt;
&lt;parameter name="service-access" value="exclusive" type="string"/&gt;
&lt;/deploymentParameters&gt;</programlisting></para>
</section>
<section id="&tp;descriptor.cas_processors.individual.filter">
<title>&lt;filter&gt; Element</title>
<para>The &lt;filter&gt; element is a required element but currently should be
left empty. This element is reserved for future use.</para>
</section>
<section id="&tp;descriptor.cas_processors.individual.error_handling">
<title>&lt;errorHandling&gt; Element</title>
<para>The mandatory <literal>&lt;errorHandling&gt;</literal> element
defines error and restart policies for the CAS Processor. Each CAS Processor may
define different actions in the event of errors and restarts. The CPM monitors
and logs errant behaviors and attempts to recover the component based on the
policies specified in this element.</para>
<para>There are two kinds of faults:
<orderedlist><listitem><para>One kind only occurs with non-integrated CAS
Processors &ndash; this fault is either a timeout attempting to launch or
connect to the non-integrated component, or some other kind of connection
related exception (for instance, the network connection might timeout or get
reset).</para></listitem>
<listitem><para>The other kind happens when the CAS Processor component (an
Annotator, for example) throws any kind of exception. This kind may occur
with any kind of deployment, integrated or not. </para></listitem>
</orderedlist></para>
<para>The &lt;errorHandling&gt; has specifications for each of these kinds of
faults. The format of this element is:
<programlisting><![CDATA[<errorHandling>
<maxConsecutiveRestarts action="continue|disable|terminate"
value="[Number]"/>
<errorRateThreshold action="continue|disable|terminate" value="[Rate]"/>
<timeout max="[Number]"/>
</errorHandling>]]></programlisting></para>
<para>The mandatory <literal>&lt;maxConsecutiveRestarts&gt;</literal>
element applies only to faults of the first kind, and therefore, only applies to
non-integrated deployments. If such a fault occurs, a retry is attempted, up to
<literal>value="[Number]"</literal> of times. This retry resets the
connection (if one was made) and attempts to reconnect and perhaps re-launch
(see below for details). The original CAS (not a partially updated one) is sent to
the CAS Processor as part of the retry, once the deployed component has been
successfully restarted or reconnected to.</para>
<para>The <literal>action</literal> attribute specifies the action to take
when the threshold specified by the <literal>value="[Number]"</literal> is
exceeded. The possible actions are:
<variablelist>
<varlistentry>
<term>continue</term>
<listitem><para>skip any further processing for this CAS by this CAS
Processor, and pass the CAS to the next CAS Processor in the Pipeline.
</para>
<para>The <quote>restart</quote> action is done, because it is needed
for the next CAS.</para>
<para>If the <literal>dropCasOnException="true"</literal>, the CPM
will NOT pass the CAS to the next CAS Processor in the chain. Instead, the
CPM will abort processing of this CAS, release the CAS back to the CAS
Pool and will process the next CAS in the queue.</para>
<para>The counter counting the restarts toward the threshold is only
reset after a CAS is successfully processed.</para></listitem>
</varlistentry>
<varlistentry>
<term>disable</term>
<listitem><para>the current CAS is handled just as in the
<literal>continue</literal> case, but in addition, the CAS Processor
is marked so that its <emphasis>process()</emphasis> method will not be
called again (i.e., it will be <quote>skipped</quote> for future
CASes)</para></listitem>
</varlistentry>
<varlistentry>
<term>terminate</term>
<listitem><para>the CPM will terminate all processing and exit.</para>
</listitem>
</varlistentry>
</variablelist></para>
<para>The definition of an error for the
<literal>&lt;maxConsecutiveRestarts&gt;</literal> element differs
slightly for each of the three CAS Processor deployment modes:
<variablelist>
<varlistentry>
<term>local</term>
<listitem><para>Local CAS Processors experience two general error
types:
<itemizedlist>
<listitem><para>launch errors &ndash; errors associated with
launching a process</para></listitem>
<listitem><para>processing errors &ndash; errors associated with
sending Vinci commands to the process</para></listitem>
</itemizedlist></para>
<para>A launch error is defined by a failure of the process to
successfully register with the local VNS within a default time window.
The current timeout is 15 minutes. Multiple local CAS Processors are
launched sequentially, with a subsequent processor launched
immediately after its previous processor successfully registers
with the VNS.</para>
<para>A processing error is detected if a connection to the CAS Processor
is lost or if the processing time exceeds a specified timeout
value.</para>
<para>For local CAS Processors, the
&lt;maxConsecutiveRestarts&gt; element specifies the number of
consecutive attempts made to launch the CAS Processor at CPM startup or
after the CPM has lost a connection to the CAS Processor.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>remote</term>
<listitem><para>For remote CAS Processors, the
&lt;maxConsecutiveRestarts&gt; element applies to errors from
sending Vinci commands. An error is detected if a connection to the CAS
Processor is lost, or if the processing time exceeds the timeout value
specified in the &lt;timeout&gt; element (see below).</para>
</listitem>
</varlistentry>
<varlistentry>
<term>integrated</term>
<listitem><para>Although mandatory, the
&lt;maxConsecutiveRestarts&gt; element is NOT used for integrated CAS
Processors, because Integrated CAS Processors are not
re-instantiated/restarted on exceptions. This setting is ignored by
the CPM for Integrated CAS Processors but it is required. Future version
of the CPM will make this element mandatory for remote and local CAS
Processors only.</para></listitem>
</varlistentry>
</variablelist></para>
<para>The mandatory <literal>&lt;errorRateThreshold&gt;</literal> element
is used for all faults &ndash; both those above, and exceptions thrown by the CAS
Processor itself. It specifies the number of retries for exceptions thrown by
the CAS Processor itself, a maximum error rate, and the corresponding action to
take when this rate is exceeded. The <literal>value</literal> attribute
specifies the error rate in terms of errors per sample size in the form
<quote><literal>N/M</literal></quote>, where <literal>N</literal> is the
number of errors and <literal>M</literal> is the sample size, defined in terms
of the number of documents.</para>
<para>The first number is used also to indicate the maximum number of retries. If
this number is less than the <literal>&lt;maxConsecutiveRestarts
value="[Number]"&gt;, </literal>it will override, reducing the number of
<quote>restarts</quote> attempted. A retry is done only if the
<literal>dropCasOnException </literal>is false. If it is set to true, no retry
occurs, but the error is counted.</para>
<para>When the number of counted errors exceeds the sample size, an action
specified by the <literal>action</literal> attribute is taken. The possible
actions and their meaning are the same as described above for the
<literal>&lt;maxConsecutiveRestarts&gt;</literal> element:
<itemizedlist spacing="compact">
<listitem><para><literal>continue</literal></para></listitem>
<listitem><para><literal>disable</literal></para></listitem>
<listitem><para><literal>terminate</literal></para></listitem>
</itemizedlist></para>
<para>The <literal>dropCasOnException="true"</literal> attribute of the
<literal>&lt;casProcessors&gt;</literal> element modifies the action
taken for continue and disable, in the same manner as above. For example:
<programlisting>&lt;errorRateThreshold value="3/1000" action="disable"/&gt;</programlisting>
specifies that each error thrown by the CAS Processor itself will be retried up to
3 times (if <literal>dropCasOnException</literal> is false) and the CAS
Processor will be disabled if the error rate exceeds 3 errors in 1000
documents.</para>
<para>If a document causes an error and the error rate threshold for the CAS
Processor is not exceeded, the CPM increments the CAS Processor&apos;s error
count and retries processing that document (if
<literal>dropCasOnException</literal> is false). The retry means that the
CPM calls the CAS Processor&apos;s process() method again, passing in as an
argument the same CAS that previously caused an exception.</para>
<note><para>The CPM does not attempt to rollback any partial changes that may have
been applied to the CAS in the previous process() call. </para></note>
<para>Errors are accumulated across documents. For example, assume the error
rate threshold is <literal>3/1000</literal>. The same document may fail three
times before finally succeeding on the fourth try, but the error count is now 3. If
one more error occurs within the current sample of 1000 documents, the error rate
threshold will be exceeded and the specified action will be taken. If no more
errors occur within the current sample, the error counter is reset to 0 for the
next sample of 1000 documents.</para>
<para>The <literal>&lt;timeout&gt;</literal> element is a mandatory element.
Although mandatory for all CAS Processors, this element is only relevant for
local and remote CAS Processors. For integrated CAS Processors, this element is
ignored. In the current CPM implementation the integrated CAS Processor
process() method is not subject to timeouts.</para>
<para>The <literal>max</literal> attribute specifies the maximum amount of
time in milliseconds the CPM will wait for a process() method to complete When
exceeded, the CPM will generate an exception and will treat this as an error
subject to the threshold defined in the
<literal>&lt;errorRateThreshold&gt;</literal> element above, including
doing retries.</para>
<section
id="&tp;descriptor.cas_processors.individual.error_handling.timeout_retry_action">
<title>Retry action taken on a timeout</title>
<para>The action taken depends on whether the CAS Processor is local (managed)
or remote (unmanaged). Local CAS Processors (which are services) are killed
and restarted, and a new connection to them is established. For remote CAS
Processors, the connection to them is dropped, and a new connection is
reestablished (which may actually connect to a different instance of the
remote services, if it has multiple instances).</para>
</section>
</section>
<section id="&tp;descriptor.cas_processors.individual.checkpoint">
<title>&lt;checkpoint&gt; Element</title>
<para>The <literal>&lt;checkpoint&gt;</literal> element is an optional
element used to improve the performance of CAS Consumers. It has a single
attribute, <literal>batch</literal>, which specifies the number of CASes in a
batch, e.g.:
<programlisting>&lt;checkpoint batch="1000"&gt;</programlisting></para>
<para>sets the batch size to 1000 CASes. The batch size is the interval used to mark a
point in processing requiring special handling. The CAS Processor&apos;s
<literal>batchProcessComplete()</literal> method will be called by the CPM
when this mark is reached so that the processor can take appropriate action. This
mark could be used as a mechanism to buffer up results in CAS Consumers and perform
time-consuming operations, such as check-pointing, that should not be done on a
per-document basis.</para>
</section>
</section>
</section>
<section id="&tp;descriptor.operational_parameters">
<title>CPE Operational Parameters</title>
<para>The parameters for configuring the overall CPE and CPM are specified in the
<literal>&lt;cpeConfig&gt;</literal> section. The overall format of this
section is:
<programlisting><![CDATA[<cpeConfig>
<startAt>[NumberOrID]</startAt>
<numToProcess>[Number]</numToProcess>
<outputQueue dequeueTimeout="[Number]" queueClass="[ClassName]" />
<checkpoint file="[File]" time="[Number]" batch="[Number]"/>
<timerImpl>[ClassName]</timerImpl>
<deployAs>vinciService|interactive|immediate|single-threaded
</deployAs>
</cpeConfig>]]></programlisting></para>
<para>This section of the CPE descriptor allows for defining the starting entity, the
number of entities to process, a checkpoint file and frequency, a pluggable timer, an
optional output queue implementation, and finally a mode of operation. The mode of
operation determines how the CPM interacts with users and other systems.</para>
<para>The <literal>&lt;startAt&gt;</literal> element is an optional argument. It
defines the starting entity in the collection at which the CPM should start
processing.</para>
<para>The implementation in the CPM passes this argument to the Collection Reader
as the value of the parameter <quote><literal>startNumber</literal></quote>.
The CPM does not do anything else with this parameter; in particular, the CPM has no
ability to skip to a specific document - that function, if available, is only provided
by a particular Collection Reader implementation.</para>
<para>If the <literal>&lt;startAt&gt;</literal> element is used, the Collection
Reader descriptor must define a single-valued configuration parameter with the
name <literal>startNumber</literal>. It can declare this value to be of any type;
the value passed in this XML element must be convertible to that type.</para>
<para>A typical use is to declare this to be an integer type, and to pass the sequential
document number where processing should start. An alternative implementation
might take a specific document ID; the collection reader could search through its
collection until it reaches this ID and then start there.</para>
<para>This parameter will only make sense if the particular collection reader is
implemented to use the <literal>startNumber</literal> configuration
parameter.</para>
<para>The <literal>&lt;numToProcess&gt;</literal> element is an optional
element. It specifies the total number of entities to process. Use -1 to indicate ALL.
If not defined, the number of entities to process will be taken from the Collection
Reader configuration. If present, this value overrides the Collection Reader
configuration.</para>
<para>The <literal>&lt;outputQueue&gt;</literal> element is an optional element.
It enables plugging in a custom implementation for the Output Queue. When omitted,
the CPM will use a default output queue that is based on First-in First-out (FIFO)
model.</para>
<para>The UIMA SDK provides a second implementation for the Output Queue that can be
plugged in to the CPM, named <quote>
<literal>org.apache.uima.collection.impl.cpm.engine.SequencedQueue</literal>
</quote>.</para>
<para>This implementation supports handling very large documents that are split into
<quote>chunks</quote>; it provides a delivery mechanism that insures the
sequential order of the chunks using information carried in the CAS metadata. This
metadata, which is required for this implementation to work correctly, must be added
as an instance of a Feature Structure of type
<literal>org.apache.es.tt.DocumentMetaData</literal> and referred to by an
additional feature named <literal>esDocumentMetaData</literal> in the special
instance of <literal>uima.tcas.DocumentAnnotation</literal> that is
associated with the CAS. This is usually done by the Collection Reader; the instance
contains the following features:
<variablelist>
<varlistentry>
<term>sequenceNumber</term>
<listitem><para>[Number] the sequential number of a chunk, starting at 1. If
not a chunk (i.e. complete document), the value should be 0.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>documentId</term>
<listitem><para>[Number] current document id. Chunks belonging to the same
document have identical document id.</para></listitem>
</varlistentry>
<varlistentry>
<term>isCompleted</term>
<listitem><para>[Number] 1 if the chunk is the last in a sequence, 0
otherwise.</para></listitem>
</varlistentry>
<varlistentry>
<term>url</term>
<listitem><para>[String] document url.</para></listitem>
</varlistentry>
<varlistentry>
<term>throttleID</term>
<listitem><para>[String] special attribute currently used by
OmniFind.</para></listitem>
</varlistentry>
</variablelist></para>
<para>This implementation of a sequenced queue supports proper sequencing of CASes in
CPM deployments that use document chunking. Chunking is a technique of splitting
large documents into pieces to reduce overall memory consumption. Chunking does not
depend on the number of CASes in the CAS Pool. It works equally well with one or more
CASes in the CAS Pool. Each chunk is packaged in a separate CAS and placed in the Work
Queue. If the CAS Pool is depleted, the CollectionReader thread is suspended until a
CAS is released back to the pool by the processing threads. A document may be split into
1, 2, 3 or more chunks that are analyzed independently. In order to reconstruct the
document correctly, the CAS Consumer can depend on receiving the chunks in the same
sequential order that the chunks were <quote>produced</quote>, when this
sequenced queue implementation is used. To plug in this sequenced queue to the CPM use
the following specification:
<programlisting>&lt;outputQueue dequeueTimeout="100000" queueClass=
"org.apache.uima.collection.impl.cpm.engine.SequencedQueue"/&gt;</programlisting>
where the mandatory <literal>queueClass</literal> attribute defines the name of
the class and the second mandatory attribute, <literal>dequeueTimeout</literal>
specifies the maximum number of milliseconds to wait for the expected chunk.</para>
<note><para>The value for this timeout must be carefully determined to avoid
excessive occurrences of timeouts. Typically, the size of a chunk and the type of
analysis being done are the most important factors when deciding on the value for the
timeout. The larger the chunk and the more complicated analysis, the more time it takes
for the chunk to go from source to sink. You may specify 0, in which case, the timeout is
disabled - i.e., it is equivalent to an infinitely long timeout.</para></note>
<para>If the chunk doesn&apos;t arrive in the configured time window, the entire
document is presumed to be invalid and the CAS is dropped from further processing.
This action occurs regardless of any other error action specification. The
SequencedQueue invalidate the document, adding the offending document&apos;s
metadata to a local cache of invalid documents. </para>
<para>If the time out occurs, the CPM notifies all registered listeners (see <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.cpe.using_listeners"/>) by calling
entityProcessComplete(). As part of this call, the SequencedQueue will pass null
instead of a CAS as the first argument, and a special exception &ndash;
CPMChunkTimeoutException. The reason for passing null as the first argument is
because the time out occurs due to the fact that the chunk has not been received in the
configured timeout window, so there is no CAS available when the timeout event
occurs.</para>
<para>The CPMChunkTimeoutException object includes an API that allows the listener
to retrieve the offending document id as well as the other metadata attributes as
defined above. These attributes are part of each chunk&apos;s metadata and are added
by the Collection Reader.</para>
<para>Each chunk that SequencedQueue works on is subjected to a test to determine if the
chunk belongs to an invalid document. This test checks the chunk&apos;s metadata
against the data in the local cache. If there is a match, the chunk is dropped. This
check is only performed for chunks and complete documents are not subject to this
check.</para>
<para>If there is an exception during the processing of a chunk, the CPM sends a
notification to all registered listeners. The notification includes the CAS and an
exception. When the listener notification is completed, the CPM also sends separate
notifications, containing the CAS, to the Artifact Producer and the
SequencedQueue. The intent is to stop adding new chunks to the Work Queue that belong
to an <quote>invalid</quote> document and also to deal with chunks that are
en-route, being processed by the processing threads.</para>
<para>In response to the notification, the Artifact Producer will drop and release
back to the CAS Pool all CASes that belong to an <quote>invalid</quote> document.
Currently, there is no support in the CollectionReader&apos;s API to tell it to stop
generating chunks. The CollectionReader keeps producing the chunks but the
Artifact Producer immediately drops/releases them to the CAS Pool. Before the CAS is
released back to the CAS Pool, the Artifact Producer sends notification to all
registered listeners. This notification includes the CAS and an exception &ndash;
SkipCasException.</para>
<para>In response to the notification of an exception involving a chunk, the
SequencedQueue retrieves from the CAS the metadata and adds it to its local cache of
<quote>invalid</quote> documents. All chunks de-queued from the OutputQueue and
belonging to <quote>invalid</quote> documents will be dropped and released back to
the CAS Pool. Before dropping the CAS, the CPM sends notification to all registered
listeners. The notification includes the CAS and SkipCasException.</para>
<para>The <literal>&lt;checkpoint&gt;</literal> element is an optional element.
It specifies a CPE checkpoint file, checkpoint frequency, and strategy for
checkpoints (time or count based). At checkpoint time, the CPM saves status
information and statistics to the checkpoint file. The checkpoint file is specified
in the <literal>file</literal> attribute, which has the same form as the
<literal>href</literal> attribute of the <literal>&lt;include&gt;</literal>
element described in <xref linkend="&tp;imports"/>. The
<literal>time</literal> attribute indicates that a checkpoint should be taken
every <literal>[Number]</literal> seconds, and the <literal>batch</literal>
attribute indicates that a checkpoint should be taken every
<literal>[Number]</literal> batches.</para>
<para>The <literal>&lt;timerImpl&gt;</literal> element is optional. It is used to
identify a custom timer plug-in class to generate time stamps during the CPM
execution. The value of the element is a Java class name.</para>
<para>The <literal>&lt;deployAs&gt;</literal> element indicates the type of CPM
deployment. Valid contents for this element include:
<variablelist>
<varlistentry>
<term>vinciService</term>
<listitem><para>Vinci service exposing APIs for stop, pause, resume, and
getStats</para></listitem>
</varlistentry>
<varlistentry>
<term>interactive</term>
<listitem><para>provide command line menus (start, stop, pause,
resume)</para></listitem>
</varlistentry>
<varlistentry>
<term>immediate</term>
<listitem><para>run the CPM without menus or a service API</para></listitem>
</varlistentry>
<varlistentry>
<term>single-threaded</term>
<listitem><para>run the CPM in a single threaded mode. In this mode, the
Collection Reader, the Processing Pipeline, and the CAS Consumer Pipeline
are all running in one thread without the work queue and the output
queue.</para></listitem>
</varlistentry>
</variablelist></para>
</section>
<section id="&tp;descriptor.resource_manager_configuration">
<title>Resource Manager Configuration</title>
<para>External resource bindings for the CPE may optionally be specified in an
element:
<programlisting>&lt;resourceManagerConfiguration href="..."/&gt;</programlisting></para>
<para>For an introduction to external resources, refer to <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.aae.accessing_external_resource_files"/>.</para>
<para>In the <literal>resourceManagerConfiguration</literal> element, the value
of the href attribute refers to another file that contains definitions and bindings
for the external resources used by the CPE. The format of this file is the same as the XML
snippet <olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings"/>
. For example, in a CPE containing an aggregate analysis engine with two annotators,
and a CAS Consumer, the following resource manager configuration file would bind
external resource dependencies in all three components to the same physical
resource:
<programlisting><![CDATA[<resourceManagerConfiguration>
<!-- Declare Resource -->
<externalResources>
<externalResource>
<name>ExampleResource</name>
<fileResourceSpecifier>
<fileUrl>file:MyResourceFile.dat</fileUrl>
</fileResourceSpecifier>
</externalResource>
</externalResources>
<!-- Bind component resource dependencies to ExampleResource -->
<externalResourceBindings>
<externalResourceBinding>
<key>MyAE/annotator1/myResourceKey</key>
<resourceName>ExampleResource</resourceName>
</externalResourceBinding>
<externalResourceBinding>
<key>MyAE/annotator2/someResourceKey</key>
<resourceName>ExampleResource</resourceName>
</externalResourceBinding>
<externalResourceBinding>
<key>MyCasConsumer/otherResourceKey</key>
<resourceName>ExampleResource</resourceName>
</externalResourceBinding>
</externalResourceBindings>
</resourceManagerConfiguration>]]></programlisting></para>
<para>In this example, <literal>MyAE</literal> and
<literal>MyCasConsumer</literal> are the names of the Analysis Engine and CAS
Consumer, as specified by the name attributes of the CPE&apos;s
<literal>&lt;casProcessor&gt;</literal> elements.
<literal>annotator1</literal> and <literal>annotator2</literal> are the
annotator keys specified within the Aggregate AE Descriptor, and
<literal>myResourceKey</literal>, <literal>someResourceKey</literal>, and
<literal>otherResourceKey</literal> are the keys of the resource dependencies
declared in the individual annotator and CAS Consumer descriptors.</para>
</section>
<section id="&tp;descriptor.example">
<title>Example CPE Descriptor</title>
<programlisting><![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<cpeDescription>
<collectionReader>
<collectionIterator>
<descriptor>
<import location=
"../collection_reader/FileSystemCollectionReader.xml"/>
</descriptor>
</collectionIterator>
</collectionReader>
<casProcessors dropCasOnException="true" casPoolSize="1"
processingUnitThreadCount="1">
<casProcessor deployment="integrated"
name="Aggregate TAE - Name Recognizer and Person Title Annotator">
<descriptor>
<import location=
"../analysis_engine/NamesAndPersonTitles_TAE.xml"/>
</descriptor>
<deploymentParameters/>
<filter/>
<errorHandling>
<errorRateThreshold action="terminate" value="100/1000"/>
<maxConsecutiveRestarts action="terminate" value="30"/>
<timeout max="100000"/>
</errorHandling>
<checkpoint batch="1"/>
</casProcessor>
<casProcessor deployment="integrated" name="Annotation Printer">
<descriptor>
<import location="../cas_consumer/AnnotationPrinter.xml"/>
</descriptor>
<deploymentParameters/>
<filter/>
<errorHandling>
<errorRateThreshold action="terminate" value="100/1000"/>
<maxConsecutiveRestarts action="terminate" value="30"/>
<timeout max="100000"/>
</errorHandling>
<checkpoint batch="1"/>
</casProcessor>
</casProcessors>
<cpeConfig>
<numToProcess>1</numToProcess>
<deployAs>immediate</deployAs>
<checkpoint file="" time="3000"/>
<timerImpl/>
</cpeConfig>
</cpeDescription>]]></programlisting>
</section>
</chapter>