<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" | |
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[ | |
<!ENTITY imgroot "../images/references/ref.xml.cpe_descriptor/"> | |
<!ENTITY tp "ugr.ref.xml.cpe_descriptor."> | |
<!ENTITY % uimaents SYSTEM "../entities.ent" > | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<chapter id="ugr.ref.xml.cpe_descriptor"> | |
<title>Collection Processing Engine Descriptor Reference</title> | |
<titleabbrev>CPE Descriptor Reference</titleabbrev> | |
<para>A UIMA <emphasis>Collection Processing Engine</emphasis> (CPE) is a combination | |
of UIMA components assembled to analyze a collection of artifacts. A CPE is an | |
instantiation of the UIMA <emphasis>Collection Processing Architecture</emphasis>, | |
which defines the collection processing components, interfaces, and APIs. A CPE is | |
executed by a UIMA framework component called the <emphasis>Collection Processing | |
Manager</emphasis> (CPM), which provides a number of services for deploying CPEs, | |
running CPEs, and handling errors.</para> | |
<para>A CPE can be assembled programmatically within a Java application, or it can be | |
assembled declaratively via a CPE configuration specification, called a CPE | |
Descriptor. This chapter describes the format of the CPE Descriptor.</para> | |
<para>Details about the CPE, including its function, sub-components, APIs, and related | |
tools, can be found in <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.cpe"/>. Here we briefly summarize the CPE to define terms and | |
provide context for the later sections that describe the CPE Descriptor.</para> | |
<section id="&tp;overview"> | |
<title>CPE Overview</title> | |
<figure id="&tp;overview.fig.runtime"> | |
<title>CPE Runtime Overview</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.8in" format="PNG" | |
fileref="&imgroot;image002.png"/> | |
</imageobject> | |
<textobject><phrase>CPE Runtime Overview diagram</phrase></textobject> | |
</mediaobject> | |
</figure> | |
<para>An illustration of the CPE runtime is shown in <xref | |
linkend="&tp;overview.fig.runtime"/>. Some of the CPE components, such as the | |
<emphasis>queues</emphasis> and <emphasis>processing pipelines</emphasis>, are | |
internal to the CPE, but their behavior and deployment may be configured using the CPE | |
Descriptor. Other CPE components, such as the <emphasis>Collection | |
Reader</emphasis> and <emphasis>CAS Processors</emphasis>, are defined and | |
configured externally from the CPE and then plugged in to the CPE to create the overall | |
engine. The parts of a CPE are: | |
<variablelist> | |
<varlistentry> | |
<term>Collection Reader</term> | |
<listitem><para>understands the native data collection format and iterates | |
over the collection producing subjects of analysis</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>CAS Initializer<footnote><para>Deprecated</para></footnote> | |
</term> | |
<listitem><para>initializes a CAS with a subject of analysis</para> | |
</listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>Artifact Producer</term> | |
<listitem><para>asynchronously pulls CASes from the Collection Reader, | |
creates batches of CASes and puts them into the work queue</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>Work Queue</term> | |
<listitem><para>shared queue containing batches of CASes queued by the Artifact | |
Producer for analysis by Analysis Engines</para> | |
</listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>B1-Bn</term> | |
<listitem><para>individual batches containing 1 or more CASes</para> | |
</listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>AE1-AEn</term> | |
<listitem><para>Analysis Engines arranged by a CPE descriptor</para> | |
</listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>Processing Pipelines</term> | |
<listitem><para>each pipeline runs in a separate thread and contains a | |
replicated set of the Analysis Engines running in the defined sequence</para> | |
</listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>Output Queue</term> | |
<listitem><para>holds batches of CASes with analysis results intended for CAS | |
Consumers</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>CAS Consumers</term> | |
<listitem><para>perform collection level analysis over the CASes and extract | |
analysis results, e.g., creating indexes or databases</para></listitem> | |
</varlistentry> | |
</variablelist> | |
</para> | |
</section> | |
<section id="&tp;notation"> | |
<title>Notation</title> | |
<para>CPE Descriptors are XML files. This chapter uses an informal notation to specify | |
the syntax of CPE Descriptors.</para> | |
<para>The notation used in this chapter is: | |
<itemizedlist><listitem><para>An ellipsis (...) inside an element body indicates | |
that the substructure of that element has been omitted (to be described in another | |
section of this chapter). An example of this would be: | |
<programlisting><collectionReader> | |
... | |
</collectionReader></programlisting></para> | |
</listitem> | |
<listitem><para>An ellipsis immediately after an element indicates that the | |
element type may be repeated arbitrarily many times. For example: | |
<programlisting><parameter>[String]</parameter> | |
<parameter>[String]</parameter> | |
...</programlisting> | |
indicates that there may be arbitrarily many parameter elements in this | |
context.</para></listitem> | |
<listitem><para>An ellipsis inside an element means details of the attributes | |
associated with that element are defined later, e.g.: | |
<programlisting><casProcessor ...></programlisting></para> | |
</listitem> | |
<listitem><para>Bracketed expressions (e.g. <literal>[String]</literal>) | |
indicate the type of value that may be used at that location.</para></listitem> | |
<listitem><para>A vertical bar, as in <literal>true|false</literal>, indicates | |
alternatives. This can be applied to literal values, bracketed type names, and | |
elements. </para></listitem></itemizedlist></para> | |
<para>Which elements are optional and which are required is specified in prose, not in the | |
syntax definition.</para> | |
</section> | |
<section id="&tp;imports"> | |
<title>Imports</title> | |
<para>As of version 2.2, a CPE Descriptor can use the same <literal>import</literal> mechanism | |
as other component descriptors. This allows referring to component | |
descriptors using either relative paths (resolved relative to the location of the CPE descriptor) | |
or the classpath/datapath. For details see <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor"/>.</para> | |
<para>The follwing older syntax is still supported, but <emphasis>not recommended</emphasis>: | |
<programlisting><![CDATA[<descriptor> | |
<include href="[URL or File]"/> | |
</descriptor>]]></programlisting></para> | |
<para>The <literal>[URL or File]</literal> attribute is a URL or a filename for the descriptor of the | |
incorporated component. The argument is first attempted to be resolved as a URL.</para> | |
<para> | |
Relative paths in an <literal>include</literal> are resolved relative to the current working directory | |
(NOT the CPE descriptor location as is the case for <literal>import</literal>). | |
A filename relative to another directory can be specified using the <literal>CPM_HOME</literal> | |
variable, e.g., | |
<programlisting><descriptor> | |
<include href="${CPM_HOME}/desc_dir/descriptor.xml"/> | |
</descriptor></programlisting> | |
In this case, the value for the <literal>CPM_HOME</literal> variable must be | |
provided to the CPE by specifying it on the Java command line, e.g., | |
<programlisting>java -DCPM_HOME="C:/Program Files/apache/uima/cpm" ...</programlisting> | |
</para> | |
</section> | |
<section id="&tp;descriptor"> | |
<title>CPE Descriptor Overview</title> | |
<para>A CPE Descriptor consists of information describing the following four main | |
elements.</para> | |
<orderedlist><listitem><para>The <emphasis>Collection Reader</emphasis>, which | |
is responsible for gathering artifacts and initializing the Common Analysis | |
Structure (CAS) used to support processing in the UIMA collection processing | |
engine.</para></listitem> | |
<listitem><para>The <emphasis>CAS Processors</emphasis>, responsible for | |
analyzing individual artifacts, analyzing across artifacts, and extracting | |
analysis results. CAS Processors include <emphasis>Analysis Engines</emphasis> | |
and <emphasis>CAS Consumers</emphasis>.</para></listitem> | |
<listitem><para>Operational parameters of the <emphasis>Collection Processing | |
Manager</emphasis> (CPM), such as checkpoint frequency and deployment | |
mode.</para></listitem> | |
<listitem><para>Resource Manager Configuration (optional). </para></listitem> | |
</orderedlist> | |
<para>The CPE Descriptor has the following high level skeleton: | |
<programlisting><![CDATA[<?xml version="1.0"?> | |
<cpeDescription> | |
<collectionReader> | |
... | |
</collectionReader> | |
<casProcessors> | |
... | |
</casProcessors> | |
<cpeConfig> | |
... | |
</cpeConfig> | |
<resourceManagerConfiguration> | |
... | |
</resourceManagerConfiguration> | |
</cpeDescription>]]></programlisting></para> | |
<para>Details of each of the four main elements are described in the sections that | |
follow.</para> | |
</section> | |
<section id="&tp;descriptor.collection_reader"> | |
<title>Collection Reader</title> | |
<para>The <literal><collectionReader></literal> section identifies the | |
Collection Reader and optional CAS Initializer that are to be used in the CPE. The | |
Collection Reader is responsible for retrieval of artifacts from a collection | |
outside of the CPE, and the optional CAS Initializer (deprecated as of UIMA Version 2) | |
is responsible for initializing the CAS with the artifact.</para> | |
<para>A Collection Reader may initialize the CAS itself, in which case it does not | |
require a CAS Initializer. This should be clearly specified in the documentation for | |
the Collection Reader. Specifying a CAS Initializer for a Collection Reader that | |
does not make use of a CAS Initializer will not cause an error, but the specified CAS | |
Initializer will not be used.</para> | |
<para>The complete structure of the <literal><collectionReader></literal> | |
section is: | |
<programlisting><![CDATA[<collectionReader> | |
<collectionIterator> | |
<descriptor> | |
<import ...> | <include .../> | |
</descriptor> | |
<configurationParameterSettings>...</configurationParameterSettings> | |
<sofaNameMappings>...</sofaNameMappings> | |
</collectionIterator> | |
<casInitializer> | |
<descriptor> | |
<import ...> | <include .../> | |
</descriptor> | |
<configurationParameterSettings>...</configurationParameterSettings> | |
<sofaNameMappings>...</sofaNameMappings> | |
</casInitializer> | |
</collectionReader>]]></programlisting></para> | |
<para>The <literal><collectionIterator></literal> identifies the | |
descriptor for the Collection Reader, and the <literal><casInitializer> | |
</literal>identifies the descriptor for the CAS Initializer. The format and | |
details of the Collection Reader and CAS Initializer descriptors are described in | |
<olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader"/> | |
. The <literal><configurationParameterSettings> </literal>and the | |
<literal><sofaNameMappings></literal> elements are described in the next | |
section.</para> | |
<section id="&tp;descriptor.collection_reader.error_handling"> | |
<title>Error handling for Collection Readers</title> | |
<para>The CPM will abort if the Collection Reader throws a large number of | |
consecutive exceptions (default = 100). This default can by changed by using the | |
Java initialization parameter <literal>-DMaxCRErrorThreshold | |
xxx.</literal></para> | |
</section> | |
</section> | |
<section id="&tp;descriptor.cas_processors"> | |
<title>CAS Processors</title> | |
<para>The <literal><casProcessors></literal> section identifies the | |
components that perform the analysis on the input data, including CAS analysis | |
(Analysis Engines) and analysis results extraction (CAS Consumers). The CAS | |
Consumers may also perform collection level analysis, where the analysis is | |
performed (or aggregated) over multiple CASes. The basic structure of the CAS | |
Processors section is: | |
<programlisting><![CDATA[<casProcessors | |
dropCasOnException="true|false" | |
casPoolSize="[Number]" | |
processingUnitThreadCount="[Number]"> | |
<casProcessor ...> | |
... | |
</casProcessor> | |
<casProcessor ...> | |
... | |
</casProcessor> | |
... | |
</casProcessors>]]></programlisting></para> | |
<para>The <literal><casProcessors></literal> section has two mandatory | |
attributes and one optional attribute that configure the characteristics of the CAS | |
Processor flow in the CPE. The first mandatory attribute is a casPoolSize, which | |
defines the fixed number of CAS instances that the CPM will create and use during | |
processing. All CAS instances are maintained in a CAS Pool with a check-in and | |
check-out access. Each CAS is checked-out from the CAS Pool by the Collection Reader | |
and initialized with an initial subject of analysis. The CAS is checked-in into the | |
CAS Pool when it is completely processed, at the end of the processing chain. A larger | |
CAS Pool size will result in more memory being used by the CPM. CAS objects can be large | |
and care should be taken to determine the optimum size of the CAS Pool, weighing memory | |
tradeoffs with performance.</para> | |
<para>The second mandatory <literal><casProcessors></literal> attribute | |
is <literal>processingUnitThreadCount</literal>, which specifies the number of | |
replicated <emphasis>Processing Pipelines</emphasis>. Each Processing | |
Pipeline runs in its own thread. The CPM takes CASes from the work queue and submits | |
each CAS to one of the Processing Pipelines for analysis. A Processing Pipeline | |
contains one or more Analysis Engines invoked in a given sequence. If more than one | |
Processing Pipeline is specified, the CPM replicates instances of each Analysis | |
Engine defined in the CPE descriptor. Each Processing Pipeline thread runs | |
independently, consuming CASes from work queue and depositing CASes with analysis | |
results onto the output queue. On multiprocessor machines, multiple Processing | |
Pipelines can run in parallel, improving overall throughput of the CPM.</para> | |
<note><para>The number of Processing Pipelines should be equal to or greater than CAS | |
Pool size. </para></note> | |
<para>Elements in the pipeline (each represented by a <casProcessor> element) | |
may indicate that they do not permit multiple deployment in their Analysis Engine | |
descriptor. If so, even though multiple pipelines are being used, all CASes passing | |
through the pipelines will be routed through one instance of these marked Engines. | |
</para> | |
<para>The final, optional, <casProcessors> attribute is | |
<literal>dropCasOnException</literal>. It defines a policy that determines what | |
happens with the CAS when an exception happens during processing. If the value of this | |
attribute is set to true and an exception happens, the CPM will notify all registered | |
listeners of the exception (see <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.cpe.using_listeners"/>), clear the CAS and check the CAS | |
back into the CAS Pool so that it can be re-used. The presumption is that an exception | |
may leave the CAS in an inconsistent state and therefore that CAS should not be allowed | |
to move through the processing chain. When this attribute is omitted the CPM's | |
default is the same as specifying | |
<literal>dropCasOnException="false"</literal>.</para> | |
<section id="&tp;descriptor.cas_processors.individual"> | |
<title>Specifying an Individual CAS Processor</title> | |
<para>The CAS Processors that make up the Processing Pipeline and the CAS Consumer | |
pipeline are specified with the <literal><casProcessor></literal> | |
entity, which appears within the <literal><casProcessors></literal> | |
entity. It may appear multiple times, once for each CAS Processor specified for | |
this CPE.</para> | |
<para>The order of the <literal><casProcessor></literal> entities with | |
the <literal><casProcessors></literal> section specifies the order in | |
which the CAS Processors will run. Although CAS Consumers are usually put at the end | |
of the pipeline, they need not be. Also, Aggregate Analysis Engines may include CAS | |
Consumers.</para> | |
<para>The overall format of the <literal><casProcessor></literal> entity | |
is: | |
<programlisting><![CDATA[<casProcessor deployment="local|remote|integrated" name="[String]" > | |
<descriptor> | |
<import ...> | <include .../> | |
</descriptor> | |
<configurationParameterSettings>...</configurationParameterSettings> | |
<sofaNameMappings>...</sofaNameMappings> | |
<runInSeparateProcess>...</runInSeparateProcess> | |
<deploymentParameters>...</deploymentParameters> | |
<filter/> | |
<errorHandling>...</errorHandling> | |
<checkpoint batch="Number"/> | |
</casProcessor>]]></programlisting></para> | |
<para>The <literal><casProcessor></literal> element has two mandatory | |
attributes, <literal>deployment</literal> and <literal>name</literal>. The | |
mandatory <literal>name</literal> attribute specifies a unique string | |
identifying the CAS Processor.</para> | |
<para>The mandatory <literal>deployment</literal> attribute specifies the CAS | |
Processor deployment mode. Currently, three deployment options are supported: | |
<variablelist> | |
<varlistentry> | |
<term>integrated</term> | |
<listitem><para>indicates <emphasis>integrated</emphasis> deployment | |
of the CAS Processor. The CPM deploys and collocates the CAS Processor in the | |
same process space as the CPM. This type of deployment is recommended to | |
increase the performance of the CPE. However, it is NOT recommended to | |
deploy annotators containing JNI this way. Such CAS Processors may cause a | |
fatal exception and force the JVM to exit without cleanup (bringing down the | |
CPM). Any UIMA SDK compliant pure Java CAS Processors may be safely deployed | |
this way.</para> | |
<para>The descriptor for an integrated deployment can, in fact, be a remote | |
service descriptor. When used this way, however, the CPM error recovery | |
options (see below) operate in the integrated mode, which means that many | |
of the retry options are not available.</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>remote</term> | |
<listitem><para>indicates <emphasis>non-managed</emphasis> | |
deployment of the CAS Processor. The CAS Processor descriptor referenced | |
in the <literal><descriptor></literal> element must be a Vinci | |
<emphasis>Service Client Descriptor</emphasis>, which identifies a | |
remotely deployed CAS Processor service (see <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.application.remote_services"/>). The CPM | |
assumes that the CAS Processor is already running as a remote service and | |
will connect to it using the URI provided in the client service descriptor. | |
The lifecycle of a remotely deployed CAS Processor is not managed by the CPM, | |
so appropriate infrastructure should be in place to start/restart such CAS | |
Processors when necessary. This deployment provides fault isolation and | |
is implementation (i.e., programming language) neutral.</para> | |
</listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>local</term> | |
<listitem><para>indicates <emphasis>managed</emphasis> deployment of | |
the CAS Processor. The CAS Processor descriptor referenced in the | |
<literal><descriptor></literal> element must be a Vinci | |
<emphasis>Service Deployment Descriptor</emphasis>, which configures | |
a CAS Processor for deployment as a Vinci service (see <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.application.remote_services"/>). The CPM | |
deploys the CAS Processor in a separate process and manages the life cycle | |
(start/stop) of the CAS Processor. Communication between the CPM and the | |
CAS Processor is done with Vinci. When the CPM completes processing, the | |
process containing the CAS Processor is terminated. This deployment mode | |
insulates the CPM from the CAS Processor, creating a more robust deployment | |
at the cost of a small communication overhead. On multiprocessor machines, | |
the separate processes may run concurrently and improve overall | |
throughput.</para></listitem> | |
</varlistentry> | |
</variablelist></para> | |
<para>A number of elements may appear within the | |
<literal><casProcessor></literal> element.</para> | |
<section id="&tp;descriptor.cas_processors.individual.descriptor"> | |
<title><descriptor> Element</title> | |
<para>The <literal><descriptor></literal> element is mandatory. It | |
identifies the descriptor for the referenced CAS Processor using the syntax | |
described in <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.aes"/>. | |
<itemizedlist spacing="compact"><listitem><para>For | |
<emphasis><literal>remote</literal></emphasis> CAS Processors, the | |
referenced descriptor must be a Vinci <emphasis>Service Client | |
Descriptor</emphasis>, which identifies a remotely deployed CAS Processor | |
service.</para></listitem> | |
<listitem><para>For <emphasis>local</emphasis> CAS Processors, the | |
referenced descriptor must be a Vinci <emphasis>Service Deployment | |
Descriptor</emphasis>.</para></listitem> | |
<listitem><para>For <emphasis>integrated</emphasis> CAS Processors, | |
the referenced descriptor must be an Analysis Engine Descriptor | |
(primitive or aggregate). </para></listitem></itemizedlist> </para> | |
<para>See <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.application.remote_services"/> for more | |
information on creating these descriptors and deploying services.</para> | |
</section> | |
<section | |
id="&tp;descriptor.cas_processors.individual.configuration_parameter_settings"> | |
<title><configurationParameterSettings> Element</title> | |
<para>This element provides a way to override the contained Analysis | |
Engine's parameters settings. Any entry specified here must already be | |
defined; values specified replace the corresponding values for each | |
parameter. <emphasis role="bold-italic">For Cas Processors, this mechanism | |
is only available when they are deployed in <quote>integrated</quote> | |
mode.</emphasis> For Collection Readers and Initializers, it always is | |
available.</para> | |
<para>The content of this element is identical to the component descriptor for | |
specifying parameters (in the case where no parameter groups are | |
specified)<footnote><para>An earlier UIMA version required these to have a | |
suffix of <quote>_p</quote>, e.g., <quote>string_p</quote>. This is no | |
longer required, but this format is accepted, also, for backward | |
compatibility.</para></footnote>. Here is an example: | |
<programlisting><![CDATA[<configurationParameterSettings> | |
<nameValuePair> | |
<name>CivilianTitles</name> | |
<value> | |
<array> | |
<string>Mr.</string> | |
<string>Ms.</string> | |
<string>Mrs.</string> | |
<string>Dr.</string> | |
</array> | |
</value> | |
</nameValuePair> | |
... | |
</configurationParameterSettings>]]></programlisting></para> | |
</section> | |
<section | |
id="&tp;descriptor.cas_processors.individual.sofa_name_mappings"> | |
<title><sofaNameMappings> Element</title> | |
<para>This optional element provides a mapping from defined Sofa names in the | |
component, or the default Sofa name (if the component does not declare any Sofa | |
names). The form of this element is: | |
<programlisting><sofaNameMappings> | |
<sofaNameMapping cpeSofaName="a_CPE_name" | |
componentSofaName="a_component_Name"/> | |
... | |
</sofaNameMappings></programlisting></para> | |
<para>There can be any number of<literal> | |
<sofaNameMapping></literal> elements contained in the | |
<literal><sofaNameMappings></literal> element. The | |
<literal>componentSofaName</literal> attribute is optional; leave it out to | |
specify a mapping for the <literal>_InitialView</literal> - that is, for | |
Single-View components.</para> | |
</section> | |
<section id="&tp;descriptor.cas_processors.run_in_separate_process"> | |
<title><runInSeparateProcess> Element</title> | |
<para>The <literal><runInSeparateProcess></literal> element is | |
mandatory for <literal>local</literal> CAS Processors, but should not appear | |
for <literal>remote</literal> or <literal>integrated</literal> CAS | |
Processors. It enables the CPM to create external processes using the provided | |
runtime environment. Applications launched this way communicate with the CPM | |
using the Vinci protocol and connectivity is enabled by a local instance of the | |
VNS that the CPM manages. Since communication is based on Vinci, the application | |
need not be implemented in Java. Any language for which Vinci provides support | |
may be used to create an application, and the CPM will seamlessly communicate | |
with it. The overall structure of this element is: | |
<programlisting><![CDATA[<runInSeparateProcess> | |
<exec dir="[String]" executable="[String]"> | |
<env key="[String]" value ="[String]"/> | |
... | |
<arg>[String]</arg> | |
... | |
</exec> | |
</runInSeparateProcess>]]></programlisting></para> | |
<para>The <literal><exec></literal> element provides information | |
about how to execute the referenced CAS Processor. Two attributes are defined | |
for the <literal><exec></literal> element. The | |
<literal>dir</literal> attribute is currently not used – it is reserved | |
for future functionality. The <literal>executable</literal> attribute | |
specifies the actual Vinci service executable that will be run by the CPM, e.g., | |
<literal>java</literal>, a batch script, an application (.exe), etc. The | |
executable must be specified with a fully qualified path, or be found in the | |
<literal>PATH</literal> of the CPM.</para> | |
<para>The <literal><exec></literal> element has two elements within it | |
that define parameters used to construct the command line for executing the CAS | |
Processor. These elements must be listed in the order in which they should be | |
defined for the CAS Processor.</para> | |
<para>The optional <literal><env></literal> element is used to set an | |
environment variable. The variable <literal>key</literal> will be set to | |
<literal>value</literal>. For example, | |
<programlisting><env key="CLASSPATH" value="C:Javalib"/></programlisting> | |
will set the environment variable <literal>CLASSPATH</literal> to the value | |
<literal>C:Javalib</literal>. The <literal><env></literal> | |
element may be repeated to set multiple environment variables. All of the | |
key/value pairs will be added to the environment by the CPM prior to launching the | |
executable.</para> | |
<note><para>The CPM actually adds ALL system environment variables when it | |
launches the program. It queries the Operating System for its current system | |
variables and one by one adds them to the program's process | |
configuration.</para></note> | |
<para>The <literal><arg></literal> element is used to specify arbitrary | |
string arguments that will appear on the command line when the CPM runs the | |
command specified in the <literal>executable</literal> attribute.</para> | |
<para>For example, the following would be used to invoke the UIMA Java | |
implementation of the Vinci service wrapper on a Java CAS Processor: | |
<programlisting><![CDATA[<runInSeparateProcess> | |
<exec executable="java"> | |
<arg>-DVNS_HOST=localhost</arg> | |
<arg>-DVNS_PORT=9099</arg> | |
<arg>org.apache.uima.reference_impl.analysis_engine.service. | |
vinci.VinciAnalysisEngineService_impl</arg> | |
<arg>C:uimadescdeployCasProcessor.xml</arg> | |
</exec> | |
<runInSeparateProcess>]]></programlisting></para> | |
<para>This will cause the CPM to run the following command line when starting the | |
CAS Processor: | |
<programlisting>java -DVNS_HOST=localhost -DVNS_PORT=9099 | |
org.apache.uima.reference_impl.analysis_engine.service.vinci.\\ | |
VinciAnalysisEngineService_impl | |
C:uimadescdeployCasProcessor.xml</programlisting></para> | |
<para>The first argument specifies that the Vinci Naming Service is running on the | |
<literal>localhost</literal>. The second argument specifies that the Vinci | |
Naming Service port number is <literal>9099</literal>. The third argument | |
(split over 2 lines in this documentation) | |
identifies the UIMA implementation of the Vinci service wrapper. This class | |
contains the <literal>main</literal> method that will execute. That main | |
method in turn takes a single argument – the filename for the CAS Processor | |
service deployment descriptor. Thus the last argument identifies the Vinci | |
service deployment descriptor file for the CAS Processor. Since this is the same | |
descriptor file specified earlier in the | |
<literal><descriptor></literal> element, the string | |
<literal>${descriptor}</literal> can be used to refer to the descriptor, | |
e.g.: | |
<programlisting><arg>${descriptor}</arg></programlisting></para> | |
<para>The CPM will expand this out to the service deployment descriptor file | |
referenced in the <literal><descriptor></literal> element.</para> | |
</section> | |
<section | |
id="&tp;descriptor.cas_processors.individual.deployment_parameters"> | |
<title><deploymentParameters> Element</title> | |
<para>The <literal><deploymentParameters></literal> element defines | |
a number of deployment parameters that control how the CPM will interact with the | |
CAS Processor. This element has the following overall form: | |
<programlisting><deploymentParameters> | |
<parameter name="[String]" value="..." type="string|integer" /> | |
... | |
</deploymentParameters></programlisting></para> | |
<para>The <literal>name</literal> attribute identifies the parameter, the | |
<literal>value</literal> attribute specifies the value that will be assigned | |
to the parameter, and the <literal>type</literal> attribute indicates the | |
type of the parameter, either <literal>string</literal> or | |
<literal>integer</literal>. The available parameters include: | |
<variablelist> | |
<varlistentry> | |
<term>service-access</term> | |
<listitem><para>string parameter whose value must be | |
<quote>exclusive</quote>, if present. This parameter is only | |
effective for remote deployments. It modifies the Vinci service | |
connections to be preallocated and dedicated, one service instance per | |
pipe-line. It is only relevant for non-Integrated deployement modes. If | |
there are fewer services instances that are available (and alive – | |
responding to a <quote>ping</quote> request) than there are pipelines, | |
the number of pipelines (the number of concurrent threads) is reduced to | |
match the number of available instances. If not specified, the VNS is | |
queried each time a service is needed, and a <quote>random</quote> | |
instance is assigned from the pool of available instances. If a services | |
dies during processing, the CPM will use its normal error handling | |
procedures to attempt to reconnect. The number of attempts is specified | |
in the CPE descriptor for each Cas Processor using the | |
<literal><maxConsecutiveRestarts value="10" | |
action="kill-pipeline" | |
waitTimeBetweenRetries="50"/></literal> xml element. The | |
<quote>value</quote> attribute is the number of reconnection tries; | |
the <quote>action</quote> says what to do if the retries exceed the | |
limit. The <quote>kill-pipeline</quote> action stops the pipeline | |
that was associated with the failing service (other pipelines will | |
continue to work). The CAS in process within a killed pipeline will be | |
dropped. These events are communicated to the application using the | |
normal event listener mechanism. The | |
<literal>waitTimeBetweenRetries</literal> says how many | |
milliseconds to wait inbetween attempts to reconnect.</para> | |
</listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>vnsHost</term> | |
<listitem><para>(Deprecated) string parameter specifying the VNS host, | |
e.g., <literal>localhost</literal> for local CAS Processors, host | |
name or IP address of VNS host for remote CAS Processors. This parameter is | |
deprecated; use the parameter specification instead inside the Vinci | |
<emphasis>Service Client Descriptor</emphasis>, if needed. It is | |
ignored for integrated and local deployments. If present, for remote | |
deployments, it specifies the VNS Host to use, unless that is specified in | |
the Vinci <emphasis>Service Client Descriptor</emphasis>.</para> | |
</listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>vnsPort</term> | |
<listitem><para>(Deprecated) integer parameter specifying the VNS port | |
number. This parameter is deprecated; use the parameter specification | |
instead inside the Vinci <emphasis>Service Client | |
Descriptor,</emphasis> if needed. It is ignored for integrated and | |
local deployments. If present, for remote deployments, it specifies the | |
VNS Port number to use, unless that is specified in the Vinci | |
<emphasis>Service Client Descriptor.</emphasis></para> | |
</listitem> | |
</varlistentry> | |
</variablelist></para> | |
<para>For example, the following parameters might be used with a CAS Processor | |
deployed in local mode: | |
<programlisting><deploymentParameters> | |
<parameter name="service-access" value="exclusive" type="string"/> | |
</deploymentParameters></programlisting></para> | |
</section> | |
<section id="&tp;descriptor.cas_processors.individual.filter"> | |
<title><filter> Element</title> | |
<para>The <filter> element is a required element but currently should be | |
left empty. This element is reserved for future use.</para> | |
</section> | |
<section id="&tp;descriptor.cas_processors.individual.error_handling"> | |
<title><errorHandling> Element</title> | |
<para>The mandatory <literal><errorHandling></literal> element | |
defines error and restart policies for the CAS Processor. Each CAS Processor may | |
define different actions in the event of errors and restarts. The CPM monitors | |
and logs errant behaviors and attempts to recover the component based on the | |
policies specified in this element.</para> | |
<para>There are two kinds of faults: | |
<orderedlist><listitem><para>One kind only occurs with non-integrated CAS | |
Processors – this fault is either a timeout attempting to launch or | |
connect to the non-integrated component, or some other kind of connection | |
related exception (for instance, the network connection might timeout or get | |
reset).</para></listitem> | |
<listitem><para>The other kind happens when the CAS Processor component (an | |
Annotator, for example) throws any kind of exception. This kind may occur | |
with any kind of deployment, integrated or not. </para></listitem> | |
</orderedlist></para> | |
<para>The <errorHandling> has specifications for each of these kinds of | |
faults. The format of this element is: | |
<programlisting><![CDATA[<errorHandling> | |
<maxConsecutiveRestarts action="continue|disable|terminate" | |
value="[Number]"/> | |
<errorRateThreshold action="continue|disable|terminate" value="[Rate]"/> | |
<timeout max="[Number]"/> | |
</errorHandling>]]></programlisting></para> | |
<para>The mandatory <literal><maxConsecutiveRestarts></literal> | |
element applies only to faults of the first kind, and therefore, only applies to | |
non-integrated deployments. If such a fault occurs, a retry is attempted, up to | |
<literal>value="[Number]"</literal> of times. This retry resets the | |
connection (if one was made) and attempts to reconnect and perhaps re-launch | |
(see below for details). The original CAS (not a partially updated one) is sent to | |
the CAS Processor as part of the retry, once the deployed component has been | |
successfully restarted or reconnected to.</para> | |
<para>The <literal>action</literal> attribute specifies the action to take | |
when the threshold specified by the <literal>value="[Number]"</literal> is | |
exceeded. The possible actions are: | |
<variablelist> | |
<varlistentry> | |
<term>continue</term> | |
<listitem><para>skip any further processing for this CAS by this CAS | |
Processor, and pass the CAS to the next CAS Processor in the Pipeline. | |
</para> | |
<para>The <quote>restart</quote> action is done, because it is needed | |
for the next CAS.</para> | |
<para>If the <literal>dropCasOnException="true"</literal>, the CPM | |
will NOT pass the CAS to the next CAS Processor in the chain. Instead, the | |
CPM will abort processing of this CAS, release the CAS back to the CAS | |
Pool and will process the next CAS in the queue.</para> | |
<para>The counter counting the restarts toward the threshold is only | |
reset after a CAS is successfully processed.</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>disable</term> | |
<listitem><para>the current CAS is handled just as in the | |
<literal>continue</literal> case, but in addition, the CAS Processor | |
is marked so that its <emphasis>process()</emphasis> method will not be | |
called again (i.e., it will be <quote>skipped</quote> for future | |
CASes)</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>terminate</term> | |
<listitem><para>the CPM will terminate all processing and exit.</para> | |
</listitem> | |
</varlistentry> | |
</variablelist></para> | |
<para>The definition of an error for the | |
<literal><maxConsecutiveRestarts></literal> element differs | |
slightly for each of the three CAS Processor deployment modes: | |
<variablelist> | |
<varlistentry> | |
<term>local</term> | |
<listitem><para>Local CAS Processors experience two general error | |
types: | |
<itemizedlist> | |
<listitem><para>launch errors – errors associated with | |
launching a process</para></listitem> | |
<listitem><para>processing errors – errors associated with | |
sending Vinci commands to the process</para></listitem> | |
</itemizedlist></para> | |
<para>A launch error is defined by a failure of the process to | |
successfully register with the local VNS within a default time window. | |
The current timeout is 15 minutes. Multiple local CAS Processors are | |
launched sequentially, with a subsequent processor launched | |
immediately after its previous processor successfully registers | |
with the VNS.</para> | |
<para>A processing error is detected if a connection to the CAS Processor | |
is lost or if the processing time exceeds a specified timeout | |
value.</para> | |
<para>For local CAS Processors, the | |
<maxConsecutiveRestarts> element specifies the number of | |
consecutive attempts made to launch the CAS Processor at CPM startup or | |
after the CPM has lost a connection to the CAS Processor.</para> | |
</listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>remote</term> | |
<listitem><para>For remote CAS Processors, the | |
<maxConsecutiveRestarts> element applies to errors from | |
sending Vinci commands. An error is detected if a connection to the CAS | |
Processor is lost, or if the processing time exceeds the timeout value | |
specified in the <timeout> element (see below).</para> | |
</listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>integrated</term> | |
<listitem><para>Although mandatory, the | |
<maxConsecutiveRestarts> element is NOT used for integrated CAS | |
Processors, because Integrated CAS Processors are not | |
re-instantiated/restarted on exceptions. This setting is ignored by | |
the CPM for Integrated CAS Processors but it is required. Future version | |
of the CPM will make this element mandatory for remote and local CAS | |
Processors only.</para></listitem> | |
</varlistentry> | |
</variablelist></para> | |
<para>The mandatory <literal><errorRateThreshold></literal> element | |
is used for all faults – both those above, and exceptions thrown by the CAS | |
Processor itself. It specifies the number of retries for exceptions thrown by | |
the CAS Processor itself, a maximum error rate, and the corresponding action to | |
take when this rate is exceeded. The <literal>value</literal> attribute | |
specifies the error rate in terms of errors per sample size in the form | |
<quote><literal>N/M</literal></quote>, where <literal>N</literal> is the | |
number of errors and <literal>M</literal> is the sample size, defined in terms | |
of the number of documents.</para> | |
<para>The first number is used also to indicate the maximum number of retries. If | |
this number is less than the <literal><maxConsecutiveRestarts | |
value="[Number]">, </literal>it will override, reducing the number of | |
<quote>restarts</quote> attempted. A retry is done only if the | |
<literal>dropCasOnException </literal>is false. If it is set to true, no retry | |
occurs, but the error is counted.</para> | |
<para>When the number of counted errors exceeds the sample size, an action | |
specified by the <literal>action</literal> attribute is taken. The possible | |
actions and their meaning are the same as described above for the | |
<literal><maxConsecutiveRestarts></literal> element: | |
<itemizedlist spacing="compact"> | |
<listitem><para><literal>continue</literal></para></listitem> | |
<listitem><para><literal>disable</literal></para></listitem> | |
<listitem><para><literal>terminate</literal></para></listitem> | |
</itemizedlist></para> | |
<para>The <literal>dropCasOnException="true"</literal> attribute of the | |
<literal><casProcessors></literal> element modifies the action | |
taken for continue and disable, in the same manner as above. For example: | |
<programlisting><errorRateThreshold value="3/1000" action="disable"/></programlisting> | |
specifies that each error thrown by the CAS Processor itself will be retried up to | |
3 times (if <literal>dropCasOnException</literal> is false) and the CAS | |
Processor will be disabled if the error rate exceeds 3 errors in 1000 | |
documents.</para> | |
<para>If a document causes an error and the error rate threshold for the CAS | |
Processor is not exceeded, the CPM increments the CAS Processor's error | |
count and retries processing that document (if | |
<literal>dropCasOnException</literal> is false). The retry means that the | |
CPM calls the CAS Processor's process() method again, passing in as an | |
argument the same CAS that previously caused an exception.</para> | |
<note><para>The CPM does not attempt to rollback any partial changes that may have | |
been applied to the CAS in the previous process() call. </para></note> | |
<para>Errors are accumulated across documents. For example, assume the error | |
rate threshold is <literal>3/1000</literal>. The same document may fail three | |
times before finally succeeding on the fourth try, but the error count is now 3. If | |
one more error occurs within the current sample of 1000 documents, the error rate | |
threshold will be exceeded and the specified action will be taken. If no more | |
errors occur within the current sample, the error counter is reset to 0 for the | |
next sample of 1000 documents.</para> | |
<para>The <literal><timeout></literal> element is a mandatory element. | |
Although mandatory for all CAS Processors, this element is only relevant for | |
local and remote CAS Processors. For integrated CAS Processors, this element is | |
ignored. In the current CPM implementation the integrated CAS Processor | |
process() method is not subject to timeouts.</para> | |
<para>The <literal>max</literal> attribute specifies the maximum amount of | |
time in milliseconds the CPM will wait for a process() method to complete When | |
exceeded, the CPM will generate an exception and will treat this as an error | |
subject to the threshold defined in the | |
<literal><errorRateThreshold></literal> element above, including | |
doing retries.</para> | |
<section | |
id="&tp;descriptor.cas_processors.individual.error_handling.timeout_retry_action"> | |
<title>Retry action taken on a timeout</title> | |
<para>The action taken depends on whether the CAS Processor is local (managed) | |
or remote (unmanaged). Local CAS Processors (which are services) are killed | |
and restarted, and a new connection to them is established. For remote CAS | |
Processors, the connection to them is dropped, and a new connection is | |
reestablished (which may actually connect to a different instance of the | |
remote services, if it has multiple instances).</para> | |
</section> | |
</section> | |
<section id="&tp;descriptor.cas_processors.individual.checkpoint"> | |
<title><checkpoint> Element</title> | |
<para>The <literal><checkpoint></literal> element is an optional | |
element used to improve the performance of CAS Consumers. It has a single | |
attribute, <literal>batch</literal>, which specifies the number of CASes in a | |
batch, e.g.: | |
<programlisting><checkpoint batch="1000"></programlisting></para> | |
<para>sets the batch size to 1000 CASes. The batch size is the interval used to mark a | |
point in processing requiring special handling. The CAS Processor's | |
<literal>batchProcessComplete()</literal> method will be called by the CPM | |
when this mark is reached so that the processor can take appropriate action. This | |
mark could be used as a mechanism to buffer up results in CAS Consumers and perform | |
time-consuming operations, such as check-pointing, that should not be done on a | |
per-document basis.</para> | |
</section> | |
</section> | |
</section> | |
<section id="&tp;descriptor.operational_parameters"> | |
<title>CPE Operational Parameters</title> | |
<para>The parameters for configuring the overall CPE and CPM are specified in the | |
<literal><cpeConfig></literal> section. The overall format of this | |
section is: | |
<programlisting><![CDATA[<cpeConfig> | |
<startAt>[NumberOrID]</startAt> | |
<numToProcess>[Number]</numToProcess> | |
<outputQueue dequeueTimeout="[Number]" queueClass="[ClassName]" /> | |
<checkpoint file="[File]" time="[Number]" batch="[Number]"/> | |
<timerImpl>[ClassName]</timerImpl> | |
<deployAs>vinciService|interactive|immediate|single-threaded | |
</deployAs> | |
</cpeConfig>]]></programlisting></para> | |
<para>This section of the CPE descriptor allows for defining the starting entity, the | |
number of entities to process, a checkpoint file and frequency, a pluggable timer, an | |
optional output queue implementation, and finally a mode of operation. The mode of | |
operation determines how the CPM interacts with users and other systems.</para> | |
<para>The <literal><startAt></literal> element is an optional argument. It | |
defines the starting entity in the collection at which the CPM should start | |
processing.</para> | |
<para>The implementation in the CPM passes this argument to the Collection Reader | |
as the value of the parameter <quote><literal>startNumber</literal></quote>. | |
The CPM does not do anything else with this parameter; in particular, the CPM has no | |
ability to skip to a specific document - that function, if available, is only provided | |
by a particular Collection Reader implementation.</para> | |
<para>If the <literal><startAt></literal> element is used, the Collection | |
Reader descriptor must define a single-valued configuration parameter with the | |
name <literal>startNumber</literal>. It can declare this value to be of any type; | |
the value passed in this XML element must be convertible to that type.</para> | |
<para>A typical use is to declare this to be an integer type, and to pass the sequential | |
document number where processing should start. An alternative implementation | |
might take a specific document ID; the collection reader could search through its | |
collection until it reaches this ID and then start there.</para> | |
<para>This parameter will only make sense if the particular collection reader is | |
implemented to use the <literal>startNumber</literal> configuration | |
parameter.</para> | |
<para>The <literal><numToProcess></literal> element is an optional | |
element. It specifies the total number of entities to process. Use -1 to indicate ALL. | |
If not defined, the number of entities to process will be taken from the Collection | |
Reader configuration. If present, this value overrides the Collection Reader | |
configuration.</para> | |
<para>The <literal><outputQueue></literal> element is an optional element. | |
It enables plugging in a custom implementation for the Output Queue. When omitted, | |
the CPM will use a default output queue that is based on First-in First-out (FIFO) | |
model.</para> | |
<para>The UIMA SDK provides a second implementation for the Output Queue that can be | |
plugged in to the CPM, named <quote> | |
<literal>org.apache.uima.collection.impl.cpm.engine.SequencedQueue</literal> | |
</quote>.</para> | |
<para>This implementation supports handling very large documents that are split into | |
<quote>chunks</quote>; it provides a delivery mechanism that insures the | |
sequential order of the chunks using information carried in the CAS metadata. This | |
metadata, which is required for this implementation to work correctly, must be added | |
as an instance of a Feature Structure of type | |
<literal>org.apache.es.tt.DocumentMetaData</literal> and referred to by an | |
additional feature named <literal>esDocumentMetaData</literal> in the special | |
instance of <literal>uima.tcas.DocumentAnnotation</literal> that is | |
associated with the CAS. This is usually done by the Collection Reader; the instance | |
contains the following features: | |
<variablelist> | |
<varlistentry> | |
<term>sequenceNumber</term> | |
<listitem><para>[Number] the sequential number of a chunk, starting at 1. If | |
not a chunk (i.e. complete document), the value should be 0.</para> | |
</listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>documentId</term> | |
<listitem><para>[Number] current document id. Chunks belonging to the same | |
document have identical document id.</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>isCompleted</term> | |
<listitem><para>[Number] 1 if the chunk is the last in a sequence, 0 | |
otherwise.</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>url</term> | |
<listitem><para>[String] document url.</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>throttleID</term> | |
<listitem><para>[String] special attribute currently used by | |
OmniFind.</para></listitem> | |
</varlistentry> | |
</variablelist></para> | |
<para>This implementation of a sequenced queue supports proper sequencing of CASes in | |
CPM deployments that use document chunking. Chunking is a technique of splitting | |
large documents into pieces to reduce overall memory consumption. Chunking does not | |
depend on the number of CASes in the CAS Pool. It works equally well with one or more | |
CASes in the CAS Pool. Each chunk is packaged in a separate CAS and placed in the Work | |
Queue. If the CAS Pool is depleted, the CollectionReader thread is suspended until a | |
CAS is released back to the pool by the processing threads. A document may be split into | |
1, 2, 3 or more chunks that are analyzed independently. In order to reconstruct the | |
document correctly, the CAS Consumer can depend on receiving the chunks in the same | |
sequential order that the chunks were <quote>produced</quote>, when this | |
sequenced queue implementation is used. To plug in this sequenced queue to the CPM use | |
the following specification: | |
<programlisting><outputQueue dequeueTimeout="100000" queueClass= | |
"org.apache.uima.collection.impl.cpm.engine.SequencedQueue"/></programlisting> | |
where the mandatory <literal>queueClass</literal> attribute defines the name of | |
the class and the second mandatory attribute, <literal>dequeueTimeout</literal> | |
specifies the maximum number of milliseconds to wait for the expected chunk.</para> | |
<note><para>The value for this timeout must be carefully determined to avoid | |
excessive occurrences of timeouts. Typically, the size of a chunk and the type of | |
analysis being done are the most important factors when deciding on the value for the | |
timeout. The larger the chunk and the more complicated analysis, the more time it takes | |
for the chunk to go from source to sink. You may specify 0, in which case, the timeout is | |
disabled - i.e., it is equivalent to an infinitely long timeout.</para></note> | |
<para>If the chunk doesn't arrive in the configured time window, the entire | |
document is presumed to be invalid and the CAS is dropped from further processing. | |
This action occurs regardless of any other error action specification. The | |
SequencedQueue invalidate the document, adding the offending document's | |
metadata to a local cache of invalid documents. </para> | |
<para>If the time out occurs, the CPM notifies all registered listeners (see <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.cpe.using_listeners"/>) by calling | |
entityProcessComplete(). As part of this call, the SequencedQueue will pass null | |
instead of a CAS as the first argument, and a special exception – | |
CPMChunkTimeoutException. The reason for passing null as the first argument is | |
because the time out occurs due to the fact that the chunk has not been received in the | |
configured timeout window, so there is no CAS available when the timeout event | |
occurs.</para> | |
<para>The CPMChunkTimeoutException object includes an API that allows the listener | |
to retrieve the offending document id as well as the other metadata attributes as | |
defined above. These attributes are part of each chunk's metadata and are added | |
by the Collection Reader.</para> | |
<para>Each chunk that SequencedQueue works on is subjected to a test to determine if the | |
chunk belongs to an invalid document. This test checks the chunk's metadata | |
against the data in the local cache. If there is a match, the chunk is dropped. This | |
check is only performed for chunks and complete documents are not subject to this | |
check.</para> | |
<para>If there is an exception during the processing of a chunk, the CPM sends a | |
notification to all registered listeners. The notification includes the CAS and an | |
exception. When the listener notification is completed, the CPM also sends separate | |
notifications, containing the CAS, to the Artifact Producer and the | |
SequencedQueue. The intent is to stop adding new chunks to the Work Queue that belong | |
to an <quote>invalid</quote> document and also to deal with chunks that are | |
en-route, being processed by the processing threads.</para> | |
<para>In response to the notification, the Artifact Producer will drop and release | |
back to the CAS Pool all CASes that belong to an <quote>invalid</quote> document. | |
Currently, there is no support in the CollectionReader's API to tell it to stop | |
generating chunks. The CollectionReader keeps producing the chunks but the | |
Artifact Producer immediately drops/releases them to the CAS Pool. Before the CAS is | |
released back to the CAS Pool, the Artifact Producer sends notification to all | |
registered listeners. This notification includes the CAS and an exception – | |
SkipCasException.</para> | |
<para>In response to the notification of an exception involving a chunk, the | |
SequencedQueue retrieves from the CAS the metadata and adds it to its local cache of | |
<quote>invalid</quote> documents. All chunks de-queued from the OutputQueue and | |
belonging to <quote>invalid</quote> documents will be dropped and released back to | |
the CAS Pool. Before dropping the CAS, the CPM sends notification to all registered | |
listeners. The notification includes the CAS and SkipCasException.</para> | |
<para>The <literal><checkpoint></literal> element is an optional element. | |
It specifies a CPE checkpoint file, checkpoint frequency, and strategy for | |
checkpoints (time or count based). At checkpoint time, the CPM saves status | |
information and statistics to the checkpoint file. The checkpoint file is specified | |
in the <literal>file</literal> attribute, which has the same form as the | |
<literal>href</literal> attribute of the <literal><include></literal> | |
element described in <xref linkend="&tp;imports"/>. The | |
<literal>time</literal> attribute indicates that a checkpoint should be taken | |
every <literal>[Number]</literal> seconds, and the <literal>batch</literal> | |
attribute indicates that a checkpoint should be taken every | |
<literal>[Number]</literal> batches.</para> | |
<para>The <literal><timerImpl></literal> element is optional. It is used to | |
identify a custom timer plug-in class to generate time stamps during the CPM | |
execution. The value of the element is a Java class name.</para> | |
<para>The <literal><deployAs></literal> element indicates the type of CPM | |
deployment. Valid contents for this element include: | |
<variablelist> | |
<varlistentry> | |
<term>vinciService</term> | |
<listitem><para>Vinci service exposing APIs for stop, pause, resume, and | |
getStats</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>interactive</term> | |
<listitem><para>provide command line menus (start, stop, pause, | |
resume)</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>immediate</term> | |
<listitem><para>run the CPM without menus or a service API</para></listitem> | |
</varlistentry> | |
<varlistentry> | |
<term>single-threaded</term> | |
<listitem><para>run the CPM in a single threaded mode. In this mode, the | |
Collection Reader, the Processing Pipeline, and the CAS Consumer Pipeline | |
are all running in one thread without the work queue and the output | |
queue.</para></listitem> | |
</varlistentry> | |
</variablelist></para> | |
</section> | |
<section id="&tp;descriptor.resource_manager_configuration"> | |
<title>Resource Manager Configuration</title> | |
<para>External resource bindings for the CPE may optionally be specified in an | |
element: | |
<programlisting><resourceManagerConfiguration href="..."/></programlisting></para> | |
<para>For an introduction to external resources, refer to <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.aae.accessing_external_resource_files"/>.</para> | |
<para>In the <literal>resourceManagerConfiguration</literal> element, the value | |
of the href attribute refers to another file that contains definitions and bindings | |
for the external resources used by the CPE. The format of this file is the same as the XML | |
snippet <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings"/> | |
. For example, in a CPE containing an aggregate analysis engine with two annotators, | |
and a CAS Consumer, the following resource manager configuration file would bind | |
external resource dependencies in all three components to the same physical | |
resource: | |
<programlisting><![CDATA[<resourceManagerConfiguration> | |
<!-- Declare Resource --> | |
<externalResources> | |
<externalResource> | |
<name>ExampleResource</name> | |
<fileResourceSpecifier> | |
<fileUrl>file:MyResourceFile.dat</fileUrl> | |
</fileResourceSpecifier> | |
</externalResource> | |
</externalResources> | |
<!-- Bind component resource dependencies to ExampleResource --> | |
<externalResourceBindings> | |
<externalResourceBinding> | |
<key>MyAE/annotator1/myResourceKey</key> | |
<resourceName>ExampleResource</resourceName> | |
</externalResourceBinding> | |
<externalResourceBinding> | |
<key>MyAE/annotator2/someResourceKey</key> | |
<resourceName>ExampleResource</resourceName> | |
</externalResourceBinding> | |
<externalResourceBinding> | |
<key>MyCasConsumer/otherResourceKey</key> | |
<resourceName>ExampleResource</resourceName> | |
</externalResourceBinding> | |
</externalResourceBindings> | |
</resourceManagerConfiguration>]]></programlisting></para> | |
<para>In this example, <literal>MyAE</literal> and | |
<literal>MyCasConsumer</literal> are the names of the Analysis Engine and CAS | |
Consumer, as specified by the name attributes of the CPE's | |
<literal><casProcessor></literal> elements. | |
<literal>annotator1</literal> and <literal>annotator2</literal> are the | |
annotator keys specified within the Aggregate AE Descriptor, and | |
<literal>myResourceKey</literal>, <literal>someResourceKey</literal>, and | |
<literal>otherResourceKey</literal> are the keys of the resource dependencies | |
declared in the individual annotator and CAS Consumer descriptors.</para> | |
</section> | |
<section id="&tp;descriptor.example"> | |
<title>Example CPE Descriptor</title> | |
<programlisting><![CDATA[<?xml version="1.0" encoding="UTF-8"?> | |
<cpeDescription> | |
<collectionReader> | |
<collectionIterator> | |
<descriptor> | |
<import location= | |
"../collection_reader/FileSystemCollectionReader.xml"/> | |
</descriptor> | |
</collectionIterator> | |
</collectionReader> | |
<casProcessors dropCasOnException="true" casPoolSize="1" | |
processingUnitThreadCount="1"> | |
<casProcessor deployment="integrated" | |
name="Aggregate TAE - Name Recognizer and Person Title Annotator"> | |
<descriptor> | |
<import location= | |
"../analysis_engine/NamesAndPersonTitles_TAE.xml"/> | |
</descriptor> | |
<deploymentParameters/> | |
<filter/> | |
<errorHandling> | |
<errorRateThreshold action="terminate" value="100/1000"/> | |
<maxConsecutiveRestarts action="terminate" value="30"/> | |
<timeout max="100000"/> | |
</errorHandling> | |
<checkpoint batch="1"/> | |
</casProcessor> | |
<casProcessor deployment="integrated" name="Annotation Printer"> | |
<descriptor> | |
<import location="../cas_consumer/AnnotationPrinter.xml"/> | |
</descriptor> | |
<deploymentParameters/> | |
<filter/> | |
<errorHandling> | |
<errorRateThreshold action="terminate" value="100/1000"/> | |
<maxConsecutiveRestarts action="terminate" value="30"/> | |
<timeout max="100000"/> | |
</errorHandling> | |
<checkpoint batch="1"/> | |
</casProcessor> | |
</casProcessors> | |
<cpeConfig> | |
<numToProcess>1</numToProcess> | |
<deployAs>immediate</deployAs> | |
<checkpoint file="" time="3000"/> | |
<timerImpl/> | |
</cpeConfig> | |
</cpeDescription>]]></programlisting> | |
</section> | |
</chapter> |