blob: ce4f610be2a467cc2c3685a2eca7a30126f6271f [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
<!ENTITY imgroot "../images/tutorials_and_users_guides/tug.cpe/">
<!ENTITY % uimaents SYSTEM "../entities.ent">
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="ugr.tug.cpe">
<title>Collection Processing Engine Developer&apos;s Guide</title>
<titleabbrev>CPE Developer&apos;s Guide</titleabbrev>
<para>The UIMA Analysis Engine interface provides support for developing and integrating
algorithms that analyze unstructured data. Analysis Engines are designed to operate on a
per-document basis. Their interface handles one CAS at a time. UIMA provides additional
support for applying analysis engines to collections of unstructured data with its
<emphasis>Collection Processing Architecture</emphasis>. The Collection
Processing Architecture defines additional components for reading raw data formats
from data collections, preparing the data for processing by Analysis Engines, executing
the analysis, extracting analysis results, and deploying the overall flow in a variety of
local and distributed configurations.</para>
<para>The functionality defined in the Collection Processing Architecture is
implemented by a <emphasis>Collection Processing Engine</emphasis> (CPE). A CPE
includes an Analysis Engine and adds a <emphasis>Collection Reader</emphasis>, a
<emphasis>CAS Initializer</emphasis> (deprecated as of version 2), and <emphasis>CAS
Consumers</emphasis>. The part of the UIMA Framework that supports the execution of
CPEs is called the Collection Processing Manager, or CPM.</para>
<para>A Collection Reader provides the interface to the raw input data and knows how to
iterate over the data collection. Collection Readers are discussed in <xref
linkend="ugr.tug.cpe.collection_reader.developing"/>. The CAS Initializer
<footnote><para>CAS Initializers are deprecated in favor of a more general mechanism,
multiple subjects of analysis.</para></footnote> prepares an individual data item for
analysis and loads it into the CAS. CAS Initializers are discussed in <xref
linkend="ugr.tug.cpe.cas_initializer.developing"/> A CAS Consumer extracts
analysis results from the CAS and may also perform <emphasis>collection level
processing</emphasis>, or analysis over a collection of CASes. CAS Consumers are
discussed in <xref linkend="ugr.tug.cpe.cas_consumer.developing"/>.</para>
<para>Analysis Engines and CAS Consumers are both instances of <emphasis>CAS
Processors</emphasis>. A Collection Processing Engine (CPE) may contain multiple CAS
Processors. An Analysis Engine contained in a CPE may itself be a Primitive or an Aggregate
(composed of other Analysis Engines). Aggregates may contain Cas Consumers. While
Collection Readers and CAS Initializers always run in the same JVM as the CPM, a CAS
Processor may be deployed in a variety of local and distributed modes, providing a number
of options for scalability and robustness. The different deployment options are covered
in detail in <xref linkend="ugr.tug.cpe.deployment_alternatives"/>.</para>
<para>Each of the components in a CPE has an interface specified by the UIMA Collection
Processing Architecture and is described by a declarative XML descriptor file.
Similarly, the CPE itself has a well defined component interface and is described by a
declarative XML descriptor file.</para>
<para>A user creates a CPE by assembling the components mentioned above. The UIMA SDK
provides a graphical tool, called the CPE Configurator, for assisting in the assembly of
CPEs. Use of this tool is summarized in <xref
linkend="ugr.tug.cpe.cpe_configurator"/>, and more details can be found in <olink
targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cpe"/>.
Alternatively, a CPE can be assembled by writing an XML CPE descriptor. Details on the CPE
descriptor, including its syntax and content, can be found in the <olink
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>. The individual
components have associated XML descriptors, each of which can be created and / or edited
using the <olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde">
Component Description Editor</olink>.</para>
<para>A CPE is executed by a UIMA infrastructure component called the
<emphasis>Collection Processing Manager</emphasis> (CPM). The CPM provides a number
of services and deployment options that cover instantiation and execution of CPEs, error
recovery, and local and distributed deployment of the CPE components.</para>
<section id="ugr.tug.cpe.concepts">
<title>CPE Concepts</title>
<para> <xref linkend="ugr.tug.cpe.fig.cpe_components"/> illustrates the data flow
that occurs between the different types of components that make up a CPE.</para>
<figure id="ugr.tug.cpe.fig.cpe_components">
<title>CPE Components</title>
<mediaobject>
<imageobject>
<imagedata width="5.84in" format="PNG"
fileref="&imgroot;image002.png"/>
</imageobject>
<textobject><phrase>CPE Components and flow between them</phrase>
</textobject>
</mediaobject>
</figure>
<para>The components of a CPE are:</para>
<itemizedlist><listitem><para><emphasis>Collection Reader &ndash;</emphasis>
interfaces to a collection of data items (e.g., documents) to be analyzed. Collection
Readers return CASes that contain the documents to analyze, possibly along with
additional metadata.</para></listitem>
<listitem><para><emphasis>Analysis Engine &ndash;</emphasis> takes a CAS,
analyzes its contents, and produces an enriched CAS. Analysis Engines can be
recursively composed of other Analysis Engines (called an
<emphasis>Aggregate</emphasis> Analysis Engine). Aggregates may also contain
CAS Consumers.</para></listitem>
<listitem><para><emphasis>CAS Consumer &ndash;</emphasis> consume the enriched
CAS that was produced by the sequence of Analysis Engines before it, and produce an
application-specific data structure, such as a search engine index or database.
</para></listitem></itemizedlist>
<para>A fourth type of component, the <emphasis>CAS Initializer,</emphasis> may be
used by a Collection Reader to populate a CAS from a document. However, as of UIMA
version 2 CAS Initializers are now deprecated in favor of a more general mechsanism,
multiple Subjects of Analysis.</para>
<para>The Collection Processing Manager orchestrates the data flow
within a CPE, monitors status, optionally manages the life-cycle of internal
components and collects statistics.</para>
<para>CASes are not saved in a persistent way by the framework. If you want to save CASes,
then you have to save each CAS as it comes through (for example) using a CAS Consumer you
write to do this, in whatever format you like. The UIMA SDK supplies an example CAS
Consumer to save CASes to XML files, either in the standard XMI format or in an older
format called XCAS. It also supplies an example CAS Consumer to extract information from CASes and
store the results into a relational Database, using Java&apos;s JDBC APIs.</para>
</section>
<section id="ugr.tug.cpe.configurator_and_viewer">
<title>CPE Configurator and CAS viewer</title>
<section id="ugr.tug.cpe.cpe_configurator">
<title>Using the CPE Configurator</title>
<para>A CPE can be assembled by writing an XML CPE descriptor. Details on the CPE
descriptor, including its syntax and content, can be found in <olink
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>. Rather than
edit raw XML, you may develop a CPE Descriptor using the CPE Configurator tool. The CPE
Configurator tool is described briefly in this section, and in more detail in <olink
targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cpe"/>.</para>
<para>The CPE Configurator tool can be run from Eclipse (see <xref
linkend="ugr.tug.cpe.running_cpe_configurator_from_eclipse"/>, or using
the <literal>cpeGui</literal> shell script (<literal>cpeGui.bat</literal> on
Windows, <literal>cpeGui.sh</literal> on Unix), which is located in the
<literal>bin</literal> directory of the UIMA SDK installation. Executing this
batch file will display the window shown here:
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="5.84in" format="JPG" fileref="&imgroot;image004.jpg"/>
</imageobject>
<textobject><phrase>Screenshot of CPE GUI</phrase></textobject>
</mediaobject>
</screenshot>
</para>
<para>The window is divided into three sections, one each for the Collection Reader,
Analysis Engines, and CAS Consumers.<footnote><para>There is also a fourth pane,
for the CAS Initializer, but it is hidden by default. To enable it click the
<literal>View &rarr; CAS Initializer Panel</literal> menu item.</para></footnote>
In each section, you select the component(s) you want to include in the CPE by
browsing to their XML descriptors. The configuration parameters present in the XML
descriptors will then be displayed in the GUI; these can be modified to override
the values present in the descriptor. For example, the screen shot below shows the
CPE Configurator after the following components have been chosen:
<programlisting>Collection Reader:
%UIMA_HOME%/examples/descriptors/collection_reader/
FileSystemCollectionReader.xml
Analysis Engine:
%UIMA_HOME%/examples/descriptors/analysis_engine/
NamesAndPersonTitles_TAE.xml
CAS Consumer:
%UIMA_HOME%/examples/descriptors/cas_consumer/
XmiWriterCasConsumer.xml</programlisting></para>
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="5.84in" format="JPG" fileref="&imgroot;image006.jpg"/>
</imageobject>
<textobject><phrase>Screenshot of CPE GUI after fields filled in</phrase></textobject>
</mediaobject>
</screenshot>
<para>For the File System Collection Reader, ensure that the Input Directory is set to
<literal>%UIMA_HOME%\examples\data</literal><footnote><para>Replace
<literal>%UIMA_HOME%</literal> with the path to where you installed UIMA.</para>
</footnote>. The other parameters may be left blank. For the External CAS Writer CAS
Consumer, ensure that the Output Directory is set to
<literal>%UIMA_HOME%\examples\data\processed</literal>.</para>
<para>After selecting each of the components and providing configuration settings,
click the play (forward arrow) button at the bottom of the screen to begin processing.
A progress bar should be displayed in the lower left corner. (Note that the progress
bar will not begin to move until all components have completed their initialization,
which may take several seconds.) Once processing has begun, the pause and stop
buttons become enabled.</para>
<para>If an error occurs, you will be informed by an error dialog. If processing
completes successfully, you will be presented with a performance report.</para>
<para>Using the File menu, you can select <literal>Save CPE Descriptor </literal>to
create an .xml descriptor file that defines the CPE you have constructed. Later, you
can use <literal>Open CPE Descriptor</literal> to restore the CPE Configurator to
the saved state. Also, CPE descriptors can be used to run a CPE from a Java program
&ndash; see section <xref
linkend="ugr.tug.cpe.running_cpe_from_application"/>. CPE Descriptors
allow specifying operational parameters, such as error handling options, that are
not currently available for configuration through the CPE Configurator. For more
information on manually creating a CPE Descriptor, see the <olink
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>.</para>
<para>The CPE configured above runs a simple name and title annotator on the sample data
provided with the UIMA SDK and stores the results using the XMI Writer CAS Consumer. To
view the results, start the External CAS Annotation Viewer by running the
<literal>annotationViewer</literal> batch file
(<literal>annotationViewer.bat</literal> on Windows,
<literal>annotationViewer.sh</literal> on Unix), which is located in the
<literal>bin</literal> directory of the UIMA SDK installation. Executing this
batch file will display the window shown here:
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="5.5in" format="JPG" fileref="&imgroot;image008.jpg"/>
</imageobject>
<textobject><phrase>Screenshot of Annotation Viewer results</phrase></textobject>
</mediaobject>
</screenshot>
</para>
<para>Ensure that the Input Directory is the same as the Output Directory specified for
the XMI Writer CAS Consumer in the CPE configured above (e.g.,
<literal>%UIMA_HOME%\examples\data\processed</literal>) and that the TAE
Descriptor File is set to the Analysis Engine used in the CPE configured above (e.g.,
<literal>examples\descriptors\analysis_engine\NamesAndPersonTitles_TAE.xml</literal>
).</para>
<para>Click the View button to display the Analyzed Documents window:
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="3.5in" format="JPG" fileref="&imgroot;image010.jpg"/>
</imageobject>
<textobject><phrase>Screenshot of CPE Configurator Analyzed Documents</phrase></textobject>
</mediaobject>
</screenshot>
</para>
<para>Double click on any document in the list to view the analyzed document. Double
clicking the first document, IBM_LifeSciences.txt, will bring up the following
window:
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="5.84in" format="JPG" fileref="&imgroot;image012.jpg"/>
</imageobject>
<textobject><phrase>Screenshot of Document and Annotation Viewer</phrase></textobject>
</mediaobject>
</screenshot>
</para>
<para>This window shows the analysis results for the document. Clicking on any
highlighted annotation causes the details for that annotation to be displayed in the
right-hand pane. Here the annotation spanning <quote>John M. Thompson</quote> has
been clicked.</para>
<para>Congratulations! You have successfully configured a CPE, saved its
descriptor, run the CPE, and viewed the analysis results.</para>
</section>
<section id="ugr.tug.cpe.running_cpe_configurator_from_eclipse">
<title>Running the CPE Configurator from Eclipse</title>
<para>If you have followed the instructions in <olink
targetdoc="&uima_docs_overview;"
targetptr="ugr.ovv.eclipse_setup"/> and imported the example Eclipse
project, then you should already have a Run configuration for the CPE Configurator
tool (called <literal>UIMA CPE GUI</literal>) configured to run in the example
project. Simply run that configuration to start the CPE Configurator.</para>
<para>If you haven&apos;t followed the Eclipse setup instructions and wish to run the
CPE Configurator tool from Eclipse, you will need to do the following. As installed,
this Eclipse launch configuration is associated with the
<quote>uimaj-examples</quote> project. If you&apos;ve not already done so, you
may wish to import that project into your Eclipse workspace. It&apos;s located in
%UIMA_HOME%/docs/examples. Doing this will supply the Eclipse launcher with all
the class files it needs to run the CPE configurator. If you don&apos;t do this, please
manually add the JAR files for UIMA to the launch configuration.</para>
<para>Also, you need to add any projects or JAR files for any UIMA components you will be
running to the launch class path.</para> <note><para>A simpler alternative may be
to change the CPE launch configuration to be based on your project. If you do that, it will
pick up all the files in your project&apos;s class path, which you should set up to
include all the UIMA framework files. An easy way to do this is to specify in your
project&apos;s properties&apos; build-path that the uimaj-examples project is on
the build path, because the uimaj-examples project is set up to include all the UIMA
framework classes in its classpath already. </para></note>
<para>Next, in the Eclipse menu select <literal>Run &rarr;
Run</literal>..., which brings up the Run configuration screen.</para>
<para>In the Main tab, set the main class to
<literal>org.apache.uima.tools.cpm.CpmFrame</literal></para>
<para>In the arguments tab, add the following to the VM arguments:
<programlisting>-Xms128M -Xmx256M
-Duima.home="C:\Program Files\Apache\uima"</programlisting>
(or wherever you installed the UIMA SDK)</para>
<para>Click the Run button to launch the CPE Configurator, and use it as previously
described in this section.</para>
</section>
</section>
<section id="ugr.tug.cpe.running_cpe_from_application">
<title>Running a CPE from Your Own Java Application</title>
<para>The simplest way to run a CPE from a Java application is to first create a CPE
descriptor as described in the previous section. Then the CPE can be instantiated and
run using the following code:
<programlisting> //parse CPE descriptor in file specified on command line
CpeDescription cpeDesc = UIMAFramework.getXMLParser().
parseCpeDescription(new XMLInputSource(args[0]));
//instantiate CPE
mCPE = UIMAFramework.produceCollectionProcessingEngine(cpeDesc);
//Create and register a Status Callback Listener
mCPE.addStatusCallbackListener(new StatusCallbackListenerImpl());
//Start Processing
mCPE.process();</programlisting></para>
<para>This will start the CPE running in a separate thread.</para>
<note><para>The <literal>process()</literal> method for a CPE can only be called once. If you
need to call it again, you have to instantiate a new CPE, and call that new CPE's process
method.</para></note>
<section id="ugr.tug.cpe.using_listeners">
<title>Using Listeners</title>
<para>Updates of the CPM&apos;s progress, including any errors that occur, are sent to
the callback handler that is registered by the call to
<literal>addStatusCallbackListener</literal>, above. The callback handler is a
class that implements the CPM&apos;s
<literal>StatusCallbackListener</literal> interface. It responds to events by
printing messages to the console. The source code is fairly straightforward and is
not included in this chapter &ndash; see the
<literal>org.apache.uima.examples.cpe.SimpleRunCPE.java</literal> in the
<literal>%UIMA_HOME%\examples\src</literal> directory for the complete
code.</para>
<para>If you need more control over the information in the CPE descriptor, you can
manually configure it via its API. See the Javadocs for package
<literal>org.apache.uima.collection</literal> for more details.</para>
</section>
</section>
<section id="ugr.tug.cpe.developing_collection_processing_components">
<title>Developing Collection Processing Components</title>
<para>This section is an introduction to the process of developing Collection Readers,
CAS Initializers, and CAS Consumers. The code snippets refer to the classes that can be
found in <literal>%UIMA_HOME%\examples\src </literal>example project.</para>
<para>In the following sections, classes you write to represent components need to be
public and have public, 0-argument constructors, so that they can be instantiated by
the framework. (Although Java classes in which you do not define any constructor will,
by default, have a 0-argument constructor that doesn&apos;t do anything, a class in
which you have defined at least one constructor does not get a default 0-argument
constructor.)</para>
<section id="ugr.tug.cpe.collection_reader.developing">
<title>Developing Collection Readers</title>
<para>A Collection Reader is responsible for obtaining documents from the collection
and returning each document as a CAS. Like all UIMA components, a Collection Reader
consists of two parts &mdash; the code and an XML descriptor.</para>
<para>A simple example of a Collection Reader is the <quote>File System Collection
Reader,</quote> which simply reads documents from files in a specified directory.
The Java code is in the class
<literal>org.apache.uima.examples.cpe.FileSystemCollectionReader</literal>
and the XML descriptor is
<literal>%UIMA_HOME%/examples/src/main/descriptors/collection_reader/
FileSystemCollectionReader.xml</literal>.</para>
<section id="ugr.tug.cpe.collection_reader.java_class">
<title>Java Class for the Collection Reader</title>
<para>The Java class for a Collection Reader must implement the
<literal>org.apache.uima.collection.CollectionReader</literal>
interface. You may build your Collection Reader from scratch and implement this
interface, or you may extend the convenience base class
<literal>org.apache.uima.collection.CollectionReader_ImplBase</literal>
.</para>
<para>The convenience base class provides default implementations for many of the
methods defined in the <literal>CollectionReader</literal> interface, and
provides abstract definitions for those methods that you are required to
implement in your new Collection Reader. Note that if you extend this base class,
you do not need to declare that your new Collection Reader implements the
<literal>CollectionReader</literal> interface.</para> <tip><para>Eclipse
tip &ndash; if you are using Eclipse, you can quickly create the boiler plate code and
stubs for all of the required methods by clicking <literal>File</literal>
&rarr; <literal>New</literal> &rarr; <literal>Class</literal> to bring up the <quote>New Java Class</quote>
dialogue, specifying
<literal>org.apache.uima.collection.CollectionReader_ImplBase</literal>
as the Superclass, and checking <quote>Inherited abstract methods</quote> in the
section <quote>Which method stubs would you like to create?</quote>, as in the
screenshot below:</para></tip>
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="4.4in" format="JPG" fileref="&imgroot;image014.jpg"/>
</imageobject>
<textobject><phrase>Screenshot showing Eclipse new class wizard</phrase></textobject>
</mediaobject>
</screenshot>
<para>For the rest of this section we will assume that your new Collection Reader
extends the <literal>CollectionReader_ImplBase</literal> class, and we will
show examples from the
<literal>org.apache.uima.examples.cpe.FileSystemCollectionReader</literal>
. If you must inherit from a different superclass, you must ensure that your
Collection Reader implements the <literal>CollectionReader</literal>
interface &ndash; see the Javadocs for <literal>CollectionReader</literal>
for more details.</para>
</section>
<section id="ugr.tug.cpe.collection_reader.required_methods">
<title>Required Methods in the Collection Reader class</title>
<para>The following abstract methods must be implemented:</para>
<section id="ugr.tug.cpe.collection_reader.required_methods.initialize">
<title>initialize()</title>
<para>The <literal>initialize()</literal> method is called by the framework
when the Collection Reader is first created.
<literal>CollectionReader_ImplBase</literal> actually provides a default
implementation of this method (i.e., it is not abstract), so you are not strictly
required to implement this method. However, a typical Collection Reader will
implement this method to obtain parameter values and perform various
initialization steps.</para>
<para>In this method, the Collection Reader class can access the values of its
configuration parameters and perform other initialization logic. The example
File System Collection Reader reads its configuration parameters and then
builds a list of files in the specified input directory, as follows:</para>
<programlisting>public void initialize() throws ResourceInitializationException {
File directory = new File(
(String)getConfigParameterValue(PARAM_INPUTDIR));
mEncoding = (String)getConfigParameterValue(PARAM_ENCODING);
mDocumentTextXmlTagName = (String)getConfigParameterValue(PARAM_XMLTAG);
mLanguage = (String)getConfigParameterValue(PARAM_LANGUAGE);
mCurrentIndex = 0;
//get list of files (not subdirectories) in the specified directory
mFiles = new ArrayList();
File[] files = directory.listFiles();
for (int i = 0; i &lt; files.length; i++) {
if (!files[i].isDirectory()) {
mFiles.add(files[i]);
}
}
}</programlisting>
<note><para>This is the zero-argument version of the initialize method. There is
also a method on the Collection Reader interface called
<literal>initialize(ResourceSpecifier, Map)</literal> but it is not
recommended that you override this method in your code. That method performs
internal initialization steps and then calls the zero-argument
<literal>initialize()</literal>. </para></note>
</section>
<section id="ugr.tug.cpe.collection_reader.hasnext">
<title>hasNext()</title>
<para>The <literal>hasNext()</literal> method returns whether or not there are
any documents remaining to be read from the collection. The File System
Collection Reader&apos;s <literal>hasNext()</literal> method is very
simple. It just checks if there are any more files left to be read:
<programlisting>public boolean hasNext() {
return mCurrentIndex &lt; mFiles.size();
}</programlisting>
</para>
</section>
<section id="ugr.tug.cpe.collection_reader.required_methods.getnext">
<title>getNext(CAS)</title>
<para>The <literal>getNext()</literal> method reads the next document from the
collection and populates a CAS. In the simple case, this amounts to reading the
file and calling the CAS&apos;s <literal>setDocumentText</literal> method.
The example File System Collection Reader is slightly more complex. It first
checks for a CAS Initializer. If the CPE includes a CAS Initializer, the CAS
Initializer is used to read the document, and
<literal>initialize()</literal> the CAS. If the CPE does not include a CAS
Initializer, the File System Collection Reader reads the document and sets the
document text in the CAS.</para>
<para>The File System Collection Reader also stores additional metadata about
the document in the CAS. In particular, it sets the document&apos;s language in
the special built-in feature structure
<literal>uima.tcas.DocumentAnnotation </literal>(see <olink
targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.cas.document_annotation"/> for details about this
built-in type) and creates an instance of
<literal>org.apache.uima.examples.SourceDocumentInformation</literal>
, which stores information about the document&apos;s source location. This
information may be useful to downstream components such as CAS Consumers. Note
that the type system descriptor for this type can be found in
<literal>org.apache.uima.examples.SourceDocumentInformation.xml</literal>
, which is located in the <literal>examples/src</literal> directory.</para>
<para>The getNext() method for the File System Collection Reader looks like
this:</para>
<programlisting> public void getNext(CAS aCAS) throws IOException, CollectionException {
JCas jcas;
try {
jcas = aCAS.getJCas();
} catch (CASException e) {
throw new CollectionException(e);
}
// open input stream to file
File file = (File) mFiles.get(mCurrentIndex++);
BufferedInputStream fis =
new BufferedInputStream(new FileInputStream(file));
try {
byte[] contents = new byte[(int) file.length()];
fis.read(contents);
String text;
if (mEncoding != null) {
text = new String(contents, mEncoding);
} else {
text = new String(contents);
}
// put document in CAS
jcas.setDocumentText(text);
} finally {
if (fis != null)
fis.close();
}
// set language if it was explicitly specified
//as a configuration parameter
if (mLanguage != null) {
((DocumentAnnotation) jcas.getDocumentAnnotationFs()).
setLanguage(mLanguage);
}
// Also store location of source document in CAS.
// This information is critical if CAS Consumers will
// need to know where the original document contents
// are located.
// For example, the Semantic Search CAS Indexer
// writes this information into the search index that
// it creates, which allows applications that use the
// search index to locate the documents that satisfy
//their semantic queries.
SourceDocumentInformation srcDocInfo =
new SourceDocumentInformation(jcas);
srcDocInfo.setUri(
file.getAbsoluteFile().toURL().toString());
srcDocInfo.setOffsetInSource(0);
srcDocInfo.setDocumentSize((int) file.length());
srcDocInfo.setLastSegment(
mCurrentIndex == mFiles.size());
srcDocInfo.addToIndexes();
}</programlisting>
<para>The Collection Reader can create additional annotations in the CAS at this
point, in the same way that annotators create annotations.</para>
</section>
<section id="ugr.tug.cpe.collection_reader.required_methods.getprogress">
<title>getProgress()</title>
<para>The Collection Reader is responsible for returning progress information;
that is, how much of the collection has been read thus far and how much remains to be
read. The framework defines progress very generally; the Collection Reader
simply returns an array of <literal>Progress</literal> objects, where each
object contains three fields &mdash; the amount already completed, the total
amount (if known), and a unit (e.g. entities (documents), bytes, or files). The
method returns an array so that the Collection Reader can report progress in
multiple different units, if that information is available. The File System
Collection Reader&apos;s <literal>getProgress()</literal> method looks
like this:
<programlisting>public Progress[] getProgress() {
return new Progress[]{
new ProgressImpl(mCurrentIndex,mFiles.size(),Progress.ENTITIES)};
}</programlisting></para>
<para>In this particular example, the total number of files in the collection is
known, but the total size of the collection is not known. As such, a
<literal>ProgressImpl</literal> object for
<literal>Progress.ENTITIES</literal> is returned, but a
<literal>ProgressImpl</literal> object for
<literal>Progress.BYTES</literal> is not.</para>
</section>
<section id="ugr.tug.cpe.collection_reader.required_methods.close">
<title>close()</title>
<para>The close method is called when the Collection Reader is no longer needed.
The Collection Reader should then release any resources it may be holding. The
FileSystemCollectionReader does not hold resources and so has an empty
implementation of this method:</para>
<programlisting>public void close() throws IOException { }</programlisting>
</section>
<section id="ugr.tug.cpe.collection_reader.optional_methods">
<title>Optional Methods</title>
<para>The following methods may be implemented:</para>
<section id="ugr.tug.cpe.collection_reader.optional_methods.reconfigure">
<title>reconfigure()</title>
<para>This method is called if the Collection Reader&apos;s configuration
parameters change.</para>
</section>
<section id="ugr.tug.cpe.collection_reader.optional_methods.typesysteminit">
<title>typeSystemInit()</title>
<para>If you are only setting the document text in the CAS, or if you are using the
JCas (recommended, as in the current example, you do not have to implement this
method. If you are directly using the CAS API, this method is used in the same way
as it is used for an annotator &ndash; see <olink
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae.contract_for_annotator_methods"/>
for more information.</para>
</section>
</section>
<section id="ugr.tug.cpe.collection_reader.threading">
<title>Threading considerations</title>
<para>Collection readers do not have to be thread safe; they are run with a single
thread per instance, and only one instance per instance of the Collection
Processing Manager (CPM) is made.</para>
</section>
<section id="ugr.tug.cpe.collection_reader.descriptor">
<title>XML Descriptor for a Collection Reader</title>
<para>You can use the Component Description Editor to create and / or edit the File
System Collection Reader&apos;s descriptor. Here is its descriptor
(abbreviated somewhat), which is very similar to an Analysis
Engine descriptor:</para>
<programlisting><![CDATA[<collectionReaderDescription
xmlns="http://uima.apache.org/resourceSpecifier">
<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
<implementationName>
org.apache.uima.examples.cpe.FileSystemCollectionReader
</implementationName>
<processingResourceMetaData>
<name>File System Collection Reader</name>
<description>Reads files from the filesystem.</description>
<version>1.0</version>
<vendor>The Apache Software Foundation</vendor>
<configurationParameters>
<configurationParameter>
<name>InputDirectory</name>
<description>Directory containing input files</description>
<type>String</type>
<multiValued>false</multiValued>
<mandatory>true</mandatory>
</configurationParameter>
<configurationParameter>
<name>Encoding</name>
<description>Character encoding for the documents.</description>
<type>String</type>
<multiValued>false</multiValued>
<mandatory>false</mandatory>
</configurationParameter>
<configurationParameter>
<name>Language</name>
<description>ISO language code for the documents</description>
<type>String</type>
<multiValued>false</multiValued>
<mandatory>false</mandatory>
</configurationParameter>
</configurationParameters>
<configurationParameterSettings>
<nameValuePair>
<name>InputDirectory</name>
<value>
<string>C:/Program Files/apache/uima/examples/data</string>
</value>
</nameValuePair>
</configurationParameterSettings>
<!-- Type System of CASes returned by this Collection Reader -->
<typeSystemDescription>
<imports>
<import name="org.apache.uima.examples.SourceDocumentInformation"/>
</imports>
</typeSystemDescription>
<capabilities>
<capability>
<inputs/>
<outputs>
<type allAnnotatorFeatures="true">
org.apache.uima.examples.SourceDocumentInformation
</type>
</outputs>
</capability>
</capabilities>
<operationalProperties>
<modifiesCas>true</modifiesCas>
<multipleDeploymentAllowed>false</multipleDeploymentAllowed>
<outputsNewCASes>true</outputsNewCASes>
</operationalProperties>
</processingResourceMetaData>
</collectionReaderDescription>]]></programlisting>
</section>
</section>
</section>
<section id="ugr.tug.cpe.cas_initializer.developing"><title>Developing CAS
Initializers</title> <note><para>CAS Initializers are now deprecated (as of
version 2.1). For complex initialization, please use instead the capabilities of
creating additional Subjects of Analysis (see <olink
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.mvs"/>
). </para></note>
<para>In UIMA 1.x, the CAS Initializer component was intended to be used as a plug-in
to the Collection Reader for when the task of populating the CAS from a raw document is
complex and might be reusable with other data collections.</para>
<para>A CAS Initializer Java class must implement the interface
<literal>org.apache.uima.collection.CasInitializer</literal>, and will also
generally extend from the convenience base class
<literal>org.apache.uima.collection.CasInitializer_ImplBase</literal>. A
CAS Initializer also must have an XML descriptor, which has the exact same form as a
Collection Reader Descriptor except that the outer tag is
<literal>&lt;casInitializerDescription&gt;</literal>.</para>
<para>CAS Initializers have optional <literal>initialize()</literal>,
<literal>reconfigure()</literal>, and <literal>typeSystemInit()</literal>
methods, which perform the same functions as they do for Collection Readers. The only
required method for a CAS Initializer is <literal>initializeCas(Object,
CAS)</literal>. This method takes the raw document (for example, an
<literal>InputStream</literal> object from which the document can be read) and a
CAS, and populates the CAS from the document.</para>
</section>
<section id="ugr.tug.cpe.cas_consumer.developing"><title>Developing CAS
Consumers</title>
<note><para>In version 2, there is no difference in capability
between CAS Consumers and ordinary Analysis Engines, except for the default setting of
the XML parameters for <literal>multipleDeploymentAllowed</literal> and
<literal>modifiesCas</literal>. We recommend for future work that users implement
and use Analysis Engine components instead of CAS Consumers.</para>
<para>The rest of this section is written using the version 1 style of CAS Consumer;
the methods described are also available for Analysis Engines. Note that the
CAS Consumer <literal>processCAS</literal> method is equivalent to the Analysis Engine
<literal>process</literal> method.</para></note>
<para>A CAS Consumer receives each CAS after it has been analyzed by the Analysis
Engine. CAS Consumers typically do not update the CAS; they typically extract data
from the CAS and persist selected information to aggregate data structures such as
search engine indexes or databases.</para>
<para>A CAS Consumer Java class must implement the interface
<literal>org.apache.uima.collection.CasConsumer</literal>, and will also
generally extend from the convenience base class
<literal>org.apache.uima.collection.CasConsumer_ImplBase</literal>. A CAS
Consumer also must have an XML descriptor, which has the exact same form as a
Collection Reader Descriptor except that the outer tag is
<literal>&lt;casConsumerDescription&gt;</literal>.</para>
<para>CAS Consumers have optional <literal>initialize()</literal>,
<literal>reconfigure()</literal>, and <literal>typeSystemInit()</literal>
methods, which perform the same functions as they do for Collection Readers and CAS
Initializers. The only required method for a CAS Consumer is
<literal>processCas(CAS)</literal>, which is where the CAS Consumer does the bulk
of its work (i.e., consume the CAS).</para>
<para>The <literal>CasConsumer</literal> interface (as well as the version 2
Analysis Engine interfac) additionally defines batch
and collection level processing methods. The CAS Consumer or Analysis Engine
can implement the
<literal>batchProcessComplete()</literal> method to perform processing that
should occur at the end of each batch of CASes. Similarly, the CAS Consumer
or Analysis Engine can
implement the <literal>collectionProcessComplete()</literal> method to
perform any collection level processing at the end of the collection.</para>
<para>A very simple example of a CAS Consumer, which writes an XML representation of the
CAS to a file, is the XMI Writer CAS Consumer. The Java code is in the class
<literal>org.apache.uima.examples.cpe.XmiWriterCasConsumer</literal> and
the descriptor is in
<literal>%UIMA_HOME%/examples/descriptors/cas_consumer/XmiWriterCasConsumer.xml</literal>
.</para>
<section id="ugr.tug.cpe.cas_consumer.required_methods">
<title>Required Methods for a CAS Consumer</title>
<para>When extending the convenience class
<literal>org.apache.uima.collection.CasConsumer_ImplBase</literal>, the
following abstract methods must be implemented:</para>
<section id="ugr.tug.cpe.cas_consumer.required_methods.initialize">
<title>initialize()</title>
<para>The <literal>initialize()</literal> method is called by the framework
when the CAS Consumer is first created.
<literal>CasConsumer_ImplBase</literal> actually provides a default
implementation of this method (i.e., it is not abstract), so you are not strictly
required to implement this method. However, a typical CAS Consumer will
implement this method to obtain parameter values and perform various
initialization steps.</para>
<para>In this method, the CAS Consumer can access the values of its configuration
parameters and perform other initialization logic. The example XMI Writer CAS
Consumer reads its configuration parameters and sets up the output directory:
<programlisting>public void initialize() throws ResourceInitializationException {
mDocNum = 0;
mOutputDir = new File((String) getConfigParameterValue(PARAM_OUTPUTDIR));
if (!mOutputDir.exists()) {
mOutputDir.mkdirs();
}
}</programlisting></para>
</section>
<section id="ugr.tug.cpe.cas_consumer.required_methods.processcas">
<title>processCas()</title>
<para>The <literal>processCas()</literal> method is where the CAS Consumer
does most of its work. In our example, the XMI Writer CAS Consumer obtains an
iterator over the document metadata in the CAS (in the
SourceDocumentInformation feature structure, which is created by the File
System Collection Reader) and extracts the URI for the current document. From
this the output filename is constructed in the output directory and a subroutine
(<literal>writeXmi</literal>) is called to generate the output file. The
<literal>writeXmi</literal> subroutine uses the
<literal>XmiCasSerializer</literal> class provided with the UIMA SDK to
serialize the CAS to the output file (see the example source code for
details).</para>
<programlisting>public void processCas(CAS aCAS) throws ResourceProcessException {
String modelFileName = null;
JCas jcas;
try {
jcas = aCAS.getJCas();
} catch (CASException e) {
throw new ResourceProcessException(e);
}
// retreive the filename of the input file from the CAS
FSIterator it = jcas
.getAnnotationIndex(SourceDocumentInformation.type)
.iterator();
File outFile = null;
if (it.hasNext()) {
SourceDocumentInformation fileLoc =
(SourceDocumentInformation) it.next();
File inFile;
try {
inFile = new File(new URL(fileLoc.getUri()).getPath());
String outFileName = inFile.getName();
if (fileLoc.getOffsetInSource() > 0) {
outFileName += ("_" + fileLoc.getOffsetInSource());
}
outFileName += ".xmi";
outFile = new File(mOutputDir, outFileName);
modelFileName = mOutputDir.getAbsolutePath() +
"/" + inFile.getName() + ".ecore";
} catch (MalformedURLException e1) {
// invalid URL, use default processing below
}
}
if (outFile == null) {
outFile = new File(mOutputDir, "doc" + mDocNum++);
}
// serialize XCAS and write to output file
try {
writeXmi(jcas.getCas(), outFile, modelFileName);
} catch (IOException e) {
throw new ResourceProcessException(e);
} catch (SAXException e) {
throw new ResourceProcessException(e);
}
}</programlisting>
</section>
<section id="ugr.tug.cpe.cas_consumer.optional_methods">
<title>Optional Methods</title>
<para>The following methods are optional in a CAS Consumer, though they are often
used.</para>
<section id="ugr.tug.cpe.cas_consumer.optional_methods.batchprocesscomplete">
<title>batchProcessComplete()</title>
<para>The framework calls the batchProcessComplete() method at the end of each
batch of CASes. This gives the CAS Consumer or Analysis Engine
an opportunity to perform any batch
level processing. Our simple XMI Writer CAS Consumer does not perform any
batch level processing, so this method is empty. Batch size is set in the
Collection Processing Engine descriptor.</para>
</section>
<section id="ugr.tug.cpe.cas_consumer.optional_methods.collectionprocesscomplete">
<title>collectionProcessComplete()</title>
<para>The framework calls the collectionProcessComplete() method at the end
of the collection (i.e., when all objects in the collection have been
processed). At this point in time, no CAS is passed in as a parameter. This gives
the CAS Consumer or Analysis Engine an opportunity to perform collection processing over the
entire set of objects in the collection. Our simple XMI Writer CAS Consumer
does not perform any collection level processing, so this method is
empty.</para>
</section>
</section>
</section>
</section>
</section>
<section id="ugr.tug.cpe.deploying_a_cpe">
<title>Deploying a CPE</title>
<para>The CPM provides a number of service and deployment options that cover
instantiation and execution of CPEs, error recovery, and local and distributed
deployment of the CPE components. The behavior of the CPM (and correspondingly, the
CPE) is controlled by various options and parameters set in the CPE descriptor. The
current version of the CPE Configurator tool, however, supports only default error
handling and deployment options. To change these options, you must manually edit the
CPE descriptor.</para>
<para>Eventually the CPE Configurator tool will support configuring these options and a
detailed tutorial for these settings will be provided. In the meantime, we provide only
a high-level, conceptual overview of these advanced features in the rest of this
chapter, and refer the advanced user to <olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.cpe_descriptor"/> for details on setting these options in the CPE
Descriptor.</para>
<para> <xref linkend="ugr.tug.cpe.fig.cpe_instantiation"/> shows a logical view of
how an application uses the UIMA framework to instantiate a CPE from a CPE descriptor.
The CPE descriptor identifies the CPE components (referencing their corresponding
descriptors) and specifies the various options for configuring the CPM and deploying
the CPE components.</para>
<figure id="ugr.tug.cpe.fig.cpe_instantiation">
<title>CPE Instantiation</title>
<mediaobject>
<imageobject>
<imagedata width="5.84in" format="PNG"
fileref="&imgroot;image018.png"/>
</imageobject>
<textobject><phrase>Picture of deployment of a CPE</phrase></textobject>
</mediaobject>
</figure>
<para id="ugr.tug.cpe.deployment_alternatives">There are three deployment modes
for CAS Processors (Analysis Engines and CAS Consumers) in a CPE:</para>
<orderedlist><listitem><para><emphasis role="bold">Integrated</emphasis> (runs
in the same Java instance as the CPM)</para></listitem>
<listitem><para><emphasis role="bold">Managed</emphasis> (runs in a separate
process on the same machine), and</para></listitem>
<listitem><para><emphasis role="bold">Non-managed</emphasis> (runs in a
separate process, perhaps on a different machine). </para></listitem>
</orderedlist>
<para>An integrated CAS Processor runs in the same JVM as the CPE. A managed CAS Processor
runs in a separate process from the CPE, but still on the same computer. The CPE controls
startup, shutdown, and recovery of a managed CAS Processor. A non-managed CAS
Processor runs as a service and may be on the same computer as the CPE or on a remote
computer. A non-managed CAS Processor <emphasis role="bold-italic">
service</emphasis> is started and managed independently from the CPE.</para>
<para>For both managed and non-managed CAS Processors, the CAS must be transmitted
between separate processes and possibly between separate computers. This is
accomplished using <emphasis>Vinci</emphasis>, a communication protocol used by
the CPM and which is provided as a part of Apache UIMA. Vinci handles service naming and
location and data transport (see <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.application.how_to_deploy_a_vinci_service"/>&nbsp; for more
information). Service naming and location are provided by a <emphasis>Vinci Naming
Service</emphasis>, or <emphasis>VNS</emphasis>. For managed CAS Processors, the
CPE uses its own internal VNS. For non-managed CAS Processors, a separate VNS must be
running.</para> <note><para>The UIMA SDK also supports using unmanaged remote
services via the web-standard SOAP communications protocol (see <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.application.how_to_deploy_as_soap"/>. This approach is
based on a proxy implementation, where the proxy is essentially running in an integrated
mode. To use this approach with the CPM, use the Integrated mode, with the component being
an Aggregate which, in turn, connects to a remote service. </para></note>
<para>The CPE Configurator tool currently only supports constructing CPEs that deploy
CAS Processors in integrated mode. To deploy CAS Processors in any other mode, the CPE
descriptor must be edited by hand (better tooling may be provided later). Details on the
CPE descriptor and the required settings for various CAS Processor deployment modes
can be found in <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>
. In the following sections we merely summarize the various CAS Processor deployment
options.</para>
<section id="ugr.tug.cpe.managed_deployment">
<title>Deploying Managed CAS Processors</title>
<para>Managed CAS Processor deployment is shown in <xref
linkend="ugr.tug.cpe.fig.managed_deployment"/>. A managed CAS Processor is
deployed by the CPE as a Vinci service. The CPE manages the lifecycle of the CAS
Processor including service launch, restart on failures, and service shutdown. A
managed CAS Processor runs on the same machine as the CPE, but in a separate process.
This provides the necessary fault isolation for the CPE to protect it from non-robust
CAS Processors. A fatal failure of a managed CAS Processor does not threaten the
stability of the CPE.</para>
<figure id="ugr.tug.cpe.fig.managed_deployment">
<title>CPE with Managed CAS Processors</title>
<mediaobject>
<imageobject>
<imagedata width="3.6in" format="PNG"
fileref="&imgroot;image020.png"/>
</imageobject>
<textobject><phrase>Managed deployment showing separate JVMs and CASes
flowing between them</phrase></textobject>
</mediaobject>
</figure>
<para>The CPE communicates with managed CAS Processors using the Vinci communication
protocol. A CAS Processor is launched as a Vinci service and its
<literal>process()</literal> method is invoked remotely via a Vinci command. The
CPE uses its own internal VNS to support managed CAS processors. The VNS, by default,
listens on port 9005. If this port is not available, the VNS will increment its listen
port until it finds one that is available. All managed CAS Processors are internally
configured to <quote>talk</quote> to the CPE managed VNS. This internal VNS is
transparent to the end user launching the CPE.</para>
<para>To deploy a managed CAS Processor, the CPE deployer must change the CPE
descriptor. The following is a section from the CPE descriptor that shows an example
configuration specifying a managed CAS Processor.</para>
<programlisting>&lt;casProcessor <emphasis role="bold-italic">deployment="local"</emphasis> name="Meeting Detector TAE"&gt;
&lt;descriptor&gt;
&lt;include href="deploy/vinci/Deploy_MeetingDetectorTAE.xml"/&gt;
&lt;/descriptor&gt;
&lt;runInSeparateProcess&gt;
&lt;exec dir="." executable="java"&gt;
&lt;env key="CLASSPATH"
value="src;
C:/Program Files/apache/uima/lib/uima-core.jar;
C:/Program Files/apache/uima/lib/uima-cpe.jar;
C:/Program Files/apache/uima/lib/uima-examples.jar;
C:/Program Files/apache/uima/lib/uima-adapter-vinci.jar;
C:/Program Files/apache/uima/lib/jVinci.jar"/>
&lt;arg&gt;-DLOG=C:/Temp/service.log&lt;/arg&gt;
&lt;arg&gt;org.apache.uima.reference_impl.collection.
service.vinci.VinciAnalysisEnginerService_impl&lt;/arg&gt;
&lt;arg&gt;${descriptor}&lt;/arg&gt;
&lt;/exec&gt;
&lt;/runInSeparateProcess&gt;
&lt;deploymentParameters/&gt;
&lt;filter/&gt;
&lt;errorHandling&gt;
&lt;errorRateThreshold action="terminate" value="1/100"/&gt;
&lt;maxConsecutiveRestarts action="terminate" value="3"/&gt;
&lt;timeout max="100000"/&gt;
&lt;/errorHandling&gt;
&lt;checkpoint batch="10000"/&gt;
&lt;/casProcessor&gt;</programlisting>
<para>See <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/> for
details and required settings.</para>
</section>
<section id="ugr.tug.cpe.deploying_nonmanaged_cas_processors">
<title>Deploying Non-managed CAS Processors</title>
<para>Non-managed CAS Processor deployment is shown in <xref
linkend="ugr.tug.cpe.fig.nonmanaged_cpe"/>. In non-managed mode, the CPE
supports connectivity to CAS Processors running on local or remote computers using
Vinci. Non-managed processors are different from managed processors in two
aspects:
<orderedlist><listitem><para>Non-managed processors are neither started nor
stopped by the CPE.</para></listitem>
<listitem><para>Non-managed processors use an independent VNS, also neither
started nor stopped by the CPE. </para></listitem></orderedlist></para>
<figure id="ugr.tug.cpe.fig.nonmanaged_cpe">
<title>CPE with non-managed CAS Processors</title>
<mediaobject>
<imageobject>
<imagedata width="4.8in" format="PNG"
fileref="&imgroot;image023.png"/>
</imageobject>
<textobject><phrase>Non-managed CPE deployment</phrase></textobject>
</mediaobject>
</figure>
<para>While non-managed CAS Processors provide the same level of fault isolation and
robustness as managed CAS Processors, error recovery support for non-managed CAS
Processors is much more limited. In particular, the CPE cannot restart a non-managed
CAS Processor after an error.</para>
<para>Non-managed CAS Processors also require a separate Vinci Naming Service
running on the network. This VNS must be manually started and monitored by the end user
or application. Instructions for running a VNS can be found in <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.application.vns.starting"/>.</para>
<para>To deploy a non-managed CAS Processor, the CPE deployer must change the CPE
descriptor. The following is a section from the CPE descriptor that shows an example
configuration for the non-managed CAS Processor.</para>
<programlisting>&lt;casProcessor <emphasis role="bold-italic">deployment="remote"</emphasis> name="Meeting Detector TAE"&gt;
&lt;descriptor&gt;
&lt;include href=
"descriptors/vinciService/MeetingDetectorVinciService.xml"/&gt;
&lt;/descriptor&gt;
&lt;deploymentParameters/&gt;
&lt;filter/&gt;
&lt;errorHandling&gt;
&lt;errorRateThreshold action="terminate" value="1/100"/&gt;
&lt;maxConsecutiveRestarts action="terminate" value="3"/&gt;
&lt;timeout max="100000"/&gt;
&lt;/errorHandling&gt;
&lt;checkpoint batch="10000"/&gt;
&lt;/casProcessor&gt;</programlisting>
<para>See <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/> for
details and required settings.</para>
</section>
<section id="ugr.tug.cpe.integrated_deployment">
<title>Deploying Integrated CAS Processors</title>
<para>Integrated CAS Processors are shown in <xref
linkend="ugr.tug.cpe.fig.integrated_deployment"/>. Here the CAS Processors
run in the same JVM as the CPE, just like the Collection Reader and CAS Initializer.
This deployment method results in minimal CAS communication and transport overhead
as the CAS is shared in the same process space of the JVM. However, a CPE running with all
integrated CAS Processors is limited in scalability by the capability of the single
computer on which the CPE is running. There is also a stability risk associated with
integrated processors because a poorly written CAS Processor can cause the JVM, and
hence the entire CPE, to abort.</para>
<figure id="ugr.tug.cpe.fig.integrated_deployment">
<title>CPE with integrated CAS Processor</title>
<mediaobject>
<imageobject>
<imagedata width="3.2in" format="PNG"
fileref="&imgroot;image026.png"/>
</imageobject>
<textobject><phrase>CPE with integrated CAS Processor</phrase>
</textobject>
</mediaobject>
</figure>
<para>The following is a section from a CPE descriptor that shows an example
configuration for the integrated CAS Processor.</para>
<programlisting>&lt;casProcessor <emphasis role="bold-italic">deployment=<quote>integrated</quote></emphasis> name=<quote>Meeting Detector TAE</quote>&gt;
&lt;descriptor&gt;
&lt;include href="descriptors/tutorial/ex4/MeetingDetectorTAE.xml"/&gt;
&lt;/descriptor&gt;
&lt;deploymentParameters/&gt;
&lt;filter/&gt;
&lt;errorHandling&gt;
&lt;errorRateThreshold action="terminate" value="100/1000"/&gt;
&lt;maxConsecutiveRestarts action="terminate" value="30"/&gt;
&lt;timeout max="100000"/&gt;
&lt;/errorHandling&gt;
&lt;checkpoint batch="10000"/&gt;
&lt;/casProcessor&gt;</programlisting>
<para>See <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/> for
details and required settings.</para>
</section>
</section>
<section id="ugr.tug.cpe.collection_processing_examples">
<title>Collection Processing Examples</title>
<para>The UIMA SDK includes a set of examples illustrating the three modes of deployment,
integrated, managed, and non-managed. These are in the
<literal>/examples/descriptors/collection_processing_engine</literal>
directory. There are three CPE descriptors that run an example annotator (the Meeting
Finder) in these modes.</para>
<para>To run either the integrated or managed examples, use the
<literal>runCPE</literal> script in the /bin directory of the UIMA installation,
passing the appropriate CPE descriptor as an argument, or
if you're using Eclipse and have the <literal>uimaj-examples</literal> project in your
workspace, you can use the Eclipse Menu &rarr; Run &rarr; Run... &rarr; and then pick the
launch configuration <quote>UIMA Run CPE</quote>.</para>
<note><para>The <literal>runCPE</literal> script <emphasis role="bold-italic"> must</emphasis>
be run from the <literal>%UIMA_HOME%\examples</literal> directory, because the example
CPE descriptors use relative path names that are resolved relative to this working directory.
For instance,
<literallayout>runCPE
descriptors\collection_processing_engine\MeetingFinderCPE_Integrated.xml</literallayout></para>
</note>
<!--
<para>If you installed the examples into Eclipse, you can run directly from Eclipse by
creating a run configuration. To do this, highlight the SimpleRunCPE.java source file
in the examples src/org/apache/uima/examples/cpe directory, and then</para>
<orderedlist><listitem><para>pick the menu Run &rarr; Run...</para></listitem>
<listitem><para>click <quote>Java Application</quote> and press
<quote>New</quote></para></listitem>
<listitem><para>click on the Arguments panel, and insert a path to the appropriate CPE
descriptor in the <quote>Program Arguments</quote> box by typing, for instance:
<literal>descriptors/collection_processing_engine/
MeetingFinderCPE_Integrated.xml</literal>
</para></listitem>
<listitem><para>Then press <quote>Run</quote> </para></listitem>
</orderedlist>
-->
<para>To run the non-managed example, there are some additional steps.
<orderedlist><listitem><para>Start a VNS service by running the
<literal>startVNS</literal> script in the <literal>/bin</literal>
directory, or using the Eclipse launcher <quote>UIMA Start VNS</quote>.</para></listitem>
<listitem><para>Deploy the Meeting Detector Analysis Engine as a Vinci service, by
running the <literal>startVinciService</literal> script in the
<literal>/bin</literal> directory or using the Eclipse launcher for this, and passing it the location of the
descriptor to deploy, in this case
<literal>%UIMA_HOME%/examples/deploy/vinci/Deploy_MeetingDetectorTAE.xml</literal>,
or
if you're using Eclipse and have the <literal>uimaj-examples</literal> project in your
workspace, you can use the Eclipse Menu &rarr; Run &rarr; Run... &rarr; and then pick the
launch configuration <quote>UIMA Start Vinci Service</quote>.
</para></listitem>
<listitem><para>Now, run the runCPE script (or if in Eclipse, run the
launch configuration <quote>UIMA Run CPE</quote>), passing it the CPE for the non-managed
version
<literal>(%UIMA_HOME%/examples/descriptors/collection_processing_engine/
MeetingFinderCPE_NonManaged.xml</literal>
). </para></listitem></orderedlist></para>
<para>This assumes that the Vinci Naming Service, the runCPE application, and the
<literal>MeetingDetectorTAE</literal> service are all running on the same machine.
Most of the scripts that need information about VNS will look for values to use in
environment variables VNS_HOST and VNS_PORT; these default to
<quote>localhost</quote> and <quote>9000</quote>. You may set these to appropriate
values before running the scripts, as needed; you can also pass the name of the VNS host as
the second argument to the startVinciService script.</para>
<para>Alternatively, you can edit the scripts and/or the XML files to specify
alternatives for the VNS_HOST and VNS_PORT. For instance, if the
<literal>runCPE</literal> application is running on a different machine from the
Vinci Naming Service, you can edit the
<literal>MeetingFinderCPE_NonManaged.xml</literal> and change the vnsHost
parameter:
<literal>&lt;parameter name="vnsHost" value="localhost" type="string"/&gt;</literal>
to specify the VNS host instead of <quote>localhost</quote>.</para>
</section>
</chapter>