<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ | |
<!ENTITY imgroot "images/tutorials_and_users_guides/tug.cpe/"> | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent"> | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<chapter id="ugr.tug.cpe"> | |
<title>Collection Processing Engine Developer's Guide</title> | |
<titleabbrev>CPE Developer's Guide</titleabbrev> | |
<note><para>The CPE (Collection Processing Engine) was an early | |
approach to supporting some scale-out use cases. It is an older | |
approach that doesn't support some of the newer features of CASes | |
such as multiple views and CAS Multipliers. It has been | |
supplanted by UIMA-AS, which has full support for the new features.</para></note> | |
<para>The UIMA Analysis Engine interface provides support for developing and integrating | |
algorithms that analyze unstructured data. Analysis Engines are designed to operate on a | |
per-document basis. Their interface handles one CAS at a time. UIMA provides additional | |
support for applying analysis engines to collections of unstructured data with its | |
<emphasis>Collection Processing Architecture</emphasis>. The Collection | |
Processing Architecture defines additional components for reading raw data formats | |
from data collections, preparing the data for processing by Analysis Engines, executing | |
the analysis, extracting analysis results, and deploying the overall flow in a variety of | |
local and distributed configurations.</para> | |
<para>The functionality defined in the Collection Processing Architecture is | |
implemented by a <emphasis>Collection Processing Engine</emphasis> (CPE). A CPE | |
includes an Analysis Engine and adds a <emphasis>Collection Reader</emphasis>, a | |
<emphasis>CAS Initializer</emphasis> (deprecated as of version 2), and <emphasis>CAS | |
Consumers</emphasis>. The part of the UIMA Framework that supports the execution of | |
CPEs is called the Collection Processing Manager, or CPM.</para> | |
<para>A Collection Reader provides the interface to the raw input data and knows how to | |
iterate over the data collection. Collection Readers are discussed in <xref | |
linkend="ugr.tug.cpe.collection_reader.developing"/>. The CAS Initializer | |
<footnote><para>CAS Initializers are deprecated in favor of a more general mechanism, | |
multiple subjects of analysis.</para></footnote> prepares an individual data item for | |
analysis and loads it into the CAS. CAS Initializers are discussed in <xref | |
linkend="ugr.tug.cpe.cas_initializer.developing"/> A CAS Consumer extracts | |
analysis results from the CAS and may also perform <emphasis>collection level | |
processing</emphasis>, or analysis over a collection of CASes. CAS Consumers are | |
discussed in <xref linkend="ugr.tug.cpe.cas_consumer.developing"/>.</para> | |
<para>Analysis Engines and CAS Consumers are both instances of <emphasis>CAS | |
Processors</emphasis>. A Collection Processing Engine (CPE) may contain multiple CAS | |
Processors. An Analysis Engine contained in a CPE may itself be a Primitive or an Aggregate | |
(composed of other Analysis Engines). Aggregates may contain Cas Consumers. While | |
Collection Readers and CAS Initializers always run in the same JVM as the CPM, a CAS | |
Processor may be deployed in a variety of local and distributed modes, providing a number | |
of options for scalability and robustness. The different deployment options are covered | |
in detail in <xref linkend="ugr.tug.cpe.deployment_alternatives"/>.</para> | |
<para>Each of the components in a CPE has an interface specified by the UIMA Collection | |
Processing Architecture and is described by a declarative XML descriptor file. | |
Similarly, the CPE itself has a well defined component interface and is described by a | |
declarative XML descriptor file.</para> | |
<para>A user creates a CPE by assembling the components mentioned above. The UIMA SDK | |
provides a graphical tool, called the CPE Configurator, for assisting in the assembly of | |
CPEs. Use of this tool is summarized in <xref | |
linkend="ugr.tug.cpe.cpe_configurator"/>, and more details can be found in | |
<olink targetdoc="&uima_docs_tools;"/> | |
<olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cpe"/>. | |
Alternatively, a CPE can be assembled by writing an XML CPE descriptor. Details on the CPE | |
descriptor, including its syntax and content, can be found in the | |
<olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>. The individual | |
components have associated XML descriptors, each of which can be created and / or edited | |
using the <olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde"> | |
Component Description Editor</olink>.</para> | |
<para>A CPE is executed by a UIMA infrastructure component called the | |
<emphasis>Collection Processing Manager</emphasis> (CPM). The CPM provides a number | |
of services and deployment options that cover instantiation and execution of CPEs, error | |
recovery, and local and distributed deployment of the CPE components.</para> | |
<section id="ugr.tug.cpe.concepts"> | |
<title>CPE Concepts</title> | |
<para> <xref linkend="ugr.tug.cpe.fig.cpe_components"/> illustrates the data flow | |
that occurs between the different types of components that make up a CPE.</para> | |
<figure id="ugr.tug.cpe.fig.cpe_components"> | |
<title>CPE Components</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="PNG" | |
fileref="&imgroot;image002.png"/> | |
</imageobject> | |
<textobject><phrase>CPE Components and flow between them</phrase> | |
</textobject> | |
</mediaobject> | |
</figure> | |
<para>The components of a CPE are:</para> | |
<itemizedlist><listitem><para><emphasis>Collection Reader –</emphasis> | |
interfaces to a collection of data items (e.g., documents) to be analyzed. Collection | |
Readers return CASes that contain the documents to analyze, possibly along with | |
additional metadata.</para></listitem> | |
<listitem><para><emphasis>Analysis Engine –</emphasis> takes a CAS, | |
analyzes its contents, and produces an enriched CAS. Analysis Engines can be | |
recursively composed of other Analysis Engines (called an | |
<emphasis>Aggregate</emphasis> Analysis Engine). Aggregates may also contain | |
CAS Consumers.</para></listitem> | |
<listitem><para><emphasis>CAS Consumer –</emphasis> consume the enriched | |
CAS that was produced by the sequence of Analysis Engines before it, and produce an | |
application-specific data structure, such as a search engine index or database. | |
</para></listitem></itemizedlist> | |
<para>A fourth type of component, the <emphasis>CAS Initializer,</emphasis> may be | |
used by a Collection Reader to populate a CAS from a document. However, as of UIMA | |
version 2 CAS Initializers are now deprecated in favor of a more general mechsanism, | |
multiple Subjects of Analysis.</para> | |
<para>The Collection Processing Manager orchestrates the data flow | |
within a CPE, monitors status, optionally manages the life-cycle of internal | |
components and collects statistics.</para> | |
<para>CASes are not saved in a persistent way by the framework. If you want to save CASes, | |
then you have to save each CAS as it comes through (for example) using a CAS Consumer you | |
write to do this, in whatever format you like. The UIMA SDK supplies an example CAS | |
Consumer to save CASes to XML files, either in the standard XMI format or in an older | |
format called XCAS. It also supplies an example CAS Consumer to extract information from CASes and | |
store the results into a relational Database, using Java's JDBC APIs.</para> | |
</section> | |
<section id="ugr.tug.cpe.configurator_and_viewer"> | |
<title>CPE Configurator and CAS viewer</title> | |
<section id="ugr.tug.cpe.cpe_configurator"> | |
<title>Using the CPE Configurator</title> | |
<para>A CPE can be assembled by writing an XML CPE descriptor. Details on the CPE | |
descriptor, including its syntax and content, can be found in | |
<olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>. Rather than | |
edit raw XML, you may develop a CPE Descriptor using the CPE Configurator tool. The CPE | |
Configurator tool is described briefly in this section, and in more detail in | |
<olink targetdoc="&uima_docs_tools;"/> | |
<olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cpe"/>.</para> | |
<para>The CPE Configurator tool can be run from Eclipse (see <xref | |
linkend="ugr.tug.cpe.running_cpe_configurator_from_eclipse"/>, or using | |
the <literal>cpeGui</literal> shell script (<literal>cpeGui.bat</literal> on | |
Windows, <literal>cpeGui.sh</literal> on Unix), which is located in the | |
<literal>bin</literal> directory of the UIMA SDK installation. Executing this | |
batch file will display the window shown here: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image004.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of CPE GUI</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
</para> | |
<para>The window is divided into three sections, one each for the Collection Reader, | |
Analysis Engines, and CAS Consumers.<footnote><para>There is also a fourth pane, | |
for the CAS Initializer, but it is hidden by default. To enable it click the | |
<literal>View → CAS Initializer Panel</literal> menu item.</para></footnote> | |
In each section, you select the component(s) you want to include in the CPE by | |
browsing to their XML descriptors. The configuration parameters present in the XML | |
descriptors will then be displayed in the GUI; these can be modified to override | |
the values present in the descriptor. For example, the screen shot below shows the | |
CPE Configurator after the following components have been chosen: | |
<programlisting>Collection Reader: | |
%UIMA_HOME%/examples/descriptors/collection_reader/ | |
FileSystemCollectionReader.xml | |
Analysis Engine: | |
%UIMA_HOME%/examples/descriptors/analysis_engine/ | |
NamesAndPersonTitles_TAE.xml | |
CAS Consumer: | |
%UIMA_HOME%/examples/descriptors/cas_consumer/ | |
XmiWriterCasConsumer.xml</programlisting></para> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image006.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of CPE GUI after fields filled in</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>For the File System Collection Reader, ensure that the Input Directory is set to | |
<literal>%UIMA_HOME%\examples\data</literal><footnote><para>Replace | |
<literal>%UIMA_HOME%</literal> with the path to where you installed UIMA.</para> | |
</footnote>. The other parameters may be left blank. For the External CAS Writer CAS | |
Consumer, ensure that the Output Directory is set to | |
<literal>%UIMA_HOME%\examples\data\processed</literal>.</para> | |
<para>After selecting each of the components and providing configuration settings, | |
click the play (forward arrow) button at the bottom of the screen to begin processing. | |
A progress bar should be displayed in the lower left corner. (Note that the progress | |
bar will not begin to move until all components have completed their initialization, | |
which may take several seconds.) Once processing has begun, the pause and stop | |
buttons become enabled.</para> | |
<para>If an error occurs, you will be informed by an error dialog. If processing | |
completes successfully, you will be presented with a performance report.</para> | |
<para>Using the File menu, you can select <literal>Save CPE Descriptor </literal>to | |
create an .xml descriptor file that defines the CPE you have constructed. Later, you | |
can use <literal>Open CPE Descriptor</literal> to restore the CPE Configurator to | |
the saved state. Also, CPE descriptors can be used to run a CPE from a Java program | |
– see section <xref | |
linkend="ugr.tug.cpe.running_cpe_from_application"/>. CPE Descriptors | |
allow specifying operational parameters, such as error handling options, that are | |
not currently available for configuration through the CPE Configurator. For more | |
information on manually creating a CPE Descriptor, see the | |
<olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>.</para> | |
<para>The CPE configured above runs a simple name and title annotator on the sample data | |
provided with the UIMA SDK and stores the results using the XMI Writer CAS Consumer. To | |
view the results, start the External CAS Annotation Viewer by running the | |
<literal>annotationViewer</literal> batch file | |
(<literal>annotationViewer.bat</literal> on Windows, | |
<literal>annotationViewer.sh</literal> on Unix), which is located in the | |
<literal>bin</literal> directory of the UIMA SDK installation. Executing this | |
batch file will display the window shown here: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.5in" format="JPG" fileref="&imgroot;image008.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of Annotation Viewer results</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
</para> | |
<para>Ensure that the Input Directory is the same as the Output Directory specified for | |
the XMI Writer CAS Consumer in the CPE configured above (e.g., | |
<literal>%UIMA_HOME%\examples\data\processed</literal>) and that the TAE | |
Descriptor File is set to the Analysis Engine used in the CPE configured above (e.g., | |
<literal>examples\descriptors\analysis_engine\NamesAndPersonTitles_TAE.xml</literal> | |
).</para> | |
<para>Click the View button to display the Analyzed Documents window: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="3.5in" format="JPG" fileref="&imgroot;image010.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of CPE Configurator Analyzed Documents</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
</para> | |
<para>Double click on any document in the list to view the analyzed document. Double | |
clicking the first document, IBM_LifeSciences.txt, will bring up the following | |
window: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="JPG" fileref="&imgroot;image012.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot of Document and Annotation Viewer</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
</para> | |
<para>This window shows the analysis results for the document. Clicking on any | |
highlighted annotation causes the details for that annotation to be displayed in the | |
right-hand pane. Here the annotation spanning <quote>John M. Thompson</quote> has | |
been clicked.</para> | |
<para>Congratulations! You have successfully configured a CPE, saved its | |
descriptor, run the CPE, and viewed the analysis results.</para> | |
</section> | |
<section id="ugr.tug.cpe.running_cpe_configurator_from_eclipse"> | |
<title>Running the CPE Configurator from Eclipse</title> | |
<para>If you have followed the instructions in <olink targetdoc="&uima_docs_overview;"/> | |
<olink targetdoc="&uima_docs_overview;" | |
targetptr="ugr.ovv.eclipse_setup"/> and imported the example Eclipse | |
project, then you should already have a Run configuration for the CPE Configurator | |
tool (called <literal>UIMA CPE GUI</literal>) configured to run in the example | |
project. Simply run that configuration to start the CPE Configurator.</para> | |
<para>If you haven't followed the Eclipse setup instructions and wish to run the | |
CPE Configurator tool from Eclipse, you will need to do the following. As installed, | |
this Eclipse launch configuration is associated with the | |
<quote>uimaj-examples</quote> project. If you've not already done so, you | |
may wish to import that project into your Eclipse workspace. It's located in | |
%UIMA_HOME%/docs/examples. Doing this will supply the Eclipse launcher with all | |
the class files it needs to run the CPE configurator. If you don't do this, please | |
manually add the JAR files for UIMA to the launch configuration.</para> | |
<para>Also, you need to add any projects or JAR files for any UIMA components you will be | |
running to the launch class path.</para> <note><para>A simpler alternative may be | |
to change the CPE launch configuration to be based on your project. If you do that, it will | |
pick up all the files in your project's class path, which you should set up to | |
include all the UIMA framework files. An easy way to do this is to specify in your | |
project's properties' build-path that the uimaj-examples project is on | |
the build path, because the uimaj-examples project is set up to include all the UIMA | |
framework classes in its classpath already. </para></note> | |
<para>Next, in the Eclipse menu select <literal>Run → | |
Run</literal>..., which brings up the Run configuration screen.</para> | |
<para>In the Main tab, set the main class to | |
<literal>org.apache.uima.tools.cpm.CpmFrame</literal></para> | |
<para>In the arguments tab, add the following to the VM arguments: | |
<programlisting>-Xms128M -Xmx256M | |
-Duima.home="C:\Program Files\Apache\uima"</programlisting> | |
(or wherever you installed the UIMA SDK)</para> | |
<para>Click the Run button to launch the CPE Configurator, and use it as previously | |
described in this section.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.cpe.running_cpe_from_application"> | |
<title>Running a CPE from Your Own Java Application</title> | |
<para>The simplest way to run a CPE from a Java application is to first create a CPE | |
descriptor as described in the previous section. Then the CPE can be instantiated and | |
run using the following code: | |
<programlisting> //parse CPE descriptor in file specified on command line | |
CpeDescription cpeDesc = UIMAFramework.getXMLParser(). | |
parseCpeDescription(new XMLInputSource(args[0])); | |
//instantiate CPE | |
mCPE = UIMAFramework.produceCollectionProcessingEngine(cpeDesc); | |
//Create and register a Status Callback Listener | |
mCPE.addStatusCallbackListener(new StatusCallbackListenerImpl()); | |
//Start Processing | |
mCPE.process();</programlisting></para> | |
<para>This will start the CPE running in a separate thread.</para> | |
<note><para>The <literal>process()</literal> method for a CPE can only be called once. If you | |
need to call it again, you have to instantiate a new CPE, and call that new CPE's process | |
method.</para></note> | |
<section id="ugr.tug.cpe.using_listeners"> | |
<title>Using Listeners</title> | |
<para>Updates of the CPM's progress, including any errors that occur, are sent to | |
the callback handler that is registered by the call to | |
<literal>addStatusCallbackListener</literal>, above. The callback handler is a | |
class that implements the CPM's | |
<literal>StatusCallbackListener</literal> interface. It responds to events by | |
printing messages to the console. The source code is fairly straightforward and is | |
not included in this chapter – see the | |
<literal>org.apache.uima.examples.cpe.SimpleRunCPE.java</literal> in the | |
<literal>%UIMA_HOME%\examples\src</literal> directory for the complete | |
code.</para> | |
<para>If you need more control over the information in the CPE descriptor, you can | |
manually configure it via its API. See the Javadocs for package | |
<literal>org.apache.uima.collection</literal> for more details.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.cpe.developing_collection_processing_components"> | |
<title>Developing Collection Processing Components</title> | |
<para>This section is an introduction to the process of developing Collection Readers, | |
CAS Initializers, and CAS Consumers. The code snippets refer to the classes that can be | |
found in <literal>%UIMA_HOME%\examples\src </literal>example project.</para> | |
<para>In the following sections, classes you write to represent components need to be | |
public and have public, 0-argument constructors, so that they can be instantiated by | |
the framework. (Although Java classes in which you do not define any constructor will, | |
by default, have a 0-argument constructor that doesn't do anything, a class in | |
which you have defined at least one constructor does not get a default 0-argument | |
constructor.)</para> | |
<section id="ugr.tug.cpe.collection_reader.developing"> | |
<title>Developing Collection Readers</title> | |
<para>A Collection Reader is responsible for obtaining documents from the collection | |
and returning each document as a CAS. Like all UIMA components, a Collection Reader | |
consists of two parts — the code and an XML descriptor.</para> | |
<para>A simple example of a Collection Reader is the <quote>File System Collection | |
Reader,</quote> which simply reads documents from files in a specified directory. | |
The Java code is in the class | |
<literal>org.apache.uima.examples.cpe.FileSystemCollectionReader</literal> | |
and the XML descriptor is | |
<literal>%UIMA_HOME%/examples/src/main/descriptors/collection_reader/ | |
FileSystemCollectionReader.xml</literal>.</para> | |
<section id="ugr.tug.cpe.collection_reader.java_class"> | |
<title>Java Class for the Collection Reader</title> | |
<para>The Java class for a Collection Reader must implement the | |
<literal>org.apache.uima.collection.CollectionReader</literal> | |
interface. You may build your Collection Reader from scratch and implement this | |
interface, or you may extend the convenience base class | |
<literal>org.apache.uima.collection.CollectionReader_ImplBase</literal> | |
.</para> | |
<para>The convenience base class provides default implementations for many of the | |
methods defined in the <literal>CollectionReader</literal> interface, and | |
provides abstract definitions for those methods that you are required to | |
implement in your new Collection Reader. Note that if you extend this base class, | |
you do not need to declare that your new Collection Reader implements the | |
<literal>CollectionReader</literal> interface.</para> <tip><para>Eclipse | |
tip – if you are using Eclipse, you can quickly create the boiler plate code and | |
stubs for all of the required methods by clicking <literal>File</literal> | |
→ <literal>New</literal> → <literal>Class</literal> to bring up the <quote>New Java Class</quote> | |
dialogue, specifying | |
<literal>org.apache.uima.collection.CollectionReader_ImplBase</literal> | |
as the Superclass, and checking <quote>Inherited abstract methods</quote> in the | |
section <quote>Which method stubs would you like to create?</quote>, as in the | |
screenshot below:</para></tip> | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="4.4in" format="JPG" fileref="&imgroot;image014.jpg"/> | |
</imageobject> | |
<textobject><phrase>Screenshot showing Eclipse new class wizard</phrase></textobject> | |
</mediaobject> | |
</screenshot> | |
<para>For the rest of this section we will assume that your new Collection Reader | |
extends the <literal>CollectionReader_ImplBase</literal> class, and we will | |
show examples from the | |
<literal>org.apache.uima.examples.cpe.FileSystemCollectionReader</literal> | |
. If you must inherit from a different superclass, you must ensure that your | |
Collection Reader implements the <literal>CollectionReader</literal> | |
interface – see the Javadocs for <literal>CollectionReader</literal> | |
for more details.</para> | |
</section> | |
<section id="ugr.tug.cpe.collection_reader.required_methods"> | |
<title>Required Methods in the Collection Reader class</title> | |
<para>The following abstract methods must be implemented:</para> | |
<section id="ugr.tug.cpe.collection_reader.required_methods.initialize"> | |
<title>initialize()</title> | |
<para>The <literal>initialize()</literal> method is called by the framework | |
when the Collection Reader is first created. | |
<literal>CollectionReader_ImplBase</literal> actually provides a default | |
implementation of this method (i.e., it is not abstract), so you are not strictly | |
required to implement this method. However, a typical Collection Reader will | |
implement this method to obtain parameter values and perform various | |
initialization steps.</para> | |
<para>In this method, the Collection Reader class can access the values of its | |
configuration parameters and perform other initialization logic. The example | |
File System Collection Reader reads its configuration parameters and then | |
builds a list of files in the specified input directory, as follows:</para> | |
<programlisting>public void initialize() throws ResourceInitializationException { | |
File directory = new File( | |
(String)getConfigParameterValue(PARAM_INPUTDIR)); | |
mEncoding = (String)getConfigParameterValue(PARAM_ENCODING); | |
mDocumentTextXmlTagName = (String)getConfigParameterValue(PARAM_XMLTAG); | |
mLanguage = (String)getConfigParameterValue(PARAM_LANGUAGE); | |
mCurrentIndex = 0; | |
//get list of files (not subdirectories) in the specified directory | |
mFiles = new ArrayList(); | |
File[] files = directory.listFiles(); | |
for (int i = 0; i < files.length; i++) { | |
if (!files[i].isDirectory()) { | |
mFiles.add(files[i]); | |
} | |
} | |
}</programlisting> | |
<note><para>This is the zero-argument version of the initialize method. There is | |
also a method on the Collection Reader interface called | |
<literal>initialize(ResourceSpecifier, Map)</literal> but it is not | |
recommended that you override this method in your code. That method performs | |
internal initialization steps and then calls the zero-argument | |
<literal>initialize()</literal>. </para></note> | |
</section> | |
<section id="ugr.tug.cpe.collection_reader.hasnext"> | |
<title>hasNext()</title> | |
<para>The <literal>hasNext()</literal> method returns whether or not there are | |
any documents remaining to be read from the collection. The File System | |
Collection Reader's <literal>hasNext()</literal> method is very | |
simple. It just checks if there are any more files left to be read: | |
<programlisting>public boolean hasNext() { | |
return mCurrentIndex < mFiles.size(); | |
}</programlisting> | |
</para> | |
</section> | |
<section id="ugr.tug.cpe.collection_reader.required_methods.getnext"> | |
<title>getNext(CAS)</title> | |
<para>The <literal>getNext()</literal> method reads the next document from the | |
collection and populates a CAS. In the simple case, this amounts to reading the | |
file and calling the CAS's <literal>setDocumentText</literal> method. | |
The example File System Collection Reader is slightly more complex. It first | |
checks for a CAS Initializer. If the CPE includes a CAS Initializer, the CAS | |
Initializer is used to read the document, and | |
<literal>initialize()</literal> the CAS. If the CPE does not include a CAS | |
Initializer, the File System Collection Reader reads the document and sets the | |
document text in the CAS.</para> | |
<para>The File System Collection Reader also stores additional metadata about | |
the document in the CAS. In particular, it sets the document's language in | |
the special built-in feature structure | |
<literal>uima.tcas.DocumentAnnotation </literal>(see | |
<olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.cas.document_annotation"/> for details about this | |
built-in type) and creates an instance of | |
<literal>org.apache.uima.examples.SourceDocumentInformation</literal> | |
, which stores information about the document's source location. This | |
information may be useful to downstream components such as CAS Consumers. Note | |
that the type system descriptor for this type can be found in | |
<literal>org.apache.uima.examples.SourceDocumentInformation.xml</literal> | |
, which is located in the <literal>examples/src</literal> directory.</para> | |
<para>The getNext() method for the File System Collection Reader looks like | |
this:</para> | |
<programlisting> public void getNext(CAS aCAS) throws IOException, CollectionException { | |
JCas jcas; | |
try { | |
jcas = aCAS.getJCas(); | |
} catch (CASException e) { | |
throw new CollectionException(e); | |
} | |
// open input stream to file | |
File file = (File) mFiles.get(mCurrentIndex++); | |
BufferedInputStream fis = | |
new BufferedInputStream(new FileInputStream(file)); | |
try { | |
byte[] contents = new byte[(int) file.length()]; | |
fis.read(contents); | |
String text; | |
if (mEncoding != null) { | |
text = new String(contents, mEncoding); | |
} else { | |
text = new String(contents); | |
} | |
// put document in CAS | |
jcas.setDocumentText(text); | |
} finally { | |
if (fis != null) | |
fis.close(); | |
} | |
// set language if it was explicitly specified | |
//as a configuration parameter | |
if (mLanguage != null) { | |
((DocumentAnnotation) jcas.getDocumentAnnotationFs()). | |
setLanguage(mLanguage); | |
} | |
// Also store location of source document in CAS. | |
// This information is critical if CAS Consumers will | |
// need to know where the original document contents | |
// are located. | |
// For example, the Semantic Search CAS Indexer | |
// writes this information into the search index that | |
// it creates, which allows applications that use the | |
// search index to locate the documents that satisfy | |
//their semantic queries. | |
SourceDocumentInformation srcDocInfo = | |
new SourceDocumentInformation(jcas); | |
srcDocInfo.setUri( | |
file.getAbsoluteFile().toURL().toString()); | |
srcDocInfo.setOffsetInSource(0); | |
srcDocInfo.setDocumentSize((int) file.length()); | |
srcDocInfo.setLastSegment( | |
mCurrentIndex == mFiles.size()); | |
srcDocInfo.addToIndexes(); | |
}</programlisting> | |
<para>The Collection Reader can create additional annotations in the CAS at this | |
point, in the same way that annotators create annotations.</para> | |
</section> | |
<section id="ugr.tug.cpe.collection_reader.required_methods.getprogress"> | |
<title>getProgress()</title> | |
<para>The Collection Reader is responsible for returning progress information; | |
that is, how much of the collection has been read thus far and how much remains to be | |
read. The framework defines progress very generally; the Collection Reader | |
simply returns an array of <literal>Progress</literal> objects, where each | |
object contains three fields — the amount already completed, the total | |
amount (if known), and a unit (e.g. entities (documents), bytes, or files). The | |
method returns an array so that the Collection Reader can report progress in | |
multiple different units, if that information is available. The File System | |
Collection Reader's <literal>getProgress()</literal> method looks | |
like this: | |
<programlisting>public Progress[] getProgress() { | |
return new Progress[]{ | |
new ProgressImpl(mCurrentIndex,mFiles.size(),Progress.ENTITIES)}; | |
}</programlisting></para> | |
<para>In this particular example, the total number of files in the collection is | |
known, but the total size of the collection is not known. As such, a | |
<literal>ProgressImpl</literal> object for | |
<literal>Progress.ENTITIES</literal> is returned, but a | |
<literal>ProgressImpl</literal> object for | |
<literal>Progress.BYTES</literal> is not.</para> | |
</section> | |
<section id="ugr.tug.cpe.collection_reader.required_methods.close"> | |
<title>close()</title> | |
<para>The close method is called when the Collection Reader is no longer needed. | |
The Collection Reader should then release any resources it may be holding. The | |
FileSystemCollectionReader does not hold resources and so has an empty | |
implementation of this method:</para> | |
<programlisting>public void close() throws IOException { }</programlisting> | |
</section> | |
<section id="ugr.tug.cpe.collection_reader.optional_methods"> | |
<title>Optional Methods</title> | |
<para>The following methods may be implemented:</para> | |
<section id="ugr.tug.cpe.collection_reader.optional_methods.reconfigure"> | |
<title>reconfigure()</title> | |
<para>This method is called if the Collection Reader's configuration | |
parameters change.</para> | |
</section> | |
<section id="ugr.tug.cpe.collection_reader.optional_methods.typesysteminit"> | |
<title>typeSystemInit()</title> | |
<para>If you are only setting the document text in the CAS, or if you are using the | |
JCas (recommended, as in the current example, you do not have to implement this | |
method. If you are directly using the CAS API, this method is used in the same way | |
as it is used for an annotator – see <olink | |
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae.contract_for_annotator_methods"/> | |
for more information.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.cpe.collection_reader.threading"> | |
<title>Threading considerations</title> | |
<para>Collection readers do not have to be thread safe; they are run with a single | |
thread per instance, and only one instance per instance of the Collection | |
Processing Manager (CPM) is made.</para> | |
</section> | |
<section id="ugr.tug.cpe.collection_reader.descriptor"> | |
<title>XML Descriptor for a Collection Reader</title> | |
<para>You can use the Component Description Editor to create and / or edit the File | |
System Collection Reader's descriptor. Here is its descriptor | |
(abbreviated somewhat), which is very similar to an Analysis | |
Engine descriptor:</para> | |
<programlisting><?db-font-size 80% ?><![CDATA[<collectionReaderDescription | |
xmlns="http://uima.apache.org/resourceSpecifier"> | |
<frameworkImplementation>org.apache.uima.java</frameworkImplementation> | |
<implementationName> | |
org.apache.uima.examples.cpe.FileSystemCollectionReader | |
</implementationName> | |
<processingResourceMetaData> | |
<name>File System Collection Reader</name> | |
<description>Reads files from the filesystem.</description> | |
<version>1.0</version> | |
<vendor>The Apache Software Foundation</vendor> | |
<configurationParameters> | |
<configurationParameter> | |
<name>InputDirectory</name> | |
<description>Directory containing input files</description> | |
<type>String</type> | |
<multiValued>false</multiValued> | |
<mandatory>true</mandatory> | |
</configurationParameter> | |
<configurationParameter> | |
<name>Encoding</name> | |
<description>Character encoding for the documents.</description> | |
<type>String</type> | |
<multiValued>false</multiValued> | |
<mandatory>false</mandatory> | |
</configurationParameter> | |
<configurationParameter> | |
<name>Language</name> | |
<description>ISO language code for the documents</description> | |
<type>String</type> | |
<multiValued>false</multiValued> | |
<mandatory>false</mandatory> | |
</configurationParameter> | |
</configurationParameters> | |
<configurationParameterSettings> | |
<nameValuePair> | |
<name>InputDirectory</name> | |
<value> | |
<string>C:/Program Files/apache/uima/examples/data</string> | |
</value> | |
</nameValuePair> | |
</configurationParameterSettings> | |
<!-- Type System of CASes returned by this Collection Reader --> | |
<typeSystemDescription> | |
<imports> | |
<import name="org.apache.uima.examples.SourceDocumentInformation"/> | |
</imports> | |
</typeSystemDescription> | |
<capabilities> | |
<capability> | |
<inputs/> | |
<outputs> | |
<type allAnnotatorFeatures="true"> | |
org.apache.uima.examples.SourceDocumentInformation | |
</type> | |
</outputs> | |
</capability> | |
</capabilities> | |
<operationalProperties> | |
<modifiesCas>true</modifiesCas> | |
<multipleDeploymentAllowed>false</multipleDeploymentAllowed> | |
<outputsNewCASes>true</outputsNewCASes> | |
</operationalProperties> | |
</processingResourceMetaData> | |
</collectionReaderDescription>]]></programlisting> | |
</section> | |
</section> | |
</section> | |
<section id="ugr.tug.cpe.cas_initializer.developing"><title>Developing CAS | |
Initializers</title> <note><para>CAS Initializers are now deprecated (as of | |
version 2.1). For complex initialization, please use instead the capabilities of | |
creating additional Subjects of Analysis (see <olink | |
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.mvs"/> | |
). </para></note> | |
<para>In UIMA 1.x, the CAS Initializer component was intended to be used as a plug-in | |
to the Collection Reader for when the task of populating the CAS from a raw document is | |
complex and might be reusable with other data collections.</para> | |
<para>A CAS Initializer Java class must implement the interface | |
<literal>org.apache.uima.collection.CasInitializer</literal>, and will also | |
generally extend from the convenience base class | |
<literal>org.apache.uima.collection.CasInitializer_ImplBase</literal>. A | |
CAS Initializer also must have an XML descriptor, which has the exact same form as a | |
Collection Reader Descriptor except that the outer tag is | |
<literal><casInitializerDescription></literal>.</para> | |
<para>CAS Initializers have optional <literal>initialize()</literal>, | |
<literal>reconfigure()</literal>, and <literal>typeSystemInit()</literal> | |
methods, which perform the same functions as they do for Collection Readers. The only | |
required method for a CAS Initializer is <literal>initializeCas(Object, | |
CAS)</literal>. This method takes the raw document (for example, an | |
<literal>InputStream</literal> object from which the document can be read) and a | |
CAS, and populates the CAS from the document.</para> | |
</section> | |
<section id="ugr.tug.cpe.cas_consumer.developing"><title>Developing CAS | |
Consumers</title> | |
<note><para>In version 2, there is no difference in capability | |
between CAS Consumers and ordinary Analysis Engines, except for the default setting of | |
the XML parameters for <literal>multipleDeploymentAllowed</literal> and | |
<literal>modifiesCas</literal>. We recommend for future work that users implement | |
and use Analysis Engine components instead of CAS Consumers.</para> | |
<para>The rest of this section is written using the version 1 style of CAS Consumer; | |
the methods described are also available for Analysis Engines. Note that the | |
CAS Consumer <literal>processCAS</literal> method is equivalent to the Analysis Engine | |
<literal>process</literal> method.</para></note> | |
<para>A CAS Consumer receives each CAS after it has been analyzed by the Analysis | |
Engine. CAS Consumers typically do not update the CAS; they typically extract data | |
from the CAS and persist selected information to aggregate data structures such as | |
search engine indexes or databases.</para> | |
<para>A CAS Consumer Java class must implement the interface | |
<literal>org.apache.uima.collection.CasConsumer</literal>, and will also | |
generally extend from the convenience base class | |
<literal>org.apache.uima.collection.CasConsumer_ImplBase</literal>. A CAS | |
Consumer also must have an XML descriptor, which has the exact same form as a | |
Collection Reader Descriptor except that the outer tag is | |
<literal><casConsumerDescription></literal>.</para> | |
<para>CAS Consumers have optional <literal>initialize()</literal>, | |
<literal>reconfigure()</literal>, and <literal>typeSystemInit()</literal> | |
methods, which perform the same functions as they do for Collection Readers and CAS | |
Initializers. The only required method for a CAS Consumer is | |
<literal>processCas(CAS)</literal>, which is where the CAS Consumer does the bulk | |
of its work (i.e., consume the CAS).</para> | |
<para>The <literal>CasConsumer</literal> interface (as well as the version 2 | |
Analysis Engine interface) additionally defines batch | |
and collection level processing methods. The CAS Consumer or Analysis Engine | |
can implement the | |
<literal>batchProcessComplete()</literal> method to perform processing that | |
should occur at the end of each batch of CASes. Similarly, the CAS Consumer | |
or Analysis Engine can | |
implement the <literal>collectionProcessComplete()</literal> method to | |
perform any collection level processing at the end of the collection.</para> | |
<para>A very simple example of a CAS Consumer, which writes an XML representation of the | |
CAS to a file, is the XMI Writer CAS Consumer. The Java code is in the class | |
<literal>org.apache.uima.examples.cpe.XmiWriterCasConsumer</literal> and | |
the descriptor is in | |
<literal>%UIMA_HOME%/examples/descriptors/cas_consumer/XmiWriterCasConsumer.xml</literal> | |
.</para> | |
<section id="ugr.tug.cpe.cas_consumer.required_methods"> | |
<title>Required Methods for a CAS Consumer</title> | |
<para>When extending the convenience class | |
<literal>org.apache.uima.collection.CasConsumer_ImplBase</literal>, the | |
following abstract methods must be implemented:</para> | |
<section id="ugr.tug.cpe.cas_consumer.required_methods.initialize"> | |
<title>initialize()</title> | |
<para>The <literal>initialize()</literal> method is called by the framework | |
when the CAS Consumer is first created. | |
<literal>CasConsumer_ImplBase</literal> actually provides a default | |
implementation of this method (i.e., it is not abstract), so you are not strictly | |
required to implement this method. However, a typical CAS Consumer will | |
implement this method to obtain parameter values and perform various | |
initialization steps.</para> | |
<para>In this method, the CAS Consumer can access the values of its configuration | |
parameters and perform other initialization logic. The example XMI Writer CAS | |
Consumer reads its configuration parameters and sets up the output directory: | |
<programlisting><?db-font-size 80% ?>public void initialize() throws ResourceInitializationException { | |
mDocNum = 0; | |
mOutputDir = new File((String) getConfigParameterValue(PARAM_OUTPUTDIR)); | |
if (!mOutputDir.exists()) { | |
mOutputDir.mkdirs(); | |
} | |
}</programlisting></para> | |
</section> | |
<section id="ugr.tug.cpe.cas_consumer.required_methods.processcas"> | |
<title>processCas()</title> | |
<para>The <literal>processCas()</literal> method is where the CAS Consumer | |
does most of its work. In our example, the XMI Writer CAS Consumer obtains an | |
iterator over the document metadata in the CAS (in the | |
SourceDocumentInformation feature structure, which is created by the File | |
System Collection Reader) and extracts the URI for the current document. From | |
this the output filename is constructed in the output directory and a subroutine | |
(<literal>writeXmi</literal>) is called to generate the output file. The | |
<literal>writeXmi</literal> subroutine uses the | |
<literal>XmiCasSerializer</literal> class provided with the UIMA SDK to | |
serialize the CAS to the output file (see the example source code for | |
details).</para> | |
<programlisting>public void processCas(CAS aCAS) throws ResourceProcessException { | |
String modelFileName = null; | |
JCas jcas; | |
try { | |
jcas = aCAS.getJCas(); | |
} catch (CASException e) { | |
throw new ResourceProcessException(e); | |
} | |
// retreive the filename of the input file from the CAS | |
FSIterator it = jcas | |
.getAnnotationIndex(SourceDocumentInformation.type) | |
.iterator(); | |
File outFile = null; | |
if (it.hasNext()) { | |
SourceDocumentInformation fileLoc = | |
(SourceDocumentInformation) it.next(); | |
File inFile; | |
try { | |
inFile = new File(new URL(fileLoc.getUri()).getPath()); | |
String outFileName = inFile.getName(); | |
if (fileLoc.getOffsetInSource() > 0) { | |
outFileName += ("_" + fileLoc.getOffsetInSource()); | |
} | |
outFileName += ".xmi"; | |
outFile = new File(mOutputDir, outFileName); | |
modelFileName = mOutputDir.getAbsolutePath() + | |
"/" + inFile.getName() + ".ecore"; | |
} catch (MalformedURLException e1) { | |
// invalid URL, use default processing below | |
} | |
} | |
if (outFile == null) { | |
outFile = new File(mOutputDir, "doc" + mDocNum++); | |
} | |
// serialize XCAS and write to output file | |
try { | |
writeXmi(jcas.getCas(), outFile, modelFileName); | |
} catch (IOException e) { | |
throw new ResourceProcessException(e); | |
} catch (SAXException e) { | |
throw new ResourceProcessException(e); | |
} | |
}</programlisting> | |
</section> | |
<section id="ugr.tug.cpe.cas_consumer.optional_methods"> | |
<title>Optional Methods</title> | |
<para>The following methods are optional in a CAS Consumer, though they are often | |
used.</para> | |
<section id="ugr.tug.cpe.cas_consumer.optional_methods.batchprocesscomplete"> | |
<title>batchProcessComplete()</title> | |
<para>The framework calls the batchProcessComplete() method at the end of each | |
batch of CASes. This gives the CAS Consumer or Analysis Engine | |
an opportunity to perform any batch | |
level processing. Our simple XMI Writer CAS Consumer does not perform any | |
batch level processing, so this method is empty. Batch size is set in the | |
Collection Processing Engine descriptor.</para> | |
</section> | |
<section id="ugr.tug.cpe.cas_consumer.optional_methods.collectionprocesscomplete"> | |
<title>collectionProcessComplete()</title> | |
<para>The framework calls the collectionProcessComplete() method at the end | |
of the collection (i.e., when all objects in the collection have been | |
processed). At this point in time, no CAS is passed in as a parameter. This gives | |
the CAS Consumer or Analysis Engine an opportunity to perform collection processing over the | |
entire set of objects in the collection. Our simple XMI Writer CAS Consumer | |
does not perform any collection level processing, so this method is | |
empty.</para> | |
</section> | |
</section> | |
</section> | |
</section> | |
</section> | |
<section id="ugr.tug.cpe.deploying_a_cpe"> | |
<title>Deploying a CPE</title> | |
<para>The CPM provides a number of service and deployment options that cover | |
instantiation and execution of CPEs, error recovery, and local and distributed | |
deployment of the CPE components. The behavior of the CPM (and correspondingly, the | |
CPE) is controlled by various options and parameters set in the CPE descriptor. The | |
current version of the CPE Configurator tool, however, supports only default error | |
handling and deployment options. To change these options, you must manually edit the | |
CPE descriptor.</para> | |
<para>Eventually the CPE Configurator tool will support configuring these options and a | |
detailed tutorial for these settings will be provided. In the meantime, we provide only | |
a high-level, conceptual overview of these advanced features in the rest of this | |
chapter, and refer the advanced user to <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.cpe_descriptor"/> for details on setting these options in the CPE | |
Descriptor.</para> | |
<para> <xref linkend="ugr.tug.cpe.fig.cpe_instantiation"/> shows a logical view of | |
how an application uses the UIMA framework to instantiate a CPE from a CPE descriptor. | |
The CPE descriptor identifies the CPE components (referencing their corresponding | |
descriptors) and specifies the various options for configuring the CPM and deploying | |
the CPE components.</para> | |
<figure id="ugr.tug.cpe.fig.cpe_instantiation"> | |
<title>CPE Instantiation</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.7in" format="PNG" | |
fileref="&imgroot;image018.png"/> | |
</imageobject> | |
<textobject><phrase>Picture of deployment of a CPE</phrase></textobject> | |
</mediaobject> | |
</figure> | |
<para id="ugr.tug.cpe.deployment_alternatives">There are three deployment modes | |
for CAS Processors (Analysis Engines and CAS Consumers) in a CPE:</para> | |
<orderedlist><listitem><para><emphasis role="bold">Integrated</emphasis> (runs | |
in the same Java instance as the CPM)</para></listitem> | |
<listitem><para><emphasis role="bold">Managed</emphasis> (runs in a separate | |
process on the same machine), and</para></listitem> | |
<listitem><para><emphasis role="bold">Non-managed</emphasis> (runs in a | |
separate process, perhaps on a different machine). </para></listitem> | |
</orderedlist> | |
<para>An integrated CAS Processor runs in the same JVM as the CPE. A managed CAS Processor | |
runs in a separate process from the CPE, but still on the same computer. The CPE controls | |
startup, shutdown, and recovery of a managed CAS Processor. A non-managed CAS | |
Processor runs as a service and may be on the same computer as the CPE or on a remote | |
computer. A non-managed CAS Processor <emphasis role="bold-italic"> | |
service</emphasis> is started and managed independently from the CPE.</para> | |
<para>For both managed and non-managed CAS Processors, the CAS must be transmitted | |
between separate processes and possibly between separate computers. This is | |
accomplished using <emphasis>Vinci</emphasis>, a communication protocol used by | |
the CPM and which is provided as a part of Apache UIMA. Vinci handles service naming and | |
location and data transport (see <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.application.how_to_deploy_a_vinci_service"/> for more | |
information). Service naming and location are provided by a <emphasis>Vinci Naming | |
Service</emphasis>, or <emphasis>VNS</emphasis>. For managed CAS Processors, the | |
CPE uses its own internal VNS. For non-managed CAS Processors, a separate VNS must be | |
running.</para> | |
<para>The CPE Configurator tool currently only supports constructing CPEs that deploy | |
CAS Processors in integrated mode. To deploy CAS Processors in any other mode, the CPE | |
descriptor must be edited by hand (better tooling may be provided later). Details on the | |
CPE descriptor and the required settings for various CAS Processor deployment modes | |
can be found in <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/> | |
. In the following sections we merely summarize the various CAS Processor deployment | |
options.</para> | |
<section id="ugr.tug.cpe.managed_deployment"> | |
<title>Deploying Managed CAS Processors</title> | |
<para>Managed CAS Processor deployment is shown in <xref | |
linkend="ugr.tug.cpe.fig.managed_deployment"/>. A managed CAS Processor is | |
deployed by the CPE as a Vinci service. The CPE manages the lifecycle of the CAS | |
Processor including service launch, restart on failures, and service shutdown. A | |
managed CAS Processor runs on the same machine as the CPE, but in a separate process. | |
This provides the necessary fault isolation for the CPE to protect it from non-robust | |
CAS Processors. A fatal failure of a managed CAS Processor does not threaten the | |
stability of the CPE.</para> | |
<figure id="ugr.tug.cpe.fig.managed_deployment"> | |
<title>CPE with Managed CAS Processors</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="3.6in" format="PNG" | |
fileref="&imgroot;image020.png"/> | |
</imageobject> | |
<textobject><phrase>Managed deployment showing separate JVMs and CASes | |
flowing between them</phrase></textobject> | |
</mediaobject> | |
</figure> | |
<para>The CPE communicates with managed CAS Processors using the Vinci communication | |
protocol. A CAS Processor is launched as a Vinci service and its | |
<literal>process()</literal> method is invoked remotely via a Vinci command. The | |
CPE uses its own internal VNS to support managed CAS processors. The VNS, by default, | |
listens on port 9005. If this port is not available, the VNS will increment its listen | |
port until it finds one that is available. All managed CAS Processors are internally | |
configured to <quote>talk</quote> to the CPE managed VNS. This internal VNS is | |
transparent to the end user launching the CPE.</para> | |
<para>To deploy a managed CAS Processor, the CPE deployer must change the CPE | |
descriptor. The following is a section from the CPE descriptor that shows an example | |
configuration specifying a managed CAS Processor.</para> | |
<programlisting><casProcessor <emphasis role="bold-italic">deployment="local"</emphasis> name="Meeting Detector TAE"> | |
<descriptor> | |
<include href="deploy/vinci/Deploy_MeetingDetectorTAE.xml"/> | |
</descriptor> | |
<runInSeparateProcess> | |
<exec dir="." executable="java"> | |
<env key="CLASSPATH" | |
value="src; | |
C:/Program Files/apache/uima/lib/uima-core.jar; | |
C:/Program Files/apache/uima/lib/uima-cpe.jar; | |
C:/Program Files/apache/uima/lib/uima-examples.jar; | |
C:/Program Files/apache/uima/lib/uima-adapter-vinci.jar; | |
C:/Program Files/apache/uima/lib/jVinci.jar"/> | |
<arg>-DLOG=C:/Temp/service.log</arg> | |
<arg>org.apache.uima.reference_impl.collection. | |
service.vinci.VinciAnalysisEnginerService_impl</arg> | |
<arg>${descriptor}</arg> | |
</exec> | |
</runInSeparateProcess> | |
<deploymentParameters/> | |
<filter/> | |
<errorHandling> | |
<errorRateThreshold action="terminate" value="1/100"/> | |
<maxConsecutiveRestarts action="terminate" value="3"/> | |
<timeout max="100000"/> | |
</errorHandling> | |
<checkpoint batch="10000"/> | |
</casProcessor></programlisting> | |
<para>See <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/> for | |
details and required settings.</para> | |
</section> | |
<section id="ugr.tug.cpe.deploying_nonmanaged_cas_processors"> | |
<title>Deploying Non-managed CAS Processors</title> | |
<para>Non-managed CAS Processor deployment is shown in <xref | |
linkend="ugr.tug.cpe.fig.nonmanaged_cpe"/>. In non-managed mode, the CPE | |
supports connectivity to CAS Processors running on local or remote computers using | |
Vinci. Non-managed processors are different from managed processors in two | |
aspects: | |
<orderedlist><listitem><para>Non-managed processors are neither started nor | |
stopped by the CPE.</para></listitem> | |
<listitem><para>Non-managed processors use an independent VNS, also neither | |
started nor stopped by the CPE. </para></listitem></orderedlist></para> | |
<figure id="ugr.tug.cpe.fig.nonmanaged_cpe"> | |
<title>CPE with non-managed CAS Processors</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="4.8in" format="PNG" | |
fileref="&imgroot;image023.png"/> | |
</imageobject> | |
<textobject><phrase>Non-managed CPE deployment</phrase></textobject> | |
</mediaobject> | |
</figure> | |
<para>While non-managed CAS Processors provide the same level of fault isolation and | |
robustness as managed CAS Processors, error recovery support for non-managed CAS | |
Processors is much more limited. In particular, the CPE cannot restart a non-managed | |
CAS Processor after an error.</para> | |
<para>Non-managed CAS Processors also require a separate Vinci Naming Service | |
running on the network. This VNS must be manually started and monitored by the end user | |
or application. Instructions for running a VNS can be found in <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.application.vns.starting"/>.</para> | |
<para>To deploy a non-managed CAS Processor, the CPE deployer must change the CPE | |
descriptor. The following is a section from the CPE descriptor that shows an example | |
configuration for the non-managed CAS Processor.</para> | |
<programlisting><casProcessor <emphasis role="bold-italic">deployment="remote"</emphasis> name="Meeting Detector TAE"> | |
<descriptor> | |
<include href= | |
"descriptors/vinciService/MeetingDetectorVinciService.xml"/> | |
</descriptor> | |
<deploymentParameters/> | |
<filter/> | |
<errorHandling> | |
<errorRateThreshold action="terminate" value="1/100"/> | |
<maxConsecutiveRestarts action="terminate" value="3"/> | |
<timeout max="100000"/> | |
</errorHandling> | |
<checkpoint batch="10000"/> | |
</casProcessor></programlisting> | |
<para>See <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/> for | |
details and required settings.</para> | |
</section> | |
<section id="ugr.tug.cpe.integrated_deployment"> | |
<title>Deploying Integrated CAS Processors</title> | |
<para>Integrated CAS Processors are shown in <xref | |
linkend="ugr.tug.cpe.fig.integrated_deployment"/>. Here the CAS Processors | |
run in the same JVM as the CPE, just like the Collection Reader and CAS Initializer. | |
This deployment method results in minimal CAS communication and transport overhead | |
as the CAS is shared in the same process space of the JVM. However, a CPE running with all | |
integrated CAS Processors is limited in scalability by the capability of the single | |
computer on which the CPE is running. There is also a stability risk associated with | |
integrated processors because a poorly written CAS Processor can cause the JVM, and | |
hence the entire CPE, to abort.</para> | |
<figure id="ugr.tug.cpe.fig.integrated_deployment"> | |
<title>CPE with integrated CAS Processor</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="3.2in" format="PNG" | |
fileref="&imgroot;image026.png"/> | |
</imageobject> | |
<textobject><phrase>CPE with integrated CAS Processor</phrase> | |
</textobject> | |
</mediaobject> | |
</figure> | |
<para>The following is a section from a CPE descriptor that shows an example | |
configuration for the integrated CAS Processor.</para> | |
<programlisting><casProcessor <emphasis role="bold-italic">deployment=<quote>integrated</quote></emphasis> name=<quote>Meeting Detector TAE</quote>> | |
<descriptor> | |
<include href="descriptors/tutorial/ex4/MeetingDetectorTAE.xml"/> | |
</descriptor> | |
<deploymentParameters/> | |
<filter/> | |
<errorHandling> | |
<errorRateThreshold action="terminate" value="100/1000"/> | |
<maxConsecutiveRestarts action="terminate" value="30"/> | |
<timeout max="100000"/> | |
</errorHandling> | |
<checkpoint batch="10000"/> | |
</casProcessor></programlisting> | |
<para>See <olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/> for | |
details and required settings.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.cpe.collection_processing_examples"> | |
<title>Collection Processing Examples</title> | |
<para>The UIMA SDK includes a set of examples illustrating the three modes of deployment, | |
integrated, managed, and non-managed. These are in the | |
<literal>/examples/descriptors/collection_processing_engine</literal> | |
directory. There are three CPE descriptors that run an example annotator (the Meeting | |
Finder) in these modes.</para> | |
<para>To run either the integrated or managed examples, use the | |
<literal>runCPE</literal> script in the /bin directory of the UIMA installation, | |
passing the appropriate CPE descriptor as an argument, or | |
if you're using Eclipse and have the <literal>uimaj-examples</literal> project in your | |
workspace, you can use the Eclipse Menu → Run → Run... → and then pick the | |
launch configuration <quote>UIMA Run CPE</quote>.</para> | |
<note><para>The <literal>runCPE</literal> script <emphasis role="bold-italic"> must</emphasis> | |
be run from the <literal>%UIMA_HOME%\examples</literal> directory, because the example | |
CPE descriptors use relative path names that are resolved relative to this working directory. | |
For instance, | |
<literallayout>runCPE | |
descriptors\collection_processing_engine\MeetingFinderCPE_Integrated.xml</literallayout></para> | |
</note> | |
<!-- | |
<para>If you installed the examples into Eclipse, you can run directly from Eclipse by | |
creating a run configuration. To do this, highlight the SimpleRunCPE.java source file | |
in the examples src/org/apache/uima/examples/cpe directory, and then</para> | |
<orderedlist><listitem><para>pick the menu Run → Run...</para></listitem> | |
<listitem><para>click <quote>Java Application</quote> and press | |
<quote>New</quote></para></listitem> | |
<listitem><para>click on the Arguments panel, and insert a path to the appropriate CPE | |
descriptor in the <quote>Program Arguments</quote> box by typing, for instance: | |
<literal>descriptors/collection_processing_engine/ | |
MeetingFinderCPE_Integrated.xml</literal> | |
</para></listitem> | |
<listitem><para>Then press <quote>Run</quote> </para></listitem> | |
</orderedlist> | |
--> | |
<para>To run the non-managed example, there are some additional steps. | |
<orderedlist><listitem><para>Start a VNS service by running the | |
<literal>startVNS</literal> script in the <literal>/bin</literal> | |
directory, or using the Eclipse launcher <quote>UIMA Start VNS</quote>.</para></listitem> | |
<listitem><para>Deploy the Meeting Detector Analysis Engine as a Vinci service, by | |
running the <literal>startVinciService</literal> script in the | |
<literal>/bin</literal> directory or using the Eclipse launcher for this, and passing it the location of the | |
descriptor to deploy, in this case | |
<literal>%UIMA_HOME%/examples/deploy/vinci/Deploy_MeetingDetectorTAE.xml</literal>, | |
or | |
if you're using Eclipse and have the <literal>uimaj-examples</literal> project in your | |
workspace, you can use the Eclipse Menu → Run → Run... → and then pick the | |
launch configuration <quote>UIMA Start Vinci Service</quote>. | |
</para></listitem> | |
<listitem><para>Now, run the runCPE script (or if in Eclipse, run the | |
launch configuration <quote>UIMA Run CPE</quote>), passing it the CPE for the non-managed | |
version | |
<literal>(%UIMA_HOME%/examples/descriptors/collection_processing_engine/ | |
MeetingFinderCPE_NonManaged.xml</literal> | |
). </para></listitem></orderedlist></para> | |
<para>This assumes that the Vinci Naming Service, the runCPE application, and the | |
<literal>MeetingDetectorTAE</literal> service are all running on the same machine. | |
Most of the scripts that need information about VNS will look for values to use in | |
environment variables VNS_HOST and VNS_PORT; these default to | |
<quote>localhost</quote> and <quote>9000</quote>. You may set these to appropriate | |
values before running the scripts, as needed; you can also pass the name of the VNS host as | |
the second argument to the startVinciService script.</para> | |
<para>Alternatively, you can edit the scripts and/or the XML files to specify | |
alternatives for the VNS_HOST and VNS_PORT. For instance, if the | |
<literal>runCPE</literal> application is running on a different machine from the | |
Vinci Naming Service, you can edit the | |
<literal>MeetingFinderCPE_NonManaged.xml</literal> and change the vnsHost | |
parameter: | |
<literal><parameter name="vnsHost" value="localhost" type="string"/></literal> | |
to specify the VNS host instead of <quote>localhost</quote>.</para> | |
</section> | |
</chapter> | |