blob: 605edd92b09b3a644285259692e07a0c88cac350 [file] [log] [blame]
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>UIMA References</title><link rel="stylesheet" type="text/css" href="css/stylesheet-html.css"><meta name="generator" content="DocBook XSL-NS Stylesheets V1.76.1"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div lang="en" class="book" title="UIMA References" id="d5e1"><div xmlns:d="http://docbook.org/ns/docbook" class="titlepage"><div><div><h1 class="title">UIMA References</h1></div><div><div class="authorgroup">
<h3 class="corpauthor">Written and maintained by the Apache UIMA&#8482; Development Community</h3>
</div></div><div><p class="releaseinfo">Version 3.1.1</p></div><div><p class="copyright">Copyright &copy; 2006, 2019 The Apache Software Foundation</p></div><div><p class="copyright">Copyright &copy; 2004, 2006 International Business Machines Corporation</p></div><div><div class="legalnotice" title="Legal Notice"><a name="d5e8"></a>
<p> </p>
<p title="License and Disclaimer">
<b>License and Disclaimer.&nbsp;</b>
The ASF licenses this documentation
to you under the Apache License, Version 2.0 (the
"License"); you may not use this documentation except in compliance
with the License. You may obtain a copy of the License at
</p><div class="blockquote"><blockquote class="blockquote">
<a class="ulink" href="http://www.apache.org/licenses/LICENSE-2.0" target="_top">http://www.apache.org/licenses/LICENSE-2.0</a>
</blockquote></div><p title="License and Disclaimer">
Unless required by applicable law or agreed to in writing,
this documentation and its contents are distributed under the License
on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
</p>
<p> </p>
<p> </p>
<p title="Trademarks">
<b>Trademarks.&nbsp;</b>
All terms mentioned in the text that are known to be trademarks or
service marks have been appropriately capitalized. Use of such terms
in this book should not be regarded as affecting the validity of the
the trademark or service mark.
</p>
</div></div><div><p class="pubdate">November, 2019</p></div></div><hr></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="chapter"><a href="#ugr.ref.javadocs">1. Javadocs</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.javadocs.libraries">1.1. Using named Eclipse User Libraries</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.xml.component_descriptor">2. Component Descriptor Reference</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.notation">2.1. Notation</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.imports">2.2. Imports</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.type_system">2.3. Type System Descriptors</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.type_system.imports">2.3.1. Imports</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.type_system.types">2.3.2. Types</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.type_system.features">2.3.3. Features</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.type_system.string_subtypes">2.3.4. String Subtypes</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.aes">2.4. Analysis Engine Descriptors</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.aes.primitive">2.4.1. Primitive Analysis Engine Descriptors</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.aes.aggregate">2.4.2. Aggregate Analysis Engine Descriptors</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.aes.configuration_parameters">2.4.3. Configuration Parameters</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.flow_controller">2.5. Flow Controller Descriptors</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.collection_processing_parts">2.6. Collection Processing Component Descriptors</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader">2.6.1. Collection Reader Descriptors</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.collection_processing_parts.cas_initializer">2.6.2. CAS Initializer Descriptors (deprecated)</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.collection_processing_parts.cas_consumer">2.6.3. CAS Consumer Descriptors</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.service_client">2.7. Service Client Descriptors</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.custom_resource_specifiers">2.8. Custom Resource Specifiers</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.xml.cpe_descriptor">3. CPE Descriptor Reference</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.overview">3.1. CPE Overview</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.notation">3.2. Notation</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.imports">3.3. Imports</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor">3.4. CPE Descriptor Overview</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.collection_reader">3.5. Collection Reader</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.collection_reader.error_handling">3.5.1. Error handling for Collection Readers</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.cas_processors">3.6. CAS Processors</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual">3.6.1. Specifying an Individual CAS Processor</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.operational_parameters">3.7. CPE Operational Parameters</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.resource_manager_configuration">3.8. Resource Manager Configuration</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.example">3.9. Example CPE Descriptor</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.cas">4. CAS Reference</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.cas.javadocs">4.1. Javadocs</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.overview">4.2. CAS Overview</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.cas.type_system">4.2.1. The Type System</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.creating_accessing_manipulating_data">4.2.2. Creating/Accessing/Changing data</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.creating_using_indexes">4.2.3. Creating and using indexes</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.cas.builtin_types">4.3. Built-in CAS Types</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.accessing_the_type_system">4.4. Accessing the type system</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.cas.type_system.printer_example">4.4.1. TypeSystemPrinter example</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.cas_apis_create_modify_feature_structures">4.4.2. Using CAS APIs: Feature Structures</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.cas.creating_feature_structures">4.5. Creating feature structures</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.cas.updating_indexed_feature_structures">4.5.1. Updating indexed feature structures</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.cas.accessing_modifying_features_of_feature_structures">4.6. Accessing or modifying Features</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.indexes_and_iterators">4.7. Indexes and Iterators</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.cas.index.built_in_indexes">4.7.1. Built-in Indexes</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.index.adding_to_indexes">4.7.2. Adding Feature Structures to the Indexes</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.index.iterators">4.7.3. Iterators over UIMA Indexes</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.index.annotation_index">4.7.4. Special iterators for Annotation types</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.index.constraints_and_filtered_iterators">4.7.5. Constraints and Filtered iterators</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.cas.guide_to_javadocs">4.8. CAS API's Javadocs</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.cas.javadocs.cas_package">4.8.1. APIs in the CAS package</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.cas.typemerging">4.9. Type Merging</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.limitedmultipleaccess">4.10. Limited multi-thread access to read-only CASs</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.jcas">5. JCas Reference</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.jcas.name_spaces">5.1. Name Spaces</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.use_of_description">5.2. Use of XML Description</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.mapping_built_ins">5.3. Mapping built-in CAS types to Java types</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.augmenting_generated_code">5.4. Augmenting the generated Java Code</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.jcas.keeping_augmentations_when_regenerating">5.4.1. Keeping hand-coded augmentations when regenerating</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.additional_constructors">5.4.2. Additional Constructors</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.modifying_generated_items">5.4.3. Modifying generated items</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.jcas.merging_types_from_other_specs">5.5. Merging Types</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.jcas.merging_types.aggregates_and_cpes">5.5.1. Aggregate AEs and CPEs as sources of types</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.merging_types.jcasgen_support">5.5.2. JCasGen support for type merging</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.impact_of_type_merging_on_composability">5.5.3. Type Merging impacts on Composability</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.documentannotation_issues">5.5.4. Adding Features to DocumentAnnotation</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.jcas.using_within_an_annotator">5.6. Using JCas within an Annotator</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.jcas.new_instances">5.6.1. Creating new instances</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.getters_and_setters">5.6.2. Getters and Setters</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.obtaining_refs_to_indexes">5.6.3. Obtaining references to Indexes</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.adding_removing_instances_to_indexes">5.6.4. Updating Indexes</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.using_iterators">5.6.5. Using Iterators</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.class_loaders">5.6.6. Class Loaders in UIMA</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.accessing_jcas_objects_outside_uima_components">5.6.7. Issues accessing JCas objects outside of UIMA Engine Components</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.jcas.setting_up_classpath">5.7. Setting up Classpath for JCas</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.pear_support">5.8. PEAR isolation</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.pear">6. PEAR Reference</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.pear.packaging_a_component">6.1. Packaging a UIMA component</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.pear.creating_pear_structure">6.1.1. Creating the PEAR structure</a></span></dt><dt><span class="section"><a href="#ugr.ref.pear.populating_pear_structure">6.1.2. Populating the PEAR structure</a></span></dt><dt><span class="section"><a href="#ugr.ref.pear.creating_installation_descriptor">6.1.3. Creating the installation descriptor</a></span></dt><dt><span class="section"><a href="#ugr.ref.pear.installation_descriptor">6.1.4. Installation Descriptor: template</a></span></dt><dt><span class="section"><a href="#ugr.ref.pear.packaging_into_1_file">6.1.5. Packaging the PEAR structure into one file</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.pear.installing">6.2. Installing a PEAR package</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.pear.installing_pear_using_API">6.2.1. Installing a PEAR file using the PEAR APIs</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.pear.specifier">6.3. PEAR package descriptor</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.xmi">7. XMI CAS Serialization Reference</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xmi.xmi_tag">7.1. XMI Tag</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.feature_structures">7.2. Feature Structures</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.primitive_features">7.3. Primitive Features</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.reference_features">7.4. Reference Features</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.array_and_list_features">7.5. Array and List Features</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xmi.array_and_list_features.as_multi_valued_properties">7.5.1. Arrays and Lists as Multi-Valued Properties</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.array_and_list_features.as_1st_class_objects">7.5.2. Arrays and Lists as First-Class Objects</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.null_array_list_elements">7.5.3. Null Array/List Elements</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.xmi.sofas_views">7.6. Subjects of Analysis (Sofas) and Views</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.linking_to_ecore_type_system">7.7. Linking XMI docs to Ecore Type System</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.delta">7.8. Delta CAS XMI Format</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.compress">8. Compressed Binary CASes</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.compress.overview">8.1. Binary CAS Compression overview</a></span></dt><dt><span class="section"><a href="#ugr.ref.compress.usage">8.2. Using Compressed Binary CASes</a></span></dt><dt><span class="section"><a href="#ugr.ref.compress.simple-deltas">8.3. Simple Delta CAS serialization</a></span></dt><dt><span class="section"><a href="#ugr.ref.compress.use-cases">8.4. Use Case cookbook</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.json">9. JSON support</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.json.overview">9.1. JSON serialization support overview</a></span></dt><dt><span class="section"><a href="#ug.ref.json.cas">9.2. JSON CAS Serialization</a></span></dt><dd><dl><dt><span class="section"><a href="#ug.ref.json.cas.bigpic">9.2.1. The Big Picture</a></span></dt><dt><span class="section"><a href="#ug.ref.json.cas.context">9.2.2. The _context section</a></span></dt><dt><span class="section"><a href="#ug.ref.json.cas.featurestructures">9.2.3. Serializing Feature Structures</a></span></dt></dl></dd><dt><span class="section"><a href="#ug.ref.json.cas.featurestructures.organization">9.3. Organizing the Feature Structures</a></span></dt><dt><span class="section"><a href="#ug.ref.json.cas.features">9.4. Additional JSON CAS Serialization features</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.json.delta">9.4.1. Delta CAS</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.json.usage">9.5. Using JSON CAS serialization</a></span></dt><dt><span class="section"><a href="#ugr.ref.json.descriptionserialization">9.6. JSON serialization for UIMA descriptors</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.config">10. Setup and Configuration</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.config.properties">10.1. UIMA JVM Configuration Properties</a></span></dt><dt><span class="section"><a href="#ugr.ref.config.protect-index">10.2. Configuring index protection</a></span></dt><dt><span class="section"><a href="#ugr.ref.config.property-table">10.3. Properties Table</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.resources">11. UIMA Resources</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.resources.overview">11.1. What is a UIMA Resource?</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.resources.resource-inner-implementations">11.1.1. Resource Inner Implementations</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.resources.sharing-across-pipelines">11.2. Sharing Resources</a></span></dt><dt><span class="section"><a href="#ugr.ref.resources.external-resource-multiple-parameterized-instances">11.3. External Resources support for multiple Parameterized Instances</a></span></dt></dl></dd></dl></div>
<div class="chapter" title="Chapter&nbsp;1.&nbsp;Javadocs" id="ugr.ref.javadocs"><div class="titlepage"><div><div><h2 class="title">Chapter&nbsp;1.&nbsp;Javadocs</h2></div></div></div>
<p>The details of all the public APIs for UIMA are contained in the API Javadocs. These are located in the docs/api
directory; the top level to open in your browser is called <a class="ulink" href="api/index.html" target="_top">api/index.html</a>.</p>
<p>Eclipse supports the ability to attach the Javadocs to your project. The Javadoc should already be attached
to the <code class="literal">uimaj-examples</code> project, if you followed the setup instructions in <a href="overview_and_setup.html#d4e1" class="olink">UIMA Overview &amp; SDK Setup</a> <a href="overview_and_setup.html#ugr.ovv.eclipse_setup.example_code" class="olink">Section&nbsp;3.2, &#8220;Setting up Eclipse to view Example Code&#8221;</a>. To attach
Javadocs to your own Eclipse project, use the following instructions.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>As an alternative, you can add the UIMA source to the UIMA binary distribution; if you
do this you not only will have the Javadocs automatically available (you can skip the following
setup), you will have the ability to step through the UIMA framework code while debugging.
To add the source, follow the instructions as described in the setup chapter:
<a href="overview_and_setup.html#d4e1" class="olink">UIMA Overview &amp; SDK Setup</a>
<a href="overview_and_setup.html#ugr.ovv.eclipse_setup.adding_source" class="olink">Section&nbsp;3.3, &#8220;Adding the UIMA source code to the jar files&#8221;</a>.</p></div>
<p>To add the Javadocs, open a project which is referring to the UIMA APIs in its class path, and open the project properties. Then pick
Java Build Path. Pick the "Libraries" tab and select one of the UIMA library entries (if you don't have, for
instance, uima-core.jar in this list, it's unlikely your code will compile). Each library entry has a small "&gt;"
sign on its left - click that to expand the view to see the Javadoc location. If you highlight that and press edit - you
can add a reference to the Javadocs, in the following dialog:
</p><div class="screenshot">
<div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="574"><tr><td><img src="images/references/ref.javadocs/image002.jpg" width="574" alt="Screenshot of attaching Javadoc to source in Eclipse"></td></tr></table></div>
</div>
<p>Once you do this, Eclipse can show you Javadocs for UIMA APIs as you work. To see the Javadoc for a UIMA API, you
can hover over the API class or method, or select it and press shift-F2, or use the menu Navigate <span class="symbol">&#8594;</span>
Open External Javadoc, or open the Javadoc view (Window <span class="symbol">&#8594;</span> Show View <span class="symbol">&#8594;</span> Other
<span class="symbol">&#8594;</span> Java <span class="symbol">&#8594;</span> Javadoc).</p>
<p>In a similar manner, you can attach the source for the UIMA framework, if you download the source
distribution. The source corresponding to particular
releases is available from the Apache UIMA web site (<a class="ulink" href="http://uima.apache.org" target="_top">http://uima.apache.org</a>) on the
downloads page.</p>
<div class="section" title="1.1.&nbsp;Using named Eclipse User Libraries"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.javadocs.libraries">1.1.&nbsp;Using named Eclipse User Libraries</h2></div></div></div>
<p>You can also create a named "user library" in Eclipse containing the UIMA Jars, and attach the Javadocs (or
optionally, the sources); this named library is saved in the Eclipse workspace. Once created, it can be
added to the classpath of newly created Eclipse projects.</p>
<p>Use the menu option Project <span class="symbol">&#8594;</span> Properties
<span class="symbol">&#8594;</span> Java Build Path, and then pick the Libraries tab, and click the Add Library button. Then select
User Libraries, click "Next", and pick the library you created for the UIMA Jars.</p>
<p>To create this library in the workspace,
use the same menu picks as above, but after you select the User Libraries and click "Next", you can click the "New Library..."
button to define your new library. You use the "Add Jars" button and multi-select all the Jars in the lib directory
of the UIMA binary distribution. Then you add the Javadoc attachment for each Jar. The path to use is
file:/ -- insert the path to your install of UIMA -- /docs/api. After you do this for the first Jar, you can
copy this string to the clipboard and paste it into the rest of the Jars.</p>
</div>
</div>
<div class="chapter" title="Chapter&nbsp;2.&nbsp;Component Descriptor Reference" id="ugr.ref.xml.component_descriptor"><div class="titlepage"><div><div><h2 class="title">Chapter&nbsp;2.&nbsp;Component Descriptor Reference</h2></div></div></div>
<p>This chapter is the reference guide for the UIMA SDK's Component Descriptor XML
schema. A <span class="emphasis"><em>Component Descriptor</em></span> (also sometimes called a
<span class="emphasis"><em>Resource Specifier</em></span> in the code) is an XML file that either (a)
completely describes a component, including all information needed to construct the
component and interact with it, or (b) specifies how to connect to and interact with an
existing component that has been published as a remote service.
<span class="emphasis"><em>Component</em></span> (also called <span class="emphasis"><em>Resource</em></span>) is a
general term for modules produced by UIMA developers and used by UIMA applications. The
types of Components are: Analysis Engines, Collection Readers, CAS
Initializers<sup>[<a name="d5e71" href="#ftn.d5e71" class="footnote">1</a>]</sup>, CAS Consumers, and Collection Processing Engines.
However, Collection Processing Engine Descriptors are significantly different in
format and are covered in a separate chapter, <a href="references.html#ugr.ref.xml.cpe_descriptor" class="olink">Chapter&nbsp;3, <i>Collection Processing Engine Descriptor Reference</i></a>.</p>
<p><a class="xref" href="#ugr.ref.xml.component_descriptor.notation" title="2.1.&nbsp;Notation">Section&nbsp;2.1, &#8220;Notation&#8221;</a> describes the notation used in this
chapter.</p>
<p><a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2.&nbsp;Imports">Section&nbsp;2.2, &#8220;Imports&#8221;</a> describes the UIMA SDK's
<span class="emphasis"><em>import</em></span> syntax, used to allow XML descriptors to import
information from other XML files, to allow sharing of information between several XML
descriptors.</p>
<p><a class="xref" href="#ugr.ref.xml.component_descriptor.aes" title="2.4.&nbsp;Analysis Engine Descriptors">Section&nbsp;2.4, &#8220;Analysis Engine Descriptors&#8221;</a> describes the XML format for <span class="emphasis"><em>Analysis Engine
Descriptors</em></span>. These are descriptors that completely describe Analysis
Engines, including all information needed to construct and interact with them.</p>
<p><a class="xref" href="#ugr.ref.xml.component_descriptor.collection_processing_parts" title="2.6.&nbsp;Collection Processing Component Descriptors">Section&nbsp;2.6, &#8220;Collection Processing Component Descriptors&#8221;</a> describes the XML format for
<span class="emphasis"><em>Collection Processing Component Descriptors</em></span>. This includes
Collection Iterator, CAS Initializer, and CAS Consumer Descriptors.</p>
<p><a class="xref" href="#ugr.ref.xml.component_descriptor.service_client" title="2.7.&nbsp;Service Client Descriptors">Section&nbsp;2.7, &#8220;Service Client Descriptors&#8221;</a> describes the XML format for
<span class="emphasis"><em>Service Client Descriptors</em></span>, which specify how to connect to and
interact with resources deployed as remote services.</p>
<p><a class="xref" href="#ugr.ref.xml.component_descriptor.custom_resource_specifiers" title="2.8.&nbsp;Custom Resource Specifiers">Section&nbsp;2.8, &#8220;Custom Resource Specifiers&#8221;</a> describes the XML format for
<span class="emphasis"><em>Custom Resource Specifiers</em></span>, which allow you to plug in your
own Java class as a UIMA Resource.</p>
<div class="section" title="2.1.&nbsp;Notation"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.notation">2.1.&nbsp;Notation</h2></div></div></div>
<p>This chapter uses an informal notation to specify the syntax of Component
Descriptors. The formal syntax is defined by an XML schema definition, which is
contained in the file <code class="literal">resourceSpecifierSchema.xsd</code>,
located in the <code class="literal">uima-core.jar</code> file.</p>
<p>The notation used in this chapter is:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>An ellipsis (...) inside an element body indicates
that the substructure of that element has been omitted (to be described in another
section of this chapter). An example of this would be:
</p><pre class="programlisting">&lt;analysisEngineMetaData&gt;
...
&lt;/analysisEngineMetaData&gt;</pre><p>
An ellipsis immediately after an element indicates that the element type may be may be
repeated arbitrarily many times. For example:
</p><pre class="programlisting">&lt;parameter&gt;[String]&lt;/parameter&gt;
&lt;parameter&gt;[String]&lt;/parameter&gt;
...</pre><p>
indicates that there may be arbitrarily many parameter elements in this
context.</p></li><li class="listitem"><p>Bracketed expressions (e.g. <code class="literal">[String]</code>)
indicate the type of value that may be used at that location.</p></li><li class="listitem"><p>A vertical bar, as in <code class="literal">true|false</code>, indicates
alternatives. This can be applied to literal values, bracketed type names, and
elements.</p></li><li class="listitem"><p>Which elements are optional and which are required is specified in
prose, not in the syntax definition. </p></li></ul></div>
</div>
<div class="section" title="2.2.&nbsp;Imports"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.imports">2.2.&nbsp;Imports</h2></div></div></div>
<p>The UIMA SDK defines a particular syntax for XML descriptors to import information
from other XML files. When one of the following appears in an XML descriptor:
</p><pre class="programlisting">&lt;import location="[URL]" /&gt; or
&lt;import name="[Name]" /&gt;</pre><p>
it indicates that information from a separate XML file is being imported. Note that
imports are allowed only in certain places in the descriptor. In the remainder of this
chapter, it will be indicated at which points imports are allowed.</p>
<p>If an import specifies a <code class="literal">location</code> attribute, the value of
that attribute specifies the URL at which the XML file to import will be found. This can be
a relative URL, which will be resolved relative to the descriptor containing the
<code class="literal">import</code> element, or an absolute URL. Relative URLs can be written
without a protocol/scheme (e.g., <span class="quote">&#8220;<span class="quote">file:</span>&#8221;</span>), and without a host machine
name. In this case the relative URL might look something like
<code class="literal">org/apache/myproj/MyTypeSystem.xml.</code></p>
<p>An absolute URL is written with one of the following prefixes, followed by a path
such as <code class="literal">org/apache/myproj/MyTypeSystem.xml</code>:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>file:/ <span class="symbol">&#8592;</span> has no network
address</p></li><li class="listitem"><p>file:/// <span class="symbol">&#8592;</span> has an empty network address</p></li><li class="listitem"><p>file://some.network.address/</p></li></ul></div>
<p>For more information about URLs, please read the javadoc information for the Java
class <span class="quote">&#8220;<span class="quote">URL</span>&#8221;</span>.</p>
<p>If an import specifies a <code class="literal">name</code> attribute, the value of that
attribute should take the form of a Java-style dotted name (e.g.
<code class="literal">org.apache.myproj.MyTypeSystem</code>). An .xml file with this name
will be searched for in the classpath or datapath (described below). As in Java, the dots
in the name will be converted to file path separators. So an import specifying the
example name in this paragraph will result in a search for
<code class="literal">org/apache/myproj/MyTypeSystem.xml</code> in the classpath or
datapath.</p>
<p><a name="ugr.ref.xml.component_descriptor.datapath"></a>The datapath works similarly to the classpath but can be set programmatically
through the resource manager API. Application developers can specify a datapath
during initialization, using the following code:
</p><pre class="programlisting">
ResourceManager resMgr = UIMAFramework.newDefaultResourceManager();
resMgr.setDataPath(yourPathString);
AnalysisEngine ae =
UIMAFramework.produceAnalysisEngine(desc, resMgr, null);
</pre>
<p>The default datapath for the entire JVM can be set via the
<code class="literal">uima.datapath</code> Java system property, but this feature should
only be used for standalone applications that don't need to run in the same JVM as
other code that may need a different datapath.</p>
<p>The value of a name or location attribute may be parameterized with references to external
override variables using the <code class="literal">${variable-name}</code> syntax.
</p><pre class="programlisting">&lt;import location="Annotator${with}ExternalOverrides.xml" /&gt;</pre><p>
If a variable is undefined the value is left unmodified and a warning message identifies the missing
variable.</p>
<p>Previous versions of UIMA also supported XInclude. That support didn't work in
many situations, and it is no longer supported. To include other files, please use
&lt;import&gt;.</p>
</div>
<div class="section" title="2.3.&nbsp;Type System Descriptors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.type_system">2.3.&nbsp;Type System Descriptors</h2></div></div></div>
<p>A Type System Descriptor is used to define the types and features that can be
represented in the CAS. A Type System Descriptor can be imported into an Analysis Engine
or Collection Processing Component Descriptor.</p>
<p>The basic structure of a Type System Descriptor is as follows:
</p><pre class="programlisting">&lt;typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier"&gt;
&lt;name&gt; [String] &lt;/name&gt;
&lt;description&gt;[String]&lt;/description&gt;
&lt;version&gt;[String]&lt;/version&gt;
&lt;vendor&gt;[String]&lt;/vendor&gt;
&lt;imports&gt;
&lt;import ...&gt;
...
&lt;/imports&gt;
&lt;types&gt;
&lt;typeDescription&gt;
...
&lt;/typeDescription&gt;
...
&lt;/types&gt;
&lt;/typeSystemDescription&gt;</pre>
<p>All of the subelements are optional.</p>
<div class="section" title="2.3.1.&nbsp;Imports"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.type_system.imports">2.3.1.&nbsp;Imports</h3></div></div></div>
<p>The <code class="literal">imports</code> section allows this descriptor to import
types from other type system descriptors. The import syntax is described in <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2.&nbsp;Imports">Section&nbsp;2.2, &#8220;Imports&#8221;</a>. A type system may import any number of other type
systems and then define additional types which refer to imported types. Circular
imports are allowed.</p>
</div>
<div class="section" title="2.3.2.&nbsp;Types"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.type_system.types">2.3.2.&nbsp;Types</h3></div></div></div>
<p>The <code class="literal">types</code> element contains zero or more
<code class="literal">typeDescription</code> elements. Each
<code class="literal">typeDescription</code> has the form:
</p><pre class="programlisting">&lt;typeDescription&gt;
&lt;name&gt;[TypeName]&lt;/name&gt;
&lt;description&gt;[String]&lt;/description&gt;
&lt;supertypeName&gt;[TypeName]&lt;/supertypeName&gt;
&lt;features&gt;
...
&lt;/features&gt;
&lt;/typeDescription&gt;</pre>
<p>The name element contains the name of the type. A
<code class="literal">[TypeName]</code> is a dot-separated list of names, where each name
consists of a letter followed by any number of letters, digits, or underscores.
<code class="literal">TypeNames</code> are case sensitive. Letter and digit are as defined
by Java; therefore, any Unicode letter or digit may be used (subject to the character
encoding defined by the descriptor file's XML header). The name following the
final dot is considered to be the <span class="quote">&#8220;<span class="quote">short name</span>&#8221;</span> of the type; the
preceding portion is the namespace (analogous to the package.class syntax used in
Java). Namespaces beginning with uima are reserved and should not be used. Examples
of valid type names are:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>test.TokenAnnotation</p>
</li><li class="listitem"><p>org.myorg.TokenAnnotation</p></li><li class="listitem"><p>com.my_company.proj123.TokenAnnotation </p></li></ul></div>
<p>These would all be considered distinct types since they have different
namespaces. Best practice here is to follow the normal Java naming conventions of
having namespaces be all lowercase, with the short type names having an initial
capital, but this is not mandated, so <code class="literal">ABC.mYtyPE</code> is an allowed
type name. While type names without namespaces (e.g.
<code class="literal">TokenAnnotation</code> alone) are allowed, but discouraged because
naming conflicts can then result when combining annotators that use different
type systems.</p>
<p>The <code class="literal">description</code> element contains a textual description
of the type. The <code class="literal">supertypeName</code> element contains the name of the
type from which it inherits (this can be set to the name of another user-defined type,
or it may be set to any built-in type which may be subclassed, such as
<code class="literal">uima.tcas.Annotation</code> for a new annotation
type or <code class="literal">uima.cas.TOP</code> for a new type that is not
an annotation). All three of these elements are required.</p>
</div>
<div class="section" title="2.3.3.&nbsp;Features"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.type_system.features">2.3.3.&nbsp;Features</h3></div></div></div>
<p>The <code class="literal">features</code> element of a
<code class="literal">typeDescription</code> is required only if the type we are specifying
introduces new features. If the <code class="literal">features</code> element is present,
it contains zero or more <code class="literal">featureDescription</code> elements, each of
which has the form:</p>
<pre class="programlisting">&lt;featureDescription&gt;
&lt;name&gt;[Name]&lt;/name&gt;
&lt;description&gt;[String]&lt;/description&gt;
&lt;rangeTypeName&gt;[Name]&lt;/rangeTypeName&gt;
&lt;elementType&gt;[Name]&lt;/elementType&gt;
&lt;multipleReferencesAllowed&gt;true|false&lt;/multipleReferencesAllowed&gt;
&lt;/featureDescription&gt;</pre>
<p>A feature's name follows the same rules as a type short name &#8211; a letter
followed by any number of letters, digits, or underscores. Feature names are case
sensitive.</p>
<p>The feature's <code class="literal">rangeTypeName</code> specifies the type of
value that the feature can take. This may be the name of any type defined in your type
system, or one of the predefined types. All of the predefined types have names that are
prefixed with <code class="literal">uima.cas</code> or <code class="literal">uima.tcas</code>,
for example:
</p><pre class="programlisting">uima.cas.TOP
uima.cas.String
uima.cas.Long
uima.cas.FSArray
uima.cas.StringList
uima.tcas.Annotation.</pre><p>
For a complete list of predefined types, see the CAS API documentation.</p>
<p>The <code class="literal">elementType</code> of a feature is optional, and applies only
when the <code class="literal">rangeTypeName</code> is
<code class="literal">uima.cas.FSArray</code> or <code class="literal">uima.cas.FSList</code>
The <code class="literal">elementType</code> specifies what type of value can be assigned as
an element of the array or list. This must be the name of a non-primitive type. If
omitted, it defaults to <code class="literal">uima.cas.TOP</code>, meaning that any
FeatureStructure can be assigned as an element the array or list. Note: depending on
the CAS Interface that you use in your code, this constraint may or may not be
enforced.
Note: At run time, the elementType is available from a runtime Feature object
(using the <code class="literal">a_feature_object.getRange().getComponentType()</code> method)
only when specified for the <code class="literal">uima.cas.FSArray</code> ranges; it isn't
available for <code class="literal">uima.cas.FSList</code> ranges.
</p>
<p>The <code class="literal">multipleReferencesAllowed</code> feature is optional, and
applies only when the <code class="literal">rangeTypeName</code> is an array or list type (it
applies to arrays and lists of primitive as well as non-primitive types). Setting
this to false (the default) indicates that this feature has exclusive ownership of
the array or list, so changes to the array or list are localized. Setting this to true
indicates that the array or list may be shared, so changes to it may affect other
objects in the CAS. Note: there is currently no guarantee that the framework will
enforce this restriction. However, this setting may affect how the CAS is
serialized.</p>
</div>
<div class="section" title="2.3.4.&nbsp;String Subtypes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.type_system.string_subtypes">2.3.4.&nbsp;String Subtypes</h3></div></div></div>
<p>There is one other special type that you can declare &#8211; a subset of the String
type that specifies a restricted set of allowed values. This is useful for features
that can have only certain String values, such as parts of speech. Here is an example of
how to declare such a type:</p>
<pre class="programlisting">&lt;typeDescription&gt;
&lt;name&gt;PartOfSpeech&lt;/name&gt;
&lt;description&gt;A part of speech.&lt;/description&gt;
&lt;supertypeName&gt;uima.cas.String&lt;/supertypeName&gt;
&lt;allowedValues&gt;
&lt;value&gt;
&lt;string&gt;NN&lt;/string&gt;
&lt;description&gt;Noun, singular or mass.&lt;/description&gt;
&lt;/value&gt;
&lt;value&gt;
&lt;string&gt;NNS&lt;/string&gt;
&lt;description&gt;Noun, plural.&lt;/description&gt;
&lt;/value&gt;
&lt;value&gt;
&lt;string&gt;VB&lt;/string&gt;
&lt;description&gt;Verb, base form.&lt;/description&gt;
&lt;/value&gt;
...
&lt;/allowedValues&gt;
&lt;/typeDescription&gt;</pre>
</div>
</div>
<div class="section" title="2.4.&nbsp;Analysis Engine Descriptors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.aes">2.4.&nbsp;Analysis Engine Descriptors</h2></div></div></div>
<p>Analysis Engine (AE) descriptors completely describe Analysis Engines. There
are two basic types of Analysis Engines &#8211; <span class="emphasis"><em>Primitive</em></span> and
<span class="emphasis"><em>Aggregate</em></span>. A <span class="emphasis"><em>Primitive</em></span> Analysis
Engine is a container for a single <span class="emphasis"><em>annotator</em></span>, where as an
<span class="emphasis"><em>Aggregate</em></span> Analysis Engine is composed of a collection of other
Analysis Engines. (For more information on this and other terminology, see <a href="overview_and_setup.html#d4e1" class="olink">UIMA Overview &amp; SDK Setup</a> <a href="overview_and_setup.html#ugr.ovv.conceptual" class="olink">Chapter&nbsp;2, <i>UIMA Conceptual Overview</i></a>).</p>
<p>Both Primitive and Aggregate Analysis Engines have descriptors, and the two types
of descriptors have some similarities and some differences. <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive" title="2.4.1.&nbsp;Primitive Analysis Engine Descriptors">Section&nbsp;2.4.1, &#8220;Primitive Analysis Engine Descriptors&#8221;</a>
discusses Primitive Analysis Engine descriptors. <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate" title="2.4.2.&nbsp;Aggregate Analysis Engine Descriptors">Section&nbsp;2.4.2, &#8220;Aggregate Analysis Engine Descriptors&#8221;</a> then
describes how Aggregate Analysis Engine descriptors are different.</p>
<div class="section" title="2.4.1.&nbsp;Primitive Analysis Engine Descriptors"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.aes.primitive">2.4.1.&nbsp;Primitive Analysis Engine Descriptors</h3></div></div></div>
<div class="section" title="2.4.1.1.&nbsp;Basic Structure"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.primitive.basic">2.4.1.1.&nbsp;Basic Structure</h4></div></div></div>
<pre class="programlisting">&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
&lt;analysisEngineDescription
xmlns="http://uima.apache.org/resourceSpecifier"&gt;
&lt;frameworkImplementation&gt;org.apache.uima.java&lt;/frameworkImplementation&gt;
&lt;primitive&gt;true&lt;/primitive&gt;
&lt;annotatorImplementationName&gt; [String] &lt;/annotatorImplementationName&gt;
&lt;analysisEngineMetaData&gt;
...
&lt;/analysisEngineMetaData&gt;
&lt;externalResourceDependencies&gt;
...
&lt;/externalResourceDependencies&gt;
&lt;resourceManagerConfiguration&gt;
...
&lt;/resourceManagerConfiguration&gt;
&lt;/analysisEngineDescription&gt;</pre>
<p>The document begins with a standard XML header. The recommended root tag is
<code class="literal">&lt;analysisEngineDescription&gt;</code>, although
<code class="literal">&lt;taeDescription&gt;</code> is also allowed for backwards
compatibility.</p>
<p>Within the root element we declare that we are using the XML namespace
<code class="literal">http://uima.apache.org/resourceSpecifier.</code> It is
required that this namespace be used; otherwise, the descriptor will not be able to
be validated for errors.</p>
<p> The first subelement,
<code class="literal">&lt;frameworkImplementation&gt;,</code> currently must have
the value <code class="literal">org.apache.uima.java</code>, or
<code class="literal">org.apache.uima.cpp</code>. In future versions, there may be
other framework implementations, or perhaps implementations produced by other
vendors.</p>
<p>The second subelement, <code class="literal">&lt;primitive&gt;,</code> contains
the Boolean value <code class="literal">true</code>, indicating that this XML document
describes a <span class="emphasis"><em>Primitive</em></span> Analysis Engine.</p>
<p>The next subelement,<code class="literal">
&lt;annotatorImplementationName&gt;</code> is how the UIMA framework
determines which annotator class to use. This should contain a fully-qualified
Java class name for Java implementations, or the name of a .dll or .so file for C++
implementations.</p>
<p>The <code class="literal">&lt;analysisEngineMetaData&gt;</code> object contains
descriptive information about the analysis engine and what it does. It is
described in <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.metadata" title="2.4.1.2.&nbsp;Analysis Engine MetaData">Section&nbsp;2.4.1.2, &#8220;Analysis Engine MetaData&#8221;</a>.</p>
<p>The <code class="literal">&lt;externalResourceDependencies&gt;</code> and
<code class="literal">&lt;resourceManagerConfiguration&gt;</code> elements declare
the external resource files that the analysis engine relies
upon. They are optional and are described in <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.external_resource_dependencies" title="2.4.1.8.&nbsp;External Resource Dependencies">Section&nbsp;2.4.1.8, &#8220;External Resource Dependencies&#8221;</a> and <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration" title="2.4.1.9.&nbsp;Resource Manager Configuration">Section&nbsp;2.4.1.9, &#8220;Resource Manager Configuration&#8221;</a>.</p>
</div>
<div class="section" title="2.4.1.2.&nbsp;Analysis Engine MetaData"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.metadata">2.4.1.2.&nbsp;Analysis Engine MetaData</h4></div></div></div>
<pre class="programlisting">&lt;analysisEngineMetaData&gt;
&lt;name&gt; [String] &lt;/name&gt;
&lt;description&gt;[String]&lt;/description&gt;
&lt;version&gt;[String]&lt;/version&gt;
&lt;vendor&gt;[String]&lt;/vendor&gt;
&lt;configurationParameters&gt; ... &lt;/configurationParameters&gt;
&lt;configurationParameterSettings&gt;
...
&lt;/configurationParameterSettings&gt;
&lt;typeSystemDescription&gt; ... &lt;/typeSystemDescription&gt;
&lt;typePriorities&gt; ... &lt;/typePriorities&gt;
&lt;fsIndexCollection&gt; ... &lt;/fsIndexCollection&gt;
&lt;capabilities&gt; ... &lt;/capabilities&gt;
&lt;operationalProperties&gt; ... &lt;/operationalProperties&gt;
&lt;/analysisEngineMetaData&gt;</pre>
<p>The <code class="literal">analysisEngineMetaData</code> element contains four
simple string fields &#8211; <code class="literal">name</code>,
<code class="literal">description</code>, <code class="literal">version</code>, and
<code class="literal">vendor</code>. Only the <code class="literal">name</code> field is
required, but providing values for the other fields is recommended. The
<code class="literal">name</code> field is just a descriptive name meant to be read by
users; it does not need to be unique across all Analysis Engines.</p>
<p>Configuration parameters are described in
<a class="xref" href="#ugr.ref.xml.component_descriptor.aes.configuration_parameters" title="2.4.3.&nbsp;Configuration Parameters">Section&nbsp;2.4.3, &#8220;Configuration Parameters&#8221;</a>.</p>
<p>The other sub-elements &#8211;
<code class="literal">typeSystemDescription</code>,
<code class="literal">typePriorities</code>, <code class="literal">fsIndexes</code>,
<code class="literal">capabilities</code> and
<code class="literal">operationalProperties</code> are described in the following
sections. The only one of these that is required is
<code class="literal">capabilities</code>; the others are optional.</p>
</div>
<div class="section" title="2.4.1.3.&nbsp;Type System Definition"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.type_system">2.4.1.3.&nbsp;Type System Definition</h4></div></div></div>
<pre class="programlisting">&lt;typeSystemDescription&gt;
&lt;name&gt; [String] &lt;/name&gt;
&lt;description&gt;[String]&lt;/description&gt;
&lt;version&gt;[String]&lt;/version&gt;
&lt;vendor&gt;[String]&lt;/vendor&gt;
&lt;imports&gt;
&lt;import ...&gt;
...
&lt;/imports&gt;
&lt;types&gt;
&lt;typeDescription&gt;
...
&lt;/typeDescription&gt;
...
&lt;/types&gt;
&lt;/typeSystemDescription&gt;</pre>
<p>A <code class="literal">typeSystemDescription</code> element defines a type
system for an Analysis Engine. The syntax for the element is described in <a class="xref" href="#ugr.ref.xml.component_descriptor.type_system" title="2.3.&nbsp;Type System Descriptors">Section&nbsp;2.3, &#8220;Type System Descriptors&#8221;</a>.</p>
<p>The recommended usage is to <code class="literal">import</code> an external type
system, using the import syntax described in <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2.&nbsp;Imports">Section&nbsp;2.2, &#8220;Imports&#8221;</a>
of this chapter. For example:
</p><pre class="programlisting">&lt;typeSystemDescription&gt;
&lt;imports&gt;
&lt;import location="MySharedTypeSystem.xml"&gt;
&lt;/imports&gt;
&lt;/typeSystemDescription&gt;</pre>
<p>This allows several AEs to share a single type system definition. The file
<code class="literal">MySharedTypeSystem.xml</code> would then contain the full
type system information, including the <code class="literal">name</code>,
<code class="literal">description</code>, <code class="literal">vendor</code>,
<code class="literal">version</code>, and <code class="literal">types</code>.</p>
</div>
<div class="section" title="2.4.1.4.&nbsp;Type Priority Definition"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.type_priority">2.4.1.4.&nbsp;Type Priority Definition</h4></div></div></div>
<pre class="programlisting">&lt;typePriorities&gt;
&lt;name&gt; [String] &lt;/name&gt;
&lt;description&gt;[String]&lt;/description&gt;
&lt;version&gt;[String]&lt;/version&gt;
&lt;vendor&gt;[String]&lt;/vendor&gt;
&lt;imports&gt;
&lt;import ...&gt;
...
&lt;/imports&gt;
&lt;priorityLists&gt;
&lt;priorityList&gt;
&lt;type&gt;[TypeName]&lt;/type&gt;
&lt;type&gt;[TypeName]&lt;/type&gt;
...
&lt;/priorityList&gt;
...
&lt;/priorityLists&gt;
&lt;/typePriorities&gt;</pre>
<p>The <code class="literal">&lt;typePriorities&gt;</code> element contains
zero or more <code class="literal">&lt;priorityList&gt;</code> elements; each
<code class="literal">&lt;priorityList&gt;</code> contains zero or more types.
Like a type system, a type priorities definition may also declare a name,
description, version, and vendor, and may import other type priorities. See
<a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2.&nbsp;Imports">Section&nbsp;2.2, &#8220;Imports&#8221;</a> for the import syntax.</p>
<p>Type priority is used when iterating over feature structures in the CAS.
For example, if the CAS contains a <code class="literal">Sentence</code> annotation
and a <code class="literal">Paragraph</code> annotation with the same span of text
(i.e. a one-sentence paragraph), which annotation should be returned first
by an iterator? Probably the Paragraph, since it is conceptually
<span class="quote">&#8220;<span class="quote">bigger,</span>&#8221;</span> but the framework does not know that and must be
explicitly told that the Paragraph annotation has priority over the Sentence
annotation, like this:
</p><pre class="programlisting">&lt;typePriorities&gt;
&lt;priorityList&gt;
&lt;type&gt;org.myorg.Paragraph&lt;/type&gt;
&lt;type&gt;org.myorg.Sentence&lt;/type&gt;
&lt;/priorityList&gt;
&lt;/typePriorities&gt;</pre>
<p>All of the <code class="literal">&lt;priorityList&gt;</code> elements defined
in the descriptor (and in all component descriptors of an aggregate analysis
engine descriptor) are merged to produce a single priority list.</p>
<p>Subtypes of types specified here are also ordered, unless overridden by
another user-specified type ordering. For example, if you specify type A
comes before type B, then subtypes of A will come before subtypes of B, unless
there is an overriding specification which declares some subtype of B comes
before some subtype of A.</p>
<p>If there are inconsistencies between the priority list (type A declared
before type B in one priority list, and type B declared before type A in
another), the framework will throw an exception.</p>
<p>User defined indexes may declare if they wish to use the type priority or
not; see the next section.</p>
</div>
<div class="section" title="2.4.1.5.&nbsp;Index Definition"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.index">2.4.1.5.&nbsp;Index Definition</h4></div></div></div>
<pre class="programlisting">&lt;fsIndexCollection&gt;
&lt;name&gt;[String]&lt;/name&gt;
&lt;description&gt;[String]&lt;/description&gt;
&lt;version&gt;[String]&lt;/version&gt;
&lt;vendor&gt;[String]&lt;/vendor&gt;
&lt;imports&gt;
&lt;import ...&gt;
...
&lt;/imports&gt;
&lt;fsIndexes&gt;
&lt;fsIndexDescription&gt;
...
&lt;/fsIndexDescription&gt;
&lt;fsIndexDescription&gt;
...
&lt;/fsIndexDescription&gt;
&lt;/fsIndexes&gt;
&lt;/fsIndexCollection&gt;</pre>
<p>The <code class="literal">fsIndexCollection</code> element declares<span class="emphasis"><em> Feature Structure
Indexes</em></span>, each of which defined an index that holds feature structures of a given type.
Information in the CAS is always accessed through an index. There is a built-in default annotation
index declared which can be used to access instances of type
<code class="literal">uima.tcas.Annotation</code> (or its subtypes), sorted based on their
<code class="literal">begin</code> and <code class="literal">end</code> features, and the type priority ordering (if specified).
For all other types, there is a
default, unsorted (bag) index. If there is a need for a specialized index it must be declared in this
element of the descriptor. See <a href="references.html#ugr.ref.cas.indexes_and_iterators" class="olink">Section&nbsp;4.7, &#8220;Indexes and Iterators&#8221;</a> for details on FS indexes.</p>
<p>Like type systems and type priorities, an
<code class="literal">fsIndexCollection</code> can declare a
<code class="literal">name</code>, <code class="literal">description</code>,
<code class="literal">vendor</code>, and <code class="literal">version</code>, and may
import other <code class="literal">fsIndexCollection</code>s. The import syntax is
described in <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2.&nbsp;Imports">Section&nbsp;2.2, &#8220;Imports&#8221;</a>.</p>
<p>An <code class="literal">fsIndexCollection</code> may also define zero or more
<code class="literal">fsIndexDescription</code> elements, each of which defines a
single index. Each <code class="literal">fsIndexDescription</code> has the form:
</p><pre class="programlisting">&lt;fsIndexDescription&gt;
&lt;label&gt;[String]&lt;/label&gt;
&lt;typeName&gt;[TypeName]&lt;/typeName&gt;
&lt;kind&gt;sorted|bag|set&lt;/kind&gt;
&lt;keys&gt;
&lt;fsIndexKey&gt;
&lt;featureName&gt;[Name]&lt;/featureName&gt;
&lt;comparator&gt;standard|reverse&lt;/comparator&gt;
&lt;/fsIndexKey&gt;
&lt;fsIndexKey&gt;
&lt;typePriority/&gt;
&lt;/fsIndexKey&gt;
...
&lt;/keys&gt;
&lt;/fsIndexDescription&gt;</pre>
<p>The <code class="literal">label</code> element defines the name by which
applications and annotators refer to this index. The
<code class="literal">typeName</code> element contains the name of the type that will
be contained in this index. This must match one of the type names defined in the
<code class="literal">&lt;typeSystemDescription&gt;</code>.</p>
<p>There are three possible values for the
<code class="literal">&lt;kind&gt;</code> of index. Sorted indexes enforce an
ordering of feature structures, based on defined keys. Bag indexes do
not enforce ordering, and have no defined keys. Set indexes do not
enforce ordering, but use defined keys to specify equivalence classes;
addToIndexes will not add a Feature Structure to a set index if its keys
match those of an entry of the same type already in the index.
If the <code class="literal">&lt;kind&gt;</code>element is omitted, it will default to
sorted, which is the most common type of index.</p>
<p>Prior to version 2.7.0, the bag and sorted indexes stored duplicate entries for the
same identical FS, if it was added to the indexes multiple times. As of version 2.7.0, this
is changed; a second or subsequent add to index operation has no effect. This has the
consequence that a remove operation now guarantees that the particular FS is removed
(as opposed to only being able to say that one (of perhaps many duplicate entries) is removed).
Since sending to remote annotators only adds entries to indexes at most once, this
behavior is consistent with that.</p>
<p>Note that even after this change, there is still a distinct difference in meaning for bag and set indexes.
The set index uses equal defined key values plus the type of the Feature Structure to determine equivalence classes for Feature Structures, and
will not add a Feature Structure if it has equal key values and the same type to an entry already in there.</p>
<p>It is possible, however, that users may be depending on having multiple instances of
the identical FeatureStructure in the indicies. Therefore, UIMA uses
a JVM defined property,
"uima.allow_duplicate_add_to_indexes", which (if defined whend UIMA is loaded) will restore the previous behavior.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>If duplicates are allowed, then the proper way to update an indexed Feature Structure is to
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>remove <span class="bold"><strong>*all*</strong></span> instances of the FS to be
updated </p></li><li class="listitem"><p>update the features</p></li><li class="listitem"><p>re-add the Feature Structure to the indexes (perhaps multiple times, depending on the
details of your logic).</p></li></ul></div></div>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>There is usually no need to explicitly declare a Bag index in your descriptor.
As of UIMA v2.1, if you do not declare any index for a type (or any of its
supertypes), a Bag index will be automatically created if an instance of that type is added to the indexes.</p></div>
<p>An Sorted or Set index may define zero or more <span class="emphasis"><em>keys</em></span>. These keys
determine the sort order of the feature structures within a sorted index, and
partially determine equality for set indexes (the equality measure always includes testing that the types are the same).
Bag indexes do not use keys, and
equality is determined by Feature Structure identity (that is, two elements
are considered equal if and only if they are exactly the same feature structure,
located in the same place in the CAS). Keys are
ordered by precedence &#8211; the first key is evaluated first, and
subsequent keys are evaluated only if necessary.</p>
<p>Each key is represented by an <code class="literal">fsIndexKey</code> element.
Most <code class="literal">fsIndexKeys</code> contains a
<code class="literal">featureName</code> and a <code class="literal">comparator</code>.
The <code class="literal">featureName</code> must match the name of one of the
features for the type specified in the
<code class="literal">&lt;typeName&gt;</code> element for this index. The
comparator defines how the features will be compared &#8211; a value of
<code class="literal">standard</code> means that features will be compared using the
standard comparison for their data type (e.g. for numerical types, smaller
values precede larger values, and for string types, Unicode string
comparison is performed). A value of <code class="literal">reverse</code> means that
features will be compared using the reverse of the standard comparison (e.g.
for numerical types, larger values precede smaller values, etc.). For Set
indexes, the comparator direction is ignored &#8211; the keys are only used
for the equality testing.</p>
<p>Each key used in comparisons must refer to a feature whose range type is
Boolean, Byte, Short, Integer, Long, Float, Double, or String.
</p>
<p>There is a second type of a key, one which contains only the
<code class="literal">&lt;typePriority/&gt;</code>. When this key is used, it
indicates that Feature Structures will be compared using the type priorities
declared in the <code class="literal">&lt;typePriorities&gt;</code> section of the
descriptor.</p>
</div>
<div class="section" title="2.4.1.6.&nbsp;Capabilities"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.capabilities">2.4.1.6.&nbsp;Capabilities</h4></div></div></div>
<pre class="programlisting">&lt;capabilities&gt;
&lt;capability&gt;
&lt;inputs&gt;
&lt;type allAnnotatorFeatures="true|false"[TypeName]&lt;/type&gt;
...
&lt;feature&gt;[TypeName]:[Name]&lt;/feature&gt;
...
&lt;/inputs&gt;
&lt;outputs&gt;
&lt;type allAnnotatorFeatures="true|false"[TypeName]&lt;/type&gt;
...
&lt;feature&gt;[TypeName]:[Name]&lt;/feature&gt;
...
&lt;/output&gt;
&lt;inputSofas&gt;
&lt;sofaName&gt;[name]&lt;/sofaName&gt;
...
&lt;/inputSofas&gt;
&lt;outputSofas&gt;
&lt;sofaName&gt;[name]&lt;/sofaName&gt;
...
&lt;/outputSofas&gt;
&lt;languagesSupported&gt;
&lt;language&gt;[ISO Language ID]&lt;/language&gt;
...
&lt;/languagesSupported&gt;
&lt;/capability&gt;
&lt;capability&gt;
...
&lt;/capability&gt;
...
&lt;/capabilities&gt;</pre>
<p>The capabilities definition is used by the UIMA Framework in several
ways, including setting up the Results Specification for process calls,
routing control for aggregates based on language, and as part of the Sofa
mapping function.</p>
<p>The <code class="literal">capabilities</code> element contains one or more
<code class="literal">capability</code> elements. In Version 2 and onwards, only one
capability set should be used (multiple sets will continue to work for a while,
but they're not logically consistently supported).
</p>
<p>Each <code class="literal">capability</code> contains
<code class="literal">inputs</code>, <code class="literal">outputs</code>,
<code class="literal">languagesSupported, inputSofas, and outputSofas</code>.
Inputs and outputs element are required (though they may be empty);
<code class="literal">&lt;languagesSupported&gt;, &lt;inputSofas</code>&gt;,
and <code class="literal">&lt;outputSofas&gt;</code> are optional.</p>
<p>Both inputs and outputs may contain a mixture of type and feature
elements.</p>
<p><code class="literal">&lt;type...&gt;</code> elements contain the name of one
of the types defined in the type system or one of the built in types. Declaring a
type as an input means that this component expects instances of this type to be
in the CAS when it receives it to process. Declaring a type as an output means
that this component creates new instances of this type in the CAS.</p>
<p>There is an optional attribute
<code class="literal">allAnnotatorFeatures</code>, which defaults to false if
omitted. The Component Descriptor Editor tool defaults this to true when a new
type is added to the list of inputs and/or outputs. When this attribute is true,
it specifies that all of the type's features are also declared as input or
output. Otherwise, the features that are required as inputs or populated as
outputs must be explicitly specified in feature elements.</p>
<p><code class="literal">&lt;feature...&gt;</code> elements contain the
<span class="quote">&#8220;<span class="quote">fully-qualified</span>&#8221;</span> feature name, which is the type name
followed by a colon, followed by the feature name, e.g.
<code class="literal">org.myorg.TokenAnnotation:lemma</code>.
<code class="literal">&lt;feature...&gt;</code> elements in the
<code class="literal">&lt;inputs&gt;</code> section must also have a corresponding
type declared as an input. In output sections, this is not required. If the type
is not specified as an output, but a feature for that type is, this means that
existing instances of the type have the values of the specified features
updated. Any type mentioned in a <code class="literal">&lt;feature&gt;</code>
element must be either specified as an input or an output or both.</p>
<p><code class="literal">language </code>elements contain one of the ISO language
identifiers, such as <code class="literal">en</code> for English, or
<code class="literal">en-US</code> for the United States dialect of English.</p>
<p>The list of language codes can be found here: <a class="ulink" href="http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt" target="_top">http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt</a>
and the country codes here:
<a class="ulink" href="http://www.chemie.fu-berlin.de/diverse/doc/ISO_3166.html" target="_top">http://www.chemie.fu-berlin.de/diverse/doc/ISO_3166.html</a>
</p>
<p><code class="literal">&lt;inputSofas&gt;</code> and
<code class="literal">&lt;outputSofas&gt;</code> declare sofa names used by this
component. All Sofa names must be unique within a particular capability set. A
Sofa name must be an input or an output, and cannot be both. It is an error to have a
Sofa name declared as an input in one capability set, and also have it declared
as an output in another capability set.</p>
<p>A <code class="literal">&lt;sofaName&gt;</code> is written as a simple
Java-style identifier, without any periods in the name, except that it may be
written to end in <span class="quote">&#8220;<span class="quote"><code class="literal">.*</code></span>&#8221;</span>. If written in this
manner, it specifies a set of Sofa names, all of which start with the base name
(the part before the .*) followed by a period and then an arbitrary Java
identifier (without periods). This form is used to specify in the descriptor
that the component could generate an arbitrary number of Sofas, the exact
names and numbers of which are unknown before the component is run.</p>
</div>
<div class="section" title="2.4.1.7.&nbsp;OperationalProperties"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.operational_properties">2.4.1.7.&nbsp;OperationalProperties</h4></div></div></div>
<p>Components can specify specific operational properties that can be
useful in deployment. The following are available:</p>
<pre class="programlisting">&lt;operationalProperties&gt;
&lt;modifiesCas&gt; true|false &lt;/modifiesCas&gt;
&lt;multipleDeploymentAllowed&gt; true|false &lt;/multipleDeploymentAllowed&gt;
&lt;outputsNewCASes&gt; true|false &lt;/outputsNewCASes&gt;
&lt;/operationalProperties&gt;</pre>
<p><code class="literal">ModifiesCas</code>, if false, indicates that this
component does not modify the CAS. If it is not specified, the default value is
true except for CAS Consumer components.</p>
<p><code class="literal">multipleDeploymentAllowed</code>, if true, allows the
component to be deployed multiple times to increase performance through
scale-out techniques. If it is not specified, the default value is true,
except for CAS Consumer and Collection Reader components.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>If you wrap one or more CAS Consumers inside an aggregate as the only
components, you must explicitly specify in the aggregate the
<code class="literal">multipleDeploymentAllowed</code> property as false (assuming the CAS Consumer
components take the default here); otherwise the framework will complain about inconsistent
settings for these.</p></div>
<p><code class="literal">outputsNewCASes</code>, if true, allows the component to
create new CASes during processing, for example to break a large artifact into
smaller pieces. See <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.cm" class="olink">Chapter&nbsp;7, <i>CAS Multiplier Developer's Guide</i></a> for details.</p>
</div>
<div class="section" title="2.4.1.8.&nbsp;External Resource Dependencies"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.primitive.external_resource_dependencies">2.4.1.8.&nbsp;External Resource Dependencies</h4></div></div></div>
<pre class="programlisting">&lt;externalResourceDependencies&gt;
&lt;externalResourceDependency&gt;
&lt;key&gt;[String]&lt;/key&gt;
&lt;description&gt;[String] &lt;/description&gt;
&lt;interfaceName&gt;[String]&lt;/interfaceName&gt;
&lt;optional&gt;true|false&lt;/optional&gt;
&lt;/externalResourceDependency&gt;
&lt;externalResourceDependency&gt;
...
&lt;/externalResourceDependency&gt;
...
&lt;/externalResourceDependencies&gt;</pre>
<p>A primitive annotator may declare zero or more
<code class="literal">&lt;externalResourceDependency&gt;</code> elements. Each
dependency has the following elements:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><code class="literal">key</code> &#8211; the
string by which the annotator code will attempt to access the resource. Must
be unique within this annotator.</p></li><li class="listitem"><p><code class="literal">description</code> &#8211; a textual
description of the dependency.</p></li><li class="listitem"><p><code class="literal">interfaceName</code> &#8211; the
fully-qualified name of the Java interface through which the annotator
will access the data. This is optional. If not specified, the annotator
can only get an InputStream to the data.</p></li><li class="listitem"><p><code class="literal">optional</code> &#8211; whether the
resource is optional. If false, an exception will be thrown if no resource
is assigned to satisfy this dependency. Defaults to false. </p>
</li></ul></div>
</div>
<div class="section" title="2.4.1.9.&nbsp;Resource Manager Configuration"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration">2.4.1.9.&nbsp;Resource Manager Configuration</h4></div></div></div>
<pre class="programlisting">&lt;resourceManagerConfiguration&gt;
&lt;name&gt;[String]&lt;/name&gt;
&lt;description&gt;[String]&lt;/description&gt;
&lt;version&gt;[String]&lt;/version&gt;
&lt;vendor&gt;[String]&lt;/vendor&gt;
&lt;imports&gt;
&lt;import ...&gt;
...
&lt;/imports&gt;
&lt;externalResources&gt;
&lt;externalResource&gt;
&lt;name&gt;[String]&lt;/name&gt;
&lt;description&gt;[String]&lt;/description&gt;
&lt;fileResourceSpecifier&gt;
&lt;fileUrl&gt;[URL]&lt;/fileUrl&gt;
&lt;/fileResourceSpecifier&gt;
&lt;implementationName&gt;[String]&lt;/implementationName&gt;
&lt;/externalResource&gt;
...
&lt;/externalResources&gt;
&lt;externalResourceBindings&gt;
&lt;externalResourceBinding&gt;
&lt;key&gt;[String]&lt;/key&gt;
&lt;resourceName&gt;[String]&lt;/resourceName&gt;
&lt;/externalResourceBinding&gt;
...
&lt;/externalResourceBindings&gt;
&lt;/resourceManagerConfiguration&gt;</pre>
<p>This element declares external resources and binds them to
annotators' external resource dependencies.</p>
<p>The <code class="literal">resourceManagerConfiguration</code> element may
optionally contain an <code class="literal">import</code>, which allows resource
definitions to be stored in a separate (shareable) file. See <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2.&nbsp;Imports">Section&nbsp;2.2, &#8220;Imports&#8221;</a> for details.</p>
<p>The <code class="literal">externalResources</code> element contains zero or
more <code class="literal">externalResource</code> elements, each of which
consists of:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><code class="literal">name</code> &#8211; the
name of the resource. This name is referred to in the bindings (see below).
Resource names need to be unique within any Aggregate Analysis Engine or
Collection Processing Engine, so the Java-like
<code class="literal">org.myorg.mycomponent.MyResource</code> syntax is
recommended.</p></li><li class="listitem"><p><code class="literal">description</code> &#8211; English
description of the resource.</p></li><li class="listitem"><p>Resource Specifier &#8211;
Declares the location of the resource. There are different
possibilities for how this is done (see below).</p></li><li class="listitem"><p><code class="literal">implementationName</code> &#8211; The
fully-qualified name of the Java class that will be instantiated from the
resource data. This is optional; if not specified, the resource will be
accessible as an input stream to the raw data. If specified, the Java class
must implement the <code class="literal">interfaceName</code> that is
specified in the External Resource Dependency to which it is bound.
</p></li></ul></div>
<p>One possibility for the resource specifier is a
<code class="literal">&lt;fileResourceSpecifier&gt;</code>, as shown above. This
simply declares a URL to the resource data. This support is built on the Java
class URL and its method URL.openStream(); it supports the protocols
<span class="quote">&#8220;<span class="quote">file</span>&#8221;</span>, <span class="quote">&#8220;<span class="quote">http</span>&#8221;</span> and <span class="quote">&#8220;<span class="quote">jar</span>&#8221;</span> (for
referring to files in jars) by default, and you can plug in handlers for other
protocols. The URL has to start with file: (or some other protocol). It is
relative to either the classpath or the <span class="quote">&#8220;<span class="quote">data path</span>&#8221;</span>. The data
path works like the classpath but can be set programmatically via
<code class="literal">ResourceManager.setDataPath()</code>. Setting the Java
System property <code class="literal">uima.datapath</code> also works.</p>
<p><code class="literal">file:com/apache.d.txt</code> is a relative path;
relative paths for resources are resolved using the classpath and/or the
datapath. For the file protocol, URLs starting with file:/ or file:/// are
absolute. Note that <code class="literal">file://org/apache/d.txt</code> is NOT an
absolute path starting with <span class="quote">&#8220;<span class="quote">org</span>&#8221;</span>. The <span class="quote">&#8220;<span class="quote">//</span>&#8221;</span>
indicates that what follows is a host name. Therefore if you try to use this URL
it will complain that it can't connect to the host <span class="quote">&#8220;<span class="quote">org</span>&#8221;</span>
</p>
<p>The URL value may contain references to external override variables using the
<code class="literal">${variable-name}</code> syntax,
e.g. <code class="literal">file:com/${dictUrl}.txt</code>.
If a variable is undefined the value is left unmodified and a warning message
identifies the missing variable.
</p>
<p>Another option is a
<code class="literal">&lt;fileLanguageResourceSpecifier&gt;</code>, which is
intended to support resources, such as dictionaries, that depend on the
language of the document being processed. Instead of a single URL, a prefix and
suffix are specified, like this:
</p><pre class="programlisting">&lt;fileLanguageResourceSpecifier&gt;
&lt;fileUrlPrefix&gt;file:FileLanguageResource_implTest_data_&lt;/fileUrlPrefix&gt;
&lt;fileUrlSuffix&gt;.dat&lt;/fileUrlSuffix&gt;
&lt;/fileLanguageResourceSpecifier&gt;</pre>
<p>The URL of the actual resource is then formed by concatenating the prefix,
the language of the document (as an ISO language code, e.g.
<code class="literal">en</code> or <code class="literal">en-US</code>
&#8211; see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.capabilities" title="2.4.1.6.&nbsp;Capabilities">Section&nbsp;2.4.1.6, &#8220;Capabilities&#8221;</a> for more
information), and the suffix.</p>
<p>A third option is a <code class="literal">customResourceSpecifier</code>, which allows
you to plug in an arbitrary Java class. See <a class="xref" href="#ugr.ref.xml.component_descriptor.custom_resource_specifiers" title="2.8.&nbsp;Custom Resource Specifiers">Section&nbsp;2.8, &#8220;Custom Resource Specifiers&#8221;</a>
for more information.</p>
<p>The <code class="literal">externalResourceBindings</code> element declares
which resources are bound to which dependencies. Each
<code class="literal">externalResourceBinding</code> consists of:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><code class="literal">key</code> &#8211;
identifies the dependency. For a binding declared in a primitive analysis
engine descriptor, this must match the value of the
<code class="literal">key</code> element of one of the
<code class="literal">externalResourceDependency</code> elements. Bindings
may also be specified in aggregate analysis engine descriptors, in which
case a compound key is used
&#8211; see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings" title="2.4.2.4.&nbsp;External Resource Bindings">Section&nbsp;2.4.2.4, &#8220;External Resource Bindings&#8221;</a>
.</p></li><li class="listitem"><p><code class="literal">resourceName</code> &#8211; the name of
the resource satisfying the dependency. This must match the value of the
<code class="literal">name</code> element of one of the
<code class="literal">externalResource</code> declarations. </p>
</li></ul></div>
<p>A given resource dependency may only be bound to one external resource;
one external resource may be bound to many dependencies &#8211; to allow
resource sharing.</p>
</div>
<div class="section" title="2.4.1.10.&nbsp;Environment Variable References"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.environment_variable_references">2.4.1.10.&nbsp;Environment Variable References</h4></div></div></div>
<p>In several places throughout the descriptor, it is possible to reference
environment variables. In Java, these are actually references to Java system
properties. To reference system environment variables from a Java analysis
engine you must pass the environment variables into the Java virtual machine
by using the <code class="literal">&#8722;D</code> option on the <code class="literal">java</code>
command line.</p>
<p>The syntax for environment variable references is
<code class="literal">&lt;envVarRef&gt;[VariableName]&lt;/envVarRef&gt;</code>
, where [VariableName] is any valid Java system property name. Environment
variable references are valid in the following places:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>The value of a
configuration parameter (String-valued parameters only)</p>
</li><li class="listitem"><p>The
<code class="literal">&lt;annotatorImplementationName&gt;</code> element
of a primitive AE descriptor</p></li><li class="listitem"><p>The <code class="literal">&lt;name&gt;</code> element within
<code class="literal">&lt;analysisEngineMetaData&gt;</code></p>
</li><li class="listitem"><p>Within a
<code class="literal">&lt;fileResourceSpecifier&gt;</code> or
<code class="literal">&lt;fileLanguageResourceSpecifier&gt;</code>
</p></li></ul></div>
<p>For example, if the value of a configuration parameter were specified as:
<code class="literal">&lt;string&gt;&lt;envVarRef&gt;TEMP_DIR&lt;/envVarRef&gt;/temp.dat&lt;/string&gt;</code>
, and the value of the <code class="literal">TEMP_DIR</code> Java System property were
<code class="literal">c:/temp</code>, then the configuration parameter's
value would evaluate to <code class="literal">c:/temp/temp.dat</code>.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The Component Descriptor Editor does not support
environment variable references. If you need to, however, you
can use the <code class="code">source</code> tab view in the CDE to manually
add this notation.
</p></div>
</div>
</div>
<div class="section" title="2.4.2.&nbsp;Aggregate Analysis Engine Descriptors"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate">2.4.2.&nbsp;Aggregate Analysis Engine Descriptors</h3></div></div></div>
<p>Aggregate Analysis Engines do not contain an annotator, but instead
contain one or more component (also called <span class="emphasis"><em>delegate</em></span>)
analysis engines.</p>
<p>Aggregate Analysis Engine Descriptors maintain most of the same structure
as Primitive Analysis Engine Descriptors. The differences are:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>An Aggregate Analysis Engine Descriptor
contains the element
<code class="literal">&lt;primitive&gt;false&lt;/primitive&gt;</code> rather
than <code class="literal">&lt;primitive&gt;true&lt;/primitive&gt;</code>.
</p></li><li class="listitem"><p>An Aggregate Analysis Engine Descriptor must not include a
<code class="literal">&lt;annotatorImplementationName&gt;</code>
element.</p></li><li class="listitem"><p>In place of the
<code class="literal">&lt;annotatorImplementationName&gt;</code>, an Aggregate
Analysis Engine Descriptor must have a
<code class="literal">&lt;delegateAnalysisEngineSpecifiers&gt;</code>
element. See <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.delegates" title="2.4.2.1.&nbsp;Delegate Analysis Engine Specifiers">Section&nbsp;2.4.2.1, &#8220;Delegate Analysis Engine Specifiers&#8221;</a>.</p>
</li><li class="listitem"><p>An Aggregate Analysis Engine Descriptor may provide a
<code class="literal">&lt;flowController&gt;</code> element immediately
following the
<code class="literal">&lt;delegateAnalysisEngineSpecifiers&gt;</code>. <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.flow_controller" title="2.4.2.2.&nbsp;FlowController">Section&nbsp;2.4.2.2, &#8220;FlowController&#8221;</a>.</p></li><li class="listitem"><p>Under the analysisEngineMetaData element, an Aggregate
Analysis Engine Descriptor may specify an additional element --
<code class="literal">&lt;flowConstraints&gt;</code>. See <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.flow_constraints" title="2.4.2.3.&nbsp;FlowConstraints">Section&nbsp;2.4.2.3, &#8220;FlowConstraints&#8221;</a>. Typically only one
of <code class="literal">&lt;flowController&gt;</code> and
<code class="literal">&lt;flowConstraints&gt;</code> are specified. If both are
specified, the <code class="literal">&lt;flowController&gt;</code> takes
precedence, and the flow controller implementation can use the information
in specified in the <code class="literal">&lt;flowConstraints&gt;</code> as part of
its configuration input.</p></li><li class="listitem"><p>An aggregate Analysis Engine Descriptors must not contain a
<code class="literal">&lt;typeSystemDescription&gt;</code> element. The Type
System of the Aggregate Analysis Engine is derived by merging the Type System
of the Analysis Engines that the aggregate contains.</p></li><li class="listitem"><p>Within aggregate Analysis Engine Descriptors,
<code class="literal">&lt;configurationParameter&gt;</code> elements may define
<code class="literal">&lt;overrides&gt;</code>. See <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.configuration_parameter_overrides" title="2.4.3.3.&nbsp;Configuration Parameter Overrides">Section&nbsp;2.4.3.3, &#8220;Configuration Parameter Overrides&#8221;</a>
.</p></li><li class="listitem"><p>External Resource Bindings can bind resources to
dependencies declared by any delegate AE within the aggregate. See <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings" title="2.4.2.4.&nbsp;External Resource Bindings">Section&nbsp;2.4.2.4, &#8220;External Resource Bindings&#8221;</a>.</p>
</li><li class="listitem"><p>An additional optional element,
<code class="literal">&lt;sofaMappings&gt;</code>, may be included. </p>
</li></ul></div>
<div class="section" title="2.4.2.1.&nbsp;Delegate Analysis Engine Specifiers"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.delegates">2.4.2.1.&nbsp;Delegate Analysis Engine Specifiers</h4></div></div></div>
<pre class="programlisting">&lt;delegateAnalysisEngineSpecifiers&gt;
&lt;delegateAnalysisEngine key="[String]"&gt;
&lt;analysisEngineDescription&gt;...&lt;/analysisEngineDescription&gt; |
&lt;import .../&gt;
&lt;/delegateAnalysisEngine&gt;
&lt;delegateAnalysisEngine key="[String]"&gt;
...
&lt;/delegateAnalysisEngine&gt;
...
&lt;/delegateAnalysisEngineSpecifiers&gt;</pre>
<p>The <code class="literal">delegateAnalysisEngineSpecifiers</code> element
contains one or more <code class="literal">delegateAnalysisEngine</code>
elements. Each of these must have a unique key, and must contain
either:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>A complete
<code class="literal">analysisEngineDescription</code> element describing the
delegate analysis engine <span class="bold"><strong>OR</strong></span></p>
</li><li class="listitem"><p>An <code class="literal">import</code> element giving the name or
location of the XML descriptor for the delegate analysis engine (see <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2.&nbsp;Imports">Section&nbsp;2.2, &#8220;Imports&#8221;</a>).</p></li></ul></div>
<p>The latter is the much more common usage, and is the only form supported by
the Component Descriptor Editor tool.</p>
</div>
<div class="section" title="2.4.2.2.&nbsp;FlowController"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.flow_controller">2.4.2.2.&nbsp;FlowController</h4></div></div></div>
<pre class="programlisting">&lt;flowController key="[String]"&gt;
&lt;flowControllerDescription&gt;...&lt;/flowControllerDescription&gt; |
&lt;import .../&gt;
&lt;/flowController&gt;</pre>
<p>The optional <code class="literal">flowController</code> element identifies
the descriptor of the FlowController component that will be used to determine
the order in which delegate Analysis Engine are called.</p>
<p>The <code class="literal">key</code> attribute is optional, but recommended; it
assigns the FlowController an identifier that can be used for configuration
parameter overrides, Sofa mappings, or external resource bindings. The key
must not be the same as any of the delegate analysis engine keys.</p>
<p>As with the <code class="literal">delegateAnalysisEngine</code> element, the
<code class="literal">flowController</code> element may contain either a complete
<code class="literal">flowControllerDescription</code> or an
<code class="literal">import</code>, but the import is recommended. The Component
Descriptor Editor tool only supports imports here.</p>
</div>
<div class="section" title="2.4.2.3.&nbsp;FlowConstraints"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.flow_constraints">2.4.2.3.&nbsp;FlowConstraints</h4></div></div></div>
<p>If a <code class="literal">&lt;flowController&gt;</code> is not specified, the
order in which delegate Analysis Engines are called within the aggregate
Analysis Engine is specified using the
<code class="literal">&lt;flowConstraints&gt;</code> element, which must occur
immediately following the
<code class="literal">configurationParameterSettings</code> element. If a
<code class="literal">&lt;flowController&gt;</code> is specified, then the
<code class="literal">&lt;flowConstraints&gt;</code> are optional. They can be
used to pass an ordering of delegate keys to the
<code class="literal">&lt;flowController&gt;</code>.</p>
<p>There are two options for flow constraints --
<code class="literal">&lt;fixedFlow&gt;</code> or
<code class="literal">&lt;capabilityLanguageFlow&gt;</code>. Each is discussed
in a separate section below.</p>
<div class="section" title="Fixed Flow"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.flow_constraints.fixed_flow">Fixed Flow</h5></div></div></div>
<pre class="programlisting">&lt;flowConstraints&gt;
&lt;fixedFlow&gt;
&lt;node&gt;[String]&lt;/node&gt;
&lt;node&gt;[String]&lt;/node&gt;
...
&lt;/fixedFlow&gt;
&lt;/flowConstraints&gt;</pre>
<p>The <code class="literal">flowConstraints</code> element must be included
immediately following the
<code class="literal">configurationParameterSettings</code> element.</p>
<p>Currently the <code class="literal">flowConstraints</code> element must
contain a <code class="literal">fixedFlow</code> element. Eventually, other
types of flow constraints may be possible.</p>
<p>The <code class="literal">fixedFlow</code> element contains one or more
<code class="literal">node</code> elements, each of which contains an identifier
which must match the key of a delegate analysis engine specified in the
<code class="literal">delegateAnalysisEngineSpecifiers</code>
element.</p>
</div>
<div class="section" title="Capability Language Flow"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.flow_constraints.capability_language_flow">Capability Language Flow</h5></div></div></div>
<pre class="programlisting">&lt;flowConstraints&gt;
&lt;capabilityLanguageFlow&gt;
&lt;node&gt;[String]&lt;/node&gt;
&lt;node&gt;[String]&lt;/node&gt;
...
&lt;/capabilityLanguageFlow&gt;
&lt;/flowConstraints&gt;</pre>
<p>If you use <code class="literal">&lt;capabilityLanguageFlow&gt;</code>,
the delegate Analysis Engines named by the
<code class="literal">&lt;node&gt;</code> elements are called in the given order,
except that a delegate Analysis Engine is skipped if any of the following are
true (according to that Analysis Engine's declared output
capabilities):</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>It cannot produce any of the aggregate
Analysis Engine's output capabilities for the language of the
current document.</p></li><li class="listitem"><p>All of the output capabilities have already been
produced by an earlier Analysis Engine in the flow. </p></li></ul></div>
<p>For example, if two annotators produce
<code class="literal">org.myorg.TokenAnnotation</code> feature structures for
the same language, these feature structures will only be produced by the
first annotator in the list.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The flow analysis uses the specific types that are specified in the
output capabilities, without any expansion for subtypes. So, if you expect
a type TT and another type SubTT (which is a subtype of TT) in the output, you
must include both of them in the output capabilities.</p></div>
</div>
</div>
<div class="section" title="2.4.2.4.&nbsp;External Resource Bindings"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings">2.4.2.4.&nbsp;External Resource Bindings</h4></div></div></div>
<p>Aggregate analysis engine descriptors can declare resource bindings
that bind resources to dependencies declared in any of the delegate analysis
engines (or their subcomponents, recursively) within that aggregate. This
allows resource sharing. Any binding at this level overrides (supersedes)
any binding specified by a contained component or their subcomponents,
recursively.</p>
<p>For example, consider an aggregate Analysis Engine Descriptor that
contains delegate Analysis Engines with keys
<code class="literal">annotator1</code> and <code class="literal">annotator2</code> (as
declared in the <code class="literal">&lt;delegateAnalysisEngine&gt;</code>
element &#8211; see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.delegates" title="2.4.2.1.&nbsp;Delegate Analysis Engine Specifiers">Section&nbsp;2.4.2.1, &#8220;Delegate Analysis Engine Specifiers&#8221;</a>),
where <code class="literal">annotator1</code> declares a resource dependency with
key <code class="literal">myResource</code> and <code class="literal">annotator2</code>
declares a resource dependency with key <code class="literal">someResource</code>
.</p>
<p>Within that aggregate Analysis Engine Descriptor, the following
<code class="literal">resourceManagerConfiguration</code> would bind both of
those dependencies to a single external resource file.</p>
<pre class="programlisting">&lt;resourceManagerConfiguration&gt;
&lt;externalResources&gt;
&lt;externalResource&gt;
&lt;name&gt;ExampleResource&lt;/name&gt;
&lt;fileResourceSpecifier&gt;
&lt;fileUrl&gt;file:MyResourceFile.dat&lt;/fileUrl&gt;
&lt;/fileResourceSpecifier&gt;
&lt;/externalResource&gt;
&lt;/externalResources&gt;
&lt;externalResourceBindings&gt;
&lt;externalResourceBinding&gt;
&lt;key&gt;annotator1/myResource&lt;/key&gt;
&lt;resourceName&gt;ExampleResource&lt;/resourceName&gt;
&lt;/externalResourceBinding&gt;
&lt;externalResourceBinding&gt;
&lt;key&gt;annotator2/someResource&lt;/key&gt;
&lt;resourceName&gt;ExampleResource&lt;/resourceName&gt;
&lt;/externalResourceBinding&gt;
&lt;/externalResourceBindings&gt;
&lt;/resourceManagerConfiguration&gt;</pre>
<p>The syntax for the <code class="literal">externalResources</code> declaration
is exactly the same as described previously. In the resource bindings note the
use of the compound keys, e.g. <code class="literal">annotator1/myResource</code>.
This identifies the resource dependency key
<code class="literal">myResource</code> within the annotator with key
<code class="literal">annotator1</code>. Compound resource dependencies can be
multiple levels deep to handle nested aggregate analysis engines.</p>
</div>
<div class="section" title="2.4.2.5.&nbsp;Sofa Mappings"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.sofa_mappings">2.4.2.5.&nbsp;Sofa Mappings</h4></div></div></div>
<p>Sofa mappings are specified between Sofa names declared in this
aggregate descriptor as part of the
<code class="literal">&lt;capability&gt;</code> section, and the Sofa names
declared in the delegate components. For purposes of the mapping, all the
declarations of Sofas in any of the capability sets contained within the
<code class="literal">&lt;capabilities&gt; </code>element are considered
together.</p>
<pre class="programlisting">&lt;sofaMappings&gt;
&lt;sofaMapping&gt;
&lt;componentKey&gt;[keyName]&lt;/componentKey&gt;
&lt;componentSofaName&gt;[sofaName]&lt;/componentSofaName&gt;
&lt;aggregateSofaName&gt;[sofaName]&lt;/aggregateSofaName&gt;
&lt;/sofaMapping&gt;
...
&lt;/sofaMappings&gt;</pre>
<p>The &lt;componentSofaName&gt; may be omitted in the case where the
component is not aware of Multiple Views or Sofas. In this case, the UIMA
framework will arrange for the specified &lt;aggregateSofaName&gt; to be
the one visible to the delegate component.</p>
<p>The &lt;componentKey&gt; is the key name for the component as specified
in the list of delegate components for this aggregate.</p>
<p>The sofaNames used must be declared as input or output sofas in some
capability set.</p>
</div>
</div>
<div class="section" title="2.4.3.&nbsp;Configuration Parameters"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.aes.configuration_parameters">2.4.3.&nbsp;Configuration Parameters</h3></div></div></div>
<p>Configuration parameters may be declared and set in both Primitive and
Aggregate descriptors. Parameters set in an aggregate may override parameters set in one or
more of its delegates.
</p>
<div class="section" title="2.4.3.1.&nbsp;Configuration Parameter Declaration"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.configuration_parameter_declaration">2.4.3.1.&nbsp;Configuration Parameter Declaration</h4></div></div></div>
<p>Configuration Parameters are made available to annotator
implementations and applications by the following interfaces:
</p><div class="itemizedlist"><ul class="itemizedlist" type="circle" compact><li class="listitem" style="list-style-type: circle"><p>
<code class="literal">AnnotatorContext</code> <sup>[<a name="d5e690" href="#ftn.d5e690" class="footnote">2</a>]</sup> (passed as an argument to the
initialize() method of a version 1 annotator)</p>
</li><li class="listitem" style="list-style-type: circle"><p>
<code class="literal">ConfigurableResource</code> (every Analysis Engine
implements this interface)</p>
</li><li class="listitem" style="list-style-type: circle"><p>
<code class="literal">UimaContext</code> (passed
as an argument to the initialize() method of a version 2 annotator) (you can get
this from any resource, including Analysis Engines, using the method
<code class="literal">getUimaContext</code>()).</p>
</li></ul></div>
<p>Use AnnotatorContext within version 1 annotators and UimaContext for
version 2 annotators and outside of annotators (for instance, in CasConsumers,
or the containing application) to access configuration parameters.</p>
<p>Configuration parameters are set from the corresponding elements in the
XML descriptor for the application. If you need to programmatically change
parameter settings within an application, you can use methods in
ConfigurableResource; if you do this, you need to call reconfigure()
afterwards to have the UIMA framework notify all the contained analysis
components that the parameter configuration has changed (the analysis
engine's reinitialize() methods will be called). Note that in the current
implementation, only integrated deployment components have configuration
parameters passed to them; remote components obtain their parameters from
their remote startup environment. This will likely change in the
future.</p>
<p>There are two ways to specify the
<code class="literal">&lt;configurationParameters&gt;</code> section &#8211; as a
list of configuration parameters or a list of groups. A list of parameters, which
are not part of any group, looks like this:
</p><pre class="programlisting">&lt;configurationParameters&gt;
&lt;configurationParameter&gt;
&lt;name&gt;[String]&lt;/name&gt;
&lt;externalOverrideName&gt;[String]&lt;/externalOverrideName&gt;
&lt;description&gt;[String]&lt;/description&gt;
&lt;type&gt;String|Integer|Float|Boolean&lt;/type&gt;
&lt;multiValued&gt;true|false&lt;/multiValued&gt;
&lt;mandatory&gt;true|false&lt;/mandatory&gt;
&lt;overrides&gt;
&lt;parameter&gt;[String]&lt;/parameter&gt;
&lt;parameter&gt;[String]&lt;/parameter&gt;
...
&lt;/overrides&gt;
&lt;/configurationParameter&gt;
&lt;configurationParameter&gt;
...
&lt;/configurationParameter&gt;
...
&lt;/configurationParameters&gt;</pre>
<p>For each configuration parameter, the following are specified:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><span class="bold"><strong>name</strong></span>
&#8211; the name by which the annotator code refers to the parameter. All
parameters declared in an analysis engine descriptor must have distinct names.
(required). The name is composed of normal Java identifier characters.</p>
</li><li class="listitem"><p><span class="bold"><strong>externalOverrideName</strong></span> &#8211; the
name of a property in an external settings file that if defined overrides
any value set in this descriptor or in its parent. See <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.external_configuration_parameter_overrides" title="2.4.3.4.&nbsp;External Configuration Parameter Overrides">Section&nbsp;2.4.3.4, &#8220;External Configuration Parameter Overrides&#8221;</a>
for a discussion of external configuration parameter overrides.
(optional)</p></li><li class="listitem"><p><span class="bold"><strong>description</strong></span> &#8211; a
natural language description of the intent of the parameter
(optional)</p></li><li class="listitem"><p><span class="bold"><strong>type</strong></span> &#8211; the data
type of the parameter's value &#8211; must be one of
<code class="literal">String</code>, <code class="literal">Integer</code>,
<code class="literal">Float</code>, or <code class="literal">Boolean</code>
(required).</p></li><li class="listitem"><p><span class="bold"><strong>multiValued</strong></span> &#8211;
<code class="literal">true</code> if the parameter can take multiple-values (an
array), <code class="literal">false</code> if the parameter takes only a single value
(optional, defaults to false).</p></li><li class="listitem"><p><span class="bold"><strong>mandatory</strong></span> &#8211;
<code class="literal">true</code> if a value must be provided for the parameter
(optional, defaults to false).</p></li><li class="listitem"><p><span class="bold"><strong>overrides</strong></span> &#8211; this
is used only in aggregate Analysis Engines, but is included here for
completeness. See <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.configuration_parameter_overrides" title="2.4.3.3.&nbsp;Configuration Parameter Overrides">Section&nbsp;2.4.3.3, &#8220;Configuration Parameter Overrides&#8221;</a>
for a discussion of configuration parameter overriding in aggregate
Analysis Engines. (optional).</p></li></ul></div>
<p>A list of groups looks like this:
</p><pre class="programlisting">&lt;configurationParameters defaultGroup="[String]"
searchStrategy="none|default_fallback|language_fallback" &gt;
&lt;commonParameters&gt;
[zero or more parameters]
&lt;/commonParameters&gt;
&lt;configurationGroup names="name1 name2 name3 ..."&gt;
[zero or more parameters]
&lt;/configurationGroup&gt;
&lt;configurationGroup names="name4 name5 ..."&gt;
[zero or more parameters]
&lt;/configurationGroup&gt;
...
&lt;/configurationParameters&gt;</pre>
<p>Both the<code class="literal"> &lt;commonParameters&gt;</code> and
<code class="literal">&lt;configurationGroup&gt;</code> elements contain zero or
more <code class="literal">&lt;configurationParameter&gt;</code> elements, with
the same syntax described above.</p>
<p>The <code class="literal">&lt;commonParameters&gt;</code> element declares
parameters that exist in all groups. Each
<code class="literal">&lt;configurationGroup&gt;</code> element has a names
attribute, which contains a list of group names separated by whitespace (space
or tab characters). Names consist of any number of non-whitespace characters;
however the Component Descriptor Editor tool restricts this to be normal Java
identifiers, including the period (.) and the dash (-). One configuration group
will be created for each name, and all of the groups will contain the same set of
parameters.</p>
<p>The <code class="literal">defaultGroup</code> attribute specifies the name of the
group to be used in the case where an annotator does a lookup for a configuration
parameter without specifying a group name. It may also be used as a fallback if the
annotator specifies a group that does not exist &#8211; see below.</p>
<p>The <code class="literal">searchStrategy</code> attribute determines the action
to be taken when the context is queried for the value of a parameter belonging to a
particular configuration group, if that group does not exist or does not contain
a value for the requested parameter. There are currently three possible values:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><span class="bold"><strong>none</strong></span>
&#8211; there is no fallback; return null if there is no value in the exact group
specified by the user.</p></li><li class="listitem"><p><span class="bold"><strong>default_fallback</strong></span>
&#8211; if there is no value found in the specified group, look in the default
group (as defined by the <code class="literal">default</code> attribute)</p>
</li><li class="listitem"><p><span class="bold"><strong>language_fallback</strong></span>
&#8211; this setting allows for a specific use of configuration parameter
groups where the groups names correspond to ISO language and country codes
(for an example, see below). The fallback sequence is:
<code class="literal">&lt;lang&gt;_&lt;country&gt;_&lt;region&gt; <span class="symbol">&#8594;</span>
&lt;lang&gt;_&lt;country&gt; <span class="symbol">&#8594;</span> &lt;lang&gt; <span class="symbol">&#8594;</span>
&lt;default&gt;.</code> </p></li></ul></div><p>
</p>
<div class="section" title="Example"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.xml.component_descriptor.aes.configuration_parameter_declaration.example">Example</h5></div></div></div>
<pre class="programlisting">&lt;configurationParameters defaultGroup="en"
searchStrategy="language_fallback"&gt;
&lt;commonParameters&gt;
&lt;configurationParameter&gt;
&lt;name&gt;DictionaryFile&lt;/name&gt;
&lt;description&gt;Location of dictionary for this
language&lt;/description&gt;
&lt;type&gt;String&lt;/type&gt;
&lt;multiValued&gt;false&lt;/multiValued&gt;
&lt;mandatory&gt;false&lt;/mandatory&gt;
&lt;/configurationParameter&gt;
&lt;/commonParameters&gt;
&lt;configurationGroup names="en de en-US"/&gt;
&lt;configurationGroup names="zh"&gt;
&lt;configurationParameter&gt;
&lt;name&gt;DBC_Strategy&lt;/name&gt;
&lt;description&gt;Strategy for dealing with double-byte
characters.&lt;/description&gt;
&lt;type&gt;String&lt;/type&gt;
&lt;multiValued&gt;false&lt;/multiValued&gt;
&lt;mandatory&gt;false&lt;/mandatory&gt;
&lt;/configurationParameter&gt;
&lt;/configurationGroup&gt;
&lt;/configurationParameters&gt;</pre>
<p>In this example, we are declaring a <code class="literal">DictionaryFile</code>
parameter that can have a different value for each of the languages that our AE
supports
&#8211; English (general), German, U.S. English, and Chinese. For Chinese
only, we also declare a <code class="literal">DBC_Strategy</code>
parameter.</p>
<p>We are using the <code class="literal">language_fallback</code> search
strategy, so if an annotator requests the dictionary file for the
<code class="literal">en-GB</code> (British English) group, we will fall back to the
more general <code class="literal">en</code> group.</p>
<p>Since we have defined <code class="literal">en</code> as the default group, this
value will be returned if the context is queried for the
<code class="literal">DictionaryFile</code> parameter without specifying any
group name, or if a nonexistent group name is specified.</p>
</div>
</div>
<div class="section" title="2.4.3.2.&nbsp;Configuration Parameter Settings"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.configuration_parameter_settings">2.4.3.2.&nbsp;Configuration Parameter Settings</h4></div></div></div>
<p>For configuration parameters that are not part of any group, the
<code class="literal">&lt;configurationParameterSettings&gt;</code> element
looks like this:
</p><pre class="programlisting">&lt;configurationParameterSettings&gt;
&lt;nameValuePair&gt;
&lt;name&gt;[String]&lt;/name&gt;
&lt;value&gt;
&lt;string&gt;[String]&lt;/string&gt; |
&lt;integer&gt;[Integer]&lt;/integer&gt; |
&lt;float&gt;[Float]&lt;/float&gt; |
&lt;boolean&gt;true|false&lt;/boolean&gt; |
&lt;array&gt; ... &lt;/array&gt;
&lt;/value&gt;
&lt;/nameValuePair&gt;
&lt;nameValuePair&gt;
...
&lt;/nameValuePair&gt;
...
&lt;/configurationParameterSettings&gt;</pre>
<p>There are zero or more <code class="literal">nameValuePair</code> elements. Each
<code class="literal">nameValuePair</code> contains a name (which refers to one of the
configuration parameters) and a value for that parameter.</p>
<p>The <code class="literal">value</code> element contains an element that matches
the type of the parameter. For single-valued parameters, this is either
<code class="literal">&lt;string&gt;</code>, <code class="literal">&lt;integer&gt;</code>
, <code class="literal">&lt;float&gt;</code>, or
<code class="literal">&lt;boolean&gt;</code>. For multi-valued parameters, this is
an <code class="literal">&lt;array&gt;</code> element, which then contains zero or
more instances of the appropriate type of primitive value, e.g.:
</p><pre class="programlisting">&lt;array&gt;&lt;string&gt;One&lt;/string&gt;&lt;string&gt;Two&lt;/string&gt;&lt;/array&gt;</pre>
<p>For parameters declared in configuration groups the
<code class="literal">&lt;configurationParameterSettings&gt;</code> element
looks like this:
</p><pre class="programlisting">&lt;configurationParameterSettings&gt;
&lt;settingsForGroup name="[String]"&gt;
[one or more &lt;nameValuePair&gt; elements]
&lt;/settingsForGroup&gt;
&lt;settingsForGroup name="[String]"&gt;
[one or more &lt;nameValuePair&gt; elements]
&lt;/settingsForGroup&gt;
...
&lt;/configurationParameterSettings&gt;</pre><p>
where each <code class="literal">&lt;settingsForGroup&gt;</code> element has a name
that matches one of the configuration groups declared under the
<code class="literal">&lt;configurationParameters&gt;</code> element and contains
the parameter settings for that group.</p>
<div class="section" title="Example"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.xml.component_descriptor.aes.configuration_parameter_settings.example">Example</h5></div></div></div>
<p>Here are the settings that correspond to the parameter declarations in
the previous example:
</p><pre class="programlisting">&lt;configurationParameterSettings&gt;
&lt;settingsForGroup name="en"&gt;
&lt;nameValuePair&gt;
&lt;name&gt;DictionaryFile&lt;/name&gt;
&lt;value&gt;&lt;string&gt;resourcesEnglishdictionary.dat&gt;&lt;/string&gt;&lt;/value&gt;
&lt;/nameValuePair&gt;
&lt;/settingsForGroup&gt;
&lt;settingsForGroup name="en-US"&gt;
&lt;nameValuePair&gt;
&lt;name&gt;DictionaryFile&lt;/name&gt;
&lt;value&gt;&lt;string&gt;resourcesEnglish_USdictionary.dat&lt;/string&gt;&lt;/value&gt;
&lt;/nameValuePair&gt;
&lt;/settingsForGroup&gt;
&lt;settingsForGroup name="de"&gt;
&lt;nameValuePair&gt;
&lt;name&gt;DictionaryFile&lt;/name&gt;
&lt;value&gt;&lt;string&gt;resourcesDeutschdictionary.dat&lt;/string&gt;&lt;/value&gt;
&lt;/nameValuePair&gt;
&lt;/settingsForGroup&gt;
&lt;settingsForGroup name="zh"&gt;
&lt;nameValuePair&gt;
&lt;name&gt;DictionaryFile&lt;/name&gt;
&lt;value&gt;&lt;string&gt;resourcesChinesedictionary.dat&lt;/string&gt;&lt;/value&gt;
&lt;/nameValuePair&gt;
&lt;nameValuePair&gt;
&lt;name&gt;DBC_Strategy&lt;/name&gt;
&lt;value&gt;&lt;string&gt;default&lt;/string&gt;&lt;/value&gt;
&lt;/nameValuePair&gt;
&lt;/settingsForGroup&gt;
&lt;/configurationParameterSettings&gt;</pre>
</div>
</div>
<div class="section" title="2.4.3.3.&nbsp;Configuration Parameter Overrides"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.configuration_parameter_overrides">2.4.3.3.&nbsp;Configuration Parameter Overrides</h4></div></div></div>
<p>In an aggregate Analysis Engine Descriptor, each
<code class="literal">&lt;configurationParameter&gt; </code>element should
contain an <code class="literal">&lt;overrides&gt;</code> element, with the
following syntax:</p>
<pre class="programlisting">&lt;overrides&gt;
&lt;parameter&gt;
[delegateAnalysisEngineKey]/[parameterName]
&lt;/parameter&gt;
&lt;parameter&gt;
[delegateAnalysisEngineKey]/[parameterName]
&lt;/parameter&gt;
...
&lt;/overrides&gt;</pre>
<p>Since aggregate Analysis Engines have no code associated with them, the
only way in which their configuration parameters can affect their processing
is by overriding the parameter values of one or more delegate analysis
engines. The <code class="literal">&lt;overrides&gt; </code>element determines
which parameters, in which delegate Analysis Engines, are overridden by this
configuration parameter.</p>
<p>For example, consider an aggregate Analysis Engine Descriptor that
contains delegate Analysis Engines with keys
<code class="literal">annotator1</code> and <code class="literal">annotator2</code> (as
declared in the &lt;delegateAnalysisEngine&gt; element &#8211; see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.delegates" title="2.4.2.1.&nbsp;Delegate Analysis Engine Specifiers">Section&nbsp;2.4.2.1, &#8220;Delegate Analysis Engine Specifiers&#8221;</a>) and also declares a
configuration parameter as follows:
</p><pre class="programlisting">&lt;configurationParameter&gt;
&lt;name&gt;AggregateParam&lt;/name&gt;
&lt;type&gt;String&lt;/type&gt;
&lt;overrides&gt;
&lt;parameter&gt;annotator1/param1&lt;/parameter&gt;
&lt;parameter&gt;annotator2/param2&lt;/parameter&gt;
&lt;/overrides&gt;
&lt;/configurationParameter&gt;</pre>
<p>The value of the <code class="literal">AggregateParam</code> parameter
(whether assigned in the aggregate descriptor or at runtime by an
application) will override the value of parameter
<code class="literal">param1</code> in <code class="literal">annotator1</code> and also
override the value of parameter <code class="literal">param2</code> in
<code class="literal">annotator2</code>. No other parameters will be
affected. Note that <code class="literal">AggregateParam</code> may itself be overridden by a
parameter in an outer aggregate that has this aggregate as one of its delegates.
</p>
<p>Prior to release 2.4.1, if an aggregate Analysis Engine descriptor
declared a configuration parameter with no explicit overrides, that
parameter would override any parameters having the same name within any
delegate analysis engine. Starting with release 2.4.1, support for this
usage has been dropped.</p>
</div>
<div class="section" title="2.4.3.4.&nbsp;External Configuration Parameter Overrides"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.external_configuration_parameter_overrides">2.4.3.4.&nbsp;External Configuration Parameter Overrides</h4></div></div></div>
<p>
External parameter overrides are usually declared in primitive descriptors as a way to
easily modify the parameters in some or all of an application's annotators.
By using external settings files and shared parameter names the configuration
information can be specified without regard for a particular descriptor hierachy.
</p>
<p>
Configuration parameter declarations in primitive and aggregate descriptors may
include an <code class="literal">&lt;externalOverrideName&gt;</code> element,
which specifies the name of a property that may be defined in an external settings file.
If this element is present, and if a entry can be found for its name in a settings
files, then this value overrides the value otherwise specified for this parameter.
</p>
<p>
The value overrides any value set in this descriptor or set by an override in a parent
aggregate. In primitive descriptors the value set by an external override is always
applied. In aggregate descriptors the value set by an external override applies to the
aggregate parameter, and is passed down to the overridden delegate parameters in the
usual way, i.e. only if the delegate's parameter has not been set by an external override.
</p>
<p>
Im the absence of external overrides,
parameter evaluation can be viewed as proceeding from the primitive descriptor up through
any aggregates containing overrides, taking the last setting found. With external
overrides the search ends with the first external override found that has a value
assigned by a settings file.
</p>
<p>
The same external name may be used for multiple parameters;
the effect of this is that one setting will override multiple parameters.
</p>
<p>
The settings for all descriptors in a pipeline are usually loaded from one or more files
whose names are obtained from the Java system property <span class="emphasis"><em>UimaExternalOverrides</em></span>.
The value of the property must be a comma-separated list of resource names. If the name
has a prefix of "file:" or no prefix, the filesystem is searched. If the name has a
prefix of "path:" the rest must be a Java-style dotted name, similar to the name
attribute for descriptor imports. The dots are replaced by file separators and a suffix
of ".settings" is appended before searching the datapath and classpath.
e.g. <code class="literal">&#8722;DUimaExternalOverrides=/data/file1.settings,file:relative/file2.settings,path:org.apache.uima.resources.file3</code>.
</p>
<p>
Override settings may also be specified when creating an analysis engine by putting a
<code class="literal">Settings</code> object in the additional parameters map for the
<code class="literal">produceAnalysisEngine</code> method. In this case the
Java system property <span class="emphasis"><em>UimaExternalOverrides</em></span> is ignored.
</p><pre class="programlisting"> // Construct an analysis engine that uses two settings files
Settings extSettings =
UIMAFramework.getResourceSpecifierFactory().createSettings();
for (String fname : new String[] { "externalOverride.settings",
"default.settings" }) {
FileInputStream fis = new FileInputStream(fname);
extSettings.load(fis);
fis.close();
}
Map&lt;String,Object&gt; aeParms = new HashMap&lt;String,Object&gt;();
aeParms.put(Resource.PARAM_EXTERNAL_OVERRIDE_SETTINGS, extSettings);
AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(desc, aeParms);
</pre><p>
</p>
<p>
These external settings consist of key - value pairs stored in a
file using the UTF-8 character encoding, and written in a style similar to that
of Java properties files.
</p><div class="itemizedlist"><ul class="itemizedlist" type="circle" compact><li class="listitem" style="list-style-type: circle"><p>
Leading whitespace is ignored.
</p></li><li class="listitem" style="list-style-type: circle"><p>
Comment lines start with '#' or '!'.
</p></li><li class="listitem" style="list-style-type: circle"><p>
The key and value are separated by whitespace, '=' or ':'.
</p></li><li class="listitem" style="list-style-type: circle"><p>
Keys must contain at least one character and only letters, digits, or the characters '. / - ~ _'.
</p></li><li class="listitem" style="list-style-type: circle"><p>
If a line ends with '\' it is extended with the following line (after removing any
leading whitespace.)
</p></li><li class="listitem" style="list-style-type: circle"><p>
Whitespace is trimmed from both keys and values.
</p></li><li class="listitem" style="list-style-type: circle"><p>
Duplicate key values are ignored &#8211; once a value is assigned to a key it cannot be changed.
</p></li><li class="listitem" style="list-style-type: circle"><p>
Values may reference other settings using the syntax '${key}'.
</p></li><li class="listitem" style="list-style-type: circle"><p>
Array values are represented as a list of strings separated by commas or line breaks,
and bracketed by the '[ ]' characters. The value must start with an '[' and is
terminated by the first unescaped ']' which must be at the end of a line.
The elements of an array (and hence the array size) may be indirectly specified using
the '${key}' syntax but the brackets '[ ]' must be explicitly specified.
</p></li><li class="listitem" style="list-style-type: circle"><p>
In values the special characters '$ { } [ , ] \' are treated as regular characters if
preceeded by the escape character '\'.
</p></li></ul></div><p>
</p><pre class="programlisting">
key1 : value1
key2 = value 2
key3 element2, element3, element4
# Next assignment is ignored as key3 has already been set
key3 : value ignored
key4 = [ array element1, ${key3}, element5
element6 ]
key5 value with a reference ${key1} to key1
key6 : long value string \
continued from previous line (with leading whitespace stripped)
key7 = value without a reference \${not-a-key}
key8 \[ value that is not an array ]
key9 : [ array element1\, with embedded comma, element2 ]
</pre><p>
</p>
<p>
Multiple settings files are allowed; they are loaded in order, such that
early ones take precedence over later ones, following the first-assignment-wins rule.
So, if you have lots of settings,
you can put the defaults in one file, and then in a earlier file, override just the
ones you need to.
</p>
<p>
An external override name may be specified for a parameter declared in a group, but if
the parameter is in the common group or the group is declared with multiple names, the
external name is shared amongst all, i.e. these parameters cannot be given group-specific values.
</p>
</div>
<div class="section" title="2.4.3.5.&nbsp;Direct Access to External Configuration Parameters"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.external_configuration_parameter_access">2.4.3.5.&nbsp;Direct Access to External Configuration Parameters</h4></div></div></div>
<p>
Annotators and flow controllers can directly access these shared configuration
parameters from their UimaContext.
Direct access means an access where the key to select the shared parameter is the
parameter name as specified in the external configuration settings file.
</p><pre class="programlisting">
String value = aContext.getSharedSettingValue(paramName);
String values[] = aContext.getSharedSettingArray(arrayParamName);
String allNames[] = aContext.getSharedSettingNames();
</pre><p>
Java code called by an annotator or flow controller in the same thread or a child thread
can use the <code class="literal">UimaContextHolder</code> to get the annotator's UimaContext and
hence access the shared configuration parameters.
</p><pre class="programlisting">
UimaContext uimaContext = UimaContextHolder.getUimaContext();
if (uimaContext != null) {
value = uimaContext.getSharedSettingValue(paramName);
}
</pre><p>
The UIMA framework puts the context in an InheritableThreadLocal variable. The value
will be null if <code class="literal">getUimaContext</code> is not invoked by an annotator or flow
controller on the same thread or a child thread.
</p>
</div>
<div class="section" title="2.4.3.6.&nbsp;Other Uses for External Configuration Parameters"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.other_uses_for_external_configuration_parameters">2.4.3.6.&nbsp;Other Uses for External Configuration Parameters</h4></div></div></div>
<p>
Explicit references to shared configuration parameters can be specified as part of the
value of the name and location attributes of the <code class="literal">import</code> element
and in the value of the fileUrl for a <code class="literal">fileResourceSpecifier</code>
(see <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2.&nbsp;Imports">Section&nbsp;2.2, &#8220;Imports&#8221;</a> and <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration" title="2.4.1.9.&nbsp;Resource Manager Configuration">Section&nbsp;2.4.1.9, &#8220;Resource Manager Configuration&#8221;</a>).
</p>
</div>
</div>
</div>
<div class="section" title="2.5.&nbsp;Flow Controller Descriptors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.flow_controller">2.5.&nbsp;Flow Controller Descriptors</h2></div></div></div>
<p>The basic structure of a Flow Controller Descriptor is as follows:
</p><pre class="programlisting">&lt;?xml version="1.0" ?&gt;
&lt;flowControllerDescription
xmlns="http://uima.apache.org/resourceSpecifier"&gt;
&lt;frameworkImplementation&gt;org.apache.uima.java&lt;/frameworkImplementation&gt;
&lt;implementationName&gt;[ClassName]&lt;/implementationName&gt;
&lt;processingResourceMetaData&gt;
...
&lt;/processingResourceMetaData&gt;
&lt;externalResourceDependencies&gt;
...
&lt;/externalResourceDependencies&gt;
&lt;resourceManagerConfiguration&gt;
...
&lt;/resourceManagerConfiguration&gt;
&lt;/flowControllerDescription&gt;</pre>
<p>The <code class="literal">frameworkImplementation</code> element must always be set to
the value <code class="literal">org.apache.uima.java</code>.</p>
<p>The <code class="literal">implementationName</code> element must contain the
fully-qualified class name of the Flow Controller implementation. This must name a
class that implements the <code class="literal">FlowController</code> interface.</p>
<p>The <code class="literal">processingResourceMetaData</code> element contains
essentially the same information as a Primitive Analysis Engine Descriptor's
<code class="literal">analysisEngineMetaData</code> element, described in <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.metadata" title="2.4.1.2.&nbsp;Analysis Engine MetaData">Section&nbsp;2.4.1.2, &#8220;Analysis Engine MetaData&#8221;</a>.</p>
<p>The <code class="literal">externalResourceDependencies</code> and
<code class="literal">resourceManagerConfiguration</code> elements are exactly the same as
in Primitive Analysis Engine Descriptors (see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.external_resource_dependencies" title="2.4.1.8.&nbsp;External Resource Dependencies">Section&nbsp;2.4.1.8, &#8220;External Resource Dependencies&#8221;</a> and <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration" title="2.4.1.9.&nbsp;Resource Manager Configuration">Section&nbsp;2.4.1.9, &#8220;Resource Manager Configuration&#8221;</a>).</p>
</div>
<div class="section" title="2.6.&nbsp;Collection Processing Component Descriptors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.collection_processing_parts">2.6.&nbsp;Collection Processing Component Descriptors</h2></div></div></div>
<p>There are three types of Collection Processing Components &#8211; Collection
Readers, CAS Initializers (deprecated as of UIMA Version 2), and CAS Consumers. Each
type of component has a corresponding descriptor. The structure of these descriptors
is very similar to that of primitive Analysis Engine Descriptors.</p>
<div class="section" title="2.6.1.&nbsp;Collection Reader Descriptors"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader">2.6.1.&nbsp;Collection Reader Descriptors</h3></div></div></div>
<p>The basic structure of a Collection Reader descriptor is as follows:
</p><pre class="programlisting">&lt;?xml version="1.0" ?&gt;
&lt;collectionReaderDescription
xmlns="http://uima.apache.org/resourceSpecifier"&gt;
&lt;frameworkImplementation&gt;org.apache.uima.java&lt;/frameworkImplementation&gt;
&lt;implementationName&gt;[ClassName]&lt;/implementationName&gt;
&lt;processingResourceMetaData&gt;
...
&lt;/processingResourceMetaData&gt;
&lt;externalResourceDependencies&gt;
...
&lt;/externalResourceDependencies&gt;
&lt;resourceManagerConfiguration&gt;
...
&lt;/resourceManagerConfiguration&gt;
&lt;/collectionReaderDescription&gt;</pre>
<p>The <code class="literal">frameworkImplementation</code> element must always be set
to the value <code class="literal">org.apache.uima.java</code>.</p>
<p>The <code class="literal">implementationName</code> element contains the
fully-qualified class name of the Collection Reader implementation. This must name
a class that implements the <code class="literal">CollectionReader</code>
interface.</p>
<p>The <code class="literal">processingResourceMetaData</code> element contains
essentially the same information as a Primitive Analysis Engine
Descriptor's' <code class="literal">analysisEngineMetaData</code> element:
</p><pre class="programlisting">&lt;processingResourceMetaData&gt;
&lt;name&gt; [String] &lt;/name&gt;
&lt;description&gt;[String]&lt;/description&gt;
&lt;version&gt;[String]&lt;/version&gt;
&lt;vendor&gt;[String]&lt;/vendor&gt;
&lt;configurationParameters&gt;
...
&lt;/configurationParameters&gt;
&lt;configurationParameterSettings&gt;
...
&lt;/configurationParameterSettings&gt;
&lt;typeSystemDescription&gt;
...
&lt;/typeSystemDescription&gt;
&lt;typePriorities&gt;
...
&lt;/typePriorities&gt;
&lt;fsIndexes&gt;
...
&lt;/fsIndexes&gt;
&lt;capabilities&gt;
...
&lt;/capabilities&gt;
&lt;/processingResourceMetaData&gt;</pre>
<p>The contents of these elements are the same as that described in <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.metadata" title="2.4.1.2.&nbsp;Analysis Engine MetaData">Section&nbsp;2.4.1.2, &#8220;Analysis Engine MetaData&#8221;</a>, with the exception that the capabilities
section should not declare any inputs (because the Collection Reader is always the
first component to receive the CAS).</p>
<p>The <code class="literal">externalResourceDependencies</code> and
<code class="literal">resourceManagerConfiguration</code> elements are exactly the same
as in the Primitive Analysis Engine Descriptors (see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.external_resource_dependencies" title="2.4.1.8.&nbsp;External Resource Dependencies">Section&nbsp;2.4.1.8, &#8220;External Resource Dependencies&#8221;</a> and <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration" title="2.4.1.9.&nbsp;Resource Manager Configuration">Section&nbsp;2.4.1.9, &#8220;Resource Manager Configuration&#8221;</a>).</p>
</div>
<div class="section" title="2.6.2.&nbsp;CAS Initializer Descriptors (deprecated)"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.collection_processing_parts.cas_initializer">2.6.2.&nbsp;CAS Initializer Descriptors (deprecated)</h3></div></div></div>
<p>The basic structure of a CAS Initializer Descriptor is as follows:
</p><pre class="programlisting">&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
&lt;casInitializerDescription
xmlns="http://uima.apache.org/resourceSpecifier"&gt;
&lt;frameworkImplementation&gt;org.apache.uima.java&lt;/frameworkImplementation&gt;
&lt;implementationName&gt;[ClassName] &lt;/implementationName&gt;
&lt;processingResourceMetaData&gt;
...
&lt;/processingResourceMetaData&gt;
&lt;externalResourceDependencies&gt;
...
&lt;/externalResourceDependencies&gt;
&lt;resourceManagerConfiguration&gt;
...
&lt;/resourceManagerConfiguration&gt;
&lt;/casInitializerDescription&gt;</pre>
<p>The <code class="literal">frameworkImplementation</code> element must always be set
to the value <code class="literal">org.apache.uima.java</code>.</p>
<p>The <code class="literal">implementationName</code> element contains the
fully-qualified class name of the CAS Initializer implementation. This must name a
class that implements the <code class="literal">CasInitializer</code> interface.</p>
<p>The <code class="literal">processingResourceMetaData</code> element contains
essentially the same information as a Primitive Analysis Engine
Descriptor's' <code class="literal">analysisEngineMetaData</code> element,
as described in <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.metadata" title="2.4.1.2.&nbsp;Analysis Engine MetaData">Section&nbsp;2.4.1.2, &#8220;Analysis Engine MetaData&#8221;</a>, with the exception of some
changes to the capabilities section. A CAS Initializer's capabilities
element looks like this:
</p><pre class="programlisting">&lt;capabilities&gt;
&lt;capability&gt;
&lt;outputs&gt;
&lt;type allAnnotatorFeatures="true|false"&gt;[String]&lt;/type&gt;
&lt;type&gt;[TypeName]&lt;/type&gt;
...
&lt;feature&gt;[TypeName]:[Name]&lt;/feature&gt;
...
&lt;/outputs&gt;
&lt;outputSofas&gt;
&lt;sofaName&gt;[name]&lt;/sofaName&gt;
...
&lt;/outputSofas&gt;
&lt;mimeTypesSupported&gt;
&lt;mimeType&gt;[MIME Type]&lt;/mimeType&gt;
...
&lt;/mimeTypesSupported&gt;
&lt;/capability&gt;
&lt;capability&gt;
...
&lt;/capability&gt;
...
&lt;/capabilities&gt;</pre>
<p>The differences between a CAS Initializer's capabilities declaration
and an Analysis Engine's capabilities declaration are that the CAS Initializer does not
declare any input CAS types and features or input Sofas (because it is always the first
to operate on a CAS), it doesn't have a language specifier, and that the CAS
Initializer may declare a set of MIME types that it supports for its input documents.
Examples include: text/plain, text/html, and application/pdf. For a list of MIME
types see <a class="ulink" href="http://www.iana.org/assignments/media-types/" target="_top">http://www.iana.org/assignments/media-types/</a>. This
information is currently only for users' information, the framework does not
use it for anything. This may change in future versions.</p>
<p>The <code class="literal">externalResourceDependencies</code> and
<code class="literal">resourceManagerConfiguration</code> elements are exactly the same
as in the Primitive Analysis Engine Descriptors (see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.external_resource_dependencies" title="2.4.1.8.&nbsp;External Resource Dependencies">Section&nbsp;2.4.1.8, &#8220;External Resource Dependencies&#8221;</a> and <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration" title="2.4.1.9.&nbsp;Resource Manager Configuration">Section&nbsp;2.4.1.9, &#8220;Resource Manager Configuration&#8221;</a>).</p>
</div>
<div class="section" title="2.6.3.&nbsp;CAS Consumer Descriptors"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.collection_processing_parts.cas_consumer">2.6.3.&nbsp;CAS Consumer Descriptors</h3></div></div></div>
<p>The basic structure of a CAS Consumer Descriptor is as follows:
</p><pre class="programlisting">&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
&lt;casConsumerDescription
xmlns="http://uima.apache.org/resourceSpecifier"&gt;
&lt;frameworkImplementation&gt;org.apache.uima.java&lt;/frameworkImplementation&gt;
&lt;implementationName&gt;[ClassName]&lt;/implementationName&gt;
&lt;processingResourceMetaData&gt;
...
&lt;/processingResourceMetaData&gt;
&lt;externalResourceDependencies&gt;
...
&lt;/externalResourceDependencies&gt;
&lt;resourceManagerConfiguration&gt;
...
&lt;/resourceManagerConfiguration&gt;
&lt;/casConsumerDescription&gt;</pre>
<p>The <code class="literal">frameworkImplementation</code> element currently must
have the value <code class="literal">org.apache.uima.java</code>, or
<code class="literal">org.apache.uima.cpp</code>.</p>
<p>The next subelement,<code class="literal">
&lt;annotatorImplementationName&gt;</code> is how the UIMA framework
determines which annotator class to use. This should contain a fully-qualified
Java class name for Java implementations, or the name of a .dll or .so file for C++
implementations.</p>
<p>The <code class="literal">frameworkImplementation</code> element must always be set
to the value <code class="literal">org.apache.uima.java</code>.</p>
<p>The <code class="literal">implementationName</code> element must contain the
fully-qualified class name of the CAS Consumer implementation, or the name
of a .dll or .so file for C++ implementations. For Java, the named class must
implement the <code class="literal">CasConsumer</code> interface.</p>
<p>The <code class="literal">processingResourceMetaData</code> element contains
essentially the same information as a Primitive Analysis Engine Descriptor's
<code class="literal">analysisEngineMetaData</code> element, described in <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.metadata" title="2.4.1.2.&nbsp;Analysis Engine MetaData">Section&nbsp;2.4.1.2, &#8220;Analysis Engine MetaData&#8221;</a>, except that the CAS Consumer Descriptor's
<code class="literal">capabilities</code> element should not declare outputs or
outputSofas (since CAS Consumers do not modify the CAS).</p>
<p>The <code class="literal">externalResourceDependencies</code> and
<code class="literal">resourceManagerConfiguration</code> elements are exactly the same
as in Primitive Analysis Engine Descriptors (see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.external_resource_dependencies" title="2.4.1.8.&nbsp;External Resource Dependencies">Section&nbsp;2.4.1.8, &#8220;External Resource Dependencies&#8221;</a> and <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration" title="2.4.1.9.&nbsp;Resource Manager Configuration">Section&nbsp;2.4.1.9, &#8220;Resource Manager Configuration&#8221;</a>).</p>
</div>
</div>
<div class="section" title="2.7.&nbsp;Service Client Descriptors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.service_client">2.7.&nbsp;Service Client Descriptors</h2></div></div></div>
<p>Service Client Descriptors specify only a location of a remote service. They are
therefore much simpler in structure. In the UIMA SDK, a Service Client Descriptor that
refers to a valid Analysis Engine or CAS Consumer service can be used in place of the
actual Analysis Engine or CAS Consumer Descriptor. The UIMA SDK will handle the details
of calling the remote service. (For details on <span class="emphasis"><em>deploying</em></span> an
Analysis Engine or CAS Consumer as a service, see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.remote_services" class="olink">Section&nbsp;3.6, &#8220;Working with Remote Services&#8221;</a>.</p>
<p>The UIMA SDK is extensible to support different types of remote services. In future
versions, there may be different variations of service client descriptors that cater
to different types of services. For now, the only type of service client descriptor is
the <code class="literal">uriSpecifier</code>, which supports the SOAP and Vinci
protocols.</p>
<pre class="programlisting">&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
&lt;uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier"&gt;
&lt;resourceType&gt;AnalysisEngine | CasConsumer &lt;/resourceType&gt;
&lt;uri&gt;[URI]&lt;/uri&gt;
&lt;protocol&gt;SOAP | SOAPwithAttachments | Vinci&lt;/protocol&gt;
&lt;timeout&gt;[Integer]&lt;/timeout&gt;
&lt;parameters&gt;
&lt;parameter name="VNS_HOST" value="some.internet.ip.name-or-address"/&gt;
&lt;parameter name="VNS_PORT" value="9000"/&gt;
&lt;parameter name="GetMetaDataTimeout" value="[Integer]"/&gt;
&lt;/parameters&gt;
&lt;/uriSpecifier&gt;</pre>
<p>The <code class="literal">resourceType</code> element is required for new descriptors,
but is currently allowed to be omitted for backward compatibility. It specifies the
type of component (Analysis Engine or CAS Consumer) that is implemented by the service
endpoint described by this descriptor.</p>
<p>The <code class="literal">uri</code> element contains the URI for the web service. (Note
that in the case of Vinci, this will be the service name, which is looked up in the Vinci
Naming Service.)</p>
<p>The <code class="literal">protocol</code> element may be set to SOAP,
SOAPwithAttachments, or Vinci; other protocols may be added later. These specify the
particular data transport format that will be used.</p>
<p>The <code class="literal">timeout</code> element is optional. If present, it specifies
the number of milliseconds to wait for a request to be processed before an exception is
thrown. A value of zero or less will wait forever. If no timeout is specified, a default
value (currently 60 seconds) will be used.</p>
<p>The parameters element is optional. If present, it can specify values for each
of the following:
</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><code class="literal">VNS_HOST</code>: host name for the Vinci naming service.
</p></li><li class="listitem"><p><code class="literal">VNS_PORT</code>: port number for the Vinci naming service.
</p></li><li class="listitem"><p><code class="literal">GetMetaDataTimeout</code>: timeout period (in milliseconds) for
the GetMetaData call. If not specified, the default is 60 seconds. This may need
to be set higher if there are a lot of clients competing for connections to the service.
</p></li></ul></div>
<p>If the <code class="literal">VNS_HOST</code> and <code class="literal">VNS_PORT</code> are not specified
in the descriptor, the values used for these comes from
parameters passed on the Java command line using the
<code class="literal">&#8722;DVNS_HOST=&lt;host&gt;</code> and/or
<code class="literal">&#8722;DVNS_PORT=&lt;port&gt;</code> system arguments. If not present, and
a system argument is also not present, the values for these default to
<code class="literal">localhost</code> for the <code class="literal">VNS_HOST</code> and
<code class="literal">9000</code> for the <code class="literal">VNS_PORT</code>.</p>
<p>For details on how to deploy and call Analysis Engine and CAS Consumer services, see
<a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.remote_services" class="olink">Section&nbsp;3.6, &#8220;Working with Remote Services&#8221;</a>.</p>
</div>
<div class="section" title="2.8.&nbsp;Custom Resource Specifiers"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.custom_resource_specifiers">2.8.&nbsp;Custom Resource Specifiers</h2></div></div></div>
<p>A Custom Resource Specifier allows you to plug in your own Java class as a UIMA Resource.
For example you can support a new service protocol by plugging in a Java class that implements
the UIMA <code class="literal">AnalysisEngine</code> interface and communicates with the remote service.</p>
<p>A Custom Resource Specifier has the following format:</p>
<pre class="programlisting">&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
&lt;customResourceSpecifier xmlns="http://uima.apache.org/resourceSpecifier"&gt;
&lt;resourceClassName&gt;[Java Class Name]&lt;/resourceClassName&gt;
&lt;parameters&gt;
&lt;parameter name="[String]" value="[String]"/&gt;
&lt;parameter name="[String]" value="[String]"/&gt;
&lt;/parameters&gt;
&lt;/customResourceSpecifier&gt;</pre>
<p>The <code class="literal">resourceClassName</code> element must contain the fully-qualified name of a Java class
that can be found in the classpath (including the UIMA extension classpath, if you have specified one using
the <code class="literal">ResourceManager.setExtensionClassPath</code> method). This class must implement the
UIMA <code class="literal">Resource</code> interface.</p>
<p>When an application calls the <code class="literal">UIMAFramework.produceResource</code> method and passes a
<code class="literal">CustomResourceSpecifier</code>, the UIMA framework will load the named class and call its
<code class="literal">initialize(ResourceSpecifier,Map)</code> method, passing the <code class="literal">CustomResourceSpecifier</code>
as the first argument. Your class can override the <code class="literal">initialize</code> method and use the
<code class="literal">CustomResourceSpecifier</code> API to get access to the <code class="literal">parameter</code> names and values
specified in the XML.</p>
<p>If you are using a custom resource specifier to plug in a class that implements a new service protocol,
your class must also implement the <code class="literal">AnalysisEngine</code> interface. Generally it should also
extend <code class="literal">AnalysisEngineImplBase</code>. The key methods that should be implemented are
<code class="literal">getMetaData</code>, <code class="literal">processAndOutputNewCASes</code>,
<code class="literal">collectionProcessComplete</code>, and <code class="literal">destroy</code>.</p>
</div>
<div class="footnotes"><br><hr width="100" align="left"><div class="footnote"><p><sup>[<a id="ftn.d5e71" href="#d5e71" class="para">1</a>] </sup>This component is deprecated and should not be use in new
development.</p></div><div class="footnote"><p><sup>[<a id="ftn.d5e690" href="#d5e690" class="para">2</a>] </sup>Deprecated; use
UimaContext instead.</p></div></div></div>
<div class="chapter" title="Chapter&nbsp;3.&nbsp;Collection Processing Engine Descriptor Reference" id="ugr.ref.xml.cpe_descriptor"><div class="titlepage"><div><div><h2 class="title">Chapter&nbsp;3.&nbsp;Collection Processing Engine Descriptor Reference</h2></div></div></div>
<p>A UIMA <span class="emphasis"><em>Collection Processing Engine</em></span> (CPE) is a combination
of UIMA components assembled to analyze a collection of artifacts. A CPE is an
instantiation of the UIMA <span class="emphasis"><em>Collection Processing Architecture</em></span>,
which defines the collection processing components, interfaces, and APIs. A CPE is
executed by a UIMA framework component called the <span class="emphasis"><em>Collection Processing
Manager</em></span> (CPM), which provides a number of services for deploying CPEs,
running CPEs, and handling errors.</p>
<p>A CPE can be assembled programmatically within a Java application, or it can be
assembled declaratively via a CPE configuration specification, called a CPE
Descriptor. This chapter describes the format of the CPE Descriptor.</p>
<p>Details about the CPE, including its function, sub-components, APIs, and related
tools, can be found in <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.cpe" class="olink">Chapter&nbsp;2, <i>Collection Processing Engine Developer's Guide</i></a>. Here we briefly summarize the CPE to define terms and
provide context for the later sections that describe the CPE Descriptor.</p>
<div class="section" title="3.1.&nbsp;CPE Overview"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.overview">3.1.&nbsp;CPE Overview</h2></div></div></div>
<div class="figure"><a name="ugr.ref.xml.cpe_descriptor.overview.fig.runtime"></a><div class="figure-contents">
<div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="574"><tr><td><img src="images/references/ref.xml.cpe_descriptor/image002.png" width="574" alt="CPE Runtime Overview diagram"></td></tr></table></div>
</div><p class="title"><b>Figure&nbsp;3.1.&nbsp;CPE Runtime Overview</b></p></div><br class="figure-break">
<p>An illustration of the CPE runtime is shown in <a class="xref" href="#ugr.ref.xml.cpe_descriptor.overview.fig.runtime" title="Figure&nbsp;3.1.&nbsp;CPE Runtime Overview">Figure&nbsp;3.1, &#8220;CPE Runtime Overview&#8221;</a>. Some of the CPE components, such as the
<span class="emphasis"><em>queues</em></span> and <span class="emphasis"><em>processing pipelines</em></span>, are
internal to the CPE, but their behavior and deployment may be configured using the CPE
Descriptor. Other CPE components, such as the <span class="emphasis"><em>Collection
Reader</em></span> and <span class="emphasis"><em>CAS Processors</em></span>, are defined and
configured externally from the CPE and then plugged in to the CPE to create the overall
engine. The parts of a CPE are:
</p><div class="variablelist"><dl><dt><span class="term">Collection Reader</span></dt><dd><p>understands the native data collection format and iterates
over the collection producing subjects of analysis</p></dd><dt><span class="term">CAS Initializer<sup>[<a name="d5e1067" href="#ftn.d5e1067" class="footnote">3</a>]</sup>
</span></dt><dd><p>initializes a CAS with a subject of analysis</p>
</dd><dt><span class="term">Artifact Producer</span></dt><dd><p>asynchronously pulls CASes from the Collection Reader,
creates batches of CASes and puts them into the work queue</p></dd><dt><span class="term">Work Queue</span></dt><dd><p>shared queue containing batches of CASes queued by the Artifact
Producer for analysis by Analysis Engines</p>
</dd><dt><span class="term">B1-Bn</span></dt><dd><p>individual batches containing 1 or more CASes</p>
</dd><dt><span class="term">AE1-AEn</span></dt><dd><p>Analysis Engines arranged by a CPE descriptor</p>
</dd><dt><span class="term">Processing Pipelines</span></dt><dd><p>each pipeline runs in a separate thread and contains a
replicated set of the Analysis Engines running in the defined sequence</p>
</dd><dt><span class="term">Output Queue</span></dt><dd><p>holds batches of CASes with analysis results intended for CAS
Consumers</p></dd><dt><span class="term">CAS Consumers</span></dt><dd><p>perform collection level analysis over the CASes and extract
analysis results, e.g., creating indexes or databases</p></dd></dl></div><p>
</p>
</div>
<div class="section" title="3.2.&nbsp;Notation"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.notation">3.2.&nbsp;Notation</h2></div></div></div>
<p>CPE Descriptors are XML files. This chapter uses an informal notation to specify
the syntax of CPE Descriptors.</p>
<p>The notation used in this chapter is:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>An ellipsis (...) inside an element body indicates
that the substructure of that element has been omitted (to be described in another
section of this chapter). An example of this would be:
</p><pre class="programlisting">&lt;collectionReader&gt;
...
&lt;/collectionReader&gt;</pre>
</li><li class="listitem"><p>An ellipsis immediately after an element indicates that the
element type may be repeated arbitrarily many times. For example:
</p><pre class="programlisting">&lt;parameter&gt;[String]&lt;/parameter&gt;
&lt;parameter&gt;[String]&lt;/parameter&gt;
...</pre><p>
indicates that there may be arbitrarily many parameter elements in this
context.</p></li><li class="listitem"><p>An ellipsis inside an element means details of the attributes
associated with that element are defined later, e.g.:
</p><pre class="programlisting">&lt;casProcessor ...&gt;</pre>
</li><li class="listitem"><p>Bracketed expressions (e.g. <code class="literal">[String]</code>)
indicate the type of value that may be used at that location.</p></li><li class="listitem"><p>A vertical bar, as in <code class="literal">true|false</code>, indicates
alternatives. This can be applied to literal values, bracketed type names, and
elements. </p></li></ul></div>
<p>Which elements are optional and which are required is specified in prose, not in the
syntax definition.</p>
</div>
<div class="section" title="3.3.&nbsp;Imports"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.imports">3.3.&nbsp;Imports</h2></div></div></div>
<p>As of version 2.2, a CPE Descriptor can use the same <code class="literal">import</code> mechanism
as other component descriptors. This allows referring to component
descriptors using either relative paths (resolved relative to the location of the CPE descriptor)
or the classpath/datapath. For details see <a href="references.html#ugr.ref.xml.component_descriptor" class="olink">Chapter&nbsp;2, <i>Component Descriptor Reference</i></a>.</p>
<p>The follwing older syntax is still supported, but <span class="emphasis"><em>not recommended</em></span>:
</p><pre class="programlisting">&lt;descriptor&gt;
&lt;include href="[URL or File]"/&gt;
&lt;/descriptor&gt;</pre>
<p>The <code class="literal">[URL or File]</code> attribute is a URL or a filename for the descriptor of the
incorporated component. The argument is first attempted to be resolved as a URL.</p>
<p>
Relative paths in an <code class="literal">include</code> are resolved relative to the current working directory
(NOT the CPE descriptor location as is the case for <code class="literal">import</code>).
A filename relative to another directory can be specified using the <code class="literal">CPM_HOME</code>
variable, e.g.,
</p><pre class="programlisting">&lt;descriptor&gt;
&lt;include href="${CPM_HOME}/desc_dir/descriptor.xml"/&gt;
&lt;/descriptor&gt;</pre><p>
In this case, the value for the <code class="literal">CPM_HOME</code> variable must be
provided to the CPE by specifying it on the Java command line, e.g.,
</p><pre class="programlisting">java -DCPM_HOME="C:/Program Files/apache/uima/cpm" ...</pre><p>
</p>
</div>
<div class="section" title="3.4.&nbsp;CPE Descriptor Overview"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.descriptor">3.4.&nbsp;CPE Descriptor Overview</h2></div></div></div>
<p>A CPE Descriptor consists of information describing the following four main
elements.</p>
<div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>The <span class="emphasis"><em>Collection Reader</em></span>, which
is responsible for gathering artifacts and initializing the Common Analysis
Structure (CAS) used to support processing in the UIMA collection processing
engine.</p></li><li class="listitem"><p>The <span class="emphasis"><em>CAS Processors</em></span>, responsible for
analyzing individual artifacts, analyzing across artifacts, and extracting
analysis results. CAS Processors include <span class="emphasis"><em>Analysis Engines</em></span>
and <span class="emphasis"><em>CAS Consumers</em></span>.</p></li><li class="listitem"><p>Operational parameters of the <span class="emphasis"><em>Collection Processing
Manager</em></span> (CPM), such as checkpoint frequency and deployment
mode.</p></li><li class="listitem"><p>Resource Manager Configuration (optional). </p></li></ol></div>
<p>The CPE Descriptor has the following high level skeleton:
</p><pre class="programlisting">&lt;?xml version="1.0"?&gt;
&lt;cpeDescription&gt;
&lt;collectionReader&gt;
...
&lt;/collectionReader&gt;
&lt;casProcessors&gt;
...
&lt;/casProcessors&gt;
&lt;cpeConfig&gt;
...
&lt;/cpeConfig&gt;
&lt;resourceManagerConfiguration&gt;
...
&lt;/resourceManagerConfiguration&gt;
&lt;/cpeDescription&gt;</pre>
<p>Details of each of the four main elements are described in the sections that
follow.</p>
</div>
<div class="section" title="3.5.&nbsp;Collection Reader"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.descriptor.collection_reader">3.5.&nbsp;Collection Reader</h2></div></div></div>
<p>The <code class="literal">&lt;collectionReader&gt;</code> section identifies the
Collection Reader and optional CAS Initializer that are to be used in the CPE. The
Collection Reader is responsible for retrieval of artifacts from a collection
outside of the CPE, and the optional CAS Initializer (deprecated as of UIMA Version 2)
is responsible for initializing the CAS with the artifact.</p>
<p>A Collection Reader may initialize the CAS itself, in which case it does not
require a CAS Initializer. This should be clearly specified in the documentation for
the Collection Reader. Specifying a CAS Initializer for a Collection Reader that
does not make use of a CAS Initializer will not cause an error, but the specified CAS
Initializer will not be used.</p>
<p>The complete structure of the <code class="literal">&lt;collectionReader&gt;</code>
section is:
</p><pre class="programlisting">&lt;collectionReader&gt;
&lt;collectionIterator&gt;
&lt;descriptor&gt;
&lt;import ...&gt; | &lt;include .../&gt;
&lt;/descriptor&gt;
&lt;configurationParameterSettings&gt;...&lt;/configurationParameterSettings&gt;
&lt;sofaNameMappings&gt;...&lt;/sofaNameMappings&gt;
&lt;/collectionIterator&gt;
&lt;casInitializer&gt;
&lt;descriptor&gt;
&lt;import ...&gt; | &lt;include .../&gt;
&lt;/descriptor&gt;
&lt;configurationParameterSettings&gt;...&lt;/configurationParameterSettings&gt;
&lt;sofaNameMappings&gt;...&lt;/sofaNameMappings&gt;
&lt;/casInitializer&gt;
&lt;/collectionReader&gt;</pre>
<p>The <code class="literal">&lt;collectionIterator&gt;</code> identifies the
descriptor for the Collection Reader, and the <code class="literal">&lt;casInitializer&gt;
</code>identifies the descriptor for the CAS Initializer. The format and
details of the Collection Reader and CAS Initializer descriptors are described in
<a href="references.html#ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader" class="olink">Section&nbsp;2.6.1, &#8220;Collection Reader Descriptors&#8221;</a>
. The <code class="literal">&lt;configurationParameterSettings&gt; </code>and the
<code class="literal">&lt;sofaNameMappings&gt;</code> elements are described in the next
section.</p>
<div class="section" title="3.5.1.&nbsp;Error handling for Collection Readers"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.collection_reader.error_handling">3.5.1.&nbsp;Error handling for Collection Readers</h3></div></div></div>
<p>The CPM will abort if the Collection Reader throws a large number of
consecutive exceptions (default = 100). This default can by changed by using the
Java initialization parameter <code class="literal">&#8722;DMaxCRErrorThreshold
xxx.</code></p>
</div>
</div>
<div class="section" title="3.6.&nbsp;CAS Processors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors">3.6.&nbsp;CAS Processors</h2></div></div></div>
<p>The <code class="literal">&lt;casProcessors&gt;</code> section identifies the
components that perform the analysis on the input data, including CAS analysis
(Analysis Engines) and analysis results extraction (CAS Consumers). The CAS
Consumers may also perform collection level analysis, where the analysis is
performed (or aggregated) over multiple CASes. The basic structure of the CAS
Processors section is:
</p><pre class="programlisting">&lt;casProcessors
dropCasOnException="true|false"
casPoolSize="[Number]"
processingUnitThreadCount="[Number]"&gt;
&lt;casProcessor ...&gt;
...
&lt;/casProcessor&gt;
&lt;casProcessor ...&gt;
...
&lt;/casProcessor&gt;
...
&lt;/casProcessors&gt;</pre>
<p>The <code class="literal">&lt;casProcessors&gt;</code> section has two mandatory
attributes and one optional attribute that configure the characteristics of the CAS
Processor flow in the CPE. The first mandatory attribute is a casPoolSize, which
defines the fixed number of CAS instances that the CPM will create and use during
processing. All CAS instances are maintained in a CAS Pool with a check-in and
check-out access. Each CAS is checked-out from the CAS Pool by the Collection Reader
and initialized with an initial subject of analysis. The CAS is checked-in into the
CAS Pool when it is completely processed, at the end of the processing chain. A larger
CAS Pool size will result in more memory being used by the CPM. CAS objects can be large
and care should be taken to determine the optimum size of the CAS Pool, weighing memory
tradeoffs with performance.</p>
<p>The second mandatory <code class="literal">&lt;casProcessors&gt;</code> attribute
is <code class="literal">processingUnitThreadCount</code>, which specifies the number of
replicated <span class="emphasis"><em>Processing Pipelines</em></span>. Each Processing
Pipeline runs in its own thread. The CPM takes CASes from the work queue and submits
each CAS to one of the Processing Pipelines for analysis. A Processing Pipeline
contains one or more Analysis Engines invoked in a given sequence. If more than one
Processing Pipeline is specified, the CPM replicates instances of each Analysis
Engine defined in the CPE descriptor. Each Processing Pipeline thread runs
independently, consuming CASes from work queue and depositing CASes with analysis
results onto the output queue. On multiprocessor machines, multiple Processing
Pipelines can run in parallel, improving overall throughput of the CPM.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The number of Processing Pipelines should be equal to or greater than CAS
Pool size. </p></div>
<p>Elements in the pipeline (each represented by a &lt;casProcessor&gt; element)
may indicate that they do not permit multiple deployment in their Analysis Engine
descriptor. If so, even though multiple pipelines are being used, all CASes passing
through the pipelines will be routed through one instance of these marked Engines.
</p>
<p>The final, optional, &lt;casProcessors&gt; attribute is
<code class="literal">dropCasOnException</code>. It defines a policy that determines what
happens with the CAS when an exception happens during processing. If the value of this
attribute is set to true and an exception happens, the CPM will notify all registered
listeners of the exception (see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.cpe.using_listeners" class="olink">Section&nbsp;2.3.1, &#8220;Using Listeners&#8221;</a>), clear the CAS and check the CAS
back into the CAS Pool so that it can be re-used. The presumption is that an exception
may leave the CAS in an inconsistent state and therefore that CAS should not be allowed
to move through the processing chain. When this attribute is omitted the CPM's
default is the same as specifying
<code class="literal">dropCasOnException="false"</code>.</p>
<div class="section" title="3.6.1.&nbsp;Specifying an Individual CAS Processor"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual">3.6.1.&nbsp;Specifying an Individual CAS Processor</h3></div></div></div>
<p>The CAS Processors that make up the Processing Pipeline and the CAS Consumer
pipeline are specified with the <code class="literal">&lt;casProcessor&gt;</code>
entity, which appears within the <code class="literal">&lt;casProcessors&gt;</code>
entity. It may appear multiple times, once for each CAS Processor specified for
this CPE.</p>
<p>The order of the <code class="literal">&lt;casProcessor&gt;</code> entities with
the <code class="literal">&lt;casProcessors&gt;</code> section specifies the order in
which the CAS Processors will run. Although CAS Consumers are usually put at the end
of the pipeline, they need not be. Also, Aggregate Analysis Engines may include CAS
Consumers.</p>
<p>The overall format of the <code class="literal">&lt;casProcessor&gt;</code> entity
is:
</p><pre class="programlisting">&lt;casProcessor deployment="local|remote|integrated" name="[String]" &gt;
&lt;descriptor&gt;
&lt;import ...&gt; | &lt;include .../&gt;
&lt;/descriptor&gt;
&lt;configurationParameterSettings&gt;...&lt;/configurationParameterSettings&gt;
&lt;sofaNameMappings&gt;...&lt;/sofaNameMappings&gt;
&lt;runInSeparateProcess&gt;...&lt;/runInSeparateProcess&gt;
&lt;deploymentParameters&gt;...&lt;/deploymentParameters&gt;
&lt;filter/&gt;
&lt;errorHandling&gt;...&lt;/errorHandling&gt;
&lt;checkpoint batch="Number"/&gt;
&lt;/casProcessor&gt;</pre>
<p>The <code class="literal">&lt;casProcessor&gt;</code> element has two mandatory
attributes, <code class="literal">deployment</code> and <code class="literal">name</code>. The
mandatory <code class="literal">name</code> attribute specifies a unique string
identifying the CAS Processor.</p>
<p>The mandatory <code class="literal">deployment</code> attribute specifies the CAS
Processor deployment mode. Currently, three deployment options are supported:
</p><div class="variablelist"><dl><dt><span class="term">integrated</span></dt><dd><p>indicates <span class="emphasis"><em>integrated</em></span> deployment
of the CAS Processor. The CPM deploys and collocates the CAS Processor in the
same process space as the CPM. This type of deployment is recommended to
increase the performance of the CPE. However, it is NOT recommended to
deploy annotators containing JNI this way. Such CAS Processors may cause a
fatal exception and force the JVM to exit without cleanup (bringing down the
CPM). Any UIMA SDK compliant pure Java CAS Processors may be safely deployed
this way.</p>
<p>The descriptor for an integrated deployment can, in fact, be a remote
service descriptor. When used this way, however, the CPM error recovery
options (see below) operate in the integrated mode, which means that many
of the retry options are not available.</p></dd><dt><span class="term">remote</span></dt><dd><p>indicates <span class="emphasis"><em>non-managed</em></span>
deployment of the CAS Processor. The CAS Processor descriptor referenced
in the <code class="literal">&lt;descriptor&gt;</code> element must be a Vinci
<span class="emphasis"><em>Service Client Descriptor</em></span>, which identifies a
remotely deployed CAS Processor service (see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.remote_services" class="olink">Section&nbsp;3.6, &#8220;Working with Remote Services&#8221;</a>). The CPM
assumes that the CAS Processor is already running as a remote service and
will connect to it using the URI provided in the client service descriptor.
The lifecycle of a remotely deployed CAS Processor is not managed by the CPM,
so appropriate infrastructure should be in place to start/restart such CAS
Processors when necessary. This deployment provides fault isolation and
is implementation (i.e., programming language) neutral.</p>
</dd><dt><span class="term">local</span></dt><dd><p>indicates <span class="emphasis"><em>managed</em></span> deployment of
the CAS Processor. The CAS Processor descriptor referenced in the
<code class="literal">&lt;descriptor&gt;</code> element must be a Vinci
<span class="emphasis"><em>Service Deployment Descriptor</em></span>, which configures
a CAS Processor for deployment as a Vinci service (see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.remote_services" class="olink">Section&nbsp;3.6, &#8220;Working with Remote Services&#8221;</a>). The CPM
deploys the CAS Processor in a separate process and manages the life cycle
(start/stop) of the CAS Processor. Communication between the CPM and the
CAS Processor is done with Vinci. When the CPM completes processing, the
process containing the CAS Processor is terminated. This deployment mode
insulates the CPM from the CAS Processor, creating a more robust deployment
at the cost of a small communication overhead. On multiprocessor machines,
the separate processes may run concurrently and improve overall
throughput.</p></dd></dl></div>
<p>A number of elements may appear within the
<code class="literal">&lt;casProcessor&gt;</code> element.</p>
<div class="section" title="3.6.1.1.&nbsp;<descriptor&gt; Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.descriptor">3.6.1.1.&nbsp;&lt;descriptor&gt; Element</h4></div></div></div>
<p>The <code class="literal">&lt;descriptor&gt;</code> element is mandatory. It
identifies the descriptor for the referenced CAS Processor using the syntax
described in <a href="references.html#ugr.ref.xml.component_descriptor.aes" class="olink">Section&nbsp;2.4, &#8220;Analysis Engine Descriptors&#8221;</a>.
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>For
<span class="emphasis"><em><code class="literal">remote</code></em></span> CAS Processors, the
referenced descriptor must be a Vinci <span class="emphasis"><em>Service Client
Descriptor</em></span>, which identifies a remotely deployed CAS Processor
service.</p></li><li class="listitem"><p>For <span class="emphasis"><em>local</em></span> CAS Processors, the
referenced descriptor must be a Vinci <span class="emphasis"><em>Service Deployment
Descriptor</em></span>.</p></li><li class="listitem"><p>For <span class="emphasis"><em>integrated</em></span> CAS Processors,
the referenced descriptor must be an Analysis Engine Descriptor
(primitive or aggregate). </p></li></ul></div><p> </p>
<p>See <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.remote_services" class="olink">Section&nbsp;3.6, &#8220;Working with Remote Services&#8221;</a> for more
information on creating these descriptors and deploying services.</p>
</div>
<div class="section" title="3.6.1.2.&nbsp;<configurationParameterSettings&gt; Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.configuration_parameter_settings">3.6.1.2.&nbsp;&lt;configurationParameterSettings&gt; Element</h4></div></div></div>
<p>This element provides a way to override the contained Analysis
Engine's parameters settings. Any entry specified here must already be
defined; values specified replace the corresponding values for each
parameter. <span class="bold-italic">For Cas Processors, this mechanism
is only available when they are deployed in <span class="quote">&#8220;<span class="quote">integrated</span>&#8221;</span>
mode.</span> For Collection Readers and Initializers, it always is
available.</p>
<p>The content of this element is identical to the component descriptor for
specifying parameters (in the case where no parameter groups are
specified)<sup>[<a name="d5e1266" href="#ftn.d5e1266" class="footnote">4</a>]</sup>. Here is an example:
</p><pre class="programlisting">&lt;configurationParameterSettings&gt;
&lt;nameValuePair&gt;
&lt;name&gt;CivilianTitles&lt;/name&gt;
&lt;value&gt;
&lt;array&gt;
&lt;string&gt;Mr.&lt;/string&gt;
&lt;string&gt;Ms.&lt;/string&gt;
&lt;string&gt;Mrs.&lt;/string&gt;
&lt;string&gt;Dr.&lt;/string&gt;
&lt;/array&gt;
&lt;/value&gt;
&lt;/nameValuePair&gt;
...
&lt;/configurationParameterSettings&gt;</pre>
</div>
<div class="section" title="3.6.1.3.&nbsp;<sofaNameMappings&gt; Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.sofa_name_mappings">3.6.1.3.&nbsp;&lt;sofaNameMappings&gt; Element</h4></div></div></div>
<p>This optional element provides a mapping from defined Sofa names in the
component, or the default Sofa name (if the component does not declare any Sofa
names). The form of this element is:
</p><pre class="programlisting">&lt;sofaNameMappings&gt;
&lt;sofaNameMapping cpeSofaName="a_CPE_name"
componentSofaName="a_component_Name"/&gt;
...
&lt;/sofaNameMappings&gt;</pre>
<p>There can be any number of<code class="literal">
&lt;sofaNameMapping&gt;</code> elements contained in the
<code class="literal">&lt;sofaNameMappings&gt;</code> element. The
<code class="literal">componentSofaName</code> attribute is optional; leave it out to
specify a mapping for the <code class="literal">_InitialView</code> - that is, for
Single-View components.</p>
</div>
<div class="section" title="3.6.1.4.&nbsp;<runInSeparateProcess&gt; Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.run_in_separate_process">3.6.1.4.&nbsp;&lt;runInSeparateProcess&gt; Element</h4></div></div></div>
<p>The <code class="literal">&lt;runInSeparateProcess&gt;</code> element is
mandatory for <code class="literal">local</code> CAS Processors, but should not appear
for <code class="literal">remote</code> or <code class="literal">integrated</code> CAS
Processors. It enables the CPM to create external processes using the provided
runtime environment. Applications launched this way communicate with the CPM
using the Vinci protocol and connectivity is enabled by a local instance of the
VNS that the CPM manages. Since communication is based on Vinci, the application
need not be implemented in Java. Any language for which Vinci provides support
may be used to create an application, and the CPM will seamlessly communicate
with it. The overall structure of this element is:
</p><pre class="programlisting">&lt;runInSeparateProcess&gt;
&lt;exec dir="[String]" executable="[String]"&gt;
&lt;env key="[String]" value ="[String]"/&gt;
...
&lt;arg&gt;[String]&lt;/arg&gt;
...
&lt;/exec&gt;
&lt;/runInSeparateProcess&gt;</pre>
<p>The <code class="literal">&lt;exec&gt;</code> element provides information
about how to execute the referenced CAS Processor. Two attributes are defined
for the <code class="literal">&lt;exec&gt;</code> element. The
<code class="literal">dir</code> attribute is currently not used &#8211; it is reserved
for future functionality. The <code class="literal">executable</code> attribute
specifies the actual Vinci service executable that will be run by the CPM, e.g.,
<code class="literal">java</code>, a batch script, an application (.exe), etc. The
executable must be specified with a fully qualified path, or be found in the
<code class="literal">PATH</code> of the CPM.</p>
<p>The <code class="literal">&lt;exec&gt;</code> element has two elements within it
that define parameters used to construct the command line for executing the CAS
Processor. These elements must be listed in the order in which they should be
defined for the CAS Processor.</p>
<p>The optional <code class="literal">&lt;env&gt;</code> element is used to set an
environment variable. The variable <code class="literal">key</code> will be set to
<code class="literal">value</code>. For example,
</p><pre class="programlisting">&lt;env key="CLASSPATH" value="C:Javalib"/&gt;</pre><p>
will set the environment variable <code class="literal">CLASSPATH</code> to the value
<code class="literal">C:Javalib</code>. The <code class="literal">&lt;env&gt;</code>
element may be repeated to set multiple environment variables. All of the
key/value pairs will be added to the environment by the CPM prior to launching the
executable.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The CPM actually adds ALL system environment variables when it
launches the program. It queries the Operating System for its current system
variables and one by one adds them to the program's process
configuration.</p></div>
<p>The <code class="literal">&lt;arg&gt;</code> element is used to specify arbitrary
string arguments that will appear on the command line when the CPM runs the
command specified in the <code class="literal">executable</code> attribute.</p>
<p>For example, the following would be used to invoke the UIMA Java
implementation of the Vinci service wrapper on a Java CAS Processor:
</p><pre class="programlisting">&lt;runInSeparateProcess&gt;
&lt;exec executable="java"&gt;
&lt;arg&gt;&amp;minus;DVNS_HOST=localhost&lt;/arg&gt;
&lt;arg&gt;&amp;minus;DVNS_PORT=9099&lt;/arg&gt;
&lt;arg&gt;org.apache.uima.reference_impl.analysis_engine.service.
vinci.VinciAnalysisEngineService_impl&lt;/arg&gt;
&lt;arg&gt;C:uimadescdeployCasProcessor.xml&lt;/arg&gt;
&lt;/exec&gt;
&lt;runInSeparateProcess&gt;</pre>
<p>This will cause the CPM to run the following command line when starting the
CAS Processor:
</p><pre class="programlisting">java -DVNS_HOST=localhost -DVNS_PORT=9099
org.apache.uima.reference_impl.analysis_engine.service.vinci.\\
VinciAnalysisEngineService_impl
C:uimadescdeployCasProcessor.xml</pre>
<p>The first argument specifies that the Vinci Naming Service is running on the
<code class="literal">localhost</code>. The second argument specifies that the Vinci
Naming Service port number is <code class="literal">9099</code>. The third argument
(split over 2 lines in this documentation)
identifies the UIMA implementation of the Vinci service wrapper. This class
contains the <code class="literal">main</code> method that will execute. That main
method in turn takes a single argument &#8211; the filename for the CAS Processor
service deployment descriptor. Thus the last argument identifies the Vinci
service deployment descriptor file for the CAS Processor. Since this is the same
descriptor file specified earlier in the
<code class="literal">&lt;descriptor&gt;</code> element, the string
<code class="literal">${descriptor}</code> can be used to refer to the descriptor,
e.g.:
</p><pre class="programlisting">&lt;arg&gt;${descriptor}&lt;/arg&gt;</pre>
<p>The CPM will expand this out to the service deployment descriptor file
referenced in the <code class="literal">&lt;descriptor&gt;</code> element.</p>
</div>
<div class="section" title="3.6.1.5.&nbsp;<deploymentParameters&gt; Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.deployment_parameters">3.6.1.5.&nbsp;&lt;deploymentParameters&gt; Element</h4></div></div></div>
<p>The <code class="literal">&lt;deploymentParameters&gt;</code> element defines
a number of deployment parameters that control how the CPM will interact with the
CAS Processor. This element has the following overall form:
</p><pre class="programlisting">&lt;deploymentParameters&gt;
&lt;parameter name="[String]" value="..." type="string|integer" /&gt;
...
&lt;/deploymentParameters&gt;</pre>
<p>The <code class="literal">name</code> attribute identifies the parameter, the
<code class="literal">value</code> attribute specifies the value that will be assigned
to the parameter, and the <code class="literal">type</code> attribute indicates the
type of the parameter, either <code class="literal">string</code> or
<code class="literal">integer</code>. The available parameters include:
</p><div class="variablelist"><dl><dt><span class="term">service-access</span></dt><dd><p>string parameter whose value must be
<span class="quote">&#8220;<span class="quote">exclusive</span>&#8221;</span>, if present. This parameter is only
effective for remote deployments. It modifies the Vinci service
connections to be preallocated and dedicated, one service instance per
pipe-line. It is only relevant for non-Integrated deployement modes. If
there are fewer services instances that are available (and alive &#8211;
responding to a <span class="quote">&#8220;<span class="quote">ping</span>&#8221;</span> request) than there are pipelines,
the number of pipelines (the number of concurrent threads) is reduced to
match the number of available instances. If not specified, the VNS is
queried each time a service is needed, and a <span class="quote">&#8220;<span class="quote">random</span>&#8221;</span>
instance is assigned from the pool of available instances. If a services
dies during processing, the CPM will use its normal error handling
procedures to attempt to reconnect. The number of attempts is specified
in the CPE descriptor for each Cas Processor using the
<code class="literal">&lt;maxConsecutiveRestarts value="10"
action="kill-pipeline"
waitTimeBetweenRetries="50"/&gt;</code> xml element. The
<span class="quote">&#8220;<span class="quote">value</span>&#8221;</span> attribute is the number of reconnection tries;
the <span class="quote">&#8220;<span class="quote">action</span>&#8221;</span> says what to do if the retries exceed the
limit. The <span class="quote">&#8220;<span class="quote">kill-pipeline</span>&#8221;</span> action stops the pipeline
that was associated with the failing service (other pipelines will
continue to work). The CAS in process within a killed pipeline will be
dropped. These events are communicated to the application using the
normal event listener mechanism. The
<code class="literal">waitTimeBetweenRetries</code> says how many
milliseconds to wait inbetween attempts to reconnect.</p>
</dd><dt><span class="term">vnsHost</span></dt><dd><p>(Deprecated) string parameter specifying the VNS host,
e.g., <code class="literal">localhost</code> for local CAS Processors, host
name or IP address of VNS host for remote CAS Processors. This parameter is
deprecated; use the parameter specification instead inside the Vinci
<span class="emphasis"><em>Service Client Descriptor</em></span>, if needed. It is
ignored for integrated and local deployments. If present, for remote
deployments, it specifies the VNS Host to use, unless that is specified in
the Vinci <span class="emphasis"><em>Service Client Descriptor</em></span>.</p>
</dd><dt><span class="term">vnsPort</span></dt><dd><p>(Deprecated) integer parameter specifying the VNS port
number. This parameter is deprecated; use the parameter specification
instead inside the Vinci <span class="emphasis"><em>Service Client
Descriptor,</em></span> if needed. It is ignored for integrated and
local deployments. If present, for remote deployments, it specifies the
VNS Port number to use, unless that is specified in the Vinci
<span class="emphasis"><em>Service Client Descriptor.</em></span></p>
</dd></dl></div>
<p>For example, the following parameters might be used with a CAS Processor
deployed in local mode:
</p><pre class="programlisting">&lt;deploymentParameters&gt;
&lt;parameter name="service-access" value="exclusive" type="string"/&gt;
&lt;/deploymentParameters&gt;</pre>
</div>
<div class="section" title="3.6.1.6.&nbsp;<filter&gt; Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.filter">3.6.1.6.&nbsp;&lt;filter&gt; Element</h4></div></div></div>
<p>The &lt;filter&gt; element is a required element but currently should be
left empty. This element is reserved for future use.</p>
</div>
<div class="section" title="3.6.1.7.&nbsp;<errorHandling&gt; Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.error_handling">3.6.1.7.&nbsp;&lt;errorHandling&gt; Element</h4></div></div></div>
<p>The mandatory <code class="literal">&lt;errorHandling&gt;</code> element
defines error and restart policies for the CAS Processor. Each CAS Processor may
define different actions in the event of errors and restarts. The CPM monitors
and logs errant behaviors and attempts to recover the component based on the
policies specified in this element.</p>
<p>There are two kinds of faults:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>One kind only occurs with non-integrated CAS
Processors &#8211; this fault is either a timeout attempting to launch or
connect to the non-integrated component, or some other kind of connection
related exception (for instance, the network connection might timeout or get
reset).</p></li><li class="listitem"><p>The other kind happens when the CAS Processor component (an
Annotator, for example) throws any kind of exception. This kind may occur
with any kind of deployment, integrated or not. </p></li></ol></div>
<p>The &lt;errorHandling&gt; has specifications for each of these kinds of
faults. The format of this element is:
</p><pre class="programlisting">&lt;errorHandling&gt;
&lt;maxConsecutiveRestarts action="continue|disable|terminate"
value="[Number]"/&gt;
&lt;errorRateThreshold action="continue|disable|terminate" value="[Rate]"/&gt;
&lt;timeout max="[Number]"/&gt;
&lt;/errorHandling&gt;</pre>
<p>The mandatory <code class="literal">&lt;maxConsecutiveRestarts&gt;</code>
element applies only to faults of the first kind, and therefore, only applies to
non-integrated deployments. If such a fault occurs, a retry is attempted, up to
<code class="literal">value="[Number]"</code> of times. This retry resets the
connection (if one was made) and attempts to reconnect and perhaps re-launch
(see below for details). The original CAS (not a partially updated one) is sent to
the CAS Processor as part of the retry, once the deployed component has been
successfully restarted or reconnected to.</p>
<p>The <code class="literal">action</code> attribute specifies the action to take
when the threshold specified by the <code class="literal">value="[Number]"</code> is
exceeded. The possible actions are:
</p><div class="variablelist"><dl><dt><span class="term">continue</span></dt><dd><p>skip any further processing for this CAS by this CAS
Processor, and pass the CAS to the next CAS Processor in the Pipeline.
</p>
<p>The <span class="quote">&#8220;<span class="quote">restart</span>&#8221;</span> action is done, because it is needed
for the next CAS.</p>
<p>If the <code class="literal">dropCasOnException="true"</code>, the CPM
will NOT pass the CAS to the next CAS Processor in the chain. Instead, the
CPM will abort processing of this CAS, release the CAS back to the CAS
Pool and will process the next CAS in the queue.</p>
<p>The counter counting the restarts toward the threshold is only
reset after a CAS is successfully processed.</p></dd><dt><span class="term">disable</span></dt><dd><p>the current CAS is handled just as in the
<code class="literal">continue</code> case, but in addition, the CAS Processor
is marked so that its <span class="emphasis"><em>process()</em></span> method will not be
called again (i.e., it will be <span class="quote">&#8220;<span class="quote">skipped</span>&#8221;</span> for future
CASes)</p></dd><dt><span class="term">terminate</span></dt><dd><p>the CPM will terminate all processing and exit.</p>
</dd></dl></div>
<p>The definition of an error for the
<code class="literal">&lt;maxConsecutiveRestarts&gt;</code> element differs
slightly for each of the three CAS Processor deployment modes:
</p><div class="variablelist"><dl><dt><span class="term">local</span></dt><dd><p>Local CAS Processors experience two general error
types:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>launch errors &#8211; errors associated with
launching a process</p></li><li class="listitem"><p>processing errors &#8211; errors associated with
sending Vinci commands to the process</p></li></ul></div>
<p>A launch error is defined by a failure of the process to
successfully register with the local VNS within a default time window.
The current timeout is 15 minutes. Multiple local CAS Processors are
launched sequentially, with a subsequent processor launched
immediately after its previous processor successfully registers
with the VNS.</p>
<p>A processing error is detected if a connection to the CAS Processor
is lost or if the processing time exceeds a specified timeout
value.</p>
<p>For local CAS Processors, the
&lt;maxConsecutiveRestarts&gt; element specifies the number of
consecutive attempts made to launch the CAS Processor at CPM startup or
after the CPM has lost a connection to the CAS Processor.</p>
</dd><dt><span class="term">remote</span></dt><dd><p>For remote CAS Processors, the
&lt;maxConsecutiveRestarts&gt; element applies to errors from
sending Vinci commands. An error is detected if a connection to the CAS
Processor is lost, or if the processing time exceeds the timeout value
specified in the &lt;timeout&gt; element (see below).</p>
</dd><dt><span class="term">integrated</span></dt><dd><p>Although mandatory, the
&lt;maxConsecutiveRestarts&gt; element is NOT used for integrated CAS
Processors, because Integrated CAS Processors are not
re-instantiated/restarted on exceptions. This setting is ignored by
the CPM for Integrated CAS Processors but it is required. Future version
of the CPM will make this element mandatory for remote and local CAS
Processors only.</p></dd></dl></div>
<p>The mandatory <code class="literal">&lt;errorRateThreshold&gt;</code> element
is used for all faults &#8211; both those above, and exceptions thrown by the CAS
Processor itself. It specifies the number of retries for exceptions thrown by
the CAS Processor itself, a maximum error rate, and the corresponding action to
take when this rate is exceeded. The <code class="literal">value</code> attribute
specifies the error rate in terms of errors per sample size in the form
<span class="quote">&#8220;<span class="quote"><code class="literal">N/M</code></span>&#8221;</span>, where <code class="literal">N</code> is the
number of errors and <code class="literal">M</code> is the sample size, defined in terms
of the number of documents.</p>
<p>The first number is used also to indicate the maximum number of retries. If
this number is less than the <code class="literal">&lt;maxConsecutiveRestarts
value="[Number]"&gt;, </code>it will override, reducing the number of
<span class="quote">&#8220;<span class="quote">restarts</span>&#8221;</span> attempted. A retry is done only if the
<code class="literal">dropCasOnException </code>is false. If it is set to true, no retry
occurs, but the error is counted.</p>
<p>When the number of counted errors exceeds the sample size, an action
specified by the <code class="literal">action</code> attribute is taken. The possible
actions and their meaning are the same as described above for the
<code class="literal">&lt;maxConsecutiveRestarts&gt;</code> element:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p><code class="literal">continue</code></p></li><li class="listitem"><p><code class="literal">disable</code></p></li><li class="listitem"><p><code class="literal">terminate</code></p></li></ul></div>
<p>The <code class="literal">dropCasOnException="true"</code> attribute of the
<code class="literal">&lt;casProcessors&gt;</code> element modifies the action
taken for continue and disable, in the same manner as above. For example:
</p><pre class="programlisting">&lt;errorRateThreshold value="3/1000" action="disable"/&gt;</pre><p>
specifies that each error thrown by the CAS Processor itself will be retried up to
3 times (if <code class="literal">dropCasOnException</code> is false) and the CAS
Processor will be disabled if the error rate exceeds 3 errors in 1000
documents.</p>
<p>If a document causes an error and the error rate threshold for the CAS
Processor is not exceeded, the CPM increments the CAS Processor's error
count and retries processing that document (if
<code class="literal">dropCasOnException</code> is false). The retry means that the
CPM calls the CAS Processor's process() method again, passing in as an
argument the same CAS that previously caused an exception.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The CPM does not attempt to rollback any partial changes that may have
been applied to the CAS in the previous process() call. </p></div>
<p>Errors are accumulated across documents. For example, assume the error
rate threshold is <code class="literal">3/1000</code>. The same document may fail three
times before finally succeeding on the fourth try, but the error count is now 3. If
one more error occurs within the current sample of 1000 documents, the error rate
threshold will be exceeded and the specified action will be taken. If no more
errors occur within the current sample, the error counter is reset to 0 for the
next sample of 1000 documents.</p>
<p>The <code class="literal">&lt;timeout&gt;</code> element is a mandatory element.
Although mandatory for all CAS Processors, this element is only relevant for
local and remote CAS Processors. For integrated CAS Processors, this element is
ignored. In the current CPM implementation the integrated CAS Processor
process() method is not subject to timeouts.</p>
<p>The <code class="literal">max</code> attribute specifies the maximum amount of
time in milliseconds the CPM will wait for a process() method to complete When
exceeded, the CPM will generate an exception and will treat this as an error
subject to the threshold defined in the
<code class="literal">&lt;errorRateThreshold&gt;</code> element above, including
doing retries.</p>
<div class="section" title="Retry action taken on a timeout"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.error_handling.timeout_retry_action">Retry action taken on a timeout</h5></div></div></div>
<p>The action taken depends on whether the CAS Processor is local (managed)
or remote (unmanaged). Local CAS Processors (which are services) are killed
and restarted, and a new connection to them is established. For remote CAS
Processors, the connection to them is dropped, and a new connection is
reestablished (which may actually connect to a different instance of the
remote services, if it has multiple instances).</p>
</div>
</div>
<div class="section" title="3.6.1.8.&nbsp;<checkpoint&gt; Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.checkpoint">3.6.1.8.&nbsp;&lt;checkpoint&gt; Element</h4></div></div></div>
<p>The <code class="literal">&lt;checkpoint&gt;</code> element is an optional
element used to improve the performance of CAS Consumers. It has a single
attribute, <code class="literal">batch</code>, which specifies the number of CASes in a
batch, e.g.:
</p><pre class="programlisting">&lt;checkpoint batch="1000"&gt;</pre>
<p>sets the batch size to 1000 CASes. The batch size is the interval used to mark a
point in processing requiring special handling. The CAS Processor's
<code class="literal">batchProcessComplete()</code> method will be called by the CPM
when this mark is reached so that the processor can take appropriate action. This
mark could be used as a mechanism to buffer up results in CAS Consumers and perform
time-consuming operations, such as check-pointing, that should not be done on a
per-document basis.</p>
</div>
</div>
</div>
<div class="section" title="3.7.&nbsp;CPE Operational Parameters"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.descriptor.operational_parameters">3.7.&nbsp;CPE Operational Parameters</h2></div></div></div>
<p>The parameters for configuring the overall CPE and CPM are specified in the
<code class="literal">&lt;cpeConfig&gt;</code> section. The overall format of this
section is:
</p><pre class="programlisting">&lt;cpeConfig&gt;
&lt;startAt&gt;[NumberOrID]&lt;/startAt&gt;
&lt;numToProcess&gt;[Number]&lt;/numToProcess&gt;
&lt;outputQueue dequeueTimeout="[Number]" queueClass="[ClassName]" /&gt;
&lt;checkpoint file="[File]" time="[Number]" batch="[Number]"/&gt;
&lt;timerImpl&gt;[ClassName]&lt;/timerImpl&gt;
&lt;deployAs&gt;vinciService|interactive|immediate|single-threaded
&lt;/deployAs&gt;
&lt;/cpeConfig&gt;</pre>
<p>This section of the CPE descriptor allows for defining the starting entity, the
number of entities to process, a checkpoint file and frequency, a pluggable timer, an
optional output queue implementation, and finally a mode of operation. The mode of
operation determines how the CPM interacts with users and other systems.</p>
<p>The <code class="literal">&lt;startAt&gt;</code> element is an optional argument. It
defines the starting entity in the collection at which the CPM should start
processing.</p>
<p>The implementation in the CPM passes this argument to the Collection Reader
as the value of the parameter <span class="quote">&#8220;<span class="quote"><code class="literal">startNumber</code></span>&#8221;</span>.
The CPM does not do anything else with this parameter; in particular, the CPM has no
ability to skip to a specific document - that function, if available, is only provided
by a particular Collection Reader implementation.</p>
<p>If the <code class="literal">&lt;startAt&gt;</code> element is used, the Collection
Reader descriptor must define a single-valued configuration parameter with the
name <code class="literal">startNumber</code>. It can declare this value to be of any type;
the value passed in this XML element must be convertible to that type.</p>
<p>A typical use is to declare this to be an integer type, and to pass the sequential
document number where processing should start. An alternative implementation
might take a specific document ID; the collection reader could search through its
collection until it reaches this ID and then start there.</p>
<p>This parameter will only make sense if the particular collection reader is
implemented to use the <code class="literal">startNumber</code> configuration
parameter.</p>
<p>The <code class="literal">&lt;numToProcess&gt;</code> element is an optional
element. It specifies the total number of entities to process. Use -1 to indicate ALL.
If not defined, the number of entities to process will be taken from the Collection
Reader configuration. If present, this value overrides the Collection Reader
configuration.</p>
<p>The <code class="literal">&lt;outputQueue&gt;</code> element is an optional element.
It enables plugging in a custom implementation for the Output Queue. When omitted,
the CPM will use a default output queue that is based on First-in First-out (FIFO)
model.</p>
<p>The UIMA SDK provides a second implementation for the Output Queue that can be
plugged in to the CPM, named <span class="quote">&#8220;<span class="quote">
<code class="literal">org.apache.uima.collection.impl.cpm.engine.SequencedQueue</code>
</span>&#8221;</span>.</p>
<p>This implementation supports handling very large documents that are split into
<span class="quote">&#8220;<span class="quote">chunks</span>&#8221;</span>; it provides a delivery mechanism that insures the
sequential order of the chunks using information carried in the CAS metadata. This
metadata, which is required for this implementation to work correctly, must be added
as an instance of a Feature Structure of type
<code class="literal">org.apache.es.tt.DocumentMetaData</code> and referred to by an
additional feature named <code class="literal">esDocumentMetaData</code> in the special
instance of <code class="literal">uima.tcas.DocumentAnnotation</code> that is
associated with the CAS. This is usually done by the Collection Reader; the instance
contains the following features:
</p><div class="variablelist"><dl><dt><span class="term">sequenceNumber</span></dt><dd><p>[Number] the sequential number of a chunk, starting at 1. If
not a chunk (i.e. complete document), the value should be 0.</p>
</dd><dt><span class="term">documentId</span></dt><dd><p>[Number] current document id. Chunks belonging to the same
document have identical document id.</p></dd><dt><span class="term">isCompleted</span></dt><dd><p>[Number] 1 if the chunk is the last in a sequence, 0
otherwise.</p></dd><dt><span class="term">url</span></dt><dd><p>[String] document url.</p></dd><dt><span class="term">throttleID</span></dt><dd><p>[String] special attribute currently used by
OmniFind.</p></dd></dl></div>
<p>This implementation of a sequenced queue supports proper sequencing of CASes in
CPM deployments that use document chunking. Chunking is a technique of splitting
large documents into pieces to reduce overall memory consumption. Chunking does not
depend on the number of CASes in the CAS Pool. It works equally well with one or more
CASes in the CAS Pool. Each chunk is packaged in a separate CAS and placed in the Work
Queue. If the CAS Pool is depleted, the CollectionReader thread is suspended until a
CAS is released back to the pool by the processing threads. A document may be split into
1, 2, 3 or more chunks that are analyzed independently. In order to reconstruct the
document correctly, the CAS Consumer can depend on receiving the chunks in the same
sequential order that the chunks were <span class="quote">&#8220;<span class="quote">produced</span>&#8221;</span>, when this
sequenced queue implementation is used. To plug in this sequenced queue to the CPM use
the following specification:
</p><pre class="programlisting">&lt;outputQueue dequeueTimeout="100000" queueClass=
"org.apache.uima.collection.impl.cpm.engine.SequencedQueue"/&gt;</pre><p>
where the mandatory <code class="literal">queueClass</code> attribute defines the name of
the class and the second mandatory attribute, <code class="literal">dequeueTimeout</code>
specifies the maximum number of milliseconds to wait for the expected chunk.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The value for this timeout must be carefully determined to avoid
excessive occurrences of timeouts. Typically, the size of a chunk and the type of
analysis being done are the most important factors when deciding on the value for the
timeout. The larger the chunk and the more complicated analysis, the more time it takes
for the chunk to go from source to sink. You may specify 0, in which case, the timeout is
disabled - i.e., it is equivalent to an infinitely long timeout.</p></div>
<p>If the chunk doesn't arrive in the configured time window, the entire
document is presumed to be invalid and the CAS is dropped from further processing.
This action occurs regardless of any other error action specification. The
SequencedQueue invalidate the document, adding the offending document's
metadata to a local cache of invalid documents. </p>
<p>If the time out occurs, the CPM notifies all registered listeners (see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.cpe.using_listeners" class="olink">Section&nbsp;2.3.1, &#8220;Using Listeners&#8221;</a>) by calling
entityProcessComplete(). As part of this call, the SequencedQueue will pass null
instead of a CAS as the first argument, and a special exception &#8211;
CPMChunkTimeoutException. The reason for passing null as the first argument is
because the time out occurs due to the fact that the chunk has not been received in the
configured timeout window, so there is no CAS available when the timeout event
occurs.</p>
<p>The CPMChunkTimeoutException object includes an API that allows the listener
to retrieve the offending document id as well as the other metadata attributes as
defined above. These attributes are part of each chunk's metadata and are added
by the Collection Reader.</p>
<p>Each chunk that SequencedQueue works on is subjected to a test to determine if the
chunk belongs to an invalid document. This test checks the chunk's metadata
against the data in the local cache. If there is a match, the chunk is dropped. This
check is only performed for chunks and complete documents are not subject to this
check.</p>
<p>If there is an exception during the processing of a chunk, the CPM sends a
notification to all registered listeners. The notification includes the CAS and an
exception. When the listener notification is completed, the CPM also sends separate
notifications, containing the CAS, to the Artifact Producer and the
SequencedQueue. The intent is to stop adding new chunks to the Work Queue that belong
to an <span class="quote">&#8220;<span class="quote">invalid</span>&#8221;</span> document and also to deal with chunks that are
en-route, being processed by the processing threads.</p>
<p>In response to the notification, the Artifact Producer will drop and release
back to the CAS Pool all CASes that belong to an <span class="quote">&#8220;<span class="quote">invalid</span>&#8221;</span> document.
Currently, there is no support in the CollectionReader's API to tell it to stop
generating chunks. The CollectionReader keeps producing the chunks but the
Artifact Producer immediately drops/releases them to the CAS Pool. Before the CAS is
released back to the CAS Pool, the Artifact Producer sends notification to all
registered listeners. This notification includes the CAS and an exception &#8211;
SkipCasException.</p>
<p>In response to the notification of an exception involving a chunk, the
SequencedQueue retrieves from the CAS the metadata and adds it to its local cache of
<span class="quote">&#8220;<span class="quote">invalid</span>&#8221;</span> documents. All chunks de-queued from the OutputQueue and
belonging to <span class="quote">&#8220;<span class="quote">invalid</span>&#8221;</span> documents will be dropped and released back to
the CAS Pool. Before dropping the CAS, the CPM sends notification to all registered
listeners. The notification includes the CAS and SkipCasException.</p>
<p>The <code class="literal">&lt;checkpoint&gt;</code> element is an optional element.
It specifies a CPE checkpoint file, checkpoint frequency, and strategy for
checkpoints (time or count based). At checkpoint time, the CPM saves status
information and statistics to the checkpoint file. The checkpoint file is specified
in the <code class="literal">file</code> attribute, which has the same form as the
<code class="literal">href</code> attribute of the <code class="literal">&lt;include&gt;</code>
element described in <a class="xref" href="#ugr.ref.xml.cpe_descriptor.imports" title="3.3.&nbsp;Imports">Section&nbsp;3.3, &#8220;Imports&#8221;</a>. The
<code class="literal">time</code> attribute indicates that a checkpoint should be taken
every <code class="literal">[Number]</code> seconds, and the <code class="literal">batch</code>
attribute indicates that a checkpoint should be taken every
<code class="literal">[Number]</code> batches.</p>
<p>The <code class="literal">&lt;timerImpl&gt;</code> element is optional. It is used to
identify a custom timer plug-in class to generate time stamps during the CPM
execution. The value of the element is a Java class name.</p>
<p>The <code class="literal">&lt;deployAs&gt;</code> element indicates the type of CPM
deployment. Valid contents for this element include:
</p><div class="variablelist"><dl><dt><span class="term">vinciService</span></dt><dd><p>Vinci service exposing APIs for stop, pause, resume, and
getStats</p></dd><dt><span class="term">interactive</span></dt><dd><p>provide command line menus (start, stop, pause,
resume)</p></dd><dt><span class="term">immediate</span></dt><dd><p>run the CPM without menus or a service API</p></dd><dt><span class="term">single-threaded</span></dt><dd><p>run the CPM in a single threaded mode. In this mode, the
Collection Reader, the Processing Pipeline, and the CAS Consumer Pipeline
are all running in one thread without the work queue and the output
queue.</p></dd></dl></div>
</div>
<div class="section" title="3.8.&nbsp;Resource Manager Configuration"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.descriptor.resource_manager_configuration">3.8.&nbsp;Resource Manager Configuration</h2></div></div></div>
<p>External resource bindings for the CPE may optionally be specified in an
element:
</p><pre class="programlisting">&lt;resourceManagerConfiguration href="..."/&gt;</pre>
<p>For an introduction to external resources, refer to <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.aae.accessing_external_resource_files" class="olink">Section&nbsp;1.5.4, &#8220;Accessing External Resources&#8221;</a>.</p>
<p>In the <code class="literal">resourceManagerConfiguration</code> element, the value
of the href attribute refers to another file that contains definitions and bindings
for the external resources used by the CPE. The format of this file is the same as the XML
snippet <a href="references.html#ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings" class="olink">Section&nbsp;2.4.2.4, &#8220;External Resource Bindings&#8221;</a>
. For example, in a CPE containing an aggregate analysis engine with two annotators,
and a CAS Consumer, the following resource manager configuration file would bind
external resource dependencies in all three components to the same physical
resource:
</p><pre class="programlisting">&lt;resourceManagerConfiguration&gt;
&lt;!-- Declare Resource --&gt;
&lt;externalResources&gt;
&lt;externalResource&gt;
&lt;name&gt;ExampleResource&lt;/name&gt;
&lt;fileResourceSpecifier&gt;
&lt;fileUrl&gt;file:MyResourceFile.dat&lt;/fileUrl&gt;
&lt;/fileResourceSpecifier&gt;
&lt;/externalResource&gt;
&lt;/externalResources&gt;
&lt;!-- Bind component resource dependencies to ExampleResource --&gt;
&lt;externalResourceBindings&gt;
&lt;externalResourceBinding&gt;
&lt;key&gt;MyAE/annotator1/myResourceKey&lt;/key&gt;
&lt;resourceName&gt;ExampleResource&lt;/resourceName&gt;
&lt;/externalResourceBinding&gt;
&lt;externalResourceBinding&gt;
&lt;key&gt;MyAE/annotator2/someResourceKey&lt;/key&gt;
&lt;resourceName&gt;ExampleResource&lt;/resourceName&gt;
&lt;/externalResourceBinding&gt;
&lt;externalResourceBinding&gt;
&lt;key&gt;MyCasConsumer/otherResourceKey&lt;/key&gt;
&lt;resourceName&gt;ExampleResource&lt;/resourceName&gt;
&lt;/externalResourceBinding&gt;
&lt;/externalResourceBindings&gt;
&lt;/resourceManagerConfiguration&gt;</pre>
<p>In this example, <code class="literal">MyAE</code> and
<code class="literal">MyCasConsumer</code> are the names of the Analysis Engine and CAS
Consumer, as specified by the name attributes of the CPE's
<code class="literal">&lt;casProcessor&gt;</code> elements.
<code class="literal">annotator1</code> and <code class="literal">annotator2</code> are the
annotator keys specified within the Aggregate AE Descriptor, and
<code class="literal">myResourceKey</code>, <code class="literal">someResourceKey</code>, and
<code class="literal">otherResourceKey</code> are the keys of the resource dependencies
declared in the individual annotator and CAS Consumer descriptors.</p>
</div>
<div class="section" title="3.9.&nbsp;Example CPE Descriptor"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.descriptor.example">3.9.&nbsp;Example CPE Descriptor</h2></div></div></div>
<pre class="programlisting">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;cpeDescription&gt;
&lt;collectionReader&gt;
&lt;collectionIterator&gt;
&lt;descriptor&gt;
&lt;import location=
"../collection_reader/FileSystemCollectionReader.xml"/&gt;
&lt;/descriptor&gt;
&lt;/collectionIterator&gt;
&lt;/collectionReader&gt;
&lt;casProcessors dropCasOnException="true" casPoolSize="1"
processingUnitThreadCount="1"&gt;
&lt;casProcessor deployment="integrated"
name="Aggregate TAE - Name Recognizer and Person Title Annotator"&gt;
&lt;descriptor&gt;
&lt;import location=
"../analysis_engine/NamesAndPersonTitles_TAE.xml"/&gt;
&lt;/descriptor&gt;
&lt;deploymentParameters/&gt;
&lt;filter/&gt;
&lt;errorHandling&gt;
&lt;errorRateThreshold action="terminate" value="100/1000"/&gt;
&lt;maxConsecutiveRestarts action="terminate" value="30"/&gt;
&lt;timeout max="100000"/&gt;
&lt;/errorHandling&gt;
&lt;checkpoint batch="1"/&gt;
&lt;/casProcessor&gt;
&lt;casProcessor deployment="integrated" name="Annotation Printer"&gt;
&lt;descriptor&gt;
&lt;import location="../cas_consumer/AnnotationPrinter.xml"/&gt;
&lt;/descriptor&gt;
&lt;deploymentParameters/&gt;
&lt;filter/&gt;
&lt;errorHandling&gt;
&lt;errorRateThreshold action="terminate" value="100/1000"/&gt;
&lt;maxConsecutiveRestarts action="terminate" value="30"/&gt;
&lt;timeout max="100000"/&gt;
&lt;/errorHandling&gt;
&lt;checkpoint batch="1"/&gt;
&lt;/casProcessor&gt;
&lt;/casProcessors&gt;
&lt;cpeConfig&gt;
&lt;numToProcess&gt;1&lt;/numToProcess&gt;
&lt;deployAs&gt;immediate&lt;/deployAs&gt;
&lt;checkpoint file="" time="3000"/&gt;
&lt;timerImpl/&gt;
&lt;/cpeConfig&gt;
&lt;/cpeDescription&gt;</pre>
</div>
<div class="footnotes"><br><hr width="100" align="left"><div class="footnote"><p><sup>[<a id="ftn.d5e1067" href="#d5e1067" class="para">3</a>] </sup>Deprecated</p></div><div class="footnote"><p><sup>[<a id="ftn.d5e1266" href="#d5e1266" class="para">4</a>] </sup>An earlier UIMA version required these to have a
suffix of <span class="quote">&#8220;<span class="quote">_p</span>&#8221;</span>, e.g., <span class="quote">&#8220;<span class="quote">string_p</span>&#8221;</span>. This is no
longer required, but this format is accepted, also, for backward
compatibility.</p></div></div></div>
<div class="chapter" title="Chapter&nbsp;4.&nbsp;CAS Reference" id="ugr.ref.cas"><div class="titlepage"><div><div><h2 class="title">Chapter&nbsp;4.&nbsp;CAS Reference</h2></div></div></div>
<p>The CAS (Common Analysis System) is the part of the Unstructured Information
Management Architecture (UIMA) that is concerned with creating and handling the data
that annotators manipulate.</p>
<p>Java users typically use the JCas (Java interface to the CAS) when manipulating
objects in the CAS. This chapter describes an alternative interface to the CAS which
allows discovery and specification of types and features at run time. It is recommended
for use when the using code cannot know ahead of time the type system it will be dealing
with.</p>
<p>Use of the CAS as described here is also recommended (or necessary) when components add
to the definitions of types of other components. This UIMA feature allows users to add features
to a type that was already defined elsewhere. When this feature is used in conjunction with the
JCas, it can lead to problems with class loading. This is because different JCas representations
of a single type are generated by the different components, and only one of them is loaded
(unless you are using Pear descriptors). Note:
we do not recommend that you add features to pre-existing types. A type should be defined in one
place only, and then there is no problem with using the JCas. However, if you do use this feature,
do not use the JCas. Similarly, if you distribute your components for inclusion in somebody else's
UIMA application, and you're not sure that they won't add features to your types, do not use the
JCas for the same reasons.
</p>
<div class="section" title="4.1.&nbsp;Javadocs"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.javadocs">4.1.&nbsp;Javadocs</h2></div></div></div>
<p>The subdirectory <code class="literal">docs/api</code> contains the documentation
details of all the classes, methods, and constants for the APIs discussed here. Please
refer to this for details on the methods, classes and constants, specifically in the
packages <code class="literal">org.apache.uima.cas.*</code>.</p>
</div>
<div class="section" title="4.2.&nbsp;CAS Overview"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.overview">4.2.&nbsp;CAS Overview</h2></div></div></div>
<p>There are three<sup>[<a name="d5e1615" href="#ftn.d5e1615" class="footnote">5</a>]</sup> main parts to the CAS: the type system, data creation and
manipulation, and indexing. We will start with a brief
description of these components.</p>
<div class="section" title="4.2.1.&nbsp;The Type System"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.type_system">4.2.1.&nbsp;The Type System</h3></div></div></div>
<p>The type system specifies what kind of data you will be able to manipulate in your
annotators. The type system defines two kinds of entities, types and features. Types
are arranged in a single inheritance tree and define the kinds of entities (objects)
you can manipulate in the CAS. Features optionally specify slots or fields within a
type. The correspondence to Java is to equate a CAS Type to a Java Class, and the CAS
Features to fields within the type. A critical difference is that CAS types have no
methods; they are just data structures with named slots (features). These features can
have as values primitive things like integers, floating point numbers, and strings,
and they also can hold references to other instances of objects in the CAS. We call
instances of the data structures declared by the type system <span class="quote">&#8220;<span class="quote">feature
structures</span>&#8221;</span> (not to be confused with <span class="quote">&#8220;<span class="quote">features</span>&#8221;</span>). Feature
structures are similar to the many variants of record structures found in computer
science.<sup>[<a name="d5e1624" href="#ftn.d5e1624" class="footnote">6</a>]</sup></p>
<p>Each CAS Type defines a supertype; it is a subtype of that supertype. This means
that any features that the supertype defines are features of the subtype; in other
words, it inherits its supertype's features. Only single inheritance is
supported; a type's feature set is the union of all of the features in its
supertype hierarchy. There is a built-in type called uima.cas.TOP; this is the top,
root node of the inheritance tree. It defines no features.</p>
<p>The values that can be stored in features are either built-in primitive values or
references to other feature structures. The primitive values are
<code class="literal">boolean</code>, <code class="literal">byte</code>,
<code class="literal">short</code> (16 bit integers), <code class="literal">integer</code> (32
bit), <code class="literal">long</code> (64 bit), <code class="literal">float</code> (32 bit),
<code class="literal">double</code> (64 bit floats) and strings; the official names of these
are <code class="literal">uima.cas.Boolean</code>, <code class="literal">uima.cas.Byte</code>,
<code class="literal">uima.cas.Short</code>, <code class="literal">uima.cas.Integer</code>,
<code class="literal">uima.cas.Long</code>, <code class="literal">uima.cas.Float</code>
,<code class="literal"> uima.cas.Double</code> and <code class="literal">uima.cas.String</code>
. The strings are Java strings, and characters are Java characters. Technically, this means
that characters are UTF-16 code points, which is not quite the same as a Unicode character.
This distinction should make no difference for almost all applications.
The CAS also defines other basic built-in types for arrays of these, plus arrays of
references to other objects, called <code class="literal">uima.cas.IntegerArray</code>
,<code class="literal"> uima.cas.FloatArray</code>,
<code class="literal">uima.cas.StringArray</code>,
<code class="literal">uima.cas.FSArray</code>, etc.</p>
<p>The CAS also defines a built-in type called
<code class="literal">uima.tcas.Annotation</code> which inherits from
<code class="literal">uima.cas.AnnotationBase</code> which in turn inherits from
<code class="literal">uima.cas.TOP</code>. There are two features defined by this type,
called <code class="literal">begin</code> and <code class="literal">end</code>, both of which are
integer valued.</p>
</div>
<div class="section" title="4.2.2.&nbsp;Creating, accessing and manipulating data"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.creating_accessing_manipulating_data">4.2.2.&nbsp;Creating, accessing and manipulating data</h3></div></div></div>
<p>
Creating and accessing data in the CAS requires knowledge about the types and features
defined in the type system. The idea is similar to other data access APIs, such as the XML
DOM or SAX APIs, or database access APIs such as JDBC. Contrary to those APIs, however, the
CAS does not use the names of type system entities directly in the APIs. Rather, you use
the type system to access type and feature entities by name, then use these entities in the
data manipulation APIs. This can be compared to the Java reflection APIs: the type system
is comparable to the Java class loader, and the type and feature objects to the
<code class="literal">java.lang.Class</code> and <code class="literal">java.lang.reflect.Field</code> classes.
</p>
<p>
Why does it have to be this complicated? You wouldn't normally use reflection to create a
Java object, either. As mentioned earlier, the JCas provides the more straightforward
method to manipulate CAS data. The CAS access methods described here need only be used for
generic types of applications that need to be able to handle any kind of data (e.g., generic
tooling) or when the JCas may not be used for other reasons. The generic kinds of applications
are exactly the ones where you would use the reflection API in Java as well.
</p>
</div>
<div class="section" title="4.2.3.&nbsp;Creating and using indexes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.creating_using_indexes">4.2.3.&nbsp;Creating and using indexes</h3></div></div></div>
<p>Each view of a CAS provides a set of indexes for that view. Instances of Types (that is, Feature
Structures) can be added to a view's indexes. These indexes provide
a way for annotators to locate existing data in the CAS, using a specific index (or the
method <code class="literal">getAllIndexedFS</code> of the object <code class="literal">FSIndexRepository</code>) to
retrieve the Feature Structures that were previously created. If you want the data you
Newly created Feature Structures are not automatically added to the indexes; you choose which
Feature Structures to add and use one of several APIs to add them.
</p>
<p>Indexes are named and are associated with a CAS Type; they are used to index
instances of that CAS type (including instances of that type's subtypes). If
you are using multiple views (see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.mvs" class="olink">Chapter&nbsp;6, <i>Multiple CAS Views of an Artifact</i></a>),
each view contains a separate instantiation of all of the indexes.
To access an index, you
minimally need to know its name. A CAS view provides an index repository which you can
query for indexes for that view. Once you have a handle to an index, you can get
information about the feature structures in the index, the size of the index, as well
as an iterator over the feature structures.</p>
<p>There are three kinds of indexes:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem">
<p>bag - no ordering</p>
</li><li class="listitem">
<p>set - uses a user-specfied set of keys to define equality; holds one instance of the set of equal items.</p>
</li><li class="listitem">
<p>sorted - uses a user-specified set of keys to define ordering.</p>
</li></ul></div><p>
</p>
<p>For set indexes, the comparator keys are augmented with an implicit additional field - the type of the
feature structure. This means that an index over Annotations, having subtype Token, and a key of the "begin" value,
will behave as follows:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>If you make two Tokens (or two Annotations), both having a begin value of 17, and add both of them to the indexes,
only one of them will be in the index.</p>
</li><li class="listitem"><p>If you make 1 Token and 1 Annotation, both having a begin value of 17, and add both of them to the indexes,
both of them will be in the index (because the types are different).
</p></li></ul></div><p>
</p>
<p>Indexes are defined in the XML descriptor metadata for the application. Each CAS
View has its own, separate instantiation of indexes based on these definitions,
kept in the view's index repository. When you obtain an index, it is always from a
particular CAS view's index repository.
When you index an item, it is always added to all indexes where it
belongs, within just the view's repository. You can specify different repositories
(associated with different CAS views) to use; a given Feature Structure instance
may be indexed in more than one CAS View (unless it is a subtype of AnnotationBase).</p>
<p>Indexes implement the Iterable interface, so you may use the Java enhanced for loop to iterate over them.</p>
<p>You can also get iterators from indexes;
iterators allow you to enumerate the feature structures in an index. There are two kinds of iterators supported:
the regular Java iterator API, and a specific FS iterator API
where the usual Java iterator APIs (<code class="literal">hasNext()</code> and <code class="literal">next()</code>)
are augmented by <code class="literal">isValid()</code>, <code class="literal">moveToNext() / moveToPrevious()</code> (which does
not return an element) and <code class="literal">get()</code>. Finally, there is a <code class="literal">moveTo(FeatureStructure)</code>
API, which, for sorted indexes, moves the iteration point to the left-most (among otherwise "equal") item
in the index which compares "equal" to the given FeatureStructure, using the index's defined comparator.
</p>
<p>
Which API style you use is up to you,
but we do not recommend mixing the styles as the results are sometimes unexpected. If you
just want to iterate over an index from start to finish, either style is equally appropriate.
If you also use <code class="literal">moveTo(FeatureStructure fs)</code> and
<code class="literal">moveToPrevious()</code>, it is better to use the special FS iterator style.
</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The reason to not mix these styles is that you might be thinking that
next() followed by moveToPrevious() would always work. This is not true, because
next() returns the "current" element, and advances to the next position, which might be
beyond the last element. At that point, the iterator becomes "invalid", and
moveToNext and moveToPrevious no longer move the iterator. But you can
call these methods on the iterator &#8212; moveToFirst(), moveToLast(), or moveTo(FS) &#8212; to reset it.</p></div>
<p>Indexes are created by specifying them in the annotator's or
aggregate's resource descriptor. An index specification includes its name,
the CAS type being indexed, the kind (bag, set or sorted) of index it is, and an (optional) set of keys.
The keys are used for set and sorted indexes, and specify what values are used for
ordering, or (for sets) what values are used to determine set equality.
When a CAS pipeline is created, all index
specifications are combined; duplicate definitions (having the same name) are
allowed only if their definitions are the same. </p>
<p>Feature structure instances need to be explicitly added to the index repository by a
method call. Feature structures that are not indexed will not be visible to other
annotators, (unless they are located via being referenced by some other feature of
another feature structure, which is indexed, or through a chain of these).</p>
<p>The framework defines an unnamed bag index which indexes all types. The
only access provided for this index is the getAllIndexedFS(type) method on the
index repository, which returns an iterator over all indexed instances of the
specified type (including its subtypes) for that CAS View.
</p>
<p>The framework defines one standard, built-in annotation index, called
AnnotationIndex, which indexes the <code class="literal">uima.tcas.Annotation</code>
type: all feature structures of type <code class="literal">uima.tcas.Annotation</code> or
its subtypes are automatically indexed with this built-in index.</p>
<p>The ordering relation used by this index is to first order by the value of the
<span class="quote">&#8220;<span class="quote">begin</span>&#8221;</span> features (in ascending order) and then by the value of the
<span class="quote">&#8220;<span class="quote">end</span>&#8221;</span> feature (in descending order), and then, finally, by the
Type Priority. This ordering insures that
longer annotations starting at the same spot come before shorter ones. For Subjects
of Analysis other than Text, this may not be an appropriate index.</p>
<p>In addition to normal iterators, there is a <code class="literal">select</code> API, documented
in the Version 3 Users guide, which provides additional capabilities for accessing
Feature Structures via the indexes.</p>
</div>
</div>
<div class="section" title="4.3.&nbsp;Built-in CAS Types"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.builtin_types">4.3.&nbsp;Built-in CAS Types</h2></div></div></div>
<p>The CAS has two kinds of built-in types &#8211; primitive and non-primitive. The
primitive types are:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>uima.cas.Boolean</p></li><li class="listitem"><p>uima.cas.Byte</p></li><li class="listitem"><p>uima.cas.Short</p></li><li class="listitem"><p>uima.cas.Integer</p></li><li class="listitem"><p>uima.cas.Long</p></li><li class="listitem"><p>uima.cas.Float</p></li><li class="listitem"><p>uima.cas.Double</p></li><li class="listitem"><p>uima.cas.String</p></li></ul></div>
<p>The <code class="literal">Byte, Short, Integer, </code>and<code class="literal"> Long</code> are
all signed integer types, of length 8, 16, 32, and 64 bits. The
<code class="literal">Double</code> type is 64 bit floating point. The
<code class="literal">String</code> type can be subtyped to create sets of allowed values; see
<a href="references.html#ugr.ref.xml.component_descriptor.type_system.string_subtypes" class="olink">Section&nbsp;2.3.4, &#8220;String Subtypes&#8221;</a>.
These types can be used to specify the range of a String-valued feature. They act like
Strings, but have additional checking to insure the setting of values into them
conforms to one of the allowed values, or to null (which is the value if it is not set).
Note that the other primitive types cannot be used
as a supertype for another type definition; only
<code class="literal">uima.cas.String</code> can be sub-typed.</p>
<p>The non-primitive types exist in a type hierarchy; the top of the hierarchy is the
type <code class="literal">uima.cas.TOP</code>. All other non-primitive types inherit from
some supertype.</p>
<p>There are 9 built-in array types. These arrays have a size specified when they are
created; the size is fixed at creation time. They are named:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>uima.cas.BooleanArray</p></li><li class="listitem"><p>uima.cas.ByteArray</p></li><li class="listitem"><p>uima.cas.ShortArray</p></li><li class="listitem"><p>uima.cas.IntegerArray</p></li><li class="listitem"><p>uima.cas.LongArray</p></li><li class="listitem"><p>uima.cas.FloatArray</p></li><li class="listitem"><p>uima.cas.DoubleArray</p></li><li class="listitem"><p>uima.cas.StringArray</p></li><li class="listitem"><p>uima.cas.FSArray</p></li></ul></div>
<p>The <code class="literal">uima.cas.FSArray</code> type is an array whose elements are
arbitrary other feature structures (instances of non-primitive types).</p>
<p>The JCas cover classes for the array types support the Iterable API, so you may
write extended for loops over instances of these. For example:
</p><pre class="programlisting">FSArray&lt;MyType&gt; myArray = ...
for (MyType fs : myArray) {
some_method(fs);
}</pre><p>
</p>
<p>There are 3 built-in types associated with the artifact being analyzed:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>uima.cas.AnnotationBase</p></li><li class="listitem"><p>uima.tcas.Annotation</p></li><li class="listitem"><p>uima.tcas.DocumentAnnotation</p></li></ul></div>
<p>The <code class="literal">AnnotationBase</code> type defines one system-used feature
which specifies for an annotation the subject of analysis (Sofa) to which it refers. The
Annotation type extends from this and defines 2 features, taking
<code class="literal">uima.cas.Integer</code> values, called <code class="literal">begin</code>
and <code class="literal">end</code>. The <code class="literal">begin</code> feature typically
identifies the start of a span of text the annotation covers; the
<code class="literal">end</code> feature identifies the end. The values refer to character
offsets; the starting index is 0. An annotation of the word <span class="quote">&#8220;<span class="quote">CAS</span>&#8221;</span> in a text
<span class="quote">&#8220;<span class="quote">CAS Reference</span>&#8221;</span> would have a start index of 0, and an end index of 3; the
difference between end and start is the length of the span the annotation refers
to.</p>
<p>Annotations are always with respect to some Sofa (Subject of Analysis &#8211; see
<a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a>
<a href="tutorials_and_users_guides.html#ugr.tug.aas" class="olink">Chapter&nbsp;5, <i>Annotations, Artifacts, and Sofas</i></a>
.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Artifacts which are not text strings may have a different interpretation of
the meaning of begin and end, or may define their own kind of annotation, extending from
<code class="literal">AnnotationBase</code>. </p></div>
<p><a name="ugr.ref.cas.document_annotation"></a>The <code class="literal">DocumentAnnotation</code> type has one special instance. It is
a subtype of the Annotation type, and the built-in definition defines one feature,
<code class="literal">language</code>, which is a string indicating the language of the
document in the CAS. The value of this language feature is used by the system to control
flow among annotators when the <span class="quote">&#8220;<span class="quote">CapabilityLanguageFlow</span>&#8221;</span> mode is used,
allowing the flow to skip over annotators that don't process particular
languages. Users may extend this type by adding additional features to it, using the XML
Descriptor element for defining a type.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>
We do <span class="emphasis"><em>not</em></span> recommend extending the <code class="literal">DocumentAnnotation</code>
type. If you do, you must <span class="emphasis"><em>not</em></span> use the JCas, for the reasons stated
earlier.
</p></div>
<p>Each CAS view has a different associated instance of the
<code class="literal">DocumentAnnotation</code> type. On the CAS, use
<code class="literal">getDocumentationAnnotation()</code> to access the
<code class="literal">DocumentAnnotation</code>.</p>
<p>There are also built-in types supporting linked lists, similar to the ones available in
Java and other programming languages. Their use is
constrained by the usual properties of linked lists: not very space efficient, no (efficient)
random access, but an easy choice if you don't know how long your list will be ahead of time. The
implementation is type specific; there are different list building objects for each of
the primitive types, plus one for general feature structures. Here are the type names:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>uima.cas.FloatList</p></li><li class="listitem"><p>uima.cas.IntegerList</p></li><li class="listitem"><p>uima.cas.StringList</p></li><li class="listitem"><p>uima.cas.FSList</p>
<p></p></li><li class="listitem"><p>uima.cas.EmptyFloatList</p></li><li class="listitem"><p>uima.cas.EmptyIntegerList</p></li><li class="listitem"><p>uima.cas.EmptyStringList</p></li><li class="listitem"><p>uima.cas.EmptyFSList</p>
<p></p></li><li class="listitem"><p>uima.cas.NonEmptyFloatList</p></li><li class="listitem"><p>uima.cas.NonEmptyIntegerList</p></li><li class="listitem"><p>uima.cas.NonEmptyStringList</p></li><li class="listitem"><p>uima.cas.NonEmptyFSList</p></li></ul></div>
<p>For the primitive types <code class="literal">Float</code>,
<code class="literal">Integer</code>, <code class="literal">String</code> and
<code class="literal">FeatureStructure</code>, there is a base type, for instance,
<code class="literal">uima.cas.FloatList</code>. For each of these, there are two subtypes,
corresponding to a non-empty element, and a marker that serves to indicate the end of the
list, or an empty list. The non-empty types define two features &#8211;
<code class="literal">head</code> and <code class="literal">tail</code>. The head feature holds the
particular value for that part of the list. The tail refers to the next list object
(either a non-empty one or the empty version to indicate the end of the list).</p>
<p>For JCas users, the new operator for the NonEmptyXyzList classes includes a 3 argument version
where you may specify the head and tail values as part of the constructor. The JCas
cover classes for these implement
a <code class="code">push(item)</code> method which creates a new non-empty node, sets the <code class="code">head</code> value
to <code class="code">item</code>, and the tail to the node it is called on, and returns the new node.
These classes also implement Iterable, so you can use the enhanced Java <code class="code">for</code> operator.
The iterator stops when it gets to the end of the list, determined by either the tail being null or
the element being one of the EmptyXXXList elements.
Here's a StringList example:
</p><pre class="programlisting">StringList sl = jcas.emptyStringList();
sl = sl.push("2");
sl = sl.push("1");
for (String s : sl) {
someMethod(s); // some sample use
}</pre><p>
</p>
<p>There are no other built-in types. Users are free to define their own type systems,
building upon these types.</p>
</div>
<div class="section" title="4.4.&nbsp;Accessing the type system"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.accessing_the_type_system">4.4.&nbsp;Accessing the type system</h2></div></div></div>
<p>
During annotator processing, or outside an annotator, access the type system by calling
<code class="literal">CAS.getTypeSystem()</code>.
</p>
<p>However, CAS annotators implement an additional method,
<code class="literal">typeSystemInit()</code>, which is called by the UIMA framework before the
annotator's process method. This method, implemented by the annotator writer,
is passed a reference to the CAS's type system metadata. The method typically uses
the type system APIs to obtain type and feature objects corresponding to all the types
and features the annotator will be using in its process method. This initialization
step should not be done during an annotator's initialize method since the type
system can change after the initialize method is called; it should not be done during the
process method, since this is presumably work that is identical for each incoming
document, and so should be performed only when the type system changes (which will be a
rare event). The UIMA framework guarantees it will call the <code class="literal">typeSystemInit
</code>method of an annotator whenever the type system changes, before calling the
annotator's <code class="literal">process()</code> method.</p>
<p>The initialization done by <code class="literal">typeSystemInit()</code> is done by the
UIMA framework when you use the JCas APIs; you only need to provide a
<code class="literal">typeSystemInit()</code> method, as described here, when you are not using
the JCas approach.</p>
<div class="section" title="4.4.1.&nbsp;TypeSystemPrinter example"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.type_system.printer_example">4.4.1.&nbsp;TypeSystemPrinter example</h3></div></div></div>
<p>Here is a code fragment that, given a CAS Type System, will print a list of all
types.</p>
<pre class="programlisting">// Get all type names from the type system
// and print them to stdout.
private void listTypes1(TypeSystem ts) {
for (Type t : ts) {
// print its name.
System.out.println(t.getName());
}
}</pre>
<p>This method is passed the type system as a parameter. From the type system, we can
get an iterator
over all the types. If you run this against a CAS created with no additional
user-defined types, we should see something like this on the console:</p>
<pre class="programlisting">Types in the type system:
uima.cas.Boolean
uima.cas.Byte
uima.cas.Short
uima.cas.Integer
uima.cas.Long
uima.cas.ArrayBase
...
</pre>
<p>If the type system had user-defined types these would show up too. Note that some
of these types are not directly creatable &#8211; they are types used by the framework
in the type hierarchy (e.g. uima.cas.ArrayBase).</p>
<p>CAS type names include a name-space prefix. The components of a type name are
separated by the dot (.). A type name component must start with a Unicode letter,
followed by an arbitrary sequence of letters, digits and the underscore (_). By
convention, the last component of a type name starts with an uppercase letter, the
rest start with a lowercase letter.</p>
<p>Listing the type names is mildly useful, but it would be even better if we could see
the inheritance relation between the types. The following code prints the
inheritance tree in indented format.</p>
<pre class="programlisting">private static final int INDENT = 2;
private void listTypes2(TypeSystem ts) {
// Get the root of the inheritance tree.
Type top = ts.getTopType();
// Recursively print the tree.
printInheritanceTree(ts, top, 0);
}
private void printInheritanceTree(TypeSystem ts, Type type, int level) {
indent(level); // Print indentation.
System.out.println(type.getName());
// Get a vector of the immediate subtypes.
Vector subTypes =
ts.getDirectlySubsumedTypes(type);
++level; // Increase the indentation level.
for (int i = 0; i &lt; subTypes.size(); i++) {
// Print the subtypes.
printInheritanceTree(ts, (Type) subTypes.get(i), level);
}
}
// A simple, inefficient indenter
private void indent(int level) {
int spaces = level * INDENT;
for (int i = 0; i &lt; spaces; i++) {
System.out.print(" ");
}
}</pre>
<p> This example shows that you can traverse the type hierarchy by starting at the top
with TypeSystem.getTopType and by retrieving subtypes with
<code class="literal">TypeSystem.getDirectlySubsumedTypes()</code>.</p>
<p>The Javadocs also have APIs that allow you to access the features, as well as what
the allowed value type is for that feature. Here is sample code which prints out all the
features of all the types, together with the allowed value types (the feature
<span class="quote">&#8220;<span class="quote">range</span>&#8221;</span>). Each feature has a <span class="quote">&#8220;<span class="quote">domain</span>&#8221;</span> which is the type
where it is defined, as well as a <span class="quote">&#8220;<span class="quote">range</span>&#8221;</span>.
</p><pre class="programlisting">private void listFeatures2(TypeSystem ts) {
Iterator featureIterator = ts.getFeatures();
Feature f;
System.out.println("Features in the type system:");
while (featureIterator.hasNext()) {
f = (Feature) featureIterator.next();
System.out.println(
f.getShortName() + ": " +
f.getDomain() + " -&gt; " + f.getRange());
}
System.out.println();
}</pre>
<p>We can ask a feature object for its domain (the type it is defined on) and its range
(the type of the value of the feature). The terminology derives from the fact that
features can be viewed as functions on subspaces of the object space.</p>
</div>
<div class="section" title="4.4.2.&nbsp;Using the CAS APIs to create and modify feature structures"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.cas_apis_create_modify_feature_structures">4.4.2.&nbsp;Using the CAS APIs to create and modify feature structures</h3></div></div></div>
<p>Assume a type system declaration that defines two types: Entity and Person.
Entity has no features defined within it but inherits from uima.tcas.Annotation
&#8211; so it has the begin and end features. Person is, in turn, a subtype of Entity,
and adds firstName and lastName features. CAS type systems are declaratively
specified using XML; the format of this XML is described in <a href="references.html#ugr.ref.xml.component_descriptor.type_system" class="olink">Section&nbsp;2.3, &#8220;Type System Descriptors&#8221;</a>.
</p><pre class="programlisting">&lt;!-- Type System Definition --&gt;
&lt;typeSystemDescription&gt;
&lt;types&gt;
&lt;typeDescription&gt;
&lt;name&gt;com.xyz.proj.Entity&lt;/name&gt;
&lt;description /&gt;
&lt;supertypeName&gt;uima.tcas.Annotation&lt;/supertypeName&gt;
&lt;/typeDescription&gt;
&lt;typeDescription&gt;
&lt;name&gt;Person&lt;/name&gt;
&lt;description /&gt;
&lt;supertypeName&gt;com.xyz.proj.Entity &lt;/supertypeName&gt;
&lt;features&gt;
&lt;featureDescription&gt;
&lt;name&gt;firstName&lt;/name&gt;
&lt;description /&gt;
&lt;rangeTypeName&gt;uima.cas.String&lt;/rangeTypeName&gt;
&lt;/featureDescription&gt;
&lt;featureDescription&gt;
&lt;name&gt;lastName&lt;/name&gt;
&lt;description /&gt;
&lt;rangeTypeName&gt;uima.cas.String&lt;/rangeTypeName&gt;
&lt;/featureDescription&gt;
&lt;/features&gt;
&lt;/typeDescription&gt;
&lt;/types&gt;
&lt;/typeSystemDescription&gt;</pre>
<p>
To be able to access types and features, we need to know their names. The CAS interface defines
constants that hold the names of built-in feature names, such as, e.g.,
<code class="literal">CAS.TYPE_NAME_INTEGER</code>. It is good programming practice to create such
constants for the types and features you define, for your own use as well as for others who will
be using your annotators.
</p>
<pre class="programlisting">/** Entity type name constant. */
public static final String ENTITY_TYPE_NAME = "com.xyz.proj.Entity";
/** Person type name constant. */
public static final String PERSON_TYPE_NAME = "com. xyz.proj.Person";
/** First name feature name constant. */
public static final String FIRST_NAME_FEAT_NAME = "firstName";
/** Last name feature name constant. */
public static final String LAST_NAME_FEAT_NAME = "lastName";</pre>
<p>Next we define type and feature member variables; these will hold the values of the
type and feature objects needed by the CAS APIs, to be assigned during
<code class="literal">typeSystemInit()</code>.</p>
<pre class="programlisting">// Type system object variables
private Type entityType;
private Type personType;
private Feature firstNameFeature;
private Feature lastNameFeature;
private Type stringType;</pre>
<p>The type system does not throw an exception if we ask for something that is
not known, it simply returns null; therefore the code checks for this and throws a proper
exception. We require all these types and features to be defined for the annotator to
work. One might imagine situations where certain computations are predicated on some type
or feature being defined in the type system, but that is not the case here.</p>
<pre class="programlisting">// Get a type object corresponding to a name.
// If it doesn't exist, throw an exception.
private Type initType(String typeName)
throws AnnotatorInitializationException {
Type type = ts.getType(typeName);
if (type == null) {
throw new AnnotatorInitializationException(
AnnotatorInitializationException.TYPE_NOT_FOUND,
new Object[] { this.getClass().getName(), typeName });
}
return type;
}
// We add similar code for retrieving feature objects.
// Get a feature object from a name and a type object.
// If it doesn't exist, throw an exception.
private Feature initFeature(String featName, Type type)
throws AnnotatorInitializationException {
Feature feat = type.getFeatureByBaseName(featName);
if (feat == null) {
throw new AnnotatorInitializationException(
AnnotatorInitializationException.FEATURE_NOT_FOUND,
new Object[] { this.getClass().getName(), featName });
}
return feat;
}</pre>
<p>Using these two functions, code for initializing the type system described
above would be:
</p><pre class="programlisting">public void typeSystemInit(TypeSystem aTypeSystem)
throws AnalysisEngineProcessException {
this.typeSystem = aTypeSystem;
// Set type system member variables.
this.entityType = initType(ENTITY_TYPE_NAME);
this.personType = initType(PERSON_TYPE_NAME);
this.firstNameFeature =
initFeature(FIRST_NAME_FEAT_NAME, personType);
this.lastNameFeature =
initFeature(LAST_NAME_FEAT_NAME, personType);
this.stringType = initType(CAS.TYPE_NAME_STRING);
}</pre>
<p>Note that we initialize the string type by using a type name constant from the
CAS.</p>
</div>
</div>
<div class="section" title="4.5.&nbsp;Creating feature structures"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.creating_feature_structures">4.5.&nbsp;Creating feature structures</h2></div></div></div>
<p>To create feature structures in JCas, we use the Java <span class="quote">&#8220;<span class="quote">new</span>&#8221;</span>
operator. In the CAS, we use one of several different API methods on the CAS object,
depending on which of the 10 basic kinds of feature structures we are creating (a plain
feature structure, or an instance of the built-in primitive type arrays or FSArray).
There are is also a method to create an instance of a
<code class="literal">uima.tcas.Annotation</code>, setting the begin and end
values.</p>
<p>Once a feature structure is created, it needs to be added to the CAS indexes (unless
it will be accessed via some reference from another accessible feature structure). The
CAS provides this API: Assuming aCAS holds a reference to a CAS, and token holds a
reference to a newly created feature structure, here's the code to add that
feature structure to all the relevant CAS indexes:</p>
<pre class="programlisting"> // Add the token to the index repository.
aCAS.addFsToIndexes(token);</pre>
<p>There is also a corresponding <code class="literal">removeFsFromIndexes(token)</code>
method on CAS objects.</p>
<p>As of version 2.4.1, there are two methods you can use on an index repository
to efficiently bulk-remove all
instances of particular types of feature structures from a particular view. One of these,
<code class="code">aCas.getIndexRepository().removeAllIncludingSubtypes(aType)</code> removes all instances of a particular
type, including instances which are subtypes of the specified type. The other,
<code class="code">aCas.getIndexRepository().removeAllExcludingSubtypes(aType)</code> remove all instances of a particular
type, only. In both cases, the removal is done from the particular view of the CAS referenced
by aCas.</p>
<div class="section" title="4.5.1.&nbsp;Updating indexed feature structures"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.updating_indexed_feature_structures">4.5.1.&nbsp;Updating indexed feature structures</h3></div></div></div>
<p>Version 2.7.0 added protection for indexes when feature structure key
value features are updated. By default this protection is automatic, but
at some performance cost. Users may optimize this further.</p>
<p>Protection is needed because some of the indexes (the Sorted and Set types) use comparators defined
to use values of the particular features; if these values
need to be changed after the feature structure is added to the indexes,
the correct way to do this is to:
</p><div class="orderedlist"><ol class="orderedlist" type="1" compact><li class="listitem"><p>completely remove the item from all indexes where it is indexed, in all views
where it is indexed,</p>
</li><li class="listitem"><p>update the value of the features being used as keys,</p></li><li class="listitem"><p>add the item back to the indexes, in all views.</p></li></ol></div>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>It&#8217;s OK to change feature values which are not used in determining
sort ordering (or set membership), without removing and re-adding back to the index.
</p></div>
<p>The automatic protection checks for updates of
features being used as keys, and if it finds an update like this for a feature structure that
is in the indexes, it removes the feature structure from the indexes, does the update,
and adds it back. It will do this for every feature update. This is obviously not
efficient when multiple features are being updated; in that case it would better to
remove the feature structure, do all the updates to all the features needing updates, and then
do a single add-back operation.</p>
<p>This is supported in user&#8217;s code by using the new method <code class="code">protectIndexes</code>
available in both the CAS and JCas interface.
Here's two ways
of using this, one with a try / finally and the other with a Runnable:
</p><pre class="programlisting">// an approach using try / finally
AutoCloseable ac = my_cas.protectIndexes(); // my_cas is a CAS or a JCas
try {
... arbitrary user code which updates features
which may be "keys" in one or more indexes
} finally {
ac.close();
}
// This can more compactly be written using the auto-close feature of try:
try (AutoCloseable ac = my_cas.protectIndexes()) {
... arbitrary user code which updates features
which may be "keys" in one or more indexes
}
// an approach using a Runnable, written in Java 8 lambda syntax
my_cas.protectIndexes(() -&gt; {
... arbitrary user code updating "key" features,
but no checked exceptions are permitted
});</pre>
<p>The <code class="code">protectIndexes</code> implementation only removes feature structures that
have features being updated which are used as keys in some index(es). At the end of the scope
of the protectIndexes, it adds all of these back. It also skips removing feature structures
from bag indexes, since these have no keys.</p>
<p>Within a <code class="code">protectIndexes</code> block, do not do any operations which depend on the
indexes being valid, such as creating and using an iterator. This is because the removed FSs
are only added back at the end of the protectIndexes block.</p>
<p>The JVM property <code class="code">-Duima.report_fs_update_corrupts_index</code> will generate a log entry
everytime the frameworks finds (and automatically surrounds with a remove - add-back) an update to
a feature which could corrupt the index. The log entries can be identified by scanning for messages
starting with <code class="code">While FS was in the index, the feature</code> - the message goes on to identify
the feature in question. Users can use these reports to find the places in their code where
they can either change the design to avoid updating these values after the item is indexed, or
surround the updates with their own <code class="code">protectIndexes</code> blocks.</p>
<p>Initially, the out-of-the-box defaults
for the UIMA framework will run with an automatic (but somewhat inefficient) protection. To improve upon this,
users would:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>Turn on reporting using a global JVM flag <code class="code">
-Duima.report_fs_update_corrupts_index</code>.
This will cause a message to be logged each time the automatic protection is being invoked,
and allows the user to find the spots to improve.</p>
</li><li class="listitem"><p>Improve each spot, perhaps by surrounding the update code with a protectIndexes
block, or by rearranging code to reduce updating feature values used as index keys.</p>
</li><li class="listitem"><p>Once the code is no longer generating any reports, you can turn off the
automatic protection for production runs using the JVM global property
<code class="code">-Duima.disable_auto_protect_indexes</code>, and rely on the protectIndexes blocks.
If protection is disabled, then the corruption detection is skipped, making the production
runs perhaps a bit faster, although this is not significant in most cases.</p></li><li class="listitem"><p>For automated build systems, there&#8217;s a JVM parameter,
<code class="code">-Duima.exception_when_fs_update_corrupts_index</code>, which will throw an
exception if any automatic recovery situation is encountered. You can use this
in build/test scenarios to insure
(after adding all needed protectIndexes blocks) that the code remains safe for
turning off the checking in production runs.</p></li></ul></div><p>
</p>
</div>
</div>
<div class="section" title="4.6.&nbsp;Accessing or modifying features of feature structures"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.accessing_modifying_features_of_feature_structures">4.6.&nbsp;Accessing or modifying features of feature structures</h2></div></div></div>
<p>Values of individual features for a feature structure can be set or referenced,
using a set of methods that depend on the type of value that feature is declared to have.
There are methods on FeatureStructure for this: getBooleanValue, getByteValue,
getShortValue, getIntValue, getLongValue, getFloatValue, getDoubleValue,
getStringValue, and getFeatureValue (which means to get a value which in turn is a
reference to a feature structure). There are corresponding <span class="quote">&#8220;<span class="quote">setter</span>&#8221;</span>
methods, as well. These methods on the feature structure object take as arguments the
feature object retrieved earlier in the typeSystemInit method.</p>
<p>Using the previous example, with the type system initialized with type personType
and feature lastNameFeature, here's a sample code fragment that gets and sets
that feature:</p>
<pre class="programlisting">// Assume aPerson is a variable holding an object of type Person
// get the lastNameFeature value from the feature structure
String lastName = aPerson.getStringValue(lastNameFeature);
// set the lastNameFeature value
aPerson.setStringValue(lastNameFeature, newStringValueForLastName);</pre>
<p>The getters and setters for each of the primitive types are defined in the Javadocs
as methods of the FeatureStructure interface.</p>
</div>
<div class="section" title="4.7.&nbsp;Indexes and Iterators"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.indexes_and_iterators">4.7.&nbsp;Indexes and Iterators</h2></div></div></div>
<p>Each CAS can have many indexes associated with it; each CAS View contains
a complete set of instantiations of the indexes. Each index is represented by an
instance of the type org.apache.uima.cas.FSIndex. You use the object
org.apache.uima.cas.FSIndexRepository, accessible via a method on a CAS object, to
retrieve instances of indexes. There are methods that let you select the index
by name, by type, or by both name and type. Since each index is already associated with a type,
passing both a name and a type is valid only if the type passed in is the same
type or a subtype of the one declared in the index specification for the named index. If you
pass in a subtype, the returned FSIndex object refers to an index that will return only
items belonging to that subtype (or subtypes of that subtype).</p>
<p>The returned FSIndex objects are used, in turn, to create iterators.
There is also a method on the Index Repository, <code class="literal">getAllIndexedFS</code>,
which will return an iterator over all indexed Feature Structures (for that CAS View),
in no particular order. The iterators
created can be used like common Java iterators, to sequentially retrieve items
indexed. If the index represents a sorted index, the items are returned in a sorted
order, where the sort order is specified in the XML index definition. This XML is part of
the Component Descriptor, see <a href="references.html#ugr.ref.xml.component_descriptor.aes.index" class="olink">Section&nbsp;2.4.1.5, &#8220;Index Definition&#8221;</a>.</p>
<p>In UIMA V3, Feature structures may be added to or removed from indexes while iterating
over them. If this happens, any iterators already created will continue to operate over the
before-modification version of the index, unless or until the iterator is re-synchronized with the current
value of the index via one of the following specific 3 iterator API calls:
moveToFirst, moveToLast, or moveTo(FeatureStructure).
ConcurrentModificationException is no longer thrown in UIMA v3.
</p>
<p>Feature structures being iterated over may have features which are used as the "keys" of an index, updated.
If this is done, UIMA will protect the indexes (to prevent index corruption) by automatically removing the
Feature Structure from the indexes,
updating the field, and adding the FS back to the index (possibly in a new position).
This automatic remove / add-back operation no longer makes the iterator throw a ConcurrentModificationException
(as it did in UIMA Version 2) if the iterator is incremented or decremented;
existing iterators will continue to operate as if no index modification occurred.
</p>
<div class="section" title="4.7.1.&nbsp;Built-in Indexes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.index.built_in_indexes">4.7.1.&nbsp;Built-in Indexes</h3></div></div></div>
<p>An unnamed built-in bag index exists which holds all feature structures which are indexed.
The only access to this index is the method getAllIndexedFS(Type) which returns an iterator
over all indexed Feature Structures.</p>
<p>The CAS also contains a built-in index for the type <code class="literal">uima.tcas.Annotation</code>, which sorts
annotations in the order in which they appear in the document. Annotations are sorted first by increasing
<code class="literal">begin</code> position. Ties are then broken by <span class="emphasis"><em>decreasing</em></span>
<code class="literal">end</code> position (so that longer annotations come first). Annotations that match in both
their <code class="literal">begin</code> and <code class="literal">end</code> features are sorted using the Type Priority,
if any are defined
(see <a href="references.html#ugr.ref.xml.component_descriptor.aes.type_priority" class="olink">Section&nbsp;2.4.1.4, &#8220;Type Priority Definition&#8221;</a> )</p>
</div>
<div class="section" title="4.7.2.&nbsp;Adding Feature Structures to the Indexes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.index.adding_to_indexes">4.7.2.&nbsp;Adding Feature Structures to the Indexes</h3></div></div></div>
<p>Feature Structures are added to the indexes by various APIs. These add the Feature Structure to
<span class="emphasis"><em>all</em></span> indexes that are defined for the type of that FeatureStructure (or any of its
supertypes), in a particular view.
Note that you should not add a Feature Structure to the indexes until you have set values for all
of the features that may be used as sort keys in an index.</p>
<p>There are multiple APIs for adding FSs to the index.
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>(preferred) myFeatureStructure.addToIndexes(). This adds the feature structure instance to the
view in which it was originally created.</p>
</li><li class="listitem"><p>(preferred) myFeatureStructure.addToIndexes(JCas or CAS). This adds the feature structure instance to the
view represented by the argument.</p>
</li><li class="listitem"><p>(older form) casView.addFsToIndexes(myFeatureStructure) or jcasView.addFsToIndexes(myFeatureStructure).
This adds the feature structure instance to the
view represented by the cas (or jcas).</p>
</li><li class="listitem"><p>(older form) fsIndexRepositoryView.addFsToIndexes(myFeatureStructure).
This adds the feature structure instance to the
view represented by the fsIndexRepository instance.</p>
</li></ul></div><p>
</p>
</div>
<div class="section" title="4.7.3.&nbsp;Iterators over UIMA Indexes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.index.iterators">4.7.3.&nbsp;Iterators over UIMA Indexes</h3></div></div></div>
<p>Iterators are objects of class <code class="literal">org.apache.uima.cas.FSIterator.</code> This class
extends <code class="literal">java.util.Iterator</code> and implements the normal Java iterator methods, plus
additional ones that allow moving both forwards and backwards.</p>
<p>UIMA Indexes implement iterable, so you can use the index directly in a Java extended for loop.</p>
</div>
<div class="section" title="4.7.4.&nbsp;Special iterators for Annotation types"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.index.annotation_index">4.7.4.&nbsp;Special iterators for Annotation types</h3></div></div></div>
<p>Note: we recommend using the UIMA V3 select framework, instead of the following.
It implements all of the following capabilities, and more, in a uniform manner.</p>
<p>The built-in index over the <code class="literal">uima.tcas.Annotation</code> type
named <span class="quote">&#8220;<span class="quote"><code class="literal">AnnotationIndex</code></span>&#8221;</span> has additional
capabilities. To use them, you first get a reference to this built-in index using
either the <code class="literal">getAnnotationIndex</code> method on a CAS View object, or
by asking the <code class="literal">FSIndexRepository</code> object for an index having the
particular name <span class="quote">&#8220;<span class="quote">AnnotationIndex</span>&#8221;</span>, for example:
</p><pre class="programlisting">AnnotationIndex idx = aCAS.getAnnotationIndex();
// or you can iterate over a specific subtype of Annotation:
AnnotationIndex idx = aCAS.getAnnotationIndex(aType); </pre>
<p>This object can be used to produce several additional kinds of iterators. It can
produce unambiguous iterators; these skip over elements until it finds one where the
start position of the next annotation is equal to or greater than the end position of
the previously returned annotation.</p>
<p>It can also produce several kinds of subiterators; these are iterators whose
annotations fall within the span of another annotation. This kind of iterator can
also have the unambiguous property, if desired. It also can be
<span class="quote">&#8220;<span class="quote">strict</span>&#8221;</span> or not; strict means that the returned annotation lies
completely within the span of the controlling annotation. Non-strict only implies
that the beginning of the returned annotation falls within the span of the
controlling annotation.</p>
<p>There is also a method which produces an <code class="literal">AnnotationTree</code>
object, which contains nodes representing the results of doing a strict,
unambiguous subiterator over the span of some controlling annotation. For more
details, please refer to the Javadocs for the
<code class="literal">org.apache.uima.cas.text</code> package.</p>
</div>
<div class="section" title="4.7.5.&nbsp;Constraints and Filtered iterators"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.index.constraints_and_filtered_iterators">4.7.5.&nbsp;Constraints and Filtered iterators</h3></div></div></div>
<p>Note: for new code, consider using the select framework plus Streams, instead of
the following.</p>
<p>There is a set of API calls that build constraint objects. These objects can be
used directly to test if a particular feature structure matches (satisfies) the
constraint, or they can be passed to the createFilteredIterator method to create an
iterator that skips over instances which fail to satisfy the constraint.</p>
<p>It is possible to specify a feature value located by following a chain of
references starting from the feature structure being tested. Here's a
scenario to explore this concept. Let's suppose you have the following type
system (namespaces are omitted for clarity):
</p><div class="blockquote"><blockquote class="blockquote">
<p><span class="bold"><strong>Token</strong></span>, having a feature PartOfSpeech
which holds a reference to another type (POS)</p>
<p><span class="bold"><strong>POS</strong></span> (a type with many subtypes, each
representing a different part of speech)</p>
<p><span class="bold"><strong>Noun</strong></span> (a subtype of POS)</p>
<p><span class="bold"><strong>ProperName</strong></span> (a subtype of Noun),
having a feature Class which holds an integer value encoding some information
about the proper noun.</p></blockquote></div>
<p>If you want to filter Token instances, such that only those tokens get through
which are proper names of class 3 (for example), you would need a test that started with
a Token instance, followed its PartOfSpeech reference to another instance (the
ProperName instance) and then tested the Class feature of that instance for a value
equal to 3.</p>
<p>To support this, the filtering approach has components that specify tests, and
components that specify <span class="quote">&#8220;<span class="quote">paths</span>&#8221;</span>. The tests that can be done include
testing references to type instances to see if they are instances of some type or its
subtypes; this is done with a FSTypeConstraint constraint. Other tests check for
equality or, for numeric values, ranges.</p>
<p>Each test may be combined with a path &#8211; to get to the value to test. Tests that
start from a feature structure instance can be combined with and and or connectors.
The Javadocs for these are in the package org.apache.uima.cas in the classes that end
in Constraint, plus the classes ConstraintFactory, FeaturePath and CAS.
Here's an example; assume the variable cas holds a reference to a CAS instance.
</p><pre class="programlisting">// Start by getting the constraint factory from the CAS.
ConstraintFactory cf = cas.getConstraintFactory();
// To specify a path to an item to test, you start by
// creating an empty path.
FeaturePath path = cas.createFeaturePath();
// Add POS feature to path, creating one-element path.
path.addFeature(posFeat);
// You can extend the chain arbitrarily by adding additional
// features.
// Create a new type constraint.
// Type constraints will check that structures
// they match against have a type at least as specific
// as the type specified in the constraint.
FSTypeConstraint nounConstraint = cf.createTypeConstraint();
// Set the type (by default it is TOP).
// This succeeds if the type being tested by this constraint
// is nounType or a subtype of nounType.
nounConstraint.add(nounType);
// Embed the noun constraint under the pos path.
// This means, associate the test with the path, so it tests the
// proper value.
// The result is a test which will
// match a feature structure that has a posFeat defined
// which has a value which is an instance of a nounType or
// one of its subtypes.
FSMatchConstraint embeddedNoun = cf.embedConstraint(path, nounConstraint);
// Create a type constraint for token (or a subtype of it)
FSTypeConstraint tokenConstraint = cf.createTypeConstraint();
// Set the type.
tokenConstraint.add(tokenType);
// Create the final constraint by conjoining the two constraints.
FSMatchConstraint nounTokenCons = cf.and(nounConstraint, tokenConstraint);
// Create a filtered iterator from some annotation iterator.
FSIterator it = cas.createFilteredIterator(annotIt, nounTokenCons);</pre><p>
</p></div></div>
<div class="section" title="4.8.&nbsp;The CAS API's &#8211; a guide to the Javadocs"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.guide_to_javadocs">4.8.&nbsp;The CAS API's &#8211; a guide to the Javadocs</h2></div></div></div>
<p>The CAS APIs are organized into 3 Java packages: cas, cas.impl, and cas.text. Most
of the APIs described here are in the cas package. The cas.impl package contains classes
used in serializing and deserializing (reading and writing external representations) the
CAS in various formats, for
transporting the CAS among local and remote annotators, or for storing the CAS in
permanent storage. The cas.text contains the APIs that extend the CAS to support
artifact (including <span class="quote">&#8220;<span class="quote">text</span>&#8221;</span>) analysis.</p>
<div class="section" title="4.8.1.&nbsp;APIs in the CAS package"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.javadocs.cas_package">4.8.1.&nbsp;APIs in the CAS package</h3></div></div></div>
<p>The main objects implementing the APIs discussed here are shown in the diagram
below. The hierarchy represents that there is a way to get from an upper object to an
instance of the lower object, usually by using a method on the upper object; this is not
an inheritance hierarchy.
</p><div class="figure"><a name="ugr.ref.cas.fig.api_hierarchy"></a><div class="figure-contents">
<div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="574"><tr><td><img src="images/references/ref.cas/image001.png" width="574" alt="CAS object hierarchy"></td></tr></table></div>
</div><p class="title"><b>Figure&nbsp;4.1.&nbsp;CAS Object hierarchy</b></p></div><p><br class="figure-break"> </p>
<p>The main Interface is the CAS interface. This has most of the functionality of the
CAS, except for the type system metadata access, and the indexing access. JCas and CAS
are alternative representations and API approaches to the CAS; each has a method to
get the other. You can mix JCas and CAS APIs in your application as needed. To use the
JCas APIs, you have to create the Java classes that correspond to the CAS types, and
include them in the Java class path of the application. If you have a CAS object, you can
get a JCas object by using the getJCas() method call on the CAS object; likewise, you
can get the CAS object from a JCas by using the getCAS() method call on the JCas object.
There is also a low level CAS interface that is not part of the official API, and is
intended for internal use only &#8211; it is not documented here.</p>
<p>The type system metadata APIs are found in the TypeSystem interface. The objects
defining each type and feature are defined by the interfaces Type and Feature. The
Type interface has methods to see what types subsume other types, to iterate over the
types available, and to extract information about the types, including what
features it has. The Feature interface has methods that get what type it belongs to,
its name, and its range (the kind of values it can hold).</p>
<p>The FSIndexRepository gives you access to methods to get instances of indexes, and
also provides access to the iterator over all indexed feature structures:
<code class="literal">getAllIndexedFS(aType)</code>.
The FSIndex and AnnotationIndex objects give you methods to create instances of
iterators.</p>
<p>Iterators and the CAS methods that create new feature structures return
FeatureStructure objects. These objects can be used to set and get the values of
defined features within them.</p>
</div>
</div>
<div class="section" title="4.9.&nbsp;Type Merging"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.typemerging">4.9.&nbsp;Type Merging</h2></div></div></div>
<p>When annotators are combined in an aggregate, their defined type systems are merged.
This is designed to support independent development of annotator components. The merge
results in a single defined type system for CASes that flow through a particular set of
annotators.</p>
<p>The basic operation of a type system merge is to iterate through all the defined types,
and if two annotators define the same fully qualified type name,
to take the features defined for those types
and form a logical union of those features. This operation requires that same-named features
have the same range type names. The resulting type system has features comprising the union
of all features over all the various definitions for this type in different annotators.
</p>
<p>Feature merging checks that for all features having the same name in a type, that the
range type is identical; otherwise an error is signaled.</p>
<p>Types are combined for merging when their fully qualified names are the same.
Two different definitions can be merged even if their supertype definitions do not match, if
one supertype subsumes the other supertype; otherwise an error is signaled. Likewise, two types
with the same name can be merged only if their features can be merged.
</p>
</div>
<div class="section" title="4.10.&nbsp;Limited multi-thread access to read-only CASs"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.limitedmultipleaccess">4.10.&nbsp;Limited multi-thread access to read-only CASs</h2></div></div></div>
<p>Some applications may find it useful to scale up pipelines and run these in parallel.</p>
<p>
Generally, CASs are not threadsafe, and only one thread at a time may operate on it. In many
scenarios, a CAS may be initialized and then filled with Feature Structures, and after some point,
no more updates to that particular CAS will be done.</p>
<p>
If a CAS is no longer going to be changed, it is possible to
access it on multiple threads in a read-only mode, simultaneously, with some limitations. Limitations
arise because some UIMA Framework activities may update internal CAS data structures.</p>
<p>Operational data is updated while running a pipeline when a PEAR is entered or exited,
because PEARs establish new class loaders and can potentially switch the JCas classes being used
(This happens because the class loaders might define different JCas cover classes
implementing the same UIMA type).
Because of this, you cannot have multiple pipelines accessing a CAS in read-only mode if one or more of those
pipelines contains a PEAR. There are other edge cases where this may happen as well; for example, if you are
running a pipeline with an Extension Class Loader,
and have a callback routine loaded under a different class loader, UIMA will switch the JCas classes when
calling the callback.
</p>
</div>
<div class="footnotes"><br><hr width="100" align="left"><div class="footnote"><p><sup>[<a id="ftn.d5e1615" href="#d5e1615" class="para">5</a>] </sup>A fourth part, the Subject of Analysis,
is discussed in <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.aas" class="olink">Chapter&nbsp;5, <i>Annotations, Artifacts, and Sofas</i></a>.</p></div><div class="footnote"><p><sup>[<a id="ftn.d5e1624" href="#d5e1624" class="para">6</a>] </sup> The name <span class="quote">&#8220;<span class="quote">feature structure</span>&#8221;</span> comes from
terminology used in linguistics.</p></div></div></div>
<div class="chapter" title="Chapter&nbsp;5.&nbsp;JCas Reference" id="ugr.ref.jcas"><div class="titlepage"><div><div><h2 class="title">Chapter&nbsp;5.&nbsp;JCas Reference</h2></div></div></div>
<p>The CAS is a system for sharing data among annotators, consisting of data structures
(definable at run time), sets of indexes over these data, metadata describing these, subjects of
analysis, and a high
performance serialization/deserialization mechanism. JCas provides Java approach to
accessing CAS data, and is based on using generated, specific Java classes for each CAS
type.</p>
<p>Annotators process one CAS per call to their process method. During processing,
annotators can retrieve feature structures from the passed in CAS, add new ones, modify
existing ones, and use and update CAS indexes. Of course, an annotator can also use plain
Java Objects in addition; but the data in the CAS is what is shared among annotators within
an application.</p>
<p>All the facilities present in the APIs for the CAS are available when using the JCas
APIs; indeed, you can use the getCas() method to get the corresponding CAS object from a
JCas (and vice-versa). The JCas APIs often have helper methods that make using this
interface more convenient for Java developers.</p>
<p>The data in the CAS are typed objects having fields. JCas uses a set of generated Java
classes (each corresponding to a particular CAS type) with <span class="quote">&#8220;<span class="quote">getter</span>&#8221;</span> and
<span class="quote">&#8220;<span class="quote">setter</span>&#8221;</span> methods for the features, plus a constructor so new instances can
be made. The Java classes stores the data in the class instance.</p>
<p>Users can modify the JCas generated
Java classes by adding fields to them; this allows arbitrary non-CAS data to also be
represented within the JCas objects, as well; however, the non-CAS data stored in the JCas
object instances cannot be shared with annotators using the plain CAS, unless special
provision is made - see the chapter in the v3 user's guide on storing arbitrary
Java objects in the CAS.</p>
<p>The JCas class Java source files are generated from XML type system descriptions. The
JCasGen utility does the work of generating the corresponding Java Class Model for the CAS
types. There are a variety of ways JCasGen can be run; these are described later. You
include the generated classes with your UIMA component, and you can publish these classes
for others who might want to use your type system.</p>
<p>JCas classes are not required for all UIMA types. Those types which don't have
corresponding JCas classes use the nearest JCas class corresponding to a type in their superchain.</p>
<p>The specification of the type system in XML can be written using a conventional text
editor, an XML editor, or using the Eclipse plug-in that supports editing UIMA
descriptors.</p>
<p>Changes to the type system are done by changing the XML and regenerating the
corresponding Java Class Models. Of course, once you've published your type system
for others to use, you should be careful that any changes you make don't adversely
impact the users. Additional features can be added to existing types without breaking
other code.</p>
<p>A separate Java class is generated for each type; this type implements the CAS
FeatureStructure interface, as well as having the special getters and setters for the
included features. The generated Java classes have methods (getters and setters) for the
fields as defined in the XML type specification. Descriptor comments are reflected in the
generated Java code as Java-doc style comments.</p>
<div class="section" title="5.1.&nbsp;Name Spaces"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.name_spaces">5.1.&nbsp;Name Spaces</h2></div></div></div>
<p>Full Type names consist of a <span class="quote">&#8220;<span class="quote">namespace</span>&#8221;</span> prefix dotted with a simple
name. Namespaces are used like packages to avoid collisions between types that are
defined by different people at different times. The namespace is used as the Java
package name for generated Java files.</p>
<p>Type names used in the CAS correspond to the generated Java classes directly. If the
CAS name is com.myCompany.myProject.ExampleClass, the generated Java class is in the
package com.myCompany.myProject, and the class is ExampleClass.</p>
<p>
An exception to this rule is the built-in types
starting with <code class="literal">uima.cas </code>and <code class="literal">uima.tcas</code>;
these names are mapped to Java packages named
<code class="literal">org.apache.uima.jcas.cas</code> and
<code class="literal">org.apache.uima.jcas.tcas</code>.</p>
</div>
<div class="section" title="5.2.&nbsp;XML description element"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.use_of_description">5.2.&nbsp;XML description element</h2></div></div></div>
<p>Each XML type specification can have &lt;description ...
&gt; tags. The description for a type will be copied into the generated Java code, as a
Javadoc style comment for the class. When writing these descriptions in the XML type
specification file, you might want to use html tags, as allowed in Javadocs.</p>
<p>If you use the Component Description Editor, you can write the html tags normally,
for instance, <span class="quote">&#8220;<span class="quote">&lt;h1&gt;My Title&lt;/h1&gt;</span>&#8221;</span>. The Component
Descriptor Editor will take care of coverting the actual descriptor source so that it
has the leading <span class="quote">&#8220;<span class="quote">&lt;</span>&#8221;</span> character written as <span class="quote">&#8220;<span class="quote">&amp;lt;</span>&#8221;</span>,
to avoid confusing the XML type specification. For example, &lt;p&gt; would be written
in the source of the descriptor as &amp;lt;p&gt;. Any characters used in the Javadoc
comment must of course be from the character set allowed by the XML type specification.
These specifications often start with the line &lt;?xml version=<span class="quote">&#8220;<span class="quote">1.0</span>&#8221;</span>
encoding=<span class="quote">&#8220;<span class="quote">UTF-8</span>&#8221;</span> ?&gt;, which means you can use any of the UTF-8
characters.</p>
</div>
<div class="section" title="5.3.&nbsp;Mapping built-in CAS types to Java types"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.mapping_built_ins">5.3.&nbsp;Mapping built-in CAS types to Java types</h2></div></div></div>
<p>The built-in primitive CAS types map to Java types as follows:</p>
<pre class="programlisting">uima.cas.Boolean <span class="symbol">&#8594;</span> boolean
uima.cas.Byte <span class="symbol">&#8594;</span> byte
uima.cas.Short <span class="symbol">&#8594;</span> short
uima.cas.Integer <span class="symbol">&#8594;</span> int
uima.cas.Long <span class="symbol">&#8594;</span> long
uima.cas.Float <span class="symbol">&#8594;</span> float
uima.cas.Double <span class="symbol">&#8594;</span> double
uima.cas.String <span class="symbol">&#8594;</span> String</pre>
</div>
<div class="section" title="5.4.&nbsp;Augmenting the generated Java Code"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.augmenting_generated_code">5.4.&nbsp;Augmenting the generated Java Code</h2></div></div></div>
<p>The Java Class Models generated for each type can be augmented by the user. Typical
augmentations include adding additional (non-CAS) fields and methods, and import
statements that might be needed to support these. Commonly added methods include
additional constructors (having different parameter signatures), and
implementations of toString().</p>
<p>To augment the code, just edit the generated Java source code for the class named the
same as the CAS type. Here's an example of an additional method you might add; the
various getter methods are retrieving values from the instance:</p>
<pre class="programlisting">public String toString() { // for debugging
return "XsgParse "
+ getslotName() + ": "
+ getheadWord().getCoveredText()
+ " seqNo: " + getseqNo()
+ ", cAddr: " + id
+ ", size left mods: " + getlMods().size()
+ ", size right mods: " + getrMods().size();
}</pre>
<div class="section" title="5.4.1.&nbsp;Keeping hand-coded augmentations when regenerating"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.keeping_augmentations_when_regenerating">5.4.1.&nbsp;Keeping hand-coded augmentations when regenerating</h3></div></div></div>
<p>If the type system specification changes, you have to re-run the JCasGen
generator. This will produce updated Java for the Class Models that capture the
changed specification. If you have previously augmented the source for these Java
Class Models, your changes must be merged with the newly (re)generated Java source
code for the Class Models. This can be done by hand, or you can run the version of JCasGen
that is integrated with Eclipse, and use automatic merging that is done using Eclipse's EMF
plug-in. You can obtain Eclipse and the needed EMF plug-in from <a class="ulink" href="http://www.eclipse.org/" target="_top">http://www.eclipse.org/</a>.</p>
<p>If you run the generator version that works without using Eclipse, it will not
merge Java source changes you may have previously made; if you want them retained,
you'll have to do the merging by hand.</p>
<p>The Java source merging will keep additional constructors, additional fields,
and any changes you may have made to the readObject method (see below). Merging will
<span class="emphasis"><em>not</em></span> delete classes in the target corresponding to deleted CAS types, which no longer
are in the source &#8211; you should delete these by hand.</p>
<div class="warning" title="Warning" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Warning</h3><p>The merging supports Java 1.4 syntactic constructs only.
JCasGen generates Java 1.4 code, so as long as any code you change here also sticks to
only Java 1.4 constructs, the merge will work. If you use Java 5 or later specific syntax or constructs, the merge
operation will likely fail to merge properly.</p></div>
</div>
<div class="section" title="5.4.2.&nbsp;Additional Constructors"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.additional_constructors">5.4.2.&nbsp;Additional Constructors</h3></div></div></div>
<p>Any additional constructors that you add must include the JCas argument. The
first line of your constructor is required to be</p>
<pre class="programlisting">this(jcas); // run the standard constructor</pre>
<p>where jcas is the passed in JCas reference. If the type you're defining
extends <code class="literal">uima.tcas.Annotation</code>, JCasGen will automatically
add a constructor which takes 2 additional parameters &#8211; the begin and end Java
int values, and set the <code class="literal">uima.tcas.Annotation</code>
<code class="literal">begin</code> and <code class="literal">end</code> fields.</p>
<p>Here's an example: If you're defining a type MyType which has a
feature parent, you might make an additional constructor which has an additional
argument of parent:</p>
<pre class="programlisting">MyType(JCas jcas, MyType parent) {
this(jcas); // run the standard constructor
setParent(parent); // set the parent field from the parameter
}</pre>
<div class="section" title="5.4.2.1.&nbsp;Using readObject"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.jcas.using_readobject">5.4.2.1.&nbsp;Using readObject</h4></div></div></div>
<p>Fields defined by augmenting the Java Class Model to include additional
fields represent data that exist for this class in Java, in a local JVM (Java Virtual
Machine), but do not exist in the CAS when it is passed to other environments (for
example, passing to a remote annotator).</p>
<p>A problem can arise when new instances are created, perhaps by the underlying
system when it iterates over an index, which is: how to insure that any additional
non-CAS fields are properly initialized. To allow for arbitrary initialization
at instance creation time, an initialization method in the Java Class Model,
called readObject is used. The generated default for this method is to do nothing,
but it is one of the methods that you can modify &#8211; to do whatever
initialization might be needed. It is called with 0 parameters, during the
constructor for the object, after the basic object fields have been set up. It can
refer to fields in the CAS using the getters and setters, and other fields in the Java
object instance being initialized.</p>
<p>A pre-existing CAS feature structure could exist if a CAS was being passed to
this annotator; in this case the JCas system calls the readObject method when
creating the corresponding Java instance for the first time for the CAS feature
structure. This can happen at two points: when a new object is being returned from an
iterator over a CAS index, or a getter method is getting a field for the first time
whose value is a feature structure.</p>
</div>
</div>
<div class="section" title="5.4.3.&nbsp;Modifying generated items"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.modifying_generated_items">5.4.3.&nbsp;Modifying generated items</h3></div></div></div>
<p>The following modifications, if made in generated items, will be preserved when
regenerating.</p>
<p>The public/private etc. flags associated with methods (getters and setters).
You can change the default (<span class="quote">&#8220;<span class="quote">public</span>&#8221;</span>) if needed.</p>
<p><span class="quote">&#8220;<span class="quote">final</span>&#8221;</span> or <span class="quote">&#8220;<span class="quote">abstract</span>&#8221;</span> can be added to the type
itself, with the usual semantics.</p>
</div>
</div>
<div class="section" title="5.5.&nbsp;Merging types"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.merging_types_from_other_specs">5.5.&nbsp;Merging types</h2></div></div></div>
<p>Type definitions are merged by the framework from all the components being run together.</p>
<div class="section" title="5.5.1.&nbsp;Aggregate AEs and CPEs as sources of types"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.merging_types.aggregates_and_cpes">5.5.1.&nbsp;Aggregate AEs and CPEs as sources of types</h3></div></div></div>
<p>When running aggregate AEs (Analysis Engines), or a set of AEs in a collection processing engine, the
UIMA framework will build a merged type system (Note: this <span class="quote">&#8220;<span class="quote">merge</span>&#8221;</span> is merging types, not to be
confused with merging Java source code, discussed above). This merged type system has all the types of every
component used in the application. In addition, application code can use UIMA Framework APIs to read and merge
type descriptions, manually.</p>
<p>In most cases, each type system can have its own Java Class Models generated individually, perhaps at an
earlier time, and the resulting class files (or .jar files containing these class files) can be put in the
class path to enable JCas.</p>
<p>However, it is possible that there may be multiple definitions of the same CAS type, each of which might
have different features defined. In this case, the UIMA framework will create a merged type by accumulating
all the defined features for a particular type into that type's type definition. However, the JCas
classes for these types are not automatically merged, which can create some issues for JCas users, as
discussed in the next section.</p>
</div>
<div class="section" title="5.5.2.&nbsp;JCasGen support for type merging"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.merging_types.jcasgen_support">5.5.2.&nbsp;JCasGen support for type merging</h3></div></div></div>
<p>When there are multiple definitions of the same CAS type with different features defined, then JCasGen
can be re-run on the merged type system, to create one set of JCas Class definitions for the merged types,
which can then be shared by all the components.
Directions for running JCasGen can be found in <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.jcasgen" class="olink">Chapter&nbsp;8, <i>JCasGen User's Guide</i></a>. This is typically done by the person who
is assembling the Aggregate Analysis Engine or Collection Processing Engine. The resulting merged Java
Class Model will then contain get and set methods for the complete set of features. These Java classes must
then be made available in the class path, <span class="emphasis"><em>replacing</em></span> the pre-merge versions of the
classes.</p>
<p>If hand-modifications were done to the pre-merge versions of the classes, these must be applied to the
merged versions, as described in section <a class="xref" href="#ugr.ref.jcas.keeping_augmentations_when_regenerating" title="5.4.1.&nbsp;Keeping hand-coded augmentations when regenerating">Section&nbsp;5.4.1, &#8220;Keeping hand-coded augmentations when regenerating&#8221;</a>, above. If just one of the
pre-merge versions had hand-modifications, the source for this hand-modified version can be put into the
file system where the generated output will go, and the -merge option for JCasGen will automatically
merge the hand-modifications with the generated code. If
<span class="emphasis"><em>both</em></span> pre-merged versions had hand-modifications, then these modifications must
be manually merged.</p>
<p>An alternative to this is packaging the components as individual PEAR files, each with their own
version of the JCas generated Classes. The Framework (as of release 2.2) can run PEAR files using the
pear file descriptor, and supply each component with its particular version of the JCas generated class.</p>
</div>
<div class="section" title="5.5.3.&nbsp;Impact of Type Merging on Composability of Annotators"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.impact_of_type_merging_on_composability">5.5.3.&nbsp;Impact of Type Merging on Composability of Annotators</h3></div></div></div>
<p>The recommended approach in UIMA is to build and maintain type systems as separate components, which are
imported by Annotators. Using this approach, Type Merging does not occur because the Type System and its JCas
classes are centrally managed and shared by the annotators.</p>
<p>If you do choose to create a JCas Annotator that relies on Type Merging (meaning that your annotator
redefines a Type that is already in use elsewhere, and adds its own features), this can negatively impact the
reusability of your annotator, unless your component is used as a PEAR file.</p>
<p>If not using PEAR file packaging isolation capability, whenever
anyone wants to combine your annotator with another annotator that uses a different version of
the same Type, they will need to be aware of all of the issues described in the previous section. They will need
to have the know-how to re-run JCasGen and appropriately set up their classpath to include the merged Java
classes and to not include the pre-merge classes. (To enable this, you should package these classes
separately from other .jar files for your annotator, so that they can be more easily excluded.) And, if you
have done hand-modifications to your JCas classes, the person assembling your annotator will need to
properly merge those changes. These issues significantly complicate the task of combining annotators, and
will cause your annotator not to be as easily reusable as other UIMA annotators. </p>
</div>
<div class="section" title="5.5.4.&nbsp;Adding Features to DocumentAnnotation"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.documentannotation_issues">5.5.4.&nbsp;Adding Features to DocumentAnnotation</h3></div></div></div>
<p>There is one built-in type, <code class="literal">uima.tcas.DocumentAnnotation</code>,
to which applications can add additional features. (All other built-in types
are "feature-final" and you cannot add additional features to them.) Frequently,
additional features are added to <code class="literal">uima.tcas.DocumentAnnotation</code>
to provide a place to store document-level metadata.</p>
<p>For the same reasons mentioned in the previous section, adding features to
DocumentAnnotation is not recommended if you are using JCas. Instead, it is recommended
that you define your own type for storing your document-level metadata. You can create
an instance of this type and add it to the indexes in the usual way. You can then
retrieve this instance using the iterator returned from the method<code class="literal">getAllIndexedFS(type)</code>
on an instance of a JFSIndexRepository object.
(As of UIMA v2.1, you do not have to declare a custom index in your descriptor to
get this to work).</p>
<p>If you do choose to add features to DocumentAnnotation, there are additional issues to
be aware of. The UIMA SDK provides the JCas cover class for the built-in definition of
DocumentAnnotation, in the separate jar file <code class="literal">uima-document-annotation.jar</code>.
If you add additional features to DocumentAnnotation, you must remove this jar file
from your classpath, because you will not want to use the default JCas cover class.
You will need to re-run JCasGen as described in <a class="xref" href="#ugr.ref.jcas.merging_types.jcasgen_support" title="5.5.2.&nbsp;JCasGen support for type merging">Section&nbsp;5.5.2, &#8220;JCasGen support for type merging&#8221;</a>. JCasGen will generate a new cover
class for DocumentAnnotation, which you must place in your classpath in lieu of the version
in <code class="literal">uima-document-annotation.jar</code>.</p>
<p>Also, this is the reason why the method <code class="literal">JCas.getDocumentAnnotationFs()</code> returns
type <code class="literal">TOP</code>, rather than type <code class="literal">DocumentAnnotation</code>. Because the
<code class="literal">DocumentAnnotation</code> class can be replaced by users, it is not part of
<code class="literal">uima-core.jar</code> and so the core UIMA framework cannot have any references
to it. In your code, you may <span class="quote">&#8220;<span class="quote">cast</span>&#8221;</span> the result of <code class="literal">JCas.getDocumentAnnotationFs()</code>
to type <code class="literal">DocumentAnnotation</code>, which must be available on the classpath either via
<code class="literal">uima-document-annotation.jar</code> or by including a custom version that you have generated using JCasGen.</p>
</div>
</div>
<div class="section" title="5.6.&nbsp;Using JCas within an Annotator"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.using_within_an_annotator">5.6.&nbsp;Using JCas within an Annotator</h2></div></div></div>
<p>To use JCas within an annotator, you must include the generated Java classes output
from JCasGen in the class path.</p>
<p>An annotator written using JCas is built by defining a class for the annotator that
extends JCasAnnotator_ImplBase. The process method for this annotator is
written</p>
<pre class="programlisting">public void process(JCas jcas)
throws AnalysisEngineProcessException {
... // body of annotator goes here
}</pre>
<p>The process method is passed the JCas instance to use as a parameter.</p>
<p>The JCas reference is used throughout the annotator to refer to the particular JCas
instance being worked on. In pooled or multi-threaded implementations, there will be a
separate JCas for each thread being (simultaneously) worked on.</p>
<p>You can do several kinds of operations using the JCas APIs: create new feature
structures (instances of CAS types) (using the new operator), access existing feature
structures passed to your annotator in the JCas (for example, by using the next method of
an iterator over the feature structures), get and set the fields of a particular
instance of a feature structure, and add and remove feature structure instances from
the CAS indexes. To support iteration, there are also functions to get and use indexes
and iterators over the instances in a JCas.</p>
<div class="section" title="5.6.1.&nbsp;Creating new instances using the Java &#8220;new&#8221; operator"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.new_instances">5.6.1.&nbsp;Creating new instances using the Java <span class="quote">&#8220;<span class="quote">new</span>&#8221;</span> operator</h3></div></div></div>
<p>The new operator creates new instances of JCas types. It takes at least one
parameter, the JCas instance in which the type is to be created. For example, if there
was a type Meeting defined, you can create a new instance of it using:
</p><pre class="programlisting">Meeting m = new Meeting(jcas);</pre>
<p>Other variations of constructors can be added in custom code; the single
parameter version is the one automatically generated by JCasGen. For types that are
subtypes of Annotation, JCasGen also generates an additional constructor with
additional <span class="quote">&#8220;<span class="quote">begin</span>&#8221;</span> and <span class="quote">&#8220;<span class="quote">end</span>&#8221;</span> arguments.</p>
</div>
<div class="section" title="5.6.2.&nbsp;Getters and Setters"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.getters_and_setters">5.6.2.&nbsp;Getters and Setters</h3></div></div></div>
<p>If the CAS type Meeting had fields location and time, you could get or set these by
using getter or setter methods. These methods have names formed by splicing together
the word <span class="quote">&#8220;<span class="quote">get</span>&#8221;</span> or <span class="quote">&#8220;<span class="quote">set</span>&#8221;</span> followed by the field name, with
the first letter of the field name capitalized. For instance
</p><pre class="programlisting">getLocation()</pre>
<p>The getter forms take no parameters and return the value of the field; the setter
forms take one parameter, the value to set into the field, and return void.</p>
<p>There are built-in CAS types for arrays of integers, strings, floats, and
feature structures. For fields whose values are these types of arrays, there is an
alternate form of getters and setters that take an additional parameter, written as
the first parameter, which is the index in the array of an item to get or set.</p>
</div>
<div class="section" title="5.6.3.&nbsp;Obtaining references to Indexes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.obtaining_refs_to_indexes">5.6.3.&nbsp;Obtaining references to Indexes</h3></div></div></div>
<p>The only way to access instances (not otherwise referenced from other
instances) passed in to your annotator in its JCas is to use an iterator over some
index. Indexes in the CAS are specified in the annotator descriptor. Indexes have a
name; text annotators have a built-in, standard index over all annotations.</p>
<p>To get an index, first get the JFSIndexRepository from the JCas using the method
jcas.getJFSIndexRepository(). Here are the calls to get indexes:</p>
<pre class="programlisting">JFSIndexRepository ir = jcas.getJFSIndexRepository();
ir.getIndex(name-of-index) // get the index by its name, a string
ir.getIndex(name-of-index, Foo.type) // filtered by specific type
ir.getAnnotationIndex() // get AnnotationIndex
jcas.getAnnotationIndex() // get directly from jcas
ir.getAnnotationIndex(Foo.type) // filtered by specific type</pre>
jcas.getAnnotationIndex(Foo.class) // better
<p>For convenience, the getAnnotationIndex method is available directly on the JCas object
instance; the implementation merely forwards to the associated index repository.</p>
<p>Filtering types have to be a subtype of the type specified for this index in its
index specification. They can be written as either Foo.type or if you have an instance
of Foo, you can write</p>
<pre class="programlisting">fooInstance.getClass()</pre>
<p>Foo is (of course) an example of the name of the type.</p>
</div>
<div class="section" title="5.6.4.&nbsp;Adding (and removing) instances to (from) indexes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.adding_removing_instances_to_indexes">5.6.4.&nbsp;Adding (and removing) instances to (from) indexes</h3></div></div></div>
<p>CAS indexes are maintained automatically by the CAS. But you must add any
instances of feature structures you want the index to find, to the indexes by using the
call:</p>
<pre class="programlisting">myInstance.addToIndexes();</pre>
<p>Do this after setting all features in the instance <span class="bold-italic">which could be used in indexing</span>,
for example, in determining the sorting order.
See <a class="xref" href="#ugr.ref.cas.updating_indexed_feature_structures" title="4.5.1.&nbsp;Updating indexed feature structures">Section&nbsp;4.5.1, &#8220;Updating indexed feature structures&#8221;</a> for details
on updating indexed feature structures.
</p>
<p>When writing a Multi-View component, you may need to index instances in multiple
CAS views. The methods above use the indexes associated with the current JCas object.
There is a variation of the <code class="literal">addToIndexes / removeFromIndexes</code> methods which
takes one argument: a reference to a JCas object holding the view in which you want to
index this instance.
</p><pre class="programlisting">myInstance.addToIndexes(anotherJCas)
myInstance.removeFromIndexes(anotherJCas)</pre><p>
</p>
<p>
You can also explicitly add instances to other views using the addFsToIndexes method on
other JCas (or CAS) objects. For instance, if you had 2 other CAS views (myView1 and
myView2), in which you wanted to index myInstance, you could write:</p>
<pre class="programlisting">myInstance.addToIndexes(); //addToIndexes used with the new operator
myView1.addFsToIndexes(myInstance); // index myInstance in myView1
myView2.addFsToIndexes(myInstance); // index myInstance in myView2</pre>
<p>
The rules for determining which index to use with a particular JCas object are designed to
behave the way most would think they should; if you need specific behavior, you can always
explicitly designate which view the index adding and removing operations should work on.
</p>
<p>
The rules are:
If the instance is a subtype of AnnotationBase, then the view is the view associated with the
annotation as specified in the feature holding the view reference in AnnotationBase.
Otherwise, if the instance was created using the "new" operator, then the view is the view passed to the
instance's constructor.
Otherwise, if the instance was created by getting a feature value from some other instance, whose range
type is a feature structure, then the view is the same as the referring instance.
Otherwise, if the instance was created by any of the Feature Structure Iterator operations over some index,
then it is the view associated with the index.
</p>
<p>As of release 2.4.1, there are two efficient bulk-remove methods to remove all instances of a given type,
or all instances of a given type and its subtypes.
These are invoked on an instance of an IndexRepository,
for a particular view. For example, to remove all instances of Token from a particular JCas instance:
</p>
<pre class="programlisting">jcas.removeAllIncludingSubtypes(Token.type) or
jcas.removeAllIncludingSubtypes(aTokenInstance.getTypeIndexID()) or
jcas.getFsIndexRepository().
removeAllIncludingSubtypes(jcas.getCasType(Token.type))
</pre>
</div>
<div class="section" title="5.6.5.&nbsp;Using Iterators"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.using_iterators">5.6.5.&nbsp;Using Iterators</h3></div></div></div>
<p>This chapter describes obtaining and using iterators. However, it is recommended that instead
you use the select framework, described in a chapter in the version 3 user's guide.</p>
<p>Once you have an index obtained from the JCas, you can get an iterator from the
index; here is an example:</p>
<pre class="programlisting">FSIndexRepository ir = jcas.getFSIndexRepository();
FSIndex myIndex = ir.getIndex("myIndexName");
FSIterator myIterator = myIndex.iterator();
JFSIndexRepository ir = jcas.getJFSIndexRepository();
FSIndex myIndex = ir.getIndex("myIndexName", Foo.type); // filtered
FSIterator myIterator = myIndex.iterator();</pre>
<p>Iterators work like normal Java iterators, but are augmented to support
additional capabilities. Iterators are described in the CAS Reference, <a href="references.html#ugr.ref.cas.indexes_and_iterators" class="olink">Section&nbsp;4.7, &#8220;Indexes and Iterators&#8221;</a>.</p>
</div>
<div class="section" title="5.6.6.&nbsp;Class Loaders in UIMA"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.class_loaders">5.6.6.&nbsp;Class Loaders in UIMA</h3></div></div></div>
<p>The basic concept of a UIMA application includes assembling engines into a flow.
The application made up of these Engines are run within the UIMA Framework, either by
the Collection Processing Manager, or by using more basic UIMA Framework
APIs.</p>
<p>The UIMA Framework exists within a JVM (Java Virtual Machine). A JVM has the
capability to load multiple applications, in a way where each one is isolated from the
others, by using a separate class loader for each application. For instance, one set
of UIMA Framework Classes could be shared by multiple sets of application - specific
classes, even if these application-specific classes had the same names but were
different versions.</p>
<div class="section" title="5.6.6.1.&nbsp;Use of Class Loaders is optional"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.jcas.class_loaders.optional">5.6.6.1.&nbsp;Use of Class Loaders is optional</h4></div></div></div>
<p>The UIMA framework will use a specific ClassLoader, based on how
ResourceManager instances are used. Specific ClassLoaders are only created if
you specify an ExtensionClassPath as part of the ResourceManager. If you do not
need to support multiple applications within one UIMA framework within a JVM,
don't specify an ExtensionClassPath; in this case, the classloader used
will be the one used to load the UIMA framework - usually the overall application
class loader.</p>
<p>Of course, you should not run multiple UIMA applications together, in this
way, if they have different class definitions for the same class name. This
includes the JCas <span class="quote">&#8220;<span class="quote">cover</span>&#8221;</span> classes. This case might arise, for
instance, if both applications extended
<code class="literal">uima.tcas.DocumentAnnotation</code> in differing,
incompatible ways. Each application would need its own definition of this class,
but only one could be loaded (unless you specify ExtensionClassPath in the
ResourceManager which will cause the UIMA application to load its private
versions of its classes, from its classpath).</p>
</div>
</div>
<div class="section" title="5.6.7.&nbsp;Issues accessing JCas objects outside of UIMA Engine Components"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.accessing_jcas_objects_outside_uima_components">5.6.7.&nbsp;Issues accessing JCas objects outside of UIMA Engine Components</h3></div></div></div>
<p>If you are using the ExtensionClassPaths, the JCas cover classes are loaded
under a class loader created by the ResourceManager part of the UIMA Framework.
If you reference the same JCas
classes outside of any UIMA component, for instance, in top level application code,
the JCas classes used by that top level application code also must be in the class path
for the application code.</p>
<p>Alternatively, you could do all the JCas processing inside a UIMA component (and do no
processing using JCas outside of the UIMA pipeline).</p>
</div>
</div>
<div class="section" title="5.7.&nbsp;Setting up Classpath for JCas"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.setting_up_classpath">5.7.&nbsp;Setting up Classpath for JCas</h2></div></div></div>
<p>The JCas Java classes generated by JCasGen are typically compiled and put into a JAR
file, which, in turn, is put into the application's class path.</p>
<p>This JAR file must be generated from the application's merged type system.
This is most conveniently done by opening the top level descriptor used by the
application in the Component Descriptor Editor tool, and pressing the Run-JCasGen
button on the Type System Definition page.</p>
</div>
<div class="section" title="5.8.&nbsp;PEAR isolation"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.pear_support">5.8.&nbsp;PEAR isolation</h2></div></div></div>
<p>
As of version 2.2, the framework supports component descriptors which are PEAR descriptors.
These descriptors define components plus include information on the class path needed to
run them. The framework uses the class path information to set up a localized class path, just
for code running within the PEAR context. This allows PEAR files requiring different
versions of common code to work well together, even if the class names in the different versions
have the same names.
</p>
<p>The mechanism used to switch the class loaders when entering a PEAR-packaged annotator in
a flow depends on the framework knowing if JCas is being used within that annotator code. The
framework will know this if the particular view being passed has had a previous call to
getJCas(), or if the particular annotator is marked as a JCas-using one (by having it extend the
class <code class="code">JCasAnnotator_ImplBase).</code></p>
</div>
</div>
<div class="chapter" title="Chapter&nbsp;6.&nbsp;PEAR Reference" id="ugr.ref.pear"><div class="titlepage"><div><div><h2 class="title">Chapter&nbsp;6.&nbsp;PEAR Reference</h2></div></div></div>
<p>
A PEAR (Processing Engine ARchive) file is a standard package
for UIMA components. This chapter describes the PEAR 1.0 structure and
specification.
</p>
<p>
The PEAR package can be used for distribution and reuse by other
components or applications. It also allows applications and
tools to manage UIMA components automatically for verification,
deployment, invocation, testing, etc.
</p>
<p>
Currently, there is an Eclipse plugin and a command line tool
available to create PEAR packages for standard UIMA components.
Please refer to
<a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a>
<a href="tools.html#ugr.tools.pear.packager" class="olink">Chapter&nbsp;9, <i>PEAR Packager User's Guide</i></a>
for more information about these tools.
</p>
<p>
PEARs distributed to new targets can be installed at those targets.
UIMA includes a tool for installing PEARs; see
<a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a>
<a href="tools.html#ugr.tools.pear.installer" class="olink">Chapter&nbsp;11, <i>PEAR Installer User's Guide</i></a> for
more information about installing PEARs.
</p>
<p>
An installed PEAR can be used as a component within a UIMA pipeline,
by specifying the pear descriptor that is created when
installing the pear. See
<a href="references.html#ugr.ref.pear.specifier" class="olink">Section&nbsp;6.3, &#8220;PEAR package descriptor&#8221;</a>.
</p>
<div class="section" title="6.1.&nbsp;Packaging a UIMA component"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.pear.packaging_a_component">6.1.&nbsp;Packaging a UIMA component</h2></div></div></div>
<p>
For the purpose of describing the process of creating a PEAR
file and its internal structure, this section describes the
steps used to package a UIMA component as a valid PEAR file.
The PEAR packaging process consists of the following steps:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem">
<p>
<a class="xref" href="#ugr.ref.pear.creating_pear_structure" title="6.1.1.&nbsp;Creating the PEAR structure">Section&nbsp;6.1.1, &#8220;Creating the PEAR structure&#8221;</a>
</p>
</li><li class="listitem">
<p>
<a class="xref" href="#ugr.ref.pear.populating_pear_structure" title="6.1.2.&nbsp;Populating the PEAR structure">Section&nbsp;6.1.2, &#8220;Populating the PEAR structure&#8221;</a>
</p>
</li><li class="listitem">
<p>
<a class="xref" href="#ugr.ref.pear.creating_installation_descriptor" title="6.1.3.&nbsp;Creating the installation descriptor">Section&nbsp;6.1.3, &#8220;Creating the installation descriptor&#8221;</a>
</p>
</li><li class="listitem">
<p>
<a class="xref" href="#ugr.ref.pear.packaging_into_1_file" title="6.1.5.&nbsp;Packaging the PEAR structure into one file">Section&nbsp;6.1.5, &#8220;Packaging the PEAR structure into one file&#8221;</a>
</p>
</li></ul></div><p>
</p>
<div class="section" title="6.1.1.&nbsp;Creating the PEAR structure"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.pear.creating_pear_structure">6.1.1.&nbsp;Creating the PEAR structure</h3></div></div></div>
<p>
The first step in the PEAR creation process is to create
a PEAR structure. The PEAR structure is a structured
tree of folders and files, including the following
elements:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem">
<p>
Required Elements:
</p><div class="itemizedlist"><ul class="itemizedlist" type="circle"><li class="listitem">
<p>
The
<span class="bold"><strong>
metadata
</strong></span>
folder which contains the PEAR
installation descriptor and
properties files.
</p>
</li><li class="listitem">
<p>
The installation descriptor (
<span class="bold"><strong>
metadata/install.xml
</strong></span>
)
</p>
</li><li class="listitem">
<p>
A UIMA analysis engine
descriptor and its required
code, delegates (if any), and
resources
</p>
</li></ul></div><p>
</p>
</li><li class="listitem">
<p>
Optional Elements:
</p><div class="itemizedlist"><ul class="itemizedlist" type="circle"><li class="listitem">
<p>
The desc folder to contain
descriptor files of analysis
engines, delegates analysis
engines (all levels), and other
components (Collection Readers,
CAS Consumers, etc).
</p>
</li><li class="listitem">
<p>
The src folder to contain the
source code
</p>
</li><li class="listitem">
<p>
The bin folder to contain
executables, scripts, class
files, dlls, shared libraries,
etc.
</p>
</li><li class="listitem">
<p>
The lib folder to contain jar
files.
</p>
</li><li class="listitem">
<p>
The doc folder containing
documentation materials,
preferably accessible through an
index.html.
</p>
</li><li class="listitem">
<p>
The data folder to contain data
files (e.g. for testing).
</p>
</li><li class="listitem">
<p>
The conf folder to contain
configuration files.
</p>
</li><li class="listitem">
<p>
The resources folder to contain
other resources and
dependencies.
</p>
</li><li class="listitem">
<p>
Other user-defined folders or
files are allowed, but should be
avoided.
</p>
</li></ul></div><p>
</p>
</li></ul></div><p>
</p>
<div class="figure"><a name="ugr.ref.pear.fig.pear_structure"></a><div class="figure-contents">
<div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="297"><tr><td><img src="images/references/ref.pear/image002.jpg" width="297" alt="diagram of the PEAR structure"></td></tr></table></div>
</div><p class="title"><b>Figure&nbsp;6.1.&nbsp;The PEAR Structure</b></p></div><br class="figure-break">
</div>
<div class="section" title="6.1.2.&nbsp;Populating the PEAR structure"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.pear.populating_pear_structure">6.1.2.&nbsp;Populating the PEAR structure</h3></div></div></div>
<p>
After creating the PEAR structure, the component's
descriptor files, code files, resources files, and any
other files and folders are copied into the
corresponding folders of the PEAR structure. The
developer should make sure that the code would work with
this layout of files and folders, and that there are no
broken links. Although it is strongly discouraged, the
optional elements of the PEAR structure can be replaced
by other user defined files and folder, if required for
the component to work properly.
</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3>
<p>
The PEAR structure must be self-contained. For
example, this means that the component must run
properly independently from the PEAR root folder
location. If the developer needs to use an absolute
path in configuration or descriptor files, then
he/she should put these files in the
<span class="quote">&#8220;<span class="quote">conf</span>&#8221;</span>
or
<span class="quote">&#8220;<span class="quote">desc</span>&#8221;</span>
and replace the path of the PEAR root folder with
the string
<span class="quote">&#8220;<span class="quote">$main_root</span>&#8221;</span>
. The tools that deploy and use PEAR files should
localize the files in the
<span class="quote">&#8220;<span class="quote">conf</span>&#8221;</span>
and
<span class="quote">&#8220;<span class="quote">desc</span>&#8221;</span>
folders by replacing the string
<span class="quote">&#8220;<span class="quote">$main_root</span>&#8221;</span>
with the local absolute path of the PEAR root
folder. The
<span class="quote">&#8220;<span class="quote">$main_root</span>&#8221;</span>
macro can also be used in the Installation
descriptor (install.xml)
</p>
</div>
<p>
Currently there are three types of component packages
depending on their deployment:
</p>
<div class="section" title="6.1.2.1.&nbsp;Standard Type"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.package_type.standard">6.1.2.1.&nbsp;Standard Type</h4></div></div></div>
<p>
A component package with the
<span class="bold"><strong>standard</strong></span>
type must be a valid Analysis Engine, and all the
required files to deploy it locally must be included
in the PEAR package.
</p>
</div>
<div class="section" title="6.1.2.2.&nbsp;Service Type"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.package_type.service">6.1.2.2.&nbsp;Service Type</h4></div></div></div>
<p>
A component package with the
<span class="bold"><strong>service</strong></span>
type must be deployable locally as a supported UIMA
service (e.g. Vinci). In this case, all the required
files to deploy it locally must be included in the
PEAR package.
</p>
</div>
<div class="section" title="6.1.2.3.&nbsp;Network Type"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.package_type.network">6.1.2.3.&nbsp;Network Type</h4></div></div></div>
<p>
A component package with the network type is not
deployed locally but rather in the
<span class="quote">&#8220;<span class="quote">remote</span>&#8221;</span>
environment. It's accessed as a network AE
(e.g. Vinci Service). The component owner has the
responsibility to start the service and make sure
it's up and running before it's used by
others (like a webmaster that makes sure the web
site is up and running). In this case, the PEAR
package does not have to contain files required for
deployment, but must contain the network AE
descriptor (see
<a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.aae.creating_xml_descriptor" class="olink">Section&nbsp;1.1.4, &#8220;Creating the XML Descriptor&#8221;</a>
) and the &lt;DESC&gt; tag in the installation
descriptor must point to the network AE descriptor.
For more information about Network Analysis Engines,
please refer to
<a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.remote_services" class="olink">Section&nbsp;3.6, &#8220;Working with Remote Services&#8221;</a>
.
</p>
</div>
</div>
<div class="section" title="6.1.3.&nbsp;Creating the installation descriptor"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.pear.creating_installation_descriptor">6.1.3.&nbsp;Creating the installation descriptor</h3></div></div></div>
<p>
The installation descriptor is an xml file called
install.xml under the metadata folder of the PEAR
structure. It's also called InsD. The InsD XML file
should be created in the UTF-8 file encoding. The InsD
should contain the following sections:
</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem">
<p>
&lt;OS&gt;: This section is used to specify
supported operating systems
</p>
</li><li class="listitem">
<p>
&lt;TOOLKITS&gt;: This section is used to
specify toolkits, such as JDK, needed by the
component.
</p>
</li><li class="listitem">
<p>
&lt;SUBMITTED_COMPONENT&gt;: This is the most
important section in the Installation
Descriptor. It's used to specify required
information about the component. See
<a class="xref" href="#ugr.ref.pear.installation_descriptor" title="6.1.4.&nbsp; Documented template for the installation descriptor:">Section&nbsp;6.1.4, &#8220;Installation Descriptor: template&#8221;</a>
for detailed information about this section.
</p>
</li><li class="listitem">
<p>
&lt;INSTALLATION&gt;: This section is explained
in section
<a class="xref" href="#ugr.ref.pear.installing" title="6.2.&nbsp;Installing a PEAR package">Section&nbsp;6.2, &#8220;Installing a PEAR package&#8221;</a>
.
</p>
</li></ul></div>
</div>
<div class="section" title="6.1.4.&nbsp; Documented template for the installation descriptor:"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.pear.installation_descriptor">6.1.4.&nbsp;
Documented template for the installation descriptor:
</h3></div></div></div>
<p>
The following is a sample
<span class="quote">&#8220;<span class="quote">documented template</span>&#8221;</span>
which describes content of the installation descriptor
install.xml:
</p>
<pre class="programlisting">&lt;? xml version="1.0" encoding="UTF-8"?&gt;
&lt;!-- Installation Descriptor Template --&gt;
&lt;COMPONENT_INSTALLATION_DESCRIPTOR&gt;
&lt;!-- Specifications of OS names, including version, etc. --&gt;
&lt;OS&gt;
&lt;NAME&gt;OS_Name_1&lt;/NAME&gt;
&lt;NAME&gt;OS_Name_2&lt;/NAME&gt;
&lt;/OS&gt;
&lt;!-- Specifications of required standard toolkits --&gt;
&lt;TOOLKITS&gt;
&lt;JDK_VERSION&gt;JDK_Version&lt;/JDK_VERSION&gt;
&lt;/TOOLKITS&gt;
&lt;!-- There are 2 types of variables that are used in the InsD:
a) $main_root , which will be substituted with the real path to the
main component root directory after installing the
main (submitted) component
b) $component_id$root, which will be substituted with the real path
to the root directory of a given delegate component after
installing the given delegate component --&gt;
&lt;!-- Specification of submitted component (AE) --&gt;
&lt;!-- Note: submitted_component_id is assigned by developer; --&gt;
&lt;!-- XML descriptor file name is set by developer. --&gt;
&lt;!-- Important: ID element should be the first in the --&gt;
&lt;!-- SUBMITTED_COMPONENT section. --&gt;
&lt;!-- Submitted component may include optional specification --&gt;
&lt;!-- of Collection Reader that can be used for testing the --&gt;
&lt;!-- submitted component. --&gt;
&lt;!-- Submitted component may include optional specification --&gt;
&lt;!-- of CAS Consumer that can be used for testing the --&gt;
&lt;!-- submitted component. --&gt;
&lt;SUBMITTED_COMPONENT&gt;
&lt;ID&gt;submitted_component_id&lt;/ID&gt;
&lt;NAME&gt;Submitted component name&lt;/NAME&gt;
&lt;DESC&gt;$main_root/desc/ComponentDescriptor.xml&lt;/DESC&gt;
&lt;!-- deployment options: --&gt;
&lt;!-- a) "standard" is deploying AE locally --&gt;
&lt;!-- b) "service" is deploying AE locally as a service, --&gt;
&lt;!-- using specified command (script) --&gt;
&lt;!-- c) "network" is deploying a pure network AE, which --&gt;
&lt;!-- is running somewhere on the network --&gt;
&lt;DEPLOYMENT&gt;standard | service | network&lt;/DEPLOYMENT&gt;
&lt;!-- Specifications for "service" deployment option only --&gt;
&lt;SERVICE_COMMAND&gt;$main_root/bin/startService.bat&lt;/SERVICE_COMMAND&gt;
&lt;SERVICE_WORKING_DIR&gt;$main_root&lt;/SERVICE_WORKING_DIR&gt;
&lt;SERVICE_COMMAND_ARGS&gt;
&lt;ARGUMENT&gt;
&lt;VALUE&gt;1st_parameter_value&lt;/VALUE&gt;
&lt;COMMENTS&gt;1st parameter description&lt;/COMMENTS&gt;
&lt;/ARGUMENT&gt;
&lt;ARGUMENT&gt;
&lt;VALUE&gt;2nd_parameter_value&lt;/VALUE&gt;
&lt;COMMENTS&gt;2nd parameter description&lt;/COMMENTS&gt;
&lt;/ARGUMENT&gt;
&lt;/SERVICE_COMMAND_ARGS&gt;
&lt;!-- Specifications for "network" deployment option only --&gt;
&lt;NETWORK_PARAMETERS&gt;
&lt;VNS_SPECS VNS_HOST="vns_host_IP" VNS_PORT="vns_port_No" /&gt;
&lt;/NETWORK_PARAMETERS&gt;
&lt;!-- General specifications --&gt;
&lt;COMMENTS&gt;Main component description&lt;/COMMENTS&gt;
&lt;COLLECTION_READER&gt;
&lt;COLLECTION_ITERATOR_DESC&gt;
$main_root/desc/CollIterDescriptor.xml
&lt;/COLLECTION_ITERATOR_DESC&gt;
&lt;CAS_INITIALIZER_DESC&gt;
$main_root/desc/CASInitializerDescriptor.xml
&lt;/CAS_INITIALIZER_DESC&gt;
&lt;/COLLECTION_READER&gt;
&lt;CAS_CONSUMER&gt;
&lt;DESC&gt;$main_root/desc/CASConsumerDescriptor.xml&lt;/DESC&gt;
&lt;/CAS_CONSUMER&gt;
&lt;/SUBMITTED_COMPONENT&gt;
&lt;!-- Specifications of the component installation process --&gt;
&lt;INSTALLATION&gt;
&lt;!-- List of delegate components that should be installed together --&gt;
&lt;!-- with the main submitted component (for aggregate components) --&gt;
&lt;!-- Important: ID element should be the first in each --&gt;
&lt;!-- DELEGATE_COMPONENT section. --&gt;
&lt;DELEGATE_COMPONENT&gt;
&lt;ID&gt;first_delegate_component_id&lt;/ID&gt;
&lt;NAME&gt;Name of first required separate component&lt;/NAME&gt;
&lt;/DELEGATE_COMPONENT&gt;
&lt;DELEGATE_COMPONENT&gt;
&lt;ID&gt;second_delegate_component_id&lt;/ID&gt;
&lt;NAME&gt;Name of second required separate component&lt;/NAME&gt;
&lt;/DELEGATE_COMPONENT&gt;
&lt;!-- Specifications of local path names that should be replaced --&gt;
&lt;!-- with real path names after the main component as well as --&gt;
&lt;!-- all required delegate (library) components are installed. --&gt;
&lt;!-- &lt;FILE&gt; and &lt;REPLACE_WITH&gt; values may use the $main_root or --&gt;
&lt;!-- one of the $component_id$root variables. --&gt;
&lt;!-- Important: ACTION element should be the first in each --&gt;
&lt;!-- PROCESS section. --&gt;
&lt;PROCESS&gt;
&lt;ACTION&gt;find_and_replace_path&lt;/ACTION&gt;
&lt;PARAMETERS&gt;
&lt;FILE&gt;$main_root/desc/ComponentDescriptor.xml&lt;/FILE&gt;
&lt;FIND_STRING&gt;../resources/dict/&lt;/FIND_STRING&gt;
&lt;REPLACE_WITH&gt;$main_root/resources/dict/&lt;/REPLACE_WITH&gt;
&lt;COMMENTS&gt;Specify actual dictionary location in XML component
descriptor
&lt;/COMMENTS&gt;
&lt;/PARAMETERS&gt;
&lt;/PROCESS&gt;
&lt;PROCESS&gt;
&lt;ACTION&gt;find_and_replace_path&lt;/ACTION&gt;
&lt;PARAMETERS&gt;
&lt;FILE&gt;$main_root/desc/DelegateComponentDescriptor.xml&lt;/FILE&gt;
&lt;FIND_STRING&gt;
local_root_directory_for_1st_delegate_component/resources/dict/
&lt;/FIND_STRING&gt;
&lt;REPLACE_WITH&gt;
$first_delegate_component_id$root/resources/dict/
&lt;/REPLACE_WITH&gt;
&lt;COMMENTS&gt;
Specify actual dictionary location in the descriptor of the 1st
delegate component
&lt;/COMMENTS&gt;
&lt;/PARAMETERS&gt;
&lt;/PROCESS&gt;
&lt;!-- Specifications of environment variables that should be set prior
to running the main component and all other reused components.
&lt;VAR_VALUE&gt; values may use the $main_root or one of the
$component_id$root variables. --&gt;
&lt;PROCESS&gt;
&lt;ACTION&gt;set_env_variable&lt;/ACTION&gt;
&lt;PARAMETERS&gt;
&lt;VAR_NAME&gt;env_variable_name&lt;/VAR_NAME&gt;
&lt;VAR_VALUE&gt;env_variable_value&lt;/VAR_VALUE&gt;
&lt;COMMENTS&gt;Set environment variable value&lt;/COMMENTS&gt;
&lt;/PARAMETERS&gt;
&lt;/PROCESS&gt;
&lt;/INSTALLATION&gt;
&lt;/COMPONENT_INSTALLATION_DESCRIPTOR&gt;</pre>
<div class="section" title="6.1.4.1.&nbsp;The SUBMITTED_COMPONENT section"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.installation_descriptor.submitted_component">6.1.4.1.&nbsp;The SUBMITTED_COMPONENT section</h4></div></div></div>
<p>The SUBMITTED_COMPONENT section of the installation descriptor
(install.xml) is used to specify required information about the UIMA component.
Before explaining the details, let's clarify the concept of component ID and
<span class="quote">&#8220;<span class="quote">macros</span>&#8221;</span> used in the installation descriptor. The component ID
element should be the <span class="bold"><strong>first element </strong></span>in the
SUBMITTED_COMPONENT section.</p>
<p>The component id is a string that uniquely identifies the component. It should
use the JAVA naming convention (e.g.
com.company_name.project_name.etc.mycomponent).</p>
<p>Macros are variables such as $main_root, used to represent a string such as the
full path of a certain directory.</p>
<p>The values of these macros are defined by the PEAR installation process, when the
PEAR is installed, and represent the values local to that particular installation.
The values are stored in the <code class="literal">metadata/PEAR.properties</code> file that is
generated during PEAR installation.
The tools and applications that use and deploy PEAR files replace these macros with
the corresponding values in the local environment as part of the deployment
process in the files included in the conf and desc folders.</p>
<p>Currently, there are two types of macros:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>$main_root, which represents the local absolute
path of the main component root directory after deployment. </p></li><li class="listitem"><p>$<span class="emphasis"><em>component_id</em></span>$root, which
represents the local absolute path to the root directory of the component which
has <span class="emphasis"><em>component_id </em></span> as component ID. This component could
be, for instance, a delegate component. </p></li></ul></div>
<p>For example, if some part of a descriptor needs to have a path to the data
subdirectory of the PEAR, you write <code class="literal">$main_root/data</code>. If
your PEAR refers to a delegate component having the ID
<span class="quote">&#8220;<span class="quote"><code class="literal">my.comp.Dictionary</code></span>&#8221;</span>, and you need to
specify a path to one of this component's subdirectories, e.g.
<code class="literal">resource/dict</code>, you write
<code class="literal">$my.comp.Dictionary$root/resources/dict</code>. </p>
</div>
<div class="section" title="6.1.4.2.&nbsp;The ID, NAME, and DESC tags"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.installation_descriptor.id_name_desc">6.1.4.2.&nbsp;The ID, NAME, and DESC tags</h4></div></div></div>
<p>These tags are used to specify the component ID, Name, and descriptor path
using the corresponding tags as follows:
</p><pre class="programlisting">&lt;SUBMITTED_COMPONENT&gt;
&lt;ID&gt;submitted_component_id&lt;/ID&gt;
&lt;NAME&gt;Submitted component name&lt;/NAME&gt;
&lt;DESC&gt;$main_root/desc/ComponentDescriptor.xml&lt;/DESC&gt;</pre>
</div>
<div class="section" title="6.1.4.3.&nbsp;Tags related to deployment types"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.installation_descriptor.deployment_type">6.1.4.3.&nbsp;Tags related to deployment types</h4></div></div></div>
<p>As mentioned before, there are currently three types of PEAR packages,
depending on the following deployment types</p>
<div class="section" title="Standard Type"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.pear.installation_descriptor.deployment_type.standard">Standard Type</h5></div></div></div>
<p>A component package with the <span class="bold"><strong>standard</strong></span>
type must be a valid UIMA Analysis Engine, and all the required files to deploy it
must be included in the PEAR package. This deployment type should be specified as
follows:
</p><pre class="programlisting">&lt;DEPLOYMENT&gt;standard&lt;/DEPLOYMENT&gt;</pre>
</div>
<div class="section" title="Service Type"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.pear.installation_descriptor.deployment_type.service">Service Type</h5></div></div></div>
<p>A component package with the <span class="bold"><strong>service</strong></span>
type must be deployable locally as a supported UIMA service (e.g. Vinci). The
installation descriptor must include the path for the executable or script to
start the service including its arguments, and the working directory from where
to launch it, following this template:
</p><pre class="programlisting">&lt;DEPLOYMENT&gt;service&lt;/DEPLOYMENT&gt;
&lt;SERVICE_COMMAND&gt;$main_root/bin/startService.bat&lt;/SERVICE_COMMAND&gt;
&lt;SERVICE_WORKING_DIR&gt;$main_root&lt;/SERVICE_WORKING_DIR&gt;
&lt;SERVICE_COMMAND_ARGS&gt;
&lt;ARGUMENT&gt;
&lt;VALUE&gt;1st_parameter_value&lt;/VALUE&gt;
&lt;COMMENTS&gt;1st parameter description&lt;/COMMENTS&gt;
&lt;/ARGUMENT&gt;
&lt;ARGUMENT&gt;
&lt;VALUE&gt;2nd_parameter_value&lt;/VALUE&gt;
&lt;COMMENTS&gt;2nd parameter description&lt;/COMMENTS&gt;
&lt;/ARGUMENT&gt;
&lt;/SERVICE_COMMAND_ARGS&gt;</pre>
</div>
<div class="section" title="Network Type"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.pear.installation_descriptor.deployment_type.network">Network Type</h5></div></div></div>
<p>A component package with the network type is not deployed locally, but
rather in a <span class="quote">&#8220;<span class="quote">remote</span>&#8221;</span> environment. It's accessed as a
network AE (e.g. Vinci Service). In this case, the PEAR package does not have to
contain files required for deployment, but must contain the network AE
descriptor. The &lt;DESC&gt; tag in the installation descriptor (See section
2.3.2.1) must point to the network AE descriptor. Here is a template in the case of
Vinci services:
</p><pre class="programlisting">&lt;DEPLOYMENT&gt;network&lt;/DEPLOYMENT&gt;
&lt;NETWORK_PARAMETERS&gt;
&lt;VNS_SPECS VNS_HOST="vns_host_IP" VNS_PORT="vns_port_No" /&gt;
&lt;/NETWORK_PARAMETERS&gt;</pre>
</div>
</div>
<div class="section" title="6.1.4.4.&nbsp;The Collection Reader and CAS Consumer tags"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.installation_descriptor.collection_reader_cas_consumer">6.1.4.4.&nbsp;The Collection Reader and CAS Consumer tags</h4></div></div></div>
<p>These sections of the installation descriptor are used by any specific
Collection Reader or CAS Consumer to be used with the packaged analysis
engine.</p>
</div>
<div class="section" title="6.1.4.5.&nbsp;The INSTALLATION section"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.installation_descriptor.installation">6.1.4.5.&nbsp;The INSTALLATION section</h4></div></div></div>
<p>The &lt;INSTALLATION&gt; section specifies the external dependencies of
the component and the operations that should be performed during the PEAR package
installation.</p>
<p>The component dependencies are specified in the
&lt;DELEGATE_COMPONENT&gt; sub-sections, as shown in the installation
descriptor template above.</p>
<p>Important: The ID element should be the first element in each
&lt;DELEGATE_COMPONENT&gt; sub-section.</p>
<p>The &lt;INSTALLATION&gt; section may specify the following operations:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>Setting environment variables that are
required to run the installed component.
</p>
<p>This is also how you specify additional classpaths
for a Java component - by specifying the setting of an environmental variable
named CLASSPATH. The <code class="literal">buildComponentClasspath</code> method
of the PackageBrowser class builds a classpath string from what it finds in
the CLASSPATH specification here, plus adds a classpath entry for all
Jars in the <code class="literal">lib</code> directory. Because of this, there is no need
to specify Class Path entries for Jars in the lib directory, when using
the Eclipse plugin pear packager or the Maven Pear Packager.</p>
<div class="blockquote"><blockquote class="blockquote"><p>When specifying the value of the CLASSPATH environment
variable, use the semicolon ";" as the separator character, regardless of the
target Operating System conventions. This delimiter will be replaced with
the right one for the Operating System during PEAR installation.</p>
</blockquote></div>
<p>If your component needs to set the UIMA datapath you must specify the necessary
datapath setting using an environment variable with the key <code class="literal">uima.datapath</code>.
When such a key is specified the <code class="literal">getComponentDataPath</code> method of the
PackageBrowser class will return the specified datapath settings for your component.
</p>
<div class="warning" title="Warning" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Warning</h3><p>Do not put UIMA Framework Jars into the lib directory of your
PEAR; doing so will cause system failures due to class loading issues.</p></div>
</li><li class="listitem"><p>Note that you can use <span class="quote">&#8220;<span class="quote">macros</span>&#8221;</span>, like
$main_root or $component_id$root in the VAR_VALUE element of the
&lt;PARAMETERS&gt; sub-section.</p></li><li class="listitem"><p>Finding and replacing string expressions in files.</p>
</li><li class="listitem"><p>Note that you can use the <span class="quote">&#8220;<span class="quote">macros</span>&#8221;</span> in the FILE
and REPLACE_WITH elements of the &lt;PARAMETERS&gt; sub-section. </p>
</li></ul></div>
<p>Important: the ACTION element always should be the 1st element in each
&lt;PROCESS&gt; sub-section.</p>
<p>By default, the PEAR Installer will try to process every file in the desc and
conf directories of the PEAR package in order to find the <span class="quote">&#8220;<span class="quote">macros</span>&#8221;</span>
and replace them with actual path expressions. In addition to this, the installer
will process the files specified in the
&lt;INSTALLATION&gt; section.</p>
<p>Important: all XML files which are going to be processed should be created
using UTF-8 or UTF-16 file encoding. All other text files which are going to be
processed should be created using the ASCII file encoding.</p>
</div>
</div>
<div class="section" title="6.1.5.&nbsp;Packaging the PEAR structure into one file"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.pear.packaging_into_1_file">6.1.5.&nbsp;Packaging the PEAR structure into one file</h3></div></div></div>
<p>The last step of the PEAR process is to simply <span class="bold"><strong>
zip</strong></span> the content of the PEAR root folder (<span class="bold"><strong>not
including the root folder itself</strong></span>) to a PEAR file with the extension <span class="quote">&#8220;<span class="quote">.pear</span>&#8221;</span>.</p>
<p>To do this you can either use the PEAR packaging tools that are described in <span class="quote">&#8220;<span class="quote"><a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.pear.packager" class="olink">Chapter&nbsp;9, <i>PEAR Packager User's Guide</i></a></span>&#8221;</span> or you can use the PEAR packaging API that is shown below.</p>
<p>
To use the PEAR packaging API you first have to create the necessary information for the PEAR package:
</p><pre class="programlisting"> //define PEAR data
String componentID = "AnnotComponentID";
String mainComponentDesc = "desc/mainComponentDescriptor.xml";
String classpath ="$main_root/bin;";
String datapath ="$main_root/resources;";
String mainComponentRoot = "/home/user/develop/myAnnot";
String targetDir = "/home/user/develop";
Properties annotatorProperties = new Properties();
annotatorProperties.setProperty("sysProperty1", "value1");</pre><p>
To create a complete PEAR package in one step call:
</p><pre class="programlisting">PackageCreator.generatePearPackage(
componentID, mainComponentDesc, classpath, datapath,
mainComponentRoot, targetDir, annotatorProperties);</pre><p>
The created PEAR package has the file name &lt;componentID&gt;.pear and is located in the &lt;targetDir&gt;.
</p>
<p>
To create just the PEAR installation descriptor in the main component root directory call:
</p><pre class="programlisting">PackageCreator.createInstallDescriptor(componentID, mainComponentDesc,
classpath, datapath, mainComponentRoot, annotatorProperties);</pre><p>
To package a PEAR file with an existing installation descriptor call:
</p><pre class="programlisting">PackageCreator.createPearPackage(componentID, mainComponentRoot,
targetDir);</pre><p>
The created PEAR package has the file name &lt;componentID&gt;.pear and is located in the &lt;targetDir&gt;.
</p>
</div>
</div>
<div class="section" title="6.2.&nbsp;Installing a PEAR package"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.pear.installing">6.2.&nbsp;Installing a PEAR package</h2></div></div></div>
<p>The installation of a PEAR package can be done using
the PEAR installer tool (see <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.pear.installer" class="olink">Chapter&nbsp;11, <i>PEAR Installer User's Guide</i></a>, or by an application using
the PEAR APIs, directly. </p>
<p>During the PEAR installation the PEAR file is extracted to the installation directory and the PEAR macros
in the descriptors are updated with the corresponding path. At the end of the installation the PEAR verification
is called to check if the installed PEAR package can be started successfully. The PEAR verification use the classpath,
datapath and the system property settings of the PEAR package to verify the PEAR content. Necessary Java library
path settings for native libararies, PATH variable settings or system environment variables cannot be recognized
automatically and the use must take care of that manually.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>By default the PEAR packages are not installed directly to the specified installation directory. For each PEAR
a subdirectory with the name of the PEAR's ID is created where the PEAR package is installed to. If the PEAR installation
directory already exists, the old content is automatically deleted before the new content is installed.</p></div>
<div class="section" title="6.2.1.&nbsp;Installing a PEAR file using the PEAR APIs"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.pear.installing_pear_using_API">6.2.1.&nbsp;Installing a PEAR file using the PEAR APIs</h3></div></div></div>
<p>The example below shows how to use the PEAR APIs to install a
PEAR package and access the installed PEAR package data. For more details about the PackageBrowser API,
please refer to the Javadocs for the org.apache.uima.pear.tools package.
</p><pre class="programlisting">File installDir = new File("/home/user/uimaApp/installedPears");
File pearFile = new File("/home/user/uimaApp/testpear.pear");
boolean doVerification = true;
try {
// install PEAR package
PackageBrowser instPear = PackageInstaller.installPackage(
installDir, pearFile, doVerification);
// retrieve installed PEAR data
// PEAR package classpath
String classpath = instPear.buildComponentClassPath();
// PEAR package datapath
String datapath = instPear.getComponentDataPath();
// PEAR package main component descriptor
String mainComponentDescriptor = instPear
.getInstallationDescriptor().getMainComponentDesc();
// PEAR package component ID
String mainComponentID = instPear
.getInstallationDescriptor().getMainComponentId();
// PEAR package pear descriptor
String pearDescPath = instPear.getComponentPearDescPath();
// print out settings
System.out.println("PEAR package class path: " + classpath);
System.out.println("PEAR package datapath: " + datapath);
System.out.println("PEAR package mainComponentDescriptor: "
+ mainComponentDescriptor);
System.out.println("PEAR package mainComponentID: "
+ mainComponentID);
System.out.println("PEAR package specifier path: " + pearDescPath);
} catch (PackageInstallerException ex) {
// catch PackageInstallerException - PEAR installation failed
ex.printStackTrace();
System.out.println("PEAR installation failed");
} catch (IOException ex) {
ex.printStackTrace();
System.out.println("Error retrieving installed PEAR settings");
}</pre>
<p>
To run a PEAR package after it was installed using the PEAR API see the example below. It use the
generated PEAR specifier that was automatically created during the PEAR installation.
For more details about the APIs please refer to the Javadocs.
</p><pre class="programlisting">File installDir = new File("/home/user/uimaApp/installedPears");
File pearFile = new File("/home/user/uimaApp/testpear.pear");
boolean doVerification = true;
try {
// Install PEAR package
PackageBrowser instPear = PackageInstaller.installPackage(
installDir, pearFile, doVerification);
// Create a default resouce manager
ResourceManager rsrcMgr = UIMAFramework.newDefaultResourceManager();
// Create analysis engine from the installed PEAR package using
// the created PEAR specifier
XMLInputSource in =
new XMLInputSource(instPear.getComponentPearDescPath());
ResourceSpecifier specifier =
UIMAFramework.getXMLParser().parseResourceSpecifier(in);
AnalysisEngine ae =
UIMAFramework.produceAnalysisEngine(specifier, rsrcMgr, null);
// Create a CAS with a sample document text
CAS cas = ae.newCAS();
cas.setDocumentText("Sample text to process");
cas.setDocumentLanguage("en");
// Process the sample document
ae.process(cas);
} catch (Exception ex) {
ex.printStackTrace();
}</pre>
</div>
</div>
<div class="section" title="6.3.&nbsp;PEAR package descriptor"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.pear.specifier">6.3.&nbsp;PEAR package descriptor</h2></div></div></div>
<p>
To run an installed PEAR package directly in the UIMA framework the <code class="literal">pearSpecifier</code>
XML descriptor can be used. Typically during the PEAR installation such an specifier is automatically generated
and contains all the necessary information to run the installed PEAR package. Settings for system environment
variables, system PATH settings or Java library path settings cannot be recognized
automatically and must be set manually when the JVM is started.
</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The PEAR may contain specifications for "environment variables" and their settings.
When such a PEAR is run
directly in the UIMA framework, those settings (except for Classpath and Data Path) are converted
to Java System properties, and set to the specified values. Java cannot set true environmental variables;
if such a setting is needed, the application would need to arrange to do this prior to invoking Java.</p>
<p>The Classpath and Data Path settings are used by UIMA to configure a special Resource Manager
that is used when code from this PEAR is being run.</p></div>
<p>
The generated PEAR descriptor
is located in the component root directory of the installed PEAR package and has a filename like
&lt;componentID&gt;_pear.xml.
</p>
<p>
The PEAR package descriptor looks like:
</p>
<pre class="programlisting">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;pearSpecifier xmlns="http://uima.apache.org/resourceSpecifier"&gt;
&lt;pearPath&gt;/home/user/uimaApp/installedPears/testpear&lt;/pearPath&gt;
&lt;parameters&gt; &lt;!-- optional --&gt;
&lt;parameter&gt; &lt;!-- any number, repeated --&gt;
&lt;name&gt;name-of-the-parameter&lt;/name&gt;
&lt;value&gt;string-value&lt;/value&gt;
&lt;/parameter&gt;
&lt;/parameters&gt;
&lt;/pearSpecifier&gt;</pre>
<p>
The <code class="literal">pearPath</code> setting in the descriptor must point to the component root directory
of the installed PEAR package.
</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3>
<p>
It is not possible to share resources between PEAR Analysis Engines that are instantiated using the PEAR
descriptor. The PEAR runtime created for each PEAR descriptor has its own specific ResourceManager
(unless exactly the same Classpath and Data Path are being used).
</p>
</div>
<p>The optional <code class="literal">parameters</code> section, if used, specifies parameter values,
which are used to customize / override parameter values in the PEAR descriptor.
External Settings overrides continue to work for PEAR descriptors, and have precedence, if specified.
</p>
</div>
</div>
<div class="chapter" title="Chapter&nbsp;7.&nbsp;XMI CAS Serialization Reference" id="ugr.ref.xmi"><div class="titlepage"><div><div><h2 class="title">Chapter&nbsp;7.&nbsp;XMI CAS Serialization Reference</h2></div></div></div>
<p>This is the specification for the mapping of the UIMA CAS into the XMI (XML Metadata
Interchange<sup>[<a name="d5e2511" href="#ftn.d5e2511" class="footnote">7</a>]</sup>) format. XMI is an OMG standard for expressing object graphs in
XML. The UIMA SDK provides support for XMI through the classes
<code class="literal">org.apache.uima.cas.impl.XmiCasSerializer</code> and
<code class="literal">org.apache.uima.cas.impl.XmiCasDeserializer</code>.</p>
<div class="section" title="7.1.&nbsp;XMI Tag"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.xmi_tag">7.1.&nbsp;XMI Tag</h2></div></div></div>
<p>The outermost tag is &lt;XMI&gt; and must include a version number and XML
namespace attribute:
</p><pre class="programlisting">&lt;xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI"&gt;
&lt;!-- CAS Contents here --&gt;
&lt;/xmi:XMI&gt;</pre>
<p>XML namespaces<sup>[<a name="d5e2521" href="#ftn.d5e2521" class="footnote">8</a>]</sup> are used throughout. The <span class="quote">&#8220;<span class="quote">xmi</span>&#8221;</span> namespace prefix is used to
identify elements and attributes that are defined by the XMI specification. The XMI
document will also define one namespace prefix for each CAS namespace, as described in
the next section.</p>
</div>
<div class="section" title="7.2.&nbsp;Feature Structures"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.feature_structures">7.2.&nbsp;Feature Structures</h2></div></div></div>
<p>UIMA Feature Structures are mapped to XML elements. The name of the element is
formed from the CAS type name, making use of XML namespaces as follows.</p>
<p>The CAS type namespace is converted to an XML namespace URI by the following rule:
replace all dots with slashes, prepend http:///, and append .ecore.</p>
<p>This mapping was chosen because it is the default mapping used by the Eclipse
Modeling Framework (EMF)<sup>[<a name="d5e2529" href="#ftn.d5e2529" class="footnote">9</a>]</sup> to create namespace URIs from Java package names. The use of
the http scheme is a common convention, and does not imply any HTTP communication. The
.ecore suffix is due to the fact that the recommended type system definition for a
namespace is an ECore model, see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.xmi_emf" class="olink">Chapter&nbsp;8, <i>XMI and EMF Interoperability</i></a>.</p>
<p>Consider the CAS type name <span class="quote">&#8220;<span class="quote">org.myproj.Foo</span>&#8221;</span>. The CAS namespace
(<span class="quote">&#8220;<span class="quote">org.myorg.</span>&#8221;</span>) is converted to the XML namespace URI is
http:///org/myproj.ecore.</p>
<p>The XML element name is then formed by concatenating the XML namespace prefix
(which is an arbitrary token, but typically we use the last component of the CAS
namespace) with the type name (excluding the namespace).</p>
<p>So the example <span class="quote">&#8220;<span class="quote">org.myproj.Foo</span>&#8221;</span> FeatureStructure is written to
XMI as:
</p><pre class="programlisting">&lt;xmi:XMI
xmi:version="2.0"
xmlns:xmi="http://www.omg.org/XMI"
xmlns:myproj="http:///org/myproj.ecore"&gt;
...
&lt;myproj:Foo xmi:id="1"/&gt;
...
&lt;/xmi:XMI&gt;</pre>
<p>The xmi:id attribute is only required if this object will be referred to from
elsewhere in the XMI document. If provided, the xmi:id must be unique for each
feature.</p>
<p>All namespace prefixes (e.g. <span class="quote">&#8220;<span class="quote">myproj</span>&#8221;</span>) in this example must be
bound to URIs using the <span class="quote">&#8220;<span class="quote">xmlns...</span>&#8221;</span> attribute, as defined by the XML
namespaces specification.</p>
</div>
<div class="section" title="7.3.&nbsp;Primitive Features"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.primitive_features">7.3.&nbsp;Primitive Features</h2></div></div></div>
<p>CAS features of primitive types (String, Boolean, Byte, Short, Integer, Long ,
Float, or Double) can be mapped either to XML attributes or XML elements. For example, a
CAS FeatureStructure of type org.myproj.Foo, with features:
</p><pre class="programlisting">begin = 14
end = 19
myFeature = "bar"</pre><p>
could be mapped to:
</p><pre class="programlisting">&lt;xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI"
xmlns:myproj="http:///org/myproj.ecore"&gt;
...
&lt;myproj:Foo xmi:id="1" begin="14" end="19" myFeature="bar"/&gt;
...
&lt;/xmi:XMI&gt;</pre><p>
or equivalently:
</p><pre class="programlisting">&lt;xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI"
xmlns:myproj="http:///org/myproj.ecore"&gt;
...
&lt;myproj:Foo xmi:id="1"&gt;
&lt;begin&gt;14&lt;/begin&gt;
&lt;end&gt;19&lt;/end&gt;
&lt;myFeature&gt;bar&lt;/myFeature&gt;
&lt;/myproj:Foo&gt;
...
&lt;/xmi:XMI&gt;</pre>
<p>The attribute serialization is preferred for compactness, but either
representation is allowable. Mixing the two styles is allowed; some features can be
represented as attributes and others as elements.</p>
</div>
<div class="section" title="7.4.&nbsp;Reference Features"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.reference_features">7.4.&nbsp;Reference Features</h2></div></div></div>
<p>CAS features that are references to other feature structures (excluding arrays
and lists, which are handled separately) are serialized as ID references.</p>
<p>If we add to the previous CAS example a feature structure of type org.myproj.Baz,
with feature <span class="quote">&#8220;<span class="quote">myFoo</span>&#8221;</span> that is a reference to the Foo object, the
serialization would be:
</p><pre class="programlisting">&lt;xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI"
xmlns:myproj="http:///org/myproj.ecore"&gt;
...
&lt;myproj:Foo xmi:id="1" begin="14" end="19" myFeature="bar"/&gt;
&lt;myproj:Baz xmi:id="2" myFoo="1"/&gt;
...
&lt;/xmi:XMI&gt;</pre>
<p>As with primitive-valued features, it is permitted to use an element rather than an
attribute. However, the syntax is slightly different:</p>
<pre class="programlisting">&lt;myproj:Baz xmi:id="2"&gt;
&lt;myFoo href="#1"/&gt;
&lt;myproj.Baz&gt;</pre>
<p>Note that in the attribute representation, a reference feature is
indistinguishable from an integer-valued feature, so the meaning cannot be
determined without prior knowledge of the type system. The element representation is
unambiguous.</p>
</div>
<div class="section" title="7.5.&nbsp;Array and List Features"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.array_and_list_features">7.5.&nbsp;Array and List Features</h2></div></div></div>
<p>For a CAS feature whose range type is one of the CAS array or list types, the XMI serialization depends on the
setting of the <span class="quote">&#8220;<span class="quote">multipleReferencesAllowed</span>&#8221;</span> attribute for that feature in the UIMA Type System
Description (see <a href="references.html#ugr.ref.xml.component_descriptor.type_system.features" class="olink">Section&nbsp;2.3.3, &#8220;Features&#8221;</a>).</p>
<p>An array or list with multipleReferencesAllowed = false (the default) is serialized as a
<span class="quote">&#8220;<span class="quote">multi-valued</span>&#8221;</span> property in XMI. An array or list with multipleReferencesAllowed = true is
serialized as a first-class object. Details are described below.</p>
<div class="section" title="7.5.1.&nbsp;Arrays and Lists as Multi-Valued Properties"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xmi.array_and_list_features.as_multi_valued_properties">7.5.1.&nbsp;Arrays and Lists as Multi-Valued Properties</h3></div></div></div>
<p>In XMI, a multi-valued property is the most natural XMI representation for most cases. Consider the
example where the FeatureStructure of type org.myproj.Baz has a feature myIntArray whose value is the
integer array {2,4,6}. This can be mapped to:
</p><pre class="programlisting">&lt;myproj:Baz xmi:id="3" myIntArray="2 4 6"/&gt;</pre><p> or
equivalently:
</p><pre class="programlisting">&lt;myproj:Baz xmi:id="3"&gt;
&lt;myIntArray&gt;2&lt;/myIntArray&gt;
&lt;myIntArray&gt;4&lt;/myIntArray&gt;
&lt;myIntArray&gt;6&lt;/myIntArray&gt;
&lt;/myproj:Baz&gt;</pre><p>
</p>
<p>Note that String arrays whose elements contain embedded spaces MUST use the latter mapping.</p>
<p>FSArray or FSList features are serialized in a similar way. For example an FSArray feature that contains
references to the elements with xmi:id's <span class="quote">&#8220;<span class="quote">13</span>&#8221;</span> and <span class="quote">&#8220;<span class="quote">42</span>&#8221;</span> could be
serialized as:
</p><pre class="programlisting">&lt;myproj:Baz xmi:id="3" myFsArray="13 42"/&gt;</pre><p> or:
</p><pre class="programlisting">&lt;myproj:Baz xmi:id="3"&gt;
&lt;myFsArray href="#13"/&gt;
&lt;myFsArray href="#42"/&gt;
&lt;/myproj:Baz&gt;</pre><p>
</p>
</div>
<div class="section" title="7.5.2.&nbsp;Arrays and Lists as First-Class Objects"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xmi.array_and_list_features.as_1st_class_objects">7.5.2.&nbsp;Arrays and Lists as First-Class Objects</h3></div></div></div>
<p>The multi-valued-property representation described in the previous section does not allow multiple
references to an array or list object. Therefore, it cannot be used for features that are defined to allow
multiple references (i.e. features for which multipleReferencesAllowed = true in the Type System
Description).</p>
<p>When multipleReferencesAllowed is set to true, array and list features are serialized as references,
and the array or list objects are serialized as separate objects in the XMI. Consider again the example where
the FeatureStructure of type org.myproj.Baz has a feature myIntArray whose value is the integer array
{2,4,6}. If myIntArray is defined with multipleReferencesAllowed=true, the serialization will be as
follows:
</p><pre class="programlisting">&lt;myproj:Baz xmi:id="3" myIntArray="4"/&gt;</pre><p> or:
</p><pre class="programlisting">&lt;myproj:Baz xmi:id="3"&gt;
&lt;myIntArray href="#4"/&gt;
&lt;/myproj:Baz&gt;</pre><p>
with the array object serialized as
</p><pre class="programlisting">&lt;cas:IntegerArray xmi:id="4" elements="2 4 6"/&gt;</pre><p> or:
</p><pre class="programlisting">&lt;cas:IntegerArray xmi:id="4"&gt;
&lt;elements&gt;2&lt;/elements&gt;
&lt;elements&gt;4&lt;/elements&gt;
&lt;elements&gt;6&lt;/elements&gt;
&lt;/cas:IntegerArray&gt;</pre>
<p>Note that in this case, the XML element name is formed from the CAS type name (e.g.
<span class="quote">&#8220;<span class="quote"><code class="literal">uima.cas.IntegerArray</code></span>&#8221;</span>) in the same way as for other
FeatureStructures. The elements of the array are serialized either as a space-separated attribute named
<span class="quote">&#8220;<span class="quote">elements</span>&#8221;</span> or as a series of child elements named <span class="quote">&#8220;<span class="quote">elements</span>&#8221;</span>.</p>
<p>List nodes are just standard FeatureStructures with <span class="quote">&#8220;<span class="quote">head</span>&#8221;</span> and <span class="quote">&#8220;<span class="quote">tail</span>&#8221;</span>
features, and are serialized using the normal FeatureStructure serialization. For example, an
IntegerList with the values 2, 4, and 6 would be serialized as the four objects:
</p><pre class="programlisting">&lt;cas:NonEmptyIntegerList xmi:id="10" head="2" tail="11"/&gt;
&lt;cas:NonEmptyIntegerList xmi:id="11" head="4" tail="12"/&gt;
&lt;cas:NonEmptyIntegerList xmi:id="12" head="6" tail="13"/&gt;
&lt;cas:EmptyIntegerList xmi:id"13"/&gt;</pre>
<p>This representation of arrays allows multiple references to an array of list. It also allows a feature
with range type TOP to refer to an array or list. However, it is a very unnatural representation in XMI and does
not support interoperability with other XMI-based systems, so we instead recommend using the
multi-valued-property representation described in the previous section whenever it is possible.</p>
<p>When a feature is specified in the descriptor without a multipleReferencesAllowed attribute, or with the
attribute specified as <code class="code">false</code>, but the framework discovers multiple references during
serialization, it will issue a message to the log say that it discovered this (look for the phrase
"serialized in duplicate"). The serialization will continue, but the multiply-referenced items will
be serialized in duplicate.</p>
</div>
<div class="section" title="7.5.3.&nbsp;Null Array/List Elements"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xmi.null_array_list_elements">7.5.3.&nbsp;Null Array/List Elements</h3></div></div></div>
<p>In UIMA, an element of an FSArray or FSList may be null. In XMI, multi-valued properties do not permit null
values. As a workaround for this, we use a dummy instance of the special type cas:NULL, which has xmi:id 0.
For example, in the following example the <span class="quote">&#8220;<span class="quote">myFsArray</span>&#8221;</span> feature refers to an FSArray whose
second element is null:
</p><pre class="programlisting">&lt;cas:NULL xmi:id="0"/&gt;
&lt;myproj:Baz xmi:id="3"&gt;
&lt;myFsArray href="#13"/&gt;
&lt;myFsArray href="#0"/&gt;
&lt;myFsArray href="#42"/&gt;
&lt;/myproj:Baz&gt;</pre>
</div>
</div>
<div class="section" title="7.6.&nbsp;Subjects of Analysis (Sofas) and Views"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.sofas_views">7.6.&nbsp;Subjects of Analysis (Sofas) and Views</h2></div></div></div>
<p>A UIMA CAS contain one or more subjects of analysis (Sofas). These are serialized no
differently from any other feature structure. For example:
</p><pre class="programlisting">&lt;?xml version="1.0"?&gt;
&lt;xmi:XMI xmi:version="2.0" xmlns:xmi=http://www.omg.org/XMI
xmlns:cas="http:///uima/cas.ecore"&gt;
&lt;cas:Sofa xmi:id="1" sofaNum="1"
text="the quick brown fox jumps over the lazy dog."/&gt;
&lt;/xmi:XMI&gt;</pre>
<p>Each Sofa defines a separate View. Feature Structures in the CAS can be members of
one or more views. (A Feature Structure that is a member of a view is indexed in its
IndexRepository, but that is an implementation detail.)</p>
<p>In the XMI serialization, views will be represented as first-class objects. Each
View has an (optional) <span class="quote">&#8220;<span class="quote">sofa</span>&#8221;</span> feature, which references a sofa, and
multi-valued reference to the members of the View. For example:</p>
<pre class="programlisting">&lt;cas:View sofa="1" members="3 7 21 39 61"/&gt;</pre>
<p>Here the integers 3, 7, 21, 39, and 61 refer to the xmi:id fields of the objects that
are members of this view.</p>
</div>
<div class="section" title="7.7.&nbsp;Linking an XMI Document to its Ecore Type System"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.linking_to_ecore_type_system">7.7.&nbsp;Linking an XMI Document to its Ecore Type System</h2></div></div></div>
<p>If the CAS Type System has been saved to an Ecore file (as described in <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.xmi_emf" class="olink">Chapter&nbsp;8, <i>XMI and EMF Interoperability</i></a>), it is possible to store a
link from an XMI document to that Ecore type system. This is done using an xsi:schemaLocation attribute
on the root XMI element.</p>
<p>The xsi:schemaLocation attribute is a space-separated list that represents a
mapping from namespace URI (e.g. http:///org/myproj.ecore) to the physical URI of the
.ecore file containing the type system for that namespace. For example:
</p><pre class="programlisting">xsi:schemaLocation=
"http:///org/myproj.ecore file:/c:/typesystems/myproj.ecore"</pre><p>
would indicate that the definition for the org.myproj CAS types is contained in the file
<code class="literal">c:/typesystems/myproj.ecore</code>. You can specify a different
mapping for each of your CAS namespaces, using a space separated list. For details see
Budinsky et al. <span class="emphasis"><em>Eclipse Modeling Framework</em></span>.</p>
</div>
<div class="section" title="7.8.&nbsp;Delta CAS XMI Format"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.delta">7.8.&nbsp;Delta CAS XMI Format</h2></div></div></div>
<p>
The Delta CAS XMI serialization format is designed primarily to reduce the overhead serialization when calling annotators
configured as services. Only Feature Structures and Views that are new or modified by the service
are serialized and returned by the service.
</p>
<p>
The classes <code class="literal">org.apache.uima.cas.impl.XmiCasSerializer</code> and
<code class="literal">org.apache.uima.cas.impl.XmiCasDeserializer</code> support serialization of only the modifications to the CAS.
A caller is expected to set a marker to indicate the point from which changes to the CAS are to be tracked.
</p>
<p>
A Delta CAS XMI document contains only the Feature Structures and Views that have been added or modified.
The new and modified Feature Structures are represented in exactly the format as in a complete CAS serialization.
The <code class="literal"> cas:View </code> element has been extended with three additional attributes to represent modifications to
View membership. These new attributes are <code class="literal">added_members</code>, <code class="literal">deleted_members</code> and
<code class="literal">reindexed_members</code>. For example:
</p>
<pre class="programlisting">&lt;cas:View sofa="1" added_members="63 77"
deleted_member="7 61" reindexed_members="39" /&gt;</pre>
<p>
Here the integers 63, 77 represent xmi:id fields of the objects that have been newly added members to this View,
7 and 61 are xmi:id fields of the objects that have been removed from this view and 39 is the xmi:id of an object to be reindexed in this view.
</p>
</div>
<div class="footnotes"><br><hr width="100" align="left"><div class="footnote"><p><sup>[<a id="ftn.d5e2511" href="#d5e2511" class="para">7</a>] </sup> For details on XMI see Grose et al. <span class="emphasis"><em>Mastering
XMI. Java Programming with XMI, XML, and UML. </em></span>John Wiley &amp; Sons, Inc.
2002.</p></div><div class="footnote"><p><sup>[<a id="ftn.d5e2521" href="#d5e2521" class="para">8</a>] </sup>http://www.w3.org/TR/xml-names11/</p>
</div><div class="footnote"><p><sup>[<a id="ftn.d5e2529" href="#d5e2529" class="para">9</a>] </sup> For details on EMF and Ecore see Budinsky et
al. <span class="emphasis"><em>Eclipse Modeling Framework 2.0</em></span>. Addison-Wesley.
2006.</p></div></div></div>
<div class="chapter" title="Chapter&nbsp;8.&nbsp;Compressed Binary CASes" id="ugr.ref.compress"><div class="titlepage"><div><div><h2 class="title">Chapter&nbsp;8.&nbsp;Compressed Binary CASes</h2></div></div></div>
<div class="section" title="8.1.&nbsp;Binary CAS Compression overview"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.compress.overview">8.1.&nbsp;Binary CAS Compression overview</h2></div></div></div>
<p>UIMA has a proprietary binary serialization format, used internally
for several things, including communicating with embedded C++ annotators using
UIMA-CPP. This binary format is also selectable for use with UIMA-AS. Its use
requires that the source and target systems implement the identical type system
(because the type system is not sent, and internal coding is used within the
format that is keyed to the particular type system).</p>
<p>Starting with version 2.4.1, two additional forms of binary serialization are added.
Both compress the data being serialized; typical size ratios can approach 50 : 1,
depending on the exact contents of the CAS, when compared with normal binary serialization.
</p>
<p>The two forms are called 4 and 6, for historical/internal reasons. The serialized forms
of both of these is fixed, but not currently standardized, and the form being used is encoded in the header so
that the appropriate deserializer can be chosen. Both forms include support for Delta CAS
being returned from a service.</p>
<p>Form 6 builds on form 4, and adds: serializing only those feature structures which
are reachable (that is, in some index, or referenced by other reachable feature structures),
and type filtering.</p>
<p>Type filtering takes a source type system and a target type system, and for serializing
(source to target), sends the binary representation of reachable feature structures in the target's type system.
For deserializing (reading a target into a source), the filtering takes the specification being read
as being encoded using the target's type system, and translates that into the source's type system.
In this process, types which exist in the source but not the target are skipped (when serializing);
types which exist in the target, but not the source are skipped when deserializing.
Features that exist in some
source type but not in the version of the same type in the target are skipped (when serializing)
or set to default values (i.e., 0 or null) when being deserialized.</p>
<p>There are two main use cases for using compressed forms. The first one is for communicating with
UIMA-AS remote services (not yet implemented).
</p>
<p>The second use case is for saving compressed representations of CASes to other media, such as disk files,
where they can be deserialized later for use in other UIMA applications.</p>
</div>
<div class="section" title="8.2.&nbsp;Using Compressed Binary CASes"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.compress.usage">8.2.&nbsp;Using Compressed Binary CASes</h2></div></div></div>
<p>The main user interface for serializing a CAS using compression is to use one of the
static methods named serializeWithCompression in Serialization. If you pass a Type System argument representing
a target type system, then form 6 compression is used; otherwise form 4 is used.
To get the benefit of only serializing reachable Feature Structure instances, without type mapping
(which is only in form 6), pass a type system argument which is null.
</p>
<p>To deserialize into a CAS without type mapping, use one of the deserialize method in Serialization.
There are multiple forms of this method, depending on the arguments. The forms which take extra arguments
include a ReuseInfo may only be used with serialized forms created with form 6 compression.
The plain form of deserialize works with all forms of binary serialization, compressed and non-compressed, by examining a common
header which identifies the form of binary serialization used; however, for form 6, since it requires
additional arguments, it will fail - and you need to use the other deserialize form.</p>
<p>Form 6 has an additional object, ReuseInfo, which holds information which
is required for subsequent Delta CAS format serializations / deserializations.
It can speed up subsequent serializations of the same
CAS (before it is further updated), for instance, if an application is sending the CAS to multiple services in parallel.
The serializeWithCompression method returns this object when form 6 is being used.
</p>
<p>In addition, the CasIOUtils class offers static load and save methods, which can be used with the SerialFormat
enum to serialize and deserialize to URLs or streams; see the Javadocs for details.</p>
</div>
<div class="section" title="8.3.&nbsp;Simple Delta CAS serialization"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.compress.simple-deltas">8.3.&nbsp;Simple Delta CAS serialization</h2></div></div></div>
<p>Use Form 4 for this, because form 6 supports delta CAS but requires
that at the time of deserialization of a CAS (on the receiver side) which will later be delta serialized
back to the sender,
an instance of the ReuseInfo must be saved, and that
same instance then used for delta serialization; furthermore, the original serialization
(on the sender side)
also must save an instance of the ReuseInfo and use this when deserializing the delta CAS.
</p>
<p>Form 4 may not be as efficient as form 6 in that it does not filter the CASes
either by type systems nor by only sending reachable Feature Structure
instances. But, it doesn't require a ReuseInfo object when doing delta serialization or
deserialization,
so it may be more convenient to use when saving
delta CASes to files (as opposed to the other use case of
a remote service returning delta CASes to a remote client).</p>
</div>
<div class="section" title="8.4.&nbsp;Use Case cookbook"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.compress.use-cases">8.4.&nbsp;Use Case cookbook</h2></div></div></div>
<p>
Here are some use cases, together with a suggested approach and example of how to use the APIs.
</p>
<p>
<span class="strong"><strong>Save a CAS to an output stream, using form 4 (no type system filtering):</strong></span>
</p>
<pre class="programlisting">// set up an output stream. In this example, an internal byte array.
ByteArrayOutputStream baos = new ByteArrayOutputStream(OUT_BFR_INIT_SZ);
Serialization.serializeWithCompression(casSrc, baos);
// or
CasIOUtls.save(casSrc, baos, SerialFormat.COMPRESSED);
</pre>
<p><span class="strong"><strong>Deserialize from a stream into an existing CAS:</strong></span></p>
<pre class="programlisting">// assume the stream is a byte array input stream
// For example, one could be created
// from the above ByteArrayOutputStream as follows:
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
// Deserialize into a cas having the identical type system
Serialization.deserializeCAS(cas, bais);
// or
CasIOUtils.load(bais, aCas);
</pre>
<p>Note that the <code class="code">deserializeCAS(cas, inputStream)</code> method is a general way to
deserialize into a CAS from an inputStream for all forms of binary serialized data
(with exceptions as noted above).
The method reads a common header, and based on what it finds, selects the appropriate
deserialization routine.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The <code class="code">deserialization</code> method with just 2 arguments method doesn't support type filtering, or
delta cas deserializating for form 6. To do those, see example below.
</p>
</div>
<p><span class="strong"><strong>Serialize to an output stream, filtering out some types and/or features:</strong></span>
</p>
<p>
To do this, an additional input specifying the Type System of the target must
be supplied; this Type System should be a subset of the source CAS's.
The <code class="code">out</code> parameter may be an OutputStream, a DataOutputStream, or a File.
</p>
<pre class="programlisting">// set up an output stream. In this example, an internal byte array.
ByteArrayOutputStream baos = new ByteArrayOutputStream(OUT_BFR_INIT_SZ);
Serialization.serializeWithCompression(cas, out, tgtTypeSystem);
</pre>
<p><span class="strong"><strong>Deserialize with type filtering:</strong></span></p>
<p>There are 2 type systems involved here: one is the receiving CAS, and the other is the type system
used to decode the serialized form. This may optionally be stored with the serialized form:</p>
<pre class="programlisting">CasIOUtils.save(cas, out, SerialFormat.COMPRESSED_FILTERED_TS);
</pre>
<p>and/or it can be supplied at load time. Here's two examples of suppling this at load time:</p>
<pre class="programlisting">CasIOUtils.load(input, cas, typeSystem);
CasIOUtils.load(input, type_system_serialized_form_input, cas);
</pre>
<p>The reuseInfo should be null unless
deserializing a delta CAS, in which case, it must be the reuse info captured when
the original CAS was serialized out.
If the target type system is identical to the one in the CAS, you may pass null for it.
If a delta cas is not being received, you must pass null for the reuseInfo.
</p>
<pre class="programlisting">ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
Serialization.deserializeCAS(cas, bais, tgtTypeSystem, reuseInfo);
</pre>
</div>
</div>
<div class="chapter" title="Chapter&nbsp;9.&nbsp;JSON Serialization of CASs and UIMA Description objects" id="ugr.ref.json"><div class="titlepage"><div><div><h2 class="title">Chapter&nbsp;9.&nbsp;JSON Serialization of CASs and UIMA Description objects</h2></div></div></div>
<div class="section" title="9.1.&nbsp;JSON serialization support overview"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.json.overview">9.1.&nbsp;JSON serialization support overview</h2></div></div></div>
<p>Applications are moving to the "cloud", and new applications are being rapidly developed that are hooking
things up using various mashup techniques. New standards and conventions are emerging to support this kind
of application development, such as REST services.
JSON is now a popular way for services to communicate;
its popularity is rising (in 2014) while XML is falling.</p>
<p>Starting with version 2.7.0, JSON style serialization (but not (yet) deserialization)
for CASs and UIMA descriptions is supported.
The exact format of the serialization is configurable in several aspects.
The implementation is built on top of the Jackson JSON generation library.
</p>
<p>The next section discusses serialization for CASes, while a later section describes serialization
of description objects, such as type system descriptions.</p>
</div>
<div class="section" title="9.2.&nbsp;JSON CAS Serialization"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ug.ref.json.cas">9.2.&nbsp;JSON CAS Serialization</h2></div></div></div>
<p>CASs primarily consist of collections of Feature Structures (FSs). Similar to XMI serialization, JSON
serialization skips serializing unreachable FSs, outputting only those FSs that are found in the indexes (these are called
<span class="emphasis"><em>roots</em></span>), plus all of
the FSs that are referenced via some chain of references, from the roots.
</p>
<p>To support the kinds of things users do with FSs,
the serialized form may be augmented to include additional information beyond the FSs.</p>
<p>For traditional UIMA implementations, the serialized formats mostly assumed that the receivers had access to
a type system description, which specified details of the types of each feature value. For JSON serialization,
some of this information can be including directly in the serialization.</p>
<p>This abbreviated type system information is one kind of additional information that can be included;
here's a summary list of the various kinds of additional information you can add to the serialization:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem">
<p>having a way to identify which fields in a FS should be treated as references to other FSs, or
as representing serialized binary data from UIMA byte arrays.</p>
</li><li class="listitem">
<p>something like XML namespaces to allow the use of short type names in the serialization while handling name
collisions</p>
</li><li class="listitem">
<p>enough of the UIMA type hierarchy to allow the common operation of iterating over a type together
with all of its subtypes</p>
</li><li class="listitem"><p>A way to identify which FSs were "added-to-the-indexes" (separately, per CAS View)
and therefore serve as roots when
iterating over types.</p>
</li><li class="listitem"><p>An identification of the associated type system definition</p></li></ul></div>
<p>Simple JSON serialization does not have a convention for supporting these, but many extensions do.
We borrow some of the concepts in the JSON-LD (linked data) standard in providing this
additional information.</p>
<div class="section" title="9.2.1.&nbsp;The Big Picture"><div class="titlepage"><div><div><h3 class="title" id="ug.ref.json.cas.bigpic">9.2.1.&nbsp;The Big Picture</h3></div></div></div>
<p>CAS JSON serialization consists of several parts: an optional _context, the set of Feature Structures,
and (if doing a delta serialization) information about changes to what was indexed.</p>
<div class="figure"><a name="ug.ref.json.fig.bigpic"></a><div class="figure-contents">
<div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="347"><tr><td><img src="images/references/ref.json/big_picture2.png" width="347" alt="The big picture showing the parts of serialization, with the _context optional."></td></tr></table></div>
</div><p class="title"><b>Figure&nbsp;9.1.&nbsp;The major sections of JSON serialization</b></p></div><br class="figure-break">
<p>The serializer can be configured to omit
the _context or parts of the _context for cases where that information isn't needed. The index changes
information is only included if Delta CAS serialization is specified. Note that Delta CAS support
is incomplete; so this information is just for planning purposes.</p>
</div>
<div class="section" title="9.2.2.&nbsp;The _context section"><div class="titlepage"><div><div><h3 class="title" id="ug.ref.json.cas.context">9.2.2.&nbsp;The _context section</h3></div></div></div>
<p>The _context section has entries for each used type as well as some special additional entries.
Each entry for a type has multiple sub-entries, identified
by a key-name. Each sub-entry can be selectively omitted if not needed.
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><span class="bold"><strong>_type_system</strong></span> - a URI of the type system information</p></li><li class="listitem"><p><span class="bold"><strong>_types</strong></span> - information about each used type
</p><div class="itemizedlist"><ul class="itemizedlist" type="circle"><li class="listitem"><p><span class="bold"><strong>_id</strong></span> - the type's fully qualified UIMA type name</p></li><li class="listitem"><p><span class="bold"><strong>_feature_types</strong></span> - a map from features of this type to
information about the type of the value of the feature</p></li><li class="listitem"><p><span class="bold"><strong>_subtypes</strong></span> - an array of used subtype short-names</p></li></ul></div><p>
</p></li></ul></div><p>
</p>
<p>Here's an example:</p>
<div class="informalexample">
<pre class="programlisting">"_context" : {
"_type_system" : "URI to the type system information",
"_types : {
"A_Typical_User_or_built_in_Type" : {
"_id" : "org.apache.uima.test.A_Typical_User_or_built_in_Type",
"_feature_types" : [
"sofa" : "_ref",
"aFS" : "_ref",
"an_array" : "_array",
"a_byte_array" : "_byte_array"],
"_subtypes" : [ "subtype1", "subtype2", ... ] },
"Sofa" : {
"_id" : "uima.cas.Sofa",
"_feature_types" : {"sofaArray" : "_ref"} }
}
}</pre></div>
<p>The <span class="bold"><strong>_type_system</strong></span> is an optional URI that references a UIMA type system description that
defines the types for the CAS being serialized.</p>
<p>In the <span class="bold"><strong>_types</strong></span> section, the key (e.g. "Sofa" or "A_Typical_User_or_built_in_Type") is the "short" name
for the type used in the serialization.
It is either just
the last segment of the full type name (e.g. for the type x.y.z.TypeName, it's TypeName), or,
if name would collide with another type name if just the last segment
was used (example: some.package.cname.Foo, and some.other.package.cname.Foo), then the key is made up of
the next-to-last segment, with an optional suffixed incrementing integer in case of collisions on that name,
a colon (:) and then the last name.</p>
<div class="blockquote"><blockquote class="blockquote"><p>In this example, since the next to last segment of both names is
"cname", one namespace name would be "cname", and the other would be "cname1". The keys in this case would be
cname:Foo and cname1:Foo.</p></blockquote></div>
<p>The value of the _id is the fully qualified name of the type.</p>
<p>The <span class="bold"><strong>_feature_types</strong></span> values of _ref, _array, and _byte_array indicate the corresponding values
of the named features need special handling
when deserailized.
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><span class="bold"><strong>_ref</strong></span> - used when features are deserialized as numbers, but they are to be
interpreted as references to other FSs whose <code class="code">id</code> is the number. UIMA lists and arrays of
FSs are marked with _ref; if the value is a JSON array, the elements of the array will be either
numbers (to be interpreted as references), or embedded serializations of FSs.</p></li><li class="listitem"><p><span class="bold"><strong>_array</strong></span> - used when features are serialized as JSON
arrays containing embedded values,
unless the corresponding UIMA object has
multiple references, in which case it is serialized as a FS reference which looks like a single number.
If a feature is marked with _array, then a non-array, single number should be interpreted as the
<code class="code">id</code> of the feature structure that is the array or the first element of the list of items.
This designation is used for both UIMA arrays and lists.</p>
<p>This designation is for arrays and lists of primitive values, except for byte arrays.
In the case of FS arrays and lists, the _ref designation is used instead of this to indicate that the
resulting values in a JSON array that look like numbers should be interpreted as references.</p></li><li class="listitem"><p><span class="bold"><strong>_byte_array</strong></span> - _byte_array features are serialized numbers (if they are a
reference to a separate object, or as strings (if embedded). The strings are to be decoded into
binary byte arrays using the Base64 encoding (the standard one used by Jackson to serialize binary data).</p></li></ul></div><p>
</p>
<p>
Note that single element arrays are <span class="emphasis"><em>not</em></span> unwrapped, as in some other JSON serializations, to enable distinguishing
references to arrays from embedded arrays.
</p>
<p><span class="bold"><strong>_subtypes</strong></span> are a list of the type's used subtypes. A type is <span class="emphasis"><em>used</em></span>
if it is the type of a Feature Structure
being serialized,
or if it is in the supertype chain of some Feature Structure which is serialized. If a type has no
used subtypes, this element is omitted.
The names are represented as the "short" name. Users typically use this information
to construct support for iterators over a type which includes all of its subtypes.</p>
<div class="section" title="9.2.2.1.&nbsp;Omitting parts of the _context section"><div class="titlepage"><div><div><h4 class="title" id="ug.ref.json.cas.context.omit">9.2.2.1.&nbsp;Omitting parts of the _context section</h4></div></div></div>
<p>It is possible to selectively omit some of the
_context sections (or the entire _context), via configuration.
Here's an example:</p>
<div class="informalexample">
<pre class="programlisting">// make a new instance to hold the serialization configuration
JsonCasSerializer jcs = new JsonCasSerializer();
// Omit the expanded type names information
jcs.setJsonContext(JsonContextFormat.omitExpandedTypeNames);</pre></div>
<p>See the Javadocs for <code class="code">JsonContextFormat</code> for how to specify the parts.</p>
</div>
</div>
<div class="section" title="9.2.3.&nbsp;Serializing Feature Structures"><div class="titlepage"><div><div><h3 class="title" id="ug.ref.json.cas.featurestructures">9.2.3.&nbsp;Serializing Feature Structures</h3></div></div></div>
<p>Feature Structures themselves are represented as JSON objects consisting of field - value pairs, where the
fields correspond to UIMA Features, and the values are the values of the features.
</p>
<p>The various kinds of values for a UIMA feature are represented by their natural JSON counterpart.
UIMA primitive boolean values are represented by JSON true and false literals. UIMA Strings are
represented as JSON strings. Numbers are represented by JSON numbers.
Byte Arrays are represented by the Jackson standard binary encoding (base64 encoding), written as JSON strings.
References to other Feature Structures are also represented as JSON integer numbers, the values of which are
interpreted as ids of the referred-to
FSs. These ids are treated in the same manner as the xmi:ids of XMI Serialization. Arrays and Lists when
embedded (see following section) are represented as JSON arrays using the [] notation.</p>
<p>Besides the feature values defined for a Feature Structure, an additional special feature
may be serialized: _type.
The _type is the type name, written using the short format. This is automatically included when the type cannot
easily be
inferred from other contextual information.
</p>
<p>Here's an example, with some comments which, since JSON doesn't support comments, are just here for explanation:</p>
<div class="informalexample">
<pre class="programlisting">{ "_type" : "Annotation", // _type may be omitted
"feat1" : true, // boolean value represented as true or false
"feat2" : 123, // could be a number or a reference to FS with id 123
"feat3" : "b3axgh"//could be a string or a base64 encoded byte array
}</pre></div>
<div class="section" title="9.2.3.1.&nbsp;Embedding normally referenced values"><div class="titlepage"><div><div><h4 class="title" id="ug.ref.json.cas.featurestructures.embedding">9.2.3.1.&nbsp;Embedding normally referenced values</h4></div></div></div>
<p>Consider a FS which has a feature that refers to another FS. This can be serialized in one of two ways:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>the value of the feature can be coded as an <code class="code">id</code> (a number), where the number is the <code class="code">id</code> of the
referred-to FS.</p></li><li class="listitem"><p>The value of the feature can be coded as the serialization of the referred-to FS.</p></li></ul></div>
<p>
This second way of encoding is often done by JSON style serializations, and is called "embedding". Referred-to
FSs may be embedded if there are no other references to the embedded FS. Multiple references may arise due to
having a FS referenced as a "root" in some CAS View, or being used as a value in a FS feature.</p>
<p>Following the XMI conventions, UIMA arrays and lists which are
identified as singly referenced by either the static or dynamic method (see below) are embedded
directly as the value of a feature. In this case, the JSON serialization writes out the value of the feature
as a JSON array. Otherwise, the value is written out as a FS reference, and a separate serialization occurs of
the list elements or the array.</p>
<p>In addition to arrays and lists, FSs which are identifed as singly referenced from another FS are
serialized as the embedded value of the referring feature.
This is also done (when using the dynamic method) for singly referenced rooted instances.
</p>
<p>
If a FS is multiply referenced, the serialization in these
cases is just the numeric value of the <code class="code">id</code> of the FS.</p>
</div>
<div class="section" title="9.2.3.2.&nbsp;Dynamic vs Static multiple-references and embedding"><div class="titlepage"><div><div><h4 class="title" id="ug.ref.json.cas.featurestructures.dynamicstatic">9.2.3.2.&nbsp;Dynamic vs Static multiple-references and embedding</h4></div></div></div>
<p>There are two methods of determining if a particular FS or list or array can be embedded.
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><span class="bold"><strong>dynamic</strong></span> - calculates at serilization time whether or not there
are multiple references to a given FS.</p></li><li class="listitem"><p><span class="bold"><strong>static</strong></span> - looks in the type system definition to see if
the feature is marked with &lt;multipleReferencesAllowed&gt;.
</p><div class="itemizedlist"><ul class="itemizedlist" type="circle" compact><li class="listitem"><p><code class="code">multipleReferencesAllowed</code> false <span class="symbol">&#8594;</span> use the embedded style</p></li><li class="listitem"><p><code class="code">multipleReferencesAllowed</code> true <span class="symbol">&#8594;</span> use separate objects</p></li></ul></div><p>
Note that since this flag is not available for
references to FSs from View indexes, any FS that is indexed in any view is considered (if using static mode)
to be multipleReferencesAllowed.
</p></li></ul></div><p>
</p>
<p>Delta serialization only supports the static method; this mode is forced on if delta serialization
is specified.</p>
<p>Dynamic embedding is enabled by default for JSON, but may be disabled via configuration.</p>
</div>
<div class="section" title="9.2.3.3.&nbsp;Embedded Arrays and Lists"><div class="titlepage"><div><div><h4 class="title" id="ug.ref.json.cas.featurestructures.embeddedArraysLists">9.2.3.3.&nbsp;Embedded Arrays and Lists</h4></div></div></div>
<p>When static embedding is being used, a case can arise where some feature is marked to have only
singly referenced FS values, but that value may actually be multiply referenced. This is detected during
serialization, and an message is issued if an error handler has been specified to the serializer.
The serialization continues, however. In the case of an Array, the value of the array is embedded
in the serialization and the fact that these were referring to the same object is lost.
In the case of a list, if any element in the list
has multiple references (for example, if the list has back-references, loops, etc.),
the serialization of the list is truncated at the point where the multiple reference
occurs.</p>
<div class="blockquote"><blockquote class="blockquote"><p>Note that you can correctly serialize arbitrarily linked complex list structures created
using the built-in list types only if you use dynamic embedding, or
specify <code class="code">multipleReferencesAllowed</code> = true.</p></blockquote></div>
<p>Embedded list or array values are both serialized using the JSON array notation; as a result, these
alternative representations are not distinguised in the JSON serialization.</p>
</div>
<div class="section" title="9.2.3.4.&nbsp;Omitting null values"><div class="titlepage"><div><div><h4 class="title" id="ug.ref.json.cas.featurestructures.null">9.2.3.4.&nbsp;Omitting null values</h4></div></div></div>
<p>Following the conventions established in XMI serialization, features with <code class="code">null</code> values have their
key-value pairs omitted from the FS serialization when the type of the feature value is:
</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem">
<p>a Feature Structure Reference</p>
</li><li class="listitem">
<p>a String ( whose value is <code class="code">null</code>, not "" (a 0-length String))</p>
</li><li class="listitem">
<p>an embedded Array or List (where the entire array and/or list is <code class="code">null</code>)</p>
</li></ul></div>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Inside arrays or lists of FSs, references which are being serialized
as references have a <code class="code">null</code> reference coded as the number 0; references which are embedded are serialized as
<code class="code">null</code>.</p></div>
<p>Configuring the serializer with <code class="code">setOmit0Values(true)</code> causes
additional primitive features (byte/short/int/long/float/double) to be omitted, when their values are 0 or 0.0</p>
</div>
</div>
</div>
<div class="section" title="9.3.&nbsp;Organizing the Feature Structures"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ug.ref.json.cas.featurestructures.organization">9.3.&nbsp;Organizing the Feature Structures</h2></div></div></div>
<p>The set of all FSs being serialized is divided into two parts. The first part represents
all FSs that are root FSs, in that they were in one or more indexes at the time of serialization. The second part
represents feature structures that are multiply referenced, or are referenced via a chain of references from the
root FSs. The same feature structure can appear in both lists. The elements in the second part are actual
serialized FSs, whereas, the elements in the first part are either references to the corresponding FSs in the
second part, if they exist, or the actual embedded serialized FSs. Actual embedded serialized FSs only
exist once in the two parts.</p>
<div class="informalexample">
<pre class="programlisting">"_views" : {
"_InitialView" : {
"theFirstType" : [ { ... fs1 ...}, 123, 456, { ... fsn ...} ]
"anotherType" : [ { ... fs1 ...}, ... { ... fsn ...} ]
... // more types which have roots in view "12"
},
"AnotherView" : {
"theFirstType" : [ { ... fsv1 ...}, 123, { ... fsvn ...} ]
"anotherType" : [ { ... fsv1 ...}, ... { ... fsvn ...} ]
... // more types which have roots in view "25"
},
... // more views
},
"_referenced_fss" : {
"12" : {"_type" : "Sofa", "sofaNum" : 1, "sofaID" : "_InitialView" },
"25" : {"_type" : "Sofa", "sofaNum" : 2, "sofaID" : "AnotherView" },
"123" : { ... fs-123 ... },
"456" : { ... fs-456 ... },
...
}</pre></div>
<p>The first part map is made up of multiple maps, one for each separate CAS View.
The outer map is keyed by the <code class="code">id</code> of the corresponding SofaFS (or 0, if there is no corresponding SofaFS).
For each view, the value is a map whose key is a used Type, and the values are an array of instances
of FSs of that type which were found in some index; these are the "root" FSs. Only root instances
of a particular type are included in this array.
</p>
<p>The second part map has keys which are the <code class="code">id</code> value of the FSs, and values which are
a map of key-value pairs corresponding to the feature-values of that FS.
In this case, the _type extra feature is added to record the type.</p>
<p>The _views map, keyed by view and type name, has all the FSs (as an JSON array) for that type that were in
one or more indexes in any View. If a FS in this array is not multiply referenced (using dynamic mode),
then it is embedded here. Otherwise, only the reference (a simple number representing the <code class="code">id</code> of that FS) is serialized for that FS.</p>
</div>
<div class="section" title="9.4.&nbsp;Additional JSON CAS Serialization features"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ug.ref.json.cas.features">9.4.&nbsp;Additional JSON CAS Serialization features</h2></div></div></div>
<p>JSON serialization also supports several additional features, including:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem">
<p>Type and feature filtering: only types and features that exist in a specified type system description
are serialized.</p>
</li><li class="listitem">
<p>An ErrorHandler; this will be called in various error situations, including when
serializing in static mode an array or list value for a feature marked <code class="code">multipleReferencesAllowed = false</code>
is found to have multiple references.</p>
</li><li class="listitem">
<p>A switch to control omitting of numeric features that have 0 values (default is to include these).
See the <code class="code">setOmit0Values(true_or_false)</code> method in JsonCasSerializer.</p>
</li><li class="listitem">
<p>a pretty printing flag (default is not to do pretty-printing)</p>
</li></ul></div>
<p>See the Javadocs for JsonCasSerializer for details.</p>
<div class="section" title="9.4.1.&nbsp;Delta CAS"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.json.delta">9.4.1.&nbsp;Delta CAS</h3></div></div></div>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Delta CAS support is incomplete, and is not supported as of release 2.7.0, but may
be supported in later releases. The information here is just for planning purposes.</p></div>
<p><span class="bold"><strong>_delta_cas</strong></span> is present only when a delta CAS serialization is being performed.
This serializes just the
changes in the CAS since a Mark was set; so for cases where a large CAS is deserialized into a service, which
then does a relatively small amount of additions and modifications, only those changes are serialized.
The values of the keys are arrays of the ids of FSs that were added to the indexes,
removed from the indexes, or reindexed.</p>
<p>This mode requires the static embeddability mode. When specified, a <code class="code">_delta_cas</code> key-value
is added to the serialization at the end,
which lists the FSs (by <code class="code">id</code>) that were added, removed, or reindexed, since the mark was set.
Additional extra information, created when the CAS was previously deserialized and the mark set,
must be passed to the serializer, in the form of an instance of <code class="code">XmiSerializationSharedData</code>,
or JsonSerializationSharedData (not yet defined as of release 2.7.0).</p>
<p>Here's what the last part of the serialization looks like, when Delta CAS is specified:
</p><div class="informalexample">
<pre class="programlisting">"_delta_cas" : {
"added_members" : [ 123, ... ],
"deleted_members" : [ 456, ... ],
"reindexed_members" : [] }</pre></div><p>
</p>
</div>
</div>
<div class="section" title="9.5.&nbsp;Using JSON CAS serialization"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.json.usage">9.5.&nbsp;Using JSON CAS serialization</h2></div></div></div>
<p>The support is built on top the Jackson JSON serialization
package. We follow Jackson conventions for configuring.</p>
<p>The serialization APIs are in the JsonCasSerializer class.</p>
<p>Although there are some static short-cut methods for common use cases, the basic operations needed
to serialize a CAS as JSON are:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem">
<p>Make an instance of the <code class="code">JsonCasSerializer</code> class. This will serve to collect configuration information.</p>
</li><li class="listitem">
<p>Do any additional configuration needed. See the Javadocs for details.
The following objects can be configured:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="circle"><li class="listitem">
<p>The <code class="code">JsonCasSerializer</code> object: here you can specify the kind of JSON formatting, what to serialize,
whether or not delta serialization is wanted, prettyprinting, and more.</p>
</li><li class="listitem">
<p>The underlying <code class="code">JsonFactory</code> object from Jackson. Normally, you won't need to configure this.
If you do, you can create your own instance of this object and configure it and use it in the
serialization.</p>
</li><li class="listitem">
<p>The underlying <code class="code">JsonGenerator</code> from Jackson. Normally, you won't need to configure this.
If you do, you can get the instance the serializer will be using and configure that.</p>
</li></ul></div>
</li><li class="listitem">
<p>Once all the configuration is done, the serialize(...) call is done in this class,
which will create a one-time-use
inner class where the actual serialization is done. The serialize(...) method is thread-safe, in that the same
JsonCasSerializer instance (after it has been configured) can kick off multiple
(identically configured) serializations
on different threads at the same time.</p>
<p>The serialize call follows the Jackson conventions, taking one of 3 specifications of where to serialize to:
a Writer, an OutputStream, or a File.</p>
</li></ul></div>
<p>Here's an example:</p>
<div class="informalexample">
<pre class="programlisting">JsonCasSerializer jcs = new JsonCasSerializer();
jcs.setPrettyPrint(true); // do some configuration
StringWriter sw = new StringWriter();
jcs.serialize(cas, sw); // serialize into sw</pre></div>
<p>The JsonCasSerializer class also has some static convenience methods for JSON serialization, for the
most common configuration cases; please see the Javadocs for details. These are named jsonSerialize, to
distinguish them from the non-static serialize methods.</p>
<p>Many of the common configuration methods generally return the instance, so they can be chained together.
For example, if <code class="code">jcs</code> is an instance of the JsonCasSerializer, you can write
<code class="code">jcs.setPrettyPrint(true).setOmit0values(true);</code> to configure both of these.</p>
</div>
<div class="section" title="9.6.&nbsp;JSON serialization for UIMA descriptors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.json.descriptionserialization">9.6.&nbsp;JSON serialization for UIMA descriptors</h2></div></div></div>
<p>UIMA descriptors are things like analysis engine descriptors, type system descriptors, etc.
UIMA has an internal form for these, typically named UIMA <span class="emphasis"><em>description</em></span>s;
these can be serialized out as XML using a <code class="code">toXML</code> method.
JSON support adds the ability to serialize these a JSON objects, as well. It may be of use, for example,
to have the full type system description for a UIMA pipeline available in JSON notation.
</p>
<p>The class JsonMetaDataSerializer defines a set of static methods that serialize UIMA description objects
using a toJson method that takes as an argument the description object to be serialized, and the standard
set of serialiization targets that Jackson supports (File, Writer, or OutputStream). There is also
an optional prettyprint flag (default is no prettyprinting).</p>
<p>The resulting JSON serialization is just a straight-forward serialization of the description object,
having the same fields as the XML serialization of it.</p>
<p>Here's what a small TypeSystem description looks like, serialized:</p>
<div class="informalexample">
<pre class="programlisting">{"typeSystemDescription" :
{"name" : "casTestCaseTypesystem",
"description" : "Type system description for CAS test cases.",
"version" : "1.0",
"vendor" : "Apache Software Foundation",
"types" : [
{"typeDescription" :
{"name" : "Token",
"description" : "",
"supertypeName" : "uima.tcas.Annotation",
"features" : [
{"featureDescription" :
{"name" : "type",
"description" : "",
"rangeTypeName" :
"TokenType" } },
{"featureDescription" :
{"name" : "tokenFloatFeat",
"description" : "",
"rangeTypeName" : "uima.cas.Float" } } ] } },
{"typeDescription" :
{"name" : "TokenType",
"description" : "",
"supertypeName" : "uima.cas.TOP" } } ] } }</pre></div>
<p>Here's a sample of code to serialize a UIMA description object held in the variable <code class="code">tsd</code>, with
and without pretty printing:</p>
<div class="informalexample">
<pre class="programlisting">StringWriter sw = new StringWriter();
JsonMetaDataSerializer.toJSON(tsd, sw); // no prettyprinting
sw = new StringWriter();
JsonMetaDataSerializer.toJSON(tsd, sw, true); // prettyprinting</pre></div>
</div>
</div>
<div class="chapter" title="Chapter&nbsp;10.&nbsp;UIMA Setup and Configuration" id="ugr.ref.config"><div class="titlepage"><div><div><h2 class="title">Chapter&nbsp;10.&nbsp;UIMA Setup and Configuration</h2></div></div></div>
<div class="section" title="10.1.&nbsp;UIMA JVM Configuration Properties"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.config.properties">10.1.&nbsp;UIMA JVM Configuration Properties</h2></div></div></div>
<p> Some updates change UIMA's behavior between released versions. For example, sometimes an error check
is enhanced, and this can cause something that previously incorrect but not checked, to now signal an error.
Often, users will want these kinds of things to be ignored, at least for a while, to give them time to
analyze and correct the issues.
</p>
<p>
To enable users to gradually address these issues, there are some global JVM properties
for UIMA that can restore earlier behaviors, in some cases.
These are detailed in the table below. Additionally, there are other JVM properties that can
be used in checking and optimizing some performance trade-offs, such as the automatic index protection.
For the most part, you don't need to assign any values to these properties,
just define them. For example to disable the enhanced check that insures you
don't add a subtype of AnnotationBase to the wrong View, you could disable this by
adding the JVM argument <code class="code">-Duima.disable_enhanced_check_wrong_add_to_index</code>.
This would remove the enhanced
checking for this, added in version 2.7.0 (the previously existing partial checking is
still there, though).
</p>
</div>
<div class="section" title="10.2.&nbsp;Configuring index protection"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.config.protect-index">10.2.&nbsp;Configuring index protection</h2></div></div></div>
<p>A new feature in version 2.7.0 optionally can include checking for invalid feature updates
which could corrupt indexes. Because this checking can slightly slow down performance, there are
global JVM properties to control it. The suggested way to operation with these is as follows.
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>At the beginning, run with automatic protection enabled (the default), but
turn on explicit reporting (<code class="code">-Duima.report_fs_update_corrupts_index</code>)</p></li><li class="listitem"><p>For all reported instances, examine your code to see if you can restructure to
do the updates before adding the FS to the indexes. Where you cannot, surround the code doing
these updates with a try / finally or block form of <code class="code">protectIndexes()</code>,
which is described in
<a class="xref" href="#ugr.ref.cas.updating_indexed_feature_structures" title="4.5.1.&nbsp;Updating indexed feature structures">Section&nbsp;4.5.1, &#8220;Updating indexed feature structures&#8221;</a> (and also is similarly available with JCas).
</p></li><li class="listitem"><p>After no further reports, for maximum performance, leave in the protections
you may have installed in the above step, and then disable the reporting and runtime checking,
using the JVM argument
<code class="code">-Duima.disable_auto_protect_indexes</code>, and removing (if present)
<code class="code">-Duima.report_fs_update_corrupts_index</code>.</p></li></ul></div><p>
One additional JVM property, <code class="code">-Duima.throw_exception_when_fs_update_corrupts_index</code>,
is intended to be used in automated build / testing configurations. It causes the framework to throw
a UIMARuntimeException if an update outside of a <code class="code">protectIndexes</code> block occurs
that could corrupt the indexes,
rather than "recovering" this.
</p>
</div>
<div class="section" title="10.3.&nbsp;Properties Table"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.config.property-table">10.3.&nbsp;Properties Table</h2></div></div></div>
<p>This table describes the various JVM defined properties; specify these on the Java command line
using -Dxxxxxx, where the xxxxxx is one of
the properties starting with <code class="code">uima.</code> from the table below.</p>
<div class="informaltable">
<table style="border-collapse: collapse;border-top: 0.5pt solid black; border-bottom: 0.5pt solid black; border-left: 0.5pt solid black; border-right: 0.5pt solid black; "><colgroup><col class="Title"><col class="Description"><col class="Version"></colgroup><tbody><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><span class="bold"><strong>Title</strong></span></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><span class="bold"><strong>Property Name &amp; Description</strong></span></td><td style="border-bottom: 0.5pt solid black; "><span class="bold"><strong>Since Version</strong></span></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p>Use built-in Java Logger as default back-end</p></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p><code class="code">uima.use_jul_as_default_uima_logger</code></p>
<p>See <a class="ulink" href="https://issues.apache.org/jira/browse/UIMA-5381" target="_top">UIMA-5381</a>.
The standard UIMA logger uses an slf4j implementation, which, in turn hooks up to
a back end implementation based on what can be found in the class path (see slf4j documentation).
If no backend implementation is found, the slf4j default is to use a NOP logger back end
which discards all logging.</p>
<p>When this flag is specified, the behavior of the UIMA logger
is altered to use the built-in-to-Java logging implementation
as the back end for the UIMA logger.
</p></td><td style="border-bottom: 0.5pt solid black; "><p>3.0.0</p></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p>XML: enable doctype declarations</p></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p><code class="code">uima.xml.enable.doctype_decl</code> (default is false)</p>
<p>See <a class="ulink" href="https://issues.apache.org/jira/browse/UIMA-6064" target="_top">UIMA-6064</a>
Normally, this is turned off to avoid exposure to malicious XML; see
<a class="ulink" href="https://www.owasp.org/index.php/XML_External_Entity_(XXE)_Processing" target="_top">
XML External Entity processing vulnerability</a>.
</p>
</td><td style="border-bottom: 0.5pt solid black; "><p>2.10.4, 3.1.0</p></td></tr><tr><td style="border-bottom: 0.5pt solid black; " colspan="3" align="center"><span class="bold"><strong>Index protection properties</strong></span></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p>Report Illegal Index-key Feature Updates</p></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p><code class="code">uima.report_fs_update_corrupts_index</code> (default is not to report)</p>
<p>See <a class="ulink" href="https://issues.apache.org/jira/browse/UIMA-4135" target="_top">UIMA-4135</a>.
Updating Features which are used in Set and Sorted
indexes as "keys" may corrupt the indexes, if the Feature Structure (FS)
has been added to the indexes. To update these, you must first
completely remove the FS from the indexes in all views, then do the updates, and then
add it back. UIMA now checks for this (unless specifically disabled, see below),
and if this property is set, will log WARN messages for each occurrence unless
the user does explicit <code class="code">protectIndexes</code> (see CAS JavaDocs for CAS / JCas <code class="code">protectIndexes</code> methods), if this
property is defined.</p>
<p>To scan the logs for these reports, search for instances of lines having the string
<code class="code">While FS was in the index, the feature</code></p>
<p>Specifying this property overrides <code class="code">uima.disable_auto_protect_indexes</code>.</p>
<p>Users would run with this property defined, and then for high performance,
would use the report to manually change their code to avoid the problem or
to wrap the updates with a <code class="code">protectIndexes</code> kind of protection (see the
reference manual, in the CAS or JCas chapters, for examples of user code doing this,
and then run with the protection turned off (see below).
</p></td><td style="border-bottom: 0.5pt solid black; "><p>2.7.0</p></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p>Throw exception on illegal Index-key Feature Updates</p></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p><code class="code">uima.exception_when_fs_update_corrupts_index</code> (default is false)</p>
<p>See <a class="ulink" href="https://issues.apache.org/jira/browse/UIMA-4150" target="_top">UIMA-4150</a>.
Throws a UIMARuntimeException if an Indexed FS feature used as a key in one or more
indexes is updated, outside of an explicit <code class="code">protectIndexes</code> block.. \
This is intended for use in automated build and test environments,
to provide a strong signal if this kind of mistake gets into the build.
If it is not set, then the other properties specify if corruption should be checked for,
recovered automatically, and / or reported</p>
<p>Specifying this property also forces <code class="code">uima.report_fs_update_corrupts_index</code>
to true even if it was set to false.</p>
</td><td style="border-bottom: 0.5pt solid black; "><p>2.7.0</p></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p>Disable the index corruption checking</p></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p><code class="code">uima.disable_auto_protect_indexes</code></p>
<p>See <a class="ulink" href="https://issues.apache.org/jira/browse/UIMA-4135" target="_top">UIMA-4135</a>.
After you have fixed all reported issues identified with the above report,
you may set this property to omit this check, which may slightly improve
performance.</p>
<p>Note that this property is ignored if the <code class="code">-Dexception_when_fs_update_corrupts_index</code>
or <code class="code">-Dreport_fs_update_corrupts_index</code></p>
</td><td style="border-bottom: 0.5pt solid black; "><p>2.7.0</p></td></tr><tr><td style="border-bottom: 0.5pt solid black; " colspan="3" align="center"><span class="bold"><strong>Measurement / Tracing properties</strong></span></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p>Trace Feature Structure Creation/Updating</p></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p><code class="code">uima.trace_fs_creation_and_updating</code></p>
<p>This causes a trace file to be produced in the current working directory.
The file has one line for each Feature Structure that is created, and include
information on the cas/cas-view, and the features that are set for the Feature Structure.
There is, additionally, one line for each Feature Structure update.
Updates that occur next-to trace information for the same Feature Structure are combined.
</p>
<p>This can generate a lot of output, and definitely slows down execution.</p>
</td><td style="border-bottom: 0.5pt solid black; "><p>2.10.1</p></td></tr><tr><td style="border-right: 0.5pt solid black; "><p>Measure index flattening optimization</p></td><td style="border-right: 0.5pt solid black; "><p><code class="code">uima.measure.flatten_index</code></p>
<p>See <a class="ulink" href="https://issues.apache.org/jira/browse/UIMA-4357" target="_top">UIMA-4357</a>.
This creates a short report to System.out when Java is shutdown.
The report has some statistics about the automatic management of
flattened index creation and use.</p>
</td><td style=""><p>2.8.0</p></td></tr></tbody></table>
</div>
<p>Some additional global flags intended for helping v3 migration are documented in the V3 user's guide.</p>
</div>
</div>
<div class="chapter" title="Chapter&nbsp;11.&nbsp;UIMA Resources" id="ugr.ref.resources"><div class="titlepage"><div><div><h2 class="title">Chapter&nbsp;11.&nbsp;UIMA Resources</h2></div></div></div>
<div class="section" title="11.1.&nbsp;What is a UIMA Resource?"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.resources.overview">11.1.&nbsp;What is a UIMA Resource?</h2></div></div></div>
<p>UIMA uses the term <code class="code">Resource</code> to describe all UIMA components
that can be acquired by an application or by other resources.</p>
<div class="figure"><a name="ref.resource.fig.kinds"></a><div class="figure-contents">
<div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="297"><tr><td><img src="images/references/ref.resources/res_resource_kinds.png" width="297" alt="Resource Kinds, a partial list"></td></tr></table></div>
</div><p class="title"><b>Figure&nbsp;11.1.&nbsp;Resource Kinds</b></p></div><br class="figure-break">
<p>There are many kinds of resources; here's a list of the main kinds:
</p><div class="variablelist"><dl><dt><span class="term"><span class="strong"><strong>Annotator</strong></span></span></dt><dd><p>a user written component, receives a CAS, does some processing, and returns the possibly
updated CAS. Variants include CollectionReaders, CAS Consumers, CAS Multipliers.</p></dd><dt><span class="term"><span class="strong"><strong>Flow Controller</strong></span></span></dt><dd><p>a user written component controlling the flow of CASes within an aggregate.</p></dd><dt><span class="term"><span class="strong"><strong>External Resource</strong></span></span></dt><dd><p>a user written component. Variants include:
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>Data - includes special lifecycle call to load data</p></li><li class="listitem"><p>Parameterized - allows multiple instantiations with simple string parameter variants;
example: a dictionary, that has variants in content for different languages</p></li><li class="listitem"><p>Configurable - supports configuration from the XML specifier</p></li></ul></div><p>
</p></dd></dl></div><p>
</p>
<div class="section" title="11.1.1.&nbsp;Resource Inner Implementations"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.resources.resource-inner-implementations">11.1.1.&nbsp;Resource Inner Implementations</h3></div></div></div>
<p>Many of the resource kinds include in their specification a (possibly optional) element, which is
the name of a Java class which implements the resource. We will call this class the "inner implementation".</p>
<p>The UIMA framework creates instances of Resource from resource specifiers, by calling
the framework's <code class="code">produceResource(specifier, additional_parameters)</code> method.
This call produces a instance of Resource. </p>
<div class="blockquote"><blockquote class="blockquote">
<p>
For example, calling produceResource on an AnalysisEngineDescription produces an instance of
AnalysisEngine. This, in turn will have a reference to the user-written inner implementation class.
specified by the <code class="code">annotatorImplementationName</code>.
</p>
<p>External resource descriptors may include an <code class="code">implementationName</code> element.
Calling produceResource on a ExternalResourceDescription produces an instance of Resource;
the resource obtained by subsequent calls to <code class="code">getResource(...)</code>
is dependent on the particular descriptor, and may be an instance of
the inner implementation class.
</p>
</blockquote></div>
<p>For external resources, each resource specifier kind handles the case where
the inner implementation is omitted. If it is supplied, the named class must implement
the interface specified in the bindings for this resource. In addition, the particular specifier kind may
further restrict the kinds of classes the user supplies as the implementationName.
</p>
<p>Some examples of this further restriction:
</p><div class="variablelist"><dl><dt><span class="term"><span class="strong"><strong>customResource</strong></span></span></dt><dd><p>the class must also implement the Resource interface</p></dd><dt><span class="term"><span class="strong"><strong>dataResource</strong></span></span></dt><dd><p>the class must also implement the SharedResourceObject interface</p></dd></dl></div><p>
</p>
</div>
</div>
<div class="section" title="11.2.&nbsp;Sharing Resources, even across pipelines"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.resources.sharing-across-pipelines">11.2.&nbsp;Sharing Resources, even across pipelines</h2></div></div></div>
<p>UIMA applications run one or more UIMA Pipelines. Each pipeline has a top-level Analysis Engine, which
may be an aggregation of many other Analysis Engine components. The UIMA framework instantiates Annotator
resources as specified to configure the pipelines.</p>
<p>Sometimes, many identical pipelines are created (for example,
in order to exploit multi-core hardware by processing multiple CASes in parallel). In this case, the framework
would produce multiple instances of those Annotation resources; these are implemented as multiple instances
of the same Java class.</p>
<p>Sets of External Resources plus a CAS Pool and UIMA Extension ClassLoader are set up and kept,
per instance of a ResourceManager;
this instance serves to allow sharing of these items across one or more pipelines.
</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem">
<p>The UIMA Extension ClassLoader (if specified) is used to find the resources to be loaded
by the framework</p>
</li><li class="listitem">
<p>The <code class="code">External Resources</code> are specified by a pipeline's resource configuration.</p>
</li><li class="listitem">
<p>The CAS Pool is a pool of CASs all with identical type systems and index definitions, associated
with a pipeline.</p>
</li></ul></div><p> </p>
<p>When setting up a pipeline, the UIMA Framework's <code class="code">produceResource</code>
or one of its specialized variants is called, and a new
ResourceManager being created and used for that pipeline. However, in many cases, it may be advantageous to
share the same Resources across multiple pipelines; this is easily doable by passing a common instance of the
ResourceManager to the pipeline creation methods (using the additional parameters of the produceResource method).</p>
<p>
To handle additional use cases, the ResourceManager has a <code class="code">copy()</code> method which creates a copy of the
Resource Manager instance. The new instance is created with a null CAS Manager; if you want to share the
the CAS Pool, you have to copy the CAS Manager: <code class="code">newRM.setCasManager(originalRM.getCasManager())</code>.
You also may set the Extension Class Loader in the new instance (PEAR wrappers use this to allow
PEARs to have their own classpath). See the Javadocs for details.
</p>
</div>
<div class="section" title="11.3.&nbsp;External Resources support for multiple Parameterized Instances"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.resources.external-resource-multiple-parameterized-instances">11.3.&nbsp;External Resources support for multiple Parameterized Instances</h2></div></div></div>
<p>A typical external resource gets a single instantiation, shared with all users of a particular
ResourceManager.
Sometimes, multiple instantiations may be useful (of the same resource). The framework supports this for
ParameterizedDataResources. There's one kind supplied with UIMA - the fileLanguageResourceSpecifier.
This works by having each call to getResource(name, extra_keys[]) use the extra keys to select a particular
instance. On the first call for a particular instance, the named resource uses the extra keys to
initialize a new instance by calling its <code class="code">load</code> method with a data resource derived from the
extra keys by the named resource.
</p>
<p>For example, the fileLanguageResourceSpecifier uses the language code and goes through
a process with lots of defaulting and fall back to find a resource to load, based on the language code.
</p>
</div>
</div>
</div></body></html>