| <html><head> |
| <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
| <title>UIMA References</title><link rel="stylesheet" type="text/css" href="css/stylesheet-html.css"><meta name="generator" content="DocBook XSL-NS Stylesheets V1.76.1"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div lang="en" class="book" title="UIMA References" id="d5e1"><div xmlns:d="http://docbook.org/ns/docbook" class="titlepage"><div><div><h1 class="title">UIMA References</h1></div><div><div class="authorgroup"> |
| <h3 class="corpauthor">Written and maintained by the Apache UIMA™ Development Community</h3> |
| </div></div><div><p class="releaseinfo">Version 3.1.1</p></div><div><p class="copyright">Copyright © 2006, 2019 The Apache Software Foundation</p></div><div><p class="copyright">Copyright © 2004, 2006 International Business Machines Corporation</p></div><div><div class="legalnotice" title="Legal Notice"><a name="d5e8"></a> |
| <p> </p> |
| <p title="License and Disclaimer"> |
| <b>License and Disclaimer. </b> |
| |
| The ASF licenses this documentation |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this documentation except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| </p><div class="blockquote"><blockquote class="blockquote"> |
| <a class="ulink" href="http://www.apache.org/licenses/LICENSE-2.0" target="_top">http://www.apache.org/licenses/LICENSE-2.0</a> |
| </blockquote></div><p title="License and Disclaimer"> |
| |
| Unless required by applicable law or agreed to in writing, |
| this documentation and its contents are distributed under the License |
| on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| </p> |
| <p> </p> |
| <p> </p> |
| <p title="Trademarks"> |
| <b>Trademarks. </b> |
| All terms mentioned in the text that are known to be trademarks or |
| service marks have been appropriately capitalized. Use of such terms |
| in this book should not be regarded as affecting the validity of the |
| the trademark or service mark. |
| |
| </p> |
| </div></div><div><p class="pubdate">November, 2019</p></div></div><hr></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="chapter"><a href="#ugr.ref.javadocs">1. Javadocs</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.javadocs.libraries">1.1. Using named Eclipse User Libraries</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.xml.component_descriptor">2. Component Descriptor Reference</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.notation">2.1. Notation</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.imports">2.2. Imports</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.type_system">2.3. Type System Descriptors</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.type_system.imports">2.3.1. Imports</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.type_system.types">2.3.2. Types</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.type_system.features">2.3.3. Features</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.type_system.string_subtypes">2.3.4. String Subtypes</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.aes">2.4. Analysis Engine Descriptors</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.aes.primitive">2.4.1. Primitive Analysis Engine Descriptors</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.aes.aggregate">2.4.2. Aggregate Analysis Engine Descriptors</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.aes.configuration_parameters">2.4.3. Configuration Parameters</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.flow_controller">2.5. Flow Controller Descriptors</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.collection_processing_parts">2.6. Collection Processing Component Descriptors</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader">2.6.1. Collection Reader Descriptors</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.collection_processing_parts.cas_initializer">2.6.2. CAS Initializer Descriptors (deprecated)</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.collection_processing_parts.cas_consumer">2.6.3. CAS Consumer Descriptors</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.service_client">2.7. Service Client Descriptors</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.component_descriptor.custom_resource_specifiers">2.8. Custom Resource Specifiers</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.xml.cpe_descriptor">3. CPE Descriptor Reference</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.overview">3.1. CPE Overview</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.notation">3.2. Notation</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.imports">3.3. Imports</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor">3.4. CPE Descriptor Overview</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.collection_reader">3.5. Collection Reader</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.collection_reader.error_handling">3.5.1. Error handling for Collection Readers</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.cas_processors">3.6. CAS Processors</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual">3.6.1. Specifying an Individual CAS Processor</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.operational_parameters">3.7. CPE Operational Parameters</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.resource_manager_configuration">3.8. Resource Manager Configuration</a></span></dt><dt><span class="section"><a href="#ugr.ref.xml.cpe_descriptor.descriptor.example">3.9. Example CPE Descriptor</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.cas">4. CAS Reference</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.cas.javadocs">4.1. Javadocs</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.overview">4.2. CAS Overview</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.cas.type_system">4.2.1. The Type System</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.creating_accessing_manipulating_data">4.2.2. Creating/Accessing/Changing data</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.creating_using_indexes">4.2.3. Creating and using indexes</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.cas.builtin_types">4.3. Built-in CAS Types</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.accessing_the_type_system">4.4. Accessing the type system</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.cas.type_system.printer_example">4.4.1. TypeSystemPrinter example</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.cas_apis_create_modify_feature_structures">4.4.2. Using CAS APIs: Feature Structures</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.cas.creating_feature_structures">4.5. Creating feature structures</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.cas.updating_indexed_feature_structures">4.5.1. Updating indexed feature structures</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.cas.accessing_modifying_features_of_feature_structures">4.6. Accessing or modifying Features</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.indexes_and_iterators">4.7. Indexes and Iterators</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.cas.index.built_in_indexes">4.7.1. Built-in Indexes</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.index.adding_to_indexes">4.7.2. Adding Feature Structures to the Indexes</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.index.iterators">4.7.3. Iterators over UIMA Indexes</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.index.annotation_index">4.7.4. Special iterators for Annotation types</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.index.constraints_and_filtered_iterators">4.7.5. Constraints and Filtered iterators</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.cas.guide_to_javadocs">4.8. CAS API's Javadocs</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.cas.javadocs.cas_package">4.8.1. APIs in the CAS package</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.cas.typemerging">4.9. Type Merging</a></span></dt><dt><span class="section"><a href="#ugr.ref.cas.limitedmultipleaccess">4.10. Limited multi-thread access to read-only CASs</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.jcas">5. JCas Reference</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.jcas.name_spaces">5.1. Name Spaces</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.use_of_description">5.2. Use of XML Description</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.mapping_built_ins">5.3. Mapping built-in CAS types to Java types</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.augmenting_generated_code">5.4. Augmenting the generated Java Code</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.jcas.keeping_augmentations_when_regenerating">5.4.1. Keeping hand-coded augmentations when regenerating</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.additional_constructors">5.4.2. Additional Constructors</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.modifying_generated_items">5.4.3. Modifying generated items</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.jcas.merging_types_from_other_specs">5.5. Merging Types</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.jcas.merging_types.aggregates_and_cpes">5.5.1. Aggregate AEs and CPEs as sources of types</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.merging_types.jcasgen_support">5.5.2. JCasGen support for type merging</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.impact_of_type_merging_on_composability">5.5.3. Type Merging impacts on Composability</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.documentannotation_issues">5.5.4. Adding Features to DocumentAnnotation</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.jcas.using_within_an_annotator">5.6. Using JCas within an Annotator</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.jcas.new_instances">5.6.1. Creating new instances</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.getters_and_setters">5.6.2. Getters and Setters</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.obtaining_refs_to_indexes">5.6.3. Obtaining references to Indexes</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.adding_removing_instances_to_indexes">5.6.4. Updating Indexes</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.using_iterators">5.6.5. Using Iterators</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.class_loaders">5.6.6. Class Loaders in UIMA</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.accessing_jcas_objects_outside_uima_components">5.6.7. Issues accessing JCas objects outside of UIMA Engine Components</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.jcas.setting_up_classpath">5.7. Setting up Classpath for JCas</a></span></dt><dt><span class="section"><a href="#ugr.ref.jcas.pear_support">5.8. PEAR isolation</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.pear">6. PEAR Reference</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.pear.packaging_a_component">6.1. Packaging a UIMA component</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.pear.creating_pear_structure">6.1.1. Creating the PEAR structure</a></span></dt><dt><span class="section"><a href="#ugr.ref.pear.populating_pear_structure">6.1.2. Populating the PEAR structure</a></span></dt><dt><span class="section"><a href="#ugr.ref.pear.creating_installation_descriptor">6.1.3. Creating the installation descriptor</a></span></dt><dt><span class="section"><a href="#ugr.ref.pear.installation_descriptor">6.1.4. Installation Descriptor: template</a></span></dt><dt><span class="section"><a href="#ugr.ref.pear.packaging_into_1_file">6.1.5. Packaging the PEAR structure into one file</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.pear.installing">6.2. Installing a PEAR package</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.pear.installing_pear_using_API">6.2.1. Installing a PEAR file using the PEAR APIs</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.pear.specifier">6.3. PEAR package descriptor</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.xmi">7. XMI CAS Serialization Reference</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xmi.xmi_tag">7.1. XMI Tag</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.feature_structures">7.2. Feature Structures</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.primitive_features">7.3. Primitive Features</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.reference_features">7.4. Reference Features</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.array_and_list_features">7.5. Array and List Features</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.xmi.array_and_list_features.as_multi_valued_properties">7.5.1. Arrays and Lists as Multi-Valued Properties</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.array_and_list_features.as_1st_class_objects">7.5.2. Arrays and Lists as First-Class Objects</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.null_array_list_elements">7.5.3. Null Array/List Elements</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.xmi.sofas_views">7.6. Subjects of Analysis (Sofas) and Views</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.linking_to_ecore_type_system">7.7. Linking XMI docs to Ecore Type System</a></span></dt><dt><span class="section"><a href="#ugr.ref.xmi.delta">7.8. Delta CAS XMI Format</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.compress">8. Compressed Binary CASes</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.compress.overview">8.1. Binary CAS Compression overview</a></span></dt><dt><span class="section"><a href="#ugr.ref.compress.usage">8.2. Using Compressed Binary CASes</a></span></dt><dt><span class="section"><a href="#ugr.ref.compress.simple-deltas">8.3. Simple Delta CAS serialization</a></span></dt><dt><span class="section"><a href="#ugr.ref.compress.use-cases">8.4. Use Case cookbook</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.json">9. JSON support</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.json.overview">9.1. JSON serialization support overview</a></span></dt><dt><span class="section"><a href="#ug.ref.json.cas">9.2. JSON CAS Serialization</a></span></dt><dd><dl><dt><span class="section"><a href="#ug.ref.json.cas.bigpic">9.2.1. The Big Picture</a></span></dt><dt><span class="section"><a href="#ug.ref.json.cas.context">9.2.2. The _context section</a></span></dt><dt><span class="section"><a href="#ug.ref.json.cas.featurestructures">9.2.3. Serializing Feature Structures</a></span></dt></dl></dd><dt><span class="section"><a href="#ug.ref.json.cas.featurestructures.organization">9.3. Organizing the Feature Structures</a></span></dt><dt><span class="section"><a href="#ug.ref.json.cas.features">9.4. Additional JSON CAS Serialization features</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.json.delta">9.4.1. Delta CAS</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.json.usage">9.5. Using JSON CAS serialization</a></span></dt><dt><span class="section"><a href="#ugr.ref.json.descriptionserialization">9.6. JSON serialization for UIMA descriptors</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.config">10. Setup and Configuration</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.config.properties">10.1. UIMA JVM Configuration Properties</a></span></dt><dt><span class="section"><a href="#ugr.ref.config.protect-index">10.2. Configuring index protection</a></span></dt><dt><span class="section"><a href="#ugr.ref.config.property-table">10.3. Properties Table</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ref.resources">11. UIMA Resources</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.resources.overview">11.1. What is a UIMA Resource?</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ref.resources.resource-inner-implementations">11.1.1. Resource Inner Implementations</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ref.resources.sharing-across-pipelines">11.2. Sharing Resources</a></span></dt><dt><span class="section"><a href="#ugr.ref.resources.external-resource-multiple-parameterized-instances">11.3. External Resources support for multiple Parameterized Instances</a></span></dt></dl></dd></dl></div> |
| |
| |
| |
| |
| |
| <div class="chapter" title="Chapter 1. Javadocs" id="ugr.ref.javadocs"><div class="titlepage"><div><div><h2 class="title">Chapter 1. Javadocs</h2></div></div></div> |
| |
| |
| <p>The details of all the public APIs for UIMA are contained in the API Javadocs. These are located in the docs/api |
| directory; the top level to open in your browser is called <a class="ulink" href="api/index.html" target="_top">api/index.html</a>.</p> |
| |
| <p>Eclipse supports the ability to attach the Javadocs to your project. The Javadoc should already be attached |
| to the <code class="literal">uimaj-examples</code> project, if you followed the setup instructions in <a href="overview_and_setup.html#d4e1" class="olink">UIMA Overview & SDK Setup</a> <a href="overview_and_setup.html#ugr.ovv.eclipse_setup.example_code" class="olink">Section 3.2, “Setting up Eclipse to view Example Code”</a>. To attach |
| Javadocs to your own Eclipse project, use the following instructions.</p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>As an alternative, you can add the UIMA source to the UIMA binary distribution; if you |
| do this you not only will have the Javadocs automatically available (you can skip the following |
| setup), you will have the ability to step through the UIMA framework code while debugging. |
| To add the source, follow the instructions as described in the setup chapter: |
| <a href="overview_and_setup.html#d4e1" class="olink">UIMA Overview & SDK Setup</a> |
| <a href="overview_and_setup.html#ugr.ovv.eclipse_setup.adding_source" class="olink">Section 3.3, “Adding the UIMA source code to the jar files”</a>.</p></div> |
| |
| <p>To add the Javadocs, open a project which is referring to the UIMA APIs in its class path, and open the project properties. Then pick |
| Java Build Path. Pick the "Libraries" tab and select one of the UIMA library entries (if you don't have, for |
| instance, uima-core.jar in this list, it's unlikely your code will compile). Each library entry has a small ">" |
| sign on its left - click that to expand the view to see the Javadoc location. If you highlight that and press edit - you |
| can add a reference to the Javadocs, in the following dialog: |
| |
| |
| </p><div class="screenshot"> |
| <div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="574"><tr><td><img src="images/references/ref.javadocs/image002.jpg" width="574" alt="Screenshot of attaching Javadoc to source in Eclipse"></td></tr></table></div> |
| </div> |
| |
| <p>Once you do this, Eclipse can show you Javadocs for UIMA APIs as you work. To see the Javadoc for a UIMA API, you |
| can hover over the API class or method, or select it and press shift-F2, or use the menu Navigate <span class="symbol">→</span> |
| Open External Javadoc, or open the Javadoc view (Window <span class="symbol">→</span> Show View <span class="symbol">→</span> Other |
| <span class="symbol">→</span> Java <span class="symbol">→</span> Javadoc).</p> |
| |
| <p>In a similar manner, you can attach the source for the UIMA framework, if you download the source |
| distribution. The source corresponding to particular |
| releases is available from the Apache UIMA web site (<a class="ulink" href="http://uima.apache.org" target="_top">http://uima.apache.org</a>) on the |
| downloads page.</p> |
| |
| <div class="section" title="1.1. Using named Eclipse User Libraries"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.javadocs.libraries">1.1. Using named Eclipse User Libraries</h2></div></div></div> |
| |
| <p>You can also create a named "user library" in Eclipse containing the UIMA Jars, and attach the Javadocs (or |
| optionally, the sources); this named library is saved in the Eclipse workspace. Once created, it can be |
| added to the classpath of newly created Eclipse projects.</p> |
| |
| <p>Use the menu option Project <span class="symbol">→</span> Properties |
| <span class="symbol">→</span> Java Build Path, and then pick the Libraries tab, and click the Add Library button. Then select |
| User Libraries, click "Next", and pick the library you created for the UIMA Jars.</p> |
| |
| <p>To create this library in the workspace, |
| use the same menu picks as above, but after you select the User Libraries and click "Next", you can click the "New Library..." |
| button to define your new library. You use the "Add Jars" button and multi-select all the Jars in the lib directory |
| of the UIMA binary distribution. Then you add the Javadoc attachment for each Jar. The path to use is |
| file:/ -- insert the path to your install of UIMA -- /docs/api. After you do this for the first Jar, you can |
| copy this string to the clipboard and paste it into the rest of the Jars.</p> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 2. Component Descriptor Reference" id="ugr.ref.xml.component_descriptor"><div class="titlepage"><div><div><h2 class="title">Chapter 2. Component Descriptor Reference</h2></div></div></div> |
| |
| |
| <p>This chapter is the reference guide for the UIMA SDK's Component Descriptor XML |
| schema. A <span class="emphasis"><em>Component Descriptor</em></span> (also sometimes called a |
| <span class="emphasis"><em>Resource Specifier</em></span> in the code) is an XML file that either (a) |
| completely describes a component, including all information needed to construct the |
| component and interact with it, or (b) specifies how to connect to and interact with an |
| existing component that has been published as a remote service. |
| <span class="emphasis"><em>Component</em></span> (also called <span class="emphasis"><em>Resource</em></span>) is a |
| general term for modules produced by UIMA developers and used by UIMA applications. The |
| types of Components are: Analysis Engines, Collection Readers, CAS |
| Initializers<sup>[<a name="d5e71" href="#ftn.d5e71" class="footnote">1</a>]</sup>, CAS Consumers, and Collection Processing Engines. |
| However, Collection Processing Engine Descriptors are significantly different in |
| format and are covered in a separate chapter, <a href="references.html#ugr.ref.xml.cpe_descriptor" class="olink">Chapter 3, <i>Collection Processing Engine Descriptor Reference</i></a>.</p> |
| |
| <p><a class="xref" href="#ugr.ref.xml.component_descriptor.notation" title="2.1. Notation">Section 2.1, “Notation”</a> describes the notation used in this |
| chapter.</p> |
| |
| <p><a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2. Imports">Section 2.2, “Imports”</a> describes the UIMA SDK's |
| <span class="emphasis"><em>import</em></span> syntax, used to allow XML descriptors to import |
| information from other XML files, to allow sharing of information between several XML |
| descriptors.</p> |
| |
| <p><a class="xref" href="#ugr.ref.xml.component_descriptor.aes" title="2.4. Analysis Engine Descriptors">Section 2.4, “Analysis Engine Descriptors”</a> describes the XML format for <span class="emphasis"><em>Analysis Engine |
| Descriptors</em></span>. These are descriptors that completely describe Analysis |
| Engines, including all information needed to construct and interact with them.</p> |
| |
| <p><a class="xref" href="#ugr.ref.xml.component_descriptor.collection_processing_parts" title="2.6. Collection Processing Component Descriptors">Section 2.6, “Collection Processing Component Descriptors”</a> describes the XML format for |
| <span class="emphasis"><em>Collection Processing Component Descriptors</em></span>. This includes |
| Collection Iterator, CAS Initializer, and CAS Consumer Descriptors.</p> |
| |
| <p><a class="xref" href="#ugr.ref.xml.component_descriptor.service_client" title="2.7. Service Client Descriptors">Section 2.7, “Service Client Descriptors”</a> describes the XML format for |
| <span class="emphasis"><em>Service Client Descriptors</em></span>, which specify how to connect to and |
| interact with resources deployed as remote services.</p> |
| |
| <p><a class="xref" href="#ugr.ref.xml.component_descriptor.custom_resource_specifiers" title="2.8. Custom Resource Specifiers">Section 2.8, “Custom Resource Specifiers”</a> describes the XML format for |
| <span class="emphasis"><em>Custom Resource Specifiers</em></span>, which allow you to plug in your |
| own Java class as a UIMA Resource.</p> |
| |
| <div class="section" title="2.1. Notation"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.notation">2.1. Notation</h2></div></div></div> |
| |
| |
| <p>This chapter uses an informal notation to specify the syntax of Component |
| Descriptors. The formal syntax is defined by an XML schema definition, which is |
| contained in the file <code class="literal">resourceSpecifierSchema.xsd</code>, |
| located in the <code class="literal">uima-core.jar</code> file.</p> |
| |
| <p>The notation used in this chapter is:</p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>An ellipsis (...) inside an element body indicates |
| that the substructure of that element has been omitted (to be described in another |
| section of this chapter). An example of this would be: |
| |
| |
| </p><pre class="programlisting"><analysisEngineMetaData> |
| ... |
| </analysisEngineMetaData></pre><p> |
| An ellipsis immediately after an element indicates that the element type may be may be |
| repeated arbitrarily many times. For example: |
| |
| |
| </p><pre class="programlisting"><parameter>[String]</parameter> |
| <parameter>[String]</parameter> |
| ...</pre><p> |
| indicates that there may be arbitrarily many parameter elements in this |
| context.</p></li><li class="listitem"><p>Bracketed expressions (e.g. <code class="literal">[String]</code>) |
| indicate the type of value that may be used at that location.</p></li><li class="listitem"><p>A vertical bar, as in <code class="literal">true|false</code>, indicates |
| alternatives. This can be applied to literal values, bracketed type names, and |
| elements.</p></li><li class="listitem"><p>Which elements are optional and which are required is specified in |
| prose, not in the syntax definition. </p></li></ul></div> |
| </div> |
| |
| <div class="section" title="2.2. Imports"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.imports">2.2. Imports</h2></div></div></div> |
| |
| |
| <p>The UIMA SDK defines a particular syntax for XML descriptors to import information |
| from other XML files. When one of the following appears in an XML descriptor: |
| |
| |
| </p><pre class="programlisting"><import location="[URL]" /> or |
| <import name="[Name]" /></pre><p> |
| it indicates that information from a separate XML file is being imported. Note that |
| imports are allowed only in certain places in the descriptor. In the remainder of this |
| chapter, it will be indicated at which points imports are allowed.</p> |
| |
| <p>If an import specifies a <code class="literal">location</code> attribute, the value of |
| that attribute specifies the URL at which the XML file to import will be found. This can be |
| a relative URL, which will be resolved relative to the descriptor containing the |
| <code class="literal">import</code> element, or an absolute URL. Relative URLs can be written |
| without a protocol/scheme (e.g., <span class="quote">“<span class="quote">file:</span>”</span>), and without a host machine |
| name. In this case the relative URL might look something like |
| <code class="literal">org/apache/myproj/MyTypeSystem.xml.</code></p> |
| |
| <p>An absolute URL is written with one of the following prefixes, followed by a path |
| such as <code class="literal">org/apache/myproj/MyTypeSystem.xml</code>: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>file:/ <span class="symbol">←</span> has no network |
| address</p></li><li class="listitem"><p>file:/// <span class="symbol">←</span> has an empty network address</p></li><li class="listitem"><p>file://some.network.address/</p></li></ul></div> |
| |
| <p>For more information about URLs, please read the javadoc information for the Java |
| class <span class="quote">“<span class="quote">URL</span>”</span>.</p> |
| |
| <p>If an import specifies a <code class="literal">name</code> attribute, the value of that |
| attribute should take the form of a Java-style dotted name (e.g. |
| <code class="literal">org.apache.myproj.MyTypeSystem</code>). An .xml file with this name |
| will be searched for in the classpath or datapath (described below). As in Java, the dots |
| in the name will be converted to file path separators. So an import specifying the |
| example name in this paragraph will result in a search for |
| <code class="literal">org/apache/myproj/MyTypeSystem.xml</code> in the classpath or |
| datapath.</p> |
| |
| <p><a name="ugr.ref.xml.component_descriptor.datapath"></a>The datapath works similarly to the classpath but can be set programmatically |
| through the resource manager API. Application developers can specify a datapath |
| during initialization, using the following code: |
| |
| |
| </p><pre class="programlisting"> |
| ResourceManager resMgr = UIMAFramework.newDefaultResourceManager(); |
| resMgr.setDataPath(yourPathString); |
| AnalysisEngine ae = |
| UIMAFramework.produceAnalysisEngine(desc, resMgr, null); |
| </pre> |
| |
| <p>The default datapath for the entire JVM can be set via the |
| <code class="literal">uima.datapath</code> Java system property, but this feature should |
| only be used for standalone applications that don't need to run in the same JVM as |
| other code that may need a different datapath.</p> |
| |
| <p>The value of a name or location attribute may be parameterized with references to external |
| override variables using the <code class="literal">${variable-name}</code> syntax. |
| </p><pre class="programlisting"><import location="Annotator${with}ExternalOverrides.xml" /></pre><p> |
| If a variable is undefined the value is left unmodified and a warning message identifies the missing |
| variable.</p> |
| |
| <p>Previous versions of UIMA also supported XInclude. That support didn't work in |
| many situations, and it is no longer supported. To include other files, please use |
| <import>.</p> |
| |
| |
| |
| </div> |
| |
| <div class="section" title="2.3. Type System Descriptors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.type_system">2.3. Type System Descriptors</h2></div></div></div> |
| |
| |
| <p>A Type System Descriptor is used to define the types and features that can be |
| represented in the CAS. A Type System Descriptor can be imported into an Analysis Engine |
| or Collection Processing Component Descriptor.</p> |
| |
| <p>The basic structure of a Type System Descriptor is as follows: |
| |
| |
| </p><pre class="programlisting"><typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier"> |
| |
| <name> [String] </name> |
| <description>[String]</description> |
| <version>[String]</version> |
| <vendor>[String]</vendor> |
| |
| <imports> |
| <import ...> |
| ... |
| </imports> |
| |
| <types> |
| <typeDescription> |
| ... |
| </typeDescription> |
| |
| ... |
| |
| </types> |
| |
| </typeSystemDescription></pre> |
| |
| <p>All of the subelements are optional.</p> |
| |
| <div class="section" title="2.3.1. Imports"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.type_system.imports">2.3.1. Imports</h3></div></div></div> |
| |
| |
| <p>The <code class="literal">imports</code> section allows this descriptor to import |
| types from other type system descriptors. The import syntax is described in <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2. Imports">Section 2.2, “Imports”</a>. A type system may import any number of other type |
| systems and then define additional types which refer to imported types. Circular |
| imports are allowed.</p> |
| </div> |
| |
| <div class="section" title="2.3.2. Types"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.type_system.types">2.3.2. Types</h3></div></div></div> |
| |
| |
| <p>The <code class="literal">types</code> element contains zero or more |
| <code class="literal">typeDescription</code> elements. Each |
| <code class="literal">typeDescription</code> has the form: |
| |
| |
| </p><pre class="programlisting"><typeDescription> |
| <name>[TypeName]</name> |
| <description>[String]</description> |
| <supertypeName>[TypeName]</supertypeName> |
| <features> |
| ... |
| </features> |
| </typeDescription></pre> |
| |
| <p>The name element contains the name of the type. A |
| <code class="literal">[TypeName]</code> is a dot-separated list of names, where each name |
| consists of a letter followed by any number of letters, digits, or underscores. |
| <code class="literal">TypeNames</code> are case sensitive. Letter and digit are as defined |
| by Java; therefore, any Unicode letter or digit may be used (subject to the character |
| encoding defined by the descriptor file's XML header). The name following the |
| final dot is considered to be the <span class="quote">“<span class="quote">short name</span>”</span> of the type; the |
| preceding portion is the namespace (analogous to the package.class syntax used in |
| Java). Namespaces beginning with uima are reserved and should not be used. Examples |
| of valid type names are:</p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>test.TokenAnnotation</p> |
| </li><li class="listitem"><p>org.myorg.TokenAnnotation</p></li><li class="listitem"><p>com.my_company.proj123.TokenAnnotation </p></li></ul></div> |
| |
| <p>These would all be considered distinct types since they have different |
| namespaces. Best practice here is to follow the normal Java naming conventions of |
| having namespaces be all lowercase, with the short type names having an initial |
| capital, but this is not mandated, so <code class="literal">ABC.mYtyPE</code> is an allowed |
| type name. While type names without namespaces (e.g. |
| <code class="literal">TokenAnnotation</code> alone) are allowed, but discouraged because |
| naming conflicts can then result when combining annotators that use different |
| type systems.</p> |
| |
| <p>The <code class="literal">description</code> element contains a textual description |
| of the type. The <code class="literal">supertypeName</code> element contains the name of the |
| type from which it inherits (this can be set to the name of another user-defined type, |
| or it may be set to any built-in type which may be subclassed, such as |
| <code class="literal">uima.tcas.Annotation</code> for a new annotation |
| type or <code class="literal">uima.cas.TOP</code> for a new type that is not |
| an annotation). All three of these elements are required.</p> |
| |
| </div> |
| |
| <div class="section" title="2.3.3. Features"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.type_system.features">2.3.3. Features</h3></div></div></div> |
| |
| |
| <p>The <code class="literal">features</code> element of a |
| <code class="literal">typeDescription</code> is required only if the type we are specifying |
| introduces new features. If the <code class="literal">features</code> element is present, |
| it contains zero or more <code class="literal">featureDescription</code> elements, each of |
| which has the form:</p> |
| |
| |
| <pre class="programlisting"><featureDescription> |
| <name>[Name]</name> |
| <description>[String]</description> |
| <rangeTypeName>[Name]</rangeTypeName> |
| <elementType>[Name]</elementType> |
| <multipleReferencesAllowed>true|false</multipleReferencesAllowed> |
| </featureDescription></pre> |
| |
| <p>A feature's name follows the same rules as a type short name – a letter |
| followed by any number of letters, digits, or underscores. Feature names are case |
| sensitive.</p> |
| |
| <p>The feature's <code class="literal">rangeTypeName</code> specifies the type of |
| value that the feature can take. This may be the name of any type defined in your type |
| system, or one of the predefined types. All of the predefined types have names that are |
| prefixed with <code class="literal">uima.cas</code> or <code class="literal">uima.tcas</code>, |
| for example: |
| |
| |
| </p><pre class="programlisting">uima.cas.TOP |
| uima.cas.String |
| uima.cas.Long |
| uima.cas.FSArray |
| uima.cas.StringList |
| uima.tcas.Annotation.</pre><p> |
| For a complete list of predefined types, see the CAS API documentation.</p> |
| |
| <p>The <code class="literal">elementType</code> of a feature is optional, and applies only |
| when the <code class="literal">rangeTypeName</code> is |
| <code class="literal">uima.cas.FSArray</code> or <code class="literal">uima.cas.FSList</code> |
| The <code class="literal">elementType</code> specifies what type of value can be assigned as |
| an element of the array or list. This must be the name of a non-primitive type. If |
| omitted, it defaults to <code class="literal">uima.cas.TOP</code>, meaning that any |
| FeatureStructure can be assigned as an element the array or list. Note: depending on |
| the CAS Interface that you use in your code, this constraint may or may not be |
| enforced. |
| Note: At run time, the elementType is available from a runtime Feature object |
| (using the <code class="literal">a_feature_object.getRange().getComponentType()</code> method) |
| only when specified for the <code class="literal">uima.cas.FSArray</code> ranges; it isn't |
| available for <code class="literal">uima.cas.FSList</code> ranges. |
| </p> |
| |
| |
| <p>The <code class="literal">multipleReferencesAllowed</code> feature is optional, and |
| applies only when the <code class="literal">rangeTypeName</code> is an array or list type (it |
| applies to arrays and lists of primitive as well as non-primitive types). Setting |
| this to false (the default) indicates that this feature has exclusive ownership of |
| the array or list, so changes to the array or list are localized. Setting this to true |
| indicates that the array or list may be shared, so changes to it may affect other |
| objects in the CAS. Note: there is currently no guarantee that the framework will |
| enforce this restriction. However, this setting may affect how the CAS is |
| serialized.</p> |
| |
| </div> |
| |
| <div class="section" title="2.3.4. String Subtypes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.type_system.string_subtypes">2.3.4. String Subtypes</h3></div></div></div> |
| |
| |
| <p>There is one other special type that you can declare – a subset of the String |
| type that specifies a restricted set of allowed values. This is useful for features |
| that can have only certain String values, such as parts of speech. Here is an example of |
| how to declare such a type:</p> |
| |
| |
| <pre class="programlisting"><typeDescription> |
| <name>PartOfSpeech</name> |
| <description>A part of speech.</description> |
| <supertypeName>uima.cas.String</supertypeName> |
| <allowedValues> |
| <value> |
| <string>NN</string> |
| <description>Noun, singular or mass.</description> |
| </value> |
| <value> |
| <string>NNS</string> |
| <description>Noun, plural.</description> |
| </value> |
| <value> |
| <string>VB</string> |
| <description>Verb, base form.</description> |
| </value> |
| ... |
| </allowedValues> |
| </typeDescription></pre> |
| |
| </div> |
| </div> |
| |
| <div class="section" title="2.4. Analysis Engine Descriptors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.aes">2.4. Analysis Engine Descriptors</h2></div></div></div> |
| |
| |
| <p>Analysis Engine (AE) descriptors completely describe Analysis Engines. There |
| are two basic types of Analysis Engines – <span class="emphasis"><em>Primitive</em></span> and |
| <span class="emphasis"><em>Aggregate</em></span>. A <span class="emphasis"><em>Primitive</em></span> Analysis |
| Engine is a container for a single <span class="emphasis"><em>annotator</em></span>, where as an |
| <span class="emphasis"><em>Aggregate</em></span> Analysis Engine is composed of a collection of other |
| Analysis Engines. (For more information on this and other terminology, see <a href="overview_and_setup.html#d4e1" class="olink">UIMA Overview & SDK Setup</a> <a href="overview_and_setup.html#ugr.ovv.conceptual" class="olink">Chapter 2, <i>UIMA Conceptual Overview</i></a>).</p> |
| |
| <p>Both Primitive and Aggregate Analysis Engines have descriptors, and the two types |
| of descriptors have some similarities and some differences. <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive" title="2.4.1. Primitive Analysis Engine Descriptors">Section 2.4.1, “Primitive Analysis Engine Descriptors”</a> |
| discusses Primitive Analysis Engine descriptors. <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate" title="2.4.2. Aggregate Analysis Engine Descriptors">Section 2.4.2, “Aggregate Analysis Engine Descriptors”</a> then |
| describes how Aggregate Analysis Engine descriptors are different.</p> |
| |
| <div class="section" title="2.4.1. Primitive Analysis Engine Descriptors"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.aes.primitive">2.4.1. Primitive Analysis Engine Descriptors</h3></div></div></div> |
| |
| |
| <div class="section" title="2.4.1.1. Basic Structure"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.primitive.basic">2.4.1.1. Basic Structure</h4></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><?xml version="1.0" encoding="UTF-8" ?> |
| <analysisEngineDescription |
| xmlns="http://uima.apache.org/resourceSpecifier"> |
| <frameworkImplementation>org.apache.uima.java</frameworkImplementation> |
| |
| <primitive>true</primitive> |
| <annotatorImplementationName> [String] </annotatorImplementationName> |
| |
| <analysisEngineMetaData> |
| ... |
| </analysisEngineMetaData> |
| |
| <externalResourceDependencies> |
| ... |
| </externalResourceDependencies> |
| |
| <resourceManagerConfiguration> |
| ... |
| </resourceManagerConfiguration> |
| |
| </analysisEngineDescription></pre> |
| |
| <p>The document begins with a standard XML header. The recommended root tag is |
| <code class="literal"><analysisEngineDescription></code>, although |
| <code class="literal"><taeDescription></code> is also allowed for backwards |
| compatibility.</p> |
| |
| <p>Within the root element we declare that we are using the XML namespace |
| <code class="literal">http://uima.apache.org/resourceSpecifier.</code> It is |
| required that this namespace be used; otherwise, the descriptor will not be able to |
| be validated for errors.</p> |
| |
| <p> The first subelement, |
| <code class="literal"><frameworkImplementation>,</code> currently must have |
| the value <code class="literal">org.apache.uima.java</code>, or |
| <code class="literal">org.apache.uima.cpp</code>. In future versions, there may be |
| other framework implementations, or perhaps implementations produced by other |
| vendors.</p> |
| |
| <p>The second subelement, <code class="literal"><primitive>,</code> contains |
| the Boolean value <code class="literal">true</code>, indicating that this XML document |
| describes a <span class="emphasis"><em>Primitive</em></span> Analysis Engine.</p> |
| |
| <p>The next subelement,<code class="literal"> |
| <annotatorImplementationName></code> is how the UIMA framework |
| determines which annotator class to use. This should contain a fully-qualified |
| Java class name for Java implementations, or the name of a .dll or .so file for C++ |
| implementations.</p> |
| |
| <p>The <code class="literal"><analysisEngineMetaData></code> object contains |
| descriptive information about the analysis engine and what it does. It is |
| described in <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.metadata" title="2.4.1.2. Analysis Engine MetaData">Section 2.4.1.2, “Analysis Engine MetaData”</a>.</p> |
| |
| <p>The <code class="literal"><externalResourceDependencies></code> and |
| <code class="literal"><resourceManagerConfiguration></code> elements declare |
| the external resource files that the analysis engine relies |
| upon. They are optional and are described in <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.external_resource_dependencies" title="2.4.1.8. External Resource Dependencies">Section 2.4.1.8, “External Resource Dependencies”</a> and <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration" title="2.4.1.9. Resource Manager Configuration">Section 2.4.1.9, “Resource Manager Configuration”</a>.</p> |
| |
| </div> |
| |
| <div class="section" title="2.4.1.2. Analysis Engine MetaData"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.metadata">2.4.1.2. Analysis Engine MetaData</h4></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><analysisEngineMetaData> |
| <name> [String] </name> |
| <description>[String]</description> |
| <version>[String]</version> |
| <vendor>[String]</vendor> |
| |
| <configurationParameters> ... </configurationParameters> |
| |
| <configurationParameterSettings> |
| ... |
| </configurationParameterSettings> |
| |
| <typeSystemDescription> ... </typeSystemDescription> |
| |
| <typePriorities> ... </typePriorities> |
| |
| <fsIndexCollection> ... </fsIndexCollection> |
| |
| <capabilities> ... </capabilities> |
| |
| <operationalProperties> ... </operationalProperties> |
| |
| </analysisEngineMetaData></pre> |
| |
| <p>The <code class="literal">analysisEngineMetaData</code> element contains four |
| simple string fields – <code class="literal">name</code>, |
| <code class="literal">description</code>, <code class="literal">version</code>, and |
| <code class="literal">vendor</code>. Only the <code class="literal">name</code> field is |
| required, but providing values for the other fields is recommended. The |
| <code class="literal">name</code> field is just a descriptive name meant to be read by |
| users; it does not need to be unique across all Analysis Engines.</p> |
| |
| <p>Configuration parameters are described in |
| <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.configuration_parameters" title="2.4.3. Configuration Parameters">Section 2.4.3, “Configuration Parameters”</a>.</p> |
| |
| <p>The other sub-elements – |
| <code class="literal">typeSystemDescription</code>, |
| <code class="literal">typePriorities</code>, <code class="literal">fsIndexes</code>, |
| <code class="literal">capabilities</code> and |
| <code class="literal">operationalProperties</code> are described in the following |
| sections. The only one of these that is required is |
| <code class="literal">capabilities</code>; the others are optional.</p> |
| |
| </div> |
| |
| <div class="section" title="2.4.1.3. Type System Definition"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.type_system">2.4.1.3. Type System Definition</h4></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><typeSystemDescription> |
| |
| <name> [String] </name> |
| <description>[String]</description> |
| <version>[String]</version> |
| <vendor>[String]</vendor> |
| |
| <imports> |
| <import ...> |
| ... |
| </imports> |
| |
| <types> |
| <typeDescription> |
| ... |
| </typeDescription> |
| |
| ... |
| |
| </types> |
| |
| </typeSystemDescription></pre> |
| |
| <p>A <code class="literal">typeSystemDescription</code> element defines a type |
| system for an Analysis Engine. The syntax for the element is described in <a class="xref" href="#ugr.ref.xml.component_descriptor.type_system" title="2.3. Type System Descriptors">Section 2.3, “Type System Descriptors”</a>.</p> |
| |
| <p>The recommended usage is to <code class="literal">import</code> an external type |
| system, using the import syntax described in <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2. Imports">Section 2.2, “Imports”</a> |
| of this chapter. For example: |
| |
| |
| </p><pre class="programlisting"><typeSystemDescription> |
| <imports> |
| <import location="MySharedTypeSystem.xml"> |
| </imports> |
| </typeSystemDescription></pre> |
| |
| <p>This allows several AEs to share a single type system definition. The file |
| <code class="literal">MySharedTypeSystem.xml</code> would then contain the full |
| type system information, including the <code class="literal">name</code>, |
| <code class="literal">description</code>, <code class="literal">vendor</code>, |
| <code class="literal">version</code>, and <code class="literal">types</code>.</p> |
| |
| </div> |
| <div class="section" title="2.4.1.4. Type Priority Definition"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.type_priority">2.4.1.4. Type Priority Definition</h4></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><typePriorities> |
| <name> [String] </name> |
| <description>[String]</description> |
| <version>[String]</version> |
| <vendor>[String]</vendor> |
| |
| <imports> |
| <import ...> |
| ... |
| </imports> |
| |
| <priorityLists> |
| <priorityList> |
| <type>[TypeName]</type> |
| <type>[TypeName]</type> |
| ... |
| </priorityList> |
| |
| ... |
| |
| </priorityLists> |
| </typePriorities></pre> |
| |
| <p>The <code class="literal"><typePriorities></code> element contains |
| zero or more <code class="literal"><priorityList></code> elements; each |
| <code class="literal"><priorityList></code> contains zero or more types. |
| Like a type system, a type priorities definition may also declare a name, |
| description, version, and vendor, and may import other type priorities. See |
| <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2. Imports">Section 2.2, “Imports”</a> for the import syntax.</p> |
| |
| <p>Type priority is used when iterating over feature structures in the CAS. |
| For example, if the CAS contains a <code class="literal">Sentence</code> annotation |
| and a <code class="literal">Paragraph</code> annotation with the same span of text |
| (i.e. a one-sentence paragraph), which annotation should be returned first |
| by an iterator? Probably the Paragraph, since it is conceptually |
| <span class="quote">“<span class="quote">bigger,</span>”</span> but the framework does not know that and must be |
| explicitly told that the Paragraph annotation has priority over the Sentence |
| annotation, like this: |
| |
| |
| </p><pre class="programlisting"><typePriorities> |
| <priorityList> |
| <type>org.myorg.Paragraph</type> |
| <type>org.myorg.Sentence</type> |
| </priorityList> |
| </typePriorities></pre> |
| |
| <p>All of the <code class="literal"><priorityList></code> elements defined |
| in the descriptor (and in all component descriptors of an aggregate analysis |
| engine descriptor) are merged to produce a single priority list.</p> |
| |
| <p>Subtypes of types specified here are also ordered, unless overridden by |
| another user-specified type ordering. For example, if you specify type A |
| comes before type B, then subtypes of A will come before subtypes of B, unless |
| there is an overriding specification which declares some subtype of B comes |
| before some subtype of A.</p> |
| |
| <p>If there are inconsistencies between the priority list (type A declared |
| before type B in one priority list, and type B declared before type A in |
| another), the framework will throw an exception.</p> |
| |
| <p>User defined indexes may declare if they wish to use the type priority or |
| not; see the next section.</p> |
| </div> |
| |
| <div class="section" title="2.4.1.5. Index Definition"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.index">2.4.1.5. Index Definition</h4></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><fsIndexCollection> |
| |
| <name>[String]</name> |
| <description>[String]</description> |
| <version>[String]</version> |
| <vendor>[String]</vendor> |
| |
| <imports> |
| <import ...> |
| ... |
| </imports> |
| |
| <fsIndexes> |
| |
| <fsIndexDescription> |
| ... |
| </fsIndexDescription> |
| |
| <fsIndexDescription> |
| ... |
| </fsIndexDescription> |
| |
| </fsIndexes> |
| |
| </fsIndexCollection></pre> |
| |
| <p>The <code class="literal">fsIndexCollection</code> element declares<span class="emphasis"><em> Feature Structure |
| Indexes</em></span>, each of which defined an index that holds feature structures of a given type. |
| Information in the CAS is always accessed through an index. There is a built-in default annotation |
| index declared which can be used to access instances of type |
| <code class="literal">uima.tcas.Annotation</code> (or its subtypes), sorted based on their |
| <code class="literal">begin</code> and <code class="literal">end</code> features, and the type priority ordering (if specified). |
| For all other types, there is a |
| default, unsorted (bag) index. If there is a need for a specialized index it must be declared in this |
| element of the descriptor. See <a href="references.html#ugr.ref.cas.indexes_and_iterators" class="olink">Section 4.7, “Indexes and Iterators”</a> for details on FS indexes.</p> |
| |
| <p>Like type systems and type priorities, an |
| <code class="literal">fsIndexCollection</code> can declare a |
| <code class="literal">name</code>, <code class="literal">description</code>, |
| <code class="literal">vendor</code>, and <code class="literal">version</code>, and may |
| import other <code class="literal">fsIndexCollection</code>s. The import syntax is |
| described in <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2. Imports">Section 2.2, “Imports”</a>.</p> |
| |
| <p>An <code class="literal">fsIndexCollection</code> may also define zero or more |
| <code class="literal">fsIndexDescription</code> elements, each of which defines a |
| single index. Each <code class="literal">fsIndexDescription</code> has the form: |
| |
| |
| </p><pre class="programlisting"><fsIndexDescription> |
| |
| <label>[String]</label> |
| <typeName>[TypeName]</typeName> |
| <kind>sorted|bag|set</kind> |
| |
| <keys> |
| |
| <fsIndexKey> |
| <featureName>[Name]</featureName> |
| <comparator>standard|reverse</comparator> |
| </fsIndexKey> |
| |
| <fsIndexKey> |
| <typePriority/> |
| </fsIndexKey> |
| |
| ... |
| |
| </keys> |
| </fsIndexDescription></pre> |
| |
| <p>The <code class="literal">label</code> element defines the name by which |
| applications and annotators refer to this index. The |
| <code class="literal">typeName</code> element contains the name of the type that will |
| be contained in this index. This must match one of the type names defined in the |
| <code class="literal"><typeSystemDescription></code>.</p> |
| |
| <p>There are three possible values for the |
| <code class="literal"><kind></code> of index. Sorted indexes enforce an |
| ordering of feature structures, based on defined keys. Bag indexes do |
| not enforce ordering, and have no defined keys. Set indexes do not |
| enforce ordering, but use defined keys to specify equivalence classes; |
| addToIndexes will not add a Feature Structure to a set index if its keys |
| match those of an entry of the same type already in the index. |
| If the <code class="literal"><kind></code>element is omitted, it will default to |
| sorted, which is the most common type of index.</p> |
| |
| <p>Prior to version 2.7.0, the bag and sorted indexes stored duplicate entries for the |
| same identical FS, if it was added to the indexes multiple times. As of version 2.7.0, this |
| is changed; a second or subsequent add to index operation has no effect. This has the |
| consequence that a remove operation now guarantees that the particular FS is removed |
| (as opposed to only being able to say that one (of perhaps many duplicate entries) is removed). |
| Since sending to remote annotators only adds entries to indexes at most once, this |
| behavior is consistent with that.</p> |
| |
| <p>Note that even after this change, there is still a distinct difference in meaning for bag and set indexes. |
| The set index uses equal defined key values plus the type of the Feature Structure to determine equivalence classes for Feature Structures, and |
| will not add a Feature Structure if it has equal key values and the same type to an entry already in there.</p> |
| |
| <p>It is possible, however, that users may be depending on having multiple instances of |
| the identical FeatureStructure in the indicies. Therefore, UIMA uses |
| a JVM defined property, |
| "uima.allow_duplicate_add_to_indexes", which (if defined whend UIMA is loaded) will restore the previous behavior.</p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>If duplicates are allowed, then the proper way to update an indexed Feature Structure is to |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>remove <span class="bold"><strong>*all*</strong></span> instances of the FS to be |
| updated </p></li><li class="listitem"><p>update the features</p></li><li class="listitem"><p>re-add the Feature Structure to the indexes (perhaps multiple times, depending on the |
| details of your logic).</p></li></ul></div></div> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>There is usually no need to explicitly declare a Bag index in your descriptor. |
| As of UIMA v2.1, if you do not declare any index for a type (or any of its |
| supertypes), a Bag index will be automatically created if an instance of that type is added to the indexes.</p></div> |
| |
| <p>An Sorted or Set index may define zero or more <span class="emphasis"><em>keys</em></span>. These keys |
| determine the sort order of the feature structures within a sorted index, and |
| partially determine equality for set indexes (the equality measure always includes testing that the types are the same). |
| Bag indexes do not use keys, and |
| equality is determined by Feature Structure identity (that is, two elements |
| are considered equal if and only if they are exactly the same feature structure, |
| located in the same place in the CAS). Keys are |
| ordered by precedence – the first key is evaluated first, and |
| subsequent keys are evaluated only if necessary.</p> |
| |
| <p>Each key is represented by an <code class="literal">fsIndexKey</code> element. |
| Most <code class="literal">fsIndexKeys</code> contains a |
| <code class="literal">featureName</code> and a <code class="literal">comparator</code>. |
| The <code class="literal">featureName</code> must match the name of one of the |
| features for the type specified in the |
| <code class="literal"><typeName></code> element for this index. The |
| comparator defines how the features will be compared – a value of |
| <code class="literal">standard</code> means that features will be compared using the |
| standard comparison for their data type (e.g. for numerical types, smaller |
| values precede larger values, and for string types, Unicode string |
| comparison is performed). A value of <code class="literal">reverse</code> means that |
| features will be compared using the reverse of the standard comparison (e.g. |
| for numerical types, larger values precede smaller values, etc.). For Set |
| indexes, the comparator direction is ignored – the keys are only used |
| for the equality testing.</p> |
| |
| <p>Each key used in comparisons must refer to a feature whose range type is |
| Boolean, Byte, Short, Integer, Long, Float, Double, or String. |
| </p> |
| |
| <p>There is a second type of a key, one which contains only the |
| <code class="literal"><typePriority/></code>. When this key is used, it |
| indicates that Feature Structures will be compared using the type priorities |
| declared in the <code class="literal"><typePriorities></code> section of the |
| descriptor.</p> |
| |
| </div> |
| |
| <div class="section" title="2.4.1.6. Capabilities"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.capabilities">2.4.1.6. Capabilities</h4></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><capabilities> |
| <capability> |
| |
| <inputs> |
| <type allAnnotatorFeatures="true|false"[TypeName]</type> |
| ... |
| <feature>[TypeName]:[Name]</feature> |
| ... |
| </inputs> |
| |
| <outputs> |
| <type allAnnotatorFeatures="true|false"[TypeName]</type> |
| ... |
| <feature>[TypeName]:[Name]</feature> |
| ... |
| </output> |
| |
| <inputSofas> |
| <sofaName>[name]</sofaName> |
| ... |
| </inputSofas> |
| |
| <outputSofas> |
| <sofaName>[name]</sofaName> |
| ... |
| </outputSofas> |
| |
| <languagesSupported> |
| <language>[ISO Language ID]</language> |
| ... |
| </languagesSupported> |
| </capability> |
| |
| <capability> |
| ... |
| </capability> |
| |
| ... |
| |
| </capabilities></pre> |
| |
| <p>The capabilities definition is used by the UIMA Framework in several |
| ways, including setting up the Results Specification for process calls, |
| routing control for aggregates based on language, and as part of the Sofa |
| mapping function.</p> |
| |
| <p>The <code class="literal">capabilities</code> element contains one or more |
| <code class="literal">capability</code> elements. In Version 2 and onwards, only one |
| capability set should be used (multiple sets will continue to work for a while, |
| but they're not logically consistently supported). |
| </p> |
| |
| <p>Each <code class="literal">capability</code> contains |
| <code class="literal">inputs</code>, <code class="literal">outputs</code>, |
| <code class="literal">languagesSupported, inputSofas, and outputSofas</code>. |
| Inputs and outputs element are required (though they may be empty); |
| <code class="literal"><languagesSupported>, <inputSofas</code>>, |
| and <code class="literal"><outputSofas></code> are optional.</p> |
| |
| <p>Both inputs and outputs may contain a mixture of type and feature |
| elements.</p> |
| |
| <p><code class="literal"><type...></code> elements contain the name of one |
| of the types defined in the type system or one of the built in types. Declaring a |
| type as an input means that this component expects instances of this type to be |
| in the CAS when it receives it to process. Declaring a type as an output means |
| that this component creates new instances of this type in the CAS.</p> |
| |
| <p>There is an optional attribute |
| <code class="literal">allAnnotatorFeatures</code>, which defaults to false if |
| omitted. The Component Descriptor Editor tool defaults this to true when a new |
| type is added to the list of inputs and/or outputs. When this attribute is true, |
| it specifies that all of the type's features are also declared as input or |
| output. Otherwise, the features that are required as inputs or populated as |
| outputs must be explicitly specified in feature elements.</p> |
| |
| <p><code class="literal"><feature...></code> elements contain the |
| <span class="quote">“<span class="quote">fully-qualified</span>”</span> feature name, which is the type name |
| followed by a colon, followed by the feature name, e.g. |
| <code class="literal">org.myorg.TokenAnnotation:lemma</code>. |
| <code class="literal"><feature...></code> elements in the |
| <code class="literal"><inputs></code> section must also have a corresponding |
| type declared as an input. In output sections, this is not required. If the type |
| is not specified as an output, but a feature for that type is, this means that |
| existing instances of the type have the values of the specified features |
| updated. Any type mentioned in a <code class="literal"><feature></code> |
| element must be either specified as an input or an output or both.</p> |
| |
| <p><code class="literal">language </code>elements contain one of the ISO language |
| identifiers, such as <code class="literal">en</code> for English, or |
| <code class="literal">en-US</code> for the United States dialect of English.</p> |
| |
| <p>The list of language codes can be found here: <a class="ulink" href="http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt" target="_top">http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt</a> |
| and the country codes here: |
| <a class="ulink" href="http://www.chemie.fu-berlin.de/diverse/doc/ISO_3166.html" target="_top">http://www.chemie.fu-berlin.de/diverse/doc/ISO_3166.html</a> |
| </p> |
| |
| <p><code class="literal"><inputSofas></code> and |
| <code class="literal"><outputSofas></code> declare sofa names used by this |
| component. All Sofa names must be unique within a particular capability set. A |
| Sofa name must be an input or an output, and cannot be both. It is an error to have a |
| Sofa name declared as an input in one capability set, and also have it declared |
| as an output in another capability set.</p> |
| |
| <p>A <code class="literal"><sofaName></code> is written as a simple |
| Java-style identifier, without any periods in the name, except that it may be |
| written to end in <span class="quote">“<span class="quote"><code class="literal">.*</code></span>”</span>. If written in this |
| manner, it specifies a set of Sofa names, all of which start with the base name |
| (the part before the .*) followed by a period and then an arbitrary Java |
| identifier (without periods). This form is used to specify in the descriptor |
| that the component could generate an arbitrary number of Sofas, the exact |
| names and numbers of which are unknown before the component is run.</p> |
| |
| </div> |
| |
| <div class="section" title="2.4.1.7. OperationalProperties"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.operational_properties">2.4.1.7. OperationalProperties</h4></div></div></div> |
| |
| |
| <p>Components can specify specific operational properties that can be |
| useful in deployment. The following are available:</p> |
| |
| |
| <pre class="programlisting"><operationalProperties> |
| <modifiesCas> true|false </modifiesCas> |
| <multipleDeploymentAllowed> true|false </multipleDeploymentAllowed> |
| <outputsNewCASes> true|false </outputsNewCASes> |
| </operationalProperties></pre> |
| |
| <p><code class="literal">ModifiesCas</code>, if false, indicates that this |
| component does not modify the CAS. If it is not specified, the default value is |
| true except for CAS Consumer components.</p> |
| |
| <p><code class="literal">multipleDeploymentAllowed</code>, if true, allows the |
| component to be deployed multiple times to increase performance through |
| scale-out techniques. If it is not specified, the default value is true, |
| except for CAS Consumer and Collection Reader components.</p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>If you wrap one or more CAS Consumers inside an aggregate as the only |
| components, you must explicitly specify in the aggregate the |
| <code class="literal">multipleDeploymentAllowed</code> property as false (assuming the CAS Consumer |
| components take the default here); otherwise the framework will complain about inconsistent |
| settings for these.</p></div> |
| |
| <p><code class="literal">outputsNewCASes</code>, if true, allows the component to |
| create new CASes during processing, for example to break a large artifact into |
| smaller pieces. See <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.cm" class="olink">Chapter 7, <i>CAS Multiplier Developer's Guide</i></a> for details.</p> |
| </div> |
| |
| <div class="section" title="2.4.1.8. External Resource Dependencies"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.primitive.external_resource_dependencies">2.4.1.8. External Resource Dependencies</h4></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><externalResourceDependencies> |
| <externalResourceDependency> |
| <key>[String]</key> |
| <description>[String] </description> |
| <interfaceName>[String]</interfaceName> |
| <optional>true|false</optional> |
| </externalResourceDependency> |
| |
| <externalResourceDependency> |
| ... |
| </externalResourceDependency> |
| |
| ... |
| |
| </externalResourceDependencies></pre> |
| |
| <p>A primitive annotator may declare zero or more |
| <code class="literal"><externalResourceDependency></code> elements. Each |
| dependency has the following elements: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><code class="literal">key</code> – the |
| string by which the annotator code will attempt to access the resource. Must |
| be unique within this annotator.</p></li><li class="listitem"><p><code class="literal">description</code> – a textual |
| description of the dependency.</p></li><li class="listitem"><p><code class="literal">interfaceName</code> – the |
| fully-qualified name of the Java interface through which the annotator |
| will access the data. This is optional. If not specified, the annotator |
| can only get an InputStream to the data.</p></li><li class="listitem"><p><code class="literal">optional</code> – whether the |
| resource is optional. If false, an exception will be thrown if no resource |
| is assigned to satisfy this dependency. Defaults to false. </p> |
| </li></ul></div> |
| |
| </div> |
| |
| <div class="section" title="2.4.1.9. Resource Manager Configuration"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration">2.4.1.9. Resource Manager Configuration</h4></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><resourceManagerConfiguration> |
| |
| <name>[String]</name> |
| <description>[String]</description> |
| <version>[String]</version> |
| <vendor>[String]</vendor> |
| |
| <imports> |
| <import ...> |
| ... |
| </imports> |
| |
| <externalResources> |
| |
| <externalResource> |
| <name>[String]</name> |
| <description>[String]</description> |
| <fileResourceSpecifier> |
| <fileUrl>[URL]</fileUrl> |
| </fileResourceSpecifier> |
| <implementationName>[String]</implementationName> |
| </externalResource> |
| ... |
| </externalResources> |
| |
| <externalResourceBindings> |
| <externalResourceBinding> |
| <key>[String]</key> |
| <resourceName>[String]</resourceName> |
| </externalResourceBinding> |
| ... |
| </externalResourceBindings> |
| |
| </resourceManagerConfiguration></pre> |
| |
| <p>This element declares external resources and binds them to |
| annotators' external resource dependencies.</p> |
| |
| <p>The <code class="literal">resourceManagerConfiguration</code> element may |
| optionally contain an <code class="literal">import</code>, which allows resource |
| definitions to be stored in a separate (shareable) file. See <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2. Imports">Section 2.2, “Imports”</a> for details.</p> |
| |
| <p>The <code class="literal">externalResources</code> element contains zero or |
| more <code class="literal">externalResource</code> elements, each of which |
| consists of: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><code class="literal">name</code> – the |
| name of the resource. This name is referred to in the bindings (see below). |
| Resource names need to be unique within any Aggregate Analysis Engine or |
| Collection Processing Engine, so the Java-like |
| <code class="literal">org.myorg.mycomponent.MyResource</code> syntax is |
| recommended.</p></li><li class="listitem"><p><code class="literal">description</code> – English |
| description of the resource.</p></li><li class="listitem"><p>Resource Specifier – |
| Declares the location of the resource. There are different |
| possibilities for how this is done (see below).</p></li><li class="listitem"><p><code class="literal">implementationName</code> – The |
| fully-qualified name of the Java class that will be instantiated from the |
| resource data. This is optional; if not specified, the resource will be |
| accessible as an input stream to the raw data. If specified, the Java class |
| must implement the <code class="literal">interfaceName</code> that is |
| specified in the External Resource Dependency to which it is bound. |
| </p></li></ul></div> |
| |
| <p>One possibility for the resource specifier is a |
| <code class="literal"><fileResourceSpecifier></code>, as shown above. This |
| simply declares a URL to the resource data. This support is built on the Java |
| class URL and its method URL.openStream(); it supports the protocols |
| <span class="quote">“<span class="quote">file</span>”</span>, <span class="quote">“<span class="quote">http</span>”</span> and <span class="quote">“<span class="quote">jar</span>”</span> (for |
| referring to files in jars) by default, and you can plug in handlers for other |
| protocols. The URL has to start with file: (or some other protocol). It is |
| relative to either the classpath or the <span class="quote">“<span class="quote">data path</span>”</span>. The data |
| path works like the classpath but can be set programmatically via |
| <code class="literal">ResourceManager.setDataPath()</code>. Setting the Java |
| System property <code class="literal">uima.datapath</code> also works.</p> |
| |
| <p><code class="literal">file:com/apache.d.txt</code> is a relative path; |
| relative paths for resources are resolved using the classpath and/or the |
| datapath. For the file protocol, URLs starting with file:/ or file:/// are |
| absolute. Note that <code class="literal">file://org/apache/d.txt</code> is NOT an |
| absolute path starting with <span class="quote">“<span class="quote">org</span>”</span>. The <span class="quote">“<span class="quote">//</span>”</span> |
| indicates that what follows is a host name. Therefore if you try to use this URL |
| it will complain that it can't connect to the host <span class="quote">“<span class="quote">org</span>”</span> |
| </p> |
| |
| <p>The URL value may contain references to external override variables using the |
| <code class="literal">${variable-name}</code> syntax, |
| e.g. <code class="literal">file:com/${dictUrl}.txt</code>. |
| If a variable is undefined the value is left unmodified and a warning message |
| identifies the missing variable. |
| </p> |
| |
| <p>Another option is a |
| <code class="literal"><fileLanguageResourceSpecifier></code>, which is |
| intended to support resources, such as dictionaries, that depend on the |
| language of the document being processed. Instead of a single URL, a prefix and |
| suffix are specified, like this: |
| |
| |
| </p><pre class="programlisting"><fileLanguageResourceSpecifier> |
| <fileUrlPrefix>file:FileLanguageResource_implTest_data_</fileUrlPrefix> |
| <fileUrlSuffix>.dat</fileUrlSuffix> |
| </fileLanguageResourceSpecifier></pre> |
| |
| <p>The URL of the actual resource is then formed by concatenating the prefix, |
| the language of the document (as an ISO language code, e.g. |
| <code class="literal">en</code> or <code class="literal">en-US</code> |
| – see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.capabilities" title="2.4.1.6. Capabilities">Section 2.4.1.6, “Capabilities”</a> for more |
| information), and the suffix.</p> |
| |
| <p>A third option is a <code class="literal">customResourceSpecifier</code>, which allows |
| you to plug in an arbitrary Java class. See <a class="xref" href="#ugr.ref.xml.component_descriptor.custom_resource_specifiers" title="2.8. Custom Resource Specifiers">Section 2.8, “Custom Resource Specifiers”</a> |
| for more information.</p> |
| |
| <p>The <code class="literal">externalResourceBindings</code> element declares |
| which resources are bound to which dependencies. Each |
| <code class="literal">externalResourceBinding</code> consists of: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><code class="literal">key</code> – |
| identifies the dependency. For a binding declared in a primitive analysis |
| engine descriptor, this must match the value of the |
| <code class="literal">key</code> element of one of the |
| <code class="literal">externalResourceDependency</code> elements. Bindings |
| may also be specified in aggregate analysis engine descriptors, in which |
| case a compound key is used |
| – see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings" title="2.4.2.4. External Resource Bindings">Section 2.4.2.4, “External Resource Bindings”</a> |
| .</p></li><li class="listitem"><p><code class="literal">resourceName</code> – the name of |
| the resource satisfying the dependency. This must match the value of the |
| <code class="literal">name</code> element of one of the |
| <code class="literal">externalResource</code> declarations. </p> |
| </li></ul></div> |
| |
| <p>A given resource dependency may only be bound to one external resource; |
| one external resource may be bound to many dependencies – to allow |
| resource sharing.</p> |
| </div> |
| |
| <div class="section" title="2.4.1.10. Environment Variable References"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.environment_variable_references">2.4.1.10. Environment Variable References</h4></div></div></div> |
| |
| |
| <p>In several places throughout the descriptor, it is possible to reference |
| environment variables. In Java, these are actually references to Java system |
| properties. To reference system environment variables from a Java analysis |
| engine you must pass the environment variables into the Java virtual machine |
| by using the <code class="literal">−D</code> option on the <code class="literal">java</code> |
| command line.</p> |
| |
| <p>The syntax for environment variable references is |
| <code class="literal"><envVarRef>[VariableName]</envVarRef></code> |
| , where [VariableName] is any valid Java system property name. Environment |
| variable references are valid in the following places: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>The value of a |
| configuration parameter (String-valued parameters only)</p> |
| </li><li class="listitem"><p>The |
| <code class="literal"><annotatorImplementationName></code> element |
| of a primitive AE descriptor</p></li><li class="listitem"><p>The <code class="literal"><name></code> element within |
| <code class="literal"><analysisEngineMetaData></code></p> |
| </li><li class="listitem"><p>Within a |
| <code class="literal"><fileResourceSpecifier></code> or |
| <code class="literal"><fileLanguageResourceSpecifier></code> |
| </p></li></ul></div> |
| |
| <p>For example, if the value of a configuration parameter were specified as: |
| <code class="literal"><string><envVarRef>TEMP_DIR</envVarRef>/temp.dat</string></code> |
| , and the value of the <code class="literal">TEMP_DIR</code> Java System property were |
| <code class="literal">c:/temp</code>, then the configuration parameter's |
| value would evaluate to <code class="literal">c:/temp/temp.dat</code>.</p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The Component Descriptor Editor does not support |
| environment variable references. If you need to, however, you |
| can use the <code class="code">source</code> tab view in the CDE to manually |
| add this notation. |
| </p></div> |
| |
| </div> |
| </div> |
| <div class="section" title="2.4.2. Aggregate Analysis Engine Descriptors"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate">2.4.2. Aggregate Analysis Engine Descriptors</h3></div></div></div> |
| |
| |
| <p>Aggregate Analysis Engines do not contain an annotator, but instead |
| contain one or more component (also called <span class="emphasis"><em>delegate</em></span>) |
| analysis engines.</p> |
| |
| <p>Aggregate Analysis Engine Descriptors maintain most of the same structure |
| as Primitive Analysis Engine Descriptors. The differences are:</p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>An Aggregate Analysis Engine Descriptor |
| contains the element |
| <code class="literal"><primitive>false</primitive></code> rather |
| than <code class="literal"><primitive>true</primitive></code>. |
| </p></li><li class="listitem"><p>An Aggregate Analysis Engine Descriptor must not include a |
| <code class="literal"><annotatorImplementationName></code> |
| element.</p></li><li class="listitem"><p>In place of the |
| <code class="literal"><annotatorImplementationName></code>, an Aggregate |
| Analysis Engine Descriptor must have a |
| <code class="literal"><delegateAnalysisEngineSpecifiers></code> |
| element. See <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.delegates" title="2.4.2.1. Delegate Analysis Engine Specifiers">Section 2.4.2.1, “Delegate Analysis Engine Specifiers”</a>.</p> |
| </li><li class="listitem"><p>An Aggregate Analysis Engine Descriptor may provide a |
| <code class="literal"><flowController></code> element immediately |
| following the |
| <code class="literal"><delegateAnalysisEngineSpecifiers></code>. <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.flow_controller" title="2.4.2.2. FlowController">Section 2.4.2.2, “FlowController”</a>.</p></li><li class="listitem"><p>Under the analysisEngineMetaData element, an Aggregate |
| Analysis Engine Descriptor may specify an additional element -- |
| <code class="literal"><flowConstraints></code>. See <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.flow_constraints" title="2.4.2.3. FlowConstraints">Section 2.4.2.3, “FlowConstraints”</a>. Typically only one |
| of <code class="literal"><flowController></code> and |
| <code class="literal"><flowConstraints></code> are specified. If both are |
| specified, the <code class="literal"><flowController></code> takes |
| precedence, and the flow controller implementation can use the information |
| in specified in the <code class="literal"><flowConstraints></code> as part of |
| its configuration input.</p></li><li class="listitem"><p>An aggregate Analysis Engine Descriptors must not contain a |
| <code class="literal"><typeSystemDescription></code> element. The Type |
| System of the Aggregate Analysis Engine is derived by merging the Type System |
| of the Analysis Engines that the aggregate contains.</p></li><li class="listitem"><p>Within aggregate Analysis Engine Descriptors, |
| <code class="literal"><configurationParameter></code> elements may define |
| <code class="literal"><overrides></code>. See <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.configuration_parameter_overrides" title="2.4.3.3. Configuration Parameter Overrides">Section 2.4.3.3, “Configuration Parameter Overrides”</a> |
| .</p></li><li class="listitem"><p>External Resource Bindings can bind resources to |
| dependencies declared by any delegate AE within the aggregate. See <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings" title="2.4.2.4. External Resource Bindings">Section 2.4.2.4, “External Resource Bindings”</a>.</p> |
| </li><li class="listitem"><p>An additional optional element, |
| <code class="literal"><sofaMappings></code>, may be included. </p> |
| </li></ul></div> |
| |
| <div class="section" title="2.4.2.1. Delegate Analysis Engine Specifiers"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.delegates">2.4.2.1. Delegate Analysis Engine Specifiers</h4></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><delegateAnalysisEngineSpecifiers> |
| |
| <delegateAnalysisEngine key="[String]"> |
| <analysisEngineDescription>...</analysisEngineDescription> | |
| <import .../> |
| </delegateAnalysisEngine> |
| |
| <delegateAnalysisEngine key="[String]"> |
| ... |
| </delegateAnalysisEngine> |
| |
| ... |
| |
| </delegateAnalysisEngineSpecifiers></pre> |
| |
| <p>The <code class="literal">delegateAnalysisEngineSpecifiers</code> element |
| contains one or more <code class="literal">delegateAnalysisEngine</code> |
| elements. Each of these must have a unique key, and must contain |
| either:</p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>A complete |
| <code class="literal">analysisEngineDescription</code> element describing the |
| delegate analysis engine <span class="bold"><strong>OR</strong></span></p> |
| </li><li class="listitem"><p>An <code class="literal">import</code> element giving the name or |
| location of the XML descriptor for the delegate analysis engine (see <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2. Imports">Section 2.2, “Imports”</a>).</p></li></ul></div> |
| |
| <p>The latter is the much more common usage, and is the only form supported by |
| the Component Descriptor Editor tool.</p> |
| </div> |
| <div class="section" title="2.4.2.2. FlowController"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.flow_controller">2.4.2.2. FlowController</h4></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><flowController key="[String]"> |
| <flowControllerDescription>...</flowControllerDescription> | |
| <import .../> |
| </flowController></pre> |
| |
| <p>The optional <code class="literal">flowController</code> element identifies |
| the descriptor of the FlowController component that will be used to determine |
| the order in which delegate Analysis Engine are called.</p> |
| |
| <p>The <code class="literal">key</code> attribute is optional, but recommended; it |
| assigns the FlowController an identifier that can be used for configuration |
| parameter overrides, Sofa mappings, or external resource bindings. The key |
| must not be the same as any of the delegate analysis engine keys.</p> |
| |
| <p>As with the <code class="literal">delegateAnalysisEngine</code> element, the |
| <code class="literal">flowController</code> element may contain either a complete |
| <code class="literal">flowControllerDescription</code> or an |
| <code class="literal">import</code>, but the import is recommended. The Component |
| Descriptor Editor tool only supports imports here.</p> |
| |
| </div> |
| <div class="section" title="2.4.2.3. FlowConstraints"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.flow_constraints">2.4.2.3. FlowConstraints</h4></div></div></div> |
| |
| |
| <p>If a <code class="literal"><flowController></code> is not specified, the |
| order in which delegate Analysis Engines are called within the aggregate |
| Analysis Engine is specified using the |
| <code class="literal"><flowConstraints></code> element, which must occur |
| immediately following the |
| <code class="literal">configurationParameterSettings</code> element. If a |
| <code class="literal"><flowController></code> is specified, then the |
| <code class="literal"><flowConstraints></code> are optional. They can be |
| used to pass an ordering of delegate keys to the |
| <code class="literal"><flowController></code>.</p> |
| |
| <p>There are two options for flow constraints -- |
| <code class="literal"><fixedFlow></code> or |
| <code class="literal"><capabilityLanguageFlow></code>. Each is discussed |
| in a separate section below.</p> |
| |
| <div class="section" title="Fixed Flow"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.flow_constraints.fixed_flow">Fixed Flow</h5></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><flowConstraints> |
| <fixedFlow> |
| <node>[String]</node> |
| <node>[String]</node> |
| ... |
| </fixedFlow> |
| </flowConstraints></pre> |
| |
| <p>The <code class="literal">flowConstraints</code> element must be included |
| immediately following the |
| <code class="literal">configurationParameterSettings</code> element.</p> |
| |
| <p>Currently the <code class="literal">flowConstraints</code> element must |
| contain a <code class="literal">fixedFlow</code> element. Eventually, other |
| types of flow constraints may be possible.</p> |
| |
| <p>The <code class="literal">fixedFlow</code> element contains one or more |
| <code class="literal">node</code> elements, each of which contains an identifier |
| which must match the key of a delegate analysis engine specified in the |
| <code class="literal">delegateAnalysisEngineSpecifiers</code> |
| element.</p> |
| |
| </div> |
| <div class="section" title="Capability Language Flow"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.flow_constraints.capability_language_flow">Capability Language Flow</h5></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><flowConstraints> |
| <capabilityLanguageFlow> |
| <node>[String]</node> |
| <node>[String]</node> |
| ... |
| </capabilityLanguageFlow> |
| </flowConstraints></pre> |
| |
| <p>If you use <code class="literal"><capabilityLanguageFlow></code>, |
| the delegate Analysis Engines named by the |
| <code class="literal"><node></code> elements are called in the given order, |
| except that a delegate Analysis Engine is skipped if any of the following are |
| true (according to that Analysis Engine's declared output |
| capabilities):</p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>It cannot produce any of the aggregate |
| Analysis Engine's output capabilities for the language of the |
| current document.</p></li><li class="listitem"><p>All of the output capabilities have already been |
| produced by an earlier Analysis Engine in the flow. </p></li></ul></div> |
| |
| <p>For example, if two annotators produce |
| <code class="literal">org.myorg.TokenAnnotation</code> feature structures for |
| the same language, these feature structures will only be produced by the |
| first annotator in the list.</p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The flow analysis uses the specific types that are specified in the |
| output capabilities, without any expansion for subtypes. So, if you expect |
| a type TT and another type SubTT (which is a subtype of TT) in the output, you |
| must include both of them in the output capabilities.</p></div> |
| </div> |
| </div> |
| |
| <div class="section" title="2.4.2.4. External Resource Bindings"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings">2.4.2.4. External Resource Bindings</h4></div></div></div> |
| |
| |
| <p>Aggregate analysis engine descriptors can declare resource bindings |
| that bind resources to dependencies declared in any of the delegate analysis |
| engines (or their subcomponents, recursively) within that aggregate. This |
| allows resource sharing. Any binding at this level overrides (supersedes) |
| any binding specified by a contained component or their subcomponents, |
| recursively.</p> |
| |
| <p>For example, consider an aggregate Analysis Engine Descriptor that |
| contains delegate Analysis Engines with keys |
| <code class="literal">annotator1</code> and <code class="literal">annotator2</code> (as |
| declared in the <code class="literal"><delegateAnalysisEngine></code> |
| element – see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.delegates" title="2.4.2.1. Delegate Analysis Engine Specifiers">Section 2.4.2.1, “Delegate Analysis Engine Specifiers”</a>), |
| where <code class="literal">annotator1</code> declares a resource dependency with |
| key <code class="literal">myResource</code> and <code class="literal">annotator2</code> |
| declares a resource dependency with key <code class="literal">someResource</code> |
| .</p> |
| |
| <p>Within that aggregate Analysis Engine Descriptor, the following |
| <code class="literal">resourceManagerConfiguration</code> would bind both of |
| those dependencies to a single external resource file.</p> |
| |
| |
| <pre class="programlisting"><resourceManagerConfiguration> |
| |
| <externalResources> |
| <externalResource> |
| <name>ExampleResource</name> |
| <fileResourceSpecifier> |
| <fileUrl>file:MyResourceFile.dat</fileUrl> |
| </fileResourceSpecifier> |
| </externalResource> |
| </externalResources> |
| |
| <externalResourceBindings> |
| <externalResourceBinding> |
| <key>annotator1/myResource</key> |
| <resourceName>ExampleResource</resourceName> |
| </externalResourceBinding> |
| <externalResourceBinding> |
| <key>annotator2/someResource</key> |
| <resourceName>ExampleResource</resourceName> |
| </externalResourceBinding> |
| </externalResourceBindings> |
| |
| </resourceManagerConfiguration></pre> |
| |
| <p>The syntax for the <code class="literal">externalResources</code> declaration |
| is exactly the same as described previously. In the resource bindings note the |
| use of the compound keys, e.g. <code class="literal">annotator1/myResource</code>. |
| This identifies the resource dependency key |
| <code class="literal">myResource</code> within the annotator with key |
| <code class="literal">annotator1</code>. Compound resource dependencies can be |
| multiple levels deep to handle nested aggregate analysis engines.</p> |
| </div> |
| |
| <div class="section" title="2.4.2.5. Sofa Mappings"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.sofa_mappings">2.4.2.5. Sofa Mappings</h4></div></div></div> |
| |
| |
| <p>Sofa mappings are specified between Sofa names declared in this |
| aggregate descriptor as part of the |
| <code class="literal"><capability></code> section, and the Sofa names |
| declared in the delegate components. For purposes of the mapping, all the |
| declarations of Sofas in any of the capability sets contained within the |
| <code class="literal"><capabilities> </code>element are considered |
| together.</p> |
| |
| |
| <pre class="programlisting"><sofaMappings> |
| <sofaMapping> |
| <componentKey>[keyName]</componentKey> |
| <componentSofaName>[sofaName]</componentSofaName> |
| <aggregateSofaName>[sofaName]</aggregateSofaName> |
| </sofaMapping> |
| ... |
| </sofaMappings></pre> |
| |
| <p>The <componentSofaName> may be omitted in the case where the |
| component is not aware of Multiple Views or Sofas. In this case, the UIMA |
| framework will arrange for the specified <aggregateSofaName> to be |
| the one visible to the delegate component.</p> |
| |
| <p>The <componentKey> is the key name for the component as specified |
| in the list of delegate components for this aggregate.</p> |
| |
| <p>The sofaNames used must be declared as input or output sofas in some |
| capability set.</p> |
| </div> |
| </div> |
| |
| <div class="section" title="2.4.3. Configuration Parameters"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.aes.configuration_parameters">2.4.3. Configuration Parameters</h3></div></div></div> |
| |
| <p>Configuration parameters may be declared and set in both Primitive and |
| Aggregate descriptors. Parameters set in an aggregate may override parameters set in one or |
| more of its delegates. |
| </p> |
| <div class="section" title="2.4.3.1. Configuration Parameter Declaration"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.configuration_parameter_declaration">2.4.3.1. Configuration Parameter Declaration</h4></div></div></div> |
| |
| |
| <p>Configuration Parameters are made available to annotator |
| implementations and applications by the following interfaces: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="circle" compact><li class="listitem" style="list-style-type: circle"><p> |
| <code class="literal">AnnotatorContext</code> <sup>[<a name="d5e690" href="#ftn.d5e690" class="footnote">2</a>]</sup> (passed as an argument to the |
| initialize() method of a version 1 annotator)</p> |
| </li><li class="listitem" style="list-style-type: circle"><p> |
| <code class="literal">ConfigurableResource</code> (every Analysis Engine |
| implements this interface)</p> |
| </li><li class="listitem" style="list-style-type: circle"><p> |
| <code class="literal">UimaContext</code> (passed |
| as an argument to the initialize() method of a version 2 annotator) (you can get |
| this from any resource, including Analysis Engines, using the method |
| <code class="literal">getUimaContext</code>()).</p> |
| </li></ul></div> |
| |
| <p>Use AnnotatorContext within version 1 annotators and UimaContext for |
| version 2 annotators and outside of annotators (for instance, in CasConsumers, |
| or the containing application) to access configuration parameters.</p> |
| |
| <p>Configuration parameters are set from the corresponding elements in the |
| XML descriptor for the application. If you need to programmatically change |
| parameter settings within an application, you can use methods in |
| ConfigurableResource; if you do this, you need to call reconfigure() |
| afterwards to have the UIMA framework notify all the contained analysis |
| components that the parameter configuration has changed (the analysis |
| engine's reinitialize() methods will be called). Note that in the current |
| implementation, only integrated deployment components have configuration |
| parameters passed to them; remote components obtain their parameters from |
| their remote startup environment. This will likely change in the |
| future.</p> |
| |
| <p>There are two ways to specify the |
| <code class="literal"><configurationParameters></code> section – as a |
| list of configuration parameters or a list of groups. A list of parameters, which |
| are not part of any group, looks like this: |
| |
| |
| </p><pre class="programlisting"><configurationParameters> |
| <configurationParameter> |
| <name>[String]</name> |
| <externalOverrideName>[String]</externalOverrideName> |
| <description>[String]</description> |
| <type>String|Integer|Float|Boolean</type> |
| <multiValued>true|false</multiValued> |
| <mandatory>true|false</mandatory> |
| <overrides> |
| <parameter>[String]</parameter> |
| <parameter>[String]</parameter> |
| ... |
| </overrides> |
| </configurationParameter> |
| <configurationParameter> |
| ... |
| </configurationParameter> |
| ... |
| </configurationParameters></pre> |
| |
| <p>For each configuration parameter, the following are specified:</p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><span class="bold"><strong>name</strong></span> |
| – the name by which the annotator code refers to the parameter. All |
| parameters declared in an analysis engine descriptor must have distinct names. |
| (required). The name is composed of normal Java identifier characters.</p> |
| </li><li class="listitem"><p><span class="bold"><strong>externalOverrideName</strong></span> – the |
| name of a property in an external settings file that if defined overrides |
| any value set in this descriptor or in its parent. See <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.external_configuration_parameter_overrides" title="2.4.3.4. External Configuration Parameter Overrides">Section 2.4.3.4, “External Configuration Parameter Overrides”</a> |
| for a discussion of external configuration parameter overrides. |
| (optional)</p></li><li class="listitem"><p><span class="bold"><strong>description</strong></span> – a |
| natural language description of the intent of the parameter |
| (optional)</p></li><li class="listitem"><p><span class="bold"><strong>type</strong></span> – the data |
| type of the parameter's value – must be one of |
| <code class="literal">String</code>, <code class="literal">Integer</code>, |
| <code class="literal">Float</code>, or <code class="literal">Boolean</code> |
| (required).</p></li><li class="listitem"><p><span class="bold"><strong>multiValued</strong></span> – |
| <code class="literal">true</code> if the parameter can take multiple-values (an |
| array), <code class="literal">false</code> if the parameter takes only a single value |
| (optional, defaults to false).</p></li><li class="listitem"><p><span class="bold"><strong>mandatory</strong></span> – |
| <code class="literal">true</code> if a value must be provided for the parameter |
| (optional, defaults to false).</p></li><li class="listitem"><p><span class="bold"><strong>overrides</strong></span> – this |
| is used only in aggregate Analysis Engines, but is included here for |
| completeness. See <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.configuration_parameter_overrides" title="2.4.3.3. Configuration Parameter Overrides">Section 2.4.3.3, “Configuration Parameter Overrides”</a> |
| for a discussion of configuration parameter overriding in aggregate |
| Analysis Engines. (optional).</p></li></ul></div> |
| |
| <p>A list of groups looks like this: |
| |
| |
| </p><pre class="programlisting"><configurationParameters defaultGroup="[String]" |
| searchStrategy="none|default_fallback|language_fallback" > |
| |
| <commonParameters> |
| [zero or more parameters] |
| </commonParameters> |
| |
| <configurationGroup names="name1 name2 name3 ..."> |
| [zero or more parameters] |
| </configurationGroup> |
| |
| <configurationGroup names="name4 name5 ..."> |
| [zero or more parameters] |
| </configurationGroup> |
| |
| ... |
| |
| </configurationParameters></pre> |
| |
| <p>Both the<code class="literal"> <commonParameters></code> and |
| <code class="literal"><configurationGroup></code> elements contain zero or |
| more <code class="literal"><configurationParameter></code> elements, with |
| the same syntax described above.</p> |
| |
| <p>The <code class="literal"><commonParameters></code> element declares |
| parameters that exist in all groups. Each |
| <code class="literal"><configurationGroup></code> element has a names |
| attribute, which contains a list of group names separated by whitespace (space |
| or tab characters). Names consist of any number of non-whitespace characters; |
| however the Component Descriptor Editor tool restricts this to be normal Java |
| identifiers, including the period (.) and the dash (-). One configuration group |
| will be created for each name, and all of the groups will contain the same set of |
| parameters.</p> |
| |
| <p>The <code class="literal">defaultGroup</code> attribute specifies the name of the |
| group to be used in the case where an annotator does a lookup for a configuration |
| parameter without specifying a group name. It may also be used as a fallback if the |
| annotator specifies a group that does not exist – see below.</p> |
| |
| <p>The <code class="literal">searchStrategy</code> attribute determines the action |
| to be taken when the context is queried for the value of a parameter belonging to a |
| particular configuration group, if that group does not exist or does not contain |
| a value for the requested parameter. There are currently three possible values: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><span class="bold"><strong>none</strong></span> |
| – there is no fallback; return null if there is no value in the exact group |
| specified by the user.</p></li><li class="listitem"><p><span class="bold"><strong>default_fallback</strong></span> |
| – if there is no value found in the specified group, look in the default |
| group (as defined by the <code class="literal">default</code> attribute)</p> |
| </li><li class="listitem"><p><span class="bold"><strong>language_fallback</strong></span> |
| – this setting allows for a specific use of configuration parameter |
| groups where the groups names correspond to ISO language and country codes |
| (for an example, see below). The fallback sequence is: |
| <code class="literal"><lang>_<country>_<region> <span class="symbol">→</span> |
| <lang>_<country> <span class="symbol">→</span> <lang> <span class="symbol">→</span> |
| <default>.</code> </p></li></ul></div><p> |
| </p> |
| |
| <div class="section" title="Example"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.xml.component_descriptor.aes.configuration_parameter_declaration.example">Example</h5></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><configurationParameters defaultGroup="en" |
| searchStrategy="language_fallback"> |
| |
| <commonParameters> |
| <configurationParameter> |
| <name>DictionaryFile</name> |
| <description>Location of dictionary for this |
| language</description> |
| <type>String</type> |
| <multiValued>false</multiValued> |
| <mandatory>false</mandatory> |
| </configurationParameter> |
| </commonParameters> |
| |
| <configurationGroup names="en de en-US"/> |
| |
| <configurationGroup names="zh"> |
| <configurationParameter> |
| <name>DBC_Strategy</name> |
| <description>Strategy for dealing with double-byte |
| characters.</description> |
| <type>String</type> |
| <multiValued>false</multiValued> |
| <mandatory>false</mandatory> |
| </configurationParameter> |
| </configurationGroup> |
| |
| </configurationParameters></pre> |
| |
| <p>In this example, we are declaring a <code class="literal">DictionaryFile</code> |
| parameter that can have a different value for each of the languages that our AE |
| supports |
| – English (general), German, U.S. English, and Chinese. For Chinese |
| only, we also declare a <code class="literal">DBC_Strategy</code> |
| parameter.</p> |
| |
| <p>We are using the <code class="literal">language_fallback</code> search |
| strategy, so if an annotator requests the dictionary file for the |
| <code class="literal">en-GB</code> (British English) group, we will fall back to the |
| more general <code class="literal">en</code> group.</p> |
| |
| <p>Since we have defined <code class="literal">en</code> as the default group, this |
| value will be returned if the context is queried for the |
| <code class="literal">DictionaryFile</code> parameter without specifying any |
| group name, or if a nonexistent group name is specified.</p> |
| </div> |
| </div> |
| |
| <div class="section" title="2.4.3.2. Configuration Parameter Settings"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.configuration_parameter_settings">2.4.3.2. Configuration Parameter Settings</h4></div></div></div> |
| |
| |
| <p>For configuration parameters that are not part of any group, the |
| <code class="literal"><configurationParameterSettings></code> element |
| looks like this: |
| |
| |
| </p><pre class="programlisting"><configurationParameterSettings> |
| <nameValuePair> |
| <name>[String]</name> |
| <value> |
| <string>[String]</string> | |
| <integer>[Integer]</integer> | |
| <float>[Float]</float> | |
| <boolean>true|false</boolean> | |
| <array> ... </array> |
| </value> |
| </nameValuePair> |
| |
| <nameValuePair> |
| ... |
| </nameValuePair> |
| ... |
| </configurationParameterSettings></pre> |
| |
| <p>There are zero or more <code class="literal">nameValuePair</code> elements. Each |
| <code class="literal">nameValuePair</code> contains a name (which refers to one of the |
| configuration parameters) and a value for that parameter.</p> |
| |
| <p>The <code class="literal">value</code> element contains an element that matches |
| the type of the parameter. For single-valued parameters, this is either |
| <code class="literal"><string></code>, <code class="literal"><integer></code> |
| , <code class="literal"><float></code>, or |
| <code class="literal"><boolean></code>. For multi-valued parameters, this is |
| an <code class="literal"><array></code> element, which then contains zero or |
| more instances of the appropriate type of primitive value, e.g.: |
| |
| |
| </p><pre class="programlisting"><array><string>One</string><string>Two</string></array></pre> |
| |
| <p>For parameters declared in configuration groups the |
| <code class="literal"><configurationParameterSettings></code> element |
| looks like this: |
| |
| |
| </p><pre class="programlisting"><configurationParameterSettings> |
| |
| <settingsForGroup name="[String]"> |
| [one or more <nameValuePair> elements] |
| </settingsForGroup> |
| |
| <settingsForGroup name="[String]"> |
| [one or more <nameValuePair> elements] |
| </settingsForGroup> |
| |
| ... |
| |
| </configurationParameterSettings></pre><p> |
| where each <code class="literal"><settingsForGroup></code> element has a name |
| that matches one of the configuration groups declared under the |
| <code class="literal"><configurationParameters></code> element and contains |
| the parameter settings for that group.</p> |
| |
| <div class="section" title="Example"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.xml.component_descriptor.aes.configuration_parameter_settings.example">Example</h5></div></div></div> |
| |
| |
| <p>Here are the settings that correspond to the parameter declarations in |
| the previous example: |
| |
| |
| </p><pre class="programlisting"><configurationParameterSettings> |
| |
| <settingsForGroup name="en"> |
| <nameValuePair> |
| <name>DictionaryFile</name> |
| <value><string>resourcesEnglishdictionary.dat></string></value> |
| </nameValuePair> |
| </settingsForGroup> |
| |
| <settingsForGroup name="en-US"> |
| <nameValuePair> |
| <name>DictionaryFile</name> |
| <value><string>resourcesEnglish_USdictionary.dat</string></value> |
| </nameValuePair> |
| </settingsForGroup> |
| |
| <settingsForGroup name="de"> |
| <nameValuePair> |
| <name>DictionaryFile</name> |
| <value><string>resourcesDeutschdictionary.dat</string></value> |
| </nameValuePair> |
| </settingsForGroup> |
| |
| <settingsForGroup name="zh"> |
| <nameValuePair> |
| <name>DictionaryFile</name> |
| <value><string>resourcesChinesedictionary.dat</string></value> |
| </nameValuePair> |
| |
| <nameValuePair> |
| <name>DBC_Strategy</name> |
| <value><string>default</string></value> |
| </nameValuePair> |
| |
| </settingsForGroup> |
| |
| </configurationParameterSettings></pre> |
| </div> |
| </div> |
| |
| <div class="section" title="2.4.3.3. Configuration Parameter Overrides"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.aggregate.configuration_parameter_overrides">2.4.3.3. Configuration Parameter Overrides</h4></div></div></div> |
| |
| |
| <p>In an aggregate Analysis Engine Descriptor, each |
| <code class="literal"><configurationParameter> </code>element should |
| contain an <code class="literal"><overrides></code> element, with the |
| following syntax:</p> |
| |
| |
| <pre class="programlisting"><overrides> |
| |
| <parameter> |
| [delegateAnalysisEngineKey]/[parameterName] |
| </parameter> |
| |
| <parameter> |
| [delegateAnalysisEngineKey]/[parameterName] |
| </parameter> |
| ... |
| |
| </overrides></pre> |
| |
| <p>Since aggregate Analysis Engines have no code associated with them, the |
| only way in which their configuration parameters can affect their processing |
| is by overriding the parameter values of one or more delegate analysis |
| engines. The <code class="literal"><overrides> </code>element determines |
| which parameters, in which delegate Analysis Engines, are overridden by this |
| configuration parameter.</p> |
| |
| <p>For example, consider an aggregate Analysis Engine Descriptor that |
| contains delegate Analysis Engines with keys |
| <code class="literal">annotator1</code> and <code class="literal">annotator2</code> (as |
| declared in the <delegateAnalysisEngine> element – see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.aggregate.delegates" title="2.4.2.1. Delegate Analysis Engine Specifiers">Section 2.4.2.1, “Delegate Analysis Engine Specifiers”</a>) and also declares a |
| configuration parameter as follows: |
| |
| |
| </p><pre class="programlisting"><configurationParameter> |
| <name>AggregateParam</name> |
| <type>String</type> |
| <overrides> |
| <parameter>annotator1/param1</parameter> |
| <parameter>annotator2/param2</parameter> |
| </overrides> |
| </configurationParameter></pre> |
| |
| <p>The value of the <code class="literal">AggregateParam</code> parameter |
| (whether assigned in the aggregate descriptor or at runtime by an |
| application) will override the value of parameter |
| <code class="literal">param1</code> in <code class="literal">annotator1</code> and also |
| override the value of parameter <code class="literal">param2</code> in |
| <code class="literal">annotator2</code>. No other parameters will be |
| affected. Note that <code class="literal">AggregateParam</code> may itself be overridden by a |
| parameter in an outer aggregate that has this aggregate as one of its delegates. |
| </p> |
| |
| <p>Prior to release 2.4.1, if an aggregate Analysis Engine descriptor |
| declared a configuration parameter with no explicit overrides, that |
| parameter would override any parameters having the same name within any |
| delegate analysis engine. Starting with release 2.4.1, support for this |
| usage has been dropped.</p> |
| |
| </div> |
| |
| |
| <div class="section" title="2.4.3.4. External Configuration Parameter Overrides"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.external_configuration_parameter_overrides">2.4.3.4. External Configuration Parameter Overrides</h4></div></div></div> |
| |
| |
| <p> |
| External parameter overrides are usually declared in primitive descriptors as a way to |
| easily modify the parameters in some or all of an application's annotators. |
| By using external settings files and shared parameter names the configuration |
| information can be specified without regard for a particular descriptor hierachy. |
| </p> |
| |
| <p> |
| Configuration parameter declarations in primitive and aggregate descriptors may |
| include an <code class="literal"><externalOverrideName></code> element, |
| which specifies the name of a property that may be defined in an external settings file. |
| If this element is present, and if a entry can be found for its name in a settings |
| files, then this value overrides the value otherwise specified for this parameter. |
| </p> |
| |
| <p> |
| The value overrides any value set in this descriptor or set by an override in a parent |
| aggregate. In primitive descriptors the value set by an external override is always |
| applied. In aggregate descriptors the value set by an external override applies to the |
| aggregate parameter, and is passed down to the overridden delegate parameters in the |
| usual way, i.e. only if the delegate's parameter has not been set by an external override. |
| </p> |
| |
| <p> |
| Im the absence of external overrides, |
| parameter evaluation can be viewed as proceeding from the primitive descriptor up through |
| any aggregates containing overrides, taking the last setting found. With external |
| overrides the search ends with the first external override found that has a value |
| assigned by a settings file. |
| </p> |
| |
| <p> |
| The same external name may be used for multiple parameters; |
| the effect of this is that one setting will override multiple parameters. |
| </p> |
| |
| <p> |
| The settings for all descriptors in a pipeline are usually loaded from one or more files |
| whose names are obtained from the Java system property <span class="emphasis"><em>UimaExternalOverrides</em></span>. |
| The value of the property must be a comma-separated list of resource names. If the name |
| has a prefix of "file:" or no prefix, the filesystem is searched. If the name has a |
| prefix of "path:" the rest must be a Java-style dotted name, similar to the name |
| attribute for descriptor imports. The dots are replaced by file separators and a suffix |
| of ".settings" is appended before searching the datapath and classpath. |
| e.g. <code class="literal">−DUimaExternalOverrides=/data/file1.settings,file:relative/file2.settings,path:org.apache.uima.resources.file3</code>. |
| </p> |
| |
| <p> |
| Override settings may also be specified when creating an analysis engine by putting a |
| <code class="literal">Settings</code> object in the additional parameters map for the |
| <code class="literal">produceAnalysisEngine</code> method. In this case the |
| Java system property <span class="emphasis"><em>UimaExternalOverrides</em></span> is ignored. |
| </p><pre class="programlisting"> // Construct an analysis engine that uses two settings files |
| Settings extSettings = |
| UIMAFramework.getResourceSpecifierFactory().createSettings(); |
| for (String fname : new String[] { "externalOverride.settings", |
| "default.settings" }) { |
| FileInputStream fis = new FileInputStream(fname); |
| extSettings.load(fis); |
| fis.close(); |
| } |
| Map<String,Object> aeParms = new HashMap<String,Object>(); |
| aeParms.put(Resource.PARAM_EXTERNAL_OVERRIDE_SETTINGS, extSettings); |
| AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(desc, aeParms); |
| </pre><p> |
| </p> |
| |
| <p> |
| These external settings consist of key - value pairs stored in a |
| file using the UTF-8 character encoding, and written in a style similar to that |
| of Java properties files. |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="circle" compact><li class="listitem" style="list-style-type: circle"><p> |
| Leading whitespace is ignored. |
| </p></li><li class="listitem" style="list-style-type: circle"><p> |
| Comment lines start with '#' or '!'. |
| </p></li><li class="listitem" style="list-style-type: circle"><p> |
| The key and value are separated by whitespace, '=' or ':'. |
| </p></li><li class="listitem" style="list-style-type: circle"><p> |
| Keys must contain at least one character and only letters, digits, or the characters '. / - ~ _'. |
| </p></li><li class="listitem" style="list-style-type: circle"><p> |
| If a line ends with '\' it is extended with the following line (after removing any |
| leading whitespace.) |
| </p></li><li class="listitem" style="list-style-type: circle"><p> |
| Whitespace is trimmed from both keys and values. |
| </p></li><li class="listitem" style="list-style-type: circle"><p> |
| Duplicate key values are ignored – once a value is assigned to a key it cannot be changed. |
| </p></li><li class="listitem" style="list-style-type: circle"><p> |
| Values may reference other settings using the syntax '${key}'. |
| </p></li><li class="listitem" style="list-style-type: circle"><p> |
| Array values are represented as a list of strings separated by commas or line breaks, |
| and bracketed by the '[ ]' characters. The value must start with an '[' and is |
| terminated by the first unescaped ']' which must be at the end of a line. |
| The elements of an array (and hence the array size) may be indirectly specified using |
| the '${key}' syntax but the brackets '[ ]' must be explicitly specified. |
| </p></li><li class="listitem" style="list-style-type: circle"><p> |
| In values the special characters '$ { } [ , ] \' are treated as regular characters if |
| preceeded by the escape character '\'. |
| </p></li></ul></div><p> |
| </p><pre class="programlisting"> |
| key1 : value1 |
| key2 = value 2 |
| key3 element2, element3, element4 |
| # Next assignment is ignored as key3 has already been set |
| key3 : value ignored |
| key4 = [ array element1, ${key3}, element5 |
| element6 ] |
| key5 value with a reference ${key1} to key1 |
| key6 : long value string \ |
| continued from previous line (with leading whitespace stripped) |
| key7 = value without a reference \${not-a-key} |
| key8 \[ value that is not an array ] |
| key9 : [ array element1\, with embedded comma, element2 ] |
| </pre><p> |
| </p> |
| |
| <p> |
| Multiple settings files are allowed; they are loaded in order, such that |
| early ones take precedence over later ones, following the first-assignment-wins rule. |
| So, if you have lots of settings, |
| you can put the defaults in one file, and then in a earlier file, override just the |
| ones you need to. |
| </p> |
| |
| <p> |
| An external override name may be specified for a parameter declared in a group, but if |
| the parameter is in the common group or the group is declared with multiple names, the |
| external name is shared amongst all, i.e. these parameters cannot be given group-specific values. |
| </p> |
| </div> |
| |
| <div class="section" title="2.4.3.5. Direct Access to External Configuration Parameters"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.external_configuration_parameter_access">2.4.3.5. Direct Access to External Configuration Parameters</h4></div></div></div> |
| |
| |
| <p> |
| Annotators and flow controllers can directly access these shared configuration |
| parameters from their UimaContext. |
| Direct access means an access where the key to select the shared parameter is the |
| parameter name as specified in the external configuration settings file. |
| </p><pre class="programlisting"> |
| String value = aContext.getSharedSettingValue(paramName); |
| String values[] = aContext.getSharedSettingArray(arrayParamName); |
| String allNames[] = aContext.getSharedSettingNames(); |
| </pre><p> |
| Java code called by an annotator or flow controller in the same thread or a child thread |
| can use the <code class="literal">UimaContextHolder</code> to get the annotator's UimaContext and |
| hence access the shared configuration parameters. |
| </p><pre class="programlisting"> |
| UimaContext uimaContext = UimaContextHolder.getUimaContext(); |
| if (uimaContext != null) { |
| value = uimaContext.getSharedSettingValue(paramName); |
| } |
| </pre><p> |
| The UIMA framework puts the context in an InheritableThreadLocal variable. The value |
| will be null if <code class="literal">getUimaContext</code> is not invoked by an annotator or flow |
| controller on the same thread or a child thread. |
| </p> |
| </div> |
| |
| <div class="section" title="2.4.3.6. Other Uses for External Configuration Parameters"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.component_descriptor.aes.other_uses_for_external_configuration_parameters">2.4.3.6. Other Uses for External Configuration Parameters</h4></div></div></div> |
| |
| <p> |
| Explicit references to shared configuration parameters can be specified as part of the |
| value of the name and location attributes of the <code class="literal">import</code> element |
| and in the value of the fileUrl for a <code class="literal">fileResourceSpecifier</code> |
| (see <a class="xref" href="#ugr.ref.xml.component_descriptor.imports" title="2.2. Imports">Section 2.2, “Imports”</a> and <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration" title="2.4.1.9. Resource Manager Configuration">Section 2.4.1.9, “Resource Manager Configuration”</a>). |
| </p> |
| </div> |
| |
| </div> |
| </div> |
| |
| |
| <div class="section" title="2.5. Flow Controller Descriptors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.flow_controller">2.5. Flow Controller Descriptors</h2></div></div></div> |
| |
| |
| <p>The basic structure of a Flow Controller Descriptor is as follows: |
| |
| |
| </p><pre class="programlisting"><?xml version="1.0" ?> |
| <flowControllerDescription |
| xmlns="http://uima.apache.org/resourceSpecifier"> |
| |
| <frameworkImplementation>org.apache.uima.java</frameworkImplementation> |
| |
| <implementationName>[ClassName]</implementationName> |
| |
| <processingResourceMetaData> |
| ... |
| </processingResourceMetaData> |
| |
| <externalResourceDependencies> |
| ... |
| </externalResourceDependencies> |
| |
| <resourceManagerConfiguration> |
| ... |
| </resourceManagerConfiguration> |
| |
| </flowControllerDescription></pre> |
| |
| <p>The <code class="literal">frameworkImplementation</code> element must always be set to |
| the value <code class="literal">org.apache.uima.java</code>.</p> |
| |
| <p>The <code class="literal">implementationName</code> element must contain the |
| fully-qualified class name of the Flow Controller implementation. This must name a |
| class that implements the <code class="literal">FlowController</code> interface.</p> |
| |
| <p>The <code class="literal">processingResourceMetaData</code> element contains |
| essentially the same information as a Primitive Analysis Engine Descriptor's |
| <code class="literal">analysisEngineMetaData</code> element, described in <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.metadata" title="2.4.1.2. Analysis Engine MetaData">Section 2.4.1.2, “Analysis Engine MetaData”</a>.</p> |
| |
| <p>The <code class="literal">externalResourceDependencies</code> and |
| <code class="literal">resourceManagerConfiguration</code> elements are exactly the same as |
| in Primitive Analysis Engine Descriptors (see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.external_resource_dependencies" title="2.4.1.8. External Resource Dependencies">Section 2.4.1.8, “External Resource Dependencies”</a> and <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration" title="2.4.1.9. Resource Manager Configuration">Section 2.4.1.9, “Resource Manager Configuration”</a>).</p> |
| |
| </div> |
| |
| <div class="section" title="2.6. Collection Processing Component Descriptors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.collection_processing_parts">2.6. Collection Processing Component Descriptors</h2></div></div></div> |
| |
| |
| <p>There are three types of Collection Processing Components – Collection |
| Readers, CAS Initializers (deprecated as of UIMA Version 2), and CAS Consumers. Each |
| type of component has a corresponding descriptor. The structure of these descriptors |
| is very similar to that of primitive Analysis Engine Descriptors.</p> |
| |
| <div class="section" title="2.6.1. Collection Reader Descriptors"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader">2.6.1. Collection Reader Descriptors</h3></div></div></div> |
| |
| |
| <p>The basic structure of a Collection Reader descriptor is as follows: |
| |
| |
| </p><pre class="programlisting"><?xml version="1.0" ?> |
| <collectionReaderDescription |
| xmlns="http://uima.apache.org/resourceSpecifier"> |
| |
| <frameworkImplementation>org.apache.uima.java</frameworkImplementation> |
| <implementationName>[ClassName]</implementationName> |
| |
| <processingResourceMetaData> |
| ... |
| </processingResourceMetaData> |
| |
| <externalResourceDependencies> |
| ... |
| </externalResourceDependencies> |
| |
| <resourceManagerConfiguration> |
| |
| ... |
| |
| </resourceManagerConfiguration> |
| |
| </collectionReaderDescription></pre> |
| |
| <p>The <code class="literal">frameworkImplementation</code> element must always be set |
| to the value <code class="literal">org.apache.uima.java</code>.</p> |
| |
| <p>The <code class="literal">implementationName</code> element contains the |
| fully-qualified class name of the Collection Reader implementation. This must name |
| a class that implements the <code class="literal">CollectionReader</code> |
| interface.</p> |
| |
| <p>The <code class="literal">processingResourceMetaData</code> element contains |
| essentially the same information as a Primitive Analysis Engine |
| Descriptor's' <code class="literal">analysisEngineMetaData</code> element: |
| |
| |
| </p><pre class="programlisting"><processingResourceMetaData> |
| |
| <name> [String] </name> |
| <description>[String]</description> |
| <version>[String]</version> |
| <vendor>[String]</vendor> |
| |
| <configurationParameters> |
| ... |
| </configurationParameters> |
| |
| <configurationParameterSettings> |
| ... |
| </configurationParameterSettings> |
| |
| <typeSystemDescription> |
| ... |
| </typeSystemDescription> |
| |
| <typePriorities> |
| ... |
| </typePriorities> |
| |
| <fsIndexes> |
| ... |
| </fsIndexes> |
| |
| <capabilities> |
| ... |
| </capabilities> |
| |
| </processingResourceMetaData></pre> |
| |
| <p>The contents of these elements are the same as that described in <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.metadata" title="2.4.1.2. Analysis Engine MetaData">Section 2.4.1.2, “Analysis Engine MetaData”</a>, with the exception that the capabilities |
| section should not declare any inputs (because the Collection Reader is always the |
| first component to receive the CAS).</p> |
| |
| <p>The <code class="literal">externalResourceDependencies</code> and |
| <code class="literal">resourceManagerConfiguration</code> elements are exactly the same |
| as in the Primitive Analysis Engine Descriptors (see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.external_resource_dependencies" title="2.4.1.8. External Resource Dependencies">Section 2.4.1.8, “External Resource Dependencies”</a> and <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration" title="2.4.1.9. Resource Manager Configuration">Section 2.4.1.9, “Resource Manager Configuration”</a>).</p> |
| |
| </div> |
| <div class="section" title="2.6.2. CAS Initializer Descriptors (deprecated)"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.collection_processing_parts.cas_initializer">2.6.2. CAS Initializer Descriptors (deprecated)</h3></div></div></div> |
| |
| |
| <p>The basic structure of a CAS Initializer Descriptor is as follows: |
| |
| |
| </p><pre class="programlisting"><?xml version="1.0" encoding="UTF-8" ?> |
| <casInitializerDescription |
| xmlns="http://uima.apache.org/resourceSpecifier"> |
| |
| <frameworkImplementation>org.apache.uima.java</frameworkImplementation> |
| <implementationName>[ClassName] </implementationName> |
| |
| <processingResourceMetaData> |
| ... |
| </processingResourceMetaData> |
| |
| <externalResourceDependencies> |
| ... |
| </externalResourceDependencies> |
| |
| <resourceManagerConfiguration> |
| ... |
| </resourceManagerConfiguration> |
| |
| </casInitializerDescription></pre> |
| |
| <p>The <code class="literal">frameworkImplementation</code> element must always be set |
| to the value <code class="literal">org.apache.uima.java</code>.</p> |
| |
| <p>The <code class="literal">implementationName</code> element contains the |
| fully-qualified class name of the CAS Initializer implementation. This must name a |
| class that implements the <code class="literal">CasInitializer</code> interface.</p> |
| |
| <p>The <code class="literal">processingResourceMetaData</code> element contains |
| essentially the same information as a Primitive Analysis Engine |
| Descriptor's' <code class="literal">analysisEngineMetaData</code> element, |
| as described in <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.metadata" title="2.4.1.2. Analysis Engine MetaData">Section 2.4.1.2, “Analysis Engine MetaData”</a>, with the exception of some |
| changes to the capabilities section. A CAS Initializer's capabilities |
| element looks like this: |
| |
| |
| </p><pre class="programlisting"><capabilities> |
| <capability> |
| <outputs> |
| <type allAnnotatorFeatures="true|false">[String]</type> |
| <type>[TypeName]</type> |
| ... |
| <feature>[TypeName]:[Name]</feature> |
| ... |
| </outputs> |
| |
| <outputSofas> |
| <sofaName>[name]</sofaName> |
| ... |
| </outputSofas> |
| |
| <mimeTypesSupported> |
| <mimeType>[MIME Type]</mimeType> |
| ... |
| </mimeTypesSupported> |
| </capability> |
| |
| <capability> |
| ... |
| </capability> |
| ... |
| </capabilities></pre> |
| |
| <p>The differences between a CAS Initializer's capabilities declaration |
| and an Analysis Engine's capabilities declaration are that the CAS Initializer does not |
| declare any input CAS types and features or input Sofas (because it is always the first |
| to operate on a CAS), it doesn't have a language specifier, and that the CAS |
| Initializer may declare a set of MIME types that it supports for its input documents. |
| Examples include: text/plain, text/html, and application/pdf. For a list of MIME |
| types see <a class="ulink" href="http://www.iana.org/assignments/media-types/" target="_top">http://www.iana.org/assignments/media-types/</a>. This |
| information is currently only for users' information, the framework does not |
| use it for anything. This may change in future versions.</p> |
| |
| <p>The <code class="literal">externalResourceDependencies</code> and |
| <code class="literal">resourceManagerConfiguration</code> elements are exactly the same |
| as in the Primitive Analysis Engine Descriptors (see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.external_resource_dependencies" title="2.4.1.8. External Resource Dependencies">Section 2.4.1.8, “External Resource Dependencies”</a> and <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration" title="2.4.1.9. Resource Manager Configuration">Section 2.4.1.9, “Resource Manager Configuration”</a>).</p> |
| |
| </div> |
| <div class="section" title="2.6.3. CAS Consumer Descriptors"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.component_descriptor.collection_processing_parts.cas_consumer">2.6.3. CAS Consumer Descriptors</h3></div></div></div> |
| |
| |
| <p>The basic structure of a CAS Consumer Descriptor is as follows: |
| |
| |
| </p><pre class="programlisting"><?xml version="1.0" encoding="UTF-8" ?> |
| <casConsumerDescription |
| xmlns="http://uima.apache.org/resourceSpecifier"> |
| |
| <frameworkImplementation>org.apache.uima.java</frameworkImplementation> |
| |
| <implementationName>[ClassName]</implementationName> |
| |
| <processingResourceMetaData> |
| ... |
| </processingResourceMetaData> |
| |
| <externalResourceDependencies> |
| ... |
| </externalResourceDependencies> |
| |
| <resourceManagerConfiguration> |
| ... |
| </resourceManagerConfiguration> |
| </casConsumerDescription></pre> |
| |
| <p>The <code class="literal">frameworkImplementation</code> element currently must |
| have the value <code class="literal">org.apache.uima.java</code>, or |
| <code class="literal">org.apache.uima.cpp</code>.</p> |
| |
| <p>The next subelement,<code class="literal"> |
| <annotatorImplementationName></code> is how the UIMA framework |
| determines which annotator class to use. This should contain a fully-qualified |
| Java class name for Java implementations, or the name of a .dll or .so file for C++ |
| implementations.</p> |
| <p>The <code class="literal">frameworkImplementation</code> element must always be set |
| to the value <code class="literal">org.apache.uima.java</code>.</p> |
| |
| <p>The <code class="literal">implementationName</code> element must contain the |
| fully-qualified class name of the CAS Consumer implementation, or the name |
| of a .dll or .so file for C++ implementations. For Java, the named class must |
| implement the <code class="literal">CasConsumer</code> interface.</p> |
| |
| <p>The <code class="literal">processingResourceMetaData</code> element contains |
| essentially the same information as a Primitive Analysis Engine Descriptor's |
| <code class="literal">analysisEngineMetaData</code> element, described in <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.metadata" title="2.4.1.2. Analysis Engine MetaData">Section 2.4.1.2, “Analysis Engine MetaData”</a>, except that the CAS Consumer Descriptor's |
| <code class="literal">capabilities</code> element should not declare outputs or |
| outputSofas (since CAS Consumers do not modify the CAS).</p> |
| |
| <p>The <code class="literal">externalResourceDependencies</code> and |
| <code class="literal">resourceManagerConfiguration</code> elements are exactly the same |
| as in Primitive Analysis Engine Descriptors (see <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.external_resource_dependencies" title="2.4.1.8. External Resource Dependencies">Section 2.4.1.8, “External Resource Dependencies”</a> and <a class="xref" href="#ugr.ref.xml.component_descriptor.aes.primitive.resource_manager_configuration" title="2.4.1.9. Resource Manager Configuration">Section 2.4.1.9, “Resource Manager Configuration”</a>).</p> |
| |
| </div> |
| </div> |
| |
| <div class="section" title="2.7. Service Client Descriptors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.service_client">2.7. Service Client Descriptors</h2></div></div></div> |
| |
| |
| <p>Service Client Descriptors specify only a location of a remote service. They are |
| therefore much simpler in structure. In the UIMA SDK, a Service Client Descriptor that |
| refers to a valid Analysis Engine or CAS Consumer service can be used in place of the |
| actual Analysis Engine or CAS Consumer Descriptor. The UIMA SDK will handle the details |
| of calling the remote service. (For details on <span class="emphasis"><em>deploying</em></span> an |
| Analysis Engine or CAS Consumer as a service, see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.remote_services" class="olink">Section 3.6, “Working with Remote Services”</a>.</p> |
| |
| <p>The UIMA SDK is extensible to support different types of remote services. In future |
| versions, there may be different variations of service client descriptors that cater |
| to different types of services. For now, the only type of service client descriptor is |
| the <code class="literal">uriSpecifier</code>, which supports the SOAP and Vinci |
| protocols.</p> |
| |
| |
| <pre class="programlisting"><?xml version="1.0" encoding="UTF-8" ?> |
| <uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier"> |
| <resourceType>AnalysisEngine | CasConsumer </resourceType> |
| <uri>[URI]</uri> |
| <protocol>SOAP | SOAPwithAttachments | Vinci</protocol> |
| <timeout>[Integer]</timeout> |
| <parameters> |
| <parameter name="VNS_HOST" value="some.internet.ip.name-or-address"/> |
| <parameter name="VNS_PORT" value="9000"/> |
| <parameter name="GetMetaDataTimeout" value="[Integer]"/> |
| </parameters> |
| </uriSpecifier></pre> |
| |
| <p>The <code class="literal">resourceType</code> element is required for new descriptors, |
| but is currently allowed to be omitted for backward compatibility. It specifies the |
| type of component (Analysis Engine or CAS Consumer) that is implemented by the service |
| endpoint described by this descriptor.</p> |
| |
| <p>The <code class="literal">uri</code> element contains the URI for the web service. (Note |
| that in the case of Vinci, this will be the service name, which is looked up in the Vinci |
| Naming Service.)</p> |
| |
| <p>The <code class="literal">protocol</code> element may be set to SOAP, |
| SOAPwithAttachments, or Vinci; other protocols may be added later. These specify the |
| particular data transport format that will be used.</p> |
| |
| <p>The <code class="literal">timeout</code> element is optional. If present, it specifies |
| the number of milliseconds to wait for a request to be processed before an exception is |
| thrown. A value of zero or less will wait forever. If no timeout is specified, a default |
| value (currently 60 seconds) will be used.</p> |
| |
| <p>The parameters element is optional. If present, it can specify values for each |
| of the following: |
| </p> |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><code class="literal">VNS_HOST</code>: host name for the Vinci naming service. |
| </p></li><li class="listitem"><p><code class="literal">VNS_PORT</code>: port number for the Vinci naming service. |
| </p></li><li class="listitem"><p><code class="literal">GetMetaDataTimeout</code>: timeout period (in milliseconds) for |
| the GetMetaData call. If not specified, the default is 60 seconds. This may need |
| to be set higher if there are a lot of clients competing for connections to the service. |
| </p></li></ul></div> |
| |
| <p>If the <code class="literal">VNS_HOST</code> and <code class="literal">VNS_PORT</code> are not specified |
| in the descriptor, the values used for these comes from |
| parameters passed on the Java command line using the |
| <code class="literal">−DVNS_HOST=<host></code> and/or |
| <code class="literal">−DVNS_PORT=<port></code> system arguments. If not present, and |
| a system argument is also not present, the values for these default to |
| <code class="literal">localhost</code> for the <code class="literal">VNS_HOST</code> and |
| <code class="literal">9000</code> for the <code class="literal">VNS_PORT</code>.</p> |
| |
| <p>For details on how to deploy and call Analysis Engine and CAS Consumer services, see |
| <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.remote_services" class="olink">Section 3.6, “Working with Remote Services”</a>.</p> |
| |
| </div> |
| |
| <div class="section" title="2.8. Custom Resource Specifiers"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.component_descriptor.custom_resource_specifiers">2.8. Custom Resource Specifiers</h2></div></div></div> |
| |
| <p>A Custom Resource Specifier allows you to plug in your own Java class as a UIMA Resource. |
| For example you can support a new service protocol by plugging in a Java class that implements |
| the UIMA <code class="literal">AnalysisEngine</code> interface and communicates with the remote service.</p> |
| |
| <p>A Custom Resource Specifier has the following format:</p> |
| <pre class="programlisting"><?xml version="1.0" encoding="UTF-8" ?> |
| <customResourceSpecifier xmlns="http://uima.apache.org/resourceSpecifier"> |
| <resourceClassName>[Java Class Name]</resourceClassName> |
| <parameters> |
| <parameter name="[String]" value="[String]"/> |
| <parameter name="[String]" value="[String]"/> |
| </parameters> |
| </customResourceSpecifier></pre> |
| |
| <p>The <code class="literal">resourceClassName</code> element must contain the fully-qualified name of a Java class |
| that can be found in the classpath (including the UIMA extension classpath, if you have specified one using |
| the <code class="literal">ResourceManager.setExtensionClassPath</code> method). This class must implement the |
| UIMA <code class="literal">Resource</code> interface.</p> |
| |
| <p>When an application calls the <code class="literal">UIMAFramework.produceResource</code> method and passes a |
| <code class="literal">CustomResourceSpecifier</code>, the UIMA framework will load the named class and call its |
| <code class="literal">initialize(ResourceSpecifier,Map)</code> method, passing the <code class="literal">CustomResourceSpecifier</code> |
| as the first argument. Your class can override the <code class="literal">initialize</code> method and use the |
| <code class="literal">CustomResourceSpecifier</code> API to get access to the <code class="literal">parameter</code> names and values |
| specified in the XML.</p> |
| |
| <p>If you are using a custom resource specifier to plug in a class that implements a new service protocol, |
| your class must also implement the <code class="literal">AnalysisEngine</code> interface. Generally it should also |
| extend <code class="literal">AnalysisEngineImplBase</code>. The key methods that should be implemented are |
| <code class="literal">getMetaData</code>, <code class="literal">processAndOutputNewCASes</code>, |
| <code class="literal">collectionProcessComplete</code>, and <code class="literal">destroy</code>.</p> |
| </div> |
| <div class="footnotes"><br><hr width="100" align="left"><div class="footnote"><p><sup>[<a id="ftn.d5e71" href="#d5e71" class="para">1</a>] </sup>This component is deprecated and should not be use in new |
| development.</p></div><div class="footnote"><p><sup>[<a id="ftn.d5e690" href="#d5e690" class="para">2</a>] </sup>Deprecated; use |
| UimaContext instead.</p></div></div></div> |
| <div class="chapter" title="Chapter 3. Collection Processing Engine Descriptor Reference" id="ugr.ref.xml.cpe_descriptor"><div class="titlepage"><div><div><h2 class="title">Chapter 3. Collection Processing Engine Descriptor Reference</h2></div></div></div> |
| |
| |
| |
| <p>A UIMA <span class="emphasis"><em>Collection Processing Engine</em></span> (CPE) is a combination |
| of UIMA components assembled to analyze a collection of artifacts. A CPE is an |
| instantiation of the UIMA <span class="emphasis"><em>Collection Processing Architecture</em></span>, |
| which defines the collection processing components, interfaces, and APIs. A CPE is |
| executed by a UIMA framework component called the <span class="emphasis"><em>Collection Processing |
| Manager</em></span> (CPM), which provides a number of services for deploying CPEs, |
| running CPEs, and handling errors.</p> |
| |
| <p>A CPE can be assembled programmatically within a Java application, or it can be |
| assembled declaratively via a CPE configuration specification, called a CPE |
| Descriptor. This chapter describes the format of the CPE Descriptor.</p> |
| |
| <p>Details about the CPE, including its function, sub-components, APIs, and related |
| tools, can be found in <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.cpe" class="olink">Chapter 2, <i>Collection Processing Engine Developer's Guide</i></a>. Here we briefly summarize the CPE to define terms and |
| provide context for the later sections that describe the CPE Descriptor.</p> |
| |
| <div class="section" title="3.1. CPE Overview"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.overview">3.1. CPE Overview</h2></div></div></div> |
| |
| |
| <div class="figure"><a name="ugr.ref.xml.cpe_descriptor.overview.fig.runtime"></a><div class="figure-contents"> |
| |
| <div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="574"><tr><td><img src="images/references/ref.xml.cpe_descriptor/image002.png" width="574" alt="CPE Runtime Overview diagram"></td></tr></table></div> |
| </div><p class="title"><b>Figure 3.1. CPE Runtime Overview</b></p></div><br class="figure-break"> |
| |
| <p>An illustration of the CPE runtime is shown in <a class="xref" href="#ugr.ref.xml.cpe_descriptor.overview.fig.runtime" title="Figure 3.1. CPE Runtime Overview">Figure 3.1, “CPE Runtime Overview”</a>. Some of the CPE components, such as the |
| <span class="emphasis"><em>queues</em></span> and <span class="emphasis"><em>processing pipelines</em></span>, are |
| internal to the CPE, but their behavior and deployment may be configured using the CPE |
| Descriptor. Other CPE components, such as the <span class="emphasis"><em>Collection |
| Reader</em></span> and <span class="emphasis"><em>CAS Processors</em></span>, are defined and |
| configured externally from the CPE and then plugged in to the CPE to create the overall |
| engine. The parts of a CPE are: |
| |
| </p><div class="variablelist"><dl><dt><span class="term">Collection Reader</span></dt><dd><p>understands the native data collection format and iterates |
| over the collection producing subjects of analysis</p></dd><dt><span class="term">CAS Initializer<sup>[<a name="d5e1067" href="#ftn.d5e1067" class="footnote">3</a>]</sup> |
| </span></dt><dd><p>initializes a CAS with a subject of analysis</p> |
| </dd><dt><span class="term">Artifact Producer</span></dt><dd><p>asynchronously pulls CASes from the Collection Reader, |
| creates batches of CASes and puts them into the work queue</p></dd><dt><span class="term">Work Queue</span></dt><dd><p>shared queue containing batches of CASes queued by the Artifact |
| Producer for analysis by Analysis Engines</p> |
| </dd><dt><span class="term">B1-Bn</span></dt><dd><p>individual batches containing 1 or more CASes</p> |
| </dd><dt><span class="term">AE1-AEn</span></dt><dd><p>Analysis Engines arranged by a CPE descriptor</p> |
| </dd><dt><span class="term">Processing Pipelines</span></dt><dd><p>each pipeline runs in a separate thread and contains a |
| replicated set of the Analysis Engines running in the defined sequence</p> |
| </dd><dt><span class="term">Output Queue</span></dt><dd><p>holds batches of CASes with analysis results intended for CAS |
| Consumers</p></dd><dt><span class="term">CAS Consumers</span></dt><dd><p>perform collection level analysis over the CASes and extract |
| analysis results, e.g., creating indexes or databases</p></dd></dl></div><p> |
| </p> |
| </div> |
| |
| <div class="section" title="3.2. Notation"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.notation">3.2. Notation</h2></div></div></div> |
| |
| |
| <p>CPE Descriptors are XML files. This chapter uses an informal notation to specify |
| the syntax of CPE Descriptors.</p> |
| |
| <p>The notation used in this chapter is: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>An ellipsis (...) inside an element body indicates |
| that the substructure of that element has been omitted (to be described in another |
| section of this chapter). An example of this would be: |
| |
| |
| </p><pre class="programlisting"><collectionReader> |
| ... |
| </collectionReader></pre> |
| </li><li class="listitem"><p>An ellipsis immediately after an element indicates that the |
| element type may be repeated arbitrarily many times. For example: |
| |
| |
| </p><pre class="programlisting"><parameter>[String]</parameter> |
| <parameter>[String]</parameter> |
| ...</pre><p> |
| indicates that there may be arbitrarily many parameter elements in this |
| context.</p></li><li class="listitem"><p>An ellipsis inside an element means details of the attributes |
| associated with that element are defined later, e.g.: |
| |
| </p><pre class="programlisting"><casProcessor ...></pre> |
| </li><li class="listitem"><p>Bracketed expressions (e.g. <code class="literal">[String]</code>) |
| indicate the type of value that may be used at that location.</p></li><li class="listitem"><p>A vertical bar, as in <code class="literal">true|false</code>, indicates |
| alternatives. This can be applied to literal values, bracketed type names, and |
| elements. </p></li></ul></div> |
| |
| <p>Which elements are optional and which are required is specified in prose, not in the |
| syntax definition.</p> |
| |
| </div> |
| |
| <div class="section" title="3.3. Imports"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.imports">3.3. Imports</h2></div></div></div> |
| |
| |
| <p>As of version 2.2, a CPE Descriptor can use the same <code class="literal">import</code> mechanism |
| as other component descriptors. This allows referring to component |
| descriptors using either relative paths (resolved relative to the location of the CPE descriptor) |
| or the classpath/datapath. For details see <a href="references.html#ugr.ref.xml.component_descriptor" class="olink">Chapter 2, <i>Component Descriptor Reference</i></a>.</p> |
| |
| <p>The follwing older syntax is still supported, but <span class="emphasis"><em>not recommended</em></span>: |
| |
| </p><pre class="programlisting"><descriptor> |
| <include href="[URL or File]"/> |
| </descriptor></pre> |
| |
| <p>The <code class="literal">[URL or File]</code> attribute is a URL or a filename for the descriptor of the |
| incorporated component. The argument is first attempted to be resolved as a URL.</p> |
| |
| <p> |
| Relative paths in an <code class="literal">include</code> are resolved relative to the current working directory |
| (NOT the CPE descriptor location as is the case for <code class="literal">import</code>). |
| A filename relative to another directory can be specified using the <code class="literal">CPM_HOME</code> |
| variable, e.g., |
| </p><pre class="programlisting"><descriptor> |
| <include href="${CPM_HOME}/desc_dir/descriptor.xml"/> |
| </descriptor></pre><p> |
| |
| In this case, the value for the <code class="literal">CPM_HOME</code> variable must be |
| provided to the CPE by specifying it on the Java command line, e.g., |
| |
| </p><pre class="programlisting">java -DCPM_HOME="C:/Program Files/apache/uima/cpm" ...</pre><p> |
| |
| </p> |
| |
| </div> |
| |
| <div class="section" title="3.4. CPE Descriptor Overview"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.descriptor">3.4. CPE Descriptor Overview</h2></div></div></div> |
| |
| |
| <p>A CPE Descriptor consists of information describing the following four main |
| elements.</p> |
| |
| <div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>The <span class="emphasis"><em>Collection Reader</em></span>, which |
| is responsible for gathering artifacts and initializing the Common Analysis |
| Structure (CAS) used to support processing in the UIMA collection processing |
| engine.</p></li><li class="listitem"><p>The <span class="emphasis"><em>CAS Processors</em></span>, responsible for |
| analyzing individual artifacts, analyzing across artifacts, and extracting |
| analysis results. CAS Processors include <span class="emphasis"><em>Analysis Engines</em></span> |
| and <span class="emphasis"><em>CAS Consumers</em></span>.</p></li><li class="listitem"><p>Operational parameters of the <span class="emphasis"><em>Collection Processing |
| Manager</em></span> (CPM), such as checkpoint frequency and deployment |
| mode.</p></li><li class="listitem"><p>Resource Manager Configuration (optional). </p></li></ol></div> |
| |
| <p>The CPE Descriptor has the following high level skeleton: |
| |
| |
| </p><pre class="programlisting"><?xml version="1.0"?> |
| <cpeDescription> |
| <collectionReader> |
| ... |
| </collectionReader> |
| <casProcessors> |
| ... |
| </casProcessors> |
| <cpeConfig> |
| ... |
| </cpeConfig> |
| <resourceManagerConfiguration> |
| ... |
| </resourceManagerConfiguration> |
| </cpeDescription></pre> |
| |
| <p>Details of each of the four main elements are described in the sections that |
| follow.</p> |
| </div> |
| <div class="section" title="3.5. Collection Reader"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.descriptor.collection_reader">3.5. Collection Reader</h2></div></div></div> |
| |
| |
| <p>The <code class="literal"><collectionReader></code> section identifies the |
| Collection Reader and optional CAS Initializer that are to be used in the CPE. The |
| Collection Reader is responsible for retrieval of artifacts from a collection |
| outside of the CPE, and the optional CAS Initializer (deprecated as of UIMA Version 2) |
| is responsible for initializing the CAS with the artifact.</p> |
| |
| <p>A Collection Reader may initialize the CAS itself, in which case it does not |
| require a CAS Initializer. This should be clearly specified in the documentation for |
| the Collection Reader. Specifying a CAS Initializer for a Collection Reader that |
| does not make use of a CAS Initializer will not cause an error, but the specified CAS |
| Initializer will not be used.</p> |
| |
| <p>The complete structure of the <code class="literal"><collectionReader></code> |
| section is: |
| |
| |
| </p><pre class="programlisting"><collectionReader> |
| <collectionIterator> |
| <descriptor> |
| <import ...> | <include .../> |
| </descriptor> |
| <configurationParameterSettings>...</configurationParameterSettings> |
| <sofaNameMappings>...</sofaNameMappings> |
| </collectionIterator> |
| <casInitializer> |
| <descriptor> |
| <import ...> | <include .../> |
| </descriptor> |
| <configurationParameterSettings>...</configurationParameterSettings> |
| <sofaNameMappings>...</sofaNameMappings> |
| </casInitializer> |
| </collectionReader></pre> |
| |
| <p>The <code class="literal"><collectionIterator></code> identifies the |
| descriptor for the Collection Reader, and the <code class="literal"><casInitializer> |
| </code>identifies the descriptor for the CAS Initializer. The format and |
| details of the Collection Reader and CAS Initializer descriptors are described in |
| <a href="references.html#ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader" class="olink">Section 2.6.1, “Collection Reader Descriptors”</a> |
| . The <code class="literal"><configurationParameterSettings> </code>and the |
| <code class="literal"><sofaNameMappings></code> elements are described in the next |
| section.</p> |
| |
| <div class="section" title="3.5.1. Error handling for Collection Readers"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.collection_reader.error_handling">3.5.1. Error handling for Collection Readers</h3></div></div></div> |
| |
| |
| <p>The CPM will abort if the Collection Reader throws a large number of |
| consecutive exceptions (default = 100). This default can by changed by using the |
| Java initialization parameter <code class="literal">−DMaxCRErrorThreshold |
| xxx.</code></p> |
| </div> |
| </div> |
| |
| <div class="section" title="3.6. CAS Processors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors">3.6. CAS Processors</h2></div></div></div> |
| |
| |
| <p>The <code class="literal"><casProcessors></code> section identifies the |
| components that perform the analysis on the input data, including CAS analysis |
| (Analysis Engines) and analysis results extraction (CAS Consumers). The CAS |
| Consumers may also perform collection level analysis, where the analysis is |
| performed (or aggregated) over multiple CASes. The basic structure of the CAS |
| Processors section is: |
| |
| |
| </p><pre class="programlisting"><casProcessors |
| dropCasOnException="true|false" |
| casPoolSize="[Number]" |
| processingUnitThreadCount="[Number]"> |
| |
| <casProcessor ...> |
| ... |
| </casProcessor> |
| |
| <casProcessor ...> |
| ... |
| </casProcessor> |
| ... |
| </casProcessors></pre> |
| |
| <p>The <code class="literal"><casProcessors></code> section has two mandatory |
| attributes and one optional attribute that configure the characteristics of the CAS |
| Processor flow in the CPE. The first mandatory attribute is a casPoolSize, which |
| defines the fixed number of CAS instances that the CPM will create and use during |
| processing. All CAS instances are maintained in a CAS Pool with a check-in and |
| check-out access. Each CAS is checked-out from the CAS Pool by the Collection Reader |
| and initialized with an initial subject of analysis. The CAS is checked-in into the |
| CAS Pool when it is completely processed, at the end of the processing chain. A larger |
| CAS Pool size will result in more memory being used by the CPM. CAS objects can be large |
| and care should be taken to determine the optimum size of the CAS Pool, weighing memory |
| tradeoffs with performance.</p> |
| |
| <p>The second mandatory <code class="literal"><casProcessors></code> attribute |
| is <code class="literal">processingUnitThreadCount</code>, which specifies the number of |
| replicated <span class="emphasis"><em>Processing Pipelines</em></span>. Each Processing |
| Pipeline runs in its own thread. The CPM takes CASes from the work queue and submits |
| each CAS to one of the Processing Pipelines for analysis. A Processing Pipeline |
| contains one or more Analysis Engines invoked in a given sequence. If more than one |
| Processing Pipeline is specified, the CPM replicates instances of each Analysis |
| Engine defined in the CPE descriptor. Each Processing Pipeline thread runs |
| independently, consuming CASes from work queue and depositing CASes with analysis |
| results onto the output queue. On multiprocessor machines, multiple Processing |
| Pipelines can run in parallel, improving overall throughput of the CPM.</p> |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The number of Processing Pipelines should be equal to or greater than CAS |
| Pool size. </p></div> |
| |
| <p>Elements in the pipeline (each represented by a <casProcessor> element) |
| may indicate that they do not permit multiple deployment in their Analysis Engine |
| descriptor. If so, even though multiple pipelines are being used, all CASes passing |
| through the pipelines will be routed through one instance of these marked Engines. |
| </p> |
| |
| <p>The final, optional, <casProcessors> attribute is |
| <code class="literal">dropCasOnException</code>. It defines a policy that determines what |
| happens with the CAS when an exception happens during processing. If the value of this |
| attribute is set to true and an exception happens, the CPM will notify all registered |
| listeners of the exception (see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.cpe.using_listeners" class="olink">Section 2.3.1, “Using Listeners”</a>), clear the CAS and check the CAS |
| back into the CAS Pool so that it can be re-used. The presumption is that an exception |
| may leave the CAS in an inconsistent state and therefore that CAS should not be allowed |
| to move through the processing chain. When this attribute is omitted the CPM's |
| default is the same as specifying |
| <code class="literal">dropCasOnException="false"</code>.</p> |
| |
| <div class="section" title="3.6.1. Specifying an Individual CAS Processor"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual">3.6.1. Specifying an Individual CAS Processor</h3></div></div></div> |
| |
| |
| <p>The CAS Processors that make up the Processing Pipeline and the CAS Consumer |
| pipeline are specified with the <code class="literal"><casProcessor></code> |
| entity, which appears within the <code class="literal"><casProcessors></code> |
| entity. It may appear multiple times, once for each CAS Processor specified for |
| this CPE.</p> |
| |
| <p>The order of the <code class="literal"><casProcessor></code> entities with |
| the <code class="literal"><casProcessors></code> section specifies the order in |
| which the CAS Processors will run. Although CAS Consumers are usually put at the end |
| of the pipeline, they need not be. Also, Aggregate Analysis Engines may include CAS |
| Consumers.</p> |
| |
| <p>The overall format of the <code class="literal"><casProcessor></code> entity |
| is: |
| |
| |
| </p><pre class="programlisting"><casProcessor deployment="local|remote|integrated" name="[String]" > |
| <descriptor> |
| <import ...> | <include .../> |
| </descriptor> |
| <configurationParameterSettings>...</configurationParameterSettings> |
| <sofaNameMappings>...</sofaNameMappings> |
| <runInSeparateProcess>...</runInSeparateProcess> |
| <deploymentParameters>...</deploymentParameters> |
| <filter/> |
| <errorHandling>...</errorHandling> |
| <checkpoint batch="Number"/> |
| </casProcessor></pre> |
| |
| <p>The <code class="literal"><casProcessor></code> element has two mandatory |
| attributes, <code class="literal">deployment</code> and <code class="literal">name</code>. The |
| mandatory <code class="literal">name</code> attribute specifies a unique string |
| identifying the CAS Processor.</p> |
| |
| <p>The mandatory <code class="literal">deployment</code> attribute specifies the CAS |
| Processor deployment mode. Currently, three deployment options are supported: |
| |
| </p><div class="variablelist"><dl><dt><span class="term">integrated</span></dt><dd><p>indicates <span class="emphasis"><em>integrated</em></span> deployment |
| of the CAS Processor. The CPM deploys and collocates the CAS Processor in the |
| same process space as the CPM. This type of deployment is recommended to |
| increase the performance of the CPE. However, it is NOT recommended to |
| deploy annotators containing JNI this way. Such CAS Processors may cause a |
| fatal exception and force the JVM to exit without cleanup (bringing down the |
| CPM). Any UIMA SDK compliant pure Java CAS Processors may be safely deployed |
| this way.</p> |
| <p>The descriptor for an integrated deployment can, in fact, be a remote |
| service descriptor. When used this way, however, the CPM error recovery |
| options (see below) operate in the integrated mode, which means that many |
| of the retry options are not available.</p></dd><dt><span class="term">remote</span></dt><dd><p>indicates <span class="emphasis"><em>non-managed</em></span> |
| deployment of the CAS Processor. The CAS Processor descriptor referenced |
| in the <code class="literal"><descriptor></code> element must be a Vinci |
| <span class="emphasis"><em>Service Client Descriptor</em></span>, which identifies a |
| remotely deployed CAS Processor service (see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.remote_services" class="olink">Section 3.6, “Working with Remote Services”</a>). The CPM |
| assumes that the CAS Processor is already running as a remote service and |
| will connect to it using the URI provided in the client service descriptor. |
| The lifecycle of a remotely deployed CAS Processor is not managed by the CPM, |
| so appropriate infrastructure should be in place to start/restart such CAS |
| Processors when necessary. This deployment provides fault isolation and |
| is implementation (i.e., programming language) neutral.</p> |
| </dd><dt><span class="term">local</span></dt><dd><p>indicates <span class="emphasis"><em>managed</em></span> deployment of |
| the CAS Processor. The CAS Processor descriptor referenced in the |
| <code class="literal"><descriptor></code> element must be a Vinci |
| <span class="emphasis"><em>Service Deployment Descriptor</em></span>, which configures |
| a CAS Processor for deployment as a Vinci service (see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.remote_services" class="olink">Section 3.6, “Working with Remote Services”</a>). The CPM |
| deploys the CAS Processor in a separate process and manages the life cycle |
| (start/stop) of the CAS Processor. Communication between the CPM and the |
| CAS Processor is done with Vinci. When the CPM completes processing, the |
| process containing the CAS Processor is terminated. This deployment mode |
| insulates the CPM from the CAS Processor, creating a more robust deployment |
| at the cost of a small communication overhead. On multiprocessor machines, |
| the separate processes may run concurrently and improve overall |
| throughput.</p></dd></dl></div> |
| |
| <p>A number of elements may appear within the |
| <code class="literal"><casProcessor></code> element.</p> |
| |
| <div class="section" title="3.6.1.1. <descriptor> Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.descriptor">3.6.1.1. <descriptor> Element</h4></div></div></div> |
| |
| |
| <p>The <code class="literal"><descriptor></code> element is mandatory. It |
| identifies the descriptor for the referenced CAS Processor using the syntax |
| described in <a href="references.html#ugr.ref.xml.component_descriptor.aes" class="olink">Section 2.4, “Analysis Engine Descriptors”</a>. |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>For |
| <span class="emphasis"><em><code class="literal">remote</code></em></span> CAS Processors, the |
| referenced descriptor must be a Vinci <span class="emphasis"><em>Service Client |
| Descriptor</em></span>, which identifies a remotely deployed CAS Processor |
| service.</p></li><li class="listitem"><p>For <span class="emphasis"><em>local</em></span> CAS Processors, the |
| referenced descriptor must be a Vinci <span class="emphasis"><em>Service Deployment |
| Descriptor</em></span>.</p></li><li class="listitem"><p>For <span class="emphasis"><em>integrated</em></span> CAS Processors, |
| the referenced descriptor must be an Analysis Engine Descriptor |
| (primitive or aggregate). </p></li></ul></div><p> </p> |
| |
| <p>See <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.remote_services" class="olink">Section 3.6, “Working with Remote Services”</a> for more |
| information on creating these descriptors and deploying services.</p> |
| |
| </div> |
| |
| <div class="section" title="3.6.1.2. <configurationParameterSettings> Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.configuration_parameter_settings">3.6.1.2. <configurationParameterSettings> Element</h4></div></div></div> |
| |
| |
| <p>This element provides a way to override the contained Analysis |
| Engine's parameters settings. Any entry specified here must already be |
| defined; values specified replace the corresponding values for each |
| parameter. <span class="bold-italic">For Cas Processors, this mechanism |
| is only available when they are deployed in <span class="quote">“<span class="quote">integrated</span>”</span> |
| mode.</span> For Collection Readers and Initializers, it always is |
| available.</p> |
| |
| <p>The content of this element is identical to the component descriptor for |
| specifying parameters (in the case where no parameter groups are |
| specified)<sup>[<a name="d5e1266" href="#ftn.d5e1266" class="footnote">4</a>]</sup>. Here is an example: |
| |
| |
| </p><pre class="programlisting"><configurationParameterSettings> |
| <nameValuePair> |
| <name>CivilianTitles</name> |
| <value> |
| <array> |
| <string>Mr.</string> |
| <string>Ms.</string> |
| <string>Mrs.</string> |
| <string>Dr.</string> |
| </array> |
| </value> |
| </nameValuePair> |
| ... |
| </configurationParameterSettings></pre> |
| |
| </div> |
| |
| <div class="section" title="3.6.1.3. <sofaNameMappings> Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.sofa_name_mappings">3.6.1.3. <sofaNameMappings> Element</h4></div></div></div> |
| |
| |
| <p>This optional element provides a mapping from defined Sofa names in the |
| component, or the default Sofa name (if the component does not declare any Sofa |
| names). The form of this element is: |
| |
| |
| </p><pre class="programlisting"><sofaNameMappings> |
| <sofaNameMapping cpeSofaName="a_CPE_name" |
| componentSofaName="a_component_Name"/> |
| ... |
| </sofaNameMappings></pre> |
| |
| <p>There can be any number of<code class="literal"> |
| <sofaNameMapping></code> elements contained in the |
| <code class="literal"><sofaNameMappings></code> element. The |
| <code class="literal">componentSofaName</code> attribute is optional; leave it out to |
| specify a mapping for the <code class="literal">_InitialView</code> - that is, for |
| Single-View components.</p> |
| |
| </div> |
| |
| <div class="section" title="3.6.1.4. <runInSeparateProcess> Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.run_in_separate_process">3.6.1.4. <runInSeparateProcess> Element</h4></div></div></div> |
| |
| |
| <p>The <code class="literal"><runInSeparateProcess></code> element is |
| mandatory for <code class="literal">local</code> CAS Processors, but should not appear |
| for <code class="literal">remote</code> or <code class="literal">integrated</code> CAS |
| Processors. It enables the CPM to create external processes using the provided |
| runtime environment. Applications launched this way communicate with the CPM |
| using the Vinci protocol and connectivity is enabled by a local instance of the |
| VNS that the CPM manages. Since communication is based on Vinci, the application |
| need not be implemented in Java. Any language for which Vinci provides support |
| may be used to create an application, and the CPM will seamlessly communicate |
| with it. The overall structure of this element is: |
| |
| |
| </p><pre class="programlisting"><runInSeparateProcess> |
| <exec dir="[String]" executable="[String]"> |
| <env key="[String]" value ="[String]"/> |
| ... |
| <arg>[String]</arg> |
| ... |
| </exec> |
| </runInSeparateProcess></pre> |
| |
| <p>The <code class="literal"><exec></code> element provides information |
| about how to execute the referenced CAS Processor. Two attributes are defined |
| for the <code class="literal"><exec></code> element. The |
| <code class="literal">dir</code> attribute is currently not used – it is reserved |
| for future functionality. The <code class="literal">executable</code> attribute |
| specifies the actual Vinci service executable that will be run by the CPM, e.g., |
| <code class="literal">java</code>, a batch script, an application (.exe), etc. The |
| executable must be specified with a fully qualified path, or be found in the |
| <code class="literal">PATH</code> of the CPM.</p> |
| |
| <p>The <code class="literal"><exec></code> element has two elements within it |
| that define parameters used to construct the command line for executing the CAS |
| Processor. These elements must be listed in the order in which they should be |
| defined for the CAS Processor.</p> |
| |
| <p>The optional <code class="literal"><env></code> element is used to set an |
| environment variable. The variable <code class="literal">key</code> will be set to |
| <code class="literal">value</code>. For example, |
| |
| |
| </p><pre class="programlisting"><env key="CLASSPATH" value="C:Javalib"/></pre><p> |
| will set the environment variable <code class="literal">CLASSPATH</code> to the value |
| <code class="literal">C:Javalib</code>. The <code class="literal"><env></code> |
| element may be repeated to set multiple environment variables. All of the |
| key/value pairs will be added to the environment by the CPM prior to launching the |
| executable.</p> |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The CPM actually adds ALL system environment variables when it |
| launches the program. It queries the Operating System for its current system |
| variables and one by one adds them to the program's process |
| configuration.</p></div> |
| |
| <p>The <code class="literal"><arg></code> element is used to specify arbitrary |
| string arguments that will appear on the command line when the CPM runs the |
| command specified in the <code class="literal">executable</code> attribute.</p> |
| |
| <p>For example, the following would be used to invoke the UIMA Java |
| implementation of the Vinci service wrapper on a Java CAS Processor: |
| |
| |
| </p><pre class="programlisting"><runInSeparateProcess> |
| <exec executable="java"> |
| <arg>&minus;DVNS_HOST=localhost</arg> |
| <arg>&minus;DVNS_PORT=9099</arg> |
| <arg>org.apache.uima.reference_impl.analysis_engine.service. |
| vinci.VinciAnalysisEngineService_impl</arg> |
| <arg>C:uimadescdeployCasProcessor.xml</arg> |
| </exec> |
| <runInSeparateProcess></pre> |
| |
| <p>This will cause the CPM to run the following command line when starting the |
| CAS Processor: |
| |
| |
| </p><pre class="programlisting">java -DVNS_HOST=localhost -DVNS_PORT=9099 |
| org.apache.uima.reference_impl.analysis_engine.service.vinci.\\ |
| VinciAnalysisEngineService_impl |
| C:uimadescdeployCasProcessor.xml</pre> |
| |
| <p>The first argument specifies that the Vinci Naming Service is running on the |
| <code class="literal">localhost</code>. The second argument specifies that the Vinci |
| Naming Service port number is <code class="literal">9099</code>. The third argument |
| (split over 2 lines in this documentation) |
| identifies the UIMA implementation of the Vinci service wrapper. This class |
| contains the <code class="literal">main</code> method that will execute. That main |
| method in turn takes a single argument – the filename for the CAS Processor |
| service deployment descriptor. Thus the last argument identifies the Vinci |
| service deployment descriptor file for the CAS Processor. Since this is the same |
| descriptor file specified earlier in the |
| <code class="literal"><descriptor></code> element, the string |
| <code class="literal">${descriptor}</code> can be used to refer to the descriptor, |
| e.g.: |
| |
| |
| </p><pre class="programlisting"><arg>${descriptor}</arg></pre> |
| |
| <p>The CPM will expand this out to the service deployment descriptor file |
| referenced in the <code class="literal"><descriptor></code> element.</p> |
| |
| </div> |
| |
| <div class="section" title="3.6.1.5. <deploymentParameters> Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.deployment_parameters">3.6.1.5. <deploymentParameters> Element</h4></div></div></div> |
| |
| |
| <p>The <code class="literal"><deploymentParameters></code> element defines |
| a number of deployment parameters that control how the CPM will interact with the |
| CAS Processor. This element has the following overall form: |
| |
| |
| </p><pre class="programlisting"><deploymentParameters> |
| <parameter name="[String]" value="..." type="string|integer" /> |
| ... |
| </deploymentParameters></pre> |
| |
| <p>The <code class="literal">name</code> attribute identifies the parameter, the |
| <code class="literal">value</code> attribute specifies the value that will be assigned |
| to the parameter, and the <code class="literal">type</code> attribute indicates the |
| type of the parameter, either <code class="literal">string</code> or |
| <code class="literal">integer</code>. The available parameters include: |
| |
| </p><div class="variablelist"><dl><dt><span class="term">service-access</span></dt><dd><p>string parameter whose value must be |
| <span class="quote">“<span class="quote">exclusive</span>”</span>, if present. This parameter is only |
| effective for remote deployments. It modifies the Vinci service |
| connections to be preallocated and dedicated, one service instance per |
| pipe-line. It is only relevant for non-Integrated deployement modes. If |
| there are fewer services instances that are available (and alive – |
| responding to a <span class="quote">“<span class="quote">ping</span>”</span> request) than there are pipelines, |
| the number of pipelines (the number of concurrent threads) is reduced to |
| match the number of available instances. If not specified, the VNS is |
| queried each time a service is needed, and a <span class="quote">“<span class="quote">random</span>”</span> |
| instance is assigned from the pool of available instances. If a services |
| dies during processing, the CPM will use its normal error handling |
| procedures to attempt to reconnect. The number of attempts is specified |
| in the CPE descriptor for each Cas Processor using the |
| <code class="literal"><maxConsecutiveRestarts value="10" |
| action="kill-pipeline" |
| waitTimeBetweenRetries="50"/></code> xml element. The |
| <span class="quote">“<span class="quote">value</span>”</span> attribute is the number of reconnection tries; |
| the <span class="quote">“<span class="quote">action</span>”</span> says what to do if the retries exceed the |
| limit. The <span class="quote">“<span class="quote">kill-pipeline</span>”</span> action stops the pipeline |
| that was associated with the failing service (other pipelines will |
| continue to work). The CAS in process within a killed pipeline will be |
| dropped. These events are communicated to the application using the |
| normal event listener mechanism. The |
| <code class="literal">waitTimeBetweenRetries</code> says how many |
| milliseconds to wait inbetween attempts to reconnect.</p> |
| </dd><dt><span class="term">vnsHost</span></dt><dd><p>(Deprecated) string parameter specifying the VNS host, |
| e.g., <code class="literal">localhost</code> for local CAS Processors, host |
| name or IP address of VNS host for remote CAS Processors. This parameter is |
| deprecated; use the parameter specification instead inside the Vinci |
| <span class="emphasis"><em>Service Client Descriptor</em></span>, if needed. It is |
| ignored for integrated and local deployments. If present, for remote |
| deployments, it specifies the VNS Host to use, unless that is specified in |
| the Vinci <span class="emphasis"><em>Service Client Descriptor</em></span>.</p> |
| </dd><dt><span class="term">vnsPort</span></dt><dd><p>(Deprecated) integer parameter specifying the VNS port |
| number. This parameter is deprecated; use the parameter specification |
| instead inside the Vinci <span class="emphasis"><em>Service Client |
| Descriptor,</em></span> if needed. It is ignored for integrated and |
| local deployments. If present, for remote deployments, it specifies the |
| VNS Port number to use, unless that is specified in the Vinci |
| <span class="emphasis"><em>Service Client Descriptor.</em></span></p> |
| </dd></dl></div> |
| |
| <p>For example, the following parameters might be used with a CAS Processor |
| deployed in local mode: |
| |
| |
| </p><pre class="programlisting"><deploymentParameters> |
| <parameter name="service-access" value="exclusive" type="string"/> |
| </deploymentParameters></pre> |
| |
| </div> |
| |
| <div class="section" title="3.6.1.6. <filter> Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.filter">3.6.1.6. <filter> Element</h4></div></div></div> |
| |
| |
| <p>The <filter> element is a required element but currently should be |
| left empty. This element is reserved for future use.</p> |
| |
| </div> |
| |
| <div class="section" title="3.6.1.7. <errorHandling> Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.error_handling">3.6.1.7. <errorHandling> Element</h4></div></div></div> |
| |
| |
| <p>The mandatory <code class="literal"><errorHandling></code> element |
| defines error and restart policies for the CAS Processor. Each CAS Processor may |
| define different actions in the event of errors and restarts. The CPM monitors |
| and logs errant behaviors and attempts to recover the component based on the |
| policies specified in this element.</p> |
| |
| <p>There are two kinds of faults: |
| |
| </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>One kind only occurs with non-integrated CAS |
| Processors – this fault is either a timeout attempting to launch or |
| connect to the non-integrated component, or some other kind of connection |
| related exception (for instance, the network connection might timeout or get |
| reset).</p></li><li class="listitem"><p>The other kind happens when the CAS Processor component (an |
| Annotator, for example) throws any kind of exception. This kind may occur |
| with any kind of deployment, integrated or not. </p></li></ol></div> |
| |
| <p>The <errorHandling> has specifications for each of these kinds of |
| faults. The format of this element is: |
| |
| |
| </p><pre class="programlisting"><errorHandling> |
| <maxConsecutiveRestarts action="continue|disable|terminate" |
| value="[Number]"/> |
| <errorRateThreshold action="continue|disable|terminate" value="[Rate]"/> |
| <timeout max="[Number]"/> |
| </errorHandling></pre> |
| |
| <p>The mandatory <code class="literal"><maxConsecutiveRestarts></code> |
| element applies only to faults of the first kind, and therefore, only applies to |
| non-integrated deployments. If such a fault occurs, a retry is attempted, up to |
| <code class="literal">value="[Number]"</code> of times. This retry resets the |
| connection (if one was made) and attempts to reconnect and perhaps re-launch |
| (see below for details). The original CAS (not a partially updated one) is sent to |
| the CAS Processor as part of the retry, once the deployed component has been |
| successfully restarted or reconnected to.</p> |
| |
| <p>The <code class="literal">action</code> attribute specifies the action to take |
| when the threshold specified by the <code class="literal">value="[Number]"</code> is |
| exceeded. The possible actions are: |
| |
| </p><div class="variablelist"><dl><dt><span class="term">continue</span></dt><dd><p>skip any further processing for this CAS by this CAS |
| Processor, and pass the CAS to the next CAS Processor in the Pipeline. |
| </p> |
| <p>The <span class="quote">“<span class="quote">restart</span>”</span> action is done, because it is needed |
| for the next CAS.</p> |
| |
| <p>If the <code class="literal">dropCasOnException="true"</code>, the CPM |
| will NOT pass the CAS to the next CAS Processor in the chain. Instead, the |
| CPM will abort processing of this CAS, release the CAS back to the CAS |
| Pool and will process the next CAS in the queue.</p> |
| |
| <p>The counter counting the restarts toward the threshold is only |
| reset after a CAS is successfully processed.</p></dd><dt><span class="term">disable</span></dt><dd><p>the current CAS is handled just as in the |
| <code class="literal">continue</code> case, but in addition, the CAS Processor |
| is marked so that its <span class="emphasis"><em>process()</em></span> method will not be |
| called again (i.e., it will be <span class="quote">“<span class="quote">skipped</span>”</span> for future |
| CASes)</p></dd><dt><span class="term">terminate</span></dt><dd><p>the CPM will terminate all processing and exit.</p> |
| </dd></dl></div> |
| |
| <p>The definition of an error for the |
| <code class="literal"><maxConsecutiveRestarts></code> element differs |
| slightly for each of the three CAS Processor deployment modes: |
| </p><div class="variablelist"><dl><dt><span class="term">local</span></dt><dd><p>Local CAS Processors experience two general error |
| types: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>launch errors – errors associated with |
| launching a process</p></li><li class="listitem"><p>processing errors – errors associated with |
| sending Vinci commands to the process</p></li></ul></div> |
| |
| <p>A launch error is defined by a failure of the process to |
| successfully register with the local VNS within a default time window. |
| The current timeout is 15 minutes. Multiple local CAS Processors are |
| launched sequentially, with a subsequent processor launched |
| immediately after its previous processor successfully registers |
| with the VNS.</p> |
| |
| <p>A processing error is detected if a connection to the CAS Processor |
| is lost or if the processing time exceeds a specified timeout |
| value.</p> |
| |
| <p>For local CAS Processors, the |
| <maxConsecutiveRestarts> element specifies the number of |
| consecutive attempts made to launch the CAS Processor at CPM startup or |
| after the CPM has lost a connection to the CAS Processor.</p> |
| </dd><dt><span class="term">remote</span></dt><dd><p>For remote CAS Processors, the |
| <maxConsecutiveRestarts> element applies to errors from |
| sending Vinci commands. An error is detected if a connection to the CAS |
| Processor is lost, or if the processing time exceeds the timeout value |
| specified in the <timeout> element (see below).</p> |
| </dd><dt><span class="term">integrated</span></dt><dd><p>Although mandatory, the |
| <maxConsecutiveRestarts> element is NOT used for integrated CAS |
| Processors, because Integrated CAS Processors are not |
| re-instantiated/restarted on exceptions. This setting is ignored by |
| the CPM for Integrated CAS Processors but it is required. Future version |
| of the CPM will make this element mandatory for remote and local CAS |
| Processors only.</p></dd></dl></div> |
| |
| <p>The mandatory <code class="literal"><errorRateThreshold></code> element |
| is used for all faults – both those above, and exceptions thrown by the CAS |
| Processor itself. It specifies the number of retries for exceptions thrown by |
| the CAS Processor itself, a maximum error rate, and the corresponding action to |
| take when this rate is exceeded. The <code class="literal">value</code> attribute |
| specifies the error rate in terms of errors per sample size in the form |
| <span class="quote">“<span class="quote"><code class="literal">N/M</code></span>”</span>, where <code class="literal">N</code> is the |
| number of errors and <code class="literal">M</code> is the sample size, defined in terms |
| of the number of documents.</p> |
| |
| <p>The first number is used also to indicate the maximum number of retries. If |
| this number is less than the <code class="literal"><maxConsecutiveRestarts |
| value="[Number]">, </code>it will override, reducing the number of |
| <span class="quote">“<span class="quote">restarts</span>”</span> attempted. A retry is done only if the |
| <code class="literal">dropCasOnException </code>is false. If it is set to true, no retry |
| occurs, but the error is counted.</p> |
| |
| <p>When the number of counted errors exceeds the sample size, an action |
| specified by the <code class="literal">action</code> attribute is taken. The possible |
| actions and their meaning are the same as described above for the |
| <code class="literal"><maxConsecutiveRestarts></code> element: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p><code class="literal">continue</code></p></li><li class="listitem"><p><code class="literal">disable</code></p></li><li class="listitem"><p><code class="literal">terminate</code></p></li></ul></div> |
| |
| <p>The <code class="literal">dropCasOnException="true"</code> attribute of the |
| <code class="literal"><casProcessors></code> element modifies the action |
| taken for continue and disable, in the same manner as above. For example: |
| |
| |
| </p><pre class="programlisting"><errorRateThreshold value="3/1000" action="disable"/></pre><p> |
| specifies that each error thrown by the CAS Processor itself will be retried up to |
| 3 times (if <code class="literal">dropCasOnException</code> is false) and the CAS |
| Processor will be disabled if the error rate exceeds 3 errors in 1000 |
| documents.</p> |
| |
| <p>If a document causes an error and the error rate threshold for the CAS |
| Processor is not exceeded, the CPM increments the CAS Processor's error |
| count and retries processing that document (if |
| <code class="literal">dropCasOnException</code> is false). The retry means that the |
| CPM calls the CAS Processor's process() method again, passing in as an |
| argument the same CAS that previously caused an exception.</p> |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The CPM does not attempt to rollback any partial changes that may have |
| been applied to the CAS in the previous process() call. </p></div> |
| |
| <p>Errors are accumulated across documents. For example, assume the error |
| rate threshold is <code class="literal">3/1000</code>. The same document may fail three |
| times before finally succeeding on the fourth try, but the error count is now 3. If |
| one more error occurs within the current sample of 1000 documents, the error rate |
| threshold will be exceeded and the specified action will be taken. If no more |
| errors occur within the current sample, the error counter is reset to 0 for the |
| next sample of 1000 documents.</p> |
| |
| <p>The <code class="literal"><timeout></code> element is a mandatory element. |
| Although mandatory for all CAS Processors, this element is only relevant for |
| local and remote CAS Processors. For integrated CAS Processors, this element is |
| ignored. In the current CPM implementation the integrated CAS Processor |
| process() method is not subject to timeouts.</p> |
| |
| <p>The <code class="literal">max</code> attribute specifies the maximum amount of |
| time in milliseconds the CPM will wait for a process() method to complete When |
| exceeded, the CPM will generate an exception and will treat this as an error |
| subject to the threshold defined in the |
| <code class="literal"><errorRateThreshold></code> element above, including |
| doing retries.</p> |
| |
| <div class="section" title="Retry action taken on a timeout"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.error_handling.timeout_retry_action">Retry action taken on a timeout</h5></div></div></div> |
| |
| |
| <p>The action taken depends on whether the CAS Processor is local (managed) |
| or remote (unmanaged). Local CAS Processors (which are services) are killed |
| and restarted, and a new connection to them is established. For remote CAS |
| Processors, the connection to them is dropped, and a new connection is |
| reestablished (which may actually connect to a different instance of the |
| remote services, if it has multiple instances).</p> |
| </div> |
| </div> |
| |
| <div class="section" title="3.6.1.8. <checkpoint> Element"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual.checkpoint">3.6.1.8. <checkpoint> Element</h4></div></div></div> |
| |
| |
| <p>The <code class="literal"><checkpoint></code> element is an optional |
| element used to improve the performance of CAS Consumers. It has a single |
| attribute, <code class="literal">batch</code>, which specifies the number of CASes in a |
| batch, e.g.: |
| |
| |
| </p><pre class="programlisting"><checkpoint batch="1000"></pre> |
| |
| <p>sets the batch size to 1000 CASes. The batch size is the interval used to mark a |
| point in processing requiring special handling. The CAS Processor's |
| <code class="literal">batchProcessComplete()</code> method will be called by the CPM |
| when this mark is reached so that the processor can take appropriate action. This |
| mark could be used as a mechanism to buffer up results in CAS Consumers and perform |
| time-consuming operations, such as check-pointing, that should not be done on a |
| per-document basis.</p> |
| |
| </div> |
| </div> |
| </div> |
| |
| <div class="section" title="3.7. CPE Operational Parameters"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.descriptor.operational_parameters">3.7. CPE Operational Parameters</h2></div></div></div> |
| |
| |
| <p>The parameters for configuring the overall CPE and CPM are specified in the |
| <code class="literal"><cpeConfig></code> section. The overall format of this |
| section is: |
| |
| |
| </p><pre class="programlisting"><cpeConfig> |
| <startAt>[NumberOrID]</startAt> |
| |
| <numToProcess>[Number]</numToProcess> |
| |
| <outputQueue dequeueTimeout="[Number]" queueClass="[ClassName]" /> |
| |
| <checkpoint file="[File]" time="[Number]" batch="[Number]"/> |
| |
| <timerImpl>[ClassName]</timerImpl> |
| |
| <deployAs>vinciService|interactive|immediate|single-threaded |
| </deployAs> |
| |
| </cpeConfig></pre> |
| |
| <p>This section of the CPE descriptor allows for defining the starting entity, the |
| number of entities to process, a checkpoint file and frequency, a pluggable timer, an |
| optional output queue implementation, and finally a mode of operation. The mode of |
| operation determines how the CPM interacts with users and other systems.</p> |
| |
| <p>The <code class="literal"><startAt></code> element is an optional argument. It |
| defines the starting entity in the collection at which the CPM should start |
| processing.</p> |
| |
| <p>The implementation in the CPM passes this argument to the Collection Reader |
| as the value of the parameter <span class="quote">“<span class="quote"><code class="literal">startNumber</code></span>”</span>. |
| The CPM does not do anything else with this parameter; in particular, the CPM has no |
| ability to skip to a specific document - that function, if available, is only provided |
| by a particular Collection Reader implementation.</p> |
| |
| <p>If the <code class="literal"><startAt></code> element is used, the Collection |
| Reader descriptor must define a single-valued configuration parameter with the |
| name <code class="literal">startNumber</code>. It can declare this value to be of any type; |
| the value passed in this XML element must be convertible to that type.</p> |
| |
| <p>A typical use is to declare this to be an integer type, and to pass the sequential |
| document number where processing should start. An alternative implementation |
| might take a specific document ID; the collection reader could search through its |
| collection until it reaches this ID and then start there.</p> |
| |
| <p>This parameter will only make sense if the particular collection reader is |
| implemented to use the <code class="literal">startNumber</code> configuration |
| parameter.</p> |
| |
| <p>The <code class="literal"><numToProcess></code> element is an optional |
| element. It specifies the total number of entities to process. Use -1 to indicate ALL. |
| If not defined, the number of entities to process will be taken from the Collection |
| Reader configuration. If present, this value overrides the Collection Reader |
| configuration.</p> |
| |
| <p>The <code class="literal"><outputQueue></code> element is an optional element. |
| It enables plugging in a custom implementation for the Output Queue. When omitted, |
| the CPM will use a default output queue that is based on First-in First-out (FIFO) |
| model.</p> |
| |
| <p>The UIMA SDK provides a second implementation for the Output Queue that can be |
| plugged in to the CPM, named <span class="quote">“<span class="quote"> |
| <code class="literal">org.apache.uima.collection.impl.cpm.engine.SequencedQueue</code> |
| </span>”</span>.</p> |
| |
| <p>This implementation supports handling very large documents that are split into |
| <span class="quote">“<span class="quote">chunks</span>”</span>; it provides a delivery mechanism that insures the |
| sequential order of the chunks using information carried in the CAS metadata. This |
| metadata, which is required for this implementation to work correctly, must be added |
| as an instance of a Feature Structure of type |
| <code class="literal">org.apache.es.tt.DocumentMetaData</code> and referred to by an |
| additional feature named <code class="literal">esDocumentMetaData</code> in the special |
| instance of <code class="literal">uima.tcas.DocumentAnnotation</code> that is |
| associated with the CAS. This is usually done by the Collection Reader; the instance |
| contains the following features: |
| |
| </p><div class="variablelist"><dl><dt><span class="term">sequenceNumber</span></dt><dd><p>[Number] the sequential number of a chunk, starting at 1. If |
| not a chunk (i.e. complete document), the value should be 0.</p> |
| </dd><dt><span class="term">documentId</span></dt><dd><p>[Number] current document id. Chunks belonging to the same |
| document have identical document id.</p></dd><dt><span class="term">isCompleted</span></dt><dd><p>[Number] 1 if the chunk is the last in a sequence, 0 |
| otherwise.</p></dd><dt><span class="term">url</span></dt><dd><p>[String] document url.</p></dd><dt><span class="term">throttleID</span></dt><dd><p>[String] special attribute currently used by |
| OmniFind.</p></dd></dl></div> |
| |
| <p>This implementation of a sequenced queue supports proper sequencing of CASes in |
| CPM deployments that use document chunking. Chunking is a technique of splitting |
| large documents into pieces to reduce overall memory consumption. Chunking does not |
| depend on the number of CASes in the CAS Pool. It works equally well with one or more |
| CASes in the CAS Pool. Each chunk is packaged in a separate CAS and placed in the Work |
| Queue. If the CAS Pool is depleted, the CollectionReader thread is suspended until a |
| CAS is released back to the pool by the processing threads. A document may be split into |
| 1, 2, 3 or more chunks that are analyzed independently. In order to reconstruct the |
| document correctly, the CAS Consumer can depend on receiving the chunks in the same |
| sequential order that the chunks were <span class="quote">“<span class="quote">produced</span>”</span>, when this |
| sequenced queue implementation is used. To plug in this sequenced queue to the CPM use |
| the following specification: |
| |
| |
| </p><pre class="programlisting"><outputQueue dequeueTimeout="100000" queueClass= |
| "org.apache.uima.collection.impl.cpm.engine.SequencedQueue"/></pre><p> |
| |
| where the mandatory <code class="literal">queueClass</code> attribute defines the name of |
| the class and the second mandatory attribute, <code class="literal">dequeueTimeout</code> |
| specifies the maximum number of milliseconds to wait for the expected chunk.</p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The value for this timeout must be carefully determined to avoid |
| excessive occurrences of timeouts. Typically, the size of a chunk and the type of |
| analysis being done are the most important factors when deciding on the value for the |
| timeout. The larger the chunk and the more complicated analysis, the more time it takes |
| for the chunk to go from source to sink. You may specify 0, in which case, the timeout is |
| disabled - i.e., it is equivalent to an infinitely long timeout.</p></div> |
| |
| <p>If the chunk doesn't arrive in the configured time window, the entire |
| document is presumed to be invalid and the CAS is dropped from further processing. |
| This action occurs regardless of any other error action specification. The |
| SequencedQueue invalidate the document, adding the offending document's |
| metadata to a local cache of invalid documents. </p> |
| |
| <p>If the time out occurs, the CPM notifies all registered listeners (see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.cpe.using_listeners" class="olink">Section 2.3.1, “Using Listeners”</a>) by calling |
| entityProcessComplete(). As part of this call, the SequencedQueue will pass null |
| instead of a CAS as the first argument, and a special exception – |
| CPMChunkTimeoutException. The reason for passing null as the first argument is |
| because the time out occurs due to the fact that the chunk has not been received in the |
| configured timeout window, so there is no CAS available when the timeout event |
| occurs.</p> |
| |
| <p>The CPMChunkTimeoutException object includes an API that allows the listener |
| to retrieve the offending document id as well as the other metadata attributes as |
| defined above. These attributes are part of each chunk's metadata and are added |
| by the Collection Reader.</p> |
| |
| <p>Each chunk that SequencedQueue works on is subjected to a test to determine if the |
| chunk belongs to an invalid document. This test checks the chunk's metadata |
| against the data in the local cache. If there is a match, the chunk is dropped. This |
| check is only performed for chunks and complete documents are not subject to this |
| check.</p> |
| |
| <p>If there is an exception during the processing of a chunk, the CPM sends a |
| notification to all registered listeners. The notification includes the CAS and an |
| exception. When the listener notification is completed, the CPM also sends separate |
| notifications, containing the CAS, to the Artifact Producer and the |
| SequencedQueue. The intent is to stop adding new chunks to the Work Queue that belong |
| to an <span class="quote">“<span class="quote">invalid</span>”</span> document and also to deal with chunks that are |
| en-route, being processed by the processing threads.</p> |
| |
| <p>In response to the notification, the Artifact Producer will drop and release |
| back to the CAS Pool all CASes that belong to an <span class="quote">“<span class="quote">invalid</span>”</span> document. |
| Currently, there is no support in the CollectionReader's API to tell it to stop |
| generating chunks. The CollectionReader keeps producing the chunks but the |
| Artifact Producer immediately drops/releases them to the CAS Pool. Before the CAS is |
| released back to the CAS Pool, the Artifact Producer sends notification to all |
| registered listeners. This notification includes the CAS and an exception – |
| SkipCasException.</p> |
| |
| <p>In response to the notification of an exception involving a chunk, the |
| SequencedQueue retrieves from the CAS the metadata and adds it to its local cache of |
| <span class="quote">“<span class="quote">invalid</span>”</span> documents. All chunks de-queued from the OutputQueue and |
| belonging to <span class="quote">“<span class="quote">invalid</span>”</span> documents will be dropped and released back to |
| the CAS Pool. Before dropping the CAS, the CPM sends notification to all registered |
| listeners. The notification includes the CAS and SkipCasException.</p> |
| |
| <p>The <code class="literal"><checkpoint></code> element is an optional element. |
| It specifies a CPE checkpoint file, checkpoint frequency, and strategy for |
| checkpoints (time or count based). At checkpoint time, the CPM saves status |
| information and statistics to the checkpoint file. The checkpoint file is specified |
| in the <code class="literal">file</code> attribute, which has the same form as the |
| <code class="literal">href</code> attribute of the <code class="literal"><include></code> |
| element described in <a class="xref" href="#ugr.ref.xml.cpe_descriptor.imports" title="3.3. Imports">Section 3.3, “Imports”</a>. The |
| <code class="literal">time</code> attribute indicates that a checkpoint should be taken |
| every <code class="literal">[Number]</code> seconds, and the <code class="literal">batch</code> |
| attribute indicates that a checkpoint should be taken every |
| <code class="literal">[Number]</code> batches.</p> |
| |
| <p>The <code class="literal"><timerImpl></code> element is optional. It is used to |
| identify a custom timer plug-in class to generate time stamps during the CPM |
| execution. The value of the element is a Java class name.</p> |
| |
| <p>The <code class="literal"><deployAs></code> element indicates the type of CPM |
| deployment. Valid contents for this element include: |
| |
| </p><div class="variablelist"><dl><dt><span class="term">vinciService</span></dt><dd><p>Vinci service exposing APIs for stop, pause, resume, and |
| getStats</p></dd><dt><span class="term">interactive</span></dt><dd><p>provide command line menus (start, stop, pause, |
| resume)</p></dd><dt><span class="term">immediate</span></dt><dd><p>run the CPM without menus or a service API</p></dd><dt><span class="term">single-threaded</span></dt><dd><p>run the CPM in a single threaded mode. In this mode, the |
| Collection Reader, the Processing Pipeline, and the CAS Consumer Pipeline |
| are all running in one thread without the work queue and the output |
| queue.</p></dd></dl></div> |
| |
| </div> |
| |
| <div class="section" title="3.8. Resource Manager Configuration"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.descriptor.resource_manager_configuration">3.8. Resource Manager Configuration</h2></div></div></div> |
| |
| |
| <p>External resource bindings for the CPE may optionally be specified in an |
| element: |
| |
| |
| </p><pre class="programlisting"><resourceManagerConfiguration href="..."/></pre> |
| |
| <p>For an introduction to external resources, refer to <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.aae.accessing_external_resource_files" class="olink">Section 1.5.4, “Accessing External Resources”</a>.</p> |
| |
| <p>In the <code class="literal">resourceManagerConfiguration</code> element, the value |
| of the href attribute refers to another file that contains definitions and bindings |
| for the external resources used by the CPE. The format of this file is the same as the XML |
| snippet <a href="references.html#ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings" class="olink">Section 2.4.2.4, “External Resource Bindings”</a> |
| . For example, in a CPE containing an aggregate analysis engine with two annotators, |
| and a CAS Consumer, the following resource manager configuration file would bind |
| external resource dependencies in all three components to the same physical |
| resource: |
| |
| |
| </p><pre class="programlisting"><resourceManagerConfiguration> |
| |
| <!-- Declare Resource --> |
| |
| <externalResources> |
| <externalResource> |
| <name>ExampleResource</name> |
| <fileResourceSpecifier> |
| <fileUrl>file:MyResourceFile.dat</fileUrl> |
| </fileResourceSpecifier> |
| </externalResource> |
| </externalResources> |
| |
| <!-- Bind component resource dependencies to ExampleResource --> |
| |
| <externalResourceBindings> |
| <externalResourceBinding> |
| <key>MyAE/annotator1/myResourceKey</key> |
| <resourceName>ExampleResource</resourceName> |
| </externalResourceBinding> |
| |
| <externalResourceBinding> |
| <key>MyAE/annotator2/someResourceKey</key> |
| <resourceName>ExampleResource</resourceName> |
| </externalResourceBinding> |
| |
| <externalResourceBinding> |
| <key>MyCasConsumer/otherResourceKey</key> |
| <resourceName>ExampleResource</resourceName> |
| </externalResourceBinding> |
| |
| </externalResourceBindings> |
| |
| </resourceManagerConfiguration></pre> |
| |
| <p>In this example, <code class="literal">MyAE</code> and |
| <code class="literal">MyCasConsumer</code> are the names of the Analysis Engine and CAS |
| Consumer, as specified by the name attributes of the CPE's |
| <code class="literal"><casProcessor></code> elements. |
| <code class="literal">annotator1</code> and <code class="literal">annotator2</code> are the |
| annotator keys specified within the Aggregate AE Descriptor, and |
| <code class="literal">myResourceKey</code>, <code class="literal">someResourceKey</code>, and |
| <code class="literal">otherResourceKey</code> are the keys of the resource dependencies |
| declared in the individual annotator and CAS Consumer descriptors.</p> |
| |
| </div> |
| |
| <div class="section" title="3.9. Example CPE Descriptor"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xml.cpe_descriptor.descriptor.example">3.9. Example CPE Descriptor</h2></div></div></div> |
| |
| |
| |
| <pre class="programlisting"><?xml version="1.0" encoding="UTF-8"?> |
| <cpeDescription> |
| <collectionReader> |
| <collectionIterator> |
| <descriptor> |
| <import location= |
| "../collection_reader/FileSystemCollectionReader.xml"/> |
| </descriptor> |
| </collectionIterator> |
| </collectionReader> |
| <casProcessors dropCasOnException="true" casPoolSize="1" |
| processingUnitThreadCount="1"> |
| <casProcessor deployment="integrated" |
| name="Aggregate TAE - Name Recognizer and Person Title Annotator"> |
| <descriptor> |
| <import location= |
| "../analysis_engine/NamesAndPersonTitles_TAE.xml"/> |
| </descriptor> |
| <deploymentParameters/> |
| <filter/> |
| <errorHandling> |
| <errorRateThreshold action="terminate" value="100/1000"/> |
| <maxConsecutiveRestarts action="terminate" value="30"/> |
| <timeout max="100000"/> |
| </errorHandling> |
| <checkpoint batch="1"/> |
| </casProcessor> |
| <casProcessor deployment="integrated" name="Annotation Printer"> |
| <descriptor> |
| <import location="../cas_consumer/AnnotationPrinter.xml"/> |
| </descriptor> |
| <deploymentParameters/> |
| <filter/> |
| <errorHandling> |
| <errorRateThreshold action="terminate" value="100/1000"/> |
| <maxConsecutiveRestarts action="terminate" value="30"/> |
| <timeout max="100000"/> |
| </errorHandling> |
| <checkpoint batch="1"/> |
| </casProcessor> |
| </casProcessors> |
| <cpeConfig> |
| <numToProcess>1</numToProcess> |
| <deployAs>immediate</deployAs> |
| <checkpoint file="" time="3000"/> |
| <timerImpl/> |
| </cpeConfig> |
| </cpeDescription></pre> |
| </div> |
| |
| <div class="footnotes"><br><hr width="100" align="left"><div class="footnote"><p><sup>[<a id="ftn.d5e1067" href="#d5e1067" class="para">3</a>] </sup>Deprecated</p></div><div class="footnote"><p><sup>[<a id="ftn.d5e1266" href="#d5e1266" class="para">4</a>] </sup>An earlier UIMA version required these to have a |
| suffix of <span class="quote">“<span class="quote">_p</span>”</span>, e.g., <span class="quote">“<span class="quote">string_p</span>”</span>. This is no |
| longer required, but this format is accepted, also, for backward |
| compatibility.</p></div></div></div> |
| <div class="chapter" title="Chapter 4. CAS Reference" id="ugr.ref.cas"><div class="titlepage"><div><div><h2 class="title">Chapter 4. CAS Reference</h2></div></div></div> |
| |
| |
| <p>The CAS (Common Analysis System) is the part of the Unstructured Information |
| Management Architecture (UIMA) that is concerned with creating and handling the data |
| that annotators manipulate.</p> |
| |
| <p>Java users typically use the JCas (Java interface to the CAS) when manipulating |
| objects in the CAS. This chapter describes an alternative interface to the CAS which |
| allows discovery and specification of types and features at run time. It is recommended |
| for use when the using code cannot know ahead of time the type system it will be dealing |
| with.</p> |
| |
| <p>Use of the CAS as described here is also recommended (or necessary) when components add |
| to the definitions of types of other components. This UIMA feature allows users to add features |
| to a type that was already defined elsewhere. When this feature is used in conjunction with the |
| JCas, it can lead to problems with class loading. This is because different JCas representations |
| of a single type are generated by the different components, and only one of them is loaded |
| (unless you are using Pear descriptors). Note: |
| we do not recommend that you add features to pre-existing types. A type should be defined in one |
| place only, and then there is no problem with using the JCas. However, if you do use this feature, |
| do not use the JCas. Similarly, if you distribute your components for inclusion in somebody else's |
| UIMA application, and you're not sure that they won't add features to your types, do not use the |
| JCas for the same reasons. |
| </p> |
| |
| <div class="section" title="4.1. Javadocs"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.javadocs">4.1. Javadocs</h2></div></div></div> |
| |
| |
| <p>The subdirectory <code class="literal">docs/api</code> contains the documentation |
| details of all the classes, methods, and constants for the APIs discussed here. Please |
| refer to this for details on the methods, classes and constants, specifically in the |
| packages <code class="literal">org.apache.uima.cas.*</code>.</p> |
| </div> |
| |
| <div class="section" title="4.2. CAS Overview"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.overview">4.2. CAS Overview</h2></div></div></div> |
| |
| |
| <p>There are three<sup>[<a name="d5e1615" href="#ftn.d5e1615" class="footnote">5</a>]</sup> main parts to the CAS: the type system, data creation and |
| manipulation, and indexing. We will start with a brief |
| description of these components.</p> |
| <div class="section" title="4.2.1. The Type System"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.type_system">4.2.1. The Type System</h3></div></div></div> |
| |
| |
| <p>The type system specifies what kind of data you will be able to manipulate in your |
| annotators. The type system defines two kinds of entities, types and features. Types |
| are arranged in a single inheritance tree and define the kinds of entities (objects) |
| you can manipulate in the CAS. Features optionally specify slots or fields within a |
| type. The correspondence to Java is to equate a CAS Type to a Java Class, and the CAS |
| Features to fields within the type. A critical difference is that CAS types have no |
| methods; they are just data structures with named slots (features). These features can |
| have as values primitive things like integers, floating point numbers, and strings, |
| and they also can hold references to other instances of objects in the CAS. We call |
| instances of the data structures declared by the type system <span class="quote">“<span class="quote">feature |
| structures</span>”</span> (not to be confused with <span class="quote">“<span class="quote">features</span>”</span>). Feature |
| structures are similar to the many variants of record structures found in computer |
| science.<sup>[<a name="d5e1624" href="#ftn.d5e1624" class="footnote">6</a>]</sup></p> |
| |
| <p>Each CAS Type defines a supertype; it is a subtype of that supertype. This means |
| that any features that the supertype defines are features of the subtype; in other |
| words, it inherits its supertype's features. Only single inheritance is |
| supported; a type's feature set is the union of all of the features in its |
| supertype hierarchy. There is a built-in type called uima.cas.TOP; this is the top, |
| root node of the inheritance tree. It defines no features.</p> |
| |
| <p>The values that can be stored in features are either built-in primitive values or |
| references to other feature structures. The primitive values are |
| <code class="literal">boolean</code>, <code class="literal">byte</code>, |
| <code class="literal">short</code> (16 bit integers), <code class="literal">integer</code> (32 |
| bit), <code class="literal">long</code> (64 bit), <code class="literal">float</code> (32 bit), |
| <code class="literal">double</code> (64 bit floats) and strings; the official names of these |
| are <code class="literal">uima.cas.Boolean</code>, <code class="literal">uima.cas.Byte</code>, |
| <code class="literal">uima.cas.Short</code>, <code class="literal">uima.cas.Integer</code>, |
| <code class="literal">uima.cas.Long</code>, <code class="literal">uima.cas.Float</code> |
| ,<code class="literal"> uima.cas.Double</code> and <code class="literal">uima.cas.String</code> |
| . The strings are Java strings, and characters are Java characters. Technically, this means |
| that characters are UTF-16 code points, which is not quite the same as a Unicode character. |
| This distinction should make no difference for almost all applications. |
| The CAS also defines other basic built-in types for arrays of these, plus arrays of |
| references to other objects, called <code class="literal">uima.cas.IntegerArray</code> |
| ,<code class="literal"> uima.cas.FloatArray</code>, |
| <code class="literal">uima.cas.StringArray</code>, |
| <code class="literal">uima.cas.FSArray</code>, etc.</p> |
| |
| <p>The CAS also defines a built-in type called |
| <code class="literal">uima.tcas.Annotation</code> which inherits from |
| <code class="literal">uima.cas.AnnotationBase</code> which in turn inherits from |
| <code class="literal">uima.cas.TOP</code>. There are two features defined by this type, |
| called <code class="literal">begin</code> and <code class="literal">end</code>, both of which are |
| integer valued.</p> |
| |
| </div> |
| |
| <div class="section" title="4.2.2. Creating, accessing and manipulating data"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.creating_accessing_manipulating_data">4.2.2. Creating, accessing and manipulating data</h3></div></div></div> |
| |
| |
| |
| <p> |
| Creating and accessing data in the CAS requires knowledge about the types and features |
| defined in the type system. The idea is similar to other data access APIs, such as the XML |
| DOM or SAX APIs, or database access APIs such as JDBC. Contrary to those APIs, however, the |
| CAS does not use the names of type system entities directly in the APIs. Rather, you use |
| the type system to access type and feature entities by name, then use these entities in the |
| data manipulation APIs. This can be compared to the Java reflection APIs: the type system |
| is comparable to the Java class loader, and the type and feature objects to the |
| <code class="literal">java.lang.Class</code> and <code class="literal">java.lang.reflect.Field</code> classes. |
| </p> |
| |
| <p> |
| Why does it have to be this complicated? You wouldn't normally use reflection to create a |
| Java object, either. As mentioned earlier, the JCas provides the more straightforward |
| method to manipulate CAS data. The CAS access methods described here need only be used for |
| generic types of applications that need to be able to handle any kind of data (e.g., generic |
| tooling) or when the JCas may not be used for other reasons. The generic kinds of applications |
| are exactly the ones where you would use the reflection API in Java as well. |
| </p> |
| |
| </div> |
| |
| <div class="section" title="4.2.3. Creating and using indexes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.creating_using_indexes">4.2.3. Creating and using indexes</h3></div></div></div> |
| |
| |
| <p>Each view of a CAS provides a set of indexes for that view. Instances of Types (that is, Feature |
| Structures) can be added to a view's indexes. These indexes provide |
| a way for annotators to locate existing data in the CAS, using a specific index (or the |
| method <code class="literal">getAllIndexedFS</code> of the object <code class="literal">FSIndexRepository</code>) to |
| retrieve the Feature Structures that were previously created. If you want the data you |
| Newly created Feature Structures are not automatically added to the indexes; you choose which |
| Feature Structures to add and use one of several APIs to add them. |
| </p> |
| |
| <p>Indexes are named and are associated with a CAS Type; they are used to index |
| instances of that CAS type (including instances of that type's subtypes). If |
| you are using multiple views (see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.mvs" class="olink">Chapter 6, <i>Multiple CAS Views of an Artifact</i></a>), |
| each view contains a separate instantiation of all of the indexes. |
| To access an index, you |
| minimally need to know its name. A CAS view provides an index repository which you can |
| query for indexes for that view. Once you have a handle to an index, you can get |
| information about the feature structures in the index, the size of the index, as well |
| as an iterator over the feature structures.</p> |
| |
| <p>There are three kinds of indexes: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"> |
| <p>bag - no ordering</p> |
| </li><li class="listitem"> |
| <p>set - uses a user-specfied set of keys to define equality; holds one instance of the set of equal items.</p> |
| </li><li class="listitem"> |
| <p>sorted - uses a user-specified set of keys to define ordering.</p> |
| </li></ul></div><p> |
| </p> |
| |
| <p>For set indexes, the comparator keys are augmented with an implicit additional field - the type of the |
| feature structure. This means that an index over Annotations, having subtype Token, and a key of the "begin" value, |
| will behave as follows: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>If you make two Tokens (or two Annotations), both having a begin value of 17, and add both of them to the indexes, |
| only one of them will be in the index.</p> |
| </li><li class="listitem"><p>If you make 1 Token and 1 Annotation, both having a begin value of 17, and add both of them to the indexes, |
| both of them will be in the index (because the types are different). |
| </p></li></ul></div><p> |
| </p> |
| |
| <p>Indexes are defined in the XML descriptor metadata for the application. Each CAS |
| View has its own, separate instantiation of indexes based on these definitions, |
| kept in the view's index repository. When you obtain an index, it is always from a |
| particular CAS view's index repository. |
| When you index an item, it is always added to all indexes where it |
| belongs, within just the view's repository. You can specify different repositories |
| (associated with different CAS views) to use; a given Feature Structure instance |
| may be indexed in more than one CAS View (unless it is a subtype of AnnotationBase).</p> |
| |
| <p>Indexes implement the Iterable interface, so you may use the Java enhanced for loop to iterate over them.</p> |
| |
| <p>You can also get iterators from indexes; |
| iterators allow you to enumerate the feature structures in an index. There are two kinds of iterators supported: |
| the regular Java iterator API, and a specific FS iterator API |
| where the usual Java iterator APIs (<code class="literal">hasNext()</code> and <code class="literal">next()</code>) |
| are augmented by <code class="literal">isValid()</code>, <code class="literal">moveToNext() / moveToPrevious()</code> (which does |
| not return an element) and <code class="literal">get()</code>. Finally, there is a <code class="literal">moveTo(FeatureStructure)</code> |
| API, which, for sorted indexes, moves the iteration point to the left-most (among otherwise "equal") item |
| in the index which compares "equal" to the given FeatureStructure, using the index's defined comparator. |
| </p> |
| |
| <p> |
| Which API style you use is up to you, |
| but we do not recommend mixing the styles as the results are sometimes unexpected. If you |
| just want to iterate over an index from start to finish, either style is equally appropriate. |
| If you also use <code class="literal">moveTo(FeatureStructure fs)</code> and |
| <code class="literal">moveToPrevious()</code>, it is better to use the special FS iterator style. |
| </p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The reason to not mix these styles is that you might be thinking that |
| next() followed by moveToPrevious() would always work. This is not true, because |
| next() returns the "current" element, and advances to the next position, which might be |
| beyond the last element. At that point, the iterator becomes "invalid", and |
| moveToNext and moveToPrevious no longer move the iterator. But you can |
| call these methods on the iterator — moveToFirst(), moveToLast(), or moveTo(FS) — to reset it.</p></div> |
| |
| <p>Indexes are created by specifying them in the annotator's or |
| aggregate's resource descriptor. An index specification includes its name, |
| the CAS type being indexed, the kind (bag, set or sorted) of index it is, and an (optional) set of keys. |
| The keys are used for set and sorted indexes, and specify what values are used for |
| ordering, or (for sets) what values are used to determine set equality. |
| When a CAS pipeline is created, all index |
| specifications are combined; duplicate definitions (having the same name) are |
| allowed only if their definitions are the same. </p> |
| |
| <p>Feature structure instances need to be explicitly added to the index repository by a |
| method call. Feature structures that are not indexed will not be visible to other |
| annotators, (unless they are located via being referenced by some other feature of |
| another feature structure, which is indexed, or through a chain of these).</p> |
| |
| <p>The framework defines an unnamed bag index which indexes all types. The |
| only access provided for this index is the getAllIndexedFS(type) method on the |
| index repository, which returns an iterator over all indexed instances of the |
| specified type (including its subtypes) for that CAS View. |
| </p> |
| |
| <p>The framework defines one standard, built-in annotation index, called |
| AnnotationIndex, which indexes the <code class="literal">uima.tcas.Annotation</code> |
| type: all feature structures of type <code class="literal">uima.tcas.Annotation</code> or |
| its subtypes are automatically indexed with this built-in index.</p> |
| |
| <p>The ordering relation used by this index is to first order by the value of the |
| <span class="quote">“<span class="quote">begin</span>”</span> features (in ascending order) and then by the value of the |
| <span class="quote">“<span class="quote">end</span>”</span> feature (in descending order), and then, finally, by the |
| Type Priority. This ordering insures that |
| longer annotations starting at the same spot come before shorter ones. For Subjects |
| of Analysis other than Text, this may not be an appropriate index.</p> |
| |
| <p>In addition to normal iterators, there is a <code class="literal">select</code> API, documented |
| in the Version 3 Users guide, which provides additional capabilities for accessing |
| Feature Structures via the indexes.</p> |
| |
| </div> |
| </div> |
| |
| <div class="section" title="4.3. Built-in CAS Types"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.builtin_types">4.3. Built-in CAS Types</h2></div></div></div> |
| |
| |
| <p>The CAS has two kinds of built-in types – primitive and non-primitive. The |
| primitive types are: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>uima.cas.Boolean</p></li><li class="listitem"><p>uima.cas.Byte</p></li><li class="listitem"><p>uima.cas.Short</p></li><li class="listitem"><p>uima.cas.Integer</p></li><li class="listitem"><p>uima.cas.Long</p></li><li class="listitem"><p>uima.cas.Float</p></li><li class="listitem"><p>uima.cas.Double</p></li><li class="listitem"><p>uima.cas.String</p></li></ul></div> |
| |
| <p>The <code class="literal">Byte, Short, Integer, </code>and<code class="literal"> Long</code> are |
| all signed integer types, of length 8, 16, 32, and 64 bits. The |
| <code class="literal">Double</code> type is 64 bit floating point. The |
| <code class="literal">String</code> type can be subtyped to create sets of allowed values; see |
| <a href="references.html#ugr.ref.xml.component_descriptor.type_system.string_subtypes" class="olink">Section 2.3.4, “String Subtypes”</a>. |
| These types can be used to specify the range of a String-valued feature. They act like |
| Strings, but have additional checking to insure the setting of values into them |
| conforms to one of the allowed values, or to null (which is the value if it is not set). |
| Note that the other primitive types cannot be used |
| as a supertype for another type definition; only |
| <code class="literal">uima.cas.String</code> can be sub-typed.</p> |
| |
| <p>The non-primitive types exist in a type hierarchy; the top of the hierarchy is the |
| type <code class="literal">uima.cas.TOP</code>. All other non-primitive types inherit from |
| some supertype.</p> |
| |
| <p>There are 9 built-in array types. These arrays have a size specified when they are |
| created; the size is fixed at creation time. They are named: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>uima.cas.BooleanArray</p></li><li class="listitem"><p>uima.cas.ByteArray</p></li><li class="listitem"><p>uima.cas.ShortArray</p></li><li class="listitem"><p>uima.cas.IntegerArray</p></li><li class="listitem"><p>uima.cas.LongArray</p></li><li class="listitem"><p>uima.cas.FloatArray</p></li><li class="listitem"><p>uima.cas.DoubleArray</p></li><li class="listitem"><p>uima.cas.StringArray</p></li><li class="listitem"><p>uima.cas.FSArray</p></li></ul></div> |
| |
| <p>The <code class="literal">uima.cas.FSArray</code> type is an array whose elements are |
| arbitrary other feature structures (instances of non-primitive types).</p> |
| |
| <p>The JCas cover classes for the array types support the Iterable API, so you may |
| write extended for loops over instances of these. For example: |
| </p><pre class="programlisting">FSArray<MyType> myArray = ... |
| for (MyType fs : myArray) { |
| some_method(fs); |
| }</pre><p> |
| </p> |
| |
| <p>There are 3 built-in types associated with the artifact being analyzed: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>uima.cas.AnnotationBase</p></li><li class="listitem"><p>uima.tcas.Annotation</p></li><li class="listitem"><p>uima.tcas.DocumentAnnotation</p></li></ul></div> |
| |
| <p>The <code class="literal">AnnotationBase</code> type defines one system-used feature |
| which specifies for an annotation the subject of analysis (Sofa) to which it refers. The |
| Annotation type extends from this and defines 2 features, taking |
| <code class="literal">uima.cas.Integer</code> values, called <code class="literal">begin</code> |
| and <code class="literal">end</code>. The <code class="literal">begin</code> feature typically |
| identifies the start of a span of text the annotation covers; the |
| <code class="literal">end</code> feature identifies the end. The values refer to character |
| offsets; the starting index is 0. An annotation of the word <span class="quote">“<span class="quote">CAS</span>”</span> in a text |
| <span class="quote">“<span class="quote">CAS Reference</span>”</span> would have a start index of 0, and an end index of 3; the |
| difference between end and start is the length of the span the annotation refers |
| to.</p> |
| |
| <p>Annotations are always with respect to some Sofa (Subject of Analysis – see |
| <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> |
| <a href="tutorials_and_users_guides.html#ugr.tug.aas" class="olink">Chapter 5, <i>Annotations, Artifacts, and Sofas</i></a> |
| .</p> |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Artifacts which are not text strings may have a different interpretation of |
| the meaning of begin and end, or may define their own kind of annotation, extending from |
| <code class="literal">AnnotationBase</code>. </p></div> |
| |
| <p><a name="ugr.ref.cas.document_annotation"></a>The <code class="literal">DocumentAnnotation</code> type has one special instance. It is |
| a subtype of the Annotation type, and the built-in definition defines one feature, |
| <code class="literal">language</code>, which is a string indicating the language of the |
| document in the CAS. The value of this language feature is used by the system to control |
| flow among annotators when the <span class="quote">“<span class="quote">CapabilityLanguageFlow</span>”</span> mode is used, |
| allowing the flow to skip over annotators that don't process particular |
| languages. Users may extend this type by adding additional features to it, using the XML |
| Descriptor element for defining a type.</p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p> |
| We do <span class="emphasis"><em>not</em></span> recommend extending the <code class="literal">DocumentAnnotation</code> |
| type. If you do, you must <span class="emphasis"><em>not</em></span> use the JCas, for the reasons stated |
| earlier. |
| </p></div> |
| |
| <p>Each CAS view has a different associated instance of the |
| <code class="literal">DocumentAnnotation</code> type. On the CAS, use |
| <code class="literal">getDocumentationAnnotation()</code> to access the |
| <code class="literal">DocumentAnnotation</code>.</p> |
| |
| <p>There are also built-in types supporting linked lists, similar to the ones available in |
| Java and other programming languages. Their use is |
| constrained by the usual properties of linked lists: not very space efficient, no (efficient) |
| random access, but an easy choice if you don't know how long your list will be ahead of time. The |
| implementation is type specific; there are different list building objects for each of |
| the primitive types, plus one for general feature structures. Here are the type names: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>uima.cas.FloatList</p></li><li class="listitem"><p>uima.cas.IntegerList</p></li><li class="listitem"><p>uima.cas.StringList</p></li><li class="listitem"><p>uima.cas.FSList</p> |
| <p></p></li><li class="listitem"><p>uima.cas.EmptyFloatList</p></li><li class="listitem"><p>uima.cas.EmptyIntegerList</p></li><li class="listitem"><p>uima.cas.EmptyStringList</p></li><li class="listitem"><p>uima.cas.EmptyFSList</p> |
| <p></p></li><li class="listitem"><p>uima.cas.NonEmptyFloatList</p></li><li class="listitem"><p>uima.cas.NonEmptyIntegerList</p></li><li class="listitem"><p>uima.cas.NonEmptyStringList</p></li><li class="listitem"><p>uima.cas.NonEmptyFSList</p></li></ul></div> |
| |
| <p>For the primitive types <code class="literal">Float</code>, |
| <code class="literal">Integer</code>, <code class="literal">String</code> and |
| <code class="literal">FeatureStructure</code>, there is a base type, for instance, |
| <code class="literal">uima.cas.FloatList</code>. For each of these, there are two subtypes, |
| corresponding to a non-empty element, and a marker that serves to indicate the end of the |
| list, or an empty list. The non-empty types define two features – |
| <code class="literal">head</code> and <code class="literal">tail</code>. The head feature holds the |
| particular value for that part of the list. The tail refers to the next list object |
| (either a non-empty one or the empty version to indicate the end of the list).</p> |
| |
| <p>For JCas users, the new operator for the NonEmptyXyzList classes includes a 3 argument version |
| where you may specify the head and tail values as part of the constructor. The JCas |
| cover classes for these implement |
| a <code class="code">push(item)</code> method which creates a new non-empty node, sets the <code class="code">head</code> value |
| to <code class="code">item</code>, and the tail to the node it is called on, and returns the new node. |
| These classes also implement Iterable, so you can use the enhanced Java <code class="code">for</code> operator. |
| The iterator stops when it gets to the end of the list, determined by either the tail being null or |
| the element being one of the EmptyXXXList elements. |
| Here's a StringList example: |
| </p><pre class="programlisting">StringList sl = jcas.emptyStringList(); |
| sl = sl.push("2"); |
| sl = sl.push("1"); |
| |
| for (String s : sl) { |
| someMethod(s); // some sample use |
| }</pre><p> |
| |
| </p> |
| |
| <p>There are no other built-in types. Users are free to define their own type systems, |
| building upon these types.</p> |
| |
| </div> |
| |
| <div class="section" title="4.4. Accessing the type system"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.accessing_the_type_system">4.4. Accessing the type system</h2></div></div></div> |
| |
| |
| <p> |
| During annotator processing, or outside an annotator, access the type system by calling |
| <code class="literal">CAS.getTypeSystem()</code>. |
| </p> |
| |
| <p>However, CAS annotators implement an additional method, |
| <code class="literal">typeSystemInit()</code>, which is called by the UIMA framework before the |
| annotator's process method. This method, implemented by the annotator writer, |
| is passed a reference to the CAS's type system metadata. The method typically uses |
| the type system APIs to obtain type and feature objects corresponding to all the types |
| and features the annotator will be using in its process method. This initialization |
| step should not be done during an annotator's initialize method since the type |
| system can change after the initialize method is called; it should not be done during the |
| process method, since this is presumably work that is identical for each incoming |
| document, and so should be performed only when the type system changes (which will be a |
| rare event). The UIMA framework guarantees it will call the <code class="literal">typeSystemInit |
| </code>method of an annotator whenever the type system changes, before calling the |
| annotator's <code class="literal">process()</code> method.</p> |
| |
| <p>The initialization done by <code class="literal">typeSystemInit()</code> is done by the |
| UIMA framework when you use the JCas APIs; you only need to provide a |
| <code class="literal">typeSystemInit()</code> method, as described here, when you are not using |
| the JCas approach.</p> |
| |
| <div class="section" title="4.4.1. TypeSystemPrinter example"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.type_system.printer_example">4.4.1. TypeSystemPrinter example</h3></div></div></div> |
| |
| |
| <p>Here is a code fragment that, given a CAS Type System, will print a list of all |
| types.</p> |
| |
| |
| <pre class="programlisting">// Get all type names from the type system |
| // and print them to stdout. |
| private void listTypes1(TypeSystem ts) { |
| for (Type t : ts) { |
| // print its name. |
| System.out.println(t.getName()); |
| } |
| }</pre> |
| |
| <p>This method is passed the type system as a parameter. From the type system, we can |
| get an iterator |
| over all the types. If you run this against a CAS created with no additional |
| user-defined types, we should see something like this on the console:</p> |
| |
| <pre class="programlisting">Types in the type system: |
| uima.cas.Boolean |
| uima.cas.Byte |
| uima.cas.Short |
| uima.cas.Integer |
| uima.cas.Long |
| uima.cas.ArrayBase |
| ... |
| </pre> |
| |
| <p>If the type system had user-defined types these would show up too. Note that some |
| of these types are not directly creatable – they are types used by the framework |
| in the type hierarchy (e.g. uima.cas.ArrayBase).</p> |
| |
| <p>CAS type names include a name-space prefix. The components of a type name are |
| separated by the dot (.). A type name component must start with a Unicode letter, |
| followed by an arbitrary sequence of letters, digits and the underscore (_). By |
| convention, the last component of a type name starts with an uppercase letter, the |
| rest start with a lowercase letter.</p> |
| |
| <p>Listing the type names is mildly useful, but it would be even better if we could see |
| the inheritance relation between the types. The following code prints the |
| inheritance tree in indented format.</p> |
| |
| |
| <pre class="programlisting">private static final int INDENT = 2; |
| private void listTypes2(TypeSystem ts) { |
| // Get the root of the inheritance tree. |
| Type top = ts.getTopType(); |
| // Recursively print the tree. |
| printInheritanceTree(ts, top, 0); |
| } |
| |
| private void printInheritanceTree(TypeSystem ts, Type type, int level) { |
| indent(level); // Print indentation. |
| System.out.println(type.getName()); |
| // Get a vector of the immediate subtypes. |
| Vector subTypes = |
| ts.getDirectlySubsumedTypes(type); |
| ++level; // Increase the indentation level. |
| for (int i = 0; i < subTypes.size(); i++) { |
| // Print the subtypes. |
| printInheritanceTree(ts, (Type) subTypes.get(i), level); |
| } |
| } |
| |
| // A simple, inefficient indenter |
| private void indent(int level) { |
| int spaces = level * INDENT; |
| for (int i = 0; i < spaces; i++) { |
| System.out.print(" "); |
| } |
| }</pre> |
| |
| <p> This example shows that you can traverse the type hierarchy by starting at the top |
| with TypeSystem.getTopType and by retrieving subtypes with |
| <code class="literal">TypeSystem.getDirectlySubsumedTypes()</code>.</p> |
| |
| <p>The Javadocs also have APIs that allow you to access the features, as well as what |
| the allowed value type is for that feature. Here is sample code which prints out all the |
| features of all the types, together with the allowed value types (the feature |
| <span class="quote">“<span class="quote">range</span>”</span>). Each feature has a <span class="quote">“<span class="quote">domain</span>”</span> which is the type |
| where it is defined, as well as a <span class="quote">“<span class="quote">range</span>”</span>. |
| |
| |
| </p><pre class="programlisting">private void listFeatures2(TypeSystem ts) { |
| Iterator featureIterator = ts.getFeatures(); |
| Feature f; |
| System.out.println("Features in the type system:"); |
| while (featureIterator.hasNext()) { |
| f = (Feature) featureIterator.next(); |
| System.out.println( |
| f.getShortName() + ": " + |
| f.getDomain() + " -> " + f.getRange()); |
| } |
| System.out.println(); |
| }</pre> |
| |
| <p>We can ask a feature object for its domain (the type it is defined on) and its range |
| (the type of the value of the feature). The terminology derives from the fact that |
| features can be viewed as functions on subspaces of the object space.</p> |
| |
| </div> |
| |
| <div class="section" title="4.4.2. Using the CAS APIs to create and modify feature structures"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.cas_apis_create_modify_feature_structures">4.4.2. Using the CAS APIs to create and modify feature structures</h3></div></div></div> |
| |
| |
| |
| <p>Assume a type system declaration that defines two types: Entity and Person. |
| Entity has no features defined within it but inherits from uima.tcas.Annotation |
| – so it has the begin and end features. Person is, in turn, a subtype of Entity, |
| and adds firstName and lastName features. CAS type systems are declaratively |
| specified using XML; the format of this XML is described in <a href="references.html#ugr.ref.xml.component_descriptor.type_system" class="olink">Section 2.3, “Type System Descriptors”</a>. |
| |
| |
| </p><pre class="programlisting"><!-- Type System Definition --> |
| <typeSystemDescription> |
| <types> |
| <typeDescription> |
| <name>com.xyz.proj.Entity</name> |
| <description /> |
| <supertypeName>uima.tcas.Annotation</supertypeName> |
| </typeDescription> |
| <typeDescription> |
| <name>Person</name> |
| <description /> |
| <supertypeName>com.xyz.proj.Entity </supertypeName> |
| <features> |
| <featureDescription> |
| <name>firstName</name> |
| <description /> |
| <rangeTypeName>uima.cas.String</rangeTypeName> |
| </featureDescription> |
| <featureDescription> |
| <name>lastName</name> |
| <description /> |
| <rangeTypeName>uima.cas.String</rangeTypeName> |
| </featureDescription> |
| </features> |
| </typeDescription> |
| </types> |
| </typeSystemDescription></pre> |
| |
| <p> |
| To be able to access types and features, we need to know their names. The CAS interface defines |
| constants that hold the names of built-in feature names, such as, e.g., |
| <code class="literal">CAS.TYPE_NAME_INTEGER</code>. It is good programming practice to create such |
| constants for the types and features you define, for your own use as well as for others who will |
| be using your annotators. |
| </p> |
| |
| |
| <pre class="programlisting">/** Entity type name constant. */ |
| public static final String ENTITY_TYPE_NAME = "com.xyz.proj.Entity"; |
| |
| /** Person type name constant. */ |
| public static final String PERSON_TYPE_NAME = "com. xyz.proj.Person"; |
| |
| /** First name feature name constant. */ |
| public static final String FIRST_NAME_FEAT_NAME = "firstName"; |
| |
| /** Last name feature name constant. */ |
| public static final String LAST_NAME_FEAT_NAME = "lastName";</pre> |
| |
| <p>Next we define type and feature member variables; these will hold the values of the |
| type and feature objects needed by the CAS APIs, to be assigned during |
| <code class="literal">typeSystemInit()</code>.</p> |
| |
| |
| <pre class="programlisting">// Type system object variables |
| private Type entityType; |
| private Type personType; |
| private Feature firstNameFeature; |
| private Feature lastNameFeature; |
| private Type stringType;</pre> |
| |
| <p>The type system does not throw an exception if we ask for something that is |
| not known, it simply returns null; therefore the code checks for this and throws a proper |
| exception. We require all these types and features to be defined for the annotator to |
| work. One might imagine situations where certain computations are predicated on some type |
| or feature being defined in the type system, but that is not the case here.</p> |
| |
| |
| <pre class="programlisting">// Get a type object corresponding to a name. |
| // If it doesn't exist, throw an exception. |
| private Type initType(String typeName) |
| throws AnnotatorInitializationException { |
| Type type = ts.getType(typeName); |
| if (type == null) { |
| throw new AnnotatorInitializationException( |
| AnnotatorInitializationException.TYPE_NOT_FOUND, |
| new Object[] { this.getClass().getName(), typeName }); |
| } |
| return type; |
| } |
| |
| // We add similar code for retrieving feature objects. |
| // Get a feature object from a name and a type object. |
| // If it doesn't exist, throw an exception. |
| private Feature initFeature(String featName, Type type) |
| throws AnnotatorInitializationException { |
| Feature feat = type.getFeatureByBaseName(featName); |
| if (feat == null) { |
| throw new AnnotatorInitializationException( |
| AnnotatorInitializationException.FEATURE_NOT_FOUND, |
| new Object[] { this.getClass().getName(), featName }); |
| } |
| return feat; |
| }</pre> |
| |
| <p>Using these two functions, code for initializing the type system described |
| above would be: |
| |
| |
| </p><pre class="programlisting">public void typeSystemInit(TypeSystem aTypeSystem) |
| throws AnalysisEngineProcessException { |
| this.typeSystem = aTypeSystem; |
| // Set type system member variables. |
| this.entityType = initType(ENTITY_TYPE_NAME); |
| this.personType = initType(PERSON_TYPE_NAME); |
| this.firstNameFeature = |
| initFeature(FIRST_NAME_FEAT_NAME, personType); |
| this.lastNameFeature = |
| initFeature(LAST_NAME_FEAT_NAME, personType); |
| this.stringType = initType(CAS.TYPE_NAME_STRING); |
| }</pre> |
| |
| <p>Note that we initialize the string type by using a type name constant from the |
| CAS.</p> |
| |
| </div> |
| </div> |
| |
| <div class="section" title="4.5. Creating feature structures"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.creating_feature_structures">4.5. Creating feature structures</h2></div></div></div> |
| |
| |
| <p>To create feature structures in JCas, we use the Java <span class="quote">“<span class="quote">new</span>”</span> |
| operator. In the CAS, we use one of several different API methods on the CAS object, |
| depending on which of the 10 basic kinds of feature structures we are creating (a plain |
| feature structure, or an instance of the built-in primitive type arrays or FSArray). |
| There are is also a method to create an instance of a |
| <code class="literal">uima.tcas.Annotation</code>, setting the begin and end |
| values.</p> |
| |
| <p>Once a feature structure is created, it needs to be added to the CAS indexes (unless |
| it will be accessed via some reference from another accessible feature structure). The |
| CAS provides this API: Assuming aCAS holds a reference to a CAS, and token holds a |
| reference to a newly created feature structure, here's the code to add that |
| feature structure to all the relevant CAS indexes:</p> |
| |
| |
| <pre class="programlisting"> // Add the token to the index repository. |
| aCAS.addFsToIndexes(token);</pre> |
| |
| <p>There is also a corresponding <code class="literal">removeFsFromIndexes(token)</code> |
| method on CAS objects.</p> |
| |
| <p>As of version 2.4.1, there are two methods you can use on an index repository |
| to efficiently bulk-remove all |
| instances of particular types of feature structures from a particular view. One of these, |
| <code class="code">aCas.getIndexRepository().removeAllIncludingSubtypes(aType)</code> removes all instances of a particular |
| type, including instances which are subtypes of the specified type. The other, |
| <code class="code">aCas.getIndexRepository().removeAllExcludingSubtypes(aType)</code> remove all instances of a particular |
| type, only. In both cases, the removal is done from the particular view of the CAS referenced |
| by aCas.</p> |
| |
| <div class="section" title="4.5.1. Updating indexed feature structures"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.updating_indexed_feature_structures">4.5.1. Updating indexed feature structures</h3></div></div></div> |
| |
| <p>Version 2.7.0 added protection for indexes when feature structure key |
| value features are updated. By default this protection is automatic, but |
| at some performance cost. Users may optimize this further.</p> |
| |
| <p>Protection is needed because some of the indexes (the Sorted and Set types) use comparators defined |
| to use values of the particular features; if these values |
| need to be changed after the feature structure is added to the indexes, |
| the correct way to do this is to: |
| </p><div class="orderedlist"><ol class="orderedlist" type="1" compact><li class="listitem"><p>completely remove the item from all indexes where it is indexed, in all views |
| where it is indexed,</p> |
| </li><li class="listitem"><p>update the value of the features being used as keys,</p></li><li class="listitem"><p>add the item back to the indexes, in all views.</p></li></ol></div> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>It’s OK to change feature values which are not used in determining |
| sort ordering (or set membership), without removing and re-adding back to the index. |
| </p></div> |
| |
| |
| |
| <p>The automatic protection checks for updates of |
| features being used as keys, and if it finds an update like this for a feature structure that |
| is in the indexes, it removes the feature structure from the indexes, does the update, |
| and adds it back. It will do this for every feature update. This is obviously not |
| efficient when multiple features are being updated; in that case it would better to |
| remove the feature structure, do all the updates to all the features needing updates, and then |
| do a single add-back operation.</p> |
| |
| <p>This is supported in user’s code by using the new method <code class="code">protectIndexes</code> |
| available in both the CAS and JCas interface. |
| |
| Here's two ways |
| of using this, one with a try / finally and the other with a Runnable: |
| </p><pre class="programlisting">// an approach using try / finally |
| AutoCloseable ac = my_cas.protectIndexes(); // my_cas is a CAS or a JCas |
| try { |
| ... arbitrary user code which updates features |
| which may be "keys" in one or more indexes |
| } finally { |
| ac.close(); |
| } |
| |
| // This can more compactly be written using the auto-close feature of try: |
| |
| try (AutoCloseable ac = my_cas.protectIndexes()) { |
| ... arbitrary user code which updates features |
| which may be "keys" in one or more indexes |
| } |
| |
| // an approach using a Runnable, written in Java 8 lambda syntax |
| my_cas.protectIndexes(() -> { |
| ... arbitrary user code updating "key" features, |
| but no checked exceptions are permitted |
| });</pre> |
| |
| <p>The <code class="code">protectIndexes</code> implementation only removes feature structures that |
| have features being updated which are used as keys in some index(es). At the end of the scope |
| of the protectIndexes, it adds all of these back. It also skips removing feature structures |
| from bag indexes, since these have no keys.</p> |
| |
| <p>Within a <code class="code">protectIndexes</code> block, do not do any operations which depend on the |
| indexes being valid, such as creating and using an iterator. This is because the removed FSs |
| are only added back at the end of the protectIndexes block.</p> |
| |
| <p>The JVM property <code class="code">-Duima.report_fs_update_corrupts_index</code> will generate a log entry |
| everytime the frameworks finds (and automatically surrounds with a remove - add-back) an update to |
| a feature which could corrupt the index. The log entries can be identified by scanning for messages |
| starting with <code class="code">While FS was in the index, the feature</code> - the message goes on to identify |
| the feature in question. Users can use these reports to find the places in their code where |
| they can either change the design to avoid updating these values after the item is indexed, or |
| surround the updates with their own <code class="code">protectIndexes</code> blocks.</p> |
| |
| <p>Initially, the out-of-the-box defaults |
| for the UIMA framework will run with an automatic (but somewhat inefficient) protection. To improve upon this, |
| users would: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>Turn on reporting using a global JVM flag <code class="code"> |
| -Duima.report_fs_update_corrupts_index</code>. |
| This will cause a message to be logged each time the automatic protection is being invoked, |
| and allows the user to find the spots to improve.</p> |
| </li><li class="listitem"><p>Improve each spot, perhaps by surrounding the update code with a protectIndexes |
| block, or by rearranging code to reduce updating feature values used as index keys.</p> |
| </li><li class="listitem"><p>Once the code is no longer generating any reports, you can turn off the |
| automatic protection for production runs using the JVM global property |
| <code class="code">-Duima.disable_auto_protect_indexes</code>, and rely on the protectIndexes blocks. |
| If protection is disabled, then the corruption detection is skipped, making the production |
| runs perhaps a bit faster, although this is not significant in most cases.</p></li><li class="listitem"><p>For automated build systems, there’s a JVM parameter, |
| <code class="code">-Duima.exception_when_fs_update_corrupts_index</code>, which will throw an |
| exception if any automatic recovery situation is encountered. You can use this |
| in build/test scenarios to insure |
| (after adding all needed protectIndexes blocks) that the code remains safe for |
| turning off the checking in production runs.</p></li></ul></div><p> |
| </p> |
| |
| </div> |
| </div> |
| |
| <div class="section" title="4.6. Accessing or modifying features of feature structures"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.accessing_modifying_features_of_feature_structures">4.6. Accessing or modifying features of feature structures</h2></div></div></div> |
| |
| |
| |
| <p>Values of individual features for a feature structure can be set or referenced, |
| using a set of methods that depend on the type of value that feature is declared to have. |
| There are methods on FeatureStructure for this: getBooleanValue, getByteValue, |
| getShortValue, getIntValue, getLongValue, getFloatValue, getDoubleValue, |
| getStringValue, and getFeatureValue (which means to get a value which in turn is a |
| reference to a feature structure). There are corresponding <span class="quote">“<span class="quote">setter</span>”</span> |
| methods, as well. These methods on the feature structure object take as arguments the |
| feature object retrieved earlier in the typeSystemInit method.</p> |
| |
| <p>Using the previous example, with the type system initialized with type personType |
| and feature lastNameFeature, here's a sample code fragment that gets and sets |
| that feature:</p> |
| |
| |
| <pre class="programlisting">// Assume aPerson is a variable holding an object of type Person |
| // get the lastNameFeature value from the feature structure |
| String lastName = aPerson.getStringValue(lastNameFeature); |
| // set the lastNameFeature value |
| aPerson.setStringValue(lastNameFeature, newStringValueForLastName);</pre> |
| |
| <p>The getters and setters for each of the primitive types are defined in the Javadocs |
| as methods of the FeatureStructure interface.</p> |
| |
| </div> |
| |
| <div class="section" title="4.7. Indexes and Iterators"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.indexes_and_iterators">4.7. Indexes and Iterators</h2></div></div></div> |
| |
| |
| <p>Each CAS can have many indexes associated with it; each CAS View contains |
| a complete set of instantiations of the indexes. Each index is represented by an |
| instance of the type org.apache.uima.cas.FSIndex. You use the object |
| org.apache.uima.cas.FSIndexRepository, accessible via a method on a CAS object, to |
| retrieve instances of indexes. There are methods that let you select the index |
| by name, by type, or by both name and type. Since each index is already associated with a type, |
| passing both a name and a type is valid only if the type passed in is the same |
| type or a subtype of the one declared in the index specification for the named index. If you |
| pass in a subtype, the returned FSIndex object refers to an index that will return only |
| items belonging to that subtype (or subtypes of that subtype).</p> |
| |
| <p>The returned FSIndex objects are used, in turn, to create iterators. |
| There is also a method on the Index Repository, <code class="literal">getAllIndexedFS</code>, |
| which will return an iterator over all indexed Feature Structures (for that CAS View), |
| in no particular order. The iterators |
| created can be used like common Java iterators, to sequentially retrieve items |
| indexed. If the index represents a sorted index, the items are returned in a sorted |
| order, where the sort order is specified in the XML index definition. This XML is part of |
| the Component Descriptor, see <a href="references.html#ugr.ref.xml.component_descriptor.aes.index" class="olink">Section 2.4.1.5, “Index Definition”</a>.</p> |
| |
| <p>In UIMA V3, Feature structures may be added to or removed from indexes while iterating |
| over them. If this happens, any iterators already created will continue to operate over the |
| before-modification version of the index, unless or until the iterator is re-synchronized with the current |
| value of the index via one of the following specific 3 iterator API calls: |
| moveToFirst, moveToLast, or moveTo(FeatureStructure). |
| ConcurrentModificationException is no longer thrown in UIMA v3. |
| </p> |
| |
| <p>Feature structures being iterated over may have features which are used as the "keys" of an index, updated. |
| If this is done, UIMA will protect the indexes (to prevent index corruption) by automatically removing the |
| Feature Structure from the indexes, |
| updating the field, and adding the FS back to the index (possibly in a new position). |
| This automatic remove / add-back operation no longer makes the iterator throw a ConcurrentModificationException |
| (as it did in UIMA Version 2) if the iterator is incremented or decremented; |
| existing iterators will continue to operate as if no index modification occurred. |
| </p> |
| |
| |
| |
| <div class="section" title="4.7.1. Built-in Indexes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.index.built_in_indexes">4.7.1. Built-in Indexes</h3></div></div></div> |
| |
| |
| <p>An unnamed built-in bag index exists which holds all feature structures which are indexed. |
| The only access to this index is the method getAllIndexedFS(Type) which returns an iterator |
| over all indexed Feature Structures.</p> |
| |
| <p>The CAS also contains a built-in index for the type <code class="literal">uima.tcas.Annotation</code>, which sorts |
| annotations in the order in which they appear in the document. Annotations are sorted first by increasing |
| <code class="literal">begin</code> position. Ties are then broken by <span class="emphasis"><em>decreasing</em></span> |
| <code class="literal">end</code> position (so that longer annotations come first). Annotations that match in both |
| their <code class="literal">begin</code> and <code class="literal">end</code> features are sorted using the Type Priority, |
| if any are defined |
| (see <a href="references.html#ugr.ref.xml.component_descriptor.aes.type_priority" class="olink">Section 2.4.1.4, “Type Priority Definition”</a> )</p> |
| </div> |
| |
| |
| <div class="section" title="4.7.2. Adding Feature Structures to the Indexes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.index.adding_to_indexes">4.7.2. Adding Feature Structures to the Indexes</h3></div></div></div> |
| |
| |
| <p>Feature Structures are added to the indexes by various APIs. These add the Feature Structure to |
| <span class="emphasis"><em>all</em></span> indexes that are defined for the type of that FeatureStructure (or any of its |
| supertypes), in a particular view. |
| Note that you should not add a Feature Structure to the indexes until you have set values for all |
| of the features that may be used as sort keys in an index.</p> |
| |
| <p>There are multiple APIs for adding FSs to the index. |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>(preferred) myFeatureStructure.addToIndexes(). This adds the feature structure instance to the |
| view in which it was originally created.</p> |
| </li><li class="listitem"><p>(preferred) myFeatureStructure.addToIndexes(JCas or CAS). This adds the feature structure instance to the |
| view represented by the argument.</p> |
| </li><li class="listitem"><p>(older form) casView.addFsToIndexes(myFeatureStructure) or jcasView.addFsToIndexes(myFeatureStructure). |
| This adds the feature structure instance to the |
| view represented by the cas (or jcas).</p> |
| </li><li class="listitem"><p>(older form) fsIndexRepositoryView.addFsToIndexes(myFeatureStructure). |
| This adds the feature structure instance to the |
| view represented by the fsIndexRepository instance.</p> |
| </li></ul></div><p> |
| </p> |
| </div> |
| |
| <div class="section" title="4.7.3. Iterators over UIMA Indexes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.index.iterators">4.7.3. Iterators over UIMA Indexes</h3></div></div></div> |
| |
| |
| |
| <p>Iterators are objects of class <code class="literal">org.apache.uima.cas.FSIterator.</code> This class |
| extends <code class="literal">java.util.Iterator</code> and implements the normal Java iterator methods, plus |
| additional ones that allow moving both forwards and backwards.</p> |
| |
| <p>UIMA Indexes implement iterable, so you can use the index directly in a Java extended for loop.</p> |
| |
| </div> |
| |
| <div class="section" title="4.7.4. Special iterators for Annotation types"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.index.annotation_index">4.7.4. Special iterators for Annotation types</h3></div></div></div> |
| |
| |
| <p>Note: we recommend using the UIMA V3 select framework, instead of the following. |
| It implements all of the following capabilities, and more, in a uniform manner.</p> |
| |
| <p>The built-in index over the <code class="literal">uima.tcas.Annotation</code> type |
| named <span class="quote">“<span class="quote"><code class="literal">AnnotationIndex</code></span>”</span> has additional |
| capabilities. To use them, you first get a reference to this built-in index using |
| either the <code class="literal">getAnnotationIndex</code> method on a CAS View object, or |
| by asking the <code class="literal">FSIndexRepository</code> object for an index having the |
| particular name <span class="quote">“<span class="quote">AnnotationIndex</span>”</span>, for example: |
| |
| </p><pre class="programlisting">AnnotationIndex idx = aCAS.getAnnotationIndex(); |
| // or you can iterate over a specific subtype of Annotation: |
| AnnotationIndex idx = aCAS.getAnnotationIndex(aType); </pre> |
| |
| <p>This object can be used to produce several additional kinds of iterators. It can |
| produce unambiguous iterators; these skip over elements until it finds one where the |
| start position of the next annotation is equal to or greater than the end position of |
| the previously returned annotation.</p> |
| |
| <p>It can also produce several kinds of subiterators; these are iterators whose |
| annotations fall within the span of another annotation. This kind of iterator can |
| also have the unambiguous property, if desired. It also can be |
| <span class="quote">“<span class="quote">strict</span>”</span> or not; strict means that the returned annotation lies |
| completely within the span of the controlling annotation. Non-strict only implies |
| that the beginning of the returned annotation falls within the span of the |
| controlling annotation.</p> |
| |
| <p>There is also a method which produces an <code class="literal">AnnotationTree</code> |
| object, which contains nodes representing the results of doing a strict, |
| unambiguous subiterator over the span of some controlling annotation. For more |
| details, please refer to the Javadocs for the |
| <code class="literal">org.apache.uima.cas.text</code> package.</p> |
| |
| </div> |
| |
| <div class="section" title="4.7.5. Constraints and Filtered iterators"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.index.constraints_and_filtered_iterators">4.7.5. Constraints and Filtered iterators</h3></div></div></div> |
| |
| |
| <p>Note: for new code, consider using the select framework plus Streams, instead of |
| the following.</p> |
| |
| <p>There is a set of API calls that build constraint objects. These objects can be |
| used directly to test if a particular feature structure matches (satisfies) the |
| constraint, or they can be passed to the createFilteredIterator method to create an |
| iterator that skips over instances which fail to satisfy the constraint.</p> |
| |
| <p>It is possible to specify a feature value located by following a chain of |
| references starting from the feature structure being tested. Here's a |
| scenario to explore this concept. Let's suppose you have the following type |
| system (namespaces are omitted for clarity): |
| |
| </p><div class="blockquote"><blockquote class="blockquote"> |
| <p><span class="bold"><strong>Token</strong></span>, having a feature PartOfSpeech |
| which holds a reference to another type (POS)</p> |
| |
| <p><span class="bold"><strong>POS</strong></span> (a type with many subtypes, each |
| representing a different part of speech)</p> |
| |
| <p><span class="bold"><strong>Noun</strong></span> (a subtype of POS)</p> |
| |
| <p><span class="bold"><strong>ProperName</strong></span> (a subtype of Noun), |
| having a feature Class which holds an integer value encoding some information |
| about the proper noun.</p></blockquote></div> |
| |
| <p>If you want to filter Token instances, such that only those tokens get through |
| which are proper names of class 3 (for example), you would need a test that started with |
| a Token instance, followed its PartOfSpeech reference to another instance (the |
| ProperName instance) and then tested the Class feature of that instance for a value |
| equal to 3.</p> |
| |
| <p>To support this, the filtering approach has components that specify tests, and |
| components that specify <span class="quote">“<span class="quote">paths</span>”</span>. The tests that can be done include |
| testing references to type instances to see if they are instances of some type or its |
| subtypes; this is done with a FSTypeConstraint constraint. Other tests check for |
| equality or, for numeric values, ranges.</p> |
| |
| <p>Each test may be combined with a path – to get to the value to test. Tests that |
| start from a feature structure instance can be combined with and and or connectors. |
| The Javadocs for these are in the package org.apache.uima.cas in the classes that end |
| in Constraint, plus the classes ConstraintFactory, FeaturePath and CAS. |
| Here's an example; assume the variable cas holds a reference to a CAS instance. |
| |
| |
| </p><pre class="programlisting">// Start by getting the constraint factory from the CAS. |
| ConstraintFactory cf = cas.getConstraintFactory(); |
| |
| // To specify a path to an item to test, you start by |
| // creating an empty path. |
| FeaturePath path = cas.createFeaturePath(); |
| |
| // Add POS feature to path, creating one-element path. |
| path.addFeature(posFeat); |
| |
| // You can extend the chain arbitrarily by adding additional |
| // features. |
| |
| // Create a new type constraint. |
| |
| // Type constraints will check that structures |
| // they match against have a type at least as specific |
| // as the type specified in the constraint. |
| FSTypeConstraint nounConstraint = cf.createTypeConstraint(); |
| |
| // Set the type (by default it is TOP). |
| // This succeeds if the type being tested by this constraint |
| // is nounType or a subtype of nounType. |
| nounConstraint.add(nounType); |
| |
| // Embed the noun constraint under the pos path. |
| // This means, associate the test with the path, so it tests the |
| // proper value. |
| |
| // The result is a test which will |
| // match a feature structure that has a posFeat defined |
| // which has a value which is an instance of a nounType or |
| // one of its subtypes. |
| FSMatchConstraint embeddedNoun = cf.embedConstraint(path, nounConstraint); |
| |
| // Create a type constraint for token (or a subtype of it) |
| FSTypeConstraint tokenConstraint = cf.createTypeConstraint(); |
| |
| // Set the type. |
| tokenConstraint.add(tokenType); |
| |
| // Create the final constraint by conjoining the two constraints. |
| FSMatchConstraint nounTokenCons = cf.and(nounConstraint, tokenConstraint); |
| |
| // Create a filtered iterator from some annotation iterator. |
| FSIterator it = cas.createFilteredIterator(annotIt, nounTokenCons);</pre><p> |
| </p></div></div> |
| |
| <div class="section" title="4.8. The CAS API's – a guide to the Javadocs"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.guide_to_javadocs">4.8. The CAS API's – a guide to the Javadocs</h2></div></div></div> |
| |
| |
| |
| <p>The CAS APIs are organized into 3 Java packages: cas, cas.impl, and cas.text. Most |
| of the APIs described here are in the cas package. The cas.impl package contains classes |
| used in serializing and deserializing (reading and writing external representations) the |
| CAS in various formats, for |
| transporting the CAS among local and remote annotators, or for storing the CAS in |
| permanent storage. The cas.text contains the APIs that extend the CAS to support |
| artifact (including <span class="quote">“<span class="quote">text</span>”</span>) analysis.</p> |
| |
| <div class="section" title="4.8.1. APIs in the CAS package"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.cas.javadocs.cas_package">4.8.1. APIs in the CAS package</h3></div></div></div> |
| |
| |
| <p>The main objects implementing the APIs discussed here are shown in the diagram |
| below. The hierarchy represents that there is a way to get from an upper object to an |
| instance of the lower object, usually by using a method on the upper object; this is not |
| an inheritance hierarchy. |
| </p><div class="figure"><a name="ugr.ref.cas.fig.api_hierarchy"></a><div class="figure-contents"> |
| |
| <div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="574"><tr><td><img src="images/references/ref.cas/image001.png" width="574" alt="CAS object hierarchy"></td></tr></table></div> |
| </div><p class="title"><b>Figure 4.1. CAS Object hierarchy</b></p></div><p><br class="figure-break"> </p> |
| |
| <p>The main Interface is the CAS interface. This has most of the functionality of the |
| CAS, except for the type system metadata access, and the indexing access. JCas and CAS |
| are alternative representations and API approaches to the CAS; each has a method to |
| get the other. You can mix JCas and CAS APIs in your application as needed. To use the |
| JCas APIs, you have to create the Java classes that correspond to the CAS types, and |
| include them in the Java class path of the application. If you have a CAS object, you can |
| get a JCas object by using the getJCas() method call on the CAS object; likewise, you |
| can get the CAS object from a JCas by using the getCAS() method call on the JCas object. |
| There is also a low level CAS interface that is not part of the official API, and is |
| intended for internal use only – it is not documented here.</p> |
| |
| <p>The type system metadata APIs are found in the TypeSystem interface. The objects |
| defining each type and feature are defined by the interfaces Type and Feature. The |
| Type interface has methods to see what types subsume other types, to iterate over the |
| types available, and to extract information about the types, including what |
| features it has. The Feature interface has methods that get what type it belongs to, |
| its name, and its range (the kind of values it can hold).</p> |
| |
| <p>The FSIndexRepository gives you access to methods to get instances of indexes, and |
| also provides access to the iterator over all indexed feature structures: |
| <code class="literal">getAllIndexedFS(aType)</code>. |
| The FSIndex and AnnotationIndex objects give you methods to create instances of |
| iterators.</p> |
| |
| <p>Iterators and the CAS methods that create new feature structures return |
| FeatureStructure objects. These objects can be used to set and get the values of |
| defined features within them.</p> |
| </div> |
| </div> |
| |
| <div class="section" title="4.9. Type Merging"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.typemerging">4.9. Type Merging</h2></div></div></div> |
| |
| |
| <p>When annotators are combined in an aggregate, their defined type systems are merged. |
| This is designed to support independent development of annotator components. The merge |
| results in a single defined type system for CASes that flow through a particular set of |
| annotators.</p> |
| |
| <p>The basic operation of a type system merge is to iterate through all the defined types, |
| and if two annotators define the same fully qualified type name, |
| to take the features defined for those types |
| and form a logical union of those features. This operation requires that same-named features |
| have the same range type names. The resulting type system has features comprising the union |
| of all features over all the various definitions for this type in different annotators. |
| </p> |
| |
| <p>Feature merging checks that for all features having the same name in a type, that the |
| range type is identical; otherwise an error is signaled.</p> |
| |
| <p>Types are combined for merging when their fully qualified names are the same. |
| Two different definitions can be merged even if their supertype definitions do not match, if |
| one supertype subsumes the other supertype; otherwise an error is signaled. Likewise, two types |
| with the same name can be merged only if their features can be merged. |
| </p> |
| </div> |
| |
| <div class="section" title="4.10. Limited multi-thread access to read-only CASs"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.cas.limitedmultipleaccess">4.10. Limited multi-thread access to read-only CASs</h2></div></div></div> |
| |
| |
| <p>Some applications may find it useful to scale up pipelines and run these in parallel.</p> |
| <p> |
| Generally, CASs are not threadsafe, and only one thread at a time may operate on it. In many |
| scenarios, a CAS may be initialized and then filled with Feature Structures, and after some point, |
| no more updates to that particular CAS will be done.</p> |
| |
| <p> |
| If a CAS is no longer going to be changed, it is possible to |
| access it on multiple threads in a read-only mode, simultaneously, with some limitations. Limitations |
| arise because some UIMA Framework activities may update internal CAS data structures.</p> |
| |
| <p>Operational data is updated while running a pipeline when a PEAR is entered or exited, |
| because PEARs establish new class loaders and can potentially switch the JCas classes being used |
| (This happens because the class loaders might define different JCas cover classes |
| implementing the same UIMA type). |
| Because of this, you cannot have multiple pipelines accessing a CAS in read-only mode if one or more of those |
| pipelines contains a PEAR. There are other edge cases where this may happen as well; for example, if you are |
| running a pipeline with an Extension Class Loader, |
| and have a callback routine loaded under a different class loader, UIMA will switch the JCas classes when |
| calling the callback. |
| </p> |
| </div> |
| <div class="footnotes"><br><hr width="100" align="left"><div class="footnote"><p><sup>[<a id="ftn.d5e1615" href="#d5e1615" class="para">5</a>] </sup>A fourth part, the Subject of Analysis, |
| is discussed in <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.aas" class="olink">Chapter 5, <i>Annotations, Artifacts, and Sofas</i></a>.</p></div><div class="footnote"><p><sup>[<a id="ftn.d5e1624" href="#d5e1624" class="para">6</a>] </sup> The name <span class="quote">“<span class="quote">feature structure</span>”</span> comes from |
| terminology used in linguistics.</p></div></div></div> |
| <div class="chapter" title="Chapter 5. JCas Reference" id="ugr.ref.jcas"><div class="titlepage"><div><div><h2 class="title">Chapter 5. JCas Reference</h2></div></div></div> |
| |
| |
| <p>The CAS is a system for sharing data among annotators, consisting of data structures |
| (definable at run time), sets of indexes over these data, metadata describing these, subjects of |
| analysis, and a high |
| performance serialization/deserialization mechanism. JCas provides Java approach to |
| accessing CAS data, and is based on using generated, specific Java classes for each CAS |
| type.</p> |
| |
| <p>Annotators process one CAS per call to their process method. During processing, |
| annotators can retrieve feature structures from the passed in CAS, add new ones, modify |
| existing ones, and use and update CAS indexes. Of course, an annotator can also use plain |
| Java Objects in addition; but the data in the CAS is what is shared among annotators within |
| an application.</p> |
| |
| <p>All the facilities present in the APIs for the CAS are available when using the JCas |
| APIs; indeed, you can use the getCas() method to get the corresponding CAS object from a |
| JCas (and vice-versa). The JCas APIs often have helper methods that make using this |
| interface more convenient for Java developers.</p> |
| |
| <p>The data in the CAS are typed objects having fields. JCas uses a set of generated Java |
| classes (each corresponding to a particular CAS type) with <span class="quote">“<span class="quote">getter</span>”</span> and |
| <span class="quote">“<span class="quote">setter</span>”</span> methods for the features, plus a constructor so new instances can |
| be made. The Java classes stores the data in the class instance.</p> |
| |
| <p>Users can modify the JCas generated |
| Java classes by adding fields to them; this allows arbitrary non-CAS data to also be |
| represented within the JCas objects, as well; however, the non-CAS data stored in the JCas |
| object instances cannot be shared with annotators using the plain CAS, unless special |
| provision is made - see the chapter in the v3 user's guide on storing arbitrary |
| Java objects in the CAS.</p> |
| |
| <p>The JCas class Java source files are generated from XML type system descriptions. The |
| JCasGen utility does the work of generating the corresponding Java Class Model for the CAS |
| types. There are a variety of ways JCasGen can be run; these are described later. You |
| include the generated classes with your UIMA component, and you can publish these classes |
| for others who might want to use your type system.</p> |
| |
| <p>JCas classes are not required for all UIMA types. Those types which don't have |
| corresponding JCas classes use the nearest JCas class corresponding to a type in their superchain.</p> |
| |
| <p>The specification of the type system in XML can be written using a conventional text |
| editor, an XML editor, or using the Eclipse plug-in that supports editing UIMA |
| descriptors.</p> |
| |
| <p>Changes to the type system are done by changing the XML and regenerating the |
| corresponding Java Class Models. Of course, once you've published your type system |
| for others to use, you should be careful that any changes you make don't adversely |
| impact the users. Additional features can be added to existing types without breaking |
| other code.</p> |
| |
| <p>A separate Java class is generated for each type; this type implements the CAS |
| FeatureStructure interface, as well as having the special getters and setters for the |
| included features. The generated Java classes have methods (getters and setters) for the |
| fields as defined in the XML type specification. Descriptor comments are reflected in the |
| generated Java code as Java-doc style comments.</p> |
| |
| |
| <div class="section" title="5.1. Name Spaces"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.name_spaces">5.1. Name Spaces</h2></div></div></div> |
| |
| |
| <p>Full Type names consist of a <span class="quote">“<span class="quote">namespace</span>”</span> prefix dotted with a simple |
| name. Namespaces are used like packages to avoid collisions between types that are |
| defined by different people at different times. The namespace is used as the Java |
| package name for generated Java files.</p> |
| |
| <p>Type names used in the CAS correspond to the generated Java classes directly. If the |
| CAS name is com.myCompany.myProject.ExampleClass, the generated Java class is in the |
| package com.myCompany.myProject, and the class is ExampleClass.</p> |
| |
| <p> |
| An exception to this rule is the built-in types |
| starting with <code class="literal">uima.cas </code>and <code class="literal">uima.tcas</code>; |
| these names are mapped to Java packages named |
| <code class="literal">org.apache.uima.jcas.cas</code> and |
| <code class="literal">org.apache.uima.jcas.tcas</code>.</p> |
| |
| </div> |
| |
| <div class="section" title="5.2. XML description element"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.use_of_description">5.2. XML description element</h2></div></div></div> |
| |
| |
| |
| <p>Each XML type specification can have <description ... |
| > tags. The description for a type will be copied into the generated Java code, as a |
| Javadoc style comment for the class. When writing these descriptions in the XML type |
| specification file, you might want to use html tags, as allowed in Javadocs.</p> |
| |
| <p>If you use the Component Description Editor, you can write the html tags normally, |
| for instance, <span class="quote">“<span class="quote"><h1>My Title</h1></span>”</span>. The Component |
| Descriptor Editor will take care of coverting the actual descriptor source so that it |
| has the leading <span class="quote">“<span class="quote"><</span>”</span> character written as <span class="quote">“<span class="quote">&lt;</span>”</span>, |
| to avoid confusing the XML type specification. For example, <p> would be written |
| in the source of the descriptor as &lt;p>. Any characters used in the Javadoc |
| comment must of course be from the character set allowed by the XML type specification. |
| These specifications often start with the line <?xml version=<span class="quote">“<span class="quote">1.0</span>”</span> |
| encoding=<span class="quote">“<span class="quote">UTF-8</span>”</span> ?>, which means you can use any of the UTF-8 |
| characters.</p> |
| |
| </div> |
| |
| <div class="section" title="5.3. Mapping built-in CAS types to Java types"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.mapping_built_ins">5.3. Mapping built-in CAS types to Java types</h2></div></div></div> |
| |
| |
| <p>The built-in primitive CAS types map to Java types as follows:</p> |
| |
| |
| <pre class="programlisting">uima.cas.Boolean <span class="symbol">→</span> boolean |
| uima.cas.Byte <span class="symbol">→</span> byte |
| uima.cas.Short <span class="symbol">→</span> short |
| uima.cas.Integer <span class="symbol">→</span> int |
| uima.cas.Long <span class="symbol">→</span> long |
| uima.cas.Float <span class="symbol">→</span> float |
| uima.cas.Double <span class="symbol">→</span> double |
| uima.cas.String <span class="symbol">→</span> String</pre> |
| |
| </div> |
| |
| <div class="section" title="5.4. Augmenting the generated Java Code"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.augmenting_generated_code">5.4. Augmenting the generated Java Code</h2></div></div></div> |
| |
| |
| <p>The Java Class Models generated for each type can be augmented by the user. Typical |
| augmentations include adding additional (non-CAS) fields and methods, and import |
| statements that might be needed to support these. Commonly added methods include |
| additional constructors (having different parameter signatures), and |
| implementations of toString().</p> |
| |
| <p>To augment the code, just edit the generated Java source code for the class named the |
| same as the CAS type. Here's an example of an additional method you might add; the |
| various getter methods are retrieving values from the instance:</p> |
| |
| |
| <pre class="programlisting">public String toString() { // for debugging |
| return "XsgParse " |
| + getslotName() + ": " |
| + getheadWord().getCoveredText() |
| + " seqNo: " + getseqNo() |
| + ", cAddr: " + id |
| + ", size left mods: " + getlMods().size() |
| + ", size right mods: " + getrMods().size(); |
| }</pre> |
| |
| |
| |
| <div class="section" title="5.4.1. Keeping hand-coded augmentations when regenerating"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.keeping_augmentations_when_regenerating">5.4.1. Keeping hand-coded augmentations when regenerating</h3></div></div></div> |
| |
| |
| <p>If the type system specification changes, you have to re-run the JCasGen |
| generator. This will produce updated Java for the Class Models that capture the |
| changed specification. If you have previously augmented the source for these Java |
| Class Models, your changes must be merged with the newly (re)generated Java source |
| code for the Class Models. This can be done by hand, or you can run the version of JCasGen |
| that is integrated with Eclipse, and use automatic merging that is done using Eclipse's EMF |
| plug-in. You can obtain Eclipse and the needed EMF plug-in from <a class="ulink" href="http://www.eclipse.org/" target="_top">http://www.eclipse.org/</a>.</p> |
| |
| <p>If you run the generator version that works without using Eclipse, it will not |
| merge Java source changes you may have previously made; if you want them retained, |
| you'll have to do the merging by hand.</p> |
| |
| <p>The Java source merging will keep additional constructors, additional fields, |
| and any changes you may have made to the readObject method (see below). Merging will |
| <span class="emphasis"><em>not</em></span> delete classes in the target corresponding to deleted CAS types, which no longer |
| are in the source – you should delete these by hand.</p> |
| |
| <div class="warning" title="Warning" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Warning</h3><p>The merging supports Java 1.4 syntactic constructs only. |
| JCasGen generates Java 1.4 code, so as long as any code you change here also sticks to |
| only Java 1.4 constructs, the merge will work. If you use Java 5 or later specific syntax or constructs, the merge |
| operation will likely fail to merge properly.</p></div> |
| </div> |
| |
| <div class="section" title="5.4.2. Additional Constructors"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.additional_constructors">5.4.2. Additional Constructors</h3></div></div></div> |
| |
| |
| <p>Any additional constructors that you add must include the JCas argument. The |
| first line of your constructor is required to be</p> |
| |
| |
| <pre class="programlisting">this(jcas); // run the standard constructor</pre> |
| |
| <p>where jcas is the passed in JCas reference. If the type you're defining |
| extends <code class="literal">uima.tcas.Annotation</code>, JCasGen will automatically |
| add a constructor which takes 2 additional parameters – the begin and end Java |
| int values, and set the <code class="literal">uima.tcas.Annotation</code> |
| <code class="literal">begin</code> and <code class="literal">end</code> fields.</p> |
| |
| <p>Here's an example: If you're defining a type MyType which has a |
| feature parent, you might make an additional constructor which has an additional |
| argument of parent:</p> |
| |
| |
| <pre class="programlisting">MyType(JCas jcas, MyType parent) { |
| this(jcas); // run the standard constructor |
| setParent(parent); // set the parent field from the parameter |
| }</pre> |
| |
| <div class="section" title="5.4.2.1. Using readObject"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.jcas.using_readobject">5.4.2.1. Using readObject</h4></div></div></div> |
| |
| |
| <p>Fields defined by augmenting the Java Class Model to include additional |
| fields represent data that exist for this class in Java, in a local JVM (Java Virtual |
| Machine), but do not exist in the CAS when it is passed to other environments (for |
| example, passing to a remote annotator).</p> |
| |
| <p>A problem can arise when new instances are created, perhaps by the underlying |
| system when it iterates over an index, which is: how to insure that any additional |
| non-CAS fields are properly initialized. To allow for arbitrary initialization |
| at instance creation time, an initialization method in the Java Class Model, |
| called readObject is used. The generated default for this method is to do nothing, |
| but it is one of the methods that you can modify – to do whatever |
| initialization might be needed. It is called with 0 parameters, during the |
| constructor for the object, after the basic object fields have been set up. It can |
| refer to fields in the CAS using the getters and setters, and other fields in the Java |
| object instance being initialized.</p> |
| |
| <p>A pre-existing CAS feature structure could exist if a CAS was being passed to |
| this annotator; in this case the JCas system calls the readObject method when |
| creating the corresponding Java instance for the first time for the CAS feature |
| structure. This can happen at two points: when a new object is being returned from an |
| iterator over a CAS index, or a getter method is getting a field for the first time |
| whose value is a feature structure.</p> |
| |
| </div> |
| </div> |
| |
| <div class="section" title="5.4.3. Modifying generated items"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.modifying_generated_items">5.4.3. Modifying generated items</h3></div></div></div> |
| |
| |
| <p>The following modifications, if made in generated items, will be preserved when |
| regenerating.</p> |
| |
| <p>The public/private etc. flags associated with methods (getters and setters). |
| You can change the default (<span class="quote">“<span class="quote">public</span>”</span>) if needed.</p> |
| |
| <p><span class="quote">“<span class="quote">final</span>”</span> or <span class="quote">“<span class="quote">abstract</span>”</span> can be added to the type |
| itself, with the usual semantics.</p> |
| |
| </div> |
| </div> |
| |
| <div class="section" title="5.5. Merging types"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.merging_types_from_other_specs">5.5. Merging types</h2></div></div></div> |
| |
| |
| <p>Type definitions are merged by the framework from all the components being run together.</p> |
| |
| <div class="section" title="5.5.1. Aggregate AEs and CPEs as sources of types"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.merging_types.aggregates_and_cpes">5.5.1. Aggregate AEs and CPEs as sources of types</h3></div></div></div> |
| |
| |
| <p>When running aggregate AEs (Analysis Engines), or a set of AEs in a collection processing engine, the |
| UIMA framework will build a merged type system (Note: this <span class="quote">“<span class="quote">merge</span>”</span> is merging types, not to be |
| confused with merging Java source code, discussed above). This merged type system has all the types of every |
| component used in the application. In addition, application code can use UIMA Framework APIs to read and merge |
| type descriptions, manually.</p> |
| |
| <p>In most cases, each type system can have its own Java Class Models generated individually, perhaps at an |
| earlier time, and the resulting class files (or .jar files containing these class files) can be put in the |
| class path to enable JCas.</p> |
| |
| <p>However, it is possible that there may be multiple definitions of the same CAS type, each of which might |
| have different features defined. In this case, the UIMA framework will create a merged type by accumulating |
| all the defined features for a particular type into that type's type definition. However, the JCas |
| classes for these types are not automatically merged, which can create some issues for JCas users, as |
| discussed in the next section.</p> |
| |
| </div> |
| |
| <div class="section" title="5.5.2. JCasGen support for type merging"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.merging_types.jcasgen_support">5.5.2. JCasGen support for type merging</h3></div></div></div> |
| |
| |
| <p>When there are multiple definitions of the same CAS type with different features defined, then JCasGen |
| can be re-run on the merged type system, to create one set of JCas Class definitions for the merged types, |
| which can then be shared by all the components. |
| Directions for running JCasGen can be found in <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.jcasgen" class="olink">Chapter 8, <i>JCasGen User's Guide</i></a>. This is typically done by the person who |
| is assembling the Aggregate Analysis Engine or Collection Processing Engine. The resulting merged Java |
| Class Model will then contain get and set methods for the complete set of features. These Java classes must |
| then be made available in the class path, <span class="emphasis"><em>replacing</em></span> the pre-merge versions of the |
| classes.</p> |
| |
| <p>If hand-modifications were done to the pre-merge versions of the classes, these must be applied to the |
| merged versions, as described in section <a class="xref" href="#ugr.ref.jcas.keeping_augmentations_when_regenerating" title="5.4.1. Keeping hand-coded augmentations when regenerating">Section 5.4.1, “Keeping hand-coded augmentations when regenerating”</a>, above. If just one of the |
| pre-merge versions had hand-modifications, the source for this hand-modified version can be put into the |
| file system where the generated output will go, and the -merge option for JCasGen will automatically |
| merge the hand-modifications with the generated code. If |
| <span class="emphasis"><em>both</em></span> pre-merged versions had hand-modifications, then these modifications must |
| be manually merged.</p> |
| |
| <p>An alternative to this is packaging the components as individual PEAR files, each with their own |
| version of the JCas generated Classes. The Framework (as of release 2.2) can run PEAR files using the |
| pear file descriptor, and supply each component with its particular version of the JCas generated class.</p> |
| |
| </div> |
| |
| <div class="section" title="5.5.3. Impact of Type Merging on Composability of Annotators"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.impact_of_type_merging_on_composability">5.5.3. Impact of Type Merging on Composability of Annotators</h3></div></div></div> |
| |
| |
| |
| <p>The recommended approach in UIMA is to build and maintain type systems as separate components, which are |
| imported by Annotators. Using this approach, Type Merging does not occur because the Type System and its JCas |
| classes are centrally managed and shared by the annotators.</p> |
| |
| <p>If you do choose to create a JCas Annotator that relies on Type Merging (meaning that your annotator |
| redefines a Type that is already in use elsewhere, and adds its own features), this can negatively impact the |
| reusability of your annotator, unless your component is used as a PEAR file.</p> |
| |
| <p>If not using PEAR file packaging isolation capability, whenever |
| anyone wants to combine your annotator with another annotator that uses a different version of |
| the same Type, they will need to be aware of all of the issues described in the previous section. They will need |
| to have the know-how to re-run JCasGen and appropriately set up their classpath to include the merged Java |
| classes and to not include the pre-merge classes. (To enable this, you should package these classes |
| separately from other .jar files for your annotator, so that they can be more easily excluded.) And, if you |
| have done hand-modifications to your JCas classes, the person assembling your annotator will need to |
| properly merge those changes. These issues significantly complicate the task of combining annotators, and |
| will cause your annotator not to be as easily reusable as other UIMA annotators. </p> |
| |
| </div> |
| |
| <div class="section" title="5.5.4. Adding Features to DocumentAnnotation"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.documentannotation_issues">5.5.4. Adding Features to DocumentAnnotation</h3></div></div></div> |
| |
| |
| <p>There is one built-in type, <code class="literal">uima.tcas.DocumentAnnotation</code>, |
| to which applications can add additional features. (All other built-in types |
| are "feature-final" and you cannot add additional features to them.) Frequently, |
| additional features are added to <code class="literal">uima.tcas.DocumentAnnotation</code> |
| to provide a place to store document-level metadata.</p> |
| |
| <p>For the same reasons mentioned in the previous section, adding features to |
| DocumentAnnotation is not recommended if you are using JCas. Instead, it is recommended |
| that you define your own type for storing your document-level metadata. You can create |
| an instance of this type and add it to the indexes in the usual way. You can then |
| retrieve this instance using the iterator returned from the method<code class="literal">getAllIndexedFS(type)</code> |
| on an instance of a JFSIndexRepository object. |
| (As of UIMA v2.1, you do not have to declare a custom index in your descriptor to |
| get this to work).</p> |
| |
| <p>If you do choose to add features to DocumentAnnotation, there are additional issues to |
| be aware of. The UIMA SDK provides the JCas cover class for the built-in definition of |
| DocumentAnnotation, in the separate jar file <code class="literal">uima-document-annotation.jar</code>. |
| If you add additional features to DocumentAnnotation, you must remove this jar file |
| from your classpath, because you will not want to use the default JCas cover class. |
| You will need to re-run JCasGen as described in <a class="xref" href="#ugr.ref.jcas.merging_types.jcasgen_support" title="5.5.2. JCasGen support for type merging">Section 5.5.2, “JCasGen support for type merging”</a>. JCasGen will generate a new cover |
| class for DocumentAnnotation, which you must place in your classpath in lieu of the version |
| in <code class="literal">uima-document-annotation.jar</code>.</p> |
| |
| <p>Also, this is the reason why the method <code class="literal">JCas.getDocumentAnnotationFs()</code> returns |
| type <code class="literal">TOP</code>, rather than type <code class="literal">DocumentAnnotation</code>. Because the |
| <code class="literal">DocumentAnnotation</code> class can be replaced by users, it is not part of |
| <code class="literal">uima-core.jar</code> and so the core UIMA framework cannot have any references |
| to it. In your code, you may <span class="quote">“<span class="quote">cast</span>”</span> the result of <code class="literal">JCas.getDocumentAnnotationFs()</code> |
| to type <code class="literal">DocumentAnnotation</code>, which must be available on the classpath either via |
| <code class="literal">uima-document-annotation.jar</code> or by including a custom version that you have generated using JCasGen.</p> |
| </div> |
| |
| </div> |
| |
| <div class="section" title="5.6. Using JCas within an Annotator"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.using_within_an_annotator">5.6. Using JCas within an Annotator</h2></div></div></div> |
| |
| |
| <p>To use JCas within an annotator, you must include the generated Java classes output |
| from JCasGen in the class path.</p> |
| |
| <p>An annotator written using JCas is built by defining a class for the annotator that |
| extends JCasAnnotator_ImplBase. The process method for this annotator is |
| written</p> |
| |
| <pre class="programlisting">public void process(JCas jcas) |
| throws AnalysisEngineProcessException { |
| ... // body of annotator goes here |
| }</pre> |
| |
| <p>The process method is passed the JCas instance to use as a parameter.</p> |
| |
| <p>The JCas reference is used throughout the annotator to refer to the particular JCas |
| instance being worked on. In pooled or multi-threaded implementations, there will be a |
| separate JCas for each thread being (simultaneously) worked on.</p> |
| |
| <p>You can do several kinds of operations using the JCas APIs: create new feature |
| structures (instances of CAS types) (using the new operator), access existing feature |
| structures passed to your annotator in the JCas (for example, by using the next method of |
| an iterator over the feature structures), get and set the fields of a particular |
| instance of a feature structure, and add and remove feature structure instances from |
| the CAS indexes. To support iteration, there are also functions to get and use indexes |
| and iterators over the instances in a JCas.</p> |
| |
| <div class="section" title="5.6.1. Creating new instances using the Java “new” operator"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.new_instances">5.6.1. Creating new instances using the Java <span class="quote">“<span class="quote">new</span>”</span> operator</h3></div></div></div> |
| |
| |
| |
| <p>The new operator creates new instances of JCas types. It takes at least one |
| parameter, the JCas instance in which the type is to be created. For example, if there |
| was a type Meeting defined, you can create a new instance of it using: |
| |
| </p><pre class="programlisting">Meeting m = new Meeting(jcas);</pre> |
| |
| <p>Other variations of constructors can be added in custom code; the single |
| parameter version is the one automatically generated by JCasGen. For types that are |
| subtypes of Annotation, JCasGen also generates an additional constructor with |
| additional <span class="quote">“<span class="quote">begin</span>”</span> and <span class="quote">“<span class="quote">end</span>”</span> arguments.</p> |
| |
| </div> |
| <div class="section" title="5.6.2. Getters and Setters"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.getters_and_setters">5.6.2. Getters and Setters</h3></div></div></div> |
| |
| |
| <p>If the CAS type Meeting had fields location and time, you could get or set these by |
| using getter or setter methods. These methods have names formed by splicing together |
| the word <span class="quote">“<span class="quote">get</span>”</span> or <span class="quote">“<span class="quote">set</span>”</span> followed by the field name, with |
| the first letter of the field name capitalized. For instance |
| |
| </p><pre class="programlisting">getLocation()</pre> |
| |
| <p>The getter forms take no parameters and return the value of the field; the setter |
| forms take one parameter, the value to set into the field, and return void.</p> |
| |
| <p>There are built-in CAS types for arrays of integers, strings, floats, and |
| feature structures. For fields whose values are these types of arrays, there is an |
| alternate form of getters and setters that take an additional parameter, written as |
| the first parameter, which is the index in the array of an item to get or set.</p> |
| |
| </div> |
| |
| <div class="section" title="5.6.3. Obtaining references to Indexes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.obtaining_refs_to_indexes">5.6.3. Obtaining references to Indexes</h3></div></div></div> |
| |
| |
| <p>The only way to access instances (not otherwise referenced from other |
| instances) passed in to your annotator in its JCas is to use an iterator over some |
| index. Indexes in the CAS are specified in the annotator descriptor. Indexes have a |
| name; text annotators have a built-in, standard index over all annotations.</p> |
| |
| <p>To get an index, first get the JFSIndexRepository from the JCas using the method |
| jcas.getJFSIndexRepository(). Here are the calls to get indexes:</p> |
| |
| |
| <pre class="programlisting">JFSIndexRepository ir = jcas.getJFSIndexRepository(); |
| |
| ir.getIndex(name-of-index) // get the index by its name, a string |
| ir.getIndex(name-of-index, Foo.type) // filtered by specific type |
| |
| ir.getAnnotationIndex() // get AnnotationIndex |
| jcas.getAnnotationIndex() // get directly from jcas |
| ir.getAnnotationIndex(Foo.type) // filtered by specific type</pre> |
| jcas.getAnnotationIndex(Foo.class) // better |
| |
| <p>For convenience, the getAnnotationIndex method is available directly on the JCas object |
| instance; the implementation merely forwards to the associated index repository.</p> |
| |
| <p>Filtering types have to be a subtype of the type specified for this index in its |
| index specification. They can be written as either Foo.type or if you have an instance |
| of Foo, you can write</p> |
| |
| <pre class="programlisting">fooInstance.getClass()</pre> |
| |
| <p>Foo is (of course) an example of the name of the type.</p> |
| |
| </div> |
| <div class="section" title="5.6.4. Adding (and removing) instances to (from) indexes"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.adding_removing_instances_to_indexes">5.6.4. Adding (and removing) instances to (from) indexes</h3></div></div></div> |
| |
| |
| |
| <p>CAS indexes are maintained automatically by the CAS. But you must add any |
| instances of feature structures you want the index to find, to the indexes by using the |
| call:</p> |
| |
| <pre class="programlisting">myInstance.addToIndexes();</pre> |
| |
| <p>Do this after setting all features in the instance <span class="bold-italic">which could be used in indexing</span>, |
| for example, in determining the sorting order. |
| See <a class="xref" href="#ugr.ref.cas.updating_indexed_feature_structures" title="4.5.1. Updating indexed feature structures">Section 4.5.1, “Updating indexed feature structures”</a> for details |
| on updating indexed feature structures. |
| </p> |
| |
| <p>When writing a Multi-View component, you may need to index instances in multiple |
| CAS views. The methods above use the indexes associated with the current JCas object. |
| There is a variation of the <code class="literal">addToIndexes / removeFromIndexes</code> methods which |
| takes one argument: a reference to a JCas object holding the view in which you want to |
| index this instance. |
| </p><pre class="programlisting">myInstance.addToIndexes(anotherJCas) |
| myInstance.removeFromIndexes(anotherJCas)</pre><p> |
| </p> |
| |
| <p> |
| You can also explicitly add instances to other views using the addFsToIndexes method on |
| other JCas (or CAS) objects. For instance, if you had 2 other CAS views (myView1 and |
| myView2), in which you wanted to index myInstance, you could write:</p> |
| |
| <pre class="programlisting">myInstance.addToIndexes(); //addToIndexes used with the new operator |
| myView1.addFsToIndexes(myInstance); // index myInstance in myView1 |
| myView2.addFsToIndexes(myInstance); // index myInstance in myView2</pre> |
| |
| <p> |
| The rules for determining which index to use with a particular JCas object are designed to |
| behave the way most would think they should; if you need specific behavior, you can always |
| explicitly designate which view the index adding and removing operations should work on. |
| </p> |
| |
| <p> |
| The rules are: |
| If the instance is a subtype of AnnotationBase, then the view is the view associated with the |
| annotation as specified in the feature holding the view reference in AnnotationBase. |
| Otherwise, if the instance was created using the "new" operator, then the view is the view passed to the |
| instance's constructor. |
| Otherwise, if the instance was created by getting a feature value from some other instance, whose range |
| type is a feature structure, then the view is the same as the referring instance. |
| Otherwise, if the instance was created by any of the Feature Structure Iterator operations over some index, |
| then it is the view associated with the index. |
| </p> |
| |
| <p>As of release 2.4.1, there are two efficient bulk-remove methods to remove all instances of a given type, |
| or all instances of a given type and its subtypes. |
| These are invoked on an instance of an IndexRepository, |
| for a particular view. For example, to remove all instances of Token from a particular JCas instance: |
| </p> |
| <pre class="programlisting">jcas.removeAllIncludingSubtypes(Token.type) or |
| jcas.removeAllIncludingSubtypes(aTokenInstance.getTypeIndexID()) or |
| jcas.getFsIndexRepository(). |
| removeAllIncludingSubtypes(jcas.getCasType(Token.type)) |
| </pre> |
| |
| </div> |
| |
| <div class="section" title="5.6.5. Using Iterators"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.using_iterators">5.6.5. Using Iterators</h3></div></div></div> |
| |
| |
| <p>This chapter describes obtaining and using iterators. However, it is recommended that instead |
| you use the select framework, described in a chapter in the version 3 user's guide.</p> |
| |
| <p>Once you have an index obtained from the JCas, you can get an iterator from the |
| index; here is an example:</p> |
| |
| |
| <pre class="programlisting">FSIndexRepository ir = jcas.getFSIndexRepository(); |
| FSIndex myIndex = ir.getIndex("myIndexName"); |
| FSIterator myIterator = myIndex.iterator(); |
| |
| JFSIndexRepository ir = jcas.getJFSIndexRepository(); |
| FSIndex myIndex = ir.getIndex("myIndexName", Foo.type); // filtered |
| FSIterator myIterator = myIndex.iterator();</pre> |
| |
| <p>Iterators work like normal Java iterators, but are augmented to support |
| additional capabilities. Iterators are described in the CAS Reference, <a href="references.html#ugr.ref.cas.indexes_and_iterators" class="olink">Section 4.7, “Indexes and Iterators”</a>.</p> |
| |
| </div> |
| |
| <div class="section" title="5.6.6. Class Loaders in UIMA"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.class_loaders">5.6.6. Class Loaders in UIMA</h3></div></div></div> |
| |
| |
| <p>The basic concept of a UIMA application includes assembling engines into a flow. |
| The application made up of these Engines are run within the UIMA Framework, either by |
| the Collection Processing Manager, or by using more basic UIMA Framework |
| APIs.</p> |
| |
| <p>The UIMA Framework exists within a JVM (Java Virtual Machine). A JVM has the |
| capability to load multiple applications, in a way where each one is isolated from the |
| others, by using a separate class loader for each application. For instance, one set |
| of UIMA Framework Classes could be shared by multiple sets of application - specific |
| classes, even if these application-specific classes had the same names but were |
| different versions.</p> |
| |
| <div class="section" title="5.6.6.1. Use of Class Loaders is optional"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.jcas.class_loaders.optional">5.6.6.1. Use of Class Loaders is optional</h4></div></div></div> |
| |
| |
| <p>The UIMA framework will use a specific ClassLoader, based on how |
| ResourceManager instances are used. Specific ClassLoaders are only created if |
| you specify an ExtensionClassPath as part of the ResourceManager. If you do not |
| need to support multiple applications within one UIMA framework within a JVM, |
| don't specify an ExtensionClassPath; in this case, the classloader used |
| will be the one used to load the UIMA framework - usually the overall application |
| class loader.</p> |
| |
| <p>Of course, you should not run multiple UIMA applications together, in this |
| way, if they have different class definitions for the same class name. This |
| includes the JCas <span class="quote">“<span class="quote">cover</span>”</span> classes. This case might arise, for |
| instance, if both applications extended |
| <code class="literal">uima.tcas.DocumentAnnotation</code> in differing, |
| incompatible ways. Each application would need its own definition of this class, |
| but only one could be loaded (unless you specify ExtensionClassPath in the |
| ResourceManager which will cause the UIMA application to load its private |
| versions of its classes, from its classpath).</p> |
| </div> |
| </div> |
| |
| <div class="section" title="5.6.7. Issues accessing JCas objects outside of UIMA Engine Components"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.jcas.accessing_jcas_objects_outside_uima_components">5.6.7. Issues accessing JCas objects outside of UIMA Engine Components</h3></div></div></div> |
| |
| |
| <p>If you are using the ExtensionClassPaths, the JCas cover classes are loaded |
| under a class loader created by the ResourceManager part of the UIMA Framework. |
| If you reference the same JCas |
| classes outside of any UIMA component, for instance, in top level application code, |
| the JCas classes used by that top level application code also must be in the class path |
| for the application code.</p> |
| |
| <p>Alternatively, you could do all the JCas processing inside a UIMA component (and do no |
| processing using JCas outside of the UIMA pipeline).</p> |
| |
| </div> |
| </div> |
| |
| <div class="section" title="5.7. Setting up Classpath for JCas"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.setting_up_classpath">5.7. Setting up Classpath for JCas</h2></div></div></div> |
| |
| |
| <p>The JCas Java classes generated by JCasGen are typically compiled and put into a JAR |
| file, which, in turn, is put into the application's class path.</p> |
| |
| <p>This JAR file must be generated from the application's merged type system. |
| This is most conveniently done by opening the top level descriptor used by the |
| application in the Component Descriptor Editor tool, and pressing the Run-JCasGen |
| button on the Type System Definition page.</p> |
| |
| </div> |
| |
| <div class="section" title="5.8. PEAR isolation"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.jcas.pear_support">5.8. PEAR isolation</h2></div></div></div> |
| |
| <p> |
| As of version 2.2, the framework supports component descriptors which are PEAR descriptors. |
| These descriptors define components plus include information on the class path needed to |
| run them. The framework uses the class path information to set up a localized class path, just |
| for code running within the PEAR context. This allows PEAR files requiring different |
| versions of common code to work well together, even if the class names in the different versions |
| have the same names. |
| </p> |
| |
| <p>The mechanism used to switch the class loaders when entering a PEAR-packaged annotator in |
| a flow depends on the framework knowing if JCas is being used within that annotator code. The |
| framework will know this if the particular view being passed has had a previous call to |
| getJCas(), or if the particular annotator is marked as a JCas-using one (by having it extend the |
| class <code class="code">JCasAnnotator_ImplBase).</code></p> |
| |
| </div> |
| |
| </div> |
| <div class="chapter" title="Chapter 6. PEAR Reference" id="ugr.ref.pear"><div class="titlepage"><div><div><h2 class="title">Chapter 6. PEAR Reference</h2></div></div></div> |
| |
| |
| <p> |
| A PEAR (Processing Engine ARchive) file is a standard package |
| for UIMA components. This chapter describes the PEAR 1.0 structure and |
| specification. |
| </p> |
| |
| <p> |
| The PEAR package can be used for distribution and reuse by other |
| components or applications. It also allows applications and |
| tools to manage UIMA components automatically for verification, |
| deployment, invocation, testing, etc. |
| </p> |
| |
| <p> |
| Currently, there is an Eclipse plugin and a command line tool |
| available to create PEAR packages for standard UIMA components. |
| Please refer to |
| <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> |
| <a href="tools.html#ugr.tools.pear.packager" class="olink">Chapter 9, <i>PEAR Packager User's Guide</i></a> |
| for more information about these tools. |
| </p> |
| |
| <p> |
| PEARs distributed to new targets can be installed at those targets. |
| UIMA includes a tool for installing PEARs; see |
| <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> |
| <a href="tools.html#ugr.tools.pear.installer" class="olink">Chapter 11, <i>PEAR Installer User's Guide</i></a> for |
| more information about installing PEARs. |
| </p> |
| |
| <p> |
| An installed PEAR can be used as a component within a UIMA pipeline, |
| by specifying the pear descriptor that is created when |
| installing the pear. See |
| <a href="references.html#ugr.ref.pear.specifier" class="olink">Section 6.3, “PEAR package descriptor”</a>. |
| </p> |
| |
| <div class="section" title="6.1. Packaging a UIMA component"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.pear.packaging_a_component">6.1. Packaging a UIMA component</h2></div></div></div> |
| |
| |
| <p> |
| For the purpose of describing the process of creating a PEAR |
| file and its internal structure, this section describes the |
| steps used to package a UIMA component as a valid PEAR file. |
| The PEAR packaging process consists of the following steps: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p> |
| <a class="xref" href="#ugr.ref.pear.creating_pear_structure" title="6.1.1. Creating the PEAR structure">Section 6.1.1, “Creating the PEAR structure”</a> |
| </p> |
| </li><li class="listitem"> |
| <p> |
| <a class="xref" href="#ugr.ref.pear.populating_pear_structure" title="6.1.2. Populating the PEAR structure">Section 6.1.2, “Populating the PEAR structure”</a> |
| </p> |
| </li><li class="listitem"> |
| <p> |
| <a class="xref" href="#ugr.ref.pear.creating_installation_descriptor" title="6.1.3. Creating the installation descriptor">Section 6.1.3, “Creating the installation descriptor”</a> |
| </p> |
| </li><li class="listitem"> |
| <p> |
| <a class="xref" href="#ugr.ref.pear.packaging_into_1_file" title="6.1.5. Packaging the PEAR structure into one file">Section 6.1.5, “Packaging the PEAR structure into one file”</a> |
| </p> |
| </li></ul></div><p> |
| </p> |
| |
| <div class="section" title="6.1.1. Creating the PEAR structure"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.pear.creating_pear_structure">6.1.1. Creating the PEAR structure</h3></div></div></div> |
| |
| |
| <p> |
| The first step in the PEAR creation process is to create |
| a PEAR structure. The PEAR structure is a structured |
| tree of folders and files, including the following |
| elements: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p> |
| Required Elements: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="circle"><li class="listitem"> |
| <p> |
| The |
| <span class="bold"><strong> |
| metadata |
| </strong></span> |
| folder which contains the PEAR |
| installation descriptor and |
| properties files. |
| </p> |
| </li><li class="listitem"> |
| <p> |
| The installation descriptor ( |
| <span class="bold"><strong> |
| metadata/install.xml |
| </strong></span> |
| ) |
| </p> |
| </li><li class="listitem"> |
| <p> |
| A UIMA analysis engine |
| descriptor and its required |
| code, delegates (if any), and |
| resources |
| </p> |
| </li></ul></div><p> |
| </p> |
| </li><li class="listitem"> |
| <p> |
| Optional Elements: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="circle"><li class="listitem"> |
| <p> |
| The desc folder to contain |
| descriptor files of analysis |
| engines, delegates analysis |
| engines (all levels), and other |
| components (Collection Readers, |
| CAS Consumers, etc). |
| </p> |
| </li><li class="listitem"> |
| <p> |
| The src folder to contain the |
| source code |
| </p> |
| </li><li class="listitem"> |
| <p> |
| The bin folder to contain |
| executables, scripts, class |
| files, dlls, shared libraries, |
| etc. |
| </p> |
| </li><li class="listitem"> |
| <p> |
| The lib folder to contain jar |
| files. |
| </p> |
| </li><li class="listitem"> |
| <p> |
| The doc folder containing |
| documentation materials, |
| preferably accessible through an |
| index.html. |
| </p> |
| </li><li class="listitem"> |
| <p> |
| The data folder to contain data |
| files (e.g. for testing). |
| </p> |
| </li><li class="listitem"> |
| <p> |
| The conf folder to contain |
| configuration files. |
| </p> |
| </li><li class="listitem"> |
| <p> |
| The resources folder to contain |
| other resources and |
| dependencies. |
| </p> |
| </li><li class="listitem"> |
| <p> |
| Other user-defined folders or |
| files are allowed, but should be |
| avoided. |
| </p> |
| </li></ul></div><p> |
| </p> |
| </li></ul></div><p> |
| </p> |
| |
| <div class="figure"><a name="ugr.ref.pear.fig.pear_structure"></a><div class="figure-contents"> |
| |
| <div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="297"><tr><td><img src="images/references/ref.pear/image002.jpg" width="297" alt="diagram of the PEAR structure"></td></tr></table></div> |
| </div><p class="title"><b>Figure 6.1. The PEAR Structure</b></p></div><br class="figure-break"> |
| |
| </div> |
| <div class="section" title="6.1.2. Populating the PEAR structure"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.pear.populating_pear_structure">6.1.2. Populating the PEAR structure</h3></div></div></div> |
| |
| |
| <p> |
| After creating the PEAR structure, the component's |
| descriptor files, code files, resources files, and any |
| other files and folders are copied into the |
| corresponding folders of the PEAR structure. The |
| developer should make sure that the code would work with |
| this layout of files and folders, and that there are no |
| broken links. Although it is strongly discouraged, the |
| optional elements of the PEAR structure can be replaced |
| by other user defined files and folder, if required for |
| the component to work properly. |
| </p> |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3> |
| <p> |
| The PEAR structure must be self-contained. For |
| example, this means that the component must run |
| properly independently from the PEAR root folder |
| location. If the developer needs to use an absolute |
| path in configuration or descriptor files, then |
| he/she should put these files in the |
| <span class="quote">“<span class="quote">conf</span>”</span> |
| or |
| <span class="quote">“<span class="quote">desc</span>”</span> |
| and replace the path of the PEAR root folder with |
| the string |
| <span class="quote">“<span class="quote">$main_root</span>”</span> |
| . The tools that deploy and use PEAR files should |
| localize the files in the |
| <span class="quote">“<span class="quote">conf</span>”</span> |
| and |
| <span class="quote">“<span class="quote">desc</span>”</span> |
| folders by replacing the string |
| <span class="quote">“<span class="quote">$main_root</span>”</span> |
| with the local absolute path of the PEAR root |
| folder. The |
| <span class="quote">“<span class="quote">$main_root</span>”</span> |
| macro can also be used in the Installation |
| descriptor (install.xml) |
| </p> |
| </div> |
| |
| <p> |
| Currently there are three types of component packages |
| depending on their deployment: |
| </p> |
| |
| <div class="section" title="6.1.2.1. Standard Type"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.package_type.standard">6.1.2.1. Standard Type</h4></div></div></div> |
| |
| |
| <p> |
| A component package with the |
| <span class="bold"><strong>standard</strong></span> |
| type must be a valid Analysis Engine, and all the |
| required files to deploy it locally must be included |
| in the PEAR package. |
| </p> |
| |
| </div> |
| <div class="section" title="6.1.2.2. Service Type"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.package_type.service">6.1.2.2. Service Type</h4></div></div></div> |
| |
| |
| <p> |
| A component package with the |
| <span class="bold"><strong>service</strong></span> |
| type must be deployable locally as a supported UIMA |
| service (e.g. Vinci). In this case, all the required |
| files to deploy it locally must be included in the |
| PEAR package. |
| </p> |
| |
| </div> |
| <div class="section" title="6.1.2.3. Network Type"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.package_type.network">6.1.2.3. Network Type</h4></div></div></div> |
| |
| |
| <p> |
| A component package with the network type is not |
| deployed locally but rather in the |
| <span class="quote">“<span class="quote">remote</span>”</span> |
| environment. It's accessed as a network AE |
| (e.g. Vinci Service). The component owner has the |
| responsibility to start the service and make sure |
| it's up and running before it's used by |
| others (like a webmaster that makes sure the web |
| site is up and running). In this case, the PEAR |
| package does not have to contain files required for |
| deployment, but must contain the network AE |
| descriptor (see |
| <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.aae.creating_xml_descriptor" class="olink">Section 1.1.4, “Creating the XML Descriptor”</a> |
| ) and the <DESC> tag in the installation |
| descriptor must point to the network AE descriptor. |
| For more information about Network Analysis Engines, |
| please refer to |
| <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.remote_services" class="olink">Section 3.6, “Working with Remote Services”</a> |
| . |
| </p> |
| |
| </div> |
| </div> |
| |
| <div class="section" title="6.1.3. Creating the installation descriptor"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.pear.creating_installation_descriptor">6.1.3. Creating the installation descriptor</h3></div></div></div> |
| |
| |
| <p> |
| The installation descriptor is an xml file called |
| install.xml under the metadata folder of the PEAR |
| structure. It's also called InsD. The InsD XML file |
| should be created in the UTF-8 file encoding. The InsD |
| should contain the following sections: |
| </p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p> |
| <OS>: This section is used to specify |
| supported operating systems |
| </p> |
| </li><li class="listitem"> |
| <p> |
| <TOOLKITS>: This section is used to |
| specify toolkits, such as JDK, needed by the |
| component. |
| </p> |
| </li><li class="listitem"> |
| <p> |
| <SUBMITTED_COMPONENT>: This is the most |
| important section in the Installation |
| Descriptor. It's used to specify required |
| information about the component. See |
| <a class="xref" href="#ugr.ref.pear.installation_descriptor" title="6.1.4. Documented template for the installation descriptor:">Section 6.1.4, “Installation Descriptor: template”</a> |
| for detailed information about this section. |
| </p> |
| </li><li class="listitem"> |
| <p> |
| <INSTALLATION>: This section is explained |
| in section |
| <a class="xref" href="#ugr.ref.pear.installing" title="6.2. Installing a PEAR package">Section 6.2, “Installing a PEAR package”</a> |
| . |
| </p> |
| </li></ul></div> |
| |
| </div> |
| |
| <div class="section" title="6.1.4. Documented template for the installation descriptor:"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.pear.installation_descriptor">6.1.4. |
| Documented template for the installation descriptor: |
| </h3></div></div></div> |
| |
| |
| |
| <p> |
| The following is a sample |
| <span class="quote">“<span class="quote">documented template</span>”</span> |
| which describes content of the installation descriptor |
| install.xml: |
| </p> |
| |
| |
| <pre class="programlisting"><? xml version="1.0" encoding="UTF-8"?> |
| <!-- Installation Descriptor Template --> |
| <COMPONENT_INSTALLATION_DESCRIPTOR> |
| <!-- Specifications of OS names, including version, etc. --> |
| <OS> |
| <NAME>OS_Name_1</NAME> |
| <NAME>OS_Name_2</NAME> |
| </OS> |
| <!-- Specifications of required standard toolkits --> |
| <TOOLKITS> |
| <JDK_VERSION>JDK_Version</JDK_VERSION> |
| </TOOLKITS> |
| |
| <!-- There are 2 types of variables that are used in the InsD: |
| a) $main_root , which will be substituted with the real path to the |
| main component root directory after installing the |
| main (submitted) component |
| b) $component_id$root, which will be substituted with the real path |
| to the root directory of a given delegate component after |
| installing the given delegate component --> |
| |
| <!-- Specification of submitted component (AE) --> |
| <!-- Note: submitted_component_id is assigned by developer; --> |
| <!-- XML descriptor file name is set by developer. --> |
| <!-- Important: ID element should be the first in the --> |
| <!-- SUBMITTED_COMPONENT section. --> |
| <!-- Submitted component may include optional specification --> |
| <!-- of Collection Reader that can be used for testing the --> |
| <!-- submitted component. --> |
| <!-- Submitted component may include optional specification --> |
| <!-- of CAS Consumer that can be used for testing the --> |
| <!-- submitted component. --> |
| |
| <SUBMITTED_COMPONENT> |
| <ID>submitted_component_id</ID> |
| <NAME>Submitted component name</NAME> |
| <DESC>$main_root/desc/ComponentDescriptor.xml</DESC> |
| |
| <!-- deployment options: --> |
| <!-- a) "standard" is deploying AE locally --> |
| <!-- b) "service" is deploying AE locally as a service, --> |
| <!-- using specified command (script) --> |
| <!-- c) "network" is deploying a pure network AE, which --> |
| <!-- is running somewhere on the network --> |
| |
| <DEPLOYMENT>standard | service | network</DEPLOYMENT> |
| |
| <!-- Specifications for "service" deployment option only --> |
| <SERVICE_COMMAND>$main_root/bin/startService.bat</SERVICE_COMMAND> |
| <SERVICE_WORKING_DIR>$main_root</SERVICE_WORKING_DIR> |
| <SERVICE_COMMAND_ARGS> |
| |
| <ARGUMENT> |
| <VALUE>1st_parameter_value</VALUE> |
| <COMMENTS>1st parameter description</COMMENTS> |
| </ARGUMENT> |
| |
| <ARGUMENT> |
| <VALUE>2nd_parameter_value</VALUE> |
| <COMMENTS>2nd parameter description</COMMENTS> |
| </ARGUMENT> |
| |
| </SERVICE_COMMAND_ARGS> |
| |
| <!-- Specifications for "network" deployment option only --> |
| |
| <NETWORK_PARAMETERS> |
| <VNS_SPECS VNS_HOST="vns_host_IP" VNS_PORT="vns_port_No" /> |
| </NETWORK_PARAMETERS> |
| |
| <!-- General specifications --> |
| |
| <COMMENTS>Main component description</COMMENTS> |
| |
| <COLLECTION_READER> |
| <COLLECTION_ITERATOR_DESC> |
| $main_root/desc/CollIterDescriptor.xml |
| </COLLECTION_ITERATOR_DESC> |
| |
| <CAS_INITIALIZER_DESC> |
| $main_root/desc/CASInitializerDescriptor.xml |
| </CAS_INITIALIZER_DESC> |
| </COLLECTION_READER> |
| |
| <CAS_CONSUMER> |
| <DESC>$main_root/desc/CASConsumerDescriptor.xml</DESC> |
| </CAS_CONSUMER> |
| |
| </SUBMITTED_COMPONENT> |
| <!-- Specifications of the component installation process --> |
| <INSTALLATION> |
| <!-- List of delegate components that should be installed together --> |
| <!-- with the main submitted component (for aggregate components) --> |
| <!-- Important: ID element should be the first in each --> |
| |
| <!-- DELEGATE_COMPONENT section. --> |
| <DELEGATE_COMPONENT> |
| <ID>first_delegate_component_id</ID> |
| <NAME>Name of first required separate component</NAME> |
| </DELEGATE_COMPONENT> |
| |
| <DELEGATE_COMPONENT> |
| <ID>second_delegate_component_id</ID> |
| <NAME>Name of second required separate component</NAME> |
| </DELEGATE_COMPONENT> |
| |
| <!-- Specifications of local path names that should be replaced --> |
| <!-- with real path names after the main component as well as --> |
| <!-- all required delegate (library) components are installed. --> |
| <!-- <FILE> and <REPLACE_WITH> values may use the $main_root or --> |
| <!-- one of the $component_id$root variables. --> |
| <!-- Important: ACTION element should be the first in each --> |
| <!-- PROCESS section. --> |
| |
| <PROCESS> |
| <ACTION>find_and_replace_path</ACTION> |
| <PARAMETERS> |
| <FILE>$main_root/desc/ComponentDescriptor.xml</FILE> |
| <FIND_STRING>../resources/dict/</FIND_STRING> |
| <REPLACE_WITH>$main_root/resources/dict/</REPLACE_WITH> |
| <COMMENTS>Specify actual dictionary location in XML component |
| descriptor |
| </COMMENTS> |
| </PARAMETERS> |
| </PROCESS> |
| |
| <PROCESS> |
| <ACTION>find_and_replace_path</ACTION> |
| <PARAMETERS> |
| <FILE>$main_root/desc/DelegateComponentDescriptor.xml</FILE> |
| <FIND_STRING> |
| local_root_directory_for_1st_delegate_component/resources/dict/ |
| </FIND_STRING> |
| <REPLACE_WITH> |
| $first_delegate_component_id$root/resources/dict/ |
| </REPLACE_WITH> |
| <COMMENTS> |
| Specify actual dictionary location in the descriptor of the 1st |
| delegate component |
| </COMMENTS> |
| </PARAMETERS> |
| </PROCESS> |
| |
| <!-- Specifications of environment variables that should be set prior |
| to running the main component and all other reused components. |
| <VAR_VALUE> values may use the $main_root or one of the |
| $component_id$root variables. --> |
| |
| <PROCESS> |
| <ACTION>set_env_variable</ACTION> |
| <PARAMETERS> |
| <VAR_NAME>env_variable_name</VAR_NAME> |
| <VAR_VALUE>env_variable_value</VAR_VALUE> |
| <COMMENTS>Set environment variable value</COMMENTS> |
| </PARAMETERS> |
| </PROCESS> |
| |
| </INSTALLATION> |
| </COMPONENT_INSTALLATION_DESCRIPTOR></pre> |
| |
| <div class="section" title="6.1.4.1. The SUBMITTED_COMPONENT section"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.installation_descriptor.submitted_component">6.1.4.1. The SUBMITTED_COMPONENT section</h4></div></div></div> |
| |
| |
| <p>The SUBMITTED_COMPONENT section of the installation descriptor |
| (install.xml) is used to specify required information about the UIMA component. |
| Before explaining the details, let's clarify the concept of component ID and |
| <span class="quote">“<span class="quote">macros</span>”</span> used in the installation descriptor. The component ID |
| element should be the <span class="bold"><strong>first element </strong></span>in the |
| SUBMITTED_COMPONENT section.</p> |
| |
| <p>The component id is a string that uniquely identifies the component. It should |
| use the JAVA naming convention (e.g. |
| com.company_name.project_name.etc.mycomponent).</p> |
| |
| <p>Macros are variables such as $main_root, used to represent a string such as the |
| full path of a certain directory.</p> |
| |
| <p>The values of these macros are defined by the PEAR installation process, when the |
| PEAR is installed, and represent the values local to that particular installation. |
| The values are stored in the <code class="literal">metadata/PEAR.properties</code> file that is |
| generated during PEAR installation. |
| The tools and applications that use and deploy PEAR files replace these macros with |
| the corresponding values in the local environment as part of the deployment |
| process in the files included in the conf and desc folders.</p> |
| |
| <p>Currently, there are two types of macros:</p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>$main_root, which represents the local absolute |
| path of the main component root directory after deployment. </p></li><li class="listitem"><p>$<span class="emphasis"><em>component_id</em></span>$root, which |
| represents the local absolute path to the root directory of the component which |
| has <span class="emphasis"><em>component_id </em></span> as component ID. This component could |
| be, for instance, a delegate component. </p></li></ul></div> |
| |
| <p>For example, if some part of a descriptor needs to have a path to the data |
| subdirectory of the PEAR, you write <code class="literal">$main_root/data</code>. If |
| your PEAR refers to a delegate component having the ID |
| <span class="quote">“<span class="quote"><code class="literal">my.comp.Dictionary</code></span>”</span>, and you need to |
| specify a path to one of this component's subdirectories, e.g. |
| <code class="literal">resource/dict</code>, you write |
| <code class="literal">$my.comp.Dictionary$root/resources/dict</code>. </p> |
| |
| </div> |
| <div class="section" title="6.1.4.2. The ID, NAME, and DESC tags"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.installation_descriptor.id_name_desc">6.1.4.2. The ID, NAME, and DESC tags</h4></div></div></div> |
| |
| |
| <p>These tags are used to specify the component ID, Name, and descriptor path |
| using the corresponding tags as follows: |
| |
| |
| </p><pre class="programlisting"><SUBMITTED_COMPONENT> |
| <ID>submitted_component_id</ID> |
| <NAME>Submitted component name</NAME> |
| <DESC>$main_root/desc/ComponentDescriptor.xml</DESC></pre> |
| |
| </div> |
| <div class="section" title="6.1.4.3. Tags related to deployment types"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.installation_descriptor.deployment_type">6.1.4.3. Tags related to deployment types</h4></div></div></div> |
| |
| |
| <p>As mentioned before, there are currently three types of PEAR packages, |
| depending on the following deployment types</p> |
| <div class="section" title="Standard Type"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.pear.installation_descriptor.deployment_type.standard">Standard Type</h5></div></div></div> |
| |
| |
| <p>A component package with the <span class="bold"><strong>standard</strong></span> |
| type must be a valid UIMA Analysis Engine, and all the required files to deploy it |
| must be included in the PEAR package. This deployment type should be specified as |
| follows: |
| |
| |
| </p><pre class="programlisting"><DEPLOYMENT>standard</DEPLOYMENT></pre> |
| </div> |
| <div class="section" title="Service Type"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.pear.installation_descriptor.deployment_type.service">Service Type</h5></div></div></div> |
| |
| |
| <p>A component package with the <span class="bold"><strong>service</strong></span> |
| type must be deployable locally as a supported UIMA service (e.g. Vinci). The |
| installation descriptor must include the path for the executable or script to |
| start the service including its arguments, and the working directory from where |
| to launch it, following this template: |
| |
| |
| </p><pre class="programlisting"><DEPLOYMENT>service</DEPLOYMENT> |
| <SERVICE_COMMAND>$main_root/bin/startService.bat</SERVICE_COMMAND> |
| <SERVICE_WORKING_DIR>$main_root</SERVICE_WORKING_DIR> |
| <SERVICE_COMMAND_ARGS> |
| <ARGUMENT> |
| <VALUE>1st_parameter_value</VALUE> |
| <COMMENTS>1st parameter description</COMMENTS> |
| </ARGUMENT> |
| <ARGUMENT> |
| <VALUE>2nd_parameter_value</VALUE> |
| <COMMENTS>2nd parameter description</COMMENTS> |
| </ARGUMENT> |
| </SERVICE_COMMAND_ARGS></pre> |
| |
| </div> |
| <div class="section" title="Network Type"><div class="titlepage"><div><div><h5 class="title" id="ugr.ref.pear.installation_descriptor.deployment_type.network">Network Type</h5></div></div></div> |
| |
| |
| <p>A component package with the network type is not deployed locally, but |
| rather in a <span class="quote">“<span class="quote">remote</span>”</span> environment. It's accessed as a |
| network AE (e.g. Vinci Service). In this case, the PEAR package does not have to |
| contain files required for deployment, but must contain the network AE |
| descriptor. The <DESC> tag in the installation descriptor (See section |
| 2.3.2.1) must point to the network AE descriptor. Here is a template in the case of |
| Vinci services: |
| |
| |
| </p><pre class="programlisting"><DEPLOYMENT>network</DEPLOYMENT> |
| <NETWORK_PARAMETERS> |
| <VNS_SPECS VNS_HOST="vns_host_IP" VNS_PORT="vns_port_No" /> |
| </NETWORK_PARAMETERS></pre> |
| </div> |
| </div> |
| <div class="section" title="6.1.4.4. The Collection Reader and CAS Consumer tags"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.installation_descriptor.collection_reader_cas_consumer">6.1.4.4. The Collection Reader and CAS Consumer tags</h4></div></div></div> |
| |
| |
| <p>These sections of the installation descriptor are used by any specific |
| Collection Reader or CAS Consumer to be used with the packaged analysis |
| engine.</p> |
| |
| </div> |
| <div class="section" title="6.1.4.5. The INSTALLATION section"><div class="titlepage"><div><div><h4 class="title" id="ugr.ref.pear.installation_descriptor.installation">6.1.4.5. The INSTALLATION section</h4></div></div></div> |
| |
| |
| <p>The <INSTALLATION> section specifies the external dependencies of |
| the component and the operations that should be performed during the PEAR package |
| installation.</p> |
| |
| <p>The component dependencies are specified in the |
| <DELEGATE_COMPONENT> sub-sections, as shown in the installation |
| descriptor template above.</p> |
| |
| <p>Important: The ID element should be the first element in each |
| <DELEGATE_COMPONENT> sub-section.</p> |
| |
| <p>The <INSTALLATION> section may specify the following operations: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>Setting environment variables that are |
| required to run the installed component. |
| </p> |
| <p>This is also how you specify additional classpaths |
| for a Java component - by specifying the setting of an environmental variable |
| named CLASSPATH. The <code class="literal">buildComponentClasspath</code> method |
| of the PackageBrowser class builds a classpath string from what it finds in |
| the CLASSPATH specification here, plus adds a classpath entry for all |
| Jars in the <code class="literal">lib</code> directory. Because of this, there is no need |
| to specify Class Path entries for Jars in the lib directory, when using |
| the Eclipse plugin pear packager or the Maven Pear Packager.</p> |
| |
| <div class="blockquote"><blockquote class="blockquote"><p>When specifying the value of the CLASSPATH environment |
| variable, use the semicolon ";" as the separator character, regardless of the |
| target Operating System conventions. This delimiter will be replaced with |
| the right one for the Operating System during PEAR installation.</p> |
| </blockquote></div> |
| |
| <p>If your component needs to set the UIMA datapath you must specify the necessary |
| datapath setting using an environment variable with the key <code class="literal">uima.datapath</code>. |
| When such a key is specified the <code class="literal">getComponentDataPath</code> method of the |
| PackageBrowser class will return the specified datapath settings for your component. |
| </p> |
| |
| <div class="warning" title="Warning" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Warning</h3><p>Do not put UIMA Framework Jars into the lib directory of your |
| PEAR; doing so will cause system failures due to class loading issues.</p></div> |
| </li><li class="listitem"><p>Note that you can use <span class="quote">“<span class="quote">macros</span>”</span>, like |
| $main_root or $component_id$root in the VAR_VALUE element of the |
| <PARAMETERS> sub-section.</p></li><li class="listitem"><p>Finding and replacing string expressions in files.</p> |
| </li><li class="listitem"><p>Note that you can use the <span class="quote">“<span class="quote">macros</span>”</span> in the FILE |
| and REPLACE_WITH elements of the <PARAMETERS> sub-section. </p> |
| </li></ul></div> |
| |
| <p>Important: the ACTION element always should be the 1st element in each |
| <PROCESS> sub-section.</p> |
| |
| <p>By default, the PEAR Installer will try to process every file in the desc and |
| conf directories of the PEAR package in order to find the <span class="quote">“<span class="quote">macros</span>”</span> |
| and replace them with actual path expressions. In addition to this, the installer |
| will process the files specified in the |
| <INSTALLATION> section.</p> |
| |
| <p>Important: all XML files which are going to be processed should be created |
| using UTF-8 or UTF-16 file encoding. All other text files which are going to be |
| processed should be created using the ASCII file encoding.</p> |
| </div> |
| </div> |
| |
| <div class="section" title="6.1.5. Packaging the PEAR structure into one file"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.pear.packaging_into_1_file">6.1.5. Packaging the PEAR structure into one file</h3></div></div></div> |
| |
| |
| <p>The last step of the PEAR process is to simply <span class="bold"><strong> |
| zip</strong></span> the content of the PEAR root folder (<span class="bold"><strong>not |
| including the root folder itself</strong></span>) to a PEAR file with the extension <span class="quote">“<span class="quote">.pear</span>”</span>.</p> |
| |
| <p>To do this you can either use the PEAR packaging tools that are described in <span class="quote">“<span class="quote"><a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.pear.packager" class="olink">Chapter 9, <i>PEAR Packager User's Guide</i></a></span>”</span> or you can use the PEAR packaging API that is shown below.</p> |
| |
| <p> |
| To use the PEAR packaging API you first have to create the necessary information for the PEAR package: |
| </p><pre class="programlisting"> //define PEAR data |
| String componentID = "AnnotComponentID"; |
| String mainComponentDesc = "desc/mainComponentDescriptor.xml"; |
| String classpath ="$main_root/bin;"; |
| String datapath ="$main_root/resources;"; |
| String mainComponentRoot = "/home/user/develop/myAnnot"; |
| String targetDir = "/home/user/develop"; |
| Properties annotatorProperties = new Properties(); |
| annotatorProperties.setProperty("sysProperty1", "value1");</pre><p> |
| |
| To create a complete PEAR package in one step call: |
| </p><pre class="programlisting">PackageCreator.generatePearPackage( |
| componentID, mainComponentDesc, classpath, datapath, |
| mainComponentRoot, targetDir, annotatorProperties);</pre><p> |
| The created PEAR package has the file name <componentID>.pear and is located in the <targetDir>. |
| </p> |
| <p> |
| To create just the PEAR installation descriptor in the main component root directory call: |
| </p><pre class="programlisting">PackageCreator.createInstallDescriptor(componentID, mainComponentDesc, |
| classpath, datapath, mainComponentRoot, annotatorProperties);</pre><p> |
| |
| To package a PEAR file with an existing installation descriptor call: |
| </p><pre class="programlisting">PackageCreator.createPearPackage(componentID, mainComponentRoot, |
| targetDir);</pre><p> |
| The created PEAR package has the file name <componentID>.pear and is located in the <targetDir>. |
| </p> |
| |
| </div> |
| </div> |
| <div class="section" title="6.2. Installing a PEAR package"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.pear.installing">6.2. Installing a PEAR package</h2></div></div></div> |
| |
| |
| <p>The installation of a PEAR package can be done using |
| the PEAR installer tool (see <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.pear.installer" class="olink">Chapter 11, <i>PEAR Installer User's Guide</i></a>, or by an application using |
| the PEAR APIs, directly. </p> |
| |
| <p>During the PEAR installation the PEAR file is extracted to the installation directory and the PEAR macros |
| in the descriptors are updated with the corresponding path. At the end of the installation the PEAR verification |
| is called to check if the installed PEAR package can be started successfully. The PEAR verification use the classpath, |
| datapath and the system property settings of the PEAR package to verify the PEAR content. Necessary Java library |
| path settings for native libararies, PATH variable settings or system environment variables cannot be recognized |
| automatically and the use must take care of that manually.</p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>By default the PEAR packages are not installed directly to the specified installation directory. For each PEAR |
| a subdirectory with the name of the PEAR's ID is created where the PEAR package is installed to. If the PEAR installation |
| directory already exists, the old content is automatically deleted before the new content is installed.</p></div> |
| |
| <div class="section" title="6.2.1. Installing a PEAR file using the PEAR APIs"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.pear.installing_pear_using_API">6.2.1. Installing a PEAR file using the PEAR APIs</h3></div></div></div> |
| |
| |
| <p>The example below shows how to use the PEAR APIs to install a |
| PEAR package and access the installed PEAR package data. For more details about the PackageBrowser API, |
| please refer to the Javadocs for the org.apache.uima.pear.tools package. |
| |
| </p><pre class="programlisting">File installDir = new File("/home/user/uimaApp/installedPears"); |
| File pearFile = new File("/home/user/uimaApp/testpear.pear"); |
| boolean doVerification = true; |
| |
| try { |
| // install PEAR package |
| PackageBrowser instPear = PackageInstaller.installPackage( |
| installDir, pearFile, doVerification); |
| |
| // retrieve installed PEAR data |
| // PEAR package classpath |
| String classpath = instPear.buildComponentClassPath(); |
| // PEAR package datapath |
| String datapath = instPear.getComponentDataPath(); |
| // PEAR package main component descriptor |
| String mainComponentDescriptor = instPear |
| .getInstallationDescriptor().getMainComponentDesc(); |
| // PEAR package component ID |
| String mainComponentID = instPear |
| .getInstallationDescriptor().getMainComponentId(); |
| // PEAR package pear descriptor |
| String pearDescPath = instPear.getComponentPearDescPath(); |
| |
| // print out settings |
| System.out.println("PEAR package class path: " + classpath); |
| System.out.println("PEAR package datapath: " + datapath); |
| System.out.println("PEAR package mainComponentDescriptor: " |
| + mainComponentDescriptor); |
| System.out.println("PEAR package mainComponentID: " |
| + mainComponentID); |
| System.out.println("PEAR package specifier path: " + pearDescPath); |
| |
| } catch (PackageInstallerException ex) { |
| // catch PackageInstallerException - PEAR installation failed |
| ex.printStackTrace(); |
| System.out.println("PEAR installation failed"); |
| } catch (IOException ex) { |
| ex.printStackTrace(); |
| System.out.println("Error retrieving installed PEAR settings"); |
| }</pre> |
| |
| <p> |
| To run a PEAR package after it was installed using the PEAR API see the example below. It use the |
| generated PEAR specifier that was automatically created during the PEAR installation. |
| For more details about the APIs please refer to the Javadocs. |
| |
| |
| </p><pre class="programlisting">File installDir = new File("/home/user/uimaApp/installedPears"); |
| File pearFile = new File("/home/user/uimaApp/testpear.pear"); |
| boolean doVerification = true; |
| |
| try { |
| |
| // Install PEAR package |
| PackageBrowser instPear = PackageInstaller.installPackage( |
| installDir, pearFile, doVerification); |
| |
| // Create a default resouce manager |
| ResourceManager rsrcMgr = UIMAFramework.newDefaultResourceManager(); |
| |
| // Create analysis engine from the installed PEAR package using |
| // the created PEAR specifier |
| XMLInputSource in = |
| new XMLInputSource(instPear.getComponentPearDescPath()); |
| ResourceSpecifier specifier = |
| UIMAFramework.getXMLParser().parseResourceSpecifier(in); |
| AnalysisEngine ae = |
| UIMAFramework.produceAnalysisEngine(specifier, rsrcMgr, null); |
| |
| // Create a CAS with a sample document text |
| CAS cas = ae.newCAS(); |
| cas.setDocumentText("Sample text to process"); |
| cas.setDocumentLanguage("en"); |
| |
| // Process the sample document |
| ae.process(cas); |
| } catch (Exception ex) { |
| ex.printStackTrace(); |
| }</pre> |
| |
| </div> |
| |
| </div> |
| |
| <div class="section" title="6.3. PEAR package descriptor"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.pear.specifier">6.3. PEAR package descriptor</h2></div></div></div> |
| |
| |
| <p> |
| To run an installed PEAR package directly in the UIMA framework the <code class="literal">pearSpecifier</code> |
| XML descriptor can be used. Typically during the PEAR installation such an specifier is automatically generated |
| and contains all the necessary information to run the installed PEAR package. Settings for system environment |
| variables, system PATH settings or Java library path settings cannot be recognized |
| automatically and must be set manually when the JVM is started. |
| </p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The PEAR may contain specifications for "environment variables" and their settings. |
| When such a PEAR is run |
| directly in the UIMA framework, those settings (except for Classpath and Data Path) are converted |
| to Java System properties, and set to the specified values. Java cannot set true environmental variables; |
| if such a setting is needed, the application would need to arrange to do this prior to invoking Java.</p> |
| |
| <p>The Classpath and Data Path settings are used by UIMA to configure a special Resource Manager |
| that is used when code from this PEAR is being run.</p></div> |
| |
| <p> |
| The generated PEAR descriptor |
| is located in the component root directory of the installed PEAR package and has a filename like |
| <componentID>_pear.xml. |
| </p> |
| <p> |
| The PEAR package descriptor looks like: |
| </p> |
| <pre class="programlisting"><?xml version="1.0" encoding="UTF-8"?> |
| <pearSpecifier xmlns="http://uima.apache.org/resourceSpecifier"> |
| <pearPath>/home/user/uimaApp/installedPears/testpear</pearPath> |
| <parameters> <!-- optional --> |
| <parameter> <!-- any number, repeated --> |
| <name>name-of-the-parameter</name> |
| <value>string-value</value> |
| </parameter> |
| </parameters> |
| </pearSpecifier></pre> |
| <p> |
| The <code class="literal">pearPath</code> setting in the descriptor must point to the component root directory |
| of the installed PEAR package. |
| </p> |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3> |
| |
| <p> |
| It is not possible to share resources between PEAR Analysis Engines that are instantiated using the PEAR |
| descriptor. The PEAR runtime created for each PEAR descriptor has its own specific ResourceManager |
| (unless exactly the same Classpath and Data Path are being used). |
| </p> |
| </div> |
| |
| <p>The optional <code class="literal">parameters</code> section, if used, specifies parameter values, |
| which are used to customize / override parameter values in the PEAR descriptor. |
| External Settings overrides continue to work for PEAR descriptors, and have precedence, if specified. |
| </p> |
| |
| </div> |
| |
| </div> |
| <div class="chapter" title="Chapter 7. XMI CAS Serialization Reference" id="ugr.ref.xmi"><div class="titlepage"><div><div><h2 class="title">Chapter 7. XMI CAS Serialization Reference</h2></div></div></div> |
| |
| |
| <p>This is the specification for the mapping of the UIMA CAS into the XMI (XML Metadata |
| Interchange<sup>[<a name="d5e2511" href="#ftn.d5e2511" class="footnote">7</a>]</sup>) format. XMI is an OMG standard for expressing object graphs in |
| XML. The UIMA SDK provides support for XMI through the classes |
| <code class="literal">org.apache.uima.cas.impl.XmiCasSerializer</code> and |
| <code class="literal">org.apache.uima.cas.impl.XmiCasDeserializer</code>.</p> |
| |
| <div class="section" title="7.1. XMI Tag"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.xmi_tag">7.1. XMI Tag</h2></div></div></div> |
| |
| |
| <p>The outermost tag is <XMI> and must include a version number and XML |
| namespace attribute: |
| |
| |
| </p><pre class="programlisting"><xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI"> |
| <!-- CAS Contents here --> |
| </xmi:XMI></pre> |
| |
| <p>XML namespaces<sup>[<a name="d5e2521" href="#ftn.d5e2521" class="footnote">8</a>]</sup> are used throughout. The <span class="quote">“<span class="quote">xmi</span>”</span> namespace prefix is used to |
| identify elements and attributes that are defined by the XMI specification. The XMI |
| document will also define one namespace prefix for each CAS namespace, as described in |
| the next section.</p> |
| |
| </div> |
| |
| <div class="section" title="7.2. Feature Structures"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.feature_structures">7.2. Feature Structures</h2></div></div></div> |
| |
| |
| <p>UIMA Feature Structures are mapped to XML elements. The name of the element is |
| formed from the CAS type name, making use of XML namespaces as follows.</p> |
| |
| <p>The CAS type namespace is converted to an XML namespace URI by the following rule: |
| replace all dots with slashes, prepend http:///, and append .ecore.</p> |
| |
| <p>This mapping was chosen because it is the default mapping used by the Eclipse |
| Modeling Framework (EMF)<sup>[<a name="d5e2529" href="#ftn.d5e2529" class="footnote">9</a>]</sup> to create namespace URIs from Java package names. The use of |
| the http scheme is a common convention, and does not imply any HTTP communication. The |
| .ecore suffix is due to the fact that the recommended type system definition for a |
| namespace is an ECore model, see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.xmi_emf" class="olink">Chapter 8, <i>XMI and EMF Interoperability</i></a>.</p> |
| |
| <p>Consider the CAS type name <span class="quote">“<span class="quote">org.myproj.Foo</span>”</span>. The CAS namespace |
| (<span class="quote">“<span class="quote">org.myorg.</span>”</span>) is converted to the XML namespace URI is |
| http:///org/myproj.ecore.</p> |
| |
| <p>The XML element name is then formed by concatenating the XML namespace prefix |
| (which is an arbitrary token, but typically we use the last component of the CAS |
| namespace) with the type name (excluding the namespace).</p> |
| |
| <p>So the example <span class="quote">“<span class="quote">org.myproj.Foo</span>”</span> FeatureStructure is written to |
| XMI as: |
| |
| |
| </p><pre class="programlisting"><xmi:XMI |
| xmi:version="2.0" |
| xmlns:xmi="http://www.omg.org/XMI" |
| xmlns:myproj="http:///org/myproj.ecore"> |
| ... |
| <myproj:Foo xmi:id="1"/> |
| ... |
| </xmi:XMI></pre> |
| |
| <p>The xmi:id attribute is only required if this object will be referred to from |
| elsewhere in the XMI document. If provided, the xmi:id must be unique for each |
| feature.</p> |
| |
| <p>All namespace prefixes (e.g. <span class="quote">“<span class="quote">myproj</span>”</span>) in this example must be |
| bound to URIs using the <span class="quote">“<span class="quote">xmlns...</span>”</span> attribute, as defined by the XML |
| namespaces specification.</p> |
| </div> |
| |
| <div class="section" title="7.3. Primitive Features"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.primitive_features">7.3. Primitive Features</h2></div></div></div> |
| |
| |
| <p>CAS features of primitive types (String, Boolean, Byte, Short, Integer, Long , |
| Float, or Double) can be mapped either to XML attributes or XML elements. For example, a |
| CAS FeatureStructure of type org.myproj.Foo, with features: |
| |
| |
| </p><pre class="programlisting">begin = 14 |
| end = 19 |
| myFeature = "bar"</pre><p> |
| could be mapped to: |
| |
| |
| </p><pre class="programlisting"><xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI" |
| xmlns:myproj="http:///org/myproj.ecore"> |
| ... |
| <myproj:Foo xmi:id="1" begin="14" end="19" myFeature="bar"/> |
| ... |
| </xmi:XMI></pre><p> |
| or equivalently: |
| |
| |
| </p><pre class="programlisting"><xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI" |
| xmlns:myproj="http:///org/myproj.ecore"> |
| ... |
| <myproj:Foo xmi:id="1"> |
| <begin>14</begin> |
| <end>19</end> |
| <myFeature>bar</myFeature> |
| </myproj:Foo> |
| ... |
| </xmi:XMI></pre> |
| |
| <p>The attribute serialization is preferred for compactness, but either |
| representation is allowable. Mixing the two styles is allowed; some features can be |
| represented as attributes and others as elements.</p> |
| |
| </div> |
| |
| <div class="section" title="7.4. Reference Features"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.reference_features">7.4. Reference Features</h2></div></div></div> |
| |
| |
| <p>CAS features that are references to other feature structures (excluding arrays |
| and lists, which are handled separately) are serialized as ID references.</p> |
| |
| <p>If we add to the previous CAS example a feature structure of type org.myproj.Baz, |
| with feature <span class="quote">“<span class="quote">myFoo</span>”</span> that is a reference to the Foo object, the |
| serialization would be: |
| |
| |
| </p><pre class="programlisting"><xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI" |
| xmlns:myproj="http:///org/myproj.ecore"> |
| ... |
| <myproj:Foo xmi:id="1" begin="14" end="19" myFeature="bar"/> |
| <myproj:Baz xmi:id="2" myFoo="1"/> |
| ... |
| </xmi:XMI></pre> |
| |
| <p>As with primitive-valued features, it is permitted to use an element rather than an |
| attribute. However, the syntax is slightly different:</p> |
| |
| |
| <pre class="programlisting"><myproj:Baz xmi:id="2"> |
| <myFoo href="#1"/> |
| <myproj.Baz></pre> |
| |
| <p>Note that in the attribute representation, a reference feature is |
| indistinguishable from an integer-valued feature, so the meaning cannot be |
| determined without prior knowledge of the type system. The element representation is |
| unambiguous.</p> |
| |
| </div> |
| |
| <div class="section" title="7.5. Array and List Features"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.array_and_list_features">7.5. Array and List Features</h2></div></div></div> |
| |
| |
| <p>For a CAS feature whose range type is one of the CAS array or list types, the XMI serialization depends on the |
| setting of the <span class="quote">“<span class="quote">multipleReferencesAllowed</span>”</span> attribute for that feature in the UIMA Type System |
| Description (see <a href="references.html#ugr.ref.xml.component_descriptor.type_system.features" class="olink">Section 2.3.3, “Features”</a>).</p> |
| |
| <p>An array or list with multipleReferencesAllowed = false (the default) is serialized as a |
| <span class="quote">“<span class="quote">multi-valued</span>”</span> property in XMI. An array or list with multipleReferencesAllowed = true is |
| serialized as a first-class object. Details are described below.</p> |
| |
| <div class="section" title="7.5.1. Arrays and Lists as Multi-Valued Properties"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xmi.array_and_list_features.as_multi_valued_properties">7.5.1. Arrays and Lists as Multi-Valued Properties</h3></div></div></div> |
| |
| |
| <p>In XMI, a multi-valued property is the most natural XMI representation for most cases. Consider the |
| example where the FeatureStructure of type org.myproj.Baz has a feature myIntArray whose value is the |
| integer array {2,4,6}. This can be mapped to: |
| |
| </p><pre class="programlisting"><myproj:Baz xmi:id="3" myIntArray="2 4 6"/></pre><p> or |
| equivalently: |
| |
| |
| </p><pre class="programlisting"><myproj:Baz xmi:id="3"> |
| <myIntArray>2</myIntArray> |
| <myIntArray>4</myIntArray> |
| <myIntArray>6</myIntArray> |
| </myproj:Baz></pre><p> |
| </p> |
| |
| <p>Note that String arrays whose elements contain embedded spaces MUST use the latter mapping.</p> |
| |
| <p>FSArray or FSList features are serialized in a similar way. For example an FSArray feature that contains |
| references to the elements with xmi:id's <span class="quote">“<span class="quote">13</span>”</span> and <span class="quote">“<span class="quote">42</span>”</span> could be |
| serialized as: |
| |
| </p><pre class="programlisting"><myproj:Baz xmi:id="3" myFsArray="13 42"/></pre><p> or: |
| |
| |
| </p><pre class="programlisting"><myproj:Baz xmi:id="3"> |
| <myFsArray href="#13"/> |
| <myFsArray href="#42"/> |
| </myproj:Baz></pre><p> |
| </p> |
| </div> |
| |
| <div class="section" title="7.5.2. Arrays and Lists as First-Class Objects"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xmi.array_and_list_features.as_1st_class_objects">7.5.2. Arrays and Lists as First-Class Objects</h3></div></div></div> |
| |
| |
| <p>The multi-valued-property representation described in the previous section does not allow multiple |
| references to an array or list object. Therefore, it cannot be used for features that are defined to allow |
| multiple references (i.e. features for which multipleReferencesAllowed = true in the Type System |
| Description).</p> |
| |
| <p>When multipleReferencesAllowed is set to true, array and list features are serialized as references, |
| and the array or list objects are serialized as separate objects in the XMI. Consider again the example where |
| the FeatureStructure of type org.myproj.Baz has a feature myIntArray whose value is the integer array |
| {2,4,6}. If myIntArray is defined with multipleReferencesAllowed=true, the serialization will be as |
| follows: |
| |
| </p><pre class="programlisting"><myproj:Baz xmi:id="3" myIntArray="4"/></pre><p> or: |
| |
| |
| </p><pre class="programlisting"><myproj:Baz xmi:id="3"> |
| <myIntArray href="#4"/> |
| </myproj:Baz></pre><p> |
| with the array object serialized as |
| |
| </p><pre class="programlisting"><cas:IntegerArray xmi:id="4" elements="2 4 6"/></pre><p> or: |
| |
| |
| </p><pre class="programlisting"><cas:IntegerArray xmi:id="4"> |
| <elements>2</elements> |
| <elements>4</elements> |
| <elements>6</elements> |
| </cas:IntegerArray></pre> |
| |
| <p>Note that in this case, the XML element name is formed from the CAS type name (e.g. |
| <span class="quote">“<span class="quote"><code class="literal">uima.cas.IntegerArray</code></span>”</span>) in the same way as for other |
| FeatureStructures. The elements of the array are serialized either as a space-separated attribute named |
| <span class="quote">“<span class="quote">elements</span>”</span> or as a series of child elements named <span class="quote">“<span class="quote">elements</span>”</span>.</p> |
| |
| <p>List nodes are just standard FeatureStructures with <span class="quote">“<span class="quote">head</span>”</span> and <span class="quote">“<span class="quote">tail</span>”</span> |
| features, and are serialized using the normal FeatureStructure serialization. For example, an |
| IntegerList with the values 2, 4, and 6 would be serialized as the four objects: |
| |
| |
| </p><pre class="programlisting"><cas:NonEmptyIntegerList xmi:id="10" head="2" tail="11"/> |
| <cas:NonEmptyIntegerList xmi:id="11" head="4" tail="12"/> |
| <cas:NonEmptyIntegerList xmi:id="12" head="6" tail="13"/> |
| <cas:EmptyIntegerList xmi:id"13"/></pre> |
| |
| <p>This representation of arrays allows multiple references to an array of list. It also allows a feature |
| with range type TOP to refer to an array or list. However, it is a very unnatural representation in XMI and does |
| not support interoperability with other XMI-based systems, so we instead recommend using the |
| multi-valued-property representation described in the previous section whenever it is possible.</p> |
| |
| <p>When a feature is specified in the descriptor without a multipleReferencesAllowed attribute, or with the |
| attribute specified as <code class="code">false</code>, but the framework discovers multiple references during |
| serialization, it will issue a message to the log say that it discovered this (look for the phrase |
| "serialized in duplicate"). The serialization will continue, but the multiply-referenced items will |
| be serialized in duplicate.</p> |
| </div> |
| |
| <div class="section" title="7.5.3. Null Array/List Elements"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.xmi.null_array_list_elements">7.5.3. Null Array/List Elements</h3></div></div></div> |
| |
| |
| <p>In UIMA, an element of an FSArray or FSList may be null. In XMI, multi-valued properties do not permit null |
| values. As a workaround for this, we use a dummy instance of the special type cas:NULL, which has xmi:id 0. |
| For example, in the following example the <span class="quote">“<span class="quote">myFsArray</span>”</span> feature refers to an FSArray whose |
| second element is null: |
| |
| |
| </p><pre class="programlisting"><cas:NULL xmi:id="0"/> |
| <myproj:Baz xmi:id="3"> |
| <myFsArray href="#13"/> |
| <myFsArray href="#0"/> |
| <myFsArray href="#42"/> |
| </myproj:Baz></pre> |
| |
| </div> |
| |
| </div> |
| |
| <div class="section" title="7.6. Subjects of Analysis (Sofas) and Views"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.sofas_views">7.6. Subjects of Analysis (Sofas) and Views</h2></div></div></div> |
| |
| |
| <p>A UIMA CAS contain one or more subjects of analysis (Sofas). These are serialized no |
| differently from any other feature structure. For example: |
| |
| |
| </p><pre class="programlisting"><?xml version="1.0"?> |
| <xmi:XMI xmi:version="2.0" xmlns:xmi=http://www.omg.org/XMI |
| xmlns:cas="http:///uima/cas.ecore"> |
| <cas:Sofa xmi:id="1" sofaNum="1" |
| text="the quick brown fox jumps over the lazy dog."/> |
| </xmi:XMI></pre> |
| |
| <p>Each Sofa defines a separate View. Feature Structures in the CAS can be members of |
| one or more views. (A Feature Structure that is a member of a view is indexed in its |
| IndexRepository, but that is an implementation detail.)</p> |
| |
| <p>In the XMI serialization, views will be represented as first-class objects. Each |
| View has an (optional) <span class="quote">“<span class="quote">sofa</span>”</span> feature, which references a sofa, and |
| multi-valued reference to the members of the View. For example:</p> |
| |
| |
| <pre class="programlisting"><cas:View sofa="1" members="3 7 21 39 61"/></pre> |
| |
| <p>Here the integers 3, 7, 21, 39, and 61 refer to the xmi:id fields of the objects that |
| are members of this view.</p> |
| </div> |
| |
| <div class="section" title="7.7. Linking an XMI Document to its Ecore Type System"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.linking_to_ecore_type_system">7.7. Linking an XMI Document to its Ecore Type System</h2></div></div></div> |
| |
| |
| |
| <p>If the CAS Type System has been saved to an Ecore file (as described in <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.xmi_emf" class="olink">Chapter 8, <i>XMI and EMF Interoperability</i></a>), it is possible to store a |
| link from an XMI document to that Ecore type system. This is done using an xsi:schemaLocation attribute |
| on the root XMI element.</p> |
| |
| <p>The xsi:schemaLocation attribute is a space-separated list that represents a |
| mapping from namespace URI (e.g. http:///org/myproj.ecore) to the physical URI of the |
| .ecore file containing the type system for that namespace. For example: |
| |
| |
| </p><pre class="programlisting">xsi:schemaLocation= |
| "http:///org/myproj.ecore file:/c:/typesystems/myproj.ecore"</pre><p> |
| would indicate that the definition for the org.myproj CAS types is contained in the file |
| <code class="literal">c:/typesystems/myproj.ecore</code>. You can specify a different |
| mapping for each of your CAS namespaces, using a space separated list. For details see |
| Budinsky et al. <span class="emphasis"><em>Eclipse Modeling Framework</em></span>.</p> |
| </div> |
| |
| <div class="section" title="7.8. Delta CAS XMI Format"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.xmi.delta">7.8. Delta CAS XMI Format</h2></div></div></div> |
| |
| |
| <p> |
| The Delta CAS XMI serialization format is designed primarily to reduce the overhead serialization when calling annotators |
| configured as services. Only Feature Structures and Views that are new or modified by the service |
| are serialized and returned by the service. |
| </p> |
| <p> |
| The classes <code class="literal">org.apache.uima.cas.impl.XmiCasSerializer</code> and |
| <code class="literal">org.apache.uima.cas.impl.XmiCasDeserializer</code> support serialization of only the modifications to the CAS. |
| A caller is expected to set a marker to indicate the point from which changes to the CAS are to be tracked. |
| </p> |
| <p> |
| A Delta CAS XMI document contains only the Feature Structures and Views that have been added or modified. |
| The new and modified Feature Structures are represented in exactly the format as in a complete CAS serialization. |
| The <code class="literal"> cas:View </code> element has been extended with three additional attributes to represent modifications to |
| View membership. These new attributes are <code class="literal">added_members</code>, <code class="literal">deleted_members</code> and |
| <code class="literal">reindexed_members</code>. For example: |
| </p> |
| <pre class="programlisting"><cas:View sofa="1" added_members="63 77" |
| deleted_member="7 61" reindexed_members="39" /></pre> |
| <p> |
| Here the integers 63, 77 represent xmi:id fields of the objects that have been newly added members to this View, |
| 7 and 61 are xmi:id fields of the objects that have been removed from this view and 39 is the xmi:id of an object to be reindexed in this view. |
| </p> |
| </div> |
| <div class="footnotes"><br><hr width="100" align="left"><div class="footnote"><p><sup>[<a id="ftn.d5e2511" href="#d5e2511" class="para">7</a>] </sup> For details on XMI see Grose et al. <span class="emphasis"><em>Mastering |
| XMI. Java Programming with XMI, XML, and UML. </em></span>John Wiley & Sons, Inc. |
| 2002.</p></div><div class="footnote"><p><sup>[<a id="ftn.d5e2521" href="#d5e2521" class="para">8</a>] </sup>http://www.w3.org/TR/xml-names11/</p> |
| </div><div class="footnote"><p><sup>[<a id="ftn.d5e2529" href="#d5e2529" class="para">9</a>] </sup> For details on EMF and Ecore see Budinsky et |
| al. <span class="emphasis"><em>Eclipse Modeling Framework 2.0</em></span>. Addison-Wesley. |
| 2006.</p></div></div></div> |
| <div class="chapter" title="Chapter 8. Compressed Binary CASes" id="ugr.ref.compress"><div class="titlepage"><div><div><h2 class="title">Chapter 8. Compressed Binary CASes</h2></div></div></div> |
| |
| |
| <div class="section" title="8.1. Binary CAS Compression overview"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.compress.overview">8.1. Binary CAS Compression overview</h2></div></div></div> |
| |
| |
| <p>UIMA has a proprietary binary serialization format, used internally |
| for several things, including communicating with embedded C++ annotators using |
| UIMA-CPP. This binary format is also selectable for use with UIMA-AS. Its use |
| requires that the source and target systems implement the identical type system |
| (because the type system is not sent, and internal coding is used within the |
| format that is keyed to the particular type system).</p> |
| |
| <p>Starting with version 2.4.1, two additional forms of binary serialization are added. |
| Both compress the data being serialized; typical size ratios can approach 50 : 1, |
| depending on the exact contents of the CAS, when compared with normal binary serialization. |
| </p> |
| |
| <p>The two forms are called 4 and 6, for historical/internal reasons. The serialized forms |
| of both of these is fixed, but not currently standardized, and the form being used is encoded in the header so |
| that the appropriate deserializer can be chosen. Both forms include support for Delta CAS |
| being returned from a service.</p> |
| |
| <p>Form 6 builds on form 4, and adds: serializing only those feature structures which |
| are reachable (that is, in some index, or referenced by other reachable feature structures), |
| and type filtering.</p> |
| |
| <p>Type filtering takes a source type system and a target type system, and for serializing |
| (source to target), sends the binary representation of reachable feature structures in the target's type system. |
| For deserializing (reading a target into a source), the filtering takes the specification being read |
| as being encoded using the target's type system, and translates that into the source's type system. |
| In this process, types which exist in the source but not the target are skipped (when serializing); |
| types which exist in the target, but not the source are skipped when deserializing. |
| |
| Features that exist in some |
| source type but not in the version of the same type in the target are skipped (when serializing) |
| or set to default values (i.e., 0 or null) when being deserialized.</p> |
| |
| <p>There are two main use cases for using compressed forms. The first one is for communicating with |
| UIMA-AS remote services (not yet implemented). |
| |
| </p> |
| |
| <p>The second use case is for saving compressed representations of CASes to other media, such as disk files, |
| where they can be deserialized later for use in other UIMA applications.</p> |
| |
| </div> |
| |
| |
| <div class="section" title="8.2. Using Compressed Binary CASes"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.compress.usage">8.2. Using Compressed Binary CASes</h2></div></div></div> |
| |
| |
| <p>The main user interface for serializing a CAS using compression is to use one of the |
| static methods named serializeWithCompression in Serialization. If you pass a Type System argument representing |
| a target type system, then form 6 compression is used; otherwise form 4 is used. |
| To get the benefit of only serializing reachable Feature Structure instances, without type mapping |
| (which is only in form 6), pass a type system argument which is null. |
| </p> |
| |
| <p>To deserialize into a CAS without type mapping, use one of the deserialize method in Serialization. |
| There are multiple forms of this method, depending on the arguments. The forms which take extra arguments |
| include a ReuseInfo may only be used with serialized forms created with form 6 compression. |
| The plain form of deserialize works with all forms of binary serialization, compressed and non-compressed, by examining a common |
| header which identifies the form of binary serialization used; however, for form 6, since it requires |
| additional arguments, it will fail - and you need to use the other deserialize form.</p> |
| |
| <p>Form 6 has an additional object, ReuseInfo, which holds information which |
| is required for subsequent Delta CAS format serializations / deserializations. |
| It can speed up subsequent serializations of the same |
| CAS (before it is further updated), for instance, if an application is sending the CAS to multiple services in parallel. |
| The serializeWithCompression method returns this object when form 6 is being used. |
| |
| </p> |
| <p>In addition, the CasIOUtils class offers static load and save methods, which can be used with the SerialFormat |
| enum to serialize and deserialize to URLs or streams; see the Javadocs for details.</p> |
| </div> |
| |
| <div class="section" title="8.3. Simple Delta CAS serialization"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.compress.simple-deltas">8.3. Simple Delta CAS serialization</h2></div></div></div> |
| |
| <p>Use Form 4 for this, because form 6 supports delta CAS but requires |
| that at the time of deserialization of a CAS (on the receiver side) which will later be delta serialized |
| back to the sender, |
| an instance of the ReuseInfo must be saved, and that |
| same instance then used for delta serialization; furthermore, the original serialization |
| (on the sender side) |
| also must save an instance of the ReuseInfo and use this when deserializing the delta CAS. |
| </p> |
| |
| <p>Form 4 may not be as efficient as form 6 in that it does not filter the CASes |
| either by type systems nor by only sending reachable Feature Structure |
| instances. But, it doesn't require a ReuseInfo object when doing delta serialization or |
| deserialization, |
| so it may be more convenient to use when saving |
| delta CASes to files (as opposed to the other use case of |
| a remote service returning delta CASes to a remote client).</p> |
| </div> |
| |
| <div class="section" title="8.4. Use Case cookbook"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.compress.use-cases">8.4. Use Case cookbook</h2></div></div></div> |
| |
| <p> |
| Here are some use cases, together with a suggested approach and example of how to use the APIs. |
| </p> |
| |
| <p> |
| <span class="strong"><strong>Save a CAS to an output stream, using form 4 (no type system filtering):</strong></span> |
| </p> |
| <pre class="programlisting">// set up an output stream. In this example, an internal byte array. |
| ByteArrayOutputStream baos = new ByteArrayOutputStream(OUT_BFR_INIT_SZ); |
| Serialization.serializeWithCompression(casSrc, baos); |
| // or |
| CasIOUtls.save(casSrc, baos, SerialFormat.COMPRESSED); |
| </pre> |
| |
| <p><span class="strong"><strong>Deserialize from a stream into an existing CAS:</strong></span></p> |
| <pre class="programlisting">// assume the stream is a byte array input stream |
| // For example, one could be created |
| // from the above ByteArrayOutputStream as follows: |
| ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); |
| // Deserialize into a cas having the identical type system |
| Serialization.deserializeCAS(cas, bais); |
| // or |
| CasIOUtils.load(bais, aCas); |
| </pre> |
| |
| <p>Note that the <code class="code">deserializeCAS(cas, inputStream)</code> method is a general way to |
| deserialize into a CAS from an inputStream for all forms of binary serialized data |
| (with exceptions as noted above). |
| The method reads a common header, and based on what it finds, selects the appropriate |
| deserialization routine.</p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The <code class="code">deserialization</code> method with just 2 arguments method doesn't support type filtering, or |
| delta cas deserializating for form 6. To do those, see example below. |
| </p> |
| </div> |
| |
| <p><span class="strong"><strong>Serialize to an output stream, filtering out some types and/or features:</strong></span> |
| </p> |
| <p> |
| To do this, an additional input specifying the Type System of the target must |
| be supplied; this Type System should be a subset of the source CAS's. |
| The <code class="code">out</code> parameter may be an OutputStream, a DataOutputStream, or a File. |
| </p> |
| |
| <pre class="programlisting">// set up an output stream. In this example, an internal byte array. |
| ByteArrayOutputStream baos = new ByteArrayOutputStream(OUT_BFR_INIT_SZ); |
| Serialization.serializeWithCompression(cas, out, tgtTypeSystem); |
| </pre> |
| |
| <p><span class="strong"><strong>Deserialize with type filtering:</strong></span></p> |
| <p>There are 2 type systems involved here: one is the receiving CAS, and the other is the type system |
| used to decode the serialized form. This may optionally be stored with the serialized form:</p> |
| <pre class="programlisting">CasIOUtils.save(cas, out, SerialFormat.COMPRESSED_FILTERED_TS); |
| </pre> |
| <p>and/or it can be supplied at load time. Here's two examples of suppling this at load time:</p> |
| <pre class="programlisting">CasIOUtils.load(input, cas, typeSystem); |
| CasIOUtils.load(input, type_system_serialized_form_input, cas); |
| </pre> |
| |
| <p>The reuseInfo should be null unless |
| deserializing a delta CAS, in which case, it must be the reuse info captured when |
| the original CAS was serialized out. |
| If the target type system is identical to the one in the CAS, you may pass null for it. |
| If a delta cas is not being received, you must pass null for the reuseInfo. |
| </p> |
| <pre class="programlisting">ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); |
| Serialization.deserializeCAS(cas, bais, tgtTypeSystem, reuseInfo); |
| </pre> |
| </div> |
| |
| |
| </div> |
| <div class="chapter" title="Chapter 9. JSON Serialization of CASs and UIMA Description objects" id="ugr.ref.json"><div class="titlepage"><div><div><h2 class="title">Chapter 9. JSON Serialization of CASs and UIMA Description objects</h2></div></div></div> |
| |
| |
| |
| <div class="section" title="9.1. JSON serialization support overview"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.json.overview">9.1. JSON serialization support overview</h2></div></div></div> |
| |
| |
| <p>Applications are moving to the "cloud", and new applications are being rapidly developed that are hooking |
| things up using various mashup techniques. New standards and conventions are emerging to support this kind |
| of application development, such as REST services. |
| JSON is now a popular way for services to communicate; |
| its popularity is rising (in 2014) while XML is falling.</p> |
| |
| <p>Starting with version 2.7.0, JSON style serialization (but not (yet) deserialization) |
| for CASs and UIMA descriptions is supported. |
| The exact format of the serialization is configurable in several aspects. |
| The implementation is built on top of the Jackson JSON generation library. |
| </p> |
| |
| <p>The next section discusses serialization for CASes, while a later section describes serialization |
| of description objects, such as type system descriptions.</p> |
| </div> |
| |
| <div class="section" title="9.2. JSON CAS Serialization"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ug.ref.json.cas">9.2. JSON CAS Serialization</h2></div></div></div> |
| |
| |
| <p>CASs primarily consist of collections of Feature Structures (FSs). Similar to XMI serialization, JSON |
| serialization skips serializing unreachable FSs, outputting only those FSs that are found in the indexes (these are called |
| <span class="emphasis"><em>roots</em></span>), plus all of |
| the FSs that are referenced via some chain of references, from the roots. |
| </p> |
| |
| <p>To support the kinds of things users do with FSs, |
| the serialized form may be augmented to include additional information beyond the FSs.</p> |
| <p>For traditional UIMA implementations, the serialized formats mostly assumed that the receivers had access to |
| a type system description, which specified details of the types of each feature value. For JSON serialization, |
| some of this information can be including directly in the serialization.</p> |
| |
| <p>This abbreviated type system information is one kind of additional information that can be included; |
| here's a summary list of the various kinds of additional information you can add to the serialization:</p> |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>having a way to identify which fields in a FS should be treated as references to other FSs, or |
| as representing serialized binary data from UIMA byte arrays.</p> |
| </li><li class="listitem"> |
| <p>something like XML namespaces to allow the use of short type names in the serialization while handling name |
| collisions</p> |
| </li><li class="listitem"> |
| <p>enough of the UIMA type hierarchy to allow the common operation of iterating over a type together |
| with all of its subtypes</p> |
| </li><li class="listitem"><p>A way to identify which FSs were "added-to-the-indexes" (separately, per CAS View) |
| and therefore serve as roots when |
| iterating over types.</p> |
| </li><li class="listitem"><p>An identification of the associated type system definition</p></li></ul></div> |
| |
| <p>Simple JSON serialization does not have a convention for supporting these, but many extensions do. |
| We borrow some of the concepts in the JSON-LD (linked data) standard in providing this |
| additional information.</p> |
| |
| <div class="section" title="9.2.1. The Big Picture"><div class="titlepage"><div><div><h3 class="title" id="ug.ref.json.cas.bigpic">9.2.1. The Big Picture</h3></div></div></div> |
| |
| |
| <p>CAS JSON serialization consists of several parts: an optional _context, the set of Feature Structures, |
| and (if doing a delta serialization) information about changes to what was indexed.</p> |
| |
| <div class="figure"><a name="ug.ref.json.fig.bigpic"></a><div class="figure-contents"> |
| |
| <div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="347"><tr><td><img src="images/references/ref.json/big_picture2.png" width="347" alt="The big picture showing the parts of serialization, with the _context optional."></td></tr></table></div> |
| </div><p class="title"><b>Figure 9.1. The major sections of JSON serialization</b></p></div><br class="figure-break"> |
| |
| <p>The serializer can be configured to omit |
| the _context or parts of the _context for cases where that information isn't needed. The index changes |
| information is only included if Delta CAS serialization is specified. Note that Delta CAS support |
| is incomplete; so this information is just for planning purposes.</p> |
| </div> |
| |
| <div class="section" title="9.2.2. The _context section"><div class="titlepage"><div><div><h3 class="title" id="ug.ref.json.cas.context">9.2.2. The _context section</h3></div></div></div> |
| |
| <p>The _context section has entries for each used type as well as some special additional entries. |
| Each entry for a type has multiple sub-entries, identified |
| by a key-name. Each sub-entry can be selectively omitted if not needed. |
| |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><span class="bold"><strong>_type_system</strong></span> - a URI of the type system information</p></li><li class="listitem"><p><span class="bold"><strong>_types</strong></span> - information about each used type |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="circle"><li class="listitem"><p><span class="bold"><strong>_id</strong></span> - the type's fully qualified UIMA type name</p></li><li class="listitem"><p><span class="bold"><strong>_feature_types</strong></span> - a map from features of this type to |
| information about the type of the value of the feature</p></li><li class="listitem"><p><span class="bold"><strong>_subtypes</strong></span> - an array of used subtype short-names</p></li></ul></div><p> |
| </p></li></ul></div><p> |
| </p> |
| |
| |
| <p>Here's an example:</p> |
| <div class="informalexample"> |
| |
| <pre class="programlisting">"_context" : { |
| "_type_system" : "URI to the type system information", |
| "_types : { |
| "A_Typical_User_or_built_in_Type" : { |
| "_id" : "org.apache.uima.test.A_Typical_User_or_built_in_Type", |
| "_feature_types" : [ |
| "sofa" : "_ref", |
| "aFS" : "_ref", |
| "an_array" : "_array", |
| "a_byte_array" : "_byte_array"], |
| "_subtypes" : [ "subtype1", "subtype2", ... ] }, |
| "Sofa" : { |
| "_id" : "uima.cas.Sofa", |
| "_feature_types" : {"sofaArray" : "_ref"} } |
| } |
| }</pre></div> |
| |
| <p>The <span class="bold"><strong>_type_system</strong></span> is an optional URI that references a UIMA type system description that |
| defines the types for the CAS being serialized.</p> |
| |
| <p>In the <span class="bold"><strong>_types</strong></span> section, the key (e.g. "Sofa" or "A_Typical_User_or_built_in_Type") is the "short" name |
| for the type used in the serialization. |
| It is either just |
| the last segment of the full type name (e.g. for the type x.y.z.TypeName, it's TypeName), or, |
| if name would collide with another type name if just the last segment |
| was used (example: some.package.cname.Foo, and some.other.package.cname.Foo), then the key is made up of |
| the next-to-last segment, with an optional suffixed incrementing integer in case of collisions on that name, |
| a colon (:) and then the last name.</p> |
| |
| <div class="blockquote"><blockquote class="blockquote"><p>In this example, since the next to last segment of both names is |
| "cname", one namespace name would be "cname", and the other would be "cname1". The keys in this case would be |
| cname:Foo and cname1:Foo.</p></blockquote></div> |
| |
| <p>The value of the _id is the fully qualified name of the type.</p> |
| |
| <p>The <span class="bold"><strong>_feature_types</strong></span> values of _ref, _array, and _byte_array indicate the corresponding values |
| of the named features need special handling |
| when deserailized. |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><span class="bold"><strong>_ref</strong></span> - used when features are deserialized as numbers, but they are to be |
| interpreted as references to other FSs whose <code class="code">id</code> is the number. UIMA lists and arrays of |
| FSs are marked with _ref; if the value is a JSON array, the elements of the array will be either |
| numbers (to be interpreted as references), or embedded serializations of FSs.</p></li><li class="listitem"><p><span class="bold"><strong>_array</strong></span> - used when features are serialized as JSON |
| arrays containing embedded values, |
| unless the corresponding UIMA object has |
| multiple references, in which case it is serialized as a FS reference which looks like a single number. |
| If a feature is marked with _array, then a non-array, single number should be interpreted as the |
| <code class="code">id</code> of the feature structure that is the array or the first element of the list of items. |
| This designation is used for both UIMA arrays and lists.</p> |
| |
| <p>This designation is for arrays and lists of primitive values, except for byte arrays. |
| In the case of FS arrays and lists, the _ref designation is used instead of this to indicate that the |
| resulting values in a JSON array that look like numbers should be interpreted as references.</p></li><li class="listitem"><p><span class="bold"><strong>_byte_array</strong></span> - _byte_array features are serialized numbers (if they are a |
| reference to a separate object, or as strings (if embedded). The strings are to be decoded into |
| binary byte arrays using the Base64 encoding (the standard one used by Jackson to serialize binary data).</p></li></ul></div><p> |
| </p> |
| |
| <p> |
| Note that single element arrays are <span class="emphasis"><em>not</em></span> unwrapped, as in some other JSON serializations, to enable distinguishing |
| references to arrays from embedded arrays. |
| </p> |
| |
| <p><span class="bold"><strong>_subtypes</strong></span> are a list of the type's used subtypes. A type is <span class="emphasis"><em>used</em></span> |
| if it is the type of a Feature Structure |
| being serialized, |
| or if it is in the supertype chain of some Feature Structure which is serialized. If a type has no |
| used subtypes, this element is omitted. |
| The names are represented as the "short" name. Users typically use this information |
| to construct support for iterators over a type which includes all of its subtypes.</p> |
| |
| |
| |
| |
| |
| <div class="section" title="9.2.2.1. Omitting parts of the _context section"><div class="titlepage"><div><div><h4 class="title" id="ug.ref.json.cas.context.omit">9.2.2.1. Omitting parts of the _context section</h4></div></div></div> |
| |
| <p>It is possible to selectively omit some of the |
| _context sections (or the entire _context), via configuration. |
| Here's an example:</p> |
| |
| <div class="informalexample"> |
| |
| <pre class="programlisting">// make a new instance to hold the serialization configuration |
| JsonCasSerializer jcs = new JsonCasSerializer(); |
| // Omit the expanded type names information |
| jcs.setJsonContext(JsonContextFormat.omitExpandedTypeNames);</pre></div> |
| |
| <p>See the Javadocs for <code class="code">JsonContextFormat</code> for how to specify the parts.</p> |
| </div> |
| |
| </div> |
| |
| <div class="section" title="9.2.3. Serializing Feature Structures"><div class="titlepage"><div><div><h3 class="title" id="ug.ref.json.cas.featurestructures">9.2.3. Serializing Feature Structures</h3></div></div></div> |
| |
| |
| <p>Feature Structures themselves are represented as JSON objects consisting of field - value pairs, where the |
| fields correspond to UIMA Features, and the values are the values of the features. |
| </p> |
| |
| <p>The various kinds of values for a UIMA feature are represented by their natural JSON counterpart. |
| UIMA primitive boolean values are represented by JSON true and false literals. UIMA Strings are |
| represented as JSON strings. Numbers are represented by JSON numbers. |
| Byte Arrays are represented by the Jackson standard binary encoding (base64 encoding), written as JSON strings. |
| References to other Feature Structures are also represented as JSON integer numbers, the values of which are |
| interpreted as ids of the referred-to |
| FSs. These ids are treated in the same manner as the xmi:ids of XMI Serialization. Arrays and Lists when |
| embedded (see following section) are represented as JSON arrays using the [] notation.</p> |
| |
| <p>Besides the feature values defined for a Feature Structure, an additional special feature |
| may be serialized: _type. |
| The _type is the type name, written using the short format. This is automatically included when the type cannot |
| easily be |
| inferred from other contextual information. |
| </p> |
| |
| <p>Here's an example, with some comments which, since JSON doesn't support comments, are just here for explanation:</p> |
| <div class="informalexample"> |
| <pre class="programlisting">{ "_type" : "Annotation", // _type may be omitted |
| "feat1" : true, // boolean value represented as true or false |
| "feat2" : 123, // could be a number or a reference to FS with id 123 |
| "feat3" : "b3axgh"//could be a string or a base64 encoded byte array |
| }</pre></div> |
| |
| |
| <div class="section" title="9.2.3.1. Embedding normally referenced values"><div class="titlepage"><div><div><h4 class="title" id="ug.ref.json.cas.featurestructures.embedding">9.2.3.1. Embedding normally referenced values</h4></div></div></div> |
| |
| |
| <p>Consider a FS which has a feature that refers to another FS. This can be serialized in one of two ways:</p> |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>the value of the feature can be coded as an <code class="code">id</code> (a number), where the number is the <code class="code">id</code> of the |
| referred-to FS.</p></li><li class="listitem"><p>The value of the feature can be coded as the serialization of the referred-to FS.</p></li></ul></div> |
| |
| <p> |
| This second way of encoding is often done by JSON style serializations, and is called "embedding". Referred-to |
| FSs may be embedded if there are no other references to the embedded FS. Multiple references may arise due to |
| having a FS referenced as a "root" in some CAS View, or being used as a value in a FS feature.</p> |
| |
| <p>Following the XMI conventions, UIMA arrays and lists which are |
| identified as singly referenced by either the static or dynamic method (see below) are embedded |
| directly as the value of a feature. In this case, the JSON serialization writes out the value of the feature |
| as a JSON array. Otherwise, the value is written out as a FS reference, and a separate serialization occurs of |
| the list elements or the array.</p> |
| |
| <p>In addition to arrays and lists, FSs which are identifed as singly referenced from another FS are |
| serialized as the embedded value of the referring feature. |
| This is also done (when using the dynamic method) for singly referenced rooted instances. |
| </p> |
| <p> |
| If a FS is multiply referenced, the serialization in these |
| cases is just the numeric value of the <code class="code">id</code> of the FS.</p> |
| </div> |
| |
| <div class="section" title="9.2.3.2. Dynamic vs Static multiple-references and embedding"><div class="titlepage"><div><div><h4 class="title" id="ug.ref.json.cas.featurestructures.dynamicstatic">9.2.3.2. Dynamic vs Static multiple-references and embedding</h4></div></div></div> |
| |
| |
| <p>There are two methods of determining if a particular FS or list or array can be embedded. |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><span class="bold"><strong>dynamic</strong></span> - calculates at serilization time whether or not there |
| are multiple references to a given FS.</p></li><li class="listitem"><p><span class="bold"><strong>static</strong></span> - looks in the type system definition to see if |
| the feature is marked with <multipleReferencesAllowed>. |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="circle" compact><li class="listitem"><p><code class="code">multipleReferencesAllowed</code> false <span class="symbol">→</span> use the embedded style</p></li><li class="listitem"><p><code class="code">multipleReferencesAllowed</code> true <span class="symbol">→</span> use separate objects</p></li></ul></div><p> |
| Note that since this flag is not available for |
| references to FSs from View indexes, any FS that is indexed in any view is considered (if using static mode) |
| to be multipleReferencesAllowed. |
| </p></li></ul></div><p> |
| </p> |
| |
| <p>Delta serialization only supports the static method; this mode is forced on if delta serialization |
| is specified.</p> |
| |
| <p>Dynamic embedding is enabled by default for JSON, but may be disabled via configuration.</p> |
| </div> |
| |
| <div class="section" title="9.2.3.3. Embedded Arrays and Lists"><div class="titlepage"><div><div><h4 class="title" id="ug.ref.json.cas.featurestructures.embeddedArraysLists">9.2.3.3. Embedded Arrays and Lists</h4></div></div></div> |
| |
| |
| <p>When static embedding is being used, a case can arise where some feature is marked to have only |
| singly referenced FS values, but that value may actually be multiply referenced. This is detected during |
| serialization, and an message is issued if an error handler has been specified to the serializer. |
| The serialization continues, however. In the case of an Array, the value of the array is embedded |
| in the serialization and the fact that these were referring to the same object is lost. |
| In the case of a list, if any element in the list |
| has multiple references (for example, if the list has back-references, loops, etc.), |
| the serialization of the list is truncated at the point where the multiple reference |
| occurs.</p> |
| |
| <div class="blockquote"><blockquote class="blockquote"><p>Note that you can correctly serialize arbitrarily linked complex list structures created |
| using the built-in list types only if you use dynamic embedding, or |
| specify <code class="code">multipleReferencesAllowed</code> = true.</p></blockquote></div> |
| |
| |
| <p>Embedded list or array values are both serialized using the JSON array notation; as a result, these |
| alternative representations are not distinguised in the JSON serialization.</p> |
| </div> |
| |
| <div class="section" title="9.2.3.4. Omitting null values"><div class="titlepage"><div><div><h4 class="title" id="ug.ref.json.cas.featurestructures.null">9.2.3.4. Omitting null values</h4></div></div></div> |
| |
| |
| <p>Following the conventions established in XMI serialization, features with <code class="code">null</code> values have their |
| key-value pairs omitted from the FS serialization when the type of the feature value is: |
| </p> |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"> |
| <p>a Feature Structure Reference</p> |
| </li><li class="listitem"> |
| <p>a String ( whose value is <code class="code">null</code>, not "" (a 0-length String))</p> |
| </li><li class="listitem"> |
| <p>an embedded Array or List (where the entire array and/or list is <code class="code">null</code>)</p> |
| </li></ul></div> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Inside arrays or lists of FSs, references which are being serialized |
| as references have a <code class="code">null</code> reference coded as the number 0; references which are embedded are serialized as |
| <code class="code">null</code>.</p></div> |
| |
| <p>Configuring the serializer with <code class="code">setOmit0Values(true)</code> causes |
| additional primitive features (byte/short/int/long/float/double) to be omitted, when their values are 0 or 0.0</p> |
| |
| </div> |
| |
| </div> |
| |
| </div> |
| |
| <div class="section" title="9.3. Organizing the Feature Structures"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ug.ref.json.cas.featurestructures.organization">9.3. Organizing the Feature Structures</h2></div></div></div> |
| |
| |
| <p>The set of all FSs being serialized is divided into two parts. The first part represents |
| all FSs that are root FSs, in that they were in one or more indexes at the time of serialization. The second part |
| represents feature structures that are multiply referenced, or are referenced via a chain of references from the |
| root FSs. The same feature structure can appear in both lists. The elements in the second part are actual |
| serialized FSs, whereas, the elements in the first part are either references to the corresponding FSs in the |
| second part, if they exist, or the actual embedded serialized FSs. Actual embedded serialized FSs only |
| exist once in the two parts.</p> |
| |
| <div class="informalexample"> |
| <pre class="programlisting">"_views" : { |
| "_InitialView" : { |
| "theFirstType" : [ { ... fs1 ...}, 123, 456, { ... fsn ...} ] |
| "anotherType" : [ { ... fs1 ...}, ... { ... fsn ...} ] |
| ... // more types which have roots in view "12" |
| }, |
| "AnotherView" : { |
| "theFirstType" : [ { ... fsv1 ...}, 123, { ... fsvn ...} ] |
| "anotherType" : [ { ... fsv1 ...}, ... { ... fsvn ...} ] |
| ... // more types which have roots in view "25" |
| }, |
| ... // more views |
| }, |
| |
| "_referenced_fss" : { |
| "12" : {"_type" : "Sofa", "sofaNum" : 1, "sofaID" : "_InitialView" }, |
| "25" : {"_type" : "Sofa", "sofaNum" : 2, "sofaID" : "AnotherView" }, |
| |
| "123" : { ... fs-123 ... }, |
| "456" : { ... fs-456 ... }, |
| ... |
| }</pre></div> |
| |
| <p>The first part map is made up of multiple maps, one for each separate CAS View. |
| The outer map is keyed by the <code class="code">id</code> of the corresponding SofaFS (or 0, if there is no corresponding SofaFS). |
| For each view, the value is a map whose key is a used Type, and the values are an array of instances |
| of FSs of that type which were found in some index; these are the "root" FSs. Only root instances |
| of a particular type are included in this array. |
| </p> |
| |
| |
| <p>The second part map has keys which are the <code class="code">id</code> value of the FSs, and values which are |
| a map of key-value pairs corresponding to the feature-values of that FS. |
| In this case, the _type extra feature is added to record the type.</p> |
| |
| |
| |
| <p>The _views map, keyed by view and type name, has all the FSs (as an JSON array) for that type that were in |
| one or more indexes in any View. If a FS in this array is not multiply referenced (using dynamic mode), |
| then it is embedded here. Otherwise, only the reference (a simple number representing the <code class="code">id</code> of that FS) is serialized for that FS.</p> |
| |
| |
| |
| </div> |
| |
| |
| |
| |
| |
| |
| <div class="section" title="9.4. Additional JSON CAS Serialization features"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ug.ref.json.cas.features">9.4. Additional JSON CAS Serialization features</h2></div></div></div> |
| |
| |
| <p>JSON serialization also supports several additional features, including:</p> |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>Type and feature filtering: only types and features that exist in a specified type system description |
| are serialized.</p> |
| </li><li class="listitem"> |
| <p>An ErrorHandler; this will be called in various error situations, including when |
| serializing in static mode an array or list value for a feature marked <code class="code">multipleReferencesAllowed = false</code> |
| is found to have multiple references.</p> |
| </li><li class="listitem"> |
| <p>A switch to control omitting of numeric features that have 0 values (default is to include these). |
| See the <code class="code">setOmit0Values(true_or_false)</code> method in JsonCasSerializer.</p> |
| </li><li class="listitem"> |
| <p>a pretty printing flag (default is not to do pretty-printing)</p> |
| </li></ul></div> |
| <p>See the Javadocs for JsonCasSerializer for details.</p> |
| |
| <div class="section" title="9.4.1. Delta CAS"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.json.delta">9.4.1. Delta CAS</h3></div></div></div> |
| |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Delta CAS support is incomplete, and is not supported as of release 2.7.0, but may |
| be supported in later releases. The information here is just for planning purposes.</p></div> |
| |
| <p><span class="bold"><strong>_delta_cas</strong></span> is present only when a delta CAS serialization is being performed. |
| This serializes just the |
| changes in the CAS since a Mark was set; so for cases where a large CAS is deserialized into a service, which |
| then does a relatively small amount of additions and modifications, only those changes are serialized. |
| The values of the keys are arrays of the ids of FSs that were added to the indexes, |
| removed from the indexes, or reindexed.</p> |
| |
| <p>This mode requires the static embeddability mode. When specified, a <code class="code">_delta_cas</code> key-value |
| is added to the serialization at the end, |
| which lists the FSs (by <code class="code">id</code>) that were added, removed, or reindexed, since the mark was set. |
| Additional extra information, created when the CAS was previously deserialized and the mark set, |
| must be passed to the serializer, in the form of an instance of <code class="code">XmiSerializationSharedData</code>, |
| or JsonSerializationSharedData (not yet defined as of release 2.7.0).</p> |
| |
| <p>Here's what the last part of the serialization looks like, when Delta CAS is specified: |
| </p><div class="informalexample"> |
| <pre class="programlisting">"_delta_cas" : { |
| "added_members" : [ 123, ... ], |
| "deleted_members" : [ 456, ... ], |
| "reindexed_members" : [] }</pre></div><p> |
| </p> |
| |
| |
| </div> |
| </div> |
| |
| |
| <div class="section" title="9.5. Using JSON CAS serialization"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.json.usage">9.5. Using JSON CAS serialization</h2></div></div></div> |
| |
| |
| <p>The support is built on top the Jackson JSON serialization |
| package. We follow Jackson conventions for configuring.</p> |
| |
| <p>The serialization APIs are in the JsonCasSerializer class.</p> |
| |
| <p>Although there are some static short-cut methods for common use cases, the basic operations needed |
| to serialize a CAS as JSON are:</p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>Make an instance of the <code class="code">JsonCasSerializer</code> class. This will serve to collect configuration information.</p> |
| </li><li class="listitem"> |
| <p>Do any additional configuration needed. See the Javadocs for details. |
| The following objects can be configured:</p> |
| <div class="itemizedlist"><ul class="itemizedlist" type="circle"><li class="listitem"> |
| <p>The <code class="code">JsonCasSerializer</code> object: here you can specify the kind of JSON formatting, what to serialize, |
| whether or not delta serialization is wanted, prettyprinting, and more.</p> |
| </li><li class="listitem"> |
| <p>The underlying <code class="code">JsonFactory</code> object from Jackson. Normally, you won't need to configure this. |
| If you do, you can create your own instance of this object and configure it and use it in the |
| serialization.</p> |
| </li><li class="listitem"> |
| <p>The underlying <code class="code">JsonGenerator</code> from Jackson. Normally, you won't need to configure this. |
| If you do, you can get the instance the serializer will be using and configure that.</p> |
| </li></ul></div> |
| </li><li class="listitem"> |
| <p>Once all the configuration is done, the serialize(...) call is done in this class, |
| which will create a one-time-use |
| inner class where the actual serialization is done. The serialize(...) method is thread-safe, in that the same |
| JsonCasSerializer instance (after it has been configured) can kick off multiple |
| (identically configured) serializations |
| on different threads at the same time.</p> |
| <p>The serialize call follows the Jackson conventions, taking one of 3 specifications of where to serialize to: |
| a Writer, an OutputStream, or a File.</p> |
| </li></ul></div> |
| |
| <p>Here's an example:</p> |
| <div class="informalexample"> |
| <pre class="programlisting">JsonCasSerializer jcs = new JsonCasSerializer(); |
| jcs.setPrettyPrint(true); // do some configuration |
| StringWriter sw = new StringWriter(); |
| jcs.serialize(cas, sw); // serialize into sw</pre></div> |
| |
| <p>The JsonCasSerializer class also has some static convenience methods for JSON serialization, for the |
| most common configuration cases; please see the Javadocs for details. These are named jsonSerialize, to |
| distinguish them from the non-static serialize methods.</p> |
| |
| <p>Many of the common configuration methods generally return the instance, so they can be chained together. |
| For example, if <code class="code">jcs</code> is an instance of the JsonCasSerializer, you can write |
| <code class="code">jcs.setPrettyPrint(true).setOmit0values(true);</code> to configure both of these.</p> |
| |
| |
| |
| </div> |
| |
| <div class="section" title="9.6. JSON serialization for UIMA descriptors"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.json.descriptionserialization">9.6. JSON serialization for UIMA descriptors</h2></div></div></div> |
| |
| |
| <p>UIMA descriptors are things like analysis engine descriptors, type system descriptors, etc. |
| UIMA has an internal form for these, typically named UIMA <span class="emphasis"><em>description</em></span>s; |
| these can be serialized out as XML using a <code class="code">toXML</code> method. |
| JSON support adds the ability to serialize these a JSON objects, as well. It may be of use, for example, |
| to have the full type system description for a UIMA pipeline available in JSON notation. |
| </p> |
| |
| <p>The class JsonMetaDataSerializer defines a set of static methods that serialize UIMA description objects |
| using a toJson method that takes as an argument the description object to be serialized, and the standard |
| set of serialiization targets that Jackson supports (File, Writer, or OutputStream). There is also |
| an optional prettyprint flag (default is no prettyprinting).</p> |
| |
| <p>The resulting JSON serialization is just a straight-forward serialization of the description object, |
| having the same fields as the XML serialization of it.</p> |
| |
| <p>Here's what a small TypeSystem description looks like, serialized:</p> |
| |
| <div class="informalexample"> |
| <pre class="programlisting">{"typeSystemDescription" : |
| {"name" : "casTestCaseTypesystem", |
| "description" : "Type system description for CAS test cases.", |
| "version" : "1.0", |
| "vendor" : "Apache Software Foundation", |
| "types" : [ |
| {"typeDescription" : |
| {"name" : "Token", |
| "description" : "", |
| "supertypeName" : "uima.tcas.Annotation", |
| "features" : [ |
| {"featureDescription" : |
| {"name" : "type", |
| "description" : "", |
| "rangeTypeName" : |
| "TokenType" } }, |
| {"featureDescription" : |
| {"name" : "tokenFloatFeat", |
| "description" : "", |
| "rangeTypeName" : "uima.cas.Float" } } ] } }, |
| {"typeDescription" : |
| {"name" : "TokenType", |
| "description" : "", |
| "supertypeName" : "uima.cas.TOP" } } ] } }</pre></div> |
| |
| <p>Here's a sample of code to serialize a UIMA description object held in the variable <code class="code">tsd</code>, with |
| and without pretty printing:</p> |
| |
| |
| <div class="informalexample"> |
| <pre class="programlisting">StringWriter sw = new StringWriter(); |
| JsonMetaDataSerializer.toJSON(tsd, sw); // no prettyprinting |
| |
| sw = new StringWriter(); |
| JsonMetaDataSerializer.toJSON(tsd, sw, true); // prettyprinting</pre></div> |
| </div> |
| |
| </div> |
| <div class="chapter" title="Chapter 10. UIMA Setup and Configuration" id="ugr.ref.config"><div class="titlepage"><div><div><h2 class="title">Chapter 10. UIMA Setup and Configuration</h2></div></div></div> |
| |
| |
| |
| <div class="section" title="10.1. UIMA JVM Configuration Properties"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.config.properties">10.1. UIMA JVM Configuration Properties</h2></div></div></div> |
| |
| |
| <p> Some updates change UIMA's behavior between released versions. For example, sometimes an error check |
| is enhanced, and this can cause something that previously incorrect but not checked, to now signal an error. |
| Often, users will want these kinds of things to be ignored, at least for a while, to give them time to |
| analyze and correct the issues. |
| </p> |
| |
| <p> |
| To enable users to gradually address these issues, there are some global JVM properties |
| for UIMA that can restore earlier behaviors, in some cases. |
| These are detailed in the table below. Additionally, there are other JVM properties that can |
| be used in checking and optimizing some performance trade-offs, such as the automatic index protection. |
| For the most part, you don't need to assign any values to these properties, |
| just define them. For example to disable the enhanced check that insures you |
| don't add a subtype of AnnotationBase to the wrong View, you could disable this by |
| adding the JVM argument <code class="code">-Duima.disable_enhanced_check_wrong_add_to_index</code>. |
| This would remove the enhanced |
| checking for this, added in version 2.7.0 (the previously existing partial checking is |
| still there, though). |
| </p> |
| </div> |
| |
| <div class="section" title="10.2. Configuring index protection"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.config.protect-index">10.2. Configuring index protection</h2></div></div></div> |
| |
| |
| <p>A new feature in version 2.7.0 optionally can include checking for invalid feature updates |
| which could corrupt indexes. Because this checking can slightly slow down performance, there are |
| global JVM properties to control it. The suggested way to operation with these is as follows. |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>At the beginning, run with automatic protection enabled (the default), but |
| turn on explicit reporting (<code class="code">-Duima.report_fs_update_corrupts_index</code>)</p></li><li class="listitem"><p>For all reported instances, examine your code to see if you can restructure to |
| do the updates before adding the FS to the indexes. Where you cannot, surround the code doing |
| these updates with a try / finally or block form of <code class="code">protectIndexes()</code>, |
| which is described in |
| <a class="xref" href="#ugr.ref.cas.updating_indexed_feature_structures" title="4.5.1. Updating indexed feature structures">Section 4.5.1, “Updating indexed feature structures”</a> (and also is similarly available with JCas). |
| </p></li><li class="listitem"><p>After no further reports, for maximum performance, leave in the protections |
| you may have installed in the above step, and then disable the reporting and runtime checking, |
| using the JVM argument |
| <code class="code">-Duima.disable_auto_protect_indexes</code>, and removing (if present) |
| <code class="code">-Duima.report_fs_update_corrupts_index</code>.</p></li></ul></div><p> |
| One additional JVM property, <code class="code">-Duima.throw_exception_when_fs_update_corrupts_index</code>, |
| is intended to be used in automated build / testing configurations. It causes the framework to throw |
| a UIMARuntimeException if an update outside of a <code class="code">protectIndexes</code> block occurs |
| that could corrupt the indexes, |
| rather than "recovering" this. |
| </p> |
| </div> |
| |
| <div class="section" title="10.3. Properties Table"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.config.property-table">10.3. Properties Table</h2></div></div></div> |
| |
| |
| <p>This table describes the various JVM defined properties; specify these on the Java command line |
| using -Dxxxxxx, where the xxxxxx is one of |
| the properties starting with <code class="code">uima.</code> from the table below.</p> |
| <div class="informaltable"> |
| <table style="border-collapse: collapse;border-top: 0.5pt solid black; border-bottom: 0.5pt solid black; border-left: 0.5pt solid black; border-right: 0.5pt solid black; "><colgroup><col class="Title"><col class="Description"><col class="Version"></colgroup><tbody><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><span class="bold"><strong>Title</strong></span></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><span class="bold"><strong>Property Name & Description</strong></span></td><td style="border-bottom: 0.5pt solid black; "><span class="bold"><strong>Since Version</strong></span></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p>Use built-in Java Logger as default back-end</p></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p><code class="code">uima.use_jul_as_default_uima_logger</code></p> |
| |
| <p>See <a class="ulink" href="https://issues.apache.org/jira/browse/UIMA-5381" target="_top">UIMA-5381</a>. |
| The standard UIMA logger uses an slf4j implementation, which, in turn hooks up to |
| a back end implementation based on what can be found in the class path (see slf4j documentation). |
| If no backend implementation is found, the slf4j default is to use a NOP logger back end |
| which discards all logging.</p> |
| |
| <p>When this flag is specified, the behavior of the UIMA logger |
| is altered to use the built-in-to-Java logging implementation |
| as the back end for the UIMA logger. |
| </p></td><td style="border-bottom: 0.5pt solid black; "><p>3.0.0</p></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p>XML: enable doctype declarations</p></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p><code class="code">uima.xml.enable.doctype_decl</code> (default is false)</p> |
| |
| <p>See <a class="ulink" href="https://issues.apache.org/jira/browse/UIMA-6064" target="_top">UIMA-6064</a> |
| Normally, this is turned off to avoid exposure to malicious XML; see |
| <a class="ulink" href="https://www.owasp.org/index.php/XML_External_Entity_(XXE)_Processing" target="_top"> |
| XML External Entity processing vulnerability</a>. |
| </p> |
| </td><td style="border-bottom: 0.5pt solid black; "><p>2.10.4, 3.1.0</p></td></tr><tr><td style="border-bottom: 0.5pt solid black; " colspan="3" align="center"><span class="bold"><strong>Index protection properties</strong></span></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p>Report Illegal Index-key Feature Updates</p></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p><code class="code">uima.report_fs_update_corrupts_index</code> (default is not to report)</p> |
| |
| <p>See <a class="ulink" href="https://issues.apache.org/jira/browse/UIMA-4135" target="_top">UIMA-4135</a>. |
| Updating Features which are used in Set and Sorted |
| indexes as "keys" may corrupt the indexes, if the Feature Structure (FS) |
| has been added to the indexes. To update these, you must first |
| completely remove the FS from the indexes in all views, then do the updates, and then |
| add it back. UIMA now checks for this (unless specifically disabled, see below), |
| and if this property is set, will log WARN messages for each occurrence unless |
| the user does explicit <code class="code">protectIndexes</code> (see CAS JavaDocs for CAS / JCas <code class="code">protectIndexes</code> methods), if this |
| property is defined.</p> |
| <p>To scan the logs for these reports, search for instances of lines having the string |
| <code class="code">While FS was in the index, the feature</code></p> |
| |
| <p>Specifying this property overrides <code class="code">uima.disable_auto_protect_indexes</code>.</p> |
| |
| <p>Users would run with this property defined, and then for high performance, |
| would use the report to manually change their code to avoid the problem or |
| to wrap the updates with a <code class="code">protectIndexes</code> kind of protection (see the |
| reference manual, in the CAS or JCas chapters, for examples of user code doing this, |
| and then run with the protection turned off (see below). |
| |
| </p></td><td style="border-bottom: 0.5pt solid black; "><p>2.7.0</p></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p>Throw exception on illegal Index-key Feature Updates</p></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p><code class="code">uima.exception_when_fs_update_corrupts_index</code> (default is false)</p> |
| |
| <p>See <a class="ulink" href="https://issues.apache.org/jira/browse/UIMA-4150" target="_top">UIMA-4150</a>. |
| Throws a UIMARuntimeException if an Indexed FS feature used as a key in one or more |
| indexes is updated, outside of an explicit <code class="code">protectIndexes</code> block.. \ |
| This is intended for use in automated build and test environments, |
| to provide a strong signal if this kind of mistake gets into the build. |
| If it is not set, then the other properties specify if corruption should be checked for, |
| recovered automatically, and / or reported</p> |
| |
| <p>Specifying this property also forces <code class="code">uima.report_fs_update_corrupts_index</code> |
| to true even if it was set to false.</p> |
| |
| </td><td style="border-bottom: 0.5pt solid black; "><p>2.7.0</p></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p>Disable the index corruption checking</p></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p><code class="code">uima.disable_auto_protect_indexes</code></p> |
| |
| <p>See <a class="ulink" href="https://issues.apache.org/jira/browse/UIMA-4135" target="_top">UIMA-4135</a>. |
| After you have fixed all reported issues identified with the above report, |
| you may set this property to omit this check, which may slightly improve |
| performance.</p> |
| <p>Note that this property is ignored if the <code class="code">-Dexception_when_fs_update_corrupts_index</code> |
| or <code class="code">-Dreport_fs_update_corrupts_index</code></p> |
| </td><td style="border-bottom: 0.5pt solid black; "><p>2.7.0</p></td></tr><tr><td style="border-bottom: 0.5pt solid black; " colspan="3" align="center"><span class="bold"><strong>Measurement / Tracing properties</strong></span></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p>Trace Feature Structure Creation/Updating</p></td><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><p><code class="code">uima.trace_fs_creation_and_updating</code></p> |
| <p>This causes a trace file to be produced in the current working directory. |
| The file has one line for each Feature Structure that is created, and include |
| information on the cas/cas-view, and the features that are set for the Feature Structure. |
| There is, additionally, one line for each Feature Structure update. |
| Updates that occur next-to trace information for the same Feature Structure are combined. |
| </p> |
| |
| <p>This can generate a lot of output, and definitely slows down execution.</p> |
| </td><td style="border-bottom: 0.5pt solid black; "><p>2.10.1</p></td></tr><tr><td style="border-right: 0.5pt solid black; "><p>Measure index flattening optimization</p></td><td style="border-right: 0.5pt solid black; "><p><code class="code">uima.measure.flatten_index</code></p> |
| |
| <p>See <a class="ulink" href="https://issues.apache.org/jira/browse/UIMA-4357" target="_top">UIMA-4357</a>. |
| This creates a short report to System.out when Java is shutdown. |
| The report has some statistics about the automatic management of |
| flattened index creation and use.</p> |
| |
| </td><td style=""><p>2.8.0</p></td></tr></tbody></table> |
| </div> |
| <p>Some additional global flags intended for helping v3 migration are documented in the V3 user's guide.</p> |
| </div> |
| |
| </div> |
| <div class="chapter" title="Chapter 11. UIMA Resources" id="ugr.ref.resources"><div class="titlepage"><div><div><h2 class="title">Chapter 11. UIMA Resources</h2></div></div></div> |
| |
| |
| |
| <div class="section" title="11.1. What is a UIMA Resource?"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.resources.overview">11.1. What is a UIMA Resource?</h2></div></div></div> |
| |
| <p>UIMA uses the term <code class="code">Resource</code> to describe all UIMA components |
| that can be acquired by an application or by other resources.</p> |
| |
| <div class="figure"><a name="ref.resource.fig.kinds"></a><div class="figure-contents"> |
| |
| <div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="297"><tr><td><img src="images/references/ref.resources/res_resource_kinds.png" width="297" alt="Resource Kinds, a partial list"></td></tr></table></div> |
| </div><p class="title"><b>Figure 11.1. Resource Kinds</b></p></div><br class="figure-break"> |
| |
| <p>There are many kinds of resources; here's a list of the main kinds: |
| </p><div class="variablelist"><dl><dt><span class="term"><span class="strong"><strong>Annotator</strong></span></span></dt><dd><p>a user written component, receives a CAS, does some processing, and returns the possibly |
| updated CAS. Variants include CollectionReaders, CAS Consumers, CAS Multipliers.</p></dd><dt><span class="term"><span class="strong"><strong>Flow Controller</strong></span></span></dt><dd><p>a user written component controlling the flow of CASes within an aggregate.</p></dd><dt><span class="term"><span class="strong"><strong>External Resource</strong></span></span></dt><dd><p>a user written component. Variants include: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>Data - includes special lifecycle call to load data</p></li><li class="listitem"><p>Parameterized - allows multiple instantiations with simple string parameter variants; |
| example: a dictionary, that has variants in content for different languages</p></li><li class="listitem"><p>Configurable - supports configuration from the XML specifier</p></li></ul></div><p> |
| </p></dd></dl></div><p> |
| </p> |
| |
| <div class="section" title="11.1.1. Resource Inner Implementations"><div class="titlepage"><div><div><h3 class="title" id="ugr.ref.resources.resource-inner-implementations">11.1.1. Resource Inner Implementations</h3></div></div></div> |
| |
| |
| <p>Many of the resource kinds include in their specification a (possibly optional) element, which is |
| the name of a Java class which implements the resource. We will call this class the "inner implementation".</p> |
| |
| <p>The UIMA framework creates instances of Resource from resource specifiers, by calling |
| the framework's <code class="code">produceResource(specifier, additional_parameters)</code> method. |
| This call produces a instance of Resource. </p> |
| |
| <div class="blockquote"><blockquote class="blockquote"> |
| <p> |
| For example, calling produceResource on an AnalysisEngineDescription produces an instance of |
| AnalysisEngine. This, in turn will have a reference to the user-written inner implementation class. |
| specified by the <code class="code">annotatorImplementationName</code>. |
| </p> |
| <p>External resource descriptors may include an <code class="code">implementationName</code> element. |
| Calling produceResource on a ExternalResourceDescription produces an instance of Resource; |
| the resource obtained by subsequent calls to <code class="code">getResource(...)</code> |
| is dependent on the particular descriptor, and may be an instance of |
| the inner implementation class. |
| </p> |
| </blockquote></div> |
| |
| <p>For external resources, each resource specifier kind handles the case where |
| the inner implementation is omitted. If it is supplied, the named class must implement |
| the interface specified in the bindings for this resource. In addition, the particular specifier kind may |
| further restrict the kinds of classes the user supplies as the implementationName. |
| </p> |
| |
| <p>Some examples of this further restriction: |
| </p><div class="variablelist"><dl><dt><span class="term"><span class="strong"><strong>customResource</strong></span></span></dt><dd><p>the class must also implement the Resource interface</p></dd><dt><span class="term"><span class="strong"><strong>dataResource</strong></span></span></dt><dd><p>the class must also implement the SharedResourceObject interface</p></dd></dl></div><p> |
| </p> |
| |
| </div> |
| |
| </div> |
| |
| <div class="section" title="11.2. Sharing Resources, even across pipelines"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.resources.sharing-across-pipelines">11.2. Sharing Resources, even across pipelines</h2></div></div></div> |
| |
| |
| <p>UIMA applications run one or more UIMA Pipelines. Each pipeline has a top-level Analysis Engine, which |
| may be an aggregation of many other Analysis Engine components. The UIMA framework instantiates Annotator |
| resources as specified to configure the pipelines.</p> |
| |
| <p>Sometimes, many identical pipelines are created (for example, |
| in order to exploit multi-core hardware by processing multiple CASes in parallel). In this case, the framework |
| would produce multiple instances of those Annotation resources; these are implemented as multiple instances |
| of the same Java class.</p> |
| |
| <p>Sets of External Resources plus a CAS Pool and UIMA Extension ClassLoader are set up and kept, |
| per instance of a ResourceManager; |
| this instance serves to allow sharing of these items across one or more pipelines. |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>The UIMA Extension ClassLoader (if specified) is used to find the resources to be loaded |
| by the framework</p> |
| </li><li class="listitem"> |
| <p>The <code class="code">External Resources</code> are specified by a pipeline's resource configuration.</p> |
| </li><li class="listitem"> |
| <p>The CAS Pool is a pool of CASs all with identical type systems and index definitions, associated |
| with a pipeline.</p> |
| </li></ul></div><p> </p> |
| |
| <p>When setting up a pipeline, the UIMA Framework's <code class="code">produceResource</code> |
| or one of its specialized variants is called, and a new |
| ResourceManager being created and used for that pipeline. However, in many cases, it may be advantageous to |
| share the same Resources across multiple pipelines; this is easily doable by passing a common instance of the |
| ResourceManager to the pipeline creation methods (using the additional parameters of the produceResource method).</p> |
| |
| <p> |
| To handle additional use cases, the ResourceManager has a <code class="code">copy()</code> method which creates a copy of the |
| Resource Manager instance. The new instance is created with a null CAS Manager; if you want to share the |
| the CAS Pool, you have to copy the CAS Manager: <code class="code">newRM.setCasManager(originalRM.getCasManager())</code>. |
| You also may set the Extension Class Loader in the new instance (PEAR wrappers use this to allow |
| PEARs to have their own classpath). See the Javadocs for details. |
| </p> |
| |
| </div> |
| |
| <div class="section" title="11.3. External Resources support for multiple Parameterized Instances"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ref.resources.external-resource-multiple-parameterized-instances">11.3. External Resources support for multiple Parameterized Instances</h2></div></div></div> |
| |
| <p>A typical external resource gets a single instantiation, shared with all users of a particular |
| ResourceManager. |
| Sometimes, multiple instantiations may be useful (of the same resource). The framework supports this for |
| ParameterizedDataResources. There's one kind supplied with UIMA - the fileLanguageResourceSpecifier. |
| This works by having each call to getResource(name, extra_keys[]) use the extra keys to select a particular |
| instance. On the first call for a particular instance, the named resource uses the extra keys to |
| initialize a new instance by calling its <code class="code">load</code> method with a data resource derived from the |
| extra keys by the named resource. |
| </p> |
| |
| <p>For example, the fileLanguageResourceSpecifier uses the language code and goes through |
| a process with lots of defaulting and fall back to find a resource to load, based on the language code. |
| </p> |
| |
| </div> |
| |
| </div> |
| </div></body></html> |