| <html><head> |
| <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
| <title>UIMA Overview & SDK Setup</title><link rel="stylesheet" type="text/css" href="css/stylesheet-html.css"><meta name="generator" content="DocBook XSL-NS Stylesheets V1.76.1"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div lang="en" class="book" title="UIMA Overview & SDK Setup" id="d5e1"><div xmlns:d="http://docbook.org/ns/docbook" class="titlepage"><div><div><h1 class="title">UIMA Overview & SDK Setup</h1></div><div><div class="authorgroup"> |
| <h3 class="corpauthor">Written and maintained by the Apache UIMA™ Development Community</h3> |
| </div></div><div><p class="releaseinfo">Version 3.0.2</p></div><div><p class="copyright">Copyright © 2006, 2019 The Apache Software Foundation</p></div><div><p class="copyright">Copyright © 2004, 2006 International Business Machines Corporation</p></div><div><div class="legalnotice" title="Legal Notice"><a name="d5e8"></a> |
| <p> </p> |
| <p title="License and Disclaimer"> |
| <b>License and Disclaimer. </b> |
| |
| The ASF licenses this documentation |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this documentation except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| </p><div class="blockquote"><blockquote class="blockquote"> |
| <a class="ulink" href="http://www.apache.org/licenses/LICENSE-2.0" target="_top">http://www.apache.org/licenses/LICENSE-2.0</a> |
| </blockquote></div><p title="License and Disclaimer"> |
| |
| Unless required by applicable law or agreed to in writing, |
| this documentation and its contents are distributed under the License |
| on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| </p> |
| <p> </p> |
| <p> </p> |
| <p title="Trademarks"> |
| <b>Trademarks. </b> |
| All terms mentioned in the text that are known to be trademarks or |
| service marks have been appropriately capitalized. Use of such terms |
| in this book should not be regarded as affecting the validity of the |
| the trademark or service mark. |
| |
| </p> |
| </div></div><div><p class="pubdate">April, 2019</p></div></div><hr></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="chapter"><a href="#ugr.project_overview">1. Overview</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.project_overview_doc_overview">1.1. Apache UIMA Project Documentation Overview</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.project_overview_overview">1.1.1. Overviews</a></span></dt><dt><span class="section"><a href="#ugr.project_overview_setup">1.1.2. Eclipse Tooling Installation and Setup</a></span></dt><dt><span class="section"><a href="#ugr.project_overview_tutorials_dev_guides">1.1.3. Tutorials and Developer's Guides</a></span></dt><dt><span class="section"><a href="#ugr.project_overview_tool_guides">1.1.4. Tools Users' Guides</a></span></dt><dt><span class="section"><a href="#ugr.project_overview_reference">1.1.5. References</a></span></dt><dt><span class="section"><a href="#ugr.project_overview_v3">1.1.6. Version 3 User's guide</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.project_overview_doc_use">1.2. How to use the Documentation</a></span></dt><dt><span class="section"><a href="#ugr.project_overview_changes_from_previous">1.3. Changes from UIMA Version 2</a></span></dt><dt><span class="section"><a href="#ugr.project_overview_migrating_from_v2_to_v3">1.4. Migrating existing UIMA pipelines from Version 2 to Version 3</a></span></dt><dt><span class="section"><a href="#ugr.project_overview_summary">1.5. Apache UIMA Summary</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ovv.summary.general">1.5.1. General</a></span></dt><dt><span class="section"><a href="#ugr.ovv.summary.programming_language_support">1.5.2. Programming Language Support</a></span></dt><dt><span class="section"><a href="#ugr.ovv.general.summary.multi_modal_support">1.5.3. Multi-Modal Support</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.project_overview_summary_sdk_capabilities">1.6. Summary of Apache UIMA Capabilities</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ovv.conceptual">2. UIMA Conceptual Overview</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ovv.conceptual.uima_introduction">2.1. UIMA Introduction</a></span></dt><dt><span class="section"><a href="#ugr.ovv.conceptual.architecture_framework_sdk">2.2. The Architecture, the Framework and the SDK</a></span></dt><dt><span class="section"><a href="#ugr.ovv.conceptual.analysis_basics">2.3. Analysis Basics</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ovv.conceptual.aes_annotators_and_analysis_results">2.3.1. Analysis Engines, Annotators & Results</a></span></dt><dt><span class="section"><a href="#ugr.ovv.conceptual.representing_results_in_cas">2.3.2. Representing Analysis Results in the CAS</a></span></dt><dt><span class="section"><a href="#ugr.ovv.conceptual.interacting_with_cas_and_external_resources">2.3.3. Using CASes and External Resources</a></span></dt><dt><span class="section"><a href="#ugr.ovv.conceptual.component_descriptors">2.3.4. Component Descriptors</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ovv.conceptual.aggregate_analysis_engines">2.4. Aggregate Analysis Engines</a></span></dt><dt><span class="section"><a href="#ugr.ovv.conceptual.applicaiton_building_and_collection_processing">2.5. Application Building and Collection Processing</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ovv.conceptual.using_framework_from_an_application">2.5.1. Using the framework from an Application</a></span></dt><dt><span class="section"><a href="#ugr.ovv.conceptual.graduating_to_collection_processing">2.5.2. Graduating to Collection Processing</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ovv.conceptual.exploiting_analysis_results">2.6. Exploiting Analysis Results</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ovv.conceptual.semantic_search">2.6.1. Semantic Search</a></span></dt><dt><span class="section"><a href="#ugr.ovv.conceptual.databases">2.6.2. Databases</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ovv.conceptual.multimodal_processing">2.7. Multimodal Processing in UIMA</a></span></dt><dt><span class="section"><a href="#ugr.ovv.conceptual.next_steps">2.8. Next Steps</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.ovv.eclipse_setup">3. Eclipse IDE setup for UIMA</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ovv.eclipse_setup.installation">3.1. Installation</a></span></dt><dd><dl><dt><span class="section"><a href="#ugr.ovv.eclipse_setup.install_eclipse">3.1.1. Install Eclipse</a></span></dt><dt><span class="section"><a href="#ugr.ovv.eclipse_setup.install_uima_eclipse_plugins">3.1.2. Installing the UIMA Eclipse Plugins</a></span></dt><dt><span class="section"><a href="#ugr.ovv.eclipse_setup.install_uima_sdk">3.1.3. Install the UIMA SDK</a></span></dt><dt><span class="section"><a href="#ugr.ovv.eclipse_setup.install_uima_eclipse_plugins_manually">3.1.4. Installing the UIMA Eclipse Plugins, manually</a></span></dt><dt><span class="section"><a href="#ugr.ovv.eclipse_setup.start_eclipse">3.1.5. Start Eclipse</a></span></dt></dl></dd><dt><span class="section"><a href="#ugr.ovv.eclipse_setup.example_code">3.2. Setting up Eclipse to view Example Code</a></span></dt><dt><span class="section"><a href="#ugr.ovv.eclipse_setup.adding_source">3.3. Adding the UIMA source code to the jar files</a></span></dt><dt><span class="section"><a href="#ugr.ovv.eclipse_setup.linking_uima_javadocs">3.4. Attaching UIMA Javadocs</a></span></dt><dt><span class="section"><a href="#ugr.ovv.eclipse_setup.running_external_tools_from_eclipse">3.5. Running external tools from Eclipse</a></span></dt></dl></dd><dt><span class="chapter"><a href="#ugr.faqs">4. UIMA FAQ's</a></span></dt><dt><span class="chapter"><a href="#ugr.issues">5. Known Issues</a></span></dt><dt><span class="glossary"><a href="#ugr.glossary">Glossary</a></span></dt></dl></div> |
| |
| |
| |
| |
| |
| <div class="chapter" title="Chapter 1. UIMA Overview" id="ugr.project_overview"><div class="titlepage"><div><div><h2 class="title">Chapter 1. UIMA Overview</h2></div></div></div> |
| |
| |
| |
| <p>The Unstructured Information Management Architecture (UIMA) is an architecture and software framework |
| for creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and |
| integrating them with search technologies. The architecture is undergoing a standardization effort, |
| referred to as the <span class="emphasis"><em>UIMA specification</em></span> by a technical committee within |
| <a class="ulink" href="http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=uima" target="_top">OASIS</a>. |
| </p> |
| |
| <p>The <span class="emphasis"><em>Apache UIMA</em></span> framework is an Apache licensed, open source implementation of the |
| UIMA Architecture, and provides a run-time environment in which developers can plug in |
| and run their UIMA component implementations and with which they can build and deploy UIM applications. The |
| framework itself is not specific to any IDE or platform.</p> |
| |
| <p>It includes an all-Java implementation of the |
| UIMA framework for the development, description, composition and deployment of UIMA components and |
| applications. It also provides the developer with an Eclipse-based (<a class="ulink" href="http://www.eclipse.org/" target="_top">http://www.eclipse.org/</a> |
| ) development environment that includes a set of tools and utilities for using UIMA. It also includes |
| a C++ version of the framework, and |
| enablements for Annotators built in Perl, Python, and TCL.</p> |
| |
| <p>This chapter is the intended starting point for readers that are new to the Apache UIMA Project. It includes |
| this introduction and the following sections:</p> |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p> <a class="xref" href="#ugr.project_overview_doc_overview" title="1.1. Apache UIMA Project Documentation Overview">Section 1.1, “Apache UIMA Project Documentation Overview”</a> provides a list of the books and topics included in |
| the Apache UIMA documentation with a brief summary of each. </p> |
| </li><li class="listitem"> |
| <p> <a class="xref" href="#ugr.project_overview_doc_use" title="1.2. How to use the Documentation">Section 1.2, “How to use the Documentation”</a> describes a recommended path through the |
| documentation to help get the reader up and running with UIMA </p> |
| </li></ul></div> |
| |
| <p>The main website for Apache UIMA is <a class="ulink" href="http://uima.apache.org" target="_top">http://uima.apache.org</a>. Here you |
| can find out many things, including: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>how to download (both the binary and source distributions</p></li><li class="listitem"><p>how to participate in the development</p></li><li class="listitem"><p>mailing lists - including the user list used like a forum for questions and answers</p></li><li class="listitem"><p>a Wiki where you can find and contribute all kinds of information, including tips and best practices</p></li><li class="listitem"><p>a sandbox - a subproject for potential new additions to Apache UIMA or to subprojects of it. Things here |
| are works in progress, and may (or may not) be included in releases.</p></li><li class="listitem"><p>links to conferences</p></li></ul></div><p> |
| </p> |
| |
| <div class="section" title="1.1. Apache UIMA Project Documentation Overview"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.project_overview_doc_overview">1.1. Apache UIMA Project Documentation Overview</h2></div></div></div> |
| |
| <p> The user documentation for UIMA is organized into several parts. |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"> |
| <p> Overviews - this documentation </p> |
| </li><li class="listitem"> |
| <p> Eclipse Tooling Installation and Setup - also in this document </p> |
| </li><li class="listitem"> |
| <p> Tutorials and Developer's Guides </p> |
| </li><li class="listitem"> |
| <p> Tools Users' Guides </p> |
| </li><li class="listitem"> |
| <p> References </p> |
| </li><li class="listitem"> |
| <p>Version 3 users-guide</p> |
| </li></ul></div><p> </p> |
| |
| <p> |
| The first 2 parts make up this book; the last 4 have individual |
| books. The books are provided both as |
| (somewhat large) html files, viewable in browsers, and also as PDF files. |
| The documentation is fully hyperlinked, with tables of contents. The PDF versions are set up to |
| print nicely - they have page numbers included on the cross references within a book. </p> |
| |
| <p>If you view the PDF files inside |
| a browser that supports imbedded viewing of PDF, the hyperlinks between different PDF books may work (not |
| all browsers have been tested...).</p> |
| |
| <p>The following set of tables gives a more detailed overview of the various parts of the |
| documentation. |
| </p> |
| |
| <div class="section" title="1.1.1. Overviews"><div class="titlepage"><div><div><h3 class="title" id="ugr.project_overview_overview">1.1.1. Overviews</h3></div></div></div> |
| |
| |
| <div class="informaltable"> |
| <table style="border-collapse: collapse;border-top: 0.5pt solid black; border-bottom: 0.5pt solid black; border-left: 0.5pt solid black; border-right: 0.5pt solid black; "><colgroup><col class="col1"><col class="col2"></colgroup><tbody><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><span class="emphasis"><em>Overview of the Documentation</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; "> |
| <p>What you are currently reading. Lists the documents provided in the Apache |
| UIMA documentation set and provides |
| a recommended path through the documentation for getting started using |
| UIMA. It includes release notes and provides a brief high-level description of |
| the different software modules included in the |
| Apache UIMA Project. See <a class="xref" href="#ugr.project_overview_doc_overview" title="1.1. Apache UIMA Project Documentation Overview">Section 1.1, “Apache UIMA Project Documentation Overview”</a>.</p> |
| </td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><span class="emphasis"><em>Conceptual Overview</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Provides a broad conceptual overview of the UIMA component architecture; includes |
| references to the other documents in the documentation set that provide more detail. |
| See <a class="xref" href="#ugr.ovv.conceptual" title="Chapter 2. UIMA Conceptual Overview">Chapter 2, <i>UIMA Conceptual Overview</i></a></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><span class="emphasis"><em>UIMA FAQs</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Frequently Asked Questions about general UIMA concepts. (Not a programming |
| resource.) See <a class="xref" href="#ugr.faqs" title="Chapter 4. UIMA Frequently Asked Questions (FAQ's)">Chapter 4, <i>UIMA Frequently Asked Questions (FAQ's)</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><span class="emphasis"><em>Known Issues</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Known issues and problems with the UIMA SDK. See <a class="xref" href="#ugr.issues" title="Chapter 5. Known Issues">Chapter 5, <i>Known Issues</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; "><span class="emphasis"><em>Glossary</em></span> |
| </td><td style="">UIMA terms and concepts and their basic definitions. See <a class="xref" href="#ugr.glossary" title="Glossary: Key Terms & Concepts">Glossary</a>.</td></tr></tbody></table> |
| </div> |
| </div> |
| |
| <div class="section" title="1.1.2. Eclipse Tooling Installation and Setup"><div class="titlepage"><div><div><h3 class="title" id="ugr.project_overview_setup">1.1.2. Eclipse Tooling Installation and Setup</h3></div></div></div> |
| |
| <p>Provides step-by-step instructions for installing Apache UIMA in the Eclipse Interactive |
| Development Environment. See <a class="xref" href="#ugr.ovv.eclipse_setup" title="Chapter 3. Setting up the Eclipse IDE to work with UIMA">Chapter 3, <i>Setting up the Eclipse IDE to work with UIMA</i></a>.</p> |
| </div> |
| |
| <div class="section" title="1.1.3. Tutorials and Developer's Guides"><div class="titlepage"><div><div><h3 class="title" id="ugr.project_overview_tutorials_dev_guides">1.1.3. Tutorials and Developer's Guides</h3></div></div></div> |
| |
| <div class="informaltable"> |
| <table style="border-collapse: collapse;border-top: 0.5pt solid black; border-bottom: 0.5pt solid black; border-left: 0.5pt solid black; border-right: 0.5pt solid black; "><colgroup><col class="col1"><col class="col2"></colgroup><tbody><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_tutorial_annotator"></a><span class="emphasis"><em>Annotators and Analysis Engines</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Tutorial-style guide for building UIMA annotators and analysis engines. This chapter |
| introduces the developer to creating type systems and using UIMA's common data structure, |
| the CAS or Common Analysis Structure. It demonstrates how to use built in tools to specify and create |
| basic UIMA analysis components. See |
| <a href="tutorials_and_users_guides.html#ugr.tug.aae" class="olink">Chapter 1, <i>Annotator and Analysis Engine Developer's Guide</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_tutorial_cpe"></a><span class="emphasis"><em>Building UIMA Collection Processing Engines</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Tutorial-style guide for building UIMA collection processing engines. These |
| manage the |
| analysis of collections of documents from source to sink. See |
| <a href="tutorials_and_users_guides.html#ugr.tug.cpe" class="olink">Chapter 2, <i>Collection Processing Engine Developer's Guide</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_tutorial_application_development"></a><span class="emphasis"><em>Developing Complete Applications</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Tutorial-style guide on using the UIMA APIs to create, run and manage UIMA components from |
| your application. Also describes APIs for saving and restoring the contents of a CAS using an XML |
| format called <span class="trademark"> XMI</span>®. See |
| <a href="tutorials_and_users_guides.html#ugr.tug.application" class="olink">Chapter 3, <i>Application Developer's Guide</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_guide_flow_controller"></a><span class="emphasis"><em>Flow Controller</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">When multiple components are combined in an Aggregate, each CAS flow among the various |
| components. UIMA provides two built-in flows, and also allows custom flows to be |
| implemented. See <a href="tutorials_and_users_guides.html#ugr.tug.fc" class="olink">Chapter 4, <i>Flow Controller Developer's Guide</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_guide_multiple_sofas"></a><span class="emphasis"><em>Developing Applications using Multiple Subjects of Analysis</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">A single CAS maybe associated with multiple subjects of analysis (Sofas). These are useful |
| for representing and analyzing different formats or translations of the same document. For |
| multi-modal analysis, Sofas are good for different modal representations of the same stream |
| (e.g., audio and close-captions).This chapter provides the developer details on how to use |
| multiple Sofas in an application. See |
| <a href="tutorials_and_users_guides.html#ugr.tug.aas" class="olink">Chapter 5, <i>Annotations, Artifacts, and Sofas</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_guide_multiple_views"></a><span class="emphasis"><em>Multiple CAS Views of an Artifact</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">UIMA provides an extension to the basic model of the CAS which supports |
| analysis of multiple views of the same artifact, all contained with the CAS. This |
| chapter describes the concepts, terminology, and the API and XML extensions that |
| enable this. See |
| <a href="tutorials_and_users_guides.html#ugr.tug.mvs" class="olink">Chapter 6, <i>Multiple CAS Views of an Artifact</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_guide_cas_multiplier"></a><span class="emphasis"><em>CAS Multiplier</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">A component may add additional CASes into the workflow. This may be useful to break up a large |
| artifact into smaller units, or to create a new CAS that collects information from multiple other |
| CASes. See <a href="tutorials_and_users_guides.html#ugr.tug.cm" class="olink">Chapter 7, <i>CAS Multiplier Developer's Guide</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; "><a name="ugr.project_overview_xmi_emf"></a><span class="emphasis"><em>XMI and EMF Interoperability</em></span> |
| </td><td style="">The UIMA Type system and the contents of the CAS itself can be externalized using the XMI |
| standard for XML MetaData. Eclipse Modeling Framework (EMF) tooling can be used to develop |
| applications that use this information. See |
| <a href="tutorials_and_users_guides.html#ugr.tug.xmi_emf" class="olink">Chapter 8, <i>XMI and EMF Interoperability</i></a>.</td></tr></tbody></table> |
| </div> |
| </div> |
| |
| <div class="section" title="1.1.4. Tools Users' Guides"><div class="titlepage"><div><div><h3 class="title" id="ugr.project_overview_tool_guides">1.1.4. Tools Users' Guides</h3></div></div></div> |
| |
| |
| <div class="informaltable"> |
| <table style="border-collapse: collapse;border-top: 0.5pt solid black; border-bottom: 0.5pt solid black; border-left: 0.5pt solid black; border-right: 0.5pt solid black; "><colgroup><col class="col1"><col class="col2"></colgroup><tbody><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_tools_component_descriptor_editor"></a><span class="emphasis"><em>Component Descriptor Editor</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Describes the features of the Component Descriptor Editor Tool. This tool provides a GUI for |
| specifying the details of UIMA component descriptors, including those for Analysis Engines |
| (primitive and aggregate), Collection Readers, CAS Consumers and Type Systems. See |
| <a href="tools.html#ugr.tools.cde" class="olink">Chapter 1, <i>Component Descriptor Editor User's Guide</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_tools_cpe_configurator"></a><span class="emphasis"><em>Collection Processing Engine Configurator</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Describes the User Interfaces and features of the CPE Configurator tool. This tool allows the |
| user to select and configure the components of a Collection Processing Engine and then to run the |
| engine. See |
| <a href="tools.html#ugr.tools.cpe" class="olink">Chapter 2, <i>Collection Processing Engine Configurator User's Guide</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_tools_pear_packager"></a><span class="emphasis"><em>Pear Packager</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Describes how to use the PEAR Packager utility. This utility enables developers to produce an |
| archive file for an analysis engine that includes all required resources for installing that |
| analysis engine in another UIMA environment. See |
| <a href="tools.html#ugr.tools.pear.packager" class="olink">Chapter 9, <i>PEAR Packager User's Guide</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_tools_pear_installer"></a><span class="emphasis"><em>Pear Installer</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Describes how to use the PEAR Installer utility. This utility installs and verifies an |
| analysis engine from an archive file (PEAR) with all its resources in the right place so it is ready to |
| run. See |
| <a href="tools.html#ugr.tools.pear.installer" class="olink">Chapter 11, <i>PEAR Installer User's Guide</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_tools_pear_merger"></a><span class="emphasis"><em>Pear Merger</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Describes how to use the Pear Merger utility, which does a simple merge of multiple PEAR |
| packages into one. See |
| <a href="tools.html#ugr.tools.pear.merger" class="olink">Chapter 12, <i>PEAR Merger User's Guide</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_tools_document_analyzer"></a><span class="emphasis"><em>Document Analyzer</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Describes the features of a tool for applying a UIMA analysis engine to a set of documents and |
| viewing the results. See |
| <a href="tools.html#ugr.tools.doc_analyzer" class="olink">Chapter 3, <i>Document Analyzer User's Guide</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_tools_cas_visual_debugger"></a><span class="emphasis"><em>CAS Visual Debugger</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Describes the features of a tool for viewing the detailed structure and contents of a CAS. Good |
| for debugging. See |
| <a href="tools.html#ugr.tools.cvd" class="olink">Chapter 5, <i>CAS Visual Debugger</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_tools_jcasgen"></a><span class="emphasis"><em>JCasGen</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Describes how to run the JCasGen utility, which automatically builds Java classes that |
| correspond to a particular CAS Type System. See |
| <a href="tools.html#ugr.tools.jcasgen" class="olink">Chapter 8, <i>JCasGen User's Guide</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; "><a name="ugr.project_overview_tools_xml_cas_viewer"></a><span class="emphasis"><em>XML CAS Viewer</em></span> |
| </td><td style="">Describes how to run the supplied viewer to view externalized XML forms of CASes. This viewer |
| is used in the examples. See |
| <a href="tools.html#ugr.tools.annotation_viewer" class="olink">Chapter 4, <i>Annotation Viewer</i></a>.</td></tr></tbody></table> |
| </div> |
| </div> |
| |
| <div class="section" title="1.1.5. References"><div class="titlepage"><div><div><h3 class="title" id="ugr.project_overview_reference">1.1.5. References</h3></div></div></div> |
| |
| <div class="informaltable"> |
| <table style="border-collapse: collapse;border-top: 0.5pt solid black; border-bottom: 0.5pt solid black; border-left: 0.5pt solid black; border-right: 0.5pt solid black; "><colgroup><col class="col1"><col class="col2"></colgroup><tbody><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_javadocs"></a><span class="emphasis"><em>Introduction to the UIMA API Javadocs</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Javadocs detailing the UIMA programming interfaces See |
| <a href="references.html#ugr.ref.javadocs" class="olink">Chapter 1, <i>Javadocs</i></a></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_xml_ref_component_descriptor"></a><span class="emphasis"><em>XML: Component Descriptor</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Provides detailed XML format for all the UIMA component descriptors, except the CPE (see |
| next). See |
| <a href="references.html#ugr.ref.xml.component_descriptor" class="olink">Chapter 2, <i>Component Descriptor Reference</i></a>.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_xml_ref_collection_processing_engine_descriptor"></a><span class="emphasis"><em>XML: Collection Processing Engine Descriptor</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Provides detailed XML format for the Collection Processing Engine descriptor. See |
| <a href="references.html#ugr.ref.xml.cpe_descriptor" class="olink">Chapter 3, <i>Collection Processing Engine Descriptor Reference</i></a></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_cas"></a><span class="emphasis"><em>CAS</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Provides detailed description of the principal CAS interface. See |
| <a href="references.html#ugr.ref.cas" class="olink">Chapter 4, <i>CAS Reference</i></a></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_jcas"></a><span class="emphasis"><em>JCas</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Provides details on the JCas, a native Java interface to the CAS. See |
| <a href="references.html#ugr.ref.jcas" class="olink">Chapter 5, <i>JCas Reference</i></a></td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; "><a name="ugr.project_overview_ref_pear"></a><span class="emphasis"><em>PEAR Reference</em></span> |
| </td><td style="border-bottom: 0.5pt solid black; ">Provides detailed description of the deployable archive format for UIMA |
| components. See |
| <a href="references.html#ugr.ref.pear" class="olink">Chapter 6, <i>PEAR Reference</i></a></td></tr><tr><td style="border-right: 0.5pt solid black; "><a name="ugr.project_overview_xmi_cas_serialization"></a><span class="emphasis"><em>XMI CAS Serialization Reference</em></span> |
| </td><td style="">Provides detailed description of the deployable archive format for UIMA |
| components. See |
| <a href="references.html#ugr.ref.xmi" class="olink">Chapter 7, <i>XMI CAS Serialization Reference</i></a></td></tr></tbody></table> |
| </div> |
| </div> |
| |
| <div class="section" title="1.1.6. Version 3 User's guide"><div class="titlepage"><div><div><h3 class="title" id="ugr.project_overview_v3">1.1.6. Version 3 User's guide</h3></div></div></div> |
| |
| <p>This book describes Version 3's features, capabilities, and differences with version 2. |
| </p> |
| </div> |
| |
| </div> |
| |
| <div class="section" title="1.2. How to use the Documentation"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.project_overview_doc_use">1.2. How to use the Documentation</h2></div></div></div> |
| |
| |
| <div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"> |
| <p>Explore this chapter to get an overview of the different documents that are included with Apache UIMA.</p> |
| </li><li class="listitem"> |
| <p> Read <a href="overview_and_setup.html#ugr.ovv.conceptual" class="olink">Chapter 2, <i>UIMA Conceptual Overview</i></a> to get a broad |
| view of the basic UIMA concepts and philosophy with reference to the other documents included in the |
| documentation set which provide greater detail. </p> |
| </li><li class="listitem"> |
| <p> For more general information on the UIMA architecture and how it has been used, refer to the IBM Systems |
| Journal special issue on Unstructured Information Management, on-line at <a class="ulink" href="http://www.research.ibm.com/journal/sj43-3.html" target="_top">http://www.research.ibm.com/journal/sj43-3.html</a> or to the section of the UIMA project |
| website on Apache website where other publications are listed. </p> |
| </li><li class="listitem"> |
| <p> Set up Apache UIMA in your Eclipse environment. To do this, follow the instructions in <a class="xref" href="#ugr.ovv.eclipse_setup" title="Chapter 3. Setting up the Eclipse IDE to work with UIMA">Chapter 3, <i>Setting up the Eclipse IDE to work with UIMA</i></a>. </p> |
| </li><li class="listitem"> |
| <p> Develop sample UIMA annotators, run them and explore the results. Read <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.aae" class="olink">Chapter 1, <i>Annotator and Analysis Engine Developer's Guide</i></a> and follow it like a tutorial |
| to learn how to develop your first UIMA annotator and set up and run your first UIMA analysis engines. |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p> As part of this you will use a few tools including |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="circle"><li class="listitem"> |
| <p> The UIMA Component Descriptor Editor, described in more detail in <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.cde" class="olink">Chapter 1, <i>Component Descriptor Editor User's Guide</i></a> and </p> |
| </li><li class="listitem"> |
| <p> The Document Analyzer, described in more detail in <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.doc_analyzer" class="olink">Chapter 3, <i>Document Analyzer User's Guide</i></a>. </p> |
| </li></ul></div><p> </p> |
| |
| </li><li class="listitem"> |
| <p>While following along in <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> |
| <a href="tutorials_and_users_guides.html#ugr.tug.aae" class="olink">Chapter 1, <i>Annotator and Analysis Engine Developer's Guide</i></a>, reference documents that may help are: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="circle"><li class="listitem"> |
| <p> <a href="references.html#d5e1" class="olink">UIMA References</a> <a href="references.html#ugr.ref.xml.component_descriptor" class="olink">Chapter 2, <i>Component Descriptor Reference</i></a> for understanding the analysis |
| engine descriptors </p> |
| </li><li class="listitem"> |
| <p> <a href="references.html#d5e1" class="olink">UIMA References</a> |
| <a href="references.html#ugr.ref.jcas" class="olink">Chapter 5, <i>JCas Reference</i></a> for |
| understanding the JCas </p> |
| </li></ul></div><p> </p> |
| </li></ul></div><p> </p> |
| </li><li class="listitem"> |
| <p> Learn how to create, run and manage a UIMA analysis engine as part of an application. |
| Connect your analysis engine to the provided semantic search engine to learn how a |
| complete analysis and search application may be built with Apache UIMA. <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application" class="olink">Chapter 3, <i>Application Developer's Guide</i></a> will guide you |
| through this process. |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p> As part of this you will use the document analyzer (described in more detail in <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.doc_analyzer" class="olink">Chapter 3, <i>Document Analyzer User's Guide</i></a> and semantic search |
| GUI tools (see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> |
| <span class="olink">????</span>. </p> |
| </li></ul></div><p> </p> |
| </li><li class="listitem"> |
| <p> Pat yourself on the back. Congratulations! If you reached this step successfully, then you have an |
| appreciation for the UIMA analysis engine architecture. You would have built a few sample annotators, |
| deployed UIMA analysis engines to analyze a few documents, searched over the results using the built-in |
| semantic search engine and viewed the results through a built-in viewer |
| – all as part of a simple but complete application. </p> |
| </li><li class="listitem"> |
| <p> Develop and run a Collection Processing Engine (CPE) to analyze and gather the results of an entire |
| collection of documents. <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> |
| <a href="tutorials_and_users_guides.html#ugr.tug.cpe" class="olink">Chapter 2, <i>Collection Processing Engine Developer's Guide</i></a> will guide you through this process. |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p> As part of this you will use the CPE Configurator tool. For details see <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.cpe" class="olink">Chapter 2, <i>Collection Processing Engine Configurator User's Guide</i></a>. </p> |
| </li><li class="listitem"> |
| <p> You will also learn about CPE Descriptors. The detailed format for these may be found in <a href="references.html#d5e1" class="olink">UIMA References</a> <a href="references.html#ugr.ref.xml.cpe_descriptor" class="olink">Chapter 3, <i>Collection Processing Engine Descriptor Reference</i></a>. </p> |
| </li></ul></div><p> </p> |
| </li><li class="listitem"> |
| <p> Learn how to package up an analysis engine for easy installation into another UIMA environment. |
| <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> |
| <a href="tools.html#ugr.tools.pear.packager" class="olink">Chapter 9, <i>PEAR Packager User's Guide</i></a> and <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.pear.installer" class="olink">Chapter 11, <i>PEAR Installer User's Guide</i></a> will teach you how to |
| create UIMA analysis engine archives so that you can easily share your components with a broader |
| community. </p> |
| </li></ol></div> |
| </div> |
| |
| <div class="section" title="1.3. Changes from UIMA Version 2"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.project_overview_changes_from_previous">1.3. Changes from UIMA Version 2</h2></div></div></div> |
| |
| <p>See the separate document Version 3 User's Guide.s</p> |
| </div> |
| |
| <div class="section" title="1.4. Migrating existing UIMA pipelines from Version 2 to Version 3"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.project_overview_migrating_from_v2_to_v3">1.4. Migrating existing UIMA pipelines from Version 2 to Version 3</h2></div></div></div> |
| |
| <p>The format of JCas classes changed when going from version 2 to version 3. |
| If you had JCas classes for user types, these need to be regenerated using the |
| version 3 JCasGen tooling or Maven plugin. Alternatively, these can be |
| migrated without regenerating; the migration preserves any customization |
| users may have added to the JCas classes.</p> |
| |
| <p>The Version 3 User's Guide has a chapter detailing the migration, including |
| a description of the migration tool to aid in this process.</p> |
| </div> |
| |
| <div class="section" title="1.5. Apache UIMA Summary"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.project_overview_summary">1.5. Apache UIMA Summary</h2></div></div></div> |
| |
| <div class="section" title="1.5.1. General"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.summary.general">1.5.1. General</h3></div></div></div> |
| |
| <p>UIMA supports the development, discovery, composition and deployment of multi-modal |
| analytics for the analysis of unstructured information and its integration with search |
| technologies.</p> |
| |
| <p>Apache UIMA includes APIs and tools for creating analysis components. Examples of analysis components include |
| tokenizers, summarizers, categorizers, parsers, named-entity detectors etc. Tutorial examples are |
| provided with Apache UIMA; additional components are available from the community. </p> |
| </div> |
| <div class="section" title="1.5.2. Programming Language Support"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.summary.programming_language_support">1.5.2. Programming Language Support</h3></div></div></div> |
| |
| <p>UIMA supports the development and integration of analysis algorithms developed in different |
| programming languages. </p> |
| |
| <p>The Apache UIMA project is both a Java framework and a matching C++ |
| enablement layer, which allows annotators to be written in C++ and have access to a C++ version of the CAS. The |
| C++ enablement layer also enables annotators to be written in Perl, Python, and TCL, and to interoperate with |
| those written in other languages. |
| </p> |
| |
| </div> |
| <div class="section" title="1.5.3. Multi-Modal Support"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.general.summary.multi_modal_support">1.5.3. Multi-Modal Support</h3></div></div></div> |
| |
| <p>The UIMA architecture supports the development, discovery, composition and deployment of |
| multi-modal analytics, including text, audio and video. <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.aas" class="olink">Chapter 5, <i>Annotations, Artifacts, and Sofas</i></a> discuss this is more |
| detail.</p> |
| </div> |
| </div> |
| |
| <div class="section" title="1.6. Summary of Apache UIMA Capabilities"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.project_overview_summary_sdk_capabilities">1.6. Summary of Apache UIMA Capabilities</h2></div></div></div> |
| |
| <div class="informaltable"> |
| <table style="border-collapse: collapse;border-top: 0.5pt solid black; border-bottom: 0.5pt solid black; border-left: 0.5pt solid black; border-right: 0.5pt solid black; "><colgroup><col class="col1"><col class="col2"></colgroup><tbody><tr><td class="tableSubhead" style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Module</td><td class="tableSubhead" style="border-bottom: 0.5pt solid black; ">Description</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">UIMA Framework Core</td><td style="border-bottom: 0.5pt solid black; "> |
| <p>A framework integrating core functions for creating, deploying, running and managing UIMA |
| components, including analysis engines and Collection Processing Engines in collocated and/or |
| distributed configurations. </p> |
| |
| <p>The framework includes an implementation of core components for transport layer adaptation, |
| CAS management, workflow management based on declarative specifications, resource management, |
| configuration management, logging, and other functions.</p> |
| </td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">C++ and other programming language Interoperability</td><td style="border-bottom: 0.5pt solid black; "> |
| <p>Includes C++ CAS and supports the creation of UIMA compliant C++ components that can be |
| deployed in the UIMA run-time through a built-in JNI adapter. This includes high-speed binary |
| serialization.</p> |
| |
| <p>Includes support for creating service-based UIMA engines. This is ideal for |
| wrapping existing code written in different languages.</p> |
| </td></tr><tr><td class="tableSubhead" style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Framework Services and APIs</td><td class="tableSubhead" style="border-bottom: 0.5pt solid black; ">Note that interfaces of these components are available to the developer |
| but different implementations are possible in different implementations of the UIMA |
| framework.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">CAS</td><td style="border-bottom: 0.5pt solid black; ">These classes provide the developer with typed access to the Common Analysis Structure (CAS), |
| including type system schema, elements, subjects of analysis and indices. Multiple subjects of |
| analysis (Sofas) mechanism supports the independent or simultaneous analysis of multiple views of |
| the same artifacts (e.g. documents), supporting multi-lingual and multi-modal analysis.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">JCas</td><td style="border-bottom: 0.5pt solid black; ">An alternative interface to the CAS, providing Java-based UIMA Analysis components with |
| native Java object access to CAS types and their attributes or features, using the |
| JavaBeans conventions of getters and setters.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Collection Processing Management (CPM)</td><td style="border-bottom: 0.5pt solid black; ">Core functions for running UIMA collection processing engines in collocated and/or |
| distributed configurations. The CPM provides scalability across parallel processing pipelines, |
| check-pointing, performance monitoring and recoverability.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Resource Manager</td><td style="border-bottom: 0.5pt solid black; ">Provides UIMA components with run-time access to external resources handling capabilities |
| such as resource naming, sharing, and caching. </td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Configuration Manager</td><td style="border-bottom: 0.5pt solid black; ">Provides UIMA components with run-time access to their configuration parameter settings. |
| </td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Logger</td><td style="border-bottom: 0.5pt solid black; ">Provides access to a common logging facility.</td></tr><tr><td class="tableSubhead" style="border-bottom: 0.5pt solid black; " colspan="2" align="center"> Tools and Utilities |
| </td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">JCasGen</td><td style="border-bottom: 0.5pt solid black; ">Utility for generating a Java object model for CAS types from a UIMA XML type system |
| definition.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Saving and Restoring CAS contents</td><td style="border-bottom: 0.5pt solid black; ">APIs in the core framework support saving and restoring the contents of a CAS to streams |
| in multiple formats, including XMI, binary, and compressed forms. |
| These apis are collected into the CasIOUtils class.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">PEAR Packager for Eclipse</td><td style="border-bottom: 0.5pt solid black; ">Tool for building a UIMA component archive to facilitate porting, registering, installing and |
| testing components.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">PEAR Installer</td><td style="border-bottom: 0.5pt solid black; ">Tool for installing and verifying a UIMA component archive in a UIMA installation.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">PEAR Merger</td><td style="border-bottom: 0.5pt solid black; ">Utility that combines multiple PEARs into one.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Component Descriptor Editor</td><td style="border-bottom: 0.5pt solid black; ">Eclipse Plug-in for specifying and configuring component descriptors for UIMA analysis |
| engines as well as other UIMA component types including Collection Readers and CAS |
| Consumers.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">CPE Configurator</td><td style="border-bottom: 0.5pt solid black; ">Graphical tool for configuring Collection Processing Engines and applying them to |
| collections of documents.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Java Annotation Viewer</td><td style="border-bottom: 0.5pt solid black; ">Viewer for exploring annotations and related CAS data.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">CAS Visual Debugger</td><td style="border-bottom: 0.5pt solid black; ">GUI Java application that provides developers with detailed visual view of the contents of a |
| CAS.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Document Analyzer</td><td style="border-bottom: 0.5pt solid black; ">GUI Java application that applies analysis engines to sets of documents and shows results in a |
| viewer.</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">CAS Editor</td><td style="border-bottom: 0.5pt solid black; ">Eclipse plug-in that lets you edit the contents of a CAS</td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">UIMA Pipeline Eclipse Launcher</td><td style="border-bottom: 0.5pt solid black; ">Eclipse plug-in that lets you configure Eclipse launchers for UIMA pipelines</td></tr><tr><td class="tableSubhead" style="border-bottom: 0.5pt solid black; " colspan="2" align="center"> Example Analysis |
| Components </td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Database Writer</td><td style="border-bottom: 0.5pt solid black; ">CAS Consumer that writes the content of selected CAS types into a relational database, using |
| JDBC. This code is in cpe/PersonTitleDBWriterCasConsumer. </td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Annotators</td><td style="border-bottom: 0.5pt solid black; "> Set of simple annotators meant for pedagogical purposes. Includes: Date/time, Room-number, |
| Regular expression, Tokenizer, and Meeting-finder annotator. There are sample CAS Multipliers |
| as well. </td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">Flow Controllers</td><td style="border-bottom: 0.5pt solid black; "> There is a sample flow-controller based on the whiteboard concept of sending the CAS to whatever |
| annotator hasn't yet processed it, when that annotator's inputs are available in the CAS. </td></tr><tr><td style="border-right: 0.5pt solid black; border-bottom: 0.5pt solid black; ">XMI Collection Reader, CAS Consumer</td><td style="border-bottom: 0.5pt solid black; ">Reads and writes the CAS in the XMI format</td></tr><tr><td style="border-right: 0.5pt solid black; ">File System Collection Reader</td><td style=""> Simple Collection Reader for pulling documents from the file system and initializing CASes. |
| </td></tr></tbody></table> |
| </div> |
| </div> |
| |
| </div> |
| <div class="chapter" title="Chapter 2. UIMA Conceptual Overview" id="ugr.ovv.conceptual"><div class="titlepage"><div><div><h2 class="title">Chapter 2. UIMA Conceptual Overview</h2></div></div></div> |
| |
| |
| <p>UIMA is an open, industrial-strength, scaleable and extensible platform for |
| creating, integrating and deploying unstructured information management solutions |
| from powerful text or multi-modal analysis and search components. </p> |
| |
| <p>The Apache UIMA project is an implementation of the Java UIMA framework available |
| under the Apache License, providing a common foundation for industry and academia to |
| collaborate and accelerate the world-wide development of technologies critical for |
| discovering vital knowledge present in the fastest growing sources of information |
| today.</p> |
| |
| <p>This chapter presents an introduction to many essential UIMA concepts. It is meant to |
| provide a broad overview to give the reader a quick sense of UIMA's basic |
| architectural philosophy and the UIMA SDK's capabilities. </p> |
| |
| <p>This chapter provides a general orientation to UIMA and makes liberal reference to |
| the other chapters in the UIMA SDK documentation set, where the reader may find detailed |
| treatments of key concepts and development practices. It may be useful to refer to <a href="overview_and_setup.html#ugr.glossary" class="olink">Glossary</a>, to become familiar |
| with the terminology in this overview.</p> |
| |
| <div class="section" title="2.1. UIMA Introduction"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.conceptual.uima_introduction">2.1. UIMA Introduction</h2></div></div></div> |
| |
| <div class="figure"><a name="ugr.ovv.conceptual.fig.bridge"></a><div class="figure-contents"> |
| |
| <div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="545"><tr><td><img src="images/overview-and-setup/conceptual_overview_files/image002.png" width="545" alt="Picture of a bridge between unstructured information artifacts and structured metadata about those artifacts"></td></tr></table></div> |
| </div><p class="title"><b>Figure 2.1. UIMA helps you build the bridge between the unstructured and structured |
| worlds</b></p></div><br class="figure-break"> |
| |
| <p> Unstructured information represents the largest, most current and fastest |
| growing source of information available to businesses and governments. The web is just |
| the tip of the iceberg. Consider the mounds of information hosted in the enterprise and |
| around the world and across different media including text, voice and video. The |
| high-value content in these vast collections of unstructured information is, |
| unfortunately, buried in lots of noise. Searching for what you need or doing |
| sophisticated data mining over unstructured information sources presents new |
| challenges. </p> |
| |
| <p>An unstructured information management (UIM) application may be generally |
| characterized as a software system that analyzes large volumes of unstructured |
| information (text, audio, video, images, etc.) to discover, organize and deliver |
| relevant knowledge to the client or application end-user. An example is an application |
| that processes millions of medical abstracts to discover critical drug interactions. |
| Another example is an application that processes tens of millions of documents to |
| discover key evidence indicating probable competitive threats. </p> |
| |
| <p>First and foremost, the unstructured data must be analyzed to interpret, detect |
| and locate concepts of interest, for example, named entities like persons, |
| organizations, locations, facilities, products etc., that are not explicitly tagged |
| or annotated in the original artifact. More challenging analytics may detect things |
| like opinions, complaints, threats or facts. And then there are relations, for |
| example, located in, finances, supports, purchases, repairs etc. The list of concepts |
| important for applications to discover in unstructured content is large, varied and |
| often domain specific. |
| Many different component analytics may solve different parts of the overall analysis task. |
| These component analytics must interoperate and must be easily combined to facilitate |
| the developed of UIM applications.</p> |
| |
| <p>The result of analysis are used to populate structured forms so that conventional |
| data processing and search technologies |
| like search engines, database engines or OLAP |
| (On-Line Analytical Processing, or Data Mining) engines |
| can efficiently deliver the newly discovered content in response to the client requests |
| or queries.</p> |
| |
| <p>In analyzing unstructured content, UIM applications make use of a variety of |
| analysis technologies including:</p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>Statistical and rule-based Natural Language Processing |
| (NLP)</p> |
| </li><li class="listitem"><p>Information Retrieval (IR)</p> |
| </li><li class="listitem"><p>Machine learning</p> |
| </li><li class="listitem"><p>Ontologies</p> |
| </li><li class="listitem"><p>Automated reasoning and</p> |
| </li><li class="listitem"><p>Knowledge Sources (e.g., CYC, WordNet, FrameNet, etc.)</p> |
| </li></ul></div> |
| |
| <p>Specific analysis capabilities using these technologies are developed |
| independently using different techniques, interfaces and platforms. |
| </p> |
| |
| <p>The bridge from the unstructured world to the structured world is built through the |
| composition and deployment of these analysis capabilities. This integration is often |
| a costly challenge. </p> |
| |
| <p>The Unstructured Information Management Architecture (UIMA) is an architecture |
| and software framework that helps you build that bridge. It supports creating, |
| discovering, composing and deploying a broad range of analysis capabilities and |
| linking them to structured information services.</p> |
| |
| <p>UIMA allows development teams to match the right skills with the right parts of a |
| solution and helps enable rapid integration across technologies and platforms using a |
| variety of different deployment options. These ranging from tightly-coupled |
| deployments for high-performance, single-machine, embedded solutions to parallel |
| and fully distributed deployments for highly flexible and scaleable |
| solutions.</p> |
| |
| </div> |
| |
| <div class="section" title="2.2. The Architecture, the Framework and the SDK"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.conceptual.architecture_framework_sdk">2.2. The Architecture, the Framework and the SDK</h2></div></div></div> |
| |
| <p>UIMA is a software architecture which specifies component interfaces, data |
| representations, design patterns and development roles for creating, describing, |
| discovering, composing and deploying multi-modal analysis capabilities.</p> |
| |
| <p>The <span class="bold"><strong>UIMA framework</strong></span> provides a run-time |
| environment in which developers can plug in their UIMA component implementations and |
| with which they can build and deploy UIM applications. The framework is not specific to |
| any IDE or platform. Apache hosts a Java and (soon) a C++ implementation of the UIMA |
| Framework.</p> |
| |
| <p>The <span class="bold"><strong>UIMA Software Development Kit (SDK)</strong></span> |
| includes the UIMA framework, plus tools and utilities for using UIMA. Some of the |
| tooling supports an Eclipse-based ( <a class="ulink" href="http://www.eclipse.org/" target="_top">http://www.eclipse.org/</a>) |
| development environment. </p> |
| |
| </div> |
| |
| <div class="section" title="2.3. Analysis Basics"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.conceptual.analysis_basics">2.3. Analysis Basics</h2></div></div></div> |
| |
| <div class="note" title="Key UIMA Concepts Introduced in this Section:" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Key UIMA Concepts Introduced in this Section:</h3><p>Analysis Engine, Document, Annotator, Annotator |
| Developer, Type, Type System, Feature, Annotation, CAS, Sofa, JCas, UIMA |
| Context.</p> |
| </div> |
| |
| <div class="section" title="2.3.1. Analysis Engines, Annotators & Results"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.conceptual.aes_annotators_and_analysis_results">2.3.1. Analysis Engines, Annotators & Results</h3></div></div></div> |
| |
| <div class="figure"><a name="ugr.ovv.conceptual.metadata_in_cas"></a><div class="figure-contents"> |
| |
| <div class="mediaobject" align="center"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="594"><tr><td align="center"><img src="images/overview-and-setup/conceptual_overview_files/image004.png" align="middle" width="594" alt="Picture of some text, with a hierarchy of discovered metadata about words in the text, including some image of a person as metadata about that name."></td></tr></table></div> |
| </div><p class="title"><b>Figure 2.2. Objects represented in the Common Analysis Structure (CAS)</b></p></div><br class="figure-break"> |
| |
| <p>UIMA is an architecture in which basic building blocks called Analysis Engines |
| (AEs) are composed to analyze a document and infer and record descriptive attributes |
| about the document as a whole, and/or about regions therein. This descriptive |
| information, produced by AEs is referred to generally as <span class="bold"><strong> |
| analysis results</strong></span>. Analysis results typically represent meta-data |
| about the document content. One way to think about AEs is as software agents that |
| automatically discover and record meta-data about original content.</p> |
| |
| <p>UIMA supports the analysis of different modalities including text, audio and |
| video. The majority of examples we provide are for text. We use the term <span class="bold"><strong>document, </strong></span>therefore, to generally refer to any unit of |
| content that an AE may process, whether it is a text document or a segment of audio, for |
| example. See the <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> |
| <a href="tutorials_and_users_guides.html#ugr.tug.mvs" class="olink">Chapter 6, <i>Multiple CAS Views of an Artifact</i></a> for more information on multimodal processing |
| in UIMA.</p> |
| |
| <p>Analysis results include different statements about the content of a document. |
| For example, the following is an assertion about the topic of a document:</p> |
| |
| |
| <pre class="programlisting">(1) The Topic of document D102 is "CEOs and Golf".</pre> |
| |
| <p>Analysis results may include statements describing regions more granular than |
| the entire document. We use the term <span class="bold"><strong>span</strong></span> to |
| refer to a sequence of characters in a text document. Consider that a document with the |
| identifier D102 contains a span, <span class="quote">“<span class="quote">Fred Centers</span>”</span> starting at |
| character position 101. An AE that can detect persons in text may represent the |
| following statement as an analysis result:</p> |
| |
| |
| <pre class="programlisting">(2) The span from position 101 to 112 in document D102 denotes a Person</pre> |
| |
| <p>In both statements 1 and 2 above there is a special pre-defined term or what we call |
| in UIMA a <span class="bold"><strong>Type</strong></span>. They are |
| <span class="emphasis"><em>Topic</em></span> and <span class="emphasis"><em>Person</em></span> respectively. |
| UIMA types characterize the kinds of results that an AE may create – more on |
| types later.</p> |
| |
| <p>Other analysis results may relate two statements. For example, an AE might |
| record in its results that two spans are both referring to the same person:</p> |
| |
| |
| <pre class="programlisting">(3) The Person denoted by span 101 to 112 and |
| the Person denoted by span 141 to 143 in document D102 |
| refer to the same Entity.</pre> |
| |
| <p>The above statements are some examples of the kinds of results that AEs may record |
| to describe the content of the documents they analyze. These are not meant to indicate |
| the form or syntax with which these results are captured in UIMA – more on that |
| later in this overview.</p> |
| |
| <p>The UIMA framework treats Analysis engines as pluggable, composible, |
| discoverable, managed objects. At the heart of AEs are the analysis algorithms that |
| do all the work to analyze documents and record analysis results. </p> |
| |
| <p>UIMA provides a basic component type intended to house the core analysis |
| algorithms running inside AEs. Instances of this component are called <span class="bold"><strong>Annotators</strong></span>. The analysis algorithm developer's |
| primary concern therefore is the development of annotators. The UIMA framework |
| provides the necessary methods for taking annotators and creating analysis |
| engines.</p> |
| |
| <p>In UIMA the person who codes analysis algorithms takes on the role of the |
| <span class="bold"><strong>Annotator Developer</strong></span>. <a href="tutorials_and_users_guides.html#ugr.tug.aae" class="olink">Chapter 1, <i>Annotator and Analysis Engine Developer's Guide</i></a> |
| in <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> will take the reader |
| through the details involved in creating UIMA annotators and analysis |
| engines.</p> |
| |
| <p>At the most primitive level an AE wraps an annotator adding the necessary APIs and |
| infrastructure for the composition and deployment of annotators within the UIMA |
| framework. The simplest AE contains exactly one annotator at its core. Complex AEs |
| may contain a collection of other AEs each potentially containing within them other |
| AEs. </p> |
| </div> |
| |
| <div class="section" title="2.3.2. Representing Analysis Results in the CAS"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.conceptual.representing_results_in_cas">2.3.2. Representing Analysis Results in the CAS</h3></div></div></div> |
| |
| |
| <p>How annotators represent and share their results is an important part of the UIMA |
| architecture. UIMA defines a <span class="bold"><strong>Common Analysis Structure |
| (CAS)</strong></span> precisely for these purposes.</p> |
| |
| <p>The CAS is an object-based data structure that allows the representation of |
| objects, properties and values. Object types may be related to each other in a |
| single-inheritance hierarchy. The CAS logically (if not physically) contains the |
| document being analyzed. Analysis developers share and record their analysis |
| results in terms of an object model within the CAS. <sup>[<a name="d5e551" href="#ftn.d5e551" class="footnote">1</a>]</sup> </p> |
| |
| <p>The UIMA framework includes an implementation and interfaces to the CAS. For a |
| more detailed description of the CAS and its interfaces see <a href="references.html#d5e1" class="olink">UIMA References</a> <a href="references.html#ugr.ref.cas" class="olink">Chapter 4, <i>CAS Reference</i></a>.</p> |
| |
| <p>A CAS that logically contains statement 2 (repeated here for your |
| convenience)</p> |
| |
| |
| <pre class="programlisting">(2) The span from position 101 to 112 in document D102 denotes a Person</pre> |
| |
| <p>would include objects of the Person type. For each person found in the body of a |
| document, the AE would create a Person object in the CAS and link it to the span of text |
| where the person was mentioned in the document.</p> |
| |
| <p>While the CAS is a general purpose data structure, UIMA defines a |
| few basic types and affords the developer the ability to extend these to define an |
| arbitrarily rich <span class="bold"><strong>Type System</strong></span>. You can think of a |
| type system as an object schema for the CAS.</p> |
| |
| <p>A type system defines the various types of objects that may be discovered in |
| documents by AE's that subscribe to that type system.</p> |
| |
| <p>As suggested above, Person may be defined as a type. Types have properties or |
| <span class="bold"><strong>features</strong></span>. So for example, |
| <span class="emphasis"><em>Age</em></span> and <span class="emphasis"><em>Occupation</em></span> may be defined as |
| features of the Person type.</p> |
| |
| <p>Other types might be <span class="emphasis"><em>Organization, Company, Bank, Facility, Money, |
| Size, Price, Phone Number, Phone Call, Relation, Network Packet, Product, Noun |
| Phrase, Verb, Color, Parse Node, Feature Weight Array</em></span> etc.</p> |
| |
| <p>There are no limits to the different types that may be defined in a type system. A |
| type system is domain and application specific.</p> |
| |
| <p>Types in a UIMA type system may be organized into a taxonomy. For example, |
| <span class="emphasis"><em>Company</em></span> may be defined as a subtype of |
| <span class="emphasis"><em>Organization</em></span>. <span class="emphasis"><em>NounPhrase</em></span> may be a |
| subtype of a <span class="emphasis"><em>ParseNode</em></span>.</p> |
| |
| <div class="section" title="2.3.2.1. The Annotation Type"><div class="titlepage"><div><div><h4 class="title" id="ugr.ovv.conceptual.annotation_type">2.3.2.1. The Annotation Type</h4></div></div></div> |
| |
| |
| <p>A general and common type used in artifact analysis and from which additional |
| types are often derived is the <span class="bold"><strong>annotation</strong></span> |
| type. </p> |
| |
| <p>The annotation type is used to annotate or label regions of an artifact. Common |
| artifacts are text documents, but they can be other things, such as audio streams. |
| The annotation type for text includes two features, namely |
| <span class="emphasis"><em>begin</em></span> and <span class="emphasis"><em>end</em></span>. Values of these |
| features represent integer offsets in the artifact and delimit a span. Any |
| particular annotation object identifies the span it annotates with the |
| <span class="emphasis"><em>begin</em></span> and <span class="emphasis"><em>end</em></span> features.</p> |
| |
| <p>The key idea here is that the annotation type is used to identify and label or |
| <span class="quote">“<span class="quote">annotate</span>”</span> a specific region of an artifact.</p> |
| |
| <p>Consider that the Person type is defined as a subtype of annotation. An |
| annotator, for example, can create a Person annotation to record the discovery of a |
| mention of a person between position 141 and 143 in document D102. The annotator can |
| create another person annotation to record the detection of a mention of a person in |
| the span between positions 101 and 112. </p> |
| </div> |
| <div class="section" title="2.3.2.2. Not Just Annotations"><div class="titlepage"><div><div><h4 class="title" id="ugr.ovv.conceptual.not_just_annotations">2.3.2.2. Not Just Annotations</h4></div></div></div> |
| |
| |
| <p>While the annotation type is a useful type for annotating regions of a |
| document, annotations are not the only kind of types in a CAS. A CAS is a general |
| representation scheme and may store arbitrary data structures to represent the |
| analysis of documents.</p> |
| |
| <p>As an example, consider statement 3 above (repeated here for your |
| convenience).</p> |
| |
| |
| <pre class="programlisting">(3) The Person denoted by span 101 to 112 and |
| the Person denoted by span 141 to 143 in document D102 |
| refer to the same Entity.</pre> |
| |
| <p>This statement mentions two person annotations in the CAS; the first, call it |
| P1 delimiting the span from 101 to 112 and the other, call it P2, delimiting the span |
| from 141 to 143. Statement 3 asserts explicitly that these two spans refer to the |
| same entity. This means that while there are two expressions in the text |
| represented by the annotations P1 and P2, each refers to one and the same person. |
| </p> |
| |
| <p>The Entity type may be introduced into a type system to capture this kind of |
| information. The Entity type is not an annotation. It is intended to represent an |
| object in the domain which may be referred to by different expressions (or |
| mentions) occurring multiple times within a document (or across documents within |
| a collection of documents). The Entity type has a feature named |
| <span class="emphasis"><em>occurrences. </em></span>This feature is used to point to all the |
| annotations believed to label mentions of the same entity.</p> |
| |
| <p>Consider that the spans annotated by P1 and P2 were <span class="quote">“<span class="quote">Fred |
| Center</span>”</span> and <span class="quote">“<span class="quote">He</span>”</span> respectively. The annotator might create |
| a new Entity object called |
| <code class="code">FredCenter</code>. To represent the relationship in statement 3 above, |
| the annotator may link FredCenter to both P1 and P2 by making them values of its |
| <span class="emphasis"><em>occurrences</em></span> feature.</p> |
| |
| <p> <a class="xref" href="#ugr.ovv.conceptual.metadata_in_cas" title="Figure 2.2. Objects represented in the Common Analysis Structure (CAS)">Figure 2.2, “Objects represented in the Common Analysis Structure (CAS)”</a> also |
| illustrates that an entity may be linked to annotations referring to regions of |
| image documents as well. To do this the annotation type would have to be extended |
| with the appropriate features to point to regions of an image.</p> |
| </div> |
| |
| <div class="section" title="2.3.2.3. Multiple Views within a CAS"><div class="titlepage"><div><div><h4 class="title" id="ugr.ovv.conceptual.multiple_views_within_a_cas">2.3.2.3. Multiple Views within a CAS</h4></div></div></div> |
| |
| |
| <p>UIMA supports the simultaneous analysis of multiple views of a document. This |
| support comes in handy for processing multiple forms of the artifact, for example, the audio |
| and the closed captioned views of a single speech stream, or the tagged and detagged |
| views of an HTML document.</p> |
| |
| <p>AEs analyze one or more views of a document. Each view contains a specific |
| <span class="bold"><strong>subject of analysis(Sofa)</strong></span>, plus a set of |
| indexes holding metadata indexed by that view. The CAS, overall, holds one or more |
| CAS Views, plus the descriptive objects that represent the analysis results for |
| each. </p> |
| |
| <p>Another common example of using CAS Views is for different translations of a |
| document. Each translation may be represented with a different CAS View. Each |
| translation may be described by a different set of analysis results. For more |
| details on CAS Views and Sofas see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.mvs" class="olink">Chapter 6, <i>Multiple CAS Views of an Artifact</i></a> and <a href="tutorials_and_users_guides.html#ugr.tug.aas" class="olink">Chapter 5, <i>Annotations, Artifacts, and Sofas</i></a>. </p> |
| </div> |
| </div> |
| |
| <div class="section" title="2.3.3. Interacting with the CAS and External Resources"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.conceptual.interacting_with_cas_and_external_resources">2.3.3. Interacting with the CAS and External Resources</h3></div></div></div> |
| |
| |
| |
| <p>The two main interfaces that a UIMA component developer interacts with are the |
| CAS and the UIMA Context.</p> |
| |
| <p>UIMA provides an efficient implementation of the CAS with multiple programming |
| interfaces. Through these interfaces, the annotator developer interacts with the |
| document and reads and writes analysis results. The CAS interfaces provide a suite of |
| access methods that allow the developer to obtain indexed iterators to the different |
| objects in the CAS. See <a href="references.html#d5e1" class="olink">UIMA References</a> <a href="references.html#ugr.ref.cas" class="olink">Chapter 4, <i>CAS Reference</i></a>. While many objects may exist in a CAS, the annotator |
| developer can obtain a specialized iterator to all Person objects associated with a |
| particular view, for example.</p> |
| |
| <p>For Java annotator developers, UIMA provides the JCas. This interface provides |
| the Java developer with a natural interface to CAS objects. Each type declared in the |
| type system appears as a Java Class; the UIMA framework renders the Person type as a |
| Person class in Java. As the analysis algorithm detects mentions of persons in the |
| documents, it can create Person objects in the CAS. For more details on how to interact |
| with the CAS using this interface, refer to <a href="references.html#d5e1" class="olink">UIMA References</a> <a href="references.html#ugr.ref.jcas" class="olink">Chapter 5, <i>JCas Reference</i></a>.</p> |
| |
| <p>The component developer, in addition to interacting with the CAS, can access |
| external resources through the framework's resource manager interface |
| called the <span class="bold"><strong>UIMA Context</strong></span>. This interface, among |
| other things, can ensure that different annotators working together in an aggregate |
| flow may share the same instance of an external file or remote resource accessed |
| via its URL, for example. For details on using |
| the UIMA Context see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.aae" class="olink">Chapter 1, <i>Annotator and Analysis Engine Developer's Guide</i></a>.</p> |
| |
| </div> |
| <div class="section" title="2.3.4. Component Descriptors"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.conceptual.component_descriptors">2.3.4. Component Descriptors</h3></div></div></div> |
| |
| <p>UIMA defines interfaces for a small set of core components that users of the |
| framework provide implmentations for. Annotators and Analysis Engines are two of |
| the basic building blocks specified by the architecture. Developers implement them |
| to build and compose analysis capabilities and ultimately applications.</p> |
| |
| <p>There are others components in addition to these, which we will learn about |
| later, but for every component specified in UIMA there are two parts required for its |
| implementation:</p> |
| |
| <div class="orderedlist"><ol class="orderedlist" type="1" compact><li class="listitem"><p>the declarative part and</p></li><li class="listitem"><p>the code part.</p></li></ol></div> |
| |
| <p>The declarative part contains metadata describing the component, its |
| identity, structure and behavior and is called the <span class="bold"><strong> |
| Component Descriptor</strong></span>. Component descriptors are represented in XML. |
| The code part implements the algorithm. The code part may be a program in Java.</p> |
| |
| <p>As a developer using the UIMA SDK, to implement a UIMA component it is always the |
| case that you will provide two things: the code part and the Component Descriptor. |
| Note that when you are composing an engine, the code may be already provided in |
| reusable subcomponents. In these cases you may not be developing new code but rather |
| composing an aggregate engine by pointing to other components where the code has been |
| included.</p> |
| |
| <p>Component descriptors are represented in XML and aid in component discovery, |
| reuse, composition and development tooling. The UIMA SDK provides tools for easily |
| creating and maintaining the component descriptors that relieve the developer from |
| editing XML directly. This tool is described briefly in <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.aae" class="olink">Chapter 1, <i>Annotator and Analysis Engine Developer's Guide</i></a>, and more |
| thoroughly in <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> |
| <a href="tools.html#ugr.tools.cde" class="olink">Chapter 1, <i>Component Descriptor Editor User's Guide</i></a> |
| .</p> |
| |
| <p>Component descriptors contain standard metadata including the |
| component's name, author, version, and a reference to the class that |
| implements the component.</p> |
| |
| <p>In addition to these standard fields, a component descriptor identifies the |
| type system the component uses and the types it requires in an input CAS and the types it |
| plans to produce in an output CAS.</p> |
| |
| <p>For example, an AE that detects person types may require as input a CAS that |
| includes a tokenization and deep parse of the document. The descriptor refers to a |
| type system to make the component's input requirements and output types |
| explicit. In effect, the descriptor includes a declarative description of the |
| component's behavior and can be used to aid in component discovery and |
| composition based on desired results. UIMA analysis engines provide an interface |
| for accessing the component metadata represented in their descriptors. For more |
| details on the structure of UIMA component descriptors refer to <a href="references.html#d5e1" class="olink">UIMA References</a> <a href="references.html#ugr.ref.xml.component_descriptor" class="olink">Chapter 2, <i>Component Descriptor Reference</i></a>.</p> |
| |
| </div> |
| </div> |
| <div class="section" title="2.4. Aggregate Analysis Engines"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.conceptual.aggregate_analysis_engines">2.4. Aggregate Analysis Engines</h2></div></div></div> |
| |
| |
| <div class="note" title="Key UIMA Concepts Introduced in this Section:" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Key UIMA Concepts Introduced in this Section:</h3><p>Aggregate Analysis Engine, Delegate Analysis Engine, |
| Tightly and Loosely Coupled, Flow Specification, Analysis Engine Assembler</p> |
| </div> |
| |
| <div class="figure"><a name="ugr.ovv.conceptual.sample_aggregate"></a><div class="figure-contents"> |
| |
| <div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="588"><tr><td><img src="images/overview-and-setup/conceptual_overview_files/image006.png" width="588" alt="Picture of multiple parts (a language identifier, tokenizer, part of speech annotator, shallow parser, and named entity detector) strung together into a flow, and all of them wrapped as a single aggregate object, which produces as annotations the union of all the results of the individual annotator components ( tokens, parts of speech, names, organizations, places, persons, etc.)"></td></tr></table></div> |
| </div><p class="title"><b>Figure 2.3. Sample Aggregate Analysis Engine</b></p></div><br class="figure-break"> |
| |
| <p>A simple or primitive UIMA Analysis Engine (AE) contains a single annotator. AEs, |
| however, may be defined to contain other AEs organized in a workflow. These more complex |
| analysis engines are called <span class="bold"><strong>Aggregate Analysis |
| Engines.</strong></span> </p> |
| |
| <p>Annotators tend to perform fairly granular functions, for example language |
| detection, tokenization or part of speech detection. |
| These functions typically address just part of an overall analysis task. A workflow |
| of component engines may be orchestrated to perform more complex tasks.</p> |
| |
| <p>An AE that performs named entity detection, for example, may |
| include a pipeline of annotators starting with language detection feeding |
| tokenization, then part-of-speech detection, then deep grammatical parsing and then |
| finally named-entity detection. Each step in the pipeline is required by the |
| subsequent analysis. For example, the final named-entity annotator can only do its |
| analysis if the previous deep grammatical parse was recorded in the CAS.</p> |
| |
| <p>Aggregate AEs are built to encapsulate potentially complex internal structure |
| and insulate it from users of the AE. In our example, the aggregate analysis engine |
| developer acquires the internal components, defines the necessary flow |
| between them and publishes the resulting AE. Consider the simple example illustrated |
| in <a class="xref" href="#ugr.ovv.conceptual.sample_aggregate" title="Figure 2.3. Sample Aggregate Analysis Engine">Figure 2.3, “Sample Aggregate Analysis Engine”</a> where |
| <span class="quote">“<span class="quote">MyNamed-EntityDetector</span>”</span> is composed of a linear flow of more |
| primitive analysis engines.</p> |
| |
| <p>Users of this AE need not know how it is constructed internally but only need its name |
| and its published input requirements and output types. These must be declared in the |
| aggregate AE's descriptor. Aggregate AE's descriptors declare the components |
| they contain and a <span class="bold"><strong>flow specification</strong></span>. The flow |
| specification defines the order in which the internal component AEs should be run. The |
| internal AEs specified in an aggregate are also called the <span class="bold"><strong> |
| delegate analysis engines.</strong></span> The term "delegate" is used because aggregate AE's |
| are thought to "delegate" functions to their internal AEs.</p> |
| |
| <p> |
| In UIMA 2.0, the developer can implement a "Flow Controller" and include it as part |
| of an aggregate AE by referring to it in the aggregate AE's descriptor. |
| The flow controller is responsible for computing the "flow", that is, |
| for determining the order in which of delegate AE's that will process the CAS. |
| The Flow Contoller has access to the CAS and any external resources it may require |
| for determining the flow. It can do this dynamically at run-time, it can |
| make multi-step decisions and it can consider any sort of flow specification |
| included in the aggregate AE's descriptor. See |
| <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> |
| <a href="tutorials_and_users_guides.html#ugr.tug.fc" class="olink">Chapter 4, <i>Flow Controller Developer's Guide</i></a> |
| for details on the UIMA Flow Controller interface. |
| </p> |
| |
| <p>We refer to the development role associated with building an aggregate from |
| delegate AEs as the <span class="bold"><strong>Analysis Engine Assembler</strong></span> |
| .</p> |
| |
| <p>The UIMA framework, given an aggregate analysis engine descriptor, will run all |
| delegate AEs, ensuring that each one gets access to the CAS in the sequence produced by |
| the flow controller. The UIMA framework is equipped to handle different |
| deployments where the delegate engines, for example, are <span class="bold"><strong> |
| tightly-coupled</strong></span> (running in the same process) or <span class="bold"><strong> |
| loosely-coupled</strong></span> (running in separate processes or even on different |
| machines). The framework supports a number of remote protocols for loose coupling |
| deployments of aggregate analysis engines, including SOAP (which stands for Simple |
| Object Access Protocol, a standard Web Services communications protocol).</p> |
| |
| <p>The UIMA framework facilitates the deployment of AEs as remote services by using an |
| adapter layer that automatically creates the necessary infrastructure in response to |
| a declaration in the component's descriptor. For more details on creating |
| aggregate analysis engines refer to <a href="references.html#d5e1" class="olink">UIMA References</a> <a href="references.html#ugr.ref.xml.component_descriptor" class="olink">Chapter 2, <i>Component Descriptor Reference</i></a> The component descriptor editor tool |
| assists in the specification of aggregate AEs from a repository of available engines. |
| For more details on this tool refer to <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.cde" class="olink">Chapter 1, <i>Component Descriptor Editor User's Guide</i></a>.</p> |
| |
| <p>The UIMA framework implementation has two built-in flow implementations: one |
| that support a linear flow between components, and one with conditional branching |
| based on the language of the document. It also supports user-provided flow |
| controllers, as described in <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.fc" class="olink">Chapter 4, <i>Flow Controller Developer's Guide</i></a>. Furthermore, the application developer is |
| free to create multiple AEs and provide their own logic to combine the AEs in arbitrarily |
| complex flows. For more details on this the reader may refer to <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application.using_aes" class="olink">Section 3.2, “Using Analysis Engines”</a>.</p> |
| |
| </div> |
| |
| <div class="section" title="2.5. Application Building and Collection Processing"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.conceptual.applicaiton_building_and_collection_processing">2.5. Application Building and Collection Processing</h2></div></div></div> |
| |
| |
| <div class="note" title="Key UIMA Concepts Introduced in this Section:" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Key UIMA Concepts Introduced in this Section:</h3><p>Process Method, Collection Processing Architecture, |
| Collection Reader, CAS Consumer, CAS Initializer, Collection Processing Engine, |
| Collection Processing Manager.</p></div> |
| |
| <div class="section" title="2.5.1. Using the framework from an Application"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.conceptual.using_framework_from_an_application">2.5.1. Using the framework from an Application</h3></div></div></div> |
| |
| |
| <div class="figure"><a name="ugr.ovv.conceptual.application_factory_ae"></a><div class="figure-contents"> |
| |
| <div class="mediaobject" align="center"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="618"><tr><td align="center"><img src="images/overview-and-setup/conceptual_overview_files/image008.png" align="middle" width="618" alt="Picture of application interacting with UIMA's factory to produce an analysis engine, which acts as a container for annotators, and interfaces with the application via the process and getMetaData methods among others."></td></tr></table></div> |
| </div><p class="title"><b>Figure 2.4. Using UIMA Framework to create and interact with an Analysis Engine</b></p></div><br class="figure-break"> |
| |
| <p>As mentioned above, the basic AE interface may be thought of as simply CAS in/CAS |
| out.</p> |
| |
| <p>The application is responsible for interacting with the UIMA framework to |
| instantiate an AE, create or acquire an input CAS, initialize the input CAS with a |
| document and then pass it to the AE through the <span class="bold"><strong>process |
| method</strong></span>. This interaction with the framework is illustrated in <a class="xref" href="#ugr.ovv.conceptual.application_factory_ae" title="Figure 2.4. Using UIMA Framework to create and interact with an Analysis Engine">Figure 2.4, “Using UIMA Framework to create and interact with an Analysis Engine”</a>. </p> |
| |
| <p>The UIMA AE Factory takes the declarative information from the Component |
| Descriptor and the class files implementing the annotator, and instantiates the AE |
| instance, setting up the CAS and the UIMA Context.</p> |
| |
| <p>The AE, possibly calling many delegate AEs internally, performs the overall |
| analysis and its process method returns the CAS containing new analysis results. |
| </p> |
| |
| <p>The application then decides what to do with the returned CAS. There are many |
| possibilities. For instance the application could: display the results, store the |
| CAS to disk for post processing, extract and index analysis results as part of a search |
| or database application etc.</p> |
| |
| <p>The UIMA framework provides methods to support the application developer in |
| creating and managing CASes and instantiating, running and managing AEs. Details |
| may be found in <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application" class="olink">Chapter 3, <i>Application Developer's Guide</i></a>.</p> |
| </div> |
| |
| <div class="section" title="2.5.2. Graduating to Collection Processing"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.conceptual.graduating_to_collection_processing">2.5.2. Graduating to Collection Processing</h3></div></div></div> |
| |
| <div class="figure"><a name="ugr.ovv.conceptual.fig.cpe"></a><div class="figure-contents"> |
| |
| <div class="mediaobject" align="center"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="578"><tr><td align="center"><img src="images/overview-and-setup/conceptual_overview_files/image010.png" align="middle" width="578" alt="High-Level UIMA Component Architecture from Source to Sink"></td></tr></table></div> |
| </div><p class="title"><b>Figure 2.5. High-Level UIMA Component Architecture from Source to Sink</b></p></div><br class="figure-break"> |
| |
| <p>Many UIM applications analyze entire collections of documents. They connect to |
| different document sources and do different things with the results. But in the |
| typical case, the application must generally follow these logical steps: |
| |
| </p><div class="orderedlist"><ol class="orderedlist" type="1" compact><li class="listitem"><p>Connect to a physical source</p></li><li class="listitem"><p>Acquire a document from the source</p></li><li class="listitem"><p>Initialize a CAS with the document to be analyzed</p> |
| </li><li class="listitem"><p>Send the CAS to a selected analysis engine</p></li><li class="listitem"><p>Process the resulting CAS</p></li><li class="listitem"><p>Go back to 2 until the collection is processed</p> |
| </li><li class="listitem"><p>Do any final processing required after all the documents in the |
| collection have been analyzed</p></li></ol></div><p> </p> |
| |
| <p>UIMA supports UIM application development for this general type of processing |
| through its <span class="bold"><strong>Collection Processing |
| Architecture</strong></span>.</p> |
| |
| <p>As part of the collection processing architecture UIMA introduces two primary |
| components in addition to the annotator and analysis engine. These are the <span class="bold"><strong>Collection Reader</strong></span> and the <span class="bold"><strong>CAS |
| Consumer</strong></span>. The complete flow from source, through document analysis, |
| and to CAS Consumers supported by UIMA is illustrated in <a class="xref" href="#ugr.ovv.conceptual.fig.cpe" title="Figure 2.5. High-Level UIMA Component Architecture from Source to Sink">Figure 2.5, “High-Level UIMA Component Architecture from Source to Sink”</a>.</p> |
| |
| <p>The Collection Reader's job is to connect to and iterate through a source |
| collection, acquiring documents and initializing CASes for analysis. </p> |
| |
| |
| |
| <p>CAS Consumers, as the name suggests, function at the end of the flow. Their job is |
| to do the final CAS processing. A CAS Consumer may be implemented, for example, to |
| index CAS contents in a search engine, extract elements of interest and populate a |
| relational database or serialize and store analysis results to disk for subsequent |
| and further analysis. </p> |
| |
| <p>A UIMA <span class="bold"><strong>Collection Processing Engine</strong></span> (CPE) |
| is an aggregate component that specifies a <span class="quote">“<span class="quote">source to sink</span>”</span> flow from a |
| Collection Reader though a set of analysis engines and then to a set of CAS Consumers. |
| </p> |
| |
| <p>CPEs are specified by XML files called CPE Descriptors. These are declarative |
| specifications that point to their contained components (Collection Readers, |
| analysis engines and CAS Consumers) and indicate a flow among them. The flow |
| specification allows for filtering capabilities to, for example, skip over AEs |
| based on CAS contents. Details about the format of CPE Descriptors may be found in |
| <a href="references.html#d5e1" class="olink">UIMA References</a> |
| <a href="references.html#ugr.ref.xml.cpe_descriptor" class="olink">Chapter 3, <i>Collection Processing Engine Descriptor Reference</i></a>. |
| </p> |
| |
| <div class="figure"><a name="ugr.ovv.conceptual.fig.cpm"></a><div class="figure-contents"> |
| |
| <div class="mediaobject" align="center"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="576"><tr><td align="center"><img src="images/overview-and-setup/conceptual_overview_files/image012.png" align="middle" width="576" alt="box and arrows picture of application using CPE factory to instantiate a Collection Processing Engine, and that engine interacting with the application."></td></tr></table></div> |
| </div><p class="title"><b>Figure 2.6. Collection Processing Manager in UIMA Framework</b></p></div><br class="figure-break"> |
| |
| <p>The UIMA framework includes a <span class="bold"><strong>Collection Processing |
| Manager</strong></span> (CPM). The CPM is capable of reading a CPE descriptor, and |
| deploying and running the specified CPE. <a class="xref" href="#ugr.ovv.conceptual.fig.cpe" title="Figure 2.5. High-Level UIMA Component Architecture from Source to Sink">Figure 2.5, “High-Level UIMA Component Architecture from Source to Sink”</a> illustrates the role of the CPM |
| in the UIMA Framework.</p> |
| |
| <p>Key features of the CPM are failure recovery, CAS management and scale-out. |
| </p> |
| |
| <p>Collections may be large and take considerable time to analyze. A configurable |
| behavior of the CPM is to log faults on single document failures while continuing to |
| process the collection. This behavior is commonly used because analysis components |
| often tend to be the weakest link -- in practice they may choke on strangely formatted |
| content. </p> |
| |
| <p>This deployment option requires that the CPM run in a separate process or a |
| machine distinct from the CPE components. A CPE may be configured to run with a variety |
| of deployment options that control the features provided by the CPM. For details see |
| <a href="references.html#d5e1" class="olink">UIMA References</a> |
| <a href="references.html#ugr.ref.xml.cpe_descriptor" class="olink">Chapter 3, <i>Collection Processing Engine Descriptor Reference</i></a> |
| .</p> |
| |
| <p>The UIMA SDK also provides a tool called the CPE Configurator. This tool provides |
| the developer with a user interface that simplifies the process of connecting up all |
| the components in a CPE and running the result. For details on using the CPE |
| Configurator see <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> <a href="tools.html#ugr.tools.cpe" class="olink">Chapter 2, <i>Collection Processing Engine Configurator User's Guide</i></a>. This tool currently does not provide |
| access to the full set of CPE deployment options supported by the CPM; however, you can |
| configure other parts of the CPE descriptor by editing it directly. For details on how |
| to create and run CPEs refer to <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.cpe" class="olink">Chapter 2, <i>Collection Processing Engine Developer's Guide</i></a>.</p> |
| |
| </div> |
| |
| </div> |
| |
| <div class="section" title="2.6. Exploiting Analysis Results"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.conceptual.exploiting_analysis_results">2.6. Exploiting Analysis Results</h2></div></div></div> |
| |
| |
| <div class="note" title="Key UIMA Concepts Introduced in this Section:" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Key UIMA Concepts Introduced in this Section:</h3><p>Semantic Search, XML Fragment Queries.</p> |
| </div> |
| |
| <div class="section" title="2.6.1. Semantic Search"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.conceptual.semantic_search">2.6.1. Semantic Search</h3></div></div></div> |
| |
| |
| <p>In a simple UIMA Collection Processing Engine (CPE), a Collection Reader reads |
| documents from the file system and initializes CASs with their content. These are |
| then fed to an AE that annotates tokens and sentences, the CASs, now enriched with |
| token and sentence information, are passed to a CAS Consumer that populates a search |
| engine index. </p> |
| |
| <p>The search engine query processor can then use the token index to provide basic |
| key-word search. For example, given a query <span class="quote">“<span class="quote">center</span>”</span> the search |
| engine would return all the documents that contained the word |
| <span class="quote">“<span class="quote">center</span>”</span>.</p> |
| |
| <p><span class="bold"><strong>Semantic Search</strong></span> is a search paradigm that |
| can exploit the additional metadata generated by analytics like a UIMA CPE.</p> |
| |
| <p>Consider that we plugged a named-entity recognizer into the CPE described |
| above. Assume this analysis engine is capable of detecting in documents and |
| annotating in the CAS mentions of persons and organizations.</p> |
| |
| <p>Complementing the name-entity recognizer we add a CAS Consumer that extracts in |
| addition to token and sentence annotations, the person and organizations added to |
| the CASs by the name-entity detector. It then feeds these into the semantic search |
| engine's index.</p> |
| |
| <p>A semantic search engine can exploit |
| this addition information from the CAS to support more powerful queries. For |
| example, imagine a user is looking for documents that mention an organization with |
| <span class="quote">“<span class="quote">center</span>”</span> it is name but is not sure of the full or precise name of the |
| organization. A key-word search on <span class="quote">“<span class="quote">center</span>”</span> would likely produce way |
| too many documents because <span class="quote">“<span class="quote">center</span>”</span> is a common and ambiguous term. |
| A semantic search engine might support a query language called |
| <span class="bold"><strong>XML Fragments</strong></span>. This query language is |
| designed to exploit the CAS annotations entered in its index. The XML Fragment query, |
| for example, |
| |
| |
| </p><pre class="programlisting"><organization> center </organization></pre><p> |
| will produce first only documents that contain <span class="quote">“<span class="quote">center</span>”</span> where it |
| appears as part of a mention annotated as an organization by the name-entity |
| recognizer. This will likely be a much shorter list of documents more precisely |
| matching the user's interest.</p> |
| |
| <p>Consider taking this one step further. We add a relationship recognizer that |
| annotates mentions of the CEO-of relationship. We configure the CAS Consumer so that |
| it sends these new relationship annotations to the semantic search index as well. |
| With these additional analysis results in the index we can submit queries like |
| |
| |
| </p><pre class="programlisting"><ceo_of> |
| <person> center </person> |
| <organization> center </organization> |
| <ceo_of></pre><p> |
| This query will precisely target documents that contain a mention of an organization |
| with <span class="quote">“<span class="quote">center</span>”</span> as part of its name where that organization is mentioned |
| as part of a |
| <code class="code">CEO-of</code> relationship annotated by the relationship |
| recognizer.</p> |
| |
| <p>For more details about using UIMA and Semantic Search see the section on |
| integrating text analysis and search in <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.application" class="olink">Chapter 3, <i>Application Developer's Guide</i></a>.</p> |
| </div> |
| |
| <div class="section" title="2.6.2. Databases"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.conceptual.databases">2.6.2. Databases</h3></div></div></div> |
| |
| |
| <p>Search engine indices are not the only place to deposit analysis results for use |
| by applications. Another classic example is populating databases. While many |
| approaches are possible with varying degrees of flexibly and performance all are |
| highly dependent on application specifics. We included a simple sample CAS Consumer |
| that provides the basics for getting your analysis result into a relational |
| database. It extracts annotations from a CAS and writes them to a relational |
| database, using the open source Apache Derby database.</p> |
| </div> |
| </div> |
| |
| <div class="section" title="2.7. Multimodal Processing in UIMA"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.conceptual.multimodal_processing">2.7. Multimodal Processing in UIMA</h2></div></div></div> |
| |
| <p>In previous sections we've seen how the CAS is initialized with an initial |
| artifact that will be subsequently analyzed by Analysis engines and CAS Consumers. The |
| first Analysis engine may make some assertions about the artifact, for example, in the |
| form of annotations. Subsequent Analysis engines will make further assertions about |
| both the artifact and previous analysis results, and finally one or more CAS Consumers |
| will extract information from these CASs for structured information storage.</p> |
| <div class="figure"><a name="ugr.ovv.conceptual.fig.multiple_sofas"></a><div class="figure-contents"> |
| |
| <div class="mediaobject" align="center"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="576"><tr><td align="center"><img src="images/overview-and-setup/conceptual_overview_files/image014.png" align="middle" width="576" alt="Picture showing audio on the left broken into segments by a segmentation component, then sent to multiple analysis pipelines in parallel, some processing the raw audio, others processing the recognized speech as text."></td></tr></table></div> |
| </div><p class="title"><b>Figure 2.7. Multiple Sofas in support of multi-modal analysis of an audio Stream. Some |
| engines work on the audio <span class="quote">“<span class="quote">view</span>”</span>, some on the text |
| <span class="quote">“<span class="quote">view</span>”</span> and some on both.</b></p></div><br class="figure-break"> |
| <p>Consider a processing pipeline, illustrated in <a class="xref" href="#ugr.ovv.conceptual.fig.multiple_sofas" title="Figure 2.7. Multiple Sofas in support of multi-modal analysis of an audio Stream. Some engines work on the audio “view”, some on the text “view” and some on both.">Figure 2.7, “Multiple Sofas in support of multi-modal analysis of an audio Stream. Some |
| engines work on the audio <span class="quote">“<span class="quote">view</span>”</span>, some on the text |
| <span class="quote">“<span class="quote">view</span>”</span> and some on both.”</a>, that starts with an |
| audio recording of a conversation, transcribes the audio into text, and then extracts |
| information from the text transcript. Analysis Engines at the start of the pipeline are |
| analyzing an audio subject of analysis, and later analysis engines are analyzing a text |
| subject of analysis. The CAS Consumer will likely want to build a search index from |
| concepts found in the text to the original audio segment covered by the concept.</p> |
| |
| <p>What becomes clear from this relatively simple scenario is that the CAS must be |
| capable of simultaneously holding multiple subjects of analysis. Some analysis |
| engine will analyze only one subject of analysis, some will analyze one and create |
| another, and some will need to access multiple subjects of analysis at the same time. |
| </p> |
| |
| <p>The support in UIMA for multiple subjects of analysis is called <span class="bold"><strong>Sofa</strong></span> support; Sofa is an acronym which is derived from |
| <span class="underline">S</span>ubject <span class="underline"> |
| of</span> <span class="underline">A</span>nalysis, which is a physical |
| representation of an artifact (e.g., the detagged text of a web-page, the HTML |
| text of the same web-page, the audio segment of a video, the close-caption text |
| of the same audio segment). A Sofa may |
| be associated with CAS Views. A particular CAS will have one or more views, each view |
| corresponding to a particular subject of analysis, together with a set of the defined |
| indexes that index the metadata (that is, Feature Structures) created in that view.</p> |
| |
| <p>Analysis results can be indexed in, or <span class="quote">“<span class="quote">belong</span>”</span> to, a specific view. |
| UIMA components may be written in <span class="quote">“<span class="quote">Multi-View</span>”</span> mode - able to create and |
| access multiple Sofas at the same time, or in <span class="quote">“<span class="quote">Single-View</span>”</span> mode, simply |
| receiving a particular view of the CAS corresponding to a particular single Sofa. For |
| single-view mode components, it is up to the person assembling the component to supply |
| the needed information to insure a particular view is passed to the component at run |
| time. This is done using XML descriptors for Sofa mapping (see <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> <a href="tutorials_and_users_guides.html#ugr.tug.mvs.sofa_name_mapping" class="olink">Section 6.4, “Sofa Name Mapping”</a>).</p> |
| |
| <p>Multi-View capability brings benefits to text-only processing as well. An input |
| document can be transformed from one format to another. Examples of this include |
| transforming text from HTML to plain text or from one natural language to another. |
| </p> |
| </div> |
| |
| <div class="section" title="2.8. Next Steps"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.conceptual.next_steps">2.8. Next Steps</h2></div></div></div> |
| |
| |
| <p>This chapter presented a high-level overview of UIMA concepts. Along the way, it |
| pointed to other documents in the UIMA SDK documentation set where the reader can find |
| details on how to apply the related concepts in building applications with the UIMA |
| SDK.</p> |
| |
| <p>At this point the reader may return to the documentation guide in <a href="overview_and_setup.html#ugr.project_overview_doc_use" class="olink">Section 1.2, “How to use the Documentation”</a> |
| to learn how they might proceed in getting started using UIMA.</p> |
| |
| <p>For a more detailed overview of the UIMA architecture, framework and development |
| roles we refer the reader to the following paper:</p> |
| |
| <p>D. Ferrucci and A. Lally, <span class="quote">“<span class="quote">Building an example application using the |
| Unstructured Information Management Architecture,</span>”</span> <span class="emphasis"><em>IBM Systems |
| Journal</em></span> <span class="bold"><strong>43</strong></span>, No. 3, 455-475 (2004). |
| </p> |
| |
| <p>This paper can be found on line at <a class="ulink" href="http://www.research.ibm.com/journal/sj43-3.html" target="_top">http://www.research.ibm.com/journal/sj43-3.html</a></p> |
| </div> |
| |
| <div class="footnotes"><br><hr width="100" align="left"><div class="footnote"><p><sup>[<a id="ftn.d5e551" href="#d5e551" class="para">1</a>] </sup> We have plans to |
| extend the representational capabilities of the CAS and align its semantics with the |
| semantics of the OMG's Essential Meta-Object Facility (EMOF) and with the |
| semantics of the Eclipse Modeling Framework's ( <a class="ulink" href="http://www.eclipse.org/emf/" target="_top">http://www.eclipse.org/emf/</a>) Ecore semantics and XMI-based |
| representation.</p> </div></div></div> |
| <div class="chapter" title="Chapter 3. Setting up the Eclipse IDE to work with UIMA" id="ugr.ovv.eclipse_setup"><div class="titlepage"><div><div><h2 class="title">Chapter 3. Setting up the Eclipse IDE to work with UIMA</h2></div></div></div> |
| |
| |
| |
| <p>This chapter describes how to set up the UIMA SDK to work with Eclipse. Eclipse (<a class="ulink" href="http://www.eclipse.org" target="_top">http://www.eclipse.org</a>) is a popular open-source Integrated Development |
| Environment for many things, including Java. The UIMA SDK does not require that you use |
| Eclipse. However, we recommend that you do use Eclipse because some useful UIMA SDK tools |
| run as plug-ins to the Eclipse platform and because the UIMA SDK examples are provided in a |
| form that's easy to import into your Eclipse environment.</p> |
| |
| <p>If you are not planning on using the UIMA SDK with Eclipse, you may skip this chapter and |
| read <a href="tutorials_and_users_guides.html#d5e1" class="olink">UIMA Tutorial and Developers' Guides</a> |
| <a href="tutorials_and_users_guides.html#ugr.tug.aae" class="olink">Chapter 1, <i>Annotator and Analysis Engine Developer's Guide</i></a> |
| next.</p> |
| |
| <p>This chapter provides instructions for |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>installing Eclipse, </p> |
| </li><li class="listitem"><p>installing the UIMA SDK's Eclipse plugins into your Eclipse |
| environment, and </p></li><li class="listitem"><p>importing the example UIMA code into an Eclipse project. </p> |
| </li></ul></div> |
| |
| <p>The UIMA Eclipse plugins are designed to be used with Eclipse version 3.1 or |
| later. |
| </p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>You will need to run Eclipse using a Java at the 1.8 level, in order |
| to use the UIMA Eclipse plugins.</p></div> |
| |
| <div class="section" title="3.1. Installation"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.eclipse_setup.installation">3.1. Installation</h2></div></div></div> |
| |
| <div class="section" title="3.1.1. Install Eclipse"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.eclipse_setup.install_eclipse">3.1.1. Install Eclipse</h3></div></div></div> |
| |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>Go to <a class="ulink" href="http://www.eclipse.org" target="_top">http://www.eclipse.org</a> and follow the instructions there to download Eclipse. |
| </p></li><li class="listitem"><p>We recommend using the latest release level. |
| Navigate to the Eclipse Release version you |
| want and download the archive for your platform.</p></li><li class="listitem"><p>Unzip the archive to install Eclipse somewhere, e.g., c:\</p> |
| </li><li class="listitem"><p>Eclipse has a bit of a learning curve. If you plan to make |
| significant use of Eclipse, check out the tutorial under the help menu. It is well |
| worth the effort. There are also books you can get that describe Eclipse and its |
| use.</p></li></ul></div> |
| |
| <p>The first time Eclipse starts up it will take a bit longer as it completes its |
| installation. A <span class="quote">“<span class="quote">welcome</span>”</span> page will come up. After you are through |
| reading the welcome information, click on the arrow to exit the welcome page and get to |
| the main Eclipse screens.</p> |
| </div> |
| |
| <div class="section" title="3.1.2. Installing the UIMA Eclipse Plugins"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.eclipse_setup.install_uima_eclipse_plugins">3.1.2. Installing the UIMA Eclipse Plugins</h3></div></div></div> |
| |
| |
| <p>The best way to do this is to use the Eclipse Install New Software mechanism, because that will |
| insure that all needed prerequisites are also installed. See below for an alternative, |
| manual approach.</p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>If your computer is on an internet connection which uses a proxy server, you can |
| configure Eclipse to know about that. Put your proxy settings into Eclipse using the |
| Eclipse preferences by accessing the menus: Window <span class="symbol">→</span> Preferences... <span class="symbol">→</span> |
| Install/Update, and Enable HTTP proxy connection under the Proxy Settings with the |
| information about your proxy. </p></div> |
| |
| |
| <p>To use the Eclipse Install New Software mechanism, start Eclipse, and then pick the menu |
| <span class="command"><strong>Help <span class="symbol">→</span> Install new software...</strong></span>. In the next page, enter |
| the following URL in the "Work with" box and press enter: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p></p><code class="code">https://www.apache.org/dist/uima/eclipse-update-site/</code> or</li><li class="listitem"><p></p><code class="code">https://www.apache.org/dist/uima/eclipse-update-site-uv3/</code>.</li></ul></div><p> |
| Choose the 2nd if you are working with core UIMA Java SDK at version 3 or later. |
| .</p> |
| |
| <p>Now select the plugin tools you wish to install, and click Next, and follow the |
| remaining panels to install the UIMA plugins. </p> |
| </div> |
| |
| |
| |
| <div class="section" title="3.1.3. Install the UIMA SDK"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.eclipse_setup.install_uima_sdk">3.1.3. Install the UIMA SDK</h3></div></div></div> |
| |
| <p>If you haven't already done so, please download and install the UIMA SDK from |
| <a class="ulink" href="http://incubator.apache.org/uima" target="_top">http://incubator.apache.org/uima</a>. Be sure to set the environmental variable |
| UIMA_HOME pointing to the root of the installed UIMA SDK and run the |
| <code class="literal">adjustExamplePaths.bat</code> or <code class="literal">adjustExamplePaths.sh</code> |
| script, as explained in the README.</p> |
| |
| <p>The environmental parameter UIMA_HOME is used by the command-line scripts in the |
| %UIMA_HOME%/bin directory as well as by eclipse run configurations in the uimaj-examples |
| sample project.</p> |
| |
| </div> |
| |
| <div class="section" title="3.1.4. Installing the UIMA Eclipse Plugins, manually"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.eclipse_setup.install_uima_eclipse_plugins_manually">3.1.4. Installing the UIMA Eclipse Plugins, manually</h3></div></div></div> |
| |
| |
| <p>If you installed the UIMA plugins using the update mechanism above, please skip this section.</p> |
| |
| <p>If you are unable to use the Eclipse Update mechanism to install the UIMA plugins, you |
| can do this manually. In the directory %UIMA_HOME%/eclipsePlugins (The environment variable |
| %UIMA_HOME% is where you installed the UIMA SDK), you will see a set of folders. Copy these |
| to your %ECLIPSE_HOME%/dropins directory (%ECLIPSE_HOME% is where you |
| installed Eclipse).</p> |
| |
| </div> |
| |
| <div class="section" title="3.1.5. Start Eclipse"><div class="titlepage"><div><div><h3 class="title" id="ugr.ovv.eclipse_setup.start_eclipse">3.1.5. Start Eclipse</h3></div></div></div> |
| |
| <p>If you have Eclipse running, restart it (shut it down, and start it again) using |
| the |
| <code class="code">-clean</code> option; you can do this by running the command |
| <span class="command"><strong>eclipse -clean</strong></span> (see explanation in the next section) in the |
| directory where you installed Eclipse. You may want to set up a desktop shortcut at |
| this point for Eclipse.</p> |
| |
| <div class="section" title="3.1.5.1. Special startup parameter for Eclipse: -clean"><div class="titlepage"><div><div><h4 class="title" id="ugr.ovv.eclipse_setup.special_startup_parameter_clean">3.1.5.1. Special startup parameter for Eclipse: -clean</h4></div></div></div> |
| |
| <p>If you have modified the plugin structure (by copying or files directly in the |
| file system) after you started it for the first time, please include |
| the <span class="quote">“<span class="quote">-clean</span>”</span> parameter in the startup arguments to Eclipse, |
| <span class="emphasis"><em>one time</em></span> (after any plugin modifications were done). This |
| is needed because Eclipse may not notice the changes you made, otherwise. This |
| parameter forces Eclipse to reexamine all of its plugins at startup and recompute |
| any cached information about them.</p> |
| </div> |
| |
| </div> |
| </div> |
| <div class="section" title="3.2. Setting up Eclipse to view Example Code"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.eclipse_setup.example_code">3.2. Setting up Eclipse to view Example Code</h2></div></div></div> |
| |
| <p>Later chapters refer to example code. Here's how to create a special project in Eclipse to |
| hold the examples.</p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>In Eclipse, if the Java |
| perspective is not already open, switch to it by going to Window <span class="symbol">→</span> Open Perspective |
| <span class="symbol">→</span> Java.</p></li><li class="listitem"><p>Set up a class path variable named UIMA_HOME, whose value is the |
| directory where you installed the UIMA SDK. This is done as follows: |
| |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="circle"><li class="listitem"><p>Go to Window <span class="symbol">→</span> Preferences <span class="symbol">→</span> Java |
| <span class="symbol">→</span> Build Path <span class="symbol">→</span> Classpath Variables.</p></li><li class="listitem"><p>Click <span class="quote">“<span class="quote">New</span>”</span></p></li><li class="listitem"><p>Enter UIMA_HOME (all capitals, exactly as written) in the |
| <span class="quote">“<span class="quote">Name</span>”</span> field.</p></li><li class="listitem"><p>Enter your installation directory (e.g. <code class="literal">C:/Program |
| Files/apache-uima</code>) in the <span class="quote">“<span class="quote">Path</span>”</span> field</p> |
| </li><li class="listitem"><p>Click <span class="quote">“<span class="quote">OK</span>”</span> in the <span class="quote">“<span class="quote">New Variable |
| Entry</span>”</span> dialog</p></li><li class="listitem"><p>Click <span class="quote">“<span class="quote">OK</span>”</span> in the <span class="quote">“<span class="quote">Preferences</span>”</span> |
| dialog</p></li><li class="listitem"><p>If it asks you if you want to do a full build, click |
| <span class="quote">“<span class="quote">Yes</span>”</span> </p></li></ul></div> |
| </li><li class="listitem"><p>Select the File <span class="symbol">→</span> Import menu option</p></li><li class="listitem"><p>Select <span class="quote">“<span class="quote">General/Existing Project into Workspace</span>”</span> and click |
| the <span class="quote">“<span class="quote">Next</span>”</span> button.</p></li><li class="listitem"><p>Click <span class="quote">“<span class="quote">Browse</span>”</span> and browse to the |
| %UIMA_HOME%/examples directory</p></li><li class="listitem"><p>Click <span class="quote">“<span class="quote">Finish.</span>”</span> This will create a new project called |
| <span class="quote">“<span class="quote">uimaj-examples</span>”</span> in your Eclipse workspace. There should be no |
| compilation errors. </p></li></ul></div> |
| |
| <p>To verify that you have set up the project correctly, check that there are no error |
| messages in the <span class="quote">“<span class="quote">Problems</span>”</span> view.</p> |
| |
| </div> |
| |
| <div class="section" title="3.3. Adding the UIMA source code to the jar files"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.eclipse_setup.adding_source">3.3. Adding the UIMA source code to the jar files</h2></div></div></div> |
| |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>If you are running a current version of Eclipse, and have the m2e (Maven extensions for Eclipse) |
| plugin installed, Eclipse should be able to automatically download the source for the jars, so you may not need |
| to do anything special (it does take a few seconds, and you need an internet connection).</p></div> |
| <p>Otherwise, if you would like to be able to jump to the UIMA source code in Eclipse or to step |
| through it with the debugger, you can add the UIMA source code directly to the jar files. This is |
| done via a shell script that comes with the source distribution. To add the source code |
| to the jars, you need to: |
| </p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p> |
| Download and unpack the UIMA source distribution. |
| </p> |
| </li><li class="listitem"> |
| <p> |
| Download and install the UIMA binary distribution (the UIMA_HOME environment variable needs |
| to be set to point to where you installed the UIMA binary distribution). |
| </p> |
| </li><li class="listitem"> |
| <p>"cd" to the root directory of the source distribution</p> |
| </li><li class="listitem"> |
| <p> |
| Execute the <span class="command"><strong>src\main\readme_src\addSourceToJars</strong></span> script in the root directory of the |
| source distribution. |
| </p> |
| </li></ul></div> |
| |
| <p> |
| This adds the source code to the jar files, and it will then be automatically available |
| from Eclipse. There is no further Eclipse setup required. |
| </p> |
| |
| </div> |
| |
| |
| <div class="section" title="3.4. Attaching UIMA Javadocs"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.eclipse_setup.linking_uima_javadocs">3.4. Attaching UIMA Javadocs</h2></div></div></div> |
| |
| |
| <p>The binary distribution also includes the UIMA Javadocs. They are |
| attached to the UIMA library Jar files in the uima-examples project described |
| above. You can attach the Javadocs to your own project as well. |
| </p> |
| |
| <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>If you attached the source as described in the previous section, you |
| don't need to attach the Javadocs because the source includes the Javadoc comments.</p></div> |
| |
| <p>Attaching the Javadocs enables Javadoc help for UIMA APIs. After they are |
| attached, if you hover your mouse |
| over a certain UIMA api element, the corresponding Javadoc will appear. |
| You can then press <span class="quote">“<span class="quote">F2</span>”</span> to make the hover "stick", or |
| <span class="quote">“<span class="quote">Shift-F2</span>”</span> to open the default |
| web-browser on your system to let you browse the entire Javadoc information |
| for that element. |
| </p> |
| <p>If this pop-up behavior is something you don't want, you can turn it off |
| in the Eclipse preferences, in the menu Window <span class="symbol">→</span> Preferences <span class="symbol">→</span> |
| Java <span class="symbol">→</span> Editors <span class="symbol">→</span> hovers. |
| </p> |
| |
| <p>Eclipse also has a Javadoc "view" which you can show, using the Window <span class="symbol">→</span> |
| Show View <span class="symbol">→</span> Javadoc.</p> |
| |
| <p>See <a href="references.html#d5e1" class="olink">UIMA References</a> |
| <a href="references.html#ugr.ref.javadocs.libraries" class="olink">Section 1.1, “Using named Eclipse User Libraries”</a> |
| for information on how to set up a UIMA "library" with the Javadocs attached, which |
| can be reused for other projects in your Eclipse workspace.</p> |
| |
| <p>You can attach the Javadocs to each UIMA library jar you think you might be |
| interested in. It makes most sense |
| for the uima-core.jar, you'll probably use the core APIs most of all. |
| </p> |
| |
| <p>Here's a screenshot of what you should see when you hover your mouse pointer over the |
| class name <span class="quote">“<span class="quote">CAS</span>”</span> in the source code. |
| </p> |
| |
| <div class="informalfigure"> |
| <div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="564"><tr><td><img src="images/overview-and-setup/eclipse_setup_files/image004.jpg" width="564" alt="Screenshot of mouse-over for UIMA APIs"></td></tr></table></div> |
| </div> |
| |
| </div> |
| |
| <div class="section" title="3.5. Running external tools from Eclipse"><div class="titlepage"><div><div><h2 class="title" style="clear: both" id="ugr.ovv.eclipse_setup.running_external_tools_from_eclipse">3.5. Running external tools from Eclipse</h2></div></div></div> |
| |
| |
| <p>You can run many tools without using Eclipse at all, by using the shell scripts in the |
| UIMA SDK's bin directory. In addition, many tools can be run from inside Eclipse; |
| examples are the Document Analyzer, CPE Configurator, CAS Visual Debugger, |
| and JCasGen. The uimaj-examples project provides Eclipse launch |
| configurations that make this easy to do.</p> |
| |
| <p>To run these tools from Eclipse:</p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>If the Java perspective is not |
| already open, switch to it by going to Window <span class="symbol">→</span> Open Perspective <span class="symbol">→</span> |
| Java.</p></li><li class="listitem"><p>Go to Run <span class="symbol">→</span> Run... </p></li><li class="listitem"><p>In the window that appears, select <span class="quote">“<span class="quote">UIMA CPE GUI</span>”</span>, |
| <span class="quote">“<span class="quote">UIMA CAS Visual Debugger</span>”</span>, <span class="quote">“<span class="quote">UIMA JCasGen</span>”</span>, or |
| <span class="quote">“<span class="quote">UIMA Document Analyzer</span>”</span> |
| from the list of run configurations on the left. (If you don't see, these, please |
| select the uimaj-examples project and do a Menu <span class="symbol">→</span> File |
| <span class="symbol">→</span> Refresh).</p></li><li class="listitem"><p>Press the <span class="quote">“<span class="quote">Run</span>”</span> button. The tools should start. Close |
| the tools by clicking the <span class="quote">“<span class="quote">X</span>”</span> in the upper right corner on the GUI. |
| </p></li></ul></div> |
| |
| <p>For instructions on using the Document Analyzer and CPE Configurator, |
| in the <a href="tools.html#d5e1" class="olink">UIMA Tools Guide and Reference</a> book see <a href="tools.html#ugr.tools.doc_analyzer" class="olink">Chapter 3, <i>Document Analyzer User's Guide</i></a>, and |
| <a href="tools.html#ugr.tools.cpe" class="olink">Chapter 2, <i>Collection Processing Engine Configurator User's Guide</i></a> For |
| instructions on using the CAS Visual Debugger and JCasGen, see <a href="tools.html#ugr.tools.cvd" class="olink">Chapter 5, <i>CAS Visual Debugger</i></a> and |
| <a href="tools.html#ugr.tools.jcasgen" class="olink">Chapter 8, <i>JCasGen User's Guide</i></a></p> |
| |
| </div> |
| |
| </div> |
| <div class="chapter" title="Chapter 4. UIMA Frequently Asked Questions (FAQ's)" id="ugr.faqs"><div class="titlepage"><div><div><h2 class="title">Chapter 4. UIMA Frequently Asked Questions (FAQ's)</h2></div></div></div> |
| |
| |
| |
| <div class="variablelist"><dl><dt><a name="ugr.faqs.what_is_uima"></a><span class="term"><span class="bold"><strong>What is UIMA?</strong></span></span></dt><dd><p>UIMA stands for Unstructured Information Management |
| Architecture. It is component software architecture for the development, |
| discovery, composition and deployment of multi-modal analytics for the analysis |
| of unstructured information.</p> |
| <p>UIMA processing occurs through a series of modules called |
| <a class="link" href="#ugr.faqs.annotator_versus_ae">analysis engines</a>. The result of analysis is an assignment of semantics to the elements of |
| unstructured data, for example, the indication that the phrase |
| <span class="quote">“<span class="quote">Washington</span>”</span> refers to a person's name or that it refers to a |
| place.</p> |
| |
| <p>Analysis Engine's output can be saved in conventional structures, |
| for example, relational databases or search engine indices, where the content |
| of the original unstructured information may be efficiently accessed |
| according to its inferred semantics. </p> |
| |
| <p>UIMA supports developers in creating, |
| integrating, and deploying components across platforms and among dispersed |
| teams working to develop unstructured information management |
| applications.</p> |
| </dd><dt><a name="ugr.faqs.pronounce"></a><span class="term"><span class="bold"><strong>How do you pronounce UIMA?</strong></span></span></dt><dd><p>You – eee – muh. |
| </p></dd><dt><a name="ugr.faqs.difference_apache_uima"></a><span class="term"><span class="bold"><strong>What's the difference between UIMA and the Apache UIMA?</strong></span></span></dt><dd><p>UIMA is an architecture which specifies component interfaces, |
| design patterns, data representations and development roles.</p> |
| |
| <p>Apache UIMA is an open source, Apache-licensed software project. It includes run-time |
| frameworks in Java and C++, APIs and tools for implementing, composing, packaging |
| and deploying UIMA components.</p> |
| |
| <p>The UIMA run-time framework allows developers to plug-in their components |
| and applications and run them on different platforms and according to different |
| deployment options that range from tightly-coupled (running in the same |
| process space) to loosely-coupled (distributed across different processes or |
| machines for greater scale, flexibility and recoverability).</p> |
| |
| <p>The UIMA project has several significant subprojects, including UIMA-AS (for flexibly |
| scaling out UIMA pipelines over clusters of machines), uimaFIT (for a way of using UIMA without the xml descriptors; also provides |
| many convenience methods), UIMA-DUCC (for managing clusters of |
| machines running scaled-out UIMA "jobs" in a "fair" way), RUTA (Eclipse-based tooling and \ |
| a runtime framework for development of rule-based |
| Annotators), Addons (where you can find many extensions), and uimaFIT supplying a Java centric |
| set of friendlier interfaces and avoiding XML.</p> |
| </dd><dt><a name="ugr.faqs.what_is_an_annotation"></a><span class="term"><span class="bold"><strong>What is an Annotation?</strong></span></span></dt><dd><p>An annotation is metadata that is associated with a region of a |
| document. It often is a label, typically represented as string of characters. The |
| region may be the whole document. </p> |
| |
| <p>An example is the label <span class="quote">“<span class="quote">Person</span>”</span> associated with the span of |
| text <span class="quote">“<span class="quote">George Washington</span>”</span>. We say that <span class="quote">“<span class="quote">Person</span>”</span> |
| annotates <span class="quote">“<span class="quote">George Washington</span>”</span> in the sentence <span class="quote">“<span class="quote">George |
| Washington was the first president of the United States</span>”</span>. The |
| association of the label |
| <span class="quote">“<span class="quote">Person</span>”</span> with a particular span of text is an annotation. Another |
| example may have an annotation represent a topic, like <span class="quote">“<span class="quote">American |
| Presidents</span>”</span> and be used to label an entire document.</p> |
| |
| <p>Annotations are not limited to regions of texts. An annotation may annotate |
| a region of an image or a segment of audio. The same concepts apply.</p> |
| </dd><dt><a name="ugr.faqs.what_is_the_cas"></a><span class="term"><span class="bold"><strong>What is the CAS?</strong></span></span></dt><dd><p>The CAS stands for Common Analysis Structure. It provides |
| cooperating UIMA components with a common representation and mechanism for |
| shared access to the artifact being analyzed (e.g., a document, audio file, video |
| stream etc.) and the current analysis results.</p></dd><dt><a name="ugr.faqs.what_does_the_cas_contain"></a><span class="term"><span class="bold"><strong>What does the CAS contain?</strong></span></span></dt><dd><p>The CAS is a data structure for which UIMA provides multiple |
| interfaces. It contains and provides the analysis algorithm or application |
| developer with access to</p> |
| |
| <div class="itemizedlist"><ul class="itemizedlist" type="disc" compact><li class="listitem"><p>the subject of analysis (the artifact being analyzed, like |
| the document),</p></li><li class="listitem"><p>the analysis results or metadata(e.g., annotations, parse |
| trees, relations, entities etc.),</p></li><li class="listitem"><p>indices to the analysis results, and</p></li><li class="listitem"><p>the type system (a schema for the analysis results).</p> |
| </li></ul></div> |
| |
| <p>A CAS can hold multiple versions of the artifact being analyzed (for |
| instance, a raw html document, and a detagged version, or an English version and a |
| corresponding German version, or an audio sample, and the text that |
| corresponds, etc.). For each version there is a separate instance of the results |
| indices.</p></dd><dt><a name="ugr.faqs.only_annotations"></a><span class="term"><span class="bold"><strong>Does the CAS only contain Annotations?</strong></span></span></dt><dd><p>No. The CAS contains the artifact being analyzed plus the analysis |
| results. Analysis results are those metadata recorded by <a class="link" href="#ugr.faqs.annotator_versus_ae">analysis engines</a> in the |
| CAS. The most common form of analysis result is the addition of an annotation. But an |
| analysis engine may write any structure that conforms to the CAS's type |
| system into the CAS. These may not be annotations but may be other things, for |
| example links between annotations and properties of objects associated with |
| annotations.</p> |
| <p>The CAS may have multiple representations of the artifact being analyzed, each one |
| represented in the CAS as a particular Subject of Analysis. or <a class="link" href="#ugr.faqs.what_is_a_sofa">Sofa</a></p></dd><dt><a name="ugr.faqs.just_xml"></a><span class="term"><span class="bold"><strong>Is the CAS just XML?</strong></span></span></dt><dd><p>No, in fact there are many possible representations of the CAS. If all |
| of the <a class="link" href="#ugr.faqs.annotator_versus_ae">analysis engines</a> are running in the same process, an efficient, in-memory |
| data object is used. If a CAS must be sent to an analysis engine on a remote machine, it |
| can be done via an XML or a binary serialization of the CAS. </p> |
| |
| <p>The UIMA framework provides multiple serialization and de-serialization methods |
| in various formats, including XML. See the Javadocs for the CasIOUtils class. |
| </p></dd><dt><a name="ugr.faqs.what_is_a_type_system"></a><span class="term"><span class="bold"><strong>What is a Type System?</strong></span></span></dt><dd><p>Think of a type system as a schema or class model for the <a class="link" href="#ugr.faqs.what_is_the_cas">CAS</a>. It defines |
| the types of objects and their properties (or features) that may be instantiated in |
| a CAS. A specific CAS conforms to a particular type system. UIMA components declare |
| their input and output with respect to a type system. </p> |
| |
| <p>Type Systems include the definitions of types, their properties, range |
| types (these can restrict the value of properties to other types) and |
| single-inheritance hierarchy of types.</p></dd><dt><a name="ugr.faqs.what_is_a_sofa"></a><span class="term"><span class="bold"><strong>What is a Sofa?</strong></span></span></dt><dd><p>Sofa stands for “Subject of Analysis". A <a class="link" href="#ugr.faqs.what_is_the_cas">CAS</a> is |
| associated with a single artifact being analysed by a collection of UIMA analysis |
| engines. But a single artifact may have multiple independent views, each of which |
| may be analyzed separately by a different set of <a class="link" href="#ugr.faqs.annotator_versus_ae">analysis engines</a>. For example, |
| given a document it may have different translations, each of which are associated |
| with the original document but each potentially analyzed by different engines. A |
| CAS may have multiple Views, each containing a different Subject of Analysis |
| corresponding to some version of the original artifact. This feature is ideal for |
| multi-modal analysis, where for example, one view of a video stream may be the video |
| frames and the other the close-captions.</p></dd><dt><a name="ugr.faqs.annotator_versus_ae"></a><span class="term"><span class="bold"><strong>What's the difference between an Annotator and an Analysis |
| Engine?</strong></span></span></dt><dd><p>In the terminology of UIMA, an annotator is simply some code that |
| analyzes documents and outputs <a class="link" href="#ugr.faqs.what_is_an_annotation">annotations</a> on the content of the documents. The |
| UIMA framework takes the annotator, together with metadata describing such |
| things as the input requirements and outputs types of the annotator, and produces |
| an analysis engine. </p> |
| |
| <p>Analysis Engines contain the framework-provided infrastructure that |
| allows them to be easily combined with other analysis engines in different flows |
| and according to different deployment options (collocated or as web services, |
| for example). </p> |
| |
| <p>Analysis Engines are the framework-generated objects that an Application |
| interacts with. An Annotator is a user-written class that implements the one of |
| the supported Annotator interfaces.</p></dd><dt><a name="ugr.faqs.web_services"></a><span class="term"><span class="bold"><strong>Are UIMA analysis engines web services?</strong></span></span></dt><dd><p>They can be deployed as such. Deploying an analysis engine as a web |
| service is one of the deployment options supported by the UIMA framework.</p> |
| </dd><dt><a name="ugr.faqs.stateless_aes"></a><span class="term"><span class="bold"><strong>Do Analysis Engines have to be |
| "stateless"?</strong></span></span></dt><dd><p>This is a user-specifyable option. The XML metadata for the |
| component includes an |
| <code class="code">operationalProperties</code> element which can specify if multiple |
| deployment is allowed. If true, then a particular instance of an Engine might not |
| see all the CASes being processed. If false, then that component will see all of the |
| CASes being processed. In this case, it can accumulate state information among all |
| the CASes. Typically, Analysis Engines in the main analysis pipeline are marked |
| multipleDeploymentAllowed = true. The CAS Consumer component, on the other hand, |
| defaults to having this property set to false, and is typically associated with |
| some resource like a database or search engine that aggregates analysis results |
| across an entire collection.</p> |
| |
| <p>Analysis Engines developers are encouraged not to maintain state between |
| documents that would prevent their engine from working as advertised if |
| operated in a parallelized environment.</p></dd><dt><a name="ugr.faqs.uddi"></a><span class="term"><span class="bold"><strong>Is engine meta-data compatible with web services and |
| UDDI?</strong></span></span></dt><dd><p>All UIMA component implementations are associated with Component |
| Descriptors which represents metadata describing various properties about the |
| component to support discovery, reuse, validation, automatic composition and |
| development tooling. In principle, UIMA component descriptors are compatible |
| with web services and UDDI. However, the UIMA framework currently uses its own XML |
| representation for component metadata. It would not be difficult to convert |
| between UIMA's XML representation and other standard representations.</p> |
| </dd><dt><a name="ugr.faqs.scaling"></a><span class="term"><span class="bold"><strong>How do you scale a UIMA application?</strong></span></span></dt><dd><p>The UIMA framework allows components such as |
| <a class="link" href="#ugr.faqs.annotator_versus_ae">analysis engines</a> and |
| CAS Consumers to be easily deployed as services or in other containers and managed |
| by systems middleware designed to scale. UIMA applications tend to naturally |
| scale-out across documents allowing many documents to be analyzed in |
| parallel.</p> |
| <p>The UIMA-AS project has extensive capabilities to flexibly scale a UIMA |
| pipeline across multiple machines. The UIMA-DUCC project supports a |
| unified management of large clusters of machines running multiple "jobs" |
| each consisting of a pipeline with data sources and sinks.</p> |
| <p>Within the core UIMA framework, there is a component called the CPM (Collection Processing |
| Manager) which has features and configuration settings for scaling an |
| application to increase its throughput and recoverability; |
| the CPM was the earlier version of scaleout technology, and has been |
| superceded by the UIMA-AS effort (although it is still supported).</p></dd><dt><a name="ugr.faqs.embedding"></a><span class="term"><span class="bold"><strong>What does it mean to embed UIMA in systems middleware?</strong></span></span></dt><dd><p>An example of an embedding would be the deployment of a UIMA analysis |
| engine as an Enterprise Java Bean inside an application server such as IBM |
| WebSphere. Such an embedding allows the deployer to take advantage of the features |
| and tools provided by WebSphere for achieving scalability, service management, |
| recoverability etc. UIMA is independent of any particular systems middleware, so |
| <a class="link" href="#ugr.faqs.annotator_versus_ae">analysis engines</a> could be deployed on other application servers as well.</p> |
| </dd><dt><a name="ugr.faqs.cpm_versus_cpe"></a><span class="term"><span class="bold"><strong>How is the CPM different from a CPE?</strong></span></span></dt><dd><p>These name complimentary aspects of collection processing. The CPM |
| (Collection Processing <span class="bold"><strong>Manager</strong></span> is the part of |
| the UIMA framework that manages the execution of a workflow of UIMA |
| components orchestrated to analyze a large collection of documents. The UIMA |
| developer does not implement or describe a CPM. It is a piece of infrastructure code |
| that handles CAS transport, instance management, batching, check-pointing, |
| statistics collection and failure recovery in the execution of a collection |
| processing workflow.</p> |
| |
| <p>A Collection Processing Engine (CPE) is component created by the framework |
| from a specific CPE descriptor. A CPE descriptor refers to a series of UIMA |
| components including a Collection Reader, CAS Initializer, Analysis |
| Engine(s) and CAS Consumers. These components are organized in a work flow and |
| define a collection analysis job or CPE. A CPE acquires documents from a source |
| collection, initializes CASs with document content, performs document |
| analysis and then produces collection level results (e.g., search engine |
| index, database etc). The CPM is the execution engine for a CPE.</p> |
| </dd><dt><a name="ugr.faqs.modalities_other_than_text"></a><span class="term"><span class="bold"><strong>Does UIMA support modalities other than text?</strong></span></span></dt><dd><p>The UIMA architecture supports the development, discovery, |
| composition and deployment of multi-modal analytics including text, audio and |
| video. Applications that process text, speech and video have been developed using |
| UIMA. This release of the SDK, however, does not include examples of these |
| multi-modal applications. </p> |
| |
| <p>It does however include documentation and programming examples for using |
| the key feature required for building multi-modal applications. UIMA supports |
| multiple subjects of analysis or <a class="link" href="#ugr.faqs.what_is_a_sofa">Sofas</a>. These allow multiple views of a single |
| artifact to be associated with a <a class="link" href="#ugr.faqs.what_is_the_cas">CAS</a>. For example, if an artifact is a video |
| stream, one Sofa could be associated with the video frames and another with the |
| closed-captions text. UIMA's multiple Sofa feature is included and |
| described in this release of the SDK.</p></dd><dt><a name="ugr.faqs.compare"></a><span class="term"><span class="bold"><strong>How does UIMA compare to other similar work?</strong></span></span></dt><dd><p>A number of different frameworks for NLP have preceded UIMA. Two of |
| them were developed at IBM Research and represent UIMA's early roots. For |
| details please refer to the UIMA article that appears in the IBM Systems Journal |
| Vol. 43, No. 3 (<a class="ulink" href="http://www.research.ibm.com/journal/sj/433/ferrucci.html" target="_top">http://www.research.ibm.com/journal/sj/433/ferrucci.html</a> |
| ).</p> |
| |
| <p>UIMA has advanced that state of the art along a number of dimensions |
| including: support for distributed deployments in different middleware |
| environments, easy framework embedding in different software product |
| platforms (key for commercial applications), broader architectural converge |
| with its collection processing architecture, support for |
| multiple-modalities, support for efficient integration across programming |
| languages, support for a modern software engineering discipline calling out |
| different roles in the use of UIMA to develop applications, the extensive use of |
| descriptive component metadata to support development tooling, component |
| discovery and composition. (Please note that not all of these features are |
| available in this release of the SDK.)</p></dd><dt><a name="ugr.faqs.open_source"></a><span class="term"><span class="bold"><strong>Is UIMA Open Source?</strong></span></span></dt><dd><p>Yes. As of version 2, UIMA development has moved to Apache and is being |
| developed within the Apache open source processes. It is licensed under the Apache |
| version 2 license. |
| </p> |
| </dd><dt><a name="ugr.faqs.levels_required"></a><span class="term"><span class="bold"><strong>What Java level and OS are required for the UIMA SDK?</strong></span></span></dt><dd><p>As of release 3.0.0, the UIMA SDK requires Java 1.8. |
| It has been tested on mainly on Windows and Linux platforms, with some |
| testing on the MacOSX. Other |
| platforms and JDK implementations will likely work, but have |
| not been as significantly tested.</p></dd><dt><a name="ugr.faqs.building_apps_on_top_of_uima"></a><span class="term"><span class="bold"><strong>Can I build my UIM application on top of UIMA?</strong></span></span></dt><dd><p>Yes. Apache UIMA is licensed under the Apache version 2 license, |
| enabling you to build and distribute applications which include the framework. |
| </p></dd></dl></div> |
| </div> |
| <div class="chapter" title="Chapter 5. Known Issues" id="ugr.issues"><div class="titlepage"><div><div><h2 class="title">Chapter 5. Known Issues</h2></div></div></div> |
| |
| |
| |
| <div class="variablelist"><dl><dt><a name="ugr.issues.cr_to_xml"></a><span class="term"><span class="bold"><strong>Sun Java 1.4.2_12 doesn't serialize CR characters to XML</strong></span></span></dt><dd> |
| <p>(Note: Apache UIMA now requires Java 1.5, so this issue is moot.) The XML serialization support in Sun Java 1.4.2_12 doesn't serialize CR characters to |
| XML. As a result, if the document text contains CR characters, XCAS or XMI serialization |
| will cause them to be lost, resulting in incorrect annotation offsets. This is exposed in |
| the DocumentAnalyzer, with the highlighting being incorrect if the input document contains |
| CR characters. </p> |
| </dd><dt><a name="ugr.issues.jcasgen_java_1.4"></a><span class="term"><span class="bold"><strong>JCasGen merge facility only supports Java levels 1.4 or earlier</strong></span></span></dt><dd> |
| <p>JCasGen has a facility to merge in user (hand-coded) changes with the code generated |
| by JCasGen. This merging supports Java 1.4 constructs only. JCasGen generates Java 1.4 |
| compliant code, so as long as any code you change here also only uses Java 1.4 constructs, the |
| merge will work, even if you're using Java 5 or later. |
| If you use syntactic structures particular to Java 5 or later, the merge |
| operation will likely fail to merge properly.</p> |
| </dd><dt><a name="ugr.issues.libgcj.4.1.2"></a><span class="term"><span class="bold"><strong>Descriptor editor in Eclipse tooling does not work with libgcj 4.1.2</strong></span></span></dt><dd> |
| <p>The descriptor editor in the Eclipse tooling does not work with libgcj 4.1.2, and |
| possibly other versions of libgcj. This is apparently due to a bug in the implementation of |
| their XML library, which results in a class cast error. libgcj is used as the default |
| JVM for Eclipse in Ubuntu (and other Linux distributions?). The workaround is to use a |
| different JVM to start Eclipse.</p> |
| </dd></dl></div> |
| </div> |
| <div class="glossary" title="Glossary: Key Terms & Concepts" id="ugr.glossary"><div class="titlepage"><div><div><h2 class="title">Glossary: Key Terms & Concepts</h2></div></div></div><dl><dt><a name="ugr.glossary.aggregate"></a>Aggregate Analysis Engine</dt><dd><p>An <a class="glossterm" href="#ugr.glossary.analysis_engine"><em class="glossterm">Analysis Engine</em></a> |
| made up of multiple subcomponent |
| Analysis Engines arranged in a flow. The |
| flow can be one of the two built-in flows, or a custom flow provided by the user.</p></dd><dt><a name="ugr.glossary.analysis_engine"></a>Analysis Engine</dt><dd><p>A program that analyzes artifacts (e.g. documents) and infers information about |
| them, and which implements the UIMA Analysis Engine interface Specification. It |
| does not matter how the program is built, with what framework or whether or not |
| it contains component (<span class="quote">“<span class="quote">sub</span>”</span>) Analysis Engines.</p></dd><dt><a name="ugr.glossary.annotation"></a>Annotation</dt><dd><p>The association of a metadata, such as a label, with a region of text (or other |
| type of artifact). For example, the label <span class="quote">“<span class="quote">Person</span>”</span> associated with a |
| region of text <span class="quote">“<span class="quote">John Doe</span>”</span> constitutes an annotation. We say |
| <span class="quote">“<span class="quote">Person</span>”</span> annotates the span of text from X to Y containing exactly |
| <span class="quote">“<span class="quote">John Doe</span>”</span>. An annotation is represented as a special |
| <a class="glossterm" href="#ugr.glossary.type"><em class="glossterm">type</em></a> |
| |
| in a UIMA <a class="glossterm" href="#ugr.glossary.type_system"><em class="glossterm">type system</em></a>. |
| It is the type used to record |
| the labeling of regions of a <a class="glossterm" href="#ugr.glossary.sofa"><em class="glossterm">Sofa</em></a>. |
| Annotations are <a class="glossterm" href="#ugr.glossary.feature_structure"><em class="glossterm">Feature Structures</em></a> |
| whose <a class="glossterm" href="#ugr.glossary.type"><em class="glossterm">Type</em></a> is Annotation or a subtype |
| of that.</p></dd><dt><a name="ugr.glossary.annotator"></a>Annotator</dt><dd><p>A software |
| component that implements the UIMA annotator interface. Annotators are |
| implemented to produce and record annotations over regions of an artifact |
| (e.g., text document, audio, and video).</p></dd><dt><a name="ugr.glossary.application"></a>Application</dt><dd><p>An application is the outer containing code that invokes |
| the UIMA framework functions to instantiate an |
| <a class="glossterm" href="#ugr.glossary.analysis_engine"><em class="glossterm">Analysis Engine</em></a> or a |
| <a class="glossterm" href="#ugr.glossary.cpe"><em class="glossterm">Collection Processing Engine</em></a> from a particular |
| descriptor, and run it.</p></dd><dt><a name="ugr.glossary.apache_uima_java_framework"></a>Apache UIMA Java Framework</dt><dd><p>A Java-based implementation of the <a class="glossterm" href="#ugr.glossary.uima"><em class="glossterm">UIMA</em></a> |
| architecture. It provides a run-time environment in which developers can plug in and run their UIMA component |
| implementations and with which they can build and deploy UIM applications. The framework is the |
| core part of the <a class="glossterm" href="#ugr.glossary.apache_uima_sdk"><em class="glossterm">Apache UIMA SDK</em></a>.</p></dd><dt><a name="ugr.glossary.apache_uima_sdk"></a>Apache UIMA Software Development Kit (SDK)</dt><dd><p>The SDK for which you are now reading the documentation. The SDK includes the framework |
| plus additional components such as tooling and examples. Some of the tooling is Eclipse-based |
| (<a class="ulink" href="http://www.eclipse.org/" target="_top">http://www.eclipse.org/</a>).</p></dd><dt><a name="ugr.glossary.cas"></a>CAS</dt><dd><p>The UIMA Common Analysis Structure is |
| the primary data structure which UIMA analysis components use to represent and |
| share analysis results. It contains:</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>The artifact. This is the object |
| being analyzed such as a text document or audio or video stream. The CAS |
| projects one or more views of the artifact. Each view is referred to as a |
| <a class="glossterm" href="#ugr.glossary.sofa"><em class="glossterm">Sofa</em></a>.</p></li><li class="listitem"><p>A type system description – |
| indicating the types, subtypes, and their features. </p></li><li class="listitem"><p>Analysis metadata – <span class="quote">“<span class="quote">standoff</span>”</span> |
| annotations describing the artifact or a region of the artifact </p></li><li class="listitem"><p>An index repository to support |
| efficient access to and iteration over the results of analysis. |
| </p></li></ul></div><p>UIMA's primary interface to this structure is provided by |
| a class called the Common Analysis System. We use <span class="quote">“<span class="quote">CAS</span>”</span> to refer to |
| both the structure and system. Where the common analysis structure is used |
| through a different interface, the particular implementation of the structure |
| is indicated, For example, the <a class="glossterm" href="#ugr.glossary.jcas"><em class="glossterm">JCas</em></a> is a native Java object |
| representation of the contents of the common analysis structure.</p><p>A CAS can have multiple views; each view has a unique |
| representation of the artifact, and has its own index repository, representing |
| results of analysis for that representation of the artifact.</p></dd><dt><a name="ugr.glossary.cas_consumer"></a>CAS Consumer</dt><dd><p>A component that |
| receives each CAS in the collection, usually after it has been processed by an |
| <a class="glossterm" href="#ugr.glossary.analysis_engine"><em class="glossterm">Analysis Engine</em></a>. It is responsible for taking the results from |
| the CAS and using them for some purpose, perhaps storing selected results into |
| a database, for instance. The CAS |
| Consumer may also perform collection-level analysis, saving these results in an |
| application-specific, aggregate data structure.</p></dd><dt><a name="ugr.glossary.cas_initializer"></a>CAS Initializer (deprecated)</dt><dd><p>Prior to version 2, this was the component that took an |
| undefined input form and produced a particular <a class="glossterm" href="#ugr.glossary.sofa"><em class="glossterm">Sofa</em></a>. |
| For version 2, this has been replaced with using any <a class="glossterm" href="#ugr.glossary.analysis_engine"><em class="glossterm">Analysis Engine</em></a> |
| which takes a particular <a class="glossterm" href="#ugr.glossary.cas_view"><em class="glossterm">CAS View</em></a> and creates a |
| new output Sofa. For example, if the document is HTML, an Analysis Engine might |
| create a Sofa which is a detagged version of an input CAS View, perhaps also |
| creating annotations derived from the tags. For example <p> tags |
| might be translated into Paragraph annotations in the CAS.</p></dd><dt><a name="ugr.glossary.cas_multiplier"></a>CAS Multiplier</dt><dd><p>A component, implemented by a UIMA developer, |
| that takes a CAS as input and produces 0 or more new CASes as output. Common use cases for a CAS Multiplier |
| include creating alternative versions of an input <a class="glossterm" href="#ugr.glossary.sofa"><em class="glossterm">Sofa</em></a> |
| (see <a class="glossterm" href="#ugr.glossary.cas_initializer"><em class="glossterm">CAS Initializer</em></a>), and breaking |
| a large input CAS into smaller pieces, each of which is emitted as a |
| separate output CAS. There are other |
| uses, however, such as aggregating input CASes into a single output CAS.</p></dd><dt><a name="ugr.glossary.cas_processor"></a>CAS Processor</dt><dd><p>A component of a Collection Processing Engine (CPE) that |
| takes a CAS as input and returns a CAS as output. There are two types of CAS |
| Processors: <a class="glossterm" href="#ugr.glossary.analysis_engine"><em class="glossterm">Analysis Engine</em></a>s and |
| <a class="glossterm" href="#ugr.glossary.cas_consumer"><em class="glossterm">CAS Consumer</em></a>s.</p></dd><dt><a name="ugr.glossary.cas_view"></a>CAS View</dt><dd><p>A CAS Object which shares the base CAS and type system |
| definition and index specifications, but has a unique index repository and a |
| particular <a class="glossterm" href="#ugr.glossary.sofa"><em class="glossterm">Sofa</em></a>. Views are named, and applications and |
| annotators can dynamically create additional views whenever they are needed. |
| Annotations are made with respect to one view. Feature structures can have references to feature structures |
| indexed in other views, as needed.</p></dd><dt><a name="ugr.glossary.cde"></a>CDE</dt><dd><p>The Component Descriptor Editor. This |
| is the Eclipse tool that lets you conveniently edit the UIMA descriptors; |
| see <a href="tools.html#ugr.tools.cde" class="olink">Chapter 1, <i>Component Descriptor Editor User's Guide</i></a>.</p></dd><dt><a name="ugr.glossary.cpe"></a>Collection Processing Engine (CPE)</dt><dd><p>Performs Collection Processing |
| through the combination of a |
| <a class="glossterm" href="#ugr.glossary.collection_reader"><em class="glossterm">Collection Reader</em></a>, |
| 0 or more <a class="glossterm" href="#ugr.glossary.analysis_engine"><em class="glossterm">Analysis Engine</em></a>s, |
| and zero or more <a class="glossterm" href="#ugr.glossary.cas_consumer"><em class="glossterm">CAS Consumer</em></a>s. |
| The Collection Processing Manager (CPM) manages the execution of the engine.</p><p>The CPE also refers to the XML specification of the Collection Processing |
| engine. The CPM reads a CPE specification and instantiates a CPE instance from it, |
| and runs it.</p></dd><dt><a name="ugr.glossary.cpm"></a>Collection Processing Manager (CPM)</dt><dd><p>The part of the framework that |
| manages the execution of collection processing, routing CASs from the |
| <a class="glossterm" href="#ugr.glossary.collection_reader"><em class="glossterm">Collection Reader</em></a> |
| |
| to 0 or more <a class="glossterm" href="#ugr.glossary.analysis_engine"><em class="glossterm">Analysis Engine</em></a>s |
| and then to the 0 or more <a class="glossterm" href="#ugr.glossary.cas_consumer"><em class="glossterm">CAS Consumer</em></a>s. The CPM |
| provides feedback such as performance statistics and error reporting and supports |
| other features such as parallelization and error handling.</p></dd><dt><a name="ugr.glossary.collection_reader"></a>Collection Reader</dt><dd><p>A component |
| that reads documents from some source, for example a file system or database. |
| The collection reader initializes a CAS with this document. |
| Each document is returned as a CAS that may then be processed by |
| an <a class="glossterm" href="#ugr.glossary.analysis_engine"><em class="glossterm">Analysis Engine</em></a>s. If the task of populating a CAS |
| from the document is complex, you may use an arbitrarily complex chain of |
| <a class="glossterm" href="#ugr.glossary.analysis_engine"><em class="glossterm">Analysis Engine</em></a>s and have the last one |
| create and initialize a new <a class="glossterm" href="#ugr.glossary.sofa"><em class="glossterm">Sofa</em></a>.</p></dd><dt><a name="ugr.glossary.feature_structure"></a>Feature Structure</dt><dd><p>An instance of a <a class="glossterm" href="#ugr.glossary.type"><em class="glossterm">Type</em></a>. |
| Feature Structures are kept in the <a class="glossterm" href="#ugr.glossary.cas"><em class="glossterm">CAS, and may |
| (optionally) be added to the defined <a class="glossterm" href="#ugr.glossary.index"><em class="glossterm">indexes</em></a>. |
| Feature Structures may contain references to other Feature Structures. |
| Feature Structures whose type is Annotation or a subtype of that, are referred to as |
| <a class="glossterm" href="#ugr.glossary.annotation"><em class="glossterm">annotations</em></a>.</em></a></p></dd><dt><a name="ugr.glossary.feature"></a>Feature</dt><dd><p>A data member or attribute of a type. Each feature itself has an |
| associated range type, the type of the value that it can hold. In the |
| database analogy where types are tables, features are columns. |
| In the world of structured data types, each feature is a <span class="quote">“<span class="quote">field</span>”</span>, |
| or data member.</p></dd><dt><a name="ugr.glossary.flow_controller"></a>Flow Controller</dt><dd><p>A component which implements the interfaces needed |
| to specify a custom flow within an <a class="glossterm" href="#ugr.glossary.aggregate"><em class="glossterm">Aggregate Analysis Engine</em></a>.</p></dd><dt><a name="ugr.glossary.hybrid_analysis_engine"></a>Hybrid Analysis Engine</dt><dd><p>An <a class="glossterm" href="#ugr.glossary.aggregate"><em class="glossterm">Aggregate Analysis Engine</em></a> |
| where more than one of its component Analysis Engines are deployed |
| the same address space and one or more are deployed remotely (part tightly and |
| part loosely-coupled).</p></dd><dt><a name="ugr.glossary.index"></a>Index</dt><dd><p>Data in the CAS can only be retrieved using Indexes. |
| Indexes are analogous to the indexes that are |
| specified on tables of a database. Indexes belong to Index Repositories; |
| there is one Repository for each |
| view of the CAS. Indexes are specified |
| to retrieve instances of some CAS Type (including its subtypes), and can be |
| optionally sorted in a user-definable way. |
| For example, all types derived from the UIMA |
| built-in type <code class="literal">uima.tcas.Annotation</code> contain begin |
| and end features, which mark the begin and end offsets in the text where this |
| annotation occurs. There is a built-in index of Annotations that specifies that |
| annotations are retrieved sequentially by sorting first on the value of the begin |
| feature (ascending) and then by the value of the end feature (descending). |
| In this case, iterating over the annotations, one first obtains annotations that |
| come sequentially first in the text, while favoring longer annotations, in the case |
| where two annotations start at the same offset. Users can define their own indexes |
| as well.</p></dd><dt><a name="ugr.glossary.jcas"></a>JCas</dt><dd><p>A Java object interface to the contents of the CAS. |
| This interface uses additional generated Java classes, where each type in the CAS |
| is represented as a Java class with the same name, each feature is represented with |
| a getter and setter method, and each instance of a type is represented as a |
| Java object of the corresponding Java class.</p></dd><dt><a name="ugr.glossary.loosely_coupled_analysis_engine"></a>Loosely-Coupled Analysis Engine</dt><dd><p>An <a class="glossterm" href="#ugr.glossary.aggregate"><em class="glossterm">Aggregate Analysis Engine</em></a> |
| where no two of its component Analysis Engines run in the |
| same address space but where each is remote with respect to the others that |
| make up the aggregate. Loosely coupled engines are ideal for using |
| remote Analysis Engine services that are |
| not locally available, or for quickly assembling and testing functionality in |
| cross-language, cross-platform distributed environments. They also better enable |
| distributed scaleable implementations where quick recoverability may have a |
| greater impact on overall throughput than analysis speed.</p></dd><dt><a name="ugr.glossary.ontology"></a></dt><dd><p>The part of a knowledge base that defines the semantics of the data |
| axiomatically.</p></dd><dt><a name="ugr.glossary.pear"></a>PEAR</dt><dd><p>An archive file that packages up a UIMA component with its code, |
| descriptor files and other resources required to install and run it in another |
| environment. You can generate PEAR files using utilities that come with the |
| UIMA SDK.</p></dd><dt><a name="ugr.glossary.primitive_analysis_engine"></a>Primitive Analysis Engine</dt><dd><p>An <a class="glossterm" href="#ugr.glossary.analysis_engine"><em class="glossterm">Analysis Engine</em></a> |
| that is composed of a single |
| <a class="glossterm" href="#ugr.glossary.annotator"><em class="glossterm">Annotator</em></a>; one that has |
| no component (or <span class="quote">“<span class="quote">sub</span>”</span>) Analysis Engines inside of it; |
| contrast with |
| <a class="glossterm" href="#ugr.glossary.aggregate"><em class="glossterm">Aggregate Analysis Engine</em></a>.</p></dd><dt><a name="ugr.glossary.structured_information"></a>Structured Information</dt><dd><p>Items stored in structured resources such as |
| search engine indices, databases or knowledge bases. The canonical example of |
| structured information is the database table. Each element of information in |
| the database is associated with a precisely defined schema where each table |
| column heading indicates its precise semantics, defining exactly how the |
| information should be interpreted by a computer program or end-user.</p></dd><dt><a name="ugr.glossary.sofa"></a>Subject of Analysis (Sofa)</dt><dd><p>A piece of |
| data (e.g., text document, image, audio segment, or video segment), which is intended |
| for analysis by UIMA analysis components. It belongs to a |
| <a class="glossterm" href="#ugr.glossary.cas_view"><em class="glossterm">CAS View</em></a> which has the same name; there |
| is a one-to-one correspondence between these. There can be multiple Sofas contained within |
| one CAS, each one representing a different view of the original artifact – for example, |
| an audio file could be the original artifact, and also be one Sofa, and another |
| could be the output of a voice-recognition component, where the Sofa would be |
| the corresponding text document. Sofas may be analyzed independently or |
| simultaneously; they all co-exist within the CAS. </p></dd><dt><a name="ugr.glossary.tightly_coupled_analysis_engine"></a>Tightly-Coupled Analysis Engine</dt><dd><p>An <a class="glossterm" href="#ugr.glossary.aggregate"><em class="glossterm">Aggregate Analysis Engine</em></a> |
| where all of its component Analysis Engines run in the same address space.</p></dd><dt><a name="ugr.glossary.type"></a>Type</dt><dd><p>A specification of an object in the |
| <a class="glossterm" href="#ugr.glossary.cas"><em class="glossterm">CAS</em></a> used to store the results of |
| analysis. Types are defined using inheritance, so some types may be |
| defined purely for the sake of defining other types, and are in this sense <span class="quote">“<span class="quote">abstract |
| types.</span>”</span> Types usually contain |
| <a class="glossterm" href="#ugr.glossary.feature"><em class="glossterm">Feature</em></a>s, which are attributes, or |
| properties of the type. A type is roughly equivalent to a class in an |
| object oriented programming language, or a table in a database. Instances of types in the CAS |
| may be indexed for retrieval.</p></dd><dt><a name="ugr.glossary.type_system"></a>Type System</dt><dd><p>A collection of related <a class="glossterm" href="#ugr.glossary.type"><em class="glossterm">types</em></a>. |
| All components that can access the CAS, |
| including <a class="glossterm" href="#ugr.glossary.application"><em class="glossterm">Applications</em></a>, |
| <a class="glossterm" href="#ugr.glossary.analysis_engine"><em class="glossterm">Analysis Engine</em></a>s, |
| <a class="glossterm" href="#ugr.glossary.collection_reader"><em class="glossterm">Collection Readers</em></a>, |
| <a class="glossterm" href="#ugr.glossary.flow_controller"><em class="glossterm">Flow Controllers</em></a>, or |
| <a class="glossterm" href="#ugr.glossary.cas_consumer"><em class="glossterm">CAS Consumers</em></a> |
| declare the type system that they use. Type systems are shared across Analysis Engines, allowing the outputs |
| of one Analysis Engine to be read as input by another Analysis Engine. |
| A type system is roughly analogous to a set of related classes in object |
| oriented programming, or a set of related tables in a database. The type |
| system / type / feature terminology comes from computational linguistics.</p></dd><dt><a name="ugr.glossary.unstructured_information"></a>Unstructured Information</dt><dd><p>The canonical example of unstructured |
| information is the natural language text document. The intended meaning of a |
| document's content is only implicit and its precise interpretation by a |
| computer program requires some degree of analysis to explicate the document's |
| semantics. Other examples include audio, video and images. Contrast with |
| <a class="glossterm" href="#ugr.glossary.structured_information"><em class="glossterm">Structured Information</em></a>. |
| </p></dd><dt><a name="ugr.glossary.uima"></a>UIMA</dt><dd><p>UIMA is an acronym that stands for Unstructured Information Management Architecture; |
| it is a software architecture which specifies component interfaces, design patterns |
| and development roles for creating, describing, discovering, composing and |
| deploying multi-modal analysis capabilities. The UIMA specification is being developed by a |
| technical committee at <a class="ulink" href="http://www.oasis-open.org/committees/uima" target="_top">OASIS</a>.</p></dd><dt><a name="ugr.glossary.uima_java_framework"></a>UIMA Java Framework</dt><dd><p>See <a class="glossterm" href="#ugr.glossary.apache_uima_java_framework"><em class="glossterm">Apache UIMA Java Framework</em></a>.</p><p></p></dd><dt><a name="ugr.glossary.uima_sdk"></a>UIMA SDK</dt><dd><p>See <a class="glossterm" href="#ugr.glossary.apache_uima_sdk"><em class="glossterm">Apache UIMA SDK</em></a>.</p><p></p></dd><dt><a name="ugr.glossary.xcas"></a>XCAS</dt><dd><p>An XML representation of the CAS. The XCAS can be used for saving |
| and restoring CASs to and from streams. The UIMA SDK provides XCAS serialization and |
| de-serialization methods for CASes. This is an older serialization format and |
| new UIMA code should use the standard <a class="glossterm" href="#ugr.glossary.xmi"><em class="glossterm">XMI</em></a> |
| format instead.</p></dd><dt><a name="ugr.glossary.xmi"></a>XML Metadata Interchange (XMI)</dt><dd><p>An OMG standard for representing |
| object graphs in XML, which UIMA uses to serialize analysis results from the |
| CAS to an XML representation. The UIMA SDK provides XMI serialization and |
| de-serialization methods for CASes</p></dd></dl></div> |
| </div></body></html> |