uima-docbook-overview-and-setup/src/docbook/conceptual_overview.xml - uima-uimaj - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
 <!ENTITY key_concepts "Key UIMA Concepts Introduced in this Section:">
 <!ENTITY imgroot "images/overview-and-setup/conceptual_overview_files/" >
 <!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
 %uimaents;
 ]>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <chapter id="ugr.ovv.conceptual">
   <title>UIMA Conceptual Overview</title>

   <para>UIMA is an open, industrial-strength, scaleable and extensible platform for
     creating, integrating and deploying unstructured information management solutions
     from powerful text or multi-modal analysis and search components. </para>

   <para>The Apache UIMA project is an implementation of the Java UIMA framework available
     under the Apache License, providing a common foundation for industry and academia to
     collaborate and accelerate the world-wide development of technologies critical for
     discovering vital knowledge present in the fastest growing sources of information
     today.</para>

   <para>This chapter presents an introduction to many essential UIMA concepts. It is meant to
     provide a broad overview to give the reader a quick sense of UIMA&apos;s basic
     architectural philosophy and the UIMA SDK&apos;s capabilities. </para>

   <para>This chapter provides a general orientation to UIMA and makes liberal reference to
     the other chapters in the UIMA SDK documentation set, where the reader may find detailed
     treatments of key concepts and development practices. It may be useful to refer to <olink
       targetdoc="&uima_docs_overview;" targetptr="ugr.glossary"/>, to become familiar
     with the terminology in this overview.</para>

   <section id="ugr.ovv.conceptual.uima_introduction">
     <title>UIMA Introduction</title>
     <figure id="ugr.ovv.conceptual.fig.bridge">
       <title>UIMA helps you build the bridge between the unstructured and structured
         worlds</title>
       <mediaobject>
         <imageobject>
           <imagedata width="5.5in" format="PNG" fileref="&imgroot;image002.png"/>
         </imageobject>
         <textobject><phrase>Picture of a bridge between unstructured information
           artifacts and structured metadata about those artifacts</phrase>
         </textobject>
       </mediaobject>
     </figure>

     <para> Unstructured information represents the largest, most current and fastest
       growing source of information available to businesses and governments. The web is just
       the tip of the iceberg. Consider the mounds of information hosted in the enterprise and
       around the world and across different media including text, voice and video. The
       high-value content in these vast collections of unstructured information is,
       unfortunately, buried in lots of noise. Searching for what you need or doing
       sophisticated data mining over unstructured information sources presents new
       challenges. </para>

     <para>An unstructured information management (UIM) application may be generally
       characterized as a software system that analyzes large volumes of unstructured
       information (text, audio, video, images, etc.) to discover, organize and deliver
       relevant knowledge to the client or application end-user. An example is an application
       that processes millions of medical abstracts to discover critical drug interactions.
       Another example is an application that processes tens of millions of documents to
       discover key evidence indicating probable competitive threats. </para>

     <para>First and foremost, the unstructured data must be analyzed to interpret, detect
       and locate concepts of interest, for example, named entities like persons,
       organizations, locations, facilities, products etc., that are not explicitly tagged
       or annotated in the original artifact. More challenging analytics may detect things
       like opinions, complaints, threats or facts. And then there are relations, for
       example, located in, finances, supports, purchases, repairs etc. The list of concepts
       important for applications to discover in unstructured content is large, varied and
       often domain specific.
       Many different component analytics may solve different parts of the overall analysis task.
       These component analytics must interoperate and must be easily combined to facilitate
       the developed of UIM applications.</para>

     <para>The result of analysis are used to populate structured forms so that conventional
       data processing and search technologies
       like search engines, database engines or OLAP
       (On-Line Analytical Processing, or Data Mining) engines
       can efficiently deliver the newly discovered content in response to the client requests
       or queries.</para>

     <para>In analyzing unstructured content, UIM applications make use of a variety of
       analysis technologies including:</para>

     <itemizedlist spacing="compact">
       <listitem><para>Statistical and rule-based Natural Language Processing
         (NLP)</para>
       </listitem>
       <listitem><para>Information Retrieval (IR)</para>
       </listitem>
       <listitem><para>Machine learning</para>
       </listitem>
       <listitem><para>Ontologies</para>
       </listitem>
       <listitem><para>Automated reasoning and</para>
       </listitem>
       <listitem><para>Knowledge Sources (e.g., CYC, WordNet, FrameNet, etc.)</para>
       </listitem>

     </itemizedlist>

     <para>Specific analysis capabilities using these technologies are developed
       independently using different techniques, interfaces and platforms.
       </para>

     <para>The bridge from the unstructured world to the structured world is built through the
       composition and deployment of these analysis capabilities. This integration is often
       a costly challenge. </para>

     <para>The Unstructured Information Management Architecture (UIMA) is an architecture
       and software framework that helps you build that bridge. It supports creating,
       discovering, composing and deploying a broad range of analysis capabilities and
       linking them to structured information services.</para>

     <para>UIMA allows development teams to match the right skills with the right parts of a
       solution and helps enable rapid integration across technologies and platforms using a
       variety of different deployment options. These ranging from tightly-coupled
       deployments for high-performance, single-machine, embedded solutions to parallel
       and fully distributed deployments for highly flexible and scaleable
       solutions.</para>

   </section>

   <section id="ugr.ovv.conceptual.architecture_framework_sdk">
     <title>The Architecture, the Framework and the SDK</title>
     <para>UIMA is a software architecture which specifies component interfaces, data
       representations, design patterns and development roles for creating, describing,
       discovering, composing and deploying multi-modal analysis capabilities.</para>

     <para>The <emphasis role="bold">UIMA framework</emphasis> provides a run-time
       environment in which developers can plug in their UIMA component implementations and
       with which they can build and deploy UIM applications. The framework is not specific to
       any IDE or platform. Apache hosts a Java and (soon) a C++ implementation of the UIMA
       Framework.</para>

     <para>The <emphasis role="bold">UIMA Software Development Kit (SDK)</emphasis>
       includes the UIMA framework, plus tools and utilities for using UIMA. Some of the
       tooling supports an Eclipse-based ( <ulink url="http://www.eclipse.org/"/>)
       development environment. </para>

   </section>

   <section id="ugr.ovv.conceptual.analysis_basics">
     <title>Analysis Basics</title>
     <note><title>&key_concepts;</title><para>Analysis Engine, Document, Annotator, Annotator
       Developer, Type, Type System, Feature, Annotation, CAS, Sofa, JCas, UIMA
       Context.</para>
     </note>

     <section id="ugr.ovv.conceptual.aes_annotators_and_analysis_results">
       <title>Analysis Engines, Annotators &amp; Results</title>
       <figure id="ugr.ovv.conceptual.metadata_in_cas">
         <title>Objects represented in the Common Analysis Structure (CAS)</title>
         <mediaobject>
           <imageobject role="html">
             <imagedata format="PNG" width="594px" align="center" fileref="&imgroot;image004.png"/>
           </imageobject>
           <imageobject role="fo">
             <imagedata format="PNG" width="5.5in" align="center" fileref="&imgroot;image004.png"/>
           </imageobject>
           <textobject><phrase>Picture of some text, with a hierarchy of discovered
             metadata about words in the text, including some image of a person as metadata
             about that name.</phrase>
           </textobject>
         </mediaobject>
       </figure>

       <para>UIMA is an architecture in which basic building blocks called Analysis Engines
         (AEs) are composed to analyze a document and infer and record descriptive attributes
         about the document as a whole, and/or about regions therein. This descriptive
         information, produced by AEs is referred to generally as <emphasis role="bold">
         analysis results</emphasis>. Analysis results typically represent meta-data
         about the document content. One way to think about AEs is as software agents that
         automatically discover and record meta-data about original content.</para>

       <para>UIMA supports the analysis of different modalities including text, audio and
         video. The majority of examples we provide are for text. We use the term <emphasis
           role="bold">document, </emphasis>therefore, to generally refer to any unit of
         content that an AE may process, whether it is a text document or a segment of audio, for
         example. See the <olink targetdoc="&uima_docs_tutorial_guides;"/>
         <olink targetdoc="&uima_docs_tutorial_guides;"
           targetptr="ugr.tug.mvs"/> for more information on multimodal processing
         in UIMA.</para>

       <para>Analysis results include different statements about the content of a document.
         For example, the following is an assertion about the topic of a document:</para>


       <programlisting>(1) The Topic of document D102 is "CEOs and Golf".</programlisting>

       <para>Analysis results may include statements describing regions more granular than
         the entire document. We use the term <emphasis role="bold">span</emphasis> to
         refer to a sequence of characters in a text document. Consider that a document with the
         identifier D102 contains a span, <quote>Fred Centers</quote> starting at
         character position 101. An AE that can detect persons in text may represent the
         following statement as an analysis result:</para>


       <programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting>

       <para>In both statements 1 and 2 above there is a special pre-defined term or what we call
         in UIMA a <emphasis role="bold">Type</emphasis>. They are
         <emphasis>Topic</emphasis> and <emphasis>Person</emphasis> respectively.
         UIMA types characterize the kinds of results that an AE may create &ndash; more on
         types later.</para>

       <para>Other analysis results may relate two statements. For example, an AE might
         record in its results that two spans are both referring to the same person:</para>


       <programlisting>(3) The Person denoted by span 101 to 112 and
   the Person denoted by span 141 to 143 in document D102
   refer to the same Entity.</programlisting>

       <para>The above statements are some examples of the kinds of results that AEs may record
         to describe the content of the documents they analyze. These are not meant to indicate
         the form or syntax with which these results are captured in UIMA &ndash; more on that
         later in this overview.</para>

       <para>The UIMA framework treats Analysis engines as pluggable, composible,
         discoverable, managed objects. At the heart of AEs are the analysis algorithms that
         do all the work to analyze documents and record analysis results. </para>

       <para>UIMA provides a basic component type intended to house the core analysis
         algorithms running inside AEs. Instances of this component are called <emphasis
           role="bold">Annotators</emphasis>. The analysis algorithm developer&apos;s
         primary concern therefore is the development of annotators. The UIMA framework
         provides the necessary methods for taking annotators and creating analysis
         engines.</para>

       <para>In UIMA the person who codes analysis algorithms takes on the role of the
           <emphasis role="bold">Annotator Developer</emphasis>. <olink
           targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae"/>
           in <olink targetdoc="&uima_docs_tutorial_guides;"/> will take the reader
         through the details involved in creating UIMA annotators and analysis
         engines.</para>

       <para>At the most primitive level an AE wraps an annotator adding the necessary APIs and
         infrastructure for the composition and deployment of annotators within the UIMA
         framework. The simplest AE contains exactly one annotator at its core. Complex AEs
         may contain a collection of other AEs each potentially containing within them other
         AEs. </para>
     </section>

     <section id="ugr.ovv.conceptual.representing_results_in_cas">
       <title>Representing Analysis Results in the CAS</title>

       <para>How annotators represent and share their results is an important part of the UIMA
         architecture. UIMA defines a <emphasis role="bold">Common Analysis Structure
         (CAS)</emphasis> precisely for these purposes.</para>

       <para>The CAS is an object-based data structure that allows the representation of
         objects, properties and values. Object types may be related to each other in a
         single-inheritance hierarchy. The CAS logically (if not physically) contains the
         document being analyzed. Analysis developers share and record their analysis
         results in terms of an object model within the CAS. <footnote><para> We have plans to
         extend the representational capabilities of the CAS and align its semantics with the
         semantics of the OMG&apos;s Essential Meta-Object Facility (EMOF) and with the
         semantics of the Eclipse Modeling Framework&apos;s ( <ulink
           url="http://www.eclipse.org/emf/"/>) Ecore semantics and XMI-based
         representation.</para> </footnote> </para>

       <para>The UIMA framework includes an implementation and interfaces to the CAS. For a
         more detailed description of the CAS and its interfaces see <olink
           targetdoc="&uima_docs_ref;"/> <olink
           targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>.</para>

       <para>A CAS that logically contains statement 2 (repeated here for your
         convenience)</para>


       <programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting>

       <para>would include objects of the Person type. For each person found in the body of a
         document, the AE would create a Person object in the CAS and link it to the span of text
         where the person was mentioned in the document.</para>

       <para>While the CAS is a general purpose data structure, UIMA defines a
         few basic types and affords the developer the ability to extend these to define an
         arbitrarily rich <emphasis role="bold">Type System</emphasis>. You can think of a
         type system as an object schema for the CAS.</para>

       <para>A type system defines the various types of objects that may be discovered in
         documents by AE's that subscribe to that type system.</para>

       <para>As suggested above, Person may be defined as a type. Types have properties or
           <emphasis role="bold">features</emphasis>. So for example,
         <emphasis>Age</emphasis> and <emphasis>Occupation</emphasis> may be defined as
         features of the Person type.</para>

       <para>Other types might be <emphasis>Organization, Company, Bank, Facility, Money,
         Size, Price, Phone Number, Phone Call, Relation, Network Packet, Product, Noun
         Phrase, Verb, Color, Parse Node, Feature Weight Array</emphasis> etc.</para>

       <para>There are no limits to the different types that may be defined in a type system. A
         type system is domain and application specific.</para>

       <para>Types in a UIMA type system may be organized into a taxonomy. For example,
         <emphasis>Company</emphasis> may be defined as a subtype of
         <emphasis>Organization</emphasis>. <emphasis>NounPhrase</emphasis> may be a
         subtype of a <emphasis>ParseNode</emphasis>.</para>

       <section id="ugr.ovv.conceptual.annotation_type">
         <title>The Annotation Type</title>

         <para>A general and common type used in artifact analysis and from which additional
           types are often derived is the <emphasis role="bold">annotation</emphasis>
           type. </para>

         <para>The annotation type is used to annotate or label regions of an artifact. Common
           artifacts are text documents, but they can be other things, such as audio streams.
           The annotation type for text includes two features, namely
           <emphasis>begin</emphasis> and <emphasis>end</emphasis>. Values of these
           features represent integer offsets in the artifact and delimit a span. Any
           particular annotation object identifies the span it annotates with the
           <emphasis>begin</emphasis> and <emphasis>end</emphasis> features.</para>

         <para>The key idea here is that the annotation type is used to identify and label or
           <quote>annotate</quote> a specific region of an artifact.</para>

         <para>Consider that the Person type is defined as a subtype of annotation. An
           annotator, for example, can create a Person annotation to record the discovery of a
           mention of a person between position 141 and 143 in document D102. The annotator can
           create another person annotation to record the detection of a mention of a person in
           the span between positions 101 and 112. </para>
       </section>
       <section id="ugr.ovv.conceptual.not_just_annotations">
         <title>Not Just Annotations</title>

         <para>While the annotation type is a useful type for annotating regions of a
           document, annotations are not the only kind of types in a CAS. A CAS is a general
           representation scheme and may store arbitrary data structures to represent the
           analysis of documents.</para>

         <para>As an example, consider statement 3 above (repeated here for your
           convenience).</para>


         <programlisting>(3) The Person denoted by span 101 to 112 and
   the Person denoted by span 141 to 143 in document D102
   refer to the same Entity.</programlisting>

         <para>This statement mentions two person annotations in the CAS; the first, call it
           P1 delimiting the span from 101 to 112 and the other, call it P2, delimiting the span
           from 141 to 143. Statement 3 asserts explicitly that these two spans refer to the
           same entity. This means that while there are two expressions in the text
           represented by the annotations P1 and P2, each refers to one and the same person.
           </para>

         <para>The Entity type may be introduced into a type system to capture this kind of
           information. The Entity type is not an annotation. It is intended to represent an
           object in the domain which may be referred to by different expressions (or
           mentions) occurring multiple times within a document (or across documents within
           a collection of documents). The Entity type has a feature named
           <emphasis>occurrences. </emphasis>This feature is used to point to all the
           annotations believed to label mentions of the same entity.</para>

         <para>Consider that the spans annotated by P1 and P2 were <quote>Fred
           Center</quote> and <quote>He</quote> respectively. The annotator might create
           a new Entity object called
           <code>FredCenter</code>. To represent the relationship in statement 3 above,
           the annotator may link FredCenter to both P1 and P2 by making them values of its
           <emphasis>occurrences</emphasis> feature.</para>

         <para> <xref linkend="ugr.ovv.conceptual.metadata_in_cas"/> also
           illustrates that an entity may be linked to annotations referring to regions of
           image documents as well. To do this the annotation type would have to be extended
           with the appropriate features to point to regions of an image.</para>
       </section>

       <section id="ugr.ovv.conceptual.multiple_views_within_a_cas">
         <title>Multiple Views within a CAS</title>

         <para>UIMA supports the simultaneous analysis of multiple views of a document. This
           support comes in handy for processing multiple forms of the artifact, for example, the audio
           and the closed captioned views of a single speech stream, or the tagged and detagged
           views of an HTML document.</para>

         <para>AEs analyze one or more views of a document. Each view contains a specific
             <emphasis role="bold">subject of analysis(Sofa)</emphasis>, plus a set of
           indexes holding metadata indexed by that view. The CAS, overall, holds one or more
           CAS Views, plus the descriptive objects that represent the analysis results for
           each. </para>

         <para>Another common example of using CAS Views is for different translations of a
           document. Each translation may be represented with a different CAS View. Each
           translation may be described by a different set of analysis results. For more
           details on CAS Views and Sofas see <olink
             targetdoc="&uima_docs_tutorial_guides;"/> <olink
             targetdoc="&uima_docs_tutorial_guides;"
             targetptr="ugr.tug.mvs"/> and <olink
             targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>. </para>
       </section>
     </section>

     <section id="ugr.ovv.conceptual.interacting_with_cas_and_external_resources">
       <title>Interacting with the CAS and External Resources</title>
       <titleabbrev>Using CASes and External Resources</titleabbrev>

       <para>The two main interfaces that a UIMA component developer interacts with are the
         CAS and the UIMA Context.</para>

       <para>UIMA provides an efficient implementation of the CAS with multiple programming
         interfaces. Through these interfaces, the annotator developer interacts with the
         document and reads and writes analysis results. The CAS interfaces provide a suite of
         access methods that allow the developer to obtain indexed iterators to the different
         objects in the CAS. See <olink targetdoc="&uima_docs_ref;"/> <olink targetdoc="&uima_docs_ref;"
           targetptr="ugr.ref.cas"/>. While many objects may exist in a CAS, the annotator
         developer can obtain a specialized iterator to all Person objects associated with a
         particular view, for example.</para>

       <para>For Java annotator developers, UIMA provides the JCas. This interface provides
         the Java developer with a natural interface to CAS objects. Each type declared in the
         type system appears as a Java Class; the UIMA framework renders the Person type as a
         Person class in Java. As the analysis algorithm detects mentions of persons in the
         documents, it can create Person objects in the CAS. For more details on how to interact
         with the CAS using this interface, refer to <olink targetdoc="&uima_docs_ref;"
         /> <olink targetdoc="&uima_docs_ref;"
           targetptr="ugr.ref.jcas"/>.</para>

       <para>The component developer, in addition to interacting with the CAS, can access
         external resources through the framework&apos;s resource manager interface
         called the <emphasis role="bold">UIMA Context</emphasis>. This interface, among
         other things, can ensure that different annotators working together in an aggregate
         flow may share the same instance of an external file or remote resource accessed
         via its URL, for example. For details on using
         the UIMA Context see <olink targetdoc="&uima_docs_tutorial_guides;"
         /> <olink targetdoc="&uima_docs_tutorial_guides;"
           targetptr="ugr.tug.aae"/>.</para>

     </section>
     <section id="ugr.ovv.conceptual.component_descriptors">
       <title>Component Descriptors</title>
       <para>UIMA defines interfaces for a small set of core components that users of the
         framework provide implmentations for. Annotators and Analysis Engines are two of
         the basic building blocks specified by the architecture. Developers implement them
         to build and compose analysis capabilities and ultimately applications.</para>

       <para>There are others components in addition to these, which we will learn about
         later, but for every component specified in UIMA there are two parts required for its
         implementation:</para>

       <orderedlist spacing="compact">
         <listitem><para>the declarative part and</para></listitem>
         <listitem><para>the code part.</para></listitem>
       </orderedlist>

       <para>The declarative part contains metadata describing the component, its
         identity, structure and behavior and is called the <emphasis role="bold">
         Component Descriptor</emphasis>. Component descriptors are represented in XML.
         The code part implements the algorithm. The code part may be a program in Java.</para>

       <para>As a developer using the UIMA SDK, to implement a UIMA component it is always the
         case that you will provide two things: the code part and the Component Descriptor.
         Note that when you are composing an engine, the code may be already provided in
         reusable subcomponents. In these cases you may not be developing new code but rather
         composing an aggregate engine by pointing to other components where the code has been
         included.</para>

       <para>Component descriptors are represented in XML and aid in component discovery,
         reuse, composition and development tooling. The UIMA SDK provides tools for easily
         creating and maintaining the component descriptors that relieve the developer from
         editing XML directly. This tool is described briefly in <olink
           targetdoc="&uima_docs_tutorial_guides;"/> <olink
           targetdoc="&uima_docs_tutorial_guides;"
           targetptr="ugr.tug.aae"/>, and more
         thoroughly in <olink targetdoc="&uima_docs_tools;"/>
         <olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde"/>
         .</para>

       <para>Component descriptors contain standard metadata including the
         component&apos;s name, author, version, and a reference to the class that
         implements the component.</para>

       <para>In addition to these standard fields, a component descriptor identifies the
         type system the component uses and the types it requires in an input CAS and the types it
         plans to produce in an output CAS.</para>

       <para>For example, an AE that detects person types may require as input a CAS that
         includes a tokenization and deep parse of the document. The descriptor refers to a
         type system to make the component&apos;s input requirements and output types
         explicit. In effect, the descriptor includes a declarative description of the
         component&apos;s behavior and can be used to aid in component discovery and
         composition based on desired results. UIMA analysis engines provide an interface
         for accessing the component metadata represented in their descriptors. For more
         details on the structure of UIMA component descriptors refer to <olink
           targetdoc="&uima_docs_ref;"/> <olink
           targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.component_descriptor"/>.</para>

     </section>
   </section>
   <section id="ugr.ovv.conceptual.aggregate_analysis_engines">
     <title>Aggregate Analysis Engines</title>

     <note><title>&key_concepts;</title><para>Aggregate Analysis Engine, Delegate Analysis Engine,
       Tightly and Loosely Coupled, Flow Specification, Analysis Engine Assembler</para>
     </note>

     <figure id="ugr.ovv.conceptual.sample_aggregate">
       <title>Sample Aggregate Analysis Engine</title>
       <mediaobject>
         <imageobject role="html">
           <imagedata width="588px" format="PNG" fileref="&imgroot;image006.png"/>
         </imageobject>
         <imageobject role="fo">
           <imagedata width="5.5in" format="PNG" fileref="&imgroot;image006.png"/>
         </imageobject>
         <textobject><phrase>Picture of multiple parts (a language identifier,
           tokenizer, part of speech annotator, shallow parser, and named entity detector)
           strung together into a flow, and all of them wrapped as a single aggregate object,
           which produces as annotations the union of all the results of the individual
           annotator components ( tokens, parts of speech, names, organizations, places,
           persons, etc.)</phrase>
         </textobject>
       </mediaobject>
     </figure>

     <para>A simple or primitive UIMA Analysis Engine (AE) contains a single annotator. AEs,
       however, may be defined to contain other AEs organized in a workflow. These more complex
       analysis engines are called <emphasis role="bold">Aggregate Analysis
       Engines.</emphasis> </para>

     <para>Annotators tend to perform fairly granular functions, for example language
       detection, tokenization or part of speech detection.
     These functions typically address just part of an overall analysis task. A workflow
       of component engines may be orchestrated to perform more complex tasks.</para>

     <para>An AE that performs named entity detection, for example, may
       include a pipeline of annotators starting with language detection feeding
       tokenization, then part-of-speech detection, then deep grammatical parsing and then
       finally named-entity detection. Each step in the pipeline is required by the
       subsequent analysis. For example, the final named-entity annotator can only do its
       analysis if the previous deep grammatical parse was recorded in the CAS.</para>

     <para>Aggregate AEs are built to encapsulate potentially complex internal structure
       and insulate it from users of the AE. In our example, the aggregate analysis engine
       developer acquires the internal components, defines the necessary flow
       between them and publishes the resulting AE. Consider the simple example illustrated
       in <xref linkend="ugr.ovv.conceptual.sample_aggregate"/> where
       <quote>MyNamed-EntityDetector</quote> is composed of a linear flow of more
       primitive analysis engines.</para>

     <para>Users of this AE need not know how it is constructed internally but only need its name
       and its published input requirements and output types. These must be declared in the
       aggregate AE&apos;s descriptor. Aggregate AE&apos;s descriptors declare the components
       they contain and a <emphasis role="bold">flow specification</emphasis>. The flow
       specification defines the order in which the internal component AEs should be run. The
       internal AEs specified in an aggregate are also called the <emphasis role="bold">
       delegate analysis engines.</emphasis> The term "delegate" is used because aggregate AE's
       are thought to "delegate" functions to their internal AEs.</para>

     <para>
       In UIMA 2.0, the developer can implement a "Flow Controller" and include it as part
       of an aggregate AE by referring to it in the aggregate AE's descriptor.
       The flow controller is responsible for computing the "flow", that is,
       for determining the order in which of delegate AE's that will process the CAS.
       The Flow Contoller has access to the CAS and any external resources it may require
       for determining the flow. It can do this dynamically at run-time, it can
       make multi-step decisions and it can consider any sort of flow specification
       included in the aggregate AE's descriptor. See
       <olink targetdoc="&uima_docs_tutorial_guides;"/>
       <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.fc"/>
       for details on the UIMA Flow Controller interface.
     </para>

     <para>We refer to the development role associated with building an aggregate from
       delegate AEs as the <emphasis role="bold">Analysis Engine Assembler</emphasis>
       .</para>

     <para>The UIMA framework, given an aggregate analysis engine descriptor, will run all
       delegate AEs, ensuring that each one gets access to the CAS in the sequence produced by
       the flow controller. The UIMA framework is equipped to handle different
       deployments where the delegate engines, for example, are <emphasis role="bold">
       tightly-coupled</emphasis> (running in the same process) or <emphasis role="bold">
       loosely-coupled</emphasis> (running in separate processes or even on different
       machines). The framework supports a number of remote protocols for loose coupling
       deployments of aggregate analysis engines, including SOAP (which stands for Simple
       Object Access Protocol, a standard Web Services communications protocol).</para>

     <para>The UIMA framework facilitates the deployment of AEs as remote services by using an
       adapter layer that automatically creates the necessary infrastructure in response to
       a declaration in the component&apos;s descriptor. For more details on creating
       aggregate analysis engines refer to <olink targetdoc="&uima_docs_ref;"
         /> <olink targetdoc="&uima_docs_ref;"
         targetptr="ugr.ref.xml.component_descriptor"/> The component descriptor editor tool
       assists in the specification of aggregate AEs from a repository of available engines.
       For more details on this tool refer to <olink targetdoc="&uima_docs_tools;"
         /> <olink targetdoc="&uima_docs_tools;"
         targetptr="ugr.tools.cde"/>.</para>

     <para>The UIMA framework implementation has two built-in flow implementations: one
       that support a linear flow between components, and one with conditional branching
       based on the language of the document. It also supports user-provided flow
       controllers, as described in <olink targetdoc="&uima_docs_tutorial_guides;"
         /> <olink targetdoc="&uima_docs_tutorial_guides;"
         targetptr="ugr.tug.fc"/>. Furthermore, the application developer is
       free to create multiple AEs and provide their own logic to combine the AEs in arbitrarily
       complex flows. For more details on this the reader may refer to <olink
         targetdoc="&uima_docs_tutorial_guides;"/> <olink
         targetdoc="&uima_docs_tutorial_guides;"
         targetptr="ugr.tug.application.using_aes"/>.</para>

   </section>

   <section id="ugr.ovv.conceptual.applicaiton_building_and_collection_processing">
     <title>Application Building and Collection Processing</title>

     <note><title>&key_concepts;</title><para>Process Method, Collection Processing Architecture,
       Collection Reader, CAS Consumer, CAS Initializer, Collection Processing Engine,
       Collection Processing Manager.</para></note>

     <section id="ugr.ovv.conceptual.using_framework_from_an_application">
       <title>Using the framework from an Application</title>

       <figure id="ugr.ovv.conceptual.application_factory_ae">
         <title>Using UIMA Framework to create and interact with an Analysis Engine</title>
         <mediaobject>
           <imageobject role="html">
             <imagedata width="618px" align="center" format="PNG" fileref="&imgroot;image008.png"/>
           </imageobject>
           <imageobject role="fo">
             <imagedata width="5.5in" align="center" format="PNG" fileref="&imgroot;image008.png"/>
           </imageobject>
           <textobject><phrase>Picture of application interacting with UIMA&apos;s
             factory to produce an analysis engine, which acts as a container for annotators,
             and interfaces with the application via the process and getMetaData methods
             among others.</phrase>
           </textobject>
         </mediaobject>
       </figure>

       <para>As mentioned above, the basic AE interface may be thought of as simply CAS in/CAS
         out.</para>

       <para>The application is responsible for interacting with the UIMA framework to
         instantiate an AE, create or acquire an input CAS, initialize the input CAS with a
         document and then pass it to the AE through the <emphasis role="bold">process
         method</emphasis>. This interaction with the framework is illustrated in <xref
           linkend="ugr.ovv.conceptual.application_factory_ae"/>. </para>

       <para>The UIMA AE Factory takes the declarative information from the Component
         Descriptor and the class files implementing the annotator, and instantiates the AE
         instance, setting up the CAS and the UIMA Context.</para>

       <para>The AE, possibly calling many delegate AEs internally, performs the overall
         analysis and its process method returns the CAS containing new analysis results.
         </para>

       <para>The application then decides what to do with the returned CAS. There are many
         possibilities. For instance the application could: display the results, store the
         CAS to disk for post processing, extract and index analysis results as part of a search
         or database application etc.</para>

       <para>The UIMA framework provides methods to support the application developer in
         creating and managing CASes and instantiating, running and managing AEs. Details
         may be found in <olink targetdoc="&uima_docs_tutorial_guides;"
         /> <olink targetdoc="&uima_docs_tutorial_guides;"
           targetptr="ugr.tug.application"/>.</para>
     </section>

     <section id="ugr.ovv.conceptual.graduating_to_collection_processing">
       <title>Graduating to Collection Processing</title>
       <figure id="ugr.ovv.conceptual.fig.cpe">
         <title>High-Level UIMA Component Architecture from Source to Sink</title>
         <mediaobject>
           <imageobject role="html">
             <imagedata width="578px" format="PNG" align="center" fileref="&imgroot;image010.png"/>
           </imageobject>
           <imageobject role="fo">
             <imagedata width="5.5in" format="PNG" align="center" fileref="&imgroot;image010.png"/>
           </imageobject>
         </mediaobject>
       </figure>

       <para>Many UIM applications analyze entire collections of documents. They connect to
         different document sources and do different things with the results. But in the
         typical case, the application must generally follow these logical steps:

         <orderedlist spacing="compact">
           <listitem><para>Connect to a physical source</para></listitem>
           <listitem><para>Acquire a document from the source</para></listitem>
           <listitem><para>Initialize a CAS with the document to be analyzed</para>
             </listitem>
           <listitem><para>Send the CAS to a selected analysis engine</para></listitem>
           <listitem><para>Process the resulting CAS</para></listitem>
           <listitem><para>Go back to 2 until the collection is processed</para>
             </listitem>
           <listitem><para>Do any final processing required after all the documents in the
             collection have been analyzed</para></listitem>
         </orderedlist> </para>

       <para>UIMA supports UIM application development for this general type of processing
         through its <emphasis role="bold">Collection Processing
         Architecture</emphasis>.</para>

       <para>As part of the collection processing architecture UIMA introduces two primary
         components in addition to the annotator and analysis engine. These are the <emphasis
           role="bold">Collection Reader</emphasis> and the <emphasis role="bold">CAS
         Consumer</emphasis>. The complete flow from source, through document analysis,
         and to CAS Consumers supported by UIMA is illustrated in <xref
           linkend="ugr.ovv.conceptual.fig.cpe"/>.</para>

       <para>The Collection Reader&apos;s job is to connect to and iterate through a source
         collection, acquiring documents and initializing CASes for analysis. </para>

       <!--
       <para>Since the structure, access and iteration methods for
       physical document sources vary independently from the format of stored
       documents, UIMA defines another type of component called a <emphasis role="bold">CAS Intializer</emphasis>.
       The CAS Initializer&apos;s job is specific to a
       document format and specialized logic for mapping that format to a CAS. In the
       simplest case a CAS Intializer may take the document provided by the containing
       Collection Reader and insert it as a subject of analysis (or Sofa) in the
       CAS.  A more advanced scenario is one
       where the CAS Intializer may be implemented to handle documents that conform to
       a certain XML schema and map some subset of the XML tags to CAS types and then
       insert the de-tagged document content as the subject of analysis.  Collection Readers may reuse plug-in CAS
       Initializers for different document formats.</para>
       -->

       <para>CAS Consumers, as the name suggests, function at the end of the flow. Their job is
         to do the final CAS processing. A CAS Consumer may be implemented, for example, to
         index CAS contents in a search engine, extract elements of interest and populate a
         relational database or serialize and store analysis results to disk for subsequent
         and further analysis. </para>

       <para>A Semantic Search engine that works with UIMA is available from <ulink
           url="http://www.alphaworks.ibm.com/tech/uima">IBM&apos;s alphaWorks
         site</ulink> which will allow the developer to experiment with indexing analysis
         results and querying for documents based on all the annotations in the CAS. See the
         section on integrating text analysis and search in <olink
           targetdoc="&uima_docs_tutorial_guides;"/> <olink
           targetdoc="&uima_docs_tutorial_guides;"
           targetptr="ugr.tug.application"/>.</para>

       <para>A UIMA <emphasis role="bold">Collection Processing Engine</emphasis> (CPE)
         is an aggregate component that specifies a <quote>source to sink</quote> flow from a
         Collection Reader though a set of analysis engines and then to a set of CAS Consumers.
         </para>

       <para>CPEs are specified by XML files called CPE Descriptors. These are declarative
         specifications that point to their contained components (Collection Readers,
         analysis engines and CAS Consumers) and indicate a flow among them. The flow
         specification allows for filtering capabilities to, for example, skip over AEs
         based on CAS contents. Details about the format of CPE Descriptors may be found in
         <olink targetdoc="&uima_docs_ref;"/>
           <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>.
         </para>

       <figure id="ugr.ovv.conceptual.fig.cpm">
         <title>Collection Processing Manager in UIMA Framework</title>
         <mediaobject>
           <imageobject role="html">
             <imagedata width="576px" align="center" format="PNG" fileref="&imgroot;image012.png"/>
           </imageobject>
           <imageobject role="fo">
             <imagedata width="5.5in" align="center" format="PNG" fileref="&imgroot;image012.png"/>
           </imageobject>
           <textobject><phrase>box and arrows picture of application using CPE factory to
             instantiate a Collection Processing Engine, and that engine interacting with
             the application.</phrase></textobject>
         </mediaobject>
       </figure>

       <para>The UIMA framework includes a <emphasis role="bold">Collection Processing
         Manager</emphasis> (CPM). The CPM is capable of reading a CPE descriptor, and
         deploying and running the specified CPE. <xref
           linkend="ugr.ovv.conceptual.fig.cpe"/> illustrates the role of the CPM
         in the UIMA Framework.</para>

       <para>Key features of the CPM are failure recovery, CAS management and scale-out.
         </para>

       <para>Collections may be large and take considerable time to analyze. A configurable
         behavior of the CPM is to log faults on single document failures while continuing to
         process the collection. This behavior is commonly used because analysis components
         often tend to be the weakest link -- in practice they may choke on strangely formatted
         content. </para>

       <para>This deployment option requires that the CPM run in a separate process or a
         machine distinct from the CPE components. A CPE may be configured to run with a variety
         of deployment options that control the features provided by the CPM. For details see
         <olink targetdoc="&uima_docs_ref;"/>
           <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>
         .</para>

       <para>The UIMA SDK also provides a tool called the CPE Configurator. This tool provides
         the developer with a user interface that simplifies the process of connecting up all
         the components in a CPE and running the result. For details on using the CPE
         Configurator see <olink targetdoc="&uima_docs_tools;"
         /> <olink targetdoc="&uima_docs_tools;"
           targetptr="ugr.tools.cpe"/>. This tool currently does not provide
         access to the full set of CPE deployment options supported by the CPM; however, you can
         configure other parts of the CPE descriptor by editing it directly. For details on how
         to create and run CPEs refer to <olink targetdoc="&uima_docs_tutorial_guides;"
         /> <olink targetdoc="&uima_docs_tutorial_guides;"
           targetptr="ugr.tug.cpe"/>.</para>

     </section>

   </section>

   <section id="ugr.ovv.conceptual.exploiting_analysis_results">
     <title>Exploiting Analysis Results</title>

     <note><title>&key_concepts;</title><para>Semantic Search, XML Fragment Queries.</para>
     </note>

     <section id="ugr.ovv.conceptual.semantic_search">
       <title>Semantic Search</title>

       <para>In a simple UIMA Collection Processing Engine (CPE), a Collection Reader reads
         documents from the file system and initializes CASs with their content. These are
         then fed to an AE that annotates tokens and sentences, the CASs, now enriched with
         token and sentence information, are passed to a CAS Consumer that populates a search
         engine index. </para>

       <para>The search engine query processor can then use the token index to provide basic
         key-word search. For example, given a query <quote>center</quote> the search
         engine would return all the documents that contained the word
         <quote>center</quote>.</para>

       <para><emphasis role="bold">Semantic Search</emphasis> is a search paradigm that
         can exploit the additional metadata generated by analytics like a UIMA CPE.</para>

       <para>Consider that we plugged a named-entity recognizer into the CPE described
         above. Assume this analysis engine is capable of detecting in documents and
         annotating in the CAS mentions of persons and organizations.</para>

       <para>Complementing the name-entity recognizer we add a CAS Consumer that extracts in
         addition to token and sentence annotations, the person and organizations added to
         the CASs by the name-entity detector. It then feeds these into the semantic search
         engine&apos;s index.</para>

       <para>The semantic search engine that comes with the UIMA SDK, for example, can exploit
         this addition information from the CAS to support more powerful queries. For
         example, imagine a user is looking for documents that mention an organization with
         <quote>center</quote> it is name but is not sure of the full or precise name of the
         organization. A key-word search on <quote>center</quote> would likely produce way
         too many documents because <quote>center</quote> is a common and ambiguous term.
         The semantic search engine that is available from <ulink
           url="http://www.alphaworks.ibm.com/tech/uima"/> supports a query language
         called <emphasis role="bold">XML Fragments</emphasis>. This query language is
         designed to exploit the CAS annotations entered in its index. The XML Fragment query,
         for example,


         <programlisting>&lt;organization&gt; center &lt;/organization&gt;</programlisting>
         will produce first only documents that contain <quote>center</quote> where it
         appears as part of a mention annotated as an organization by the name-entity
         recognizer. This will likely be a much shorter list of documents more precisely
         matching the user&apos;s interest.</para>

       <para>Consider taking this one step further. We add a relationship recognizer that
         annotates mentions of the CEO-of relationship. We configure the CAS Consumer so that
         it sends these new relationship annotations to the semantic search index as well.
         With these additional analysis results in the index we can submit queries like


         <programlisting>&lt;ceo_of&gt;
     &lt;person&gt; center &lt;/person&gt;
     &lt;organization&gt; center &lt;/organization&gt;
 &lt;ceo_of&gt;</programlisting>
         This query will precisely target documents that contain a mention of an organization
         with <quote>center</quote> as part of its name where that organization is mentioned
         as part of a
         <code>CEO-of</code> relationship annotated by the relationship
         recognizer.</para>

       <para>For more details about using UIMA and Semantic Search see the section on
         integrating text analysis and search in <olink
           targetdoc="&uima_docs_tutorial_guides;"/> <olink
           targetdoc="&uima_docs_tutorial_guides;"
           targetptr="ugr.tug.application"/>.</para>
     </section>

     <section id="ugr.ovv.conceptual.databases">
       <title>Databases</title>

       <para>Search engine indices are not the only place to deposit analysis results for use
         by applications. Another classic example is populating databases. While many
         approaches are possible with varying degrees of flexibly and performance all are
         highly dependent on application specifics. We included a simple sample CAS Consumer
         that provides the basics for getting your analysis result into a relational
         database. It extracts annotations from a CAS and writes them to a relational
         database, using the open source Apache Derby database.</para>
     </section>
   </section>

   <section id="ugr.ovv.conceptual.multimodal_processing">
     <title>Multimodal Processing in UIMA</title>
     <para>In previous sections we&apos;ve seen how the CAS is initialized with an initial
       artifact that will be subsequently analyzed by Analysis engines and CAS Consumers. The
       first Analysis engine may make some assertions about the artifact, for example, in the
       form of annotations. Subsequent Analysis engines will make further assertions about
       both the artifact and previous analysis results, and finally one or more CAS Consumers
       will extract information from these CASs for structured information storage.</para>
     <figure id="ugr.ovv.conceptual.fig.multiple_sofas">
       <title>Multiple Sofas in support of multi-modal analysis of an audio Stream. Some
         engines work on the audio <quote>view</quote>, some on the text
         <quote>view</quote> and some on both.</title>
       <mediaobject>
         <imageobject role="html">
           <imagedata width="576px" format="PNG" align="center" fileref="&imgroot;image014.png"/>
         </imageobject>
         <imageobject role="fo">
           <imagedata width="5.5in" format="PNG" align="center" fileref="&imgroot;image014.png"/>
         </imageobject>
         <textobject><phrase>Picture showing audio on the left broken into segments by a
           segmentation component, then sent to multiple analysis pipelines in parallel,
           some processing the raw audio, others processing the recognized speech as
           text.</phrase></textobject>
       </mediaobject>
     </figure>
     <para>Consider a processing pipeline, illustrated in <xref
         linkend="ugr.ovv.conceptual.fig.multiple_sofas"/>, that starts with an
       audio recording of a conversation, transcribes the audio into text, and then extracts
       information from the text transcript. Analysis Engines at the start of the pipeline are
       analyzing an audio subject of analysis, and later analysis engines are analyzing a text
       subject of analysis. The CAS Consumer will likely want to build a search index from
       concepts found in the text to the original audio segment covered by the concept.</para>

     <para>What becomes clear from this relatively simple scenario is that the CAS must be
       capable of simultaneously holding multiple subjects of analysis. Some analysis
       engine will analyze only one subject of analysis, some will analyze one and create
       another, and some will need to access multiple subjects of analysis at the same time.
       </para>

     <para>The support in UIMA for multiple subjects of analysis is called <emphasis
         role="bold">Sofa</emphasis> support; Sofa is an acronym which is derived from
         <emphasis role="underline">S</emphasis>ubject <emphasis role="underline">
       of</emphasis> <emphasis role="underline">A</emphasis>nalysis, which is a physical
       representation of an artifact (e.g., the detagged text of a web-page, the HTML
       text of the same web-page, the audio segment of a video, the close-caption text
       of the same audio segment). A Sofa may
       be associated with CAS Views. A particular CAS will have one or more views, each view
       corresponding to a particular subject of analysis, together with a set of the defined
       indexes that index the metadata (that is, Feature Structures) created in that view.</para>

     <para>Analysis results can be indexed in, or <quote>belong</quote> to, a specific view.
       UIMA components may be written in <quote>Multi-View</quote> mode - able to create and
       access multiple Sofas at the same time, or in <quote>Single-View</quote> mode, simply
       receiving a particular view of the CAS corresponding to a particular single Sofa. For
       single-view mode components, it is up to the person assembling the component to supply
       the needed information to insure a particular view is passed to the component at run
       time. This is done using XML descriptors for Sofa mapping (see <olink
         targetdoc="&uima_docs_tutorial_guides;"/> <olink
         targetdoc="&uima_docs_tutorial_guides;"
         targetptr="ugr.tug.mvs.sofa_name_mapping"/>).</para>

     <para>Multi-View capability brings benefits to text-only processing as well. An input
       document can be transformed from one format to another. Examples of this include
       transforming text from HTML to plain text or from one natural language to another.
       </para>
   </section>

   <section id="ugr.ovv.conceptual.next_steps">
     <title>Next Steps</title>

     <para>This chapter presented a high-level overview of UIMA concepts. Along the way, it
       pointed to other documents in the UIMA SDK documentation set where the reader can find
       details on how to apply the related concepts in building applications with the UIMA
       SDK.</para>

     <para>At this point the reader may return to the documentation guide in <olink
         targetdoc="&uima_docs_overview;" targetptr="ugr.project_overview_doc_use"/>
       to learn how they might proceed in getting started using UIMA.</para>

     <para>For a more detailed overview of the UIMA architecture, framework and development
       roles we refer the reader to the following paper:</para>

     <para>D. Ferrucci and A. Lally, <quote>Building an example application using the
       Unstructured Information Management Architecture,</quote> <emphasis>IBM Systems
       Journal</emphasis> <emphasis role="bold">43</emphasis>, No. 3, 455-475 (2004).
       </para>

     <para>This paper can be found on line at <ulink
         url="http://www.research.ibm.com/journal/sj43-3.html"/></para>
   </section>

 </chapter>