blob: 5d9d9100bb934f5a2628b6530a34ca5b086e1ce2 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY key_concepts "Key UIMA Concepts Introduced in this Section:">
<!ENTITY imgroot "../images/overview_and_setup/conceptual_overview_files/" >
<!ENTITY % uimaents SYSTEM "../entities.ent" >
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="ugr.ovv.conceptual">
<title>UIMA Conceptual Overview</title>
<para>UIMA is an open, industrial-strength, scaleable and extensible platform for
creating, integrating and deploying unstructured information management solutions
from powerful text or multi-modal analysis and search components. </para>
<para>The Apache UIMA project is an implementation of the Java UIMA framework available
under the Apache License, providing a common foundation for industry and academia to
collaborate and accelerate the world-wide development of technologies critical for
discovering vital knowledge present in the fastest growing sources of information
today.</para>
<para>This chapter presents an introduction to many essential UIMA concepts. It is meant to
provide a broad overview to give the reader a quick sense of UIMA&apos;s basic
architectural philosophy and the UIMA SDK&apos;s capabilities. </para>
<para>This chapter provides a general orientation to UIMA and makes liberal reference to
the other chapters in the UIMA SDK documentation set, where the reader may find detailed
treatments of key concepts and development practices. It may be useful to refer to <olink
targetdoc="&uima_docs_overview;" targetptr="ugr.glossary"/>, to become familiar
with the terminology in this overview.</para>
<section id="ugr.ovv.conceptual.uima_introduction">
<title>UIMA Introduction</title>
<figure id="ugr.ovv.conceptual.fig.bridge">
<title>UIMA helps you build the bridge between the unstructured and structured
worlds</title>
<mediaobject>
<imageobject>
<imagedata width="5.5in" format="PNG" fileref="&imgroot;image002.png"/>
</imageobject>
<textobject><phrase>Picture of a bridge between unstructured information
artifacts and structured metadata about those artifacts</phrase>
</textobject>
</mediaobject>
</figure>
<para> Unstructured information represents the largest, most current and fastest
growing source of information available to businesses and governments. The web is just
the tip of the iceberg. Consider the mounds of information hosted in the enterprise and
around the world and across different media including text, voice and video. The
high-value content in these vast collections of unstructured information is,
unfortunately, buried in lots of noise. Searching for what you need or doing
sophisticated data mining over unstructured information sources presents new
challenges. </para>
<para>An unstructured information management (UIM) application may be generally
characterized as a software system that analyzes large volumes of unstructured
information (text, audio, video, images, etc.) to discover, organize and deliver
relevant knowledge to the client or application end-user. An example is an application
that processes millions of medical abstracts to discover critical drug interactions.
Another example is an application that processes tens of millions of documents to
discover key evidence indicating probable competitive threats. </para>
<para>First and foremost, the unstructured data must be analyzed to interpret, detect
and locate concepts of interest, for example, named entities like persons,
organizations, locations, facilities, products etc., that are not explicitly tagged
or annotated in the original artifact. More challenging analytics may detect things
like opinions, complaints, threats or facts. And then there are relations, for
example, located in, finances, supports, purchases, repairs etc. The list of concepts
important for applications to discover in unstructured content is large, varied and
often domain specific.
Many different component analytics may solve different parts of the overall analysis task.
These component analytics must interoperate and must be easily combined to facilitate
the developed of UIM applications.</para>
<para>The result of analysis are used to populate structured forms so that conventional
data processing and search technologies
like search engines, database engines or OLAP
(On-Line Analytical Processing, or Data Mining) engines
can efficiently deliver the newly discovered content in response to the client requests
or queries.</para>
<para>In analyzing unstructured content, UIM applications make use of a variety of
analysis technologies including:</para>
<itemizedlist spacing="compact">
<listitem><para>Statistical and rule-based Natural Language Processing
(NLP)</para>
</listitem>
<listitem><para>Information Retrieval (IR)</para>
</listitem>
<listitem><para>Machine learning</para>
</listitem>
<listitem><para>Ontologies</para>
</listitem>
<listitem><para>Automated reasoning and</para>
</listitem>
<listitem><para>Knowledge Sources (e.g., CYC, WordNet, FrameNet, etc.)</para>
</listitem>
</itemizedlist>
<para>Specific analysis capabilities using these technologies are developed
independently using different techniques, interfaces and platforms.
</para>
<para>The bridge from the unstructured world to the structured world is built through the
composition and deployment of these analysis capabilities. This integration is often
a costly challenge. </para>
<para>The Unstructured Information Management Architecture (UIMA) is an architecture
and software framework that helps you build that bridge. It supports creating,
discovering, composing and deploying a broad range of analysis capabilities and
linking them to structured information services.</para>
<para>UIMA allows development teams to match the right skills with the right parts of a
solution and helps enable rapid integration across technologies and platforms using a
variety of different deployment options. These ranging from tightly-coupled
deployments for high-performance, single-machine, embedded solutions to parallel
and fully distributed deployments for highly flexible and scaleable
solutions.</para>
</section>
<section id="ugr.ovv.conceptual.architecture_framework_sdk">
<title>The Architecture, the Framework and the SDK</title>
<para>UIMA is a software architecture which specifies component interfaces, data
representations, design patterns and development roles for creating, describing,
discovering, composing and deploying multi-modal analysis capabilities.</para>
<para>The <emphasis role="bold">UIMA framework</emphasis> provides a run-time
environment in which developers can plug in their UIMA component implementations and
with which they can build and deploy UIM applications. The framework is not specific to
any IDE or platform. Apache hosts a Java and (soon) a C++ implementation of the UIMA
Framework.</para>
<para>The <emphasis role="bold">UIMA Software Development Kit (SDK)</emphasis>
includes the UIMA framework, plus tools and utilities for using UIMA. Some of the
tooling supports an Eclipse-based ( <ulink url="http://www.eclipse.org/"/>)
development environment. </para>
</section>
<section id="ugr.ovv.conceptual.analysis_basics">
<title>Analysis Basics</title>
<note><title>&key_concepts;</title><para>Analysis Engine, Document, Annotator, Annotator
Developer, Type, Type System, Feature, Annotation, CAS, Sofa, JCas, UIMA
Context.</para>
</note>
<section id="ugr.ovv.conceptual.aes_annotators_and_analysis_results">
<title>Analysis Engines, Annotators &amp; Results</title>
<figure id="ugr.ovv.conceptual.metadata_in_cas">
<title>Objects represented in the Common Analysis Structure (CAS)</title>
<mediaobject>
<imageobject role="html">
<imagedata format="PNG" width="594px" align="center" fileref="&imgroot;image004.png"/>
</imageobject>
<imageobject role="fo">
<imagedata format="PNG" width="5.5in" align="center" fileref="&imgroot;image004.png"/>
</imageobject>
<textobject><phrase>Picture of some text, with a hierarchy of discovered
metadata about words in the text, including some image of a person as metadata
about that name.</phrase>
</textobject>
</mediaobject>
</figure>
<para>UIMA is an architecture in which basic building blocks called Analysis Engines
(AEs) are composed to analyze a document and infer and record descriptive attributes
about the document as a whole, and/or about regions therein. This descriptive
information, produced by AEs is referred to generally as <emphasis role="bold">
analysis results</emphasis>. Analysis results typically represent meta-data
about the document content. One way to think about AEs is as software agents that
automatically discover and record meta-data about original content.</para>
<para>UIMA supports the analysis of different modalities including text, audio and
video. The majority of examples we provide are for text. We use the term <emphasis
role="bold">document, </emphasis>therefore, to generally refer to any unit of
content that an AE may process, whether it is a text document or a segment of audio, for
example. See the section <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.mvs"/> for more information on multimodal processing
in UIMA.</para>
<para>Analysis results include different statements about the content of a document.
For example, the following is an assertion about the topic of a document:</para>
<programlisting>(1) The Topic of document D102 is "CEOs and Golf".</programlisting>
<para>Analysis results may include statements describing regions more granular than
the entire document. We use the term <emphasis role="bold">span</emphasis> to
refer to a sequence of characters in a text document. Consider that a document with the
identifier D102 contains a span, <quote>Fred Centers</quote> starting at
character position 101. An AE that can detect persons in text may represent the
following statement as an analysis result:</para>
<programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting>
<para>In both statements 1 and 2 above there is a special pre-defined term or what we call
in UIMA a <emphasis role="bold">Type</emphasis>. They are
<emphasis>Topic</emphasis> and <emphasis>Person</emphasis> respectively.
UIMA types characterize the kinds of results that an AE may create &ndash; more on
types later.</para>
<para>Other analysis results may relate two statements. For example, an AE might
record in its results that two spans are both referring to the same person:</para>
<programlisting>(3) The Person denoted by span 101 to 112 and
the Person denoted by span 141 to 143 in document D102
refer to the same Entity.</programlisting>
<para>The above statements are some examples of the kinds of results that AEs may record
to describe the content of the documents they analyze. These are not meant to indicate
the form or syntax with which these results are captured in UIMA &ndash; more on that
later in this overview.</para>
<para>The UIMA framework treats Analysis engines as pluggable, composible,
discoverable, managed objects. At the heart of AEs are the analysis algorithms that
do all the work to analyze documents and record analysis results. </para>
<para>UIMA provides a basic component type intended to house the core analysis
algorithms running inside AEs. Instances of this component are called <emphasis
role="bold">Annotators</emphasis>. The analysis algorithm developer&apos;s
primary concern therefore is the development of annotators. The UIMA framework
provides the necessary methods for taking annotators and creating analysis
engines.</para>
<para>In UIMA the person who codes analysis algorithms takes on the role of the
<emphasis role="bold">Annotator Developer</emphasis>. <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.aae"/> will take the reader
through the details involved in creating UIMA annotators and analysis
engines.</para>
<para>At the most primitive level an AE wraps an annotator adding the necessary APIs and
infrastructure for the composition and deployment of annotators within the UIMA
framework. The simplest AE contains exactly one annotator at its core. Complex AEs
may contain a collection of other AEs each potentially containing within them other
AEs. </para>
</section>
<section id="ugr.ovv.conceptual.representing_results_in_cas">
<title>Representing Analysis Results in the CAS</title>
<para>How annotators represent and share their results is an important part of the UIMA
architecture. UIMA defines a <emphasis role="bold">Common Analysis Structure
(CAS)</emphasis> precisely for these purposes.</para>
<para>The CAS is an object-based data structure that allows the representation of
objects, properties and values. Object types may be related to each other in a
single-inheritance hierarchy. The CAS logically (if not physically) contains the
document being analyzed. Analysis developers share and record their analysis
results in terms of an object model within the CAS. <footnote><para> We have plans to
extend the representational capabilities of the CAS and align its semantics with the
semantics of the OMG&apos;s Essential Meta-Object Facility (EMOF) and with the
semantics of the Eclipse Modeling Framework&apos;s ( <ulink
url="http://www.eclipse.org/emf/"/>) Ecore semantics and XMI-based
representation.</para> </footnote> </para>
<para>The UIMA framework includes an implementation and interfaces to the CAS. For a
more detailed description of the CAS and its interfaces see <olink
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>.</para>
<para>A CAS that logically contains statement 2 (repeated here for your
convenience)</para>
<programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting>
<para>would include objects of the Person type. For each person found in the body of a
document, the AE would create a Person object in the CAS and link it to the span of text
where the person was mentioned in the document.</para>
<para>While the CAS is a general purpose data structure, UIMA defines a
few basic types and affords the developer the ability to extend these to define an
arbitrarily rich <emphasis role="bold">Type System</emphasis>. You can think of a
type system as an object schema for the CAS.</para>
<para>A type system defines the various types of objects that may be discovered in
documents by AE's that subscribe to that type system.</para>
<para>As suggested above, Person may be defined as a type. Types have properties or
<emphasis role="bold">features</emphasis>. So for example,
<emphasis>Age</emphasis> and <emphasis>Occupation</emphasis> may be defined as
features of the Person type.</para>
<para>Other types might be <emphasis>Organization, Company, Bank, Facility, Money,
Size, Price, Phone Number, Phone Call, Relation, Network Packet, Product, Noun
Phrase, Verb, Color, Parse Node, Feature Weight Array</emphasis> etc.</para>
<para>There are no limits to the different types that may be defined in a type system. A
type system is domain and application specific.</para>
<para>Types in a UIMA type system may be organized into a taxonomy. For example,
<emphasis>Company</emphasis> may be defined as a subtype of
<emphasis>Organization</emphasis>. <emphasis>NounPhrase</emphasis> may be a
subtype of a <emphasis>ParseNode</emphasis>.</para>
<section id="ugr.ovv.conceptual.annotation_type">
<title>The Annotation Type</title>
<para>A general and common type used in artifact analysis and from which additional
types are often derived is the <emphasis role="bold">annotation</emphasis>
type. </para>
<para>The annotation type is used to annotate or label regions of an artifact. Common
artifacts are text documents, but they can be other things, such as audio streams.
The annotation type for text includes two features, namely
<emphasis>begin</emphasis> and <emphasis>end</emphasis>. Values of these
features represent integer offsets in the artifact and delimit a span. Any
particular annotation object identifies the span it annotates with the
<emphasis>begin</emphasis> and <emphasis>end</emphasis> features.</para>
<para>The key idea here is that the annotation type is used to identify and label or
<quote>annotate</quote> a specific region of an artifact.</para>
<para>Consider that the Person type is defined as a subtype of annotation. An
annotator, for example, can create a Person annotation to record the discovery of a
mention of a person between position 141 and 143 in document D102. The annotator can
create another person annotation to record the detection of a mention of a person in
the span between positions 101 and 112. </para>
</section>
<section id="ugr.ovv.conceptual.not_just_annotations">
<title>Not Just Annotations</title>
<para>While the annotation type is a useful type for annotating regions of a
document, annotations are not the only kind of types in a CAS. A CAS is a general
representation scheme and may store arbitrary data structures to represent the
analysis of documents.</para>
<para>As an example, consider statement 3 above (repeated here for your
convenience).</para>
<programlisting>(3) The Person denoted by span 101 to 112 and
the Person denoted by span 141 to 143 in document D102
refer to the same Entity.</programlisting>
<para>This statement mentions two person annotations in the CAS; the first, call it
P1 delimiting the span from 101 to 112 and the other, call it P2, delimiting the span
from 141 to 143. Statement 3 asserts explicitly that these two spans refer to the
same entity. This means that while there are two expressions in the text
represented by the annotations P1 and P2, each refers to one and the same person.
</para>
<para>The Entity type may be introduced into a type system to capture this kind of
information. The Entity type is not an annotation. It is intended to represent an
object in the domain which may be referred to by different expressions (or
mentions) occurring multiple times within a document (or across documents within
a collection of documents). The Entity type has a feature named
<emphasis>occurrences. </emphasis>This feature is used to point to all the
annotations believed to label mentions of the same entity.</para>
<para>Consider that the spans annotated by P1 and P2 were <quote>Fred
Center</quote> and <quote>He</quote> respectively. The annotator might create
a new Entity object called
<code>FredCenter</code>. To represent the relationship in statement 3 above,
the annotator may link FredCenter to both P1 and P2 by making them values of its
<emphasis>occurrences</emphasis> feature.</para>
<para> <xref linkend="ugr.ovv.conceptual.metadata_in_cas"/> also
illustrates that an entity may be linked to annotations referring to regions of
image documents as well. To do this the annotation type would have to be extended
with the appropriate features to point to regions of an image.</para>
</section>
<section id="ugr.ovv.conceptual.multiple_views_within_a_cas">
<title>Multiple Views within a CAS</title>
<para>UIMA supports the simultaneous analysis of multiple views of a document. This
support comes in handy for processing multiple forms of the artifact, for example, the audio
and the closed captioned views of a single speech stream, or the tagged and detagged
views of an HTML document.</para>
<para>AEs analyze one or more views of a document. Each view contains a specific
<emphasis role="bold">subject of analysis(Sofa)</emphasis>, plus a set of
indexes holding metadata indexed by that view. The CAS, overall, holds one or more
CAS Views, plus the descriptive objects that represent the analysis results for
each. </para>
<para>Another common example of using CAS Views is for different translations of a
document. Each translation may be represented with a different CAS View. Each
translation may be described by a different set of analysis results. For more
details on CAS Views and Sofas see <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.mvs"/> and <olink
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>. </para>
</section>
</section>
<section id="ugr.ovv.conceptual.interacting_with_cas_and_external_resources">
<title>Interacting with the CAS and External Resources</title>
<titleabbrev>Using CASes and External Resources</titleabbrev>
<para>The two main interfaces that a UIMA component developer interacts with are the
CAS and the UIMA Context.</para>
<para>UIMA provides an efficient implementation of the CAS with multiple programming
interfaces. Through these interfaces, the annotator developer interacts with the
document and reads and writes analysis results. The CAS interfaces provide a suite of
access methods that allow the developer to obtain indexed iterators to the different
objects in the CAS. See <olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.cas"/>. While many objects may exist in a CAS, the annotator
developer can obtain a specialized iterator to all Person objects associated with a
particular view, for example.</para>
<para>For Java annotator developers, UIMA provides the JCas. This interface provides
the Java developer with a natural interface to CAS objects. Each type declared in the
type system appears as a Java Class; the UIMA framework renders the Person type as a
Person class in Java. As the analysis algorithm detects mentions of persons in the
documents, it can create Person objects in the CAS. For more details on how to interact
with the CAS using this interface, refer to <olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.jcas"/>.</para>
<para>The component developer, in addition to interacting with the CAS, can access
external resources through the framework&apos;s resource manager interface
called the <emphasis role="bold">UIMA Context</emphasis>. This interface, among
other things, can ensure that different annotators working together in an aggregate
flow may share the same instance of an external file, for example. For details on using
the UIMA Context see <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.aae"/>.</para>
</section>
<section id="ugr.ovv.conceptual.component_descriptors">
<title>Component Descriptors</title>
<para>UIMA defines interfaces for a small set of core components that users of the
framework provide implmentations for. Annotators and Analysis Engines are two of
the basic building blocks specified by the architecture. Developers implement them
to build and compose analysis capabilities and ultimately applications.</para>
<para>There are others components in addition to these, which we will learn about
later, but for every component specified in UIMA there are two parts required for its
implementation:</para>
<orderedlist spacing="compact">
<listitem><para>the declarative part and</para></listitem>
<listitem><para>the code part.</para></listitem>
</orderedlist>
<para>The declarative part contains metadata describing the component, its
identity, structure and behavior and is called the <emphasis role="bold">
Component Descriptor</emphasis>. Component descriptors are represented in XML.
The code part implements the algorithm. The code part may be a program in Java.</para>
<para>As a developer using the UIMA SDK, to implement a UIMA component it is always the
case that you will provide two things: the code part and the Component Descriptor.
Note that when you are composing an engine, the code may be already provided in
reusable subcomponents. In these cases you may not be developing new code but rather
composing an aggregate engine by pointing to other components where the code has been
included.</para>
<para>Component descriptors are represented in XML and aid in component discovery,
reuse, composition and development tooling. The UIMA SDK provides tools for easily
creating and maintaining the component descriptors that relieve the developer from
editing XML directly. This tool is described briefly in <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.aae"/>, and more
thoroughly in <olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde"/>
.</para>
<para>Component descriptors contain standard metadata including the
component&apos;s name, author, version, and a reference to the class that
implements the component.</para>
<para>In addition to these standard fields, a component descriptor identifies the
type system the component uses and the types it requires in an input CAS and the types it
plans to produce in an output CAS.</para>
<para>For example, an AE that detects person types may require as input a CAS that
includes a tokenization and deep parse of the document. The descriptor refers to a
type system to make the component&apos;s input requirements and output types
explicit. In effect, the descriptor includes a declarative description of the
component&apos;s behavior and can be used to aid in component discovery and
composition based on desired results. UIMA analysis engines provide an interface
for accessing the component metadata represented in their descriptors. For more
details on the structure of UIMA component descriptors refer to <olink
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.component_descriptor"/>.</para>
</section>
</section>
<section id="ugr.ovv.conceptual.aggregate_analysis_engines">
<title>Aggregate Analysis Engines</title>
<note><title>&key_concepts;</title><para>Aggregate Analysis Engine, Delegate Analysis Engine,
Tightly and Loosely Coupled, Flow Specification, Analysis Engine Assembler</para>
</note>
<figure id="ugr.ovv.conceptual.sample_aggregate">
<title>Sample Aggregate Analysis Engine</title>
<mediaobject>
<imageobject role="html">
<imagedata width="588px" format="PNG" fileref="&imgroot;image006.png"/>
</imageobject>
<imageobject role="fo">
<imagedata width="5.5in" format="PNG" fileref="&imgroot;image006.png"/>
</imageobject>
<textobject><phrase>Picture of multiple parts (a language identifier,
tokenizer, part of speech annotator, shallow parser, and named entity detector)
strung together into a flow, and all of them wrapped as a single aggregate object,
which produces as annotations the union of all the results of the individual
annotator components ( tokens, parts of speech, names, organizations, places,
persons, etc.)</phrase>
</textobject>
</mediaobject>
</figure>
<para>A simple or primitive UIMA Analysis Engine (AE) contains a single annotator. AEs,
however, may be defined to contain other AEs organized in a workflow. These more complex
analysis engines are called <emphasis role="bold">Aggregate Analysis
Engines.</emphasis> </para>
<para>Annotators tend to perform fairly granular functions, for example language
detection, tokenization or part of speech detection.
These functions typically address just part of an overall analysis task. A workflow
of component engines may be orchestrated to perform more complex tasks.</para>
<para>An AE that performs named entity detection, for example, may
include a pipeline of annotators starting with language detection feeding
tokenization, then part-of-speech detection, then deep grammatical parsing and then
finally named-entity detection. Each step in the pipeline is required by the
subsequent analysis. For example, the final named-entity annotator can only do its
analysis if the previous deep grammatical parse was recorded in the CAS.</para>
<para>Aggregate AEs are built to encapsulate potentially complex internal structure
and insulate it from users of the AE. In our example, the aggregate analysis engine
developer acquires the internal components, defines the necessary flow
between them and publishes the resulting AE. Consider the simple example illustrated
in <xref linkend="ugr.ovv.conceptual.sample_aggregate"/> where
<quote>MyNamed-EntityDetector</quote> is composed of a linear flow of more
primitive analysis engines.</para>
<para>Users of this AE need not know how it is constructed internally but only need its name
and its published input requirements and output types. These must be declared in the
aggregate AE&apos;s descriptor. Aggregate AE&apos;s descriptors declare the components
they contain and a <emphasis role="bold">flow specification</emphasis>. The flow
specification defines the order in which the internal component AEs should be run. The
internal AEs specified in an aggregate are also called the <emphasis role="bold">
delegate analysis engines.</emphasis> The term "delegate" is used because aggregate AE's
are thought to "delegate" functions to their internal AEs.</para>
<para>
In UIMA 2.0, the developer can implement a "Flow Controller" and include it as part
of an aggregate AE by referring to it in the aggregate AE's descriptor.
The flow controller is responsible for computing the "flow", that is,
for determining the order in which of delegate AE's that will process the CAS.
The Flow Contoller has access to the CAS and any external resources it may require
for determining the flow. It can do this dynamically at run-time, it can
make multi-step decisions and it can consider any sort of flow specification
included in the aggregate AE's descriptor. See
<olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.fc"/>
for details on the UIMA Flow Controller interface.
</para>
<para>We refer to the development role associated with building an aggregate from
delegate AEs as the <emphasis role="bold">Analysis Engine Assembler</emphasis>
.</para>
<para>The UIMA framework, given an aggregate analysis engine descriptor, will run all
delegate AEs, ensuring that each one gets access to the CAS in the sequence produced by
the flow controller. The UIMA framework is equipped to handle different
deployments where the delegate engines, for example, are <emphasis role="bold">
tightly-coupled</emphasis> (running in the same process) or <emphasis role="bold">
loosely-coupled</emphasis> (running in separate processes or even on different
machines). The framework supports a number of remote protocols for loose coupling
deployments of aggregate analysis engines, including SOAP (which stands for Simple
Object Access Protocol, a standard Web Services communications protocol).</para>
<para>The UIMA framework facilitates the deployment of AEs as remote services by using an
adapter layer that automatically creates the necessary infrastructure in response to
a declaration in the component&apos;s descriptor. For more details on creating
aggregate analysis engines refer to <olink targetdoc="&uima_docs_ref;"
targetptr="ugr.ref.xml.component_descriptor"/> The component descriptor editor tool
assists in the specification of aggregate AEs from a repository of available engines.
For more details on this tool refer to <olink targetdoc="&uima_docs_tools;"
targetptr="ugr.tools.cde"/>.</para>
<para>The UIMA framework implementation has two built-in flow implementations: one
that support a linear flow between components, and one with conditional branching
based on the language of the document. It also supports user-provided flow
controllers, as described in <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.fc"/>. Furthermore, the application developer is
free to create multiple AEs and provide their own logic to combine the AEs in arbitrarily
complex flows. For more details on this the reader may refer to <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.application.using_aes"/>.</para>
</section>
<section id="ugr.ovv.conceptual.applicaiton_building_and_collection_processing">
<title>Application Building and Collection Processing</title>
<note><title>&key_concepts;</title><para>Process Method, Collection Processing Architecture,
Collection Reader, CAS Consumer, CAS Initializer, Collection Processing Engine,
Collection Processing Manager.</para></note>
<section id="ugr.ovv.conceptual.using_framework_from_an_application">
<title>Using the framework from an Application</title>
<figure id="ugr.ovv.conceptual.application_factory_ae">
<title>Using UIMA Framework to create and interact with an Analysis Engine</title>
<mediaobject>
<imageobject role="html">
<imagedata width="618px" align="center" format="PNG" fileref="&imgroot;image008.png"/>
</imageobject>
<imageobject role="fo">
<imagedata width="5.5in" align="center" format="PNG" fileref="&imgroot;image008.png"/>
</imageobject>
<textobject><phrase>Picture of application interacting with UIMA&apos;s
factory to produce an analysis engine, which acts as a container for annotators,
and interfaces with the application via the process and getMetaData methods
among others.</phrase>
</textobject>
</mediaobject>
</figure>
<para>As mentioned above, the basic AE interface may be thought of as simply CAS in/CAS
out.</para>
<para>The application is responsible for interacting with the UIMA framework to
instantiate an AE, create or acquire an input CAS, initialize the input CAS with a
document and then pass it to the AE through the <emphasis role="bold">process
method</emphasis>. This interaction with the framework is illustrated in <xref
linkend="ugr.ovv.conceptual.application_factory_ae"/>. </para>
<para>The UIMA AE Factory takes the declarative information from the Component
Descriptor and the class files implementing the annotator, and instantiates the AE
instance, setting up the CAS and the UIMA Context.</para>
<para>The AE, possibly calling many delegate AEs internally, performs the overall
analysis and its process method returns the CAS containing new analysis results.
</para>
<para>The application then decides what to do with the returned CAS. There are many
possibilities. For instance the application could: display the results, store the
CAS to disk for post processing, extract and index analysis results as part of a search
or database application etc.</para>
<para>The UIMA framework provides methods to support the application developer in
creating and managing CASes and instantiating, running and managing AEs. Details
may be found in <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.application"/>.</para>
</section>
<section id="ugr.ovv.conceptual.graduating_to_collection_processing">
<title>Graduating to Collection Processing</title>
<figure id="ugr.ovv.conceptual.fig.cpe">
<title>High-Level UIMA Component Architecture from Source to Sink</title>
<mediaobject>
<imageobject role="html">
<imagedata width="578px" format="PNG" align="center" fileref="&imgroot;image010.png"/>
</imageobject>
<imageobject role="fo">
<imagedata width="5.5in" format="PNG" align="center" fileref="&imgroot;image010.png"/>
</imageobject>
</mediaobject>
</figure>
<para>Many UIM applications analyze entire collections of documents. They connect to
different document sources and do different things with the results. But in the
typical case, the application must generally follow these logical steps:
<orderedlist spacing="compact">
<listitem><para>Connect to a physical source</para></listitem>
<listitem><para>Acquire a document from the source</para></listitem>
<listitem><para>Initialize a CAS with the document to be analyzed</para>
</listitem>
<listitem><para>Send the CAS to a selected analysis engine</para></listitem>
<listitem><para>Process the resulting CAS</para></listitem>
<listitem><para>Go back to 2 until the collection is processed</para>
</listitem>
<listitem><para>Do any final processing required after all the documents in the
collection have been analyzed</para></listitem>
</orderedlist> </para>
<para>UIMA supports UIM application development for this general type of processing
through its <emphasis role="bold">Collection Processing
Architecture</emphasis>.</para>
<para>As part of the collection processing architecture UIMA introduces two primary
components in addition to the annotator and analysis engine. These are the <emphasis
role="bold">Collection Reader</emphasis> and the <emphasis role="bold">CAS
Consumer</emphasis>. The complete flow from source, through document analysis,
and to CAS Consumers supported by UIMA is illustrated in <xref
linkend="ugr.ovv.conceptual.fig.cpe"/>.</para>
<para>The Collection Reader&apos;s job is to connect to and iterate through a source
collection, acquiring documents and initializing CASes for analysis. </para>
<!--
<para>Since the structure, access and iteration methods for
physical document sources vary independently from the format of stored
documents, UIMA defines another type of component called a <emphasis role="bold">CAS Intializer</emphasis>.
The CAS Initializer&apos;s job is specific to a
document format and specialized logic for mapping that format to a CAS. In the
simplest case a CAS Intializer may take the document provided by the containing
Collection Reader and insert it as a subject of analysis (or Sofa) in the
CAS. A more advanced scenario is one
where the CAS Intializer may be implemented to handle documents that conform to
a certain XML schema and map some subset of the XML tags to CAS types and then
insert the de-tagged document content as the subject of analysis. Collection Readers may reuse plug-in CAS
Initializers for different document formats.</para>
-->
<para>CAS Consumers, as the name suggests, function at the end of the flow. Their job is
to do the final CAS processing. A CAS Consumer may be implemented, for example, to
index CAS contents in a search engine, extract elements of interest and populate a
relational database or serialize and store analysis results to disk for subsequent
and further analysis. </para>
<para>A Semantic Search engine that works with UIMA is available from <ulink
url="http://www.alphaworks.ibm.com/tech/uima">IBM&apos;s alphaWorks
site</ulink> which will allow the developer to experiment with indexing analysis
results and querying for documents based on all the annotations in the CAS. See the
section on integrating text analysis and search in <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.application"/>.</para>
<para>A UIMA <emphasis role="bold">Collection Processing Engine</emphasis> (CPE)
is an aggregate component that specifies a <quote>source to sink</quote> flow from a
Collection Reader though a set of analysis engines and then to a set of CAS Consumers.
</para>
<para>CPEs are specified by XML files called CPE Descriptors. These are declarative
specifications that point to their contained components (Collection Readers,
analysis engines and CAS Consumers) and indicate a flow among them. The flow
specification allows for filtering capabilities to, for example, skip over AEs
based on CAS contents. Details about the format of CPE Descriptors may be found in
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>.
</para>
<figure id="ugr.ovv.conceptual.fig.cpm">
<title>Collection Processing Manager in UIMA Framework</title>
<mediaobject>
<imageobject role="html">
<imagedata width="576px" align="center" format="PNG" fileref="&imgroot;image012.png"/>
</imageobject>
<imageobject role="fo">
<imagedata width="5.5in" align="center" format="PNG" fileref="&imgroot;image012.png"/>
</imageobject>
<textobject><phrase>box and arrows picture of application using CPE factory to
instantiate a Collection Processing Engine, and that engine interacting with
the application.</phrase></textobject>
</mediaobject>
</figure>
<para>The UIMA framework includes a <emphasis role="bold">Collection Processing
Manager</emphasis> (CPM). The CPM is capable of reading a CPE descriptor, and
deploying and running the specified CPE. <xref
linkend="ugr.ovv.conceptual.fig.cpe"/> illustrates the role of the CPM
in the UIMA Framework.</para>
<para>Key features of the CPM are failure recovery, CAS management and scale-out.
</para>
<para>Collections may be large and take considerable time to analyze. A configurable
behavior of the CPM is to log faults on single document failures while continuing to
process the collection. This behavior is commonly used because analysis components
often tend to be the weakest link -- in practice they may choke on strangely formatted
content. </para>
<para>This deployment option requires that the CPM run in a separate process or a
machine distinct from the CPE components. A CPE may be configured to run with a variety
of deployment options that control the features provided by the CPM. For details see
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>
.</para>
<para>The UIMA SDK also provides a tool called the CPE Configurator. This tool provides
the developer with a user interface that simplifies the process of connecting up all
the components in a CPE and running the result. For details on using the CPE
Configurator see <olink targetdoc="&uima_docs_tools;"
targetptr="ugr.tools.cpe"/>. This tool currently does not provide
access to the full set of CPE deployment options supported by the CPM; however, you can
configure other parts of the CPE descriptor by editing it directly. For details on how
to create and run CPEs refer to <olink targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.cpe"/>.</para>
</section>
</section>
<section id="ugr.ovv.conceptual.exploiting_analysis_results">
<title>Exploiting Analysis Results</title>
<note><title>&key_concepts;</title><para>Semantic Search, XML Fragment Queries.</para>
</note>
<section id="ugr.ovv.conceptual.semantic_search">
<title>Semantic Search</title>
<para>In a simple UIMA Collection Processing Engine (CPE), a Collection Reader reads
documents from the file system and initializes CASs with their content. These are
then fed to an AE that annotates tokens and sentences, the CASs, now enriched with
token and sentence information, are passed to a CAS Consumer that populates a search
engine index. </para>
<para>The search engine query processor can then use the token index to provide basic
key-word search. For example, given a query <quote>center</quote> the search
engine would return all the documents that contained the word
<quote>center</quote>.</para>
<para><emphasis role="bold">Semantic Search</emphasis> is a search paradigm that
can exploit the additional metadata generated by analytics like a UIMA CPE.</para>
<para>Consider that we plugged a named-entity recognizer into the CPE described
above. Assume this analysis engine is capable of detecting in documents and
annotating in the CAS mentions of persons and organizations.</para>
<para>Complementing the name-entity recognizer we add a CAS Consumer that extracts in
addition to token and sentence annotations, the person and organizations added to
the CASs by the name-entity detector. It then feeds these into the semantic search
engine&apos;s index.</para>
<para>The semantic search engine that comes with the UIMA SDK, for example, can exploit
this addition information from the CAS to support more powerful queries. For
example, imagine a user is looking for documents that mention an organization with
<quote>center</quote> it is name but is not sure of the full or precise name of the
organization. A key-word search on <quote>center</quote> would likely produce way
too many documents because <quote>center</quote> is a common and ambiguous term.
The semantic search engine that is available from <ulink
url="http://www.alphaworks.ibm.com/tech/uima"/> supports a query language
called <emphasis role="bold">XML Fragments</emphasis>. This query language is
designed to exploit the CAS annotations entered in its index. The XML Fragment query,
for example,
<programlisting>&lt;organization&gt; center &lt;/organization&gt;</programlisting>
will produce first only documents that contain <quote>center</quote> where it
appears as part of a mention annotated as an organization by the name-entity
recognizer. This will likely be a much shorter list of documents more precisely
matching the user&apos;s interest.</para>
<para>Consider taking this one step further. We add a relationship recognizer that
annotates mentions of the CEO-of relationship. We configure the CAS Consumer so that
it sends these new relationship annotations to the semantic search index as well.
With these additional analysis results in the index we can submit queries like
<programlisting>&lt;ceo_of&gt;
&lt;person&gt; center &lt;/person&gt;
&lt;organization&gt; center &lt;/organization&gt;
&lt;ceo_of&gt;</programlisting>
This query will precisely target documents that contain a mention of an organization
with <quote>center</quote> as part of its name where that organization is mentioned
as part of a
<code>CEO-of</code> relationship annotated by the relationship
recognizer.</para>
<para>For more details about using UIMA and Semantic Search see the section on
integrating text analysis and search in <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.application"/>.</para>
</section>
<section id="ugr.ovv.conceptual.databases">
<title>Databases</title>
<para>Search engine indices are not the only place to deposit analysis results for use
by applications. Another classic example is populating databases. While many
approaches are possible with varying degrees of flexibly and performance all are
highly dependent on application specifics. We included a simple sample CAS Consumer
that provides the basics for getting your analysis result into a relational
database. It extracts annotations from a CAS and writes them to a relational
database, using the open source Apache Derby database.</para>
</section>
</section>
<section id="ugr.ovv.conceptual.multimodal_processing">
<title>Multimodal Processing in UIMA</title>
<para>In previous sections we&apos;ve seen how the CAS is initialized with an initial
artifact that will be subsequently analyzed by Analysis engines and CAS Consumers. The
first Analysis engine may make some assertions about the artifact, for example, in the
form of annotations. Subsequent Analysis engines will make further assertions about
both the artifact and previous analysis results, and finally one or more CAS Consumers
will extract information from these CASs for structured information storage.</para>
<figure id="ugr.ovv.conceptual.fig.multiple_sofas">
<title>Multiple Sofas in support of multi-modal analysis of an audio Stream. Some
engines work on the audio <quote>view</quote>, some on the text
<quote>view</quote> and some on both.</title>
<mediaobject>
<imageobject role="html">
<imagedata width="576px" format="PNG" align="center" fileref="&imgroot;image014.png"/>
</imageobject>
<imageobject role="fo">
<imagedata width="5.5in" format="PNG" align="center" fileref="&imgroot;image014.png"/>
</imageobject>
<textobject><phrase>Picture showing audio on the left broken into segments by a
segmentation component, then sent to multiple analysis pipelines in parallel,
some processing the raw audio, others processing the recognized speech as
text.</phrase></textobject>
</mediaobject>
</figure>
<para>Consider a processing pipeline, illustrated in <xref
linkend="ugr.ovv.conceptual.fig.multiple_sofas"/>, that starts with an
audio recording of a conversation, transcribes the audio into text, and then extracts
information from the text transcript. Analysis Engines at the start of the pipeline are
analyzing an audio subject of analysis, and later analysis engines are analyzing a text
subject of analysis. The CAS Consumer will likely want to build a search index from
concepts found in the text to the original audio segment covered by the concept.</para>
<para>What becomes clear from this relatively simple scenario is that the CAS must be
capable of simultaneously holding multiple subjects of analysis. Some analysis
engine will analyze only one subject of analysis, some will analyze one and create
another, and some will need to access multiple subjects of analysis at the same time.
</para>
<para>The support in UIMA for multiple subjects of analysis is called <emphasis
role="bold">Sofa</emphasis> support; Sofa is an acronym which is derived from
<emphasis role="underline">S</emphasis>ubject <emphasis role="underline">
of</emphasis> <emphasis role="underline">A</emphasis>nalysis, which is a physical
representation of an artifact (e.g., the detagged text of a web-page, the HTML
text of the same web-page, the audio segment of a video, the close-caption text
of the same audio segment). A Sofa may
be associated with CAS Views. A particular CAS will have one or more views, each view
corresponding to a particular subject of analysis, together with a set of the defined
indexes that index the metadata created in that view.</para>
<para>Analysis results can be indexed in, or <quote>belong</quote> to, a specific view.
UIMA components may be written in <quote>Multi-View</quote> mode - able to create and
access multiple Sofas at the same time, or in <quote>Single-View</quote> mode, simply
receiving a particular view of the CAS corresponding to a particular single Sofa. For
single-view mode components, it is up to the person assembling the component to supply
the needed information to insure a particular view is passed to the component at run
time. This is done using XML descriptors for Sofa mapping (see <olink
targetdoc="&uima_docs_tutorial_guides;"
targetptr="ugr.tug.mvs.sofa_name_mapping"/>).</para>
<para>Multi-View capability brings benefits to text-only processing as well. An input
document can be transformed from one format to another. Examples of this include
transforming text from HTML to plain text or from one natural language to another.
</para>
</section>
<section id="ugr.ovv.conceptual.next_steps">
<title>Next Steps</title>
<para>This chapter presented a high-level overview of UIMA concepts. Along the way, it
pointed to other documents in the UIMA SDK documentation set where the reader can find
details on how to apply the related concepts in building applications with the UIMA
SDK.</para>
<para>At this point the reader may return to the documentation guide in <olink
targetdoc="&uima_docs_overview;" targetptr="ugr.project_overview_doc_use"/>
to learn how they might proceed in getting started using UIMA.</para>
<para>For a more detailed overview of the UIMA architecture, framework and development
roles we refer the reader to the following paper:</para>
<para>D. Ferrucci and A. Lally, <quote>Building an example application using the
Unstructured Information Management Architecture,</quote> <emphasis>IBM Systems
Journal</emphasis> <emphasis role="bold">43</emphasis>, No. 3, 455-475 (2004).
</para>
<para>This paper can be found on line at <ulink
url="http://www.research.ibm.com/journal/sj43-3.html"/></para>
</section>
</chapter>