<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [ | |
<!ENTITY key_concepts "Key UIMA Concepts Introduced in this Section:"> | |
<!ENTITY imgroot "images/overview-and-setup/conceptual_overview_files/" > | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" > | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<chapter id="ugr.ovv.conceptual"> | |
<title>UIMA Conceptual Overview</title> | |
<para>UIMA is an open, industrial-strength, scaleable and extensible platform for | |
creating, integrating and deploying unstructured information management solutions | |
from powerful text or multi-modal analysis and search components. </para> | |
<para>The Apache UIMA project is an implementation of the Java UIMA framework available | |
under the Apache License, providing a common foundation for industry and academia to | |
collaborate and accelerate the world-wide development of technologies critical for | |
discovering vital knowledge present in the fastest growing sources of information | |
today.</para> | |
<para>This chapter presents an introduction to many essential UIMA concepts. It is meant to | |
provide a broad overview to give the reader a quick sense of UIMA's basic | |
architectural philosophy and the UIMA SDK's capabilities. </para> | |
<para>This chapter provides a general orientation to UIMA and makes liberal reference to | |
the other chapters in the UIMA SDK documentation set, where the reader may find detailed | |
treatments of key concepts and development practices. It may be useful to refer to <olink | |
targetdoc="&uima_docs_overview;" targetptr="ugr.glossary"/>, to become familiar | |
with the terminology in this overview.</para> | |
<section id="ugr.ovv.conceptual.uima_introduction"> | |
<title>UIMA Introduction</title> | |
<figure id="ugr.ovv.conceptual.fig.bridge"> | |
<title>UIMA helps you build the bridge between the unstructured and structured | |
worlds</title> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.5in" format="PNG" fileref="&imgroot;image002.png"/> | |
</imageobject> | |
<textobject><phrase>Picture of a bridge between unstructured information | |
artifacts and structured metadata about those artifacts</phrase> | |
</textobject> | |
</mediaobject> | |
</figure> | |
<para> Unstructured information represents the largest, most current and fastest | |
growing source of information available to businesses and governments. The web is just | |
the tip of the iceberg. Consider the mounds of information hosted in the enterprise and | |
around the world and across different media including text, voice and video. The | |
high-value content in these vast collections of unstructured information is, | |
unfortunately, buried in lots of noise. Searching for what you need or doing | |
sophisticated data mining over unstructured information sources presents new | |
challenges. </para> | |
<para>An unstructured information management (UIM) application may be generally | |
characterized as a software system that analyzes large volumes of unstructured | |
information (text, audio, video, images, etc.) to discover, organize and deliver | |
relevant knowledge to the client or application end-user. An example is an application | |
that processes millions of medical abstracts to discover critical drug interactions. | |
Another example is an application that processes tens of millions of documents to | |
discover key evidence indicating probable competitive threats. </para> | |
<para>First and foremost, the unstructured data must be analyzed to interpret, detect | |
and locate concepts of interest, for example, named entities like persons, | |
organizations, locations, facilities, products etc., that are not explicitly tagged | |
or annotated in the original artifact. More challenging analytics may detect things | |
like opinions, complaints, threats or facts. And then there are relations, for | |
example, located in, finances, supports, purchases, repairs etc. The list of concepts | |
important for applications to discover in unstructured content is large, varied and | |
often domain specific. | |
Many different component analytics may solve different parts of the overall analysis task. | |
These component analytics must interoperate and must be easily combined to facilitate | |
the developed of UIM applications.</para> | |
<para>The result of analysis are used to populate structured forms so that conventional | |
data processing and search technologies | |
like search engines, database engines or OLAP | |
(On-Line Analytical Processing, or Data Mining) engines | |
can efficiently deliver the newly discovered content in response to the client requests | |
or queries.</para> | |
<para>In analyzing unstructured content, UIM applications make use of a variety of | |
analysis technologies including:</para> | |
<itemizedlist spacing="compact"> | |
<listitem><para>Statistical and rule-based Natural Language Processing | |
(NLP)</para> | |
</listitem> | |
<listitem><para>Information Retrieval (IR)</para> | |
</listitem> | |
<listitem><para>Machine learning</para> | |
</listitem> | |
<listitem><para>Ontologies</para> | |
</listitem> | |
<listitem><para>Automated reasoning and</para> | |
</listitem> | |
<listitem><para>Knowledge Sources (e.g., CYC, WordNet, FrameNet, etc.)</para> | |
</listitem> | |
</itemizedlist> | |
<para>Specific analysis capabilities using these technologies are developed | |
independently using different techniques, interfaces and platforms. | |
</para> | |
<para>The bridge from the unstructured world to the structured world is built through the | |
composition and deployment of these analysis capabilities. This integration is often | |
a costly challenge. </para> | |
<para>The Unstructured Information Management Architecture (UIMA) is an architecture | |
and software framework that helps you build that bridge. It supports creating, | |
discovering, composing and deploying a broad range of analysis capabilities and | |
linking them to structured information services.</para> | |
<para>UIMA allows development teams to match the right skills with the right parts of a | |
solution and helps enable rapid integration across technologies and platforms using a | |
variety of different deployment options. These ranging from tightly-coupled | |
deployments for high-performance, single-machine, embedded solutions to parallel | |
and fully distributed deployments for highly flexible and scaleable | |
solutions.</para> | |
</section> | |
<section id="ugr.ovv.conceptual.architecture_framework_sdk"> | |
<title>The Architecture, the Framework and the SDK</title> | |
<para>UIMA is a software architecture which specifies component interfaces, data | |
representations, design patterns and development roles for creating, describing, | |
discovering, composing and deploying multi-modal analysis capabilities.</para> | |
<para>The <emphasis role="bold">UIMA framework</emphasis> provides a run-time | |
environment in which developers can plug in their UIMA component implementations and | |
with which they can build and deploy UIM applications. The framework is not specific to | |
any IDE or platform. Apache hosts a Java and (soon) a C++ implementation of the UIMA | |
Framework.</para> | |
<para>The <emphasis role="bold">UIMA Software Development Kit (SDK)</emphasis> | |
includes the UIMA framework, plus tools and utilities for using UIMA. Some of the | |
tooling supports an Eclipse-based ( <ulink url="http://www.eclipse.org/"/>) | |
development environment. </para> | |
</section> | |
<section id="ugr.ovv.conceptual.analysis_basics"> | |
<title>Analysis Basics</title> | |
<note><title>&key_concepts;</title><para>Analysis Engine, Document, Annotator, Annotator | |
Developer, Type, Type System, Feature, Annotation, CAS, Sofa, JCas, UIMA | |
Context.</para> | |
</note> | |
<section id="ugr.ovv.conceptual.aes_annotators_and_analysis_results"> | |
<title>Analysis Engines, Annotators & Results</title> | |
<figure id="ugr.ovv.conceptual.metadata_in_cas"> | |
<title>Objects represented in the Common Analysis Structure (CAS)</title> | |
<mediaobject> | |
<imageobject role="html"> | |
<imagedata format="PNG" width="594px" align="center" fileref="&imgroot;image004.png"/> | |
</imageobject> | |
<imageobject role="fo"> | |
<imagedata format="PNG" width="5.5in" align="center" fileref="&imgroot;image004.png"/> | |
</imageobject> | |
<textobject><phrase>Picture of some text, with a hierarchy of discovered | |
metadata about words in the text, including some image of a person as metadata | |
about that name.</phrase> | |
</textobject> | |
</mediaobject> | |
</figure> | |
<para>UIMA is an architecture in which basic building blocks called Analysis Engines | |
(AEs) are composed to analyze a document and infer and record descriptive attributes | |
about the document as a whole, and/or about regions therein. This descriptive | |
information, produced by AEs is referred to generally as <emphasis role="bold"> | |
analysis results</emphasis>. Analysis results typically represent meta-data | |
about the document content. One way to think about AEs is as software agents that | |
automatically discover and record meta-data about original content.</para> | |
<para>UIMA supports the analysis of different modalities including text, audio and | |
video. The majority of examples we provide are for text. We use the term <emphasis | |
role="bold">document, </emphasis>therefore, to generally refer to any unit of | |
content that an AE may process, whether it is a text document or a segment of audio, for | |
example. See the <olink targetdoc="&uima_docs_tutorial_guides;"/> | |
<olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.mvs"/> for more information on multimodal processing | |
in UIMA.</para> | |
<para>Analysis results include different statements about the content of a document. | |
For example, the following is an assertion about the topic of a document:</para> | |
<programlisting>(1) The Topic of document D102 is "CEOs and Golf".</programlisting> | |
<para>Analysis results may include statements describing regions more granular than | |
the entire document. We use the term <emphasis role="bold">span</emphasis> to | |
refer to a sequence of characters in a text document. Consider that a document with the | |
identifier D102 contains a span, <quote>Fred Centers</quote> starting at | |
character position 101. An AE that can detect persons in text may represent the | |
following statement as an analysis result:</para> | |
<programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting> | |
<para>In both statements 1 and 2 above there is a special pre-defined term or what we call | |
in UIMA a <emphasis role="bold">Type</emphasis>. They are | |
<emphasis>Topic</emphasis> and <emphasis>Person</emphasis> respectively. | |
UIMA types characterize the kinds of results that an AE may create – more on | |
types later.</para> | |
<para>Other analysis results may relate two statements. For example, an AE might | |
record in its results that two spans are both referring to the same person:</para> | |
<programlisting>(3) The Person denoted by span 101 to 112 and | |
the Person denoted by span 141 to 143 in document D102 | |
refer to the same Entity.</programlisting> | |
<para>The above statements are some examples of the kinds of results that AEs may record | |
to describe the content of the documents they analyze. These are not meant to indicate | |
the form or syntax with which these results are captured in UIMA – more on that | |
later in this overview.</para> | |
<para>The UIMA framework treats Analysis engines as pluggable, composible, | |
discoverable, managed objects. At the heart of AEs are the analysis algorithms that | |
do all the work to analyze documents and record analysis results. </para> | |
<para>UIMA provides a basic component type intended to house the core analysis | |
algorithms running inside AEs. Instances of this component are called <emphasis | |
role="bold">Annotators</emphasis>. The analysis algorithm developer's | |
primary concern therefore is the development of annotators. The UIMA framework | |
provides the necessary methods for taking annotators and creating analysis | |
engines.</para> | |
<para>In UIMA the person who codes analysis algorithms takes on the role of the | |
<emphasis role="bold">Annotator Developer</emphasis>. <olink | |
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae"/> | |
in <olink targetdoc="&uima_docs_tutorial_guides;"/> will take the reader | |
through the details involved in creating UIMA annotators and analysis | |
engines.</para> | |
<para>At the most primitive level an AE wraps an annotator adding the necessary APIs and | |
infrastructure for the composition and deployment of annotators within the UIMA | |
framework. The simplest AE contains exactly one annotator at its core. Complex AEs | |
may contain a collection of other AEs each potentially containing within them other | |
AEs. </para> | |
</section> | |
<section id="ugr.ovv.conceptual.representing_results_in_cas"> | |
<title>Representing Analysis Results in the CAS</title> | |
<para>How annotators represent and share their results is an important part of the UIMA | |
architecture. UIMA defines a <emphasis role="bold">Common Analysis Structure | |
(CAS)</emphasis> precisely for these purposes.</para> | |
<para>The CAS is an object-based data structure that allows the representation of | |
objects, properties and values. Object types may be related to each other in a | |
single-inheritance hierarchy. The CAS logically (if not physically) contains the | |
document being analyzed. Analysis developers share and record their analysis | |
results in terms of an object model within the CAS. <footnote><para> We have plans to | |
extend the representational capabilities of the CAS and align its semantics with the | |
semantics of the OMG's Essential Meta-Object Facility (EMOF) and with the | |
semantics of the Eclipse Modeling Framework's ( <ulink | |
url="http://www.eclipse.org/emf/"/>) Ecore semantics and XMI-based | |
representation.</para> </footnote> </para> | |
<para>The UIMA framework includes an implementation and interfaces to the CAS. For a | |
more detailed description of the CAS and its interfaces see <olink | |
targetdoc="&uima_docs_ref;"/> <olink | |
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>.</para> | |
<para>A CAS that logically contains statement 2 (repeated here for your | |
convenience)</para> | |
<programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting> | |
<para>would include objects of the Person type. For each person found in the body of a | |
document, the AE would create a Person object in the CAS and link it to the span of text | |
where the person was mentioned in the document.</para> | |
<para>While the CAS is a general purpose data structure, UIMA defines a | |
few basic types and affords the developer the ability to extend these to define an | |
arbitrarily rich <emphasis role="bold">Type System</emphasis>. You can think of a | |
type system as an object schema for the CAS.</para> | |
<para>A type system defines the various types of objects that may be discovered in | |
documents by AE's that subscribe to that type system.</para> | |
<para>As suggested above, Person may be defined as a type. Types have properties or | |
<emphasis role="bold">features</emphasis>. So for example, | |
<emphasis>Age</emphasis> and <emphasis>Occupation</emphasis> may be defined as | |
features of the Person type.</para> | |
<para>Other types might be <emphasis>Organization, Company, Bank, Facility, Money, | |
Size, Price, Phone Number, Phone Call, Relation, Network Packet, Product, Noun | |
Phrase, Verb, Color, Parse Node, Feature Weight Array</emphasis> etc.</para> | |
<para>There are no limits to the different types that may be defined in a type system. A | |
type system is domain and application specific.</para> | |
<para>Types in a UIMA type system may be organized into a taxonomy. For example, | |
<emphasis>Company</emphasis> may be defined as a subtype of | |
<emphasis>Organization</emphasis>. <emphasis>NounPhrase</emphasis> may be a | |
subtype of a <emphasis>ParseNode</emphasis>.</para> | |
<section id="ugr.ovv.conceptual.annotation_type"> | |
<title>The Annotation Type</title> | |
<para>A general and common type used in artifact analysis and from which additional | |
types are often derived is the <emphasis role="bold">annotation</emphasis> | |
type. </para> | |
<para>The annotation type is used to annotate or label regions of an artifact. Common | |
artifacts are text documents, but they can be other things, such as audio streams. | |
The annotation type for text includes two features, namely | |
<emphasis>begin</emphasis> and <emphasis>end</emphasis>. Values of these | |
features represent integer offsets in the artifact and delimit a span. Any | |
particular annotation object identifies the span it annotates with the | |
<emphasis>begin</emphasis> and <emphasis>end</emphasis> features.</para> | |
<para>The key idea here is that the annotation type is used to identify and label or | |
<quote>annotate</quote> a specific region of an artifact.</para> | |
<para>Consider that the Person type is defined as a subtype of annotation. An | |
annotator, for example, can create a Person annotation to record the discovery of a | |
mention of a person between position 141 and 143 in document D102. The annotator can | |
create another person annotation to record the detection of a mention of a person in | |
the span between positions 101 and 112. </para> | |
</section> | |
<section id="ugr.ovv.conceptual.not_just_annotations"> | |
<title>Not Just Annotations</title> | |
<para>While the annotation type is a useful type for annotating regions of a | |
document, annotations are not the only kind of types in a CAS. A CAS is a general | |
representation scheme and may store arbitrary data structures to represent the | |
analysis of documents.</para> | |
<para>As an example, consider statement 3 above (repeated here for your | |
convenience).</para> | |
<programlisting>(3) The Person denoted by span 101 to 112 and | |
the Person denoted by span 141 to 143 in document D102 | |
refer to the same Entity.</programlisting> | |
<para>This statement mentions two person annotations in the CAS; the first, call it | |
P1 delimiting the span from 101 to 112 and the other, call it P2, delimiting the span | |
from 141 to 143. Statement 3 asserts explicitly that these two spans refer to the | |
same entity. This means that while there are two expressions in the text | |
represented by the annotations P1 and P2, each refers to one and the same person. | |
</para> | |
<para>The Entity type may be introduced into a type system to capture this kind of | |
information. The Entity type is not an annotation. It is intended to represent an | |
object in the domain which may be referred to by different expressions (or | |
mentions) occurring multiple times within a document (or across documents within | |
a collection of documents). The Entity type has a feature named | |
<emphasis>occurrences. </emphasis>This feature is used to point to all the | |
annotations believed to label mentions of the same entity.</para> | |
<para>Consider that the spans annotated by P1 and P2 were <quote>Fred | |
Center</quote> and <quote>He</quote> respectively. The annotator might create | |
a new Entity object called | |
<code>FredCenter</code>. To represent the relationship in statement 3 above, | |
the annotator may link FredCenter to both P1 and P2 by making them values of its | |
<emphasis>occurrences</emphasis> feature.</para> | |
<para> <xref linkend="ugr.ovv.conceptual.metadata_in_cas"/> also | |
illustrates that an entity may be linked to annotations referring to regions of | |
image documents as well. To do this the annotation type would have to be extended | |
with the appropriate features to point to regions of an image.</para> | |
</section> | |
<section id="ugr.ovv.conceptual.multiple_views_within_a_cas"> | |
<title>Multiple Views within a CAS</title> | |
<para>UIMA supports the simultaneous analysis of multiple views of a document. This | |
support comes in handy for processing multiple forms of the artifact, for example, the audio | |
and the closed captioned views of a single speech stream, or the tagged and detagged | |
views of an HTML document.</para> | |
<para>AEs analyze one or more views of a document. Each view contains a specific | |
<emphasis role="bold">subject of analysis(Sofa)</emphasis>, plus a set of | |
indexes holding metadata indexed by that view. The CAS, overall, holds one or more | |
CAS Views, plus the descriptive objects that represent the analysis results for | |
each. </para> | |
<para>Another common example of using CAS Views is for different translations of a | |
document. Each translation may be represented with a different CAS View. Each | |
translation may be described by a different set of analysis results. For more | |
details on CAS Views and Sofas see <olink | |
targetdoc="&uima_docs_tutorial_guides;"/> <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.mvs"/> and <olink | |
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>. </para> | |
</section> | |
</section> | |
<section id="ugr.ovv.conceptual.interacting_with_cas_and_external_resources"> | |
<title>Interacting with the CAS and External Resources</title> | |
<titleabbrev>Using CASes and External Resources</titleabbrev> | |
<para>The two main interfaces that a UIMA component developer interacts with are the | |
CAS and the UIMA Context.</para> | |
<para>UIMA provides an efficient implementation of the CAS with multiple programming | |
interfaces. Through these interfaces, the annotator developer interacts with the | |
document and reads and writes analysis results. The CAS interfaces provide a suite of | |
access methods that allow the developer to obtain indexed iterators to the different | |
objects in the CAS. See <olink targetdoc="&uima_docs_ref;"/> <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.cas"/>. While many objects may exist in a CAS, the annotator | |
developer can obtain a specialized iterator to all Person objects associated with a | |
particular view, for example.</para> | |
<para>For Java annotator developers, UIMA provides the JCas. This interface provides | |
the Java developer with a natural interface to CAS objects. Each type declared in the | |
type system appears as a Java Class; the UIMA framework renders the Person type as a | |
Person class in Java. As the analysis algorithm detects mentions of persons in the | |
documents, it can create Person objects in the CAS. For more details on how to interact | |
with the CAS using this interface, refer to <olink targetdoc="&uima_docs_ref;" | |
/> <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.jcas"/>.</para> | |
<para>The component developer, in addition to interacting with the CAS, can access | |
external resources through the framework's resource manager interface | |
called the <emphasis role="bold">UIMA Context</emphasis>. This interface, among | |
other things, can ensure that different annotators working together in an aggregate | |
flow may share the same instance of an external file or remote resource accessed | |
via its URL, for example. For details on using | |
the UIMA Context see <olink targetdoc="&uima_docs_tutorial_guides;" | |
/> <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.aae"/>.</para> | |
</section> | |
<section id="ugr.ovv.conceptual.component_descriptors"> | |
<title>Component Descriptors</title> | |
<para>UIMA defines interfaces for a small set of core components that users of the | |
framework provide implmentations for. Annotators and Analysis Engines are two of | |
the basic building blocks specified by the architecture. Developers implement them | |
to build and compose analysis capabilities and ultimately applications.</para> | |
<para>There are others components in addition to these, which we will learn about | |
later, but for every component specified in UIMA there are two parts required for its | |
implementation:</para> | |
<orderedlist spacing="compact"> | |
<listitem><para>the declarative part and</para></listitem> | |
<listitem><para>the code part.</para></listitem> | |
</orderedlist> | |
<para>The declarative part contains metadata describing the component, its | |
identity, structure and behavior and is called the <emphasis role="bold"> | |
Component Descriptor</emphasis>. Component descriptors are represented in XML. | |
The code part implements the algorithm. The code part may be a program in Java.</para> | |
<para>As a developer using the UIMA SDK, to implement a UIMA component it is always the | |
case that you will provide two things: the code part and the Component Descriptor. | |
Note that when you are composing an engine, the code may be already provided in | |
reusable subcomponents. In these cases you may not be developing new code but rather | |
composing an aggregate engine by pointing to other components where the code has been | |
included.</para> | |
<para>Component descriptors are represented in XML and aid in component discovery, | |
reuse, composition and development tooling. The UIMA SDK provides tools for easily | |
creating and maintaining the component descriptors that relieve the developer from | |
editing XML directly. This tool is described briefly in <olink | |
targetdoc="&uima_docs_tutorial_guides;"/> <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.aae"/>, and more | |
thoroughly in <olink targetdoc="&uima_docs_tools;"/> | |
<olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde"/> | |
.</para> | |
<para>Component descriptors contain standard metadata including the | |
component's name, author, version, and a reference to the class that | |
implements the component.</para> | |
<para>In addition to these standard fields, a component descriptor identifies the | |
type system the component uses and the types it requires in an input CAS and the types it | |
plans to produce in an output CAS.</para> | |
<para>For example, an AE that detects person types may require as input a CAS that | |
includes a tokenization and deep parse of the document. The descriptor refers to a | |
type system to make the component's input requirements and output types | |
explicit. In effect, the descriptor includes a declarative description of the | |
component's behavior and can be used to aid in component discovery and | |
composition based on desired results. UIMA analysis engines provide an interface | |
for accessing the component metadata represented in their descriptors. For more | |
details on the structure of UIMA component descriptors refer to <olink | |
targetdoc="&uima_docs_ref;"/> <olink | |
targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.component_descriptor"/>.</para> | |
</section> | |
</section> | |
<section id="ugr.ovv.conceptual.aggregate_analysis_engines"> | |
<title>Aggregate Analysis Engines</title> | |
<note><title>&key_concepts;</title><para>Aggregate Analysis Engine, Delegate Analysis Engine, | |
Tightly and Loosely Coupled, Flow Specification, Analysis Engine Assembler</para> | |
</note> | |
<figure id="ugr.ovv.conceptual.sample_aggregate"> | |
<title>Sample Aggregate Analysis Engine</title> | |
<mediaobject> | |
<imageobject role="html"> | |
<imagedata width="588px" format="PNG" fileref="&imgroot;image006.png"/> | |
</imageobject> | |
<imageobject role="fo"> | |
<imagedata width="5.5in" format="PNG" fileref="&imgroot;image006.png"/> | |
</imageobject> | |
<textobject><phrase>Picture of multiple parts (a language identifier, | |
tokenizer, part of speech annotator, shallow parser, and named entity detector) | |
strung together into a flow, and all of them wrapped as a single aggregate object, | |
which produces as annotations the union of all the results of the individual | |
annotator components ( tokens, parts of speech, names, organizations, places, | |
persons, etc.)</phrase> | |
</textobject> | |
</mediaobject> | |
</figure> | |
<para>A simple or primitive UIMA Analysis Engine (AE) contains a single annotator. AEs, | |
however, may be defined to contain other AEs organized in a workflow. These more complex | |
analysis engines are called <emphasis role="bold">Aggregate Analysis | |
Engines.</emphasis> </para> | |
<para>Annotators tend to perform fairly granular functions, for example language | |
detection, tokenization or part of speech detection. | |
These functions typically address just part of an overall analysis task. A workflow | |
of component engines may be orchestrated to perform more complex tasks.</para> | |
<para>An AE that performs named entity detection, for example, may | |
include a pipeline of annotators starting with language detection feeding | |
tokenization, then part-of-speech detection, then deep grammatical parsing and then | |
finally named-entity detection. Each step in the pipeline is required by the | |
subsequent analysis. For example, the final named-entity annotator can only do its | |
analysis if the previous deep grammatical parse was recorded in the CAS.</para> | |
<para>Aggregate AEs are built to encapsulate potentially complex internal structure | |
and insulate it from users of the AE. In our example, the aggregate analysis engine | |
developer acquires the internal components, defines the necessary flow | |
between them and publishes the resulting AE. Consider the simple example illustrated | |
in <xref linkend="ugr.ovv.conceptual.sample_aggregate"/> where | |
<quote>MyNamed-EntityDetector</quote> is composed of a linear flow of more | |
primitive analysis engines.</para> | |
<para>Users of this AE need not know how it is constructed internally but only need its name | |
and its published input requirements and output types. These must be declared in the | |
aggregate AE's descriptor. Aggregate AE's descriptors declare the components | |
they contain and a <emphasis role="bold">flow specification</emphasis>. The flow | |
specification defines the order in which the internal component AEs should be run. The | |
internal AEs specified in an aggregate are also called the <emphasis role="bold"> | |
delegate analysis engines.</emphasis> The term "delegate" is used because aggregate AE's | |
are thought to "delegate" functions to their internal AEs.</para> | |
<para> | |
In UIMA 2.0, the developer can implement a "Flow Controller" and include it as part | |
of an aggregate AE by referring to it in the aggregate AE's descriptor. | |
The flow controller is responsible for computing the "flow", that is, | |
for determining the order in which of delegate AE's that will process the CAS. | |
The Flow Contoller has access to the CAS and any external resources it may require | |
for determining the flow. It can do this dynamically at run-time, it can | |
make multi-step decisions and it can consider any sort of flow specification | |
included in the aggregate AE's descriptor. See | |
<olink targetdoc="&uima_docs_tutorial_guides;"/> | |
<olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.fc"/> | |
for details on the UIMA Flow Controller interface. | |
</para> | |
<para>We refer to the development role associated with building an aggregate from | |
delegate AEs as the <emphasis role="bold">Analysis Engine Assembler</emphasis> | |
.</para> | |
<para>The UIMA framework, given an aggregate analysis engine descriptor, will run all | |
delegate AEs, ensuring that each one gets access to the CAS in the sequence produced by | |
the flow controller. The UIMA framework is equipped to handle different | |
deployments where the delegate engines, for example, are <emphasis role="bold"> | |
tightly-coupled</emphasis> (running in the same process) or <emphasis role="bold"> | |
loosely-coupled</emphasis> (running in separate processes or even on different | |
machines). The framework supports a number of remote protocols for loose coupling | |
deployments of aggregate analysis engines, including SOAP (which stands for Simple | |
Object Access Protocol, a standard Web Services communications protocol).</para> | |
<para>The UIMA framework facilitates the deployment of AEs as remote services by using an | |
adapter layer that automatically creates the necessary infrastructure in response to | |
a declaration in the component's descriptor. For more details on creating | |
aggregate analysis engines refer to <olink targetdoc="&uima_docs_ref;" | |
/> <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor"/> The component descriptor editor tool | |
assists in the specification of aggregate AEs from a repository of available engines. | |
For more details on this tool refer to <olink targetdoc="&uima_docs_tools;" | |
/> <olink targetdoc="&uima_docs_tools;" | |
targetptr="ugr.tools.cde"/>.</para> | |
<para>The UIMA framework implementation has two built-in flow implementations: one | |
that support a linear flow between components, and one with conditional branching | |
based on the language of the document. It also supports user-provided flow | |
controllers, as described in <olink targetdoc="&uima_docs_tutorial_guides;" | |
/> <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.fc"/>. Furthermore, the application developer is | |
free to create multiple AEs and provide their own logic to combine the AEs in arbitrarily | |
complex flows. For more details on this the reader may refer to <olink | |
targetdoc="&uima_docs_tutorial_guides;"/> <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.application.using_aes"/>.</para> | |
</section> | |
<section id="ugr.ovv.conceptual.applicaiton_building_and_collection_processing"> | |
<title>Application Building and Collection Processing</title> | |
<note><title>&key_concepts;</title><para>Process Method, Collection Processing Architecture, | |
Collection Reader, CAS Consumer, CAS Initializer, Collection Processing Engine, | |
Collection Processing Manager.</para></note> | |
<section id="ugr.ovv.conceptual.using_framework_from_an_application"> | |
<title>Using the framework from an Application</title> | |
<figure id="ugr.ovv.conceptual.application_factory_ae"> | |
<title>Using UIMA Framework to create and interact with an Analysis Engine</title> | |
<mediaobject> | |
<imageobject role="html"> | |
<imagedata width="618px" align="center" format="PNG" fileref="&imgroot;image008.png"/> | |
</imageobject> | |
<imageobject role="fo"> | |
<imagedata width="5.5in" align="center" format="PNG" fileref="&imgroot;image008.png"/> | |
</imageobject> | |
<textobject><phrase>Picture of application interacting with UIMA's | |
factory to produce an analysis engine, which acts as a container for annotators, | |
and interfaces with the application via the process and getMetaData methods | |
among others.</phrase> | |
</textobject> | |
</mediaobject> | |
</figure> | |
<para>As mentioned above, the basic AE interface may be thought of as simply CAS in/CAS | |
out.</para> | |
<para>The application is responsible for interacting with the UIMA framework to | |
instantiate an AE, create or acquire an input CAS, initialize the input CAS with a | |
document and then pass it to the AE through the <emphasis role="bold">process | |
method</emphasis>. This interaction with the framework is illustrated in <xref | |
linkend="ugr.ovv.conceptual.application_factory_ae"/>. </para> | |
<para>The UIMA AE Factory takes the declarative information from the Component | |
Descriptor and the class files implementing the annotator, and instantiates the AE | |
instance, setting up the CAS and the UIMA Context.</para> | |
<para>The AE, possibly calling many delegate AEs internally, performs the overall | |
analysis and its process method returns the CAS containing new analysis results. | |
</para> | |
<para>The application then decides what to do with the returned CAS. There are many | |
possibilities. For instance the application could: display the results, store the | |
CAS to disk for post processing, extract and index analysis results as part of a search | |
or database application etc.</para> | |
<para>The UIMA framework provides methods to support the application developer in | |
creating and managing CASes and instantiating, running and managing AEs. Details | |
may be found in <olink targetdoc="&uima_docs_tutorial_guides;" | |
/> <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.application"/>.</para> | |
</section> | |
<section id="ugr.ovv.conceptual.graduating_to_collection_processing"> | |
<title>Graduating to Collection Processing</title> | |
<figure id="ugr.ovv.conceptual.fig.cpe"> | |
<title>High-Level UIMA Component Architecture from Source to Sink</title> | |
<mediaobject> | |
<imageobject role="html"> | |
<imagedata width="578px" format="PNG" align="center" fileref="&imgroot;image010.png"/> | |
</imageobject> | |
<imageobject role="fo"> | |
<imagedata width="5.5in" format="PNG" align="center" fileref="&imgroot;image010.png"/> | |
</imageobject> | |
</mediaobject> | |
</figure> | |
<para>Many UIM applications analyze entire collections of documents. They connect to | |
different document sources and do different things with the results. But in the | |
typical case, the application must generally follow these logical steps: | |
<orderedlist spacing="compact"> | |
<listitem><para>Connect to a physical source</para></listitem> | |
<listitem><para>Acquire a document from the source</para></listitem> | |
<listitem><para>Initialize a CAS with the document to be analyzed</para> | |
</listitem> | |
<listitem><para>Send the CAS to a selected analysis engine</para></listitem> | |
<listitem><para>Process the resulting CAS</para></listitem> | |
<listitem><para>Go back to 2 until the collection is processed</para> | |
</listitem> | |
<listitem><para>Do any final processing required after all the documents in the | |
collection have been analyzed</para></listitem> | |
</orderedlist> </para> | |
<para>UIMA supports UIM application development for this general type of processing | |
through its <emphasis role="bold">Collection Processing | |
Architecture</emphasis>.</para> | |
<para>As part of the collection processing architecture UIMA introduces two primary | |
components in addition to the annotator and analysis engine. These are the <emphasis | |
role="bold">Collection Reader</emphasis> and the <emphasis role="bold">CAS | |
Consumer</emphasis>. The complete flow from source, through document analysis, | |
and to CAS Consumers supported by UIMA is illustrated in <xref | |
linkend="ugr.ovv.conceptual.fig.cpe"/>.</para> | |
<para>The Collection Reader's job is to connect to and iterate through a source | |
collection, acquiring documents and initializing CASes for analysis. </para> | |
<!-- | |
<para>Since the structure, access and iteration methods for | |
physical document sources vary independently from the format of stored | |
documents, UIMA defines another type of component called a <emphasis role="bold">CAS Intializer</emphasis>. | |
The CAS Initializer's job is specific to a | |
document format and specialized logic for mapping that format to a CAS. In the | |
simplest case a CAS Intializer may take the document provided by the containing | |
Collection Reader and insert it as a subject of analysis (or Sofa) in the | |
CAS. A more advanced scenario is one | |
where the CAS Intializer may be implemented to handle documents that conform to | |
a certain XML schema and map some subset of the XML tags to CAS types and then | |
insert the de-tagged document content as the subject of analysis. Collection Readers may reuse plug-in CAS | |
Initializers for different document formats.</para> | |
--> | |
<para>CAS Consumers, as the name suggests, function at the end of the flow. Their job is | |
to do the final CAS processing. A CAS Consumer may be implemented, for example, to | |
index CAS contents in a search engine, extract elements of interest and populate a | |
relational database or serialize and store analysis results to disk for subsequent | |
and further analysis. </para> | |
<para>A Semantic Search engine that works with UIMA is available from <ulink | |
url="http://www.alphaworks.ibm.com/tech/uima">IBM's alphaWorks | |
site</ulink> which will allow the developer to experiment with indexing analysis | |
results and querying for documents based on all the annotations in the CAS. See the | |
section on integrating text analysis and search in <olink | |
targetdoc="&uima_docs_tutorial_guides;"/> <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.application"/>.</para> | |
<para>A UIMA <emphasis role="bold">Collection Processing Engine</emphasis> (CPE) | |
is an aggregate component that specifies a <quote>source to sink</quote> flow from a | |
Collection Reader though a set of analysis engines and then to a set of CAS Consumers. | |
</para> | |
<para>CPEs are specified by XML files called CPE Descriptors. These are declarative | |
specifications that point to their contained components (Collection Readers, | |
analysis engines and CAS Consumers) and indicate a flow among them. The flow | |
specification allows for filtering capabilities to, for example, skip over AEs | |
based on CAS contents. Details about the format of CPE Descriptors may be found in | |
<olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>. | |
</para> | |
<figure id="ugr.ovv.conceptual.fig.cpm"> | |
<title>Collection Processing Manager in UIMA Framework</title> | |
<mediaobject> | |
<imageobject role="html"> | |
<imagedata width="576px" align="center" format="PNG" fileref="&imgroot;image012.png"/> | |
</imageobject> | |
<imageobject role="fo"> | |
<imagedata width="5.5in" align="center" format="PNG" fileref="&imgroot;image012.png"/> | |
</imageobject> | |
<textobject><phrase>box and arrows picture of application using CPE factory to | |
instantiate a Collection Processing Engine, and that engine interacting with | |
the application.</phrase></textobject> | |
</mediaobject> | |
</figure> | |
<para>The UIMA framework includes a <emphasis role="bold">Collection Processing | |
Manager</emphasis> (CPM). The CPM is capable of reading a CPE descriptor, and | |
deploying and running the specified CPE. <xref | |
linkend="ugr.ovv.conceptual.fig.cpe"/> illustrates the role of the CPM | |
in the UIMA Framework.</para> | |
<para>Key features of the CPM are failure recovery, CAS management and scale-out. | |
</para> | |
<para>Collections may be large and take considerable time to analyze. A configurable | |
behavior of the CPM is to log faults on single document failures while continuing to | |
process the collection. This behavior is commonly used because analysis components | |
often tend to be the weakest link -- in practice they may choke on strangely formatted | |
content. </para> | |
<para>This deployment option requires that the CPM run in a separate process or a | |
machine distinct from the CPE components. A CPE may be configured to run with a variety | |
of deployment options that control the features provided by the CPM. For details see | |
<olink targetdoc="&uima_docs_ref;"/> | |
<olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/> | |
.</para> | |
<para>The UIMA SDK also provides a tool called the CPE Configurator. This tool provides | |
the developer with a user interface that simplifies the process of connecting up all | |
the components in a CPE and running the result. For details on using the CPE | |
Configurator see <olink targetdoc="&uima_docs_tools;" | |
/> <olink targetdoc="&uima_docs_tools;" | |
targetptr="ugr.tools.cpe"/>. This tool currently does not provide | |
access to the full set of CPE deployment options supported by the CPM; however, you can | |
configure other parts of the CPE descriptor by editing it directly. For details on how | |
to create and run CPEs refer to <olink targetdoc="&uima_docs_tutorial_guides;" | |
/> <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.cpe"/>.</para> | |
</section> | |
</section> | |
<section id="ugr.ovv.conceptual.exploiting_analysis_results"> | |
<title>Exploiting Analysis Results</title> | |
<note><title>&key_concepts;</title><para>Semantic Search, XML Fragment Queries.</para> | |
</note> | |
<section id="ugr.ovv.conceptual.semantic_search"> | |
<title>Semantic Search</title> | |
<para>In a simple UIMA Collection Processing Engine (CPE), a Collection Reader reads | |
documents from the file system and initializes CASs with their content. These are | |
then fed to an AE that annotates tokens and sentences, the CASs, now enriched with | |
token and sentence information, are passed to a CAS Consumer that populates a search | |
engine index. </para> | |
<para>The search engine query processor can then use the token index to provide basic | |
key-word search. For example, given a query <quote>center</quote> the search | |
engine would return all the documents that contained the word | |
<quote>center</quote>.</para> | |
<para><emphasis role="bold">Semantic Search</emphasis> is a search paradigm that | |
can exploit the additional metadata generated by analytics like a UIMA CPE.</para> | |
<para>Consider that we plugged a named-entity recognizer into the CPE described | |
above. Assume this analysis engine is capable of detecting in documents and | |
annotating in the CAS mentions of persons and organizations.</para> | |
<para>Complementing the name-entity recognizer we add a CAS Consumer that extracts in | |
addition to token and sentence annotations, the person and organizations added to | |
the CASs by the name-entity detector. It then feeds these into the semantic search | |
engine's index.</para> | |
<para>The semantic search engine that comes with the UIMA SDK, for example, can exploit | |
this addition information from the CAS to support more powerful queries. For | |
example, imagine a user is looking for documents that mention an organization with | |
<quote>center</quote> it is name but is not sure of the full or precise name of the | |
organization. A key-word search on <quote>center</quote> would likely produce way | |
too many documents because <quote>center</quote> is a common and ambiguous term. | |
The semantic search engine that is available from <ulink | |
url="http://www.alphaworks.ibm.com/tech/uima"/> supports a query language | |
called <emphasis role="bold">XML Fragments</emphasis>. This query language is | |
designed to exploit the CAS annotations entered in its index. The XML Fragment query, | |
for example, | |
<programlisting><organization> center </organization></programlisting> | |
will produce first only documents that contain <quote>center</quote> where it | |
appears as part of a mention annotated as an organization by the name-entity | |
recognizer. This will likely be a much shorter list of documents more precisely | |
matching the user's interest.</para> | |
<para>Consider taking this one step further. We add a relationship recognizer that | |
annotates mentions of the CEO-of relationship. We configure the CAS Consumer so that | |
it sends these new relationship annotations to the semantic search index as well. | |
With these additional analysis results in the index we can submit queries like | |
<programlisting><ceo_of> | |
<person> center </person> | |
<organization> center </organization> | |
<ceo_of></programlisting> | |
This query will precisely target documents that contain a mention of an organization | |
with <quote>center</quote> as part of its name where that organization is mentioned | |
as part of a | |
<code>CEO-of</code> relationship annotated by the relationship | |
recognizer.</para> | |
<para>For more details about using UIMA and Semantic Search see the section on | |
integrating text analysis and search in <olink | |
targetdoc="&uima_docs_tutorial_guides;"/> <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.application"/>.</para> | |
</section> | |
<section id="ugr.ovv.conceptual.databases"> | |
<title>Databases</title> | |
<para>Search engine indices are not the only place to deposit analysis results for use | |
by applications. Another classic example is populating databases. While many | |
approaches are possible with varying degrees of flexibly and performance all are | |
highly dependent on application specifics. We included a simple sample CAS Consumer | |
that provides the basics for getting your analysis result into a relational | |
database. It extracts annotations from a CAS and writes them to a relational | |
database, using the open source Apache Derby database.</para> | |
</section> | |
</section> | |
<section id="ugr.ovv.conceptual.multimodal_processing"> | |
<title>Multimodal Processing in UIMA</title> | |
<para>In previous sections we've seen how the CAS is initialized with an initial | |
artifact that will be subsequently analyzed by Analysis engines and CAS Consumers. The | |
first Analysis engine may make some assertions about the artifact, for example, in the | |
form of annotations. Subsequent Analysis engines will make further assertions about | |
both the artifact and previous analysis results, and finally one or more CAS Consumers | |
will extract information from these CASs for structured information storage.</para> | |
<figure id="ugr.ovv.conceptual.fig.multiple_sofas"> | |
<title>Multiple Sofas in support of multi-modal analysis of an audio Stream. Some | |
engines work on the audio <quote>view</quote>, some on the text | |
<quote>view</quote> and some on both.</title> | |
<mediaobject> | |
<imageobject role="html"> | |
<imagedata width="576px" format="PNG" align="center" fileref="&imgroot;image014.png"/> | |
</imageobject> | |
<imageobject role="fo"> | |
<imagedata width="5.5in" format="PNG" align="center" fileref="&imgroot;image014.png"/> | |
</imageobject> | |
<textobject><phrase>Picture showing audio on the left broken into segments by a | |
segmentation component, then sent to multiple analysis pipelines in parallel, | |
some processing the raw audio, others processing the recognized speech as | |
text.</phrase></textobject> | |
</mediaobject> | |
</figure> | |
<para>Consider a processing pipeline, illustrated in <xref | |
linkend="ugr.ovv.conceptual.fig.multiple_sofas"/>, that starts with an | |
audio recording of a conversation, transcribes the audio into text, and then extracts | |
information from the text transcript. Analysis Engines at the start of the pipeline are | |
analyzing an audio subject of analysis, and later analysis engines are analyzing a text | |
subject of analysis. The CAS Consumer will likely want to build a search index from | |
concepts found in the text to the original audio segment covered by the concept.</para> | |
<para>What becomes clear from this relatively simple scenario is that the CAS must be | |
capable of simultaneously holding multiple subjects of analysis. Some analysis | |
engine will analyze only one subject of analysis, some will analyze one and create | |
another, and some will need to access multiple subjects of analysis at the same time. | |
</para> | |
<para>The support in UIMA for multiple subjects of analysis is called <emphasis | |
role="bold">Sofa</emphasis> support; Sofa is an acronym which is derived from | |
<emphasis role="underline">S</emphasis>ubject <emphasis role="underline"> | |
of</emphasis> <emphasis role="underline">A</emphasis>nalysis, which is a physical | |
representation of an artifact (e.g., the detagged text of a web-page, the HTML | |
text of the same web-page, the audio segment of a video, the close-caption text | |
of the same audio segment). A Sofa may | |
be associated with CAS Views. A particular CAS will have one or more views, each view | |
corresponding to a particular subject of analysis, together with a set of the defined | |
indexes that index the metadata (that is, Feature Structures) created in that view.</para> | |
<para>Analysis results can be indexed in, or <quote>belong</quote> to, a specific view. | |
UIMA components may be written in <quote>Multi-View</quote> mode - able to create and | |
access multiple Sofas at the same time, or in <quote>Single-View</quote> mode, simply | |
receiving a particular view of the CAS corresponding to a particular single Sofa. For | |
single-view mode components, it is up to the person assembling the component to supply | |
the needed information to insure a particular view is passed to the component at run | |
time. This is done using XML descriptors for Sofa mapping (see <olink | |
targetdoc="&uima_docs_tutorial_guides;"/> <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.mvs.sofa_name_mapping"/>).</para> | |
<para>Multi-View capability brings benefits to text-only processing as well. An input | |
document can be transformed from one format to another. Examples of this include | |
transforming text from HTML to plain text or from one natural language to another. | |
</para> | |
</section> | |
<section id="ugr.ovv.conceptual.next_steps"> | |
<title>Next Steps</title> | |
<para>This chapter presented a high-level overview of UIMA concepts. Along the way, it | |
pointed to other documents in the UIMA SDK documentation set where the reader can find | |
details on how to apply the related concepts in building applications with the UIMA | |
SDK.</para> | |
<para>At this point the reader may return to the documentation guide in <olink | |
targetdoc="&uima_docs_overview;" targetptr="ugr.project_overview_doc_use"/> | |
to learn how they might proceed in getting started using UIMA.</para> | |
<para>For a more detailed overview of the UIMA architecture, framework and development | |
roles we refer the reader to the following paper:</para> | |
<para>D. Ferrucci and A. Lally, <quote>Building an example application using the | |
Unstructured Information Management Architecture,</quote> <emphasis>IBM Systems | |
Journal</emphasis> <emphasis role="bold">43</emphasis>, No. 3, 455-475 (2004). | |
</para> | |
<para>This paper can be found on line at <ulink | |
url="http://www.research.ibm.com/journal/sj43-3.html"/></para> | |
</section> | |
</chapter> |