| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" |
| "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [ |
| <!ENTITY key_concepts "Key UIMA Concepts Introduced in this Section:"> |
| <!ENTITY imgroot "../images/overview_and_setup/conceptual_overview_files/" > |
| <!ENTITY % uimaents SYSTEM "../entities.ent" > |
| %uimaents; |
| ]> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <chapter id="ugr.ovv.conceptual"> |
| <title>UIMA Conceptual Overview</title> |
| |
| <para>UIMA is an open, industrial-strength, scaleable and extensible platform for |
| creating, integrating and deploying unstructured information management solutions |
| from powerful text or multi-modal analysis and search components. </para> |
| |
| <para>The Apache UIMA project is an implementation of the Java UIMA framework available |
| under the Apache License, providing a common foundation for industry and academia to |
| collaborate and accelerate the world-wide development of technologies critical for |
| discovering vital knowledge present in the fastest growing sources of information |
| today.</para> |
| |
| <para>This chapter presents an introduction to many essential UIMA concepts. It is meant to |
| provide a broad overview to give the reader a quick sense of UIMA's basic |
| architectural philosophy and the UIMA SDK's capabilities. </para> |
| |
| <para>This chapter provides a general orientation to UIMA and makes liberal reference to |
| the other chapters in the UIMA SDK documentation set, where the reader may find detailed |
| treatments of key concepts and development practices. It may be useful to refer to <olink |
| targetdoc="&uima_docs_overview;" targetptr="ugr.glossary"/>, to become familiar |
| with the terminology in this overview.</para> |
| |
| <section id="ugr.ovv.conceptual.uima_introduction"> |
| <title>UIMA Introduction</title> |
| <figure id="ugr.ovv.conceptual.fig.bridge"> |
| <title>UIMA helps you build the bridge between the unstructured and structured |
| worlds</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata width="5.5in" format="PNG" fileref="&imgroot;image002.png"/> |
| </imageobject> |
| <textobject><phrase>Picture of a bridge between unstructured information |
| artifacts and structured metadata about those artifacts</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para> Unstructured information represents the largest, most current and fastest |
| growing source of information available to businesses and governments. The web is just |
| the tip of the iceberg. Consider the mounds of information hosted in the enterprise and |
| around the world and across different media including text, voice and video. The |
| high-value content in these vast collections of unstructured information is, |
| unfortunately, buried in lots of noise. Searching for what you need or doing |
| sophisticated data mining over unstructured information sources presents new |
| challenges. </para> |
| |
| <para>An unstructured information management (UIM) application may be generally |
| characterized as a software system that analyzes large volumes of unstructured |
| information (text, audio, video, images, etc.) to discover, organize and deliver |
| relevant knowledge to the client or application end-user. An example is an application |
| that processes millions of medical abstracts to discover critical drug interactions. |
| Another example is an application that processes tens of millions of documents to |
| discover key evidence indicating probable competitive threats. </para> |
| |
| <para>First and foremost, the unstructured data must be analyzed to interpret, detect |
| and locate concepts of interest, for example, named entities like persons, |
| organizations, locations, facilities, products etc., that are not explicitly tagged |
| or annotated in the original artifact. More challenging analytics may detect things |
| like opinions, complaints, threats or facts. And then there are relations, for |
| example, located in, finances, supports, purchases, repairs etc. The list of concepts |
| important for applications to discover in unstructured content is large, varied and |
| often domain specific. |
| Many different component analytics may solve different parts of the overall analysis task. |
| These component analytics must interoperate and must be easily combined to facilitate |
| the developed of UIM applications.</para> |
| |
| <para>The result of analysis are used to populate structured forms so that conventional |
| data processing and search technologies |
| like search engines, database engines or OLAP |
| (On-Line Analytical Processing, or Data Mining) engines |
| can efficiently deliver the newly discovered content in response to the client requests |
| or queries.</para> |
| |
| <para>In analyzing unstructured content, UIM applications make use of a variety of |
| analysis technologies including:</para> |
| |
| <itemizedlist spacing="compact"> |
| <listitem><para>Statistical and rule-based Natural Language Processing |
| (NLP)</para> |
| </listitem> |
| <listitem><para>Information Retrieval (IR)</para> |
| </listitem> |
| <listitem><para>Machine learning</para> |
| </listitem> |
| <listitem><para>Ontologies</para> |
| </listitem> |
| <listitem><para>Automated reasoning and</para> |
| </listitem> |
| <listitem><para>Knowledge Sources (e.g., CYC, WordNet, FrameNet, etc.)</para> |
| </listitem> |
| |
| </itemizedlist> |
| |
| <para>Specific analysis capabilities using these technologies are developed |
| independently using different techniques, interfaces and platforms. |
| </para> |
| |
| <para>The bridge from the unstructured world to the structured world is built through the |
| composition and deployment of these analysis capabilities. This integration is often |
| a costly challenge. </para> |
| |
| <para>The Unstructured Information Management Architecture (UIMA) is an architecture |
| and software framework that helps you build that bridge. It supports creating, |
| discovering, composing and deploying a broad range of analysis capabilities and |
| linking them to structured information services.</para> |
| |
| <para>UIMA allows development teams to match the right skills with the right parts of a |
| solution and helps enable rapid integration across technologies and platforms using a |
| variety of different deployment options. These ranging from tightly-coupled |
| deployments for high-performance, single-machine, embedded solutions to parallel |
| and fully distributed deployments for highly flexible and scaleable |
| solutions.</para> |
| |
| </section> |
| |
| <section id="ugr.ovv.conceptual.architecture_framework_sdk"> |
| <title>The Architecture, the Framework and the SDK</title> |
| <para>UIMA is a software architecture which specifies component interfaces, data |
| representations, design patterns and development roles for creating, describing, |
| discovering, composing and deploying multi-modal analysis capabilities.</para> |
| |
| <para>The <emphasis role="bold">UIMA framework</emphasis> provides a run-time |
| environment in which developers can plug in their UIMA component implementations and |
| with which they can build and deploy UIM applications. The framework is not specific to |
| any IDE or platform. Apache hosts a Java and (soon) a C++ implementation of the UIMA |
| Framework.</para> |
| |
| <para>The <emphasis role="bold">UIMA Software Development Kit (SDK)</emphasis> |
| includes the UIMA framework, plus tools and utilities for using UIMA. Some of the |
| tooling supports an Eclipse-based ( <ulink url="http://www.eclipse.org/"/>) |
| development environment. </para> |
| |
| </section> |
| |
| <section id="ugr.ovv.conceptual.analysis_basics"> |
| <title>Analysis Basics</title> |
| <note><title>&key_concepts;</title><para>Analysis Engine, Document, Annotator, Annotator |
| Developer, Type, Type System, Feature, Annotation, CAS, Sofa, JCas, UIMA |
| Context.</para> |
| </note> |
| |
| <section id="ugr.ovv.conceptual.aes_annotators_and_analysis_results"> |
| <title>Analysis Engines, Annotators & Results</title> |
| <figure id="ugr.ovv.conceptual.metadata_in_cas"> |
| <title>Objects represented in the Common Analysis Structure (CAS)</title> |
| <mediaobject> |
| <imageobject role="html"> |
| <imagedata format="PNG" width="594px" align="center" fileref="&imgroot;image004.png"/> |
| </imageobject> |
| <imageobject role="fo"> |
| <imagedata format="PNG" width="5.5in" align="center" fileref="&imgroot;image004.png"/> |
| </imageobject> |
| <textobject><phrase>Picture of some text, with a hierarchy of discovered |
| metadata about words in the text, including some image of a person as metadata |
| about that name.</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para>UIMA is an architecture in which basic building blocks called Analysis Engines |
| (AEs) are composed to analyze a document and infer and record descriptive attributes |
| about the document as a whole, and/or about regions therein. This descriptive |
| information, produced by AEs is referred to generally as <emphasis role="bold"> |
| analysis results</emphasis>. Analysis results typically represent meta-data |
| about the document content. One way to think about AEs is as software agents that |
| automatically discover and record meta-data about original content.</para> |
| |
| <para>UIMA supports the analysis of different modalities including text, audio and |
| video. The majority of examples we provide are for text. We use the term <emphasis |
| role="bold">document, </emphasis>therefore, to generally refer to any unit of |
| content that an AE may process, whether it is a text document or a segment of audio, for |
| example. See the section <olink targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.mvs"/> for more information on multimodal processing |
| in UIMA.</para> |
| |
| <para>Analysis results include different statements about the content of a document. |
| For example, the following is an assertion about the topic of a document:</para> |
| |
| |
| <programlisting>(1) The Topic of document D102 is "CEOs and Golf".</programlisting> |
| |
| <para>Analysis results may include statements describing regions more granular than |
| the entire document. We use the term <emphasis role="bold">span</emphasis> to |
| refer to a sequence of characters in a text document. Consider that a document with the |
| identifier D102 contains a span, <quote>Fred Centers</quote> starting at |
| character position 101. An AE that can detect persons in text may represent the |
| following statement as an analysis result:</para> |
| |
| |
| <programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting> |
| |
| <para>In both statements 1 and 2 above there is a special pre-defined term or what we call |
| in UIMA a <emphasis role="bold">Type</emphasis>. They are |
| <emphasis>Topic</emphasis> and <emphasis>Person</emphasis> respectively. |
| UIMA types characterize the kinds of results that an AE may create – more on |
| types later.</para> |
| |
| <para>Other analysis results may relate two statements. For example, an AE might |
| record in its results that two spans are both referring to the same person:</para> |
| |
| |
| <programlisting>(3) The Person denoted by span 101 to 112 and |
| the Person denoted by span 141 to 143 in document D102 |
| refer to the same Entity.</programlisting> |
| |
| <para>The above statements are some examples of the kinds of results that AEs may record |
| to describe the content of the documents they analyze. These are not meant to indicate |
| the form or syntax with which these results are captured in UIMA – more on that |
| later in this overview.</para> |
| |
| <para>The UIMA framework treats Analysis engines as pluggable, composible, |
| discoverable, managed objects. At the heart of AEs are the analysis algorithms that |
| do all the work to analyze documents and record analysis results. </para> |
| |
| <para>UIMA provides a basic component type intended to house the core analysis |
| algorithms running inside AEs. Instances of this component are called <emphasis |
| role="bold">Annotators</emphasis>. The analysis algorithm developer's |
| primary concern therefore is the development of annotators. The UIMA framework |
| provides the necessary methods for taking annotators and creating analysis |
| engines.</para> |
| |
| <para>In UIMA the person who codes analysis algorithms takes on the role of the |
| <emphasis role="bold">Annotator Developer</emphasis>. <olink |
| targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.aae"/> will take the reader |
| through the details involved in creating UIMA annotators and analysis |
| engines.</para> |
| |
| <para>At the most primitive level an AE wraps an annotator adding the necessary APIs and |
| infrastructure for the composition and deployment of annotators within the UIMA |
| framework. The simplest AE contains exactly one annotator at its core. Complex AEs |
| may contain a collection of other AEs each potentially containing within them other |
| AEs. </para> |
| </section> |
| |
| <section id="ugr.ovv.conceptual.representing_results_in_cas"> |
| <title>Representing Analysis Results in the CAS</title> |
| |
| <para>How annotators represent and share their results is an important part of the UIMA |
| architecture. UIMA defines a <emphasis role="bold">Common Analysis Structure |
| (CAS)</emphasis> precisely for these purposes.</para> |
| |
| <para>The CAS is an object-based data structure that allows the representation of |
| objects, properties and values. Object types may be related to each other in a |
| single-inheritance hierarchy. The CAS logically (if not physically) contains the |
| document being analyzed. Analysis developers share and record their analysis |
| results in terms of an object model within the CAS. <footnote><para> We have plans to |
| extend the representational capabilities of the CAS and align its semantics with the |
| semantics of the OMG's Essential Meta-Object Facility (EMOF) and with the |
| semantics of the Eclipse Modeling Framework's ( <ulink |
| url="http://www.eclipse.org/emf/"/>) Ecore semantics and XMI-based |
| representation.</para> </footnote> </para> |
| |
| <para>The UIMA framework includes an implementation and interfaces to the CAS. For a |
| more detailed description of the CAS and its interfaces see <olink |
| targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>.</para> |
| |
| <para>A CAS that logically contains statement 2 (repeated here for your |
| convenience)</para> |
| |
| |
| <programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting> |
| |
| <para>would include objects of the Person type. For each person found in the body of a |
| document, the AE would create a Person object in the CAS and link it to the span of text |
| where the person was mentioned in the document.</para> |
| |
| <para>While the CAS is a general purpose data structure, UIMA defines a |
| few basic types and affords the developer the ability to extend these to define an |
| arbitrarily rich <emphasis role="bold">Type System</emphasis>. You can think of a |
| type system as an object schema for the CAS.</para> |
| |
| <para>A type system defines the various types of objects that may be discovered in |
| documents by AE's that subscribe to that type system.</para> |
| |
| <para>As suggested above, Person may be defined as a type. Types have properties or |
| <emphasis role="bold">features</emphasis>. So for example, |
| <emphasis>Age</emphasis> and <emphasis>Occupation</emphasis> may be defined as |
| features of the Person type.</para> |
| |
| <para>Other types might be <emphasis>Organization, Company, Bank, Facility, Money, |
| Size, Price, Phone Number, Phone Call, Relation, Network Packet, Product, Noun |
| Phrase, Verb, Color, Parse Node, Feature Weight Array</emphasis> etc.</para> |
| |
| <para>There are no limits to the different types that may be defined in a type system. A |
| type system is domain and application specific.</para> |
| |
| <para>Types in a UIMA type system may be organized into a taxonomy. For example, |
| <emphasis>Company</emphasis> may be defined as a subtype of |
| <emphasis>Organization</emphasis>. <emphasis>NounPhrase</emphasis> may be a |
| subtype of a <emphasis>ParseNode</emphasis>.</para> |
| |
| <section id="ugr.ovv.conceptual.annotation_type"> |
| <title>The Annotation Type</title> |
| |
| <para>A general and common type used in artifact analysis and from which additional |
| types are often derived is the <emphasis role="bold">annotation</emphasis> |
| type. </para> |
| |
| <para>The annotation type is used to annotate or label regions of an artifact. Common |
| artifacts are text documents, but they can be other things, such as audio streams. |
| The annotation type for text includes two features, namely |
| <emphasis>begin</emphasis> and <emphasis>end</emphasis>. Values of these |
| features represent integer offsets in the artifact and delimit a span. Any |
| particular annotation object identifies the span it annotates with the |
| <emphasis>begin</emphasis> and <emphasis>end</emphasis> features.</para> |
| |
| <para>The key idea here is that the annotation type is used to identify and label or |
| <quote>annotate</quote> a specific region of an artifact.</para> |
| |
| <para>Consider that the Person type is defined as a subtype of annotation. An |
| annotator, for example, can create a Person annotation to record the discovery of a |
| mention of a person between position 141 and 143 in document D102. The annotator can |
| create another person annotation to record the detection of a mention of a person in |
| the span between positions 101 and 112. </para> |
| </section> |
| <section id="ugr.ovv.conceptual.not_just_annotations"> |
| <title>Not Just Annotations</title> |
| |
| <para>While the annotation type is a useful type for annotating regions of a |
| document, annotations are not the only kind of types in a CAS. A CAS is a general |
| representation scheme and may store arbitrary data structures to represent the |
| analysis of documents.</para> |
| |
| <para>As an example, consider statement 3 above (repeated here for your |
| convenience).</para> |
| |
| |
| <programlisting>(3) The Person denoted by span 101 to 112 and |
| the Person denoted by span 141 to 143 in document D102 |
| refer to the same Entity.</programlisting> |
| |
| <para>This statement mentions two person annotations in the CAS; the first, call it |
| P1 delimiting the span from 101 to 112 and the other, call it P2, delimiting the span |
| from 141 to 143. Statement 3 asserts explicitly that these two spans refer to the |
| same entity. This means that while there are two expressions in the text |
| represented by the annotations P1 and P2, each refers to one and the same person. |
| </para> |
| |
| <para>The Entity type may be introduced into a type system to capture this kind of |
| information. The Entity type is not an annotation. It is intended to represent an |
| object in the domain which may be referred to by different expressions (or |
| mentions) occurring multiple times within a document (or across documents within |
| a collection of documents). The Entity type has a feature named |
| <emphasis>occurrences. </emphasis>This feature is used to point to all the |
| annotations believed to label mentions of the same entity.</para> |
| |
| <para>Consider that the spans annotated by P1 and P2 were <quote>Fred |
| Center</quote> and <quote>He</quote> respectively. The annotator might create |
| a new Entity object called |
| <code>FredCenter</code>. To represent the relationship in statement 3 above, |
| the annotator may link FredCenter to both P1 and P2 by making them values of its |
| <emphasis>occurrences</emphasis> feature.</para> |
| |
| <para> <xref linkend="ugr.ovv.conceptual.metadata_in_cas"/> also |
| illustrates that an entity may be linked to annotations referring to regions of |
| image documents as well. To do this the annotation type would have to be extended |
| with the appropriate features to point to regions of an image.</para> |
| </section> |
| |
| <section id="ugr.ovv.conceptual.multiple_views_within_a_cas"> |
| <title>Multiple Views within a CAS</title> |
| |
| <para>UIMA supports the simultaneous analysis of multiple views of a document. This |
| support comes in handy for processing multiple forms of the artifact, for example, the audio |
| and the closed captioned views of a single speech stream, or the tagged and detagged |
| views of an HTML document.</para> |
| |
| <para>AEs analyze one or more views of a document. Each view contains a specific |
| <emphasis role="bold">subject of analysis(Sofa)</emphasis>, plus a set of |
| indexes holding metadata indexed by that view. The CAS, overall, holds one or more |
| CAS Views, plus the descriptive objects that represent the analysis results for |
| each. </para> |
| |
| <para>Another common example of using CAS Views is for different translations of a |
| document. Each translation may be represented with a different CAS View. Each |
| translation may be described by a different set of analysis results. For more |
| details on CAS Views and Sofas see <olink |
| targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.mvs"/> and <olink |
| targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>. </para> |
| </section> |
| </section> |
| |
| <section id="ugr.ovv.conceptual.interacting_with_cas_and_external_resources"> |
| <title>Interacting with the CAS and External Resources</title> |
| <titleabbrev>Using CASes and External Resources</titleabbrev> |
| |
| <para>The two main interfaces that a UIMA component developer interacts with are the |
| CAS and the UIMA Context.</para> |
| |
| <para>UIMA provides an efficient implementation of the CAS with multiple programming |
| interfaces. Through these interfaces, the annotator developer interacts with the |
| document and reads and writes analysis results. The CAS interfaces provide a suite of |
| access methods that allow the developer to obtain indexed iterators to the different |
| objects in the CAS. See <olink targetdoc="&uima_docs_ref;" |
| targetptr="ugr.ref.cas"/>. While many objects may exist in a CAS, the annotator |
| developer can obtain a specialized iterator to all Person objects associated with a |
| particular view, for example.</para> |
| |
| <para>For Java annotator developers, UIMA provides the JCas. This interface provides |
| the Java developer with a natural interface to CAS objects. Each type declared in the |
| type system appears as a Java Class; the UIMA framework renders the Person type as a |
| Person class in Java. As the analysis algorithm detects mentions of persons in the |
| documents, it can create Person objects in the CAS. For more details on how to interact |
| with the CAS using this interface, refer to <olink targetdoc="&uima_docs_ref;" |
| targetptr="ugr.ref.jcas"/>.</para> |
| |
| <para>The component developer, in addition to interacting with the CAS, can access |
| external resources through the framework's resource manager interface |
| called the <emphasis role="bold">UIMA Context</emphasis>. This interface, among |
| other things, can ensure that different annotators working together in an aggregate |
| flow may share the same instance of an external file, for example. For details on using |
| the UIMA Context see <olink targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.aae"/>.</para> |
| |
| </section> |
| <section id="ugr.ovv.conceptual.component_descriptors"> |
| <title>Component Descriptors</title> |
| <para>UIMA defines interfaces for a small set of core components that users of the |
| framework provide implmentations for. Annotators and Analysis Engines are two of |
| the basic building blocks specified by the architecture. Developers implement them |
| to build and compose analysis capabilities and ultimately applications.</para> |
| |
| <para>There are others components in addition to these, which we will learn about |
| later, but for every component specified in UIMA there are two parts required for its |
| implementation:</para> |
| |
| <orderedlist spacing="compact"> |
| <listitem><para>the declarative part and</para></listitem> |
| <listitem><para>the code part.</para></listitem> |
| </orderedlist> |
| |
| <para>The declarative part contains metadata describing the component, its |
| identity, structure and behavior and is called the <emphasis role="bold"> |
| Component Descriptor</emphasis>. Component descriptors are represented in XML. |
| The code part implements the algorithm. The code part may be a program in Java.</para> |
| |
| <para>As a developer using the UIMA SDK, to implement a UIMA component it is always the |
| case that you will provide two things: the code part and the Component Descriptor. |
| Note that when you are composing an engine, the code may be already provided in |
| reusable subcomponents. In these cases you may not be developing new code but rather |
| composing an aggregate engine by pointing to other components where the code has been |
| included.</para> |
| |
| <para>Component descriptors are represented in XML and aid in component discovery, |
| reuse, composition and development tooling. The UIMA SDK provides tools for easily |
| creating and maintaining the component descriptors that relieve the developer from |
| editing XML directly. This tool is described briefly in <olink |
| targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.aae"/>, and more |
| thoroughly in <olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde"/> |
| .</para> |
| |
| <para>Component descriptors contain standard metadata including the |
| component's name, author, version, and a reference to the class that |
| implements the component.</para> |
| |
| <para>In addition to these standard fields, a component descriptor identifies the |
| type system the component uses and the types it requires in an input CAS and the types it |
| plans to produce in an output CAS.</para> |
| |
| <para>For example, an AE that detects person types may require as input a CAS that |
| includes a tokenization and deep parse of the document. The descriptor refers to a |
| type system to make the component's input requirements and output types |
| explicit. In effect, the descriptor includes a declarative description of the |
| component's behavior and can be used to aid in component discovery and |
| composition based on desired results. UIMA analysis engines provide an interface |
| for accessing the component metadata represented in their descriptors. For more |
| details on the structure of UIMA component descriptors refer to <olink |
| targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.component_descriptor"/>.</para> |
| |
| </section> |
| </section> |
| <section id="ugr.ovv.conceptual.aggregate_analysis_engines"> |
| <title>Aggregate Analysis Engines</title> |
| |
| <note><title>&key_concepts;</title><para>Aggregate Analysis Engine, Delegate Analysis Engine, |
| Tightly and Loosely Coupled, Flow Specification, Analysis Engine Assembler</para> |
| </note> |
| |
| <figure id="ugr.ovv.conceptual.sample_aggregate"> |
| <title>Sample Aggregate Analysis Engine</title> |
| <mediaobject> |
| <imageobject role="html"> |
| <imagedata width="588px" format="PNG" fileref="&imgroot;image006.png"/> |
| </imageobject> |
| <imageobject role="fo"> |
| <imagedata width="5.5in" format="PNG" fileref="&imgroot;image006.png"/> |
| </imageobject> |
| <textobject><phrase>Picture of multiple parts (a language identifier, |
| tokenizer, part of speech annotator, shallow parser, and named entity detector) |
| strung together into a flow, and all of them wrapped as a single aggregate object, |
| which produces as annotations the union of all the results of the individual |
| annotator components ( tokens, parts of speech, names, organizations, places, |
| persons, etc.)</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para>A simple or primitive UIMA Analysis Engine (AE) contains a single annotator. AEs, |
| however, may be defined to contain other AEs organized in a workflow. These more complex |
| analysis engines are called <emphasis role="bold">Aggregate Analysis |
| Engines.</emphasis> </para> |
| |
| <para>Annotators tend to perform fairly granular functions, for example language |
| detection, tokenization or part of speech detection. |
| These functions typically address just part of an overall analysis task. A workflow |
| of component engines may be orchestrated to perform more complex tasks.</para> |
| |
| <para>An AE that performs named entity detection, for example, may |
| include a pipeline of annotators starting with language detection feeding |
| tokenization, then part-of-speech detection, then deep grammatical parsing and then |
| finally named-entity detection. Each step in the pipeline is required by the |
| subsequent analysis. For example, the final named-entity annotator can only do its |
| analysis if the previous deep grammatical parse was recorded in the CAS.</para> |
| |
| <para>Aggregate AEs are built to encapsulate potentially complex internal structure |
| and insulate it from users of the AE. In our example, the aggregate analysis engine |
| developer acquires the internal components, defines the necessary flow |
| between them and publishes the resulting AE. Consider the simple example illustrated |
| in <xref linkend="ugr.ovv.conceptual.sample_aggregate"/> where |
| <quote>MyNamed-EntityDetector</quote> is composed of a linear flow of more |
| primitive analysis engines.</para> |
| |
| <para>Users of this AE need not know how it is constructed internally but only need its name |
| and its published input requirements and output types. These must be declared in the |
| aggregate AE's descriptor. Aggregate AE's descriptors declare the components |
| they contain and a <emphasis role="bold">flow specification</emphasis>. The flow |
| specification defines the order in which the internal component AEs should be run. The |
| internal AEs specified in an aggregate are also called the <emphasis role="bold"> |
| delegate analysis engines.</emphasis> The term "delegate" is used because aggregate AE's |
| are thought to "delegate" functions to their internal AEs.</para> |
| |
| <para> |
| In UIMA 2.0, the developer can implement a "Flow Controller" and include it as part |
| of an aggregate AE by referring to it in the aggregate AE's descriptor. |
| The flow controller is responsible for computing the "flow", that is, |
| for determining the order in which of delegate AE's that will process the CAS. |
| The Flow Contoller has access to the CAS and any external resources it may require |
| for determining the flow. It can do this dynamically at run-time, it can |
| make multi-step decisions and it can consider any sort of flow specification |
| included in the aggregate AE's descriptor. See |
| <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.fc"/> |
| for details on the UIMA Flow Controller interface. |
| </para> |
| |
| <para>We refer to the development role associated with building an aggregate from |
| delegate AEs as the <emphasis role="bold">Analysis Engine Assembler</emphasis> |
| .</para> |
| |
| <para>The UIMA framework, given an aggregate analysis engine descriptor, will run all |
| delegate AEs, ensuring that each one gets access to the CAS in the sequence produced by |
| the flow controller. The UIMA framework is equipped to handle different |
| deployments where the delegate engines, for example, are <emphasis role="bold"> |
| tightly-coupled</emphasis> (running in the same process) or <emphasis role="bold"> |
| loosely-coupled</emphasis> (running in separate processes or even on different |
| machines). The framework supports a number of remote protocols for loose coupling |
| deployments of aggregate analysis engines, including SOAP (which stands for Simple |
| Object Access Protocol, a standard Web Services communications protocol).</para> |
| |
| <para>The UIMA framework facilitates the deployment of AEs as remote services by using an |
| adapter layer that automatically creates the necessary infrastructure in response to |
| a declaration in the component's descriptor. For more details on creating |
| aggregate analysis engines refer to <olink targetdoc="&uima_docs_ref;" |
| targetptr="ugr.ref.xml.component_descriptor"/> The component descriptor editor tool |
| assists in the specification of aggregate AEs from a repository of available engines. |
| For more details on this tool refer to <olink targetdoc="&uima_docs_tools;" |
| targetptr="ugr.tools.cde"/>.</para> |
| |
| <para>The UIMA framework implementation has two built-in flow implementations: one |
| that support a linear flow between components, and one with conditional branching |
| based on the language of the document. It also supports user-provided flow |
| controllers, as described in <olink targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.fc"/>. Furthermore, the application developer is |
| free to create multiple AEs and provide their own logic to combine the AEs in arbitrarily |
| complex flows. For more details on this the reader may refer to <olink |
| targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.application.using_aes"/>.</para> |
| |
| </section> |
| |
| <section id="ugr.ovv.conceptual.applicaiton_building_and_collection_processing"> |
| <title>Application Building and Collection Processing</title> |
| |
| <note><title>&key_concepts;</title><para>Process Method, Collection Processing Architecture, |
| Collection Reader, CAS Consumer, CAS Initializer, Collection Processing Engine, |
| Collection Processing Manager.</para></note> |
| |
| <section id="ugr.ovv.conceptual.using_framework_from_an_application"> |
| <title>Using the framework from an Application</title> |
| |
| <figure id="ugr.ovv.conceptual.application_factory_ae"> |
| <title>Using UIMA Framework to create and interact with an Analysis Engine</title> |
| <mediaobject> |
| <imageobject role="html"> |
| <imagedata width="618px" align="center" format="PNG" fileref="&imgroot;image008.png"/> |
| </imageobject> |
| <imageobject role="fo"> |
| <imagedata width="5.5in" align="center" format="PNG" fileref="&imgroot;image008.png"/> |
| </imageobject> |
| <textobject><phrase>Picture of application interacting with UIMA's |
| factory to produce an analysis engine, which acts as a container for annotators, |
| and interfaces with the application via the process and getMetaData methods |
| among others.</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para>As mentioned above, the basic AE interface may be thought of as simply CAS in/CAS |
| out.</para> |
| |
| <para>The application is responsible for interacting with the UIMA framework to |
| instantiate an AE, create or acquire an input CAS, initialize the input CAS with a |
| document and then pass it to the AE through the <emphasis role="bold">process |
| method</emphasis>. This interaction with the framework is illustrated in <xref |
| linkend="ugr.ovv.conceptual.application_factory_ae"/>. </para> |
| |
| <para>The UIMA AE Factory takes the declarative information from the Component |
| Descriptor and the class files implementing the annotator, and instantiates the AE |
| instance, setting up the CAS and the UIMA Context.</para> |
| |
| <para>The AE, possibly calling many delegate AEs internally, performs the overall |
| analysis and its process method returns the CAS containing new analysis results. |
| </para> |
| |
| <para>The application then decides what to do with the returned CAS. There are many |
| possibilities. For instance the application could: display the results, store the |
| CAS to disk for post processing, extract and index analysis results as part of a search |
| or database application etc.</para> |
| |
| <para>The UIMA framework provides methods to support the application developer in |
| creating and managing CASes and instantiating, running and managing AEs. Details |
| may be found in <olink targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.application"/>.</para> |
| </section> |
| |
| <section id="ugr.ovv.conceptual.graduating_to_collection_processing"> |
| <title>Graduating to Collection Processing</title> |
| <figure id="ugr.ovv.conceptual.fig.cpe"> |
| <title>High-Level UIMA Component Architecture from Source to Sink</title> |
| <mediaobject> |
| <imageobject role="html"> |
| <imagedata width="578px" format="PNG" align="center" fileref="&imgroot;image010.png"/> |
| </imageobject> |
| <imageobject role="fo"> |
| <imagedata width="5.5in" format="PNG" align="center" fileref="&imgroot;image010.png"/> |
| </imageobject> |
| </mediaobject> |
| </figure> |
| |
| <para>Many UIM applications analyze entire collections of documents. They connect to |
| different document sources and do different things with the results. But in the |
| typical case, the application must generally follow these logical steps: |
| |
| <orderedlist spacing="compact"> |
| <listitem><para>Connect to a physical source</para></listitem> |
| <listitem><para>Acquire a document from the source</para></listitem> |
| <listitem><para>Initialize a CAS with the document to be analyzed</para> |
| </listitem> |
| <listitem><para>Send the CAS to a selected analysis engine</para></listitem> |
| <listitem><para>Process the resulting CAS</para></listitem> |
| <listitem><para>Go back to 2 until the collection is processed</para> |
| </listitem> |
| <listitem><para>Do any final processing required after all the documents in the |
| collection have been analyzed</para></listitem> |
| </orderedlist> </para> |
| |
| <para>UIMA supports UIM application development for this general type of processing |
| through its <emphasis role="bold">Collection Processing |
| Architecture</emphasis>.</para> |
| |
| <para>As part of the collection processing architecture UIMA introduces two primary |
| components in addition to the annotator and analysis engine. These are the <emphasis |
| role="bold">Collection Reader</emphasis> and the <emphasis role="bold">CAS |
| Consumer</emphasis>. The complete flow from source, through document analysis, |
| and to CAS Consumers supported by UIMA is illustrated in <xref |
| linkend="ugr.ovv.conceptual.fig.cpe"/>.</para> |
| |
| <para>The Collection Reader's job is to connect to and iterate through a source |
| collection, acquiring documents and initializing CASes for analysis. </para> |
| |
| <!-- |
| <para>Since the structure, access and iteration methods for |
| physical document sources vary independently from the format of stored |
| documents, UIMA defines another type of component called a <emphasis role="bold">CAS Intializer</emphasis>. |
| The CAS Initializer's job is specific to a |
| document format and specialized logic for mapping that format to a CAS. In the |
| simplest case a CAS Intializer may take the document provided by the containing |
| Collection Reader and insert it as a subject of analysis (or Sofa) in the |
| CAS. A more advanced scenario is one |
| where the CAS Intializer may be implemented to handle documents that conform to |
| a certain XML schema and map some subset of the XML tags to CAS types and then |
| insert the de-tagged document content as the subject of analysis. Collection Readers may reuse plug-in CAS |
| Initializers for different document formats.</para> |
| --> |
| |
| <para>CAS Consumers, as the name suggests, function at the end of the flow. Their job is |
| to do the final CAS processing. A CAS Consumer may be implemented, for example, to |
| index CAS contents in a search engine, extract elements of interest and populate a |
| relational database or serialize and store analysis results to disk for subsequent |
| and further analysis. </para> |
| |
| <para>A Semantic Search engine that works with UIMA is available from <ulink |
| url="http://www.alphaworks.ibm.com/tech/uima">IBM's alphaWorks |
| site</ulink> which will allow the developer to experiment with indexing analysis |
| results and querying for documents based on all the annotations in the CAS. See the |
| section on integrating text analysis and search in <olink |
| targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.application"/>.</para> |
| |
| <para>A UIMA <emphasis role="bold">Collection Processing Engine</emphasis> (CPE) |
| is an aggregate component that specifies a <quote>source to sink</quote> flow from a |
| Collection Reader though a set of analysis engines and then to a set of CAS Consumers. |
| </para> |
| |
| <para>CPEs are specified by XML files called CPE Descriptors. These are declarative |
| specifications that point to their contained components (Collection Readers, |
| analysis engines and CAS Consumers) and indicate a flow among them. The flow |
| specification allows for filtering capabilities to, for example, skip over AEs |
| based on CAS contents. Details about the format of CPE Descriptors may be found in |
| <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>. |
| </para> |
| |
| <figure id="ugr.ovv.conceptual.fig.cpm"> |
| <title>Collection Processing Manager in UIMA Framework</title> |
| <mediaobject> |
| <imageobject role="html"> |
| <imagedata width="576px" align="center" format="PNG" fileref="&imgroot;image012.png"/> |
| </imageobject> |
| <imageobject role="fo"> |
| <imagedata width="5.5in" align="center" format="PNG" fileref="&imgroot;image012.png"/> |
| </imageobject> |
| <textobject><phrase>box and arrows picture of application using CPE factory to |
| instantiate a Collection Processing Engine, and that engine interacting with |
| the application.</phrase></textobject> |
| </mediaobject> |
| </figure> |
| |
| <para>The UIMA framework includes a <emphasis role="bold">Collection Processing |
| Manager</emphasis> (CPM). The CPM is capable of reading a CPE descriptor, and |
| deploying and running the specified CPE. <xref |
| linkend="ugr.ovv.conceptual.fig.cpe"/> illustrates the role of the CPM |
| in the UIMA Framework.</para> |
| |
| <para>Key features of the CPM are failure recovery, CAS management and scale-out. |
| </para> |
| |
| <para>Collections may be large and take considerable time to analyze. A configurable |
| behavior of the CPM is to log faults on single document failures while continuing to |
| process the collection. This behavior is commonly used because analysis components |
| often tend to be the weakest link -- in practice they may choke on strangely formatted |
| content. </para> |
| |
| <para>This deployment option requires that the CPM run in a separate process or a |
| machine distinct from the CPE components. A CPE may be configured to run with a variety |
| of deployment options that control the features provided by the CPM. For details see |
| <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/> |
| .</para> |
| |
| <para>The UIMA SDK also provides a tool called the CPE Configurator. This tool provides |
| the developer with a user interface that simplifies the process of connecting up all |
| the components in a CPE and running the result. For details on using the CPE |
| Configurator see <olink targetdoc="&uima_docs_tools;" |
| targetptr="ugr.tools.cpe"/>. This tool currently does not provide |
| access to the full set of CPE deployment options supported by the CPM; however, you can |
| configure other parts of the CPE descriptor by editing it directly. For details on how |
| to create and run CPEs refer to <olink targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.cpe"/>.</para> |
| |
| </section> |
| |
| </section> |
| |
| <section id="ugr.ovv.conceptual.exploiting_analysis_results"> |
| <title>Exploiting Analysis Results</title> |
| |
| <note><title>&key_concepts;</title><para>Semantic Search, XML Fragment Queries.</para> |
| </note> |
| |
| <section id="ugr.ovv.conceptual.semantic_search"> |
| <title>Semantic Search</title> |
| |
| <para>In a simple UIMA Collection Processing Engine (CPE), a Collection Reader reads |
| documents from the file system and initializes CASs with their content. These are |
| then fed to an AE that annotates tokens and sentences, the CASs, now enriched with |
| token and sentence information, are passed to a CAS Consumer that populates a search |
| engine index. </para> |
| |
| <para>The search engine query processor can then use the token index to provide basic |
| key-word search. For example, given a query <quote>center</quote> the search |
| engine would return all the documents that contained the word |
| <quote>center</quote>.</para> |
| |
| <para><emphasis role="bold">Semantic Search</emphasis> is a search paradigm that |
| can exploit the additional metadata generated by analytics like a UIMA CPE.</para> |
| |
| <para>Consider that we plugged a named-entity recognizer into the CPE described |
| above. Assume this analysis engine is capable of detecting in documents and |
| annotating in the CAS mentions of persons and organizations.</para> |
| |
| <para>Complementing the name-entity recognizer we add a CAS Consumer that extracts in |
| addition to token and sentence annotations, the person and organizations added to |
| the CASs by the name-entity detector. It then feeds these into the semantic search |
| engine's index.</para> |
| |
| <para>The semantic search engine that comes with the UIMA SDK, for example, can exploit |
| this addition information from the CAS to support more powerful queries. For |
| example, imagine a user is looking for documents that mention an organization with |
| <quote>center</quote> it is name but is not sure of the full or precise name of the |
| organization. A key-word search on <quote>center</quote> would likely produce way |
| too many documents because <quote>center</quote> is a common and ambiguous term. |
| The semantic search engine that is available from <ulink |
| url="http://www.alphaworks.ibm.com/tech/uima"/> supports a query language |
| called <emphasis role="bold">XML Fragments</emphasis>. This query language is |
| designed to exploit the CAS annotations entered in its index. The XML Fragment query, |
| for example, |
| |
| |
| <programlisting><organization> center </organization></programlisting> |
| will produce first only documents that contain <quote>center</quote> where it |
| appears as part of a mention annotated as an organization by the name-entity |
| recognizer. This will likely be a much shorter list of documents more precisely |
| matching the user's interest.</para> |
| |
| <para>Consider taking this one step further. We add a relationship recognizer that |
| annotates mentions of the CEO-of relationship. We configure the CAS Consumer so that |
| it sends these new relationship annotations to the semantic search index as well. |
| With these additional analysis results in the index we can submit queries like |
| |
| |
| <programlisting><ceo_of> |
| <person> center </person> |
| <organization> center </organization> |
| <ceo_of></programlisting> |
| This query will precisely target documents that contain a mention of an organization |
| with <quote>center</quote> as part of its name where that organization is mentioned |
| as part of a |
| <code>CEO-of</code> relationship annotated by the relationship |
| recognizer.</para> |
| |
| <para>For more details about using UIMA and Semantic Search see the section on |
| integrating text analysis and search in <olink |
| targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.application"/>.</para> |
| </section> |
| |
| <section id="ugr.ovv.conceptual.databases"> |
| <title>Databases</title> |
| |
| <para>Search engine indices are not the only place to deposit analysis results for use |
| by applications. Another classic example is populating databases. While many |
| approaches are possible with varying degrees of flexibly and performance all are |
| highly dependent on application specifics. We included a simple sample CAS Consumer |
| that provides the basics for getting your analysis result into a relational |
| database. It extracts annotations from a CAS and writes them to a relational |
| database, using the open source Apache Derby database.</para> |
| </section> |
| </section> |
| |
| <section id="ugr.ovv.conceptual.multimodal_processing"> |
| <title>Multimodal Processing in UIMA</title> |
| <para>In previous sections we've seen how the CAS is initialized with an initial |
| artifact that will be subsequently analyzed by Analysis engines and CAS Consumers. The |
| first Analysis engine may make some assertions about the artifact, for example, in the |
| form of annotations. Subsequent Analysis engines will make further assertions about |
| both the artifact and previous analysis results, and finally one or more CAS Consumers |
| will extract information from these CASs for structured information storage.</para> |
| <figure id="ugr.ovv.conceptual.fig.multiple_sofas"> |
| <title>Multiple Sofas in support of multi-modal analysis of an audio Stream. Some |
| engines work on the audio <quote>view</quote>, some on the text |
| <quote>view</quote> and some on both.</title> |
| <mediaobject> |
| <imageobject role="html"> |
| <imagedata width="576px" format="PNG" align="center" fileref="&imgroot;image014.png"/> |
| </imageobject> |
| <imageobject role="fo"> |
| <imagedata width="5.5in" format="PNG" align="center" fileref="&imgroot;image014.png"/> |
| </imageobject> |
| <textobject><phrase>Picture showing audio on the left broken into segments by a |
| segmentation component, then sent to multiple analysis pipelines in parallel, |
| some processing the raw audio, others processing the recognized speech as |
| text.</phrase></textobject> |
| </mediaobject> |
| </figure> |
| <para>Consider a processing pipeline, illustrated in <xref |
| linkend="ugr.ovv.conceptual.fig.multiple_sofas"/>, that starts with an |
| audio recording of a conversation, transcribes the audio into text, and then extracts |
| information from the text transcript. Analysis Engines at the start of the pipeline are |
| analyzing an audio subject of analysis, and later analysis engines are analyzing a text |
| subject of analysis. The CAS Consumer will likely want to build a search index from |
| concepts found in the text to the original audio segment covered by the concept.</para> |
| |
| <para>What becomes clear from this relatively simple scenario is that the CAS must be |
| capable of simultaneously holding multiple subjects of analysis. Some analysis |
| engine will analyze only one subject of analysis, some will analyze one and create |
| another, and some will need to access multiple subjects of analysis at the same time. |
| </para> |
| |
| <para>The support in UIMA for multiple subjects of analysis is called <emphasis |
| role="bold">Sofa</emphasis> support; Sofa is an acronym which is derived from |
| <emphasis role="underline">S</emphasis>ubject <emphasis role="underline"> |
| of</emphasis> <emphasis role="underline">A</emphasis>nalysis, which is a physical |
| representation of an artifact (e.g., the detagged text of a web-page, the HTML |
| text of the same web-page, the audio segment of a video, the close-caption text |
| of the same audio segment). A Sofa may |
| be associated with CAS Views. A particular CAS will have one or more views, each view |
| corresponding to a particular subject of analysis, together with a set of the defined |
| indexes that index the metadata created in that view.</para> |
| |
| <para>Analysis results can be indexed in, or <quote>belong</quote> to, a specific view. |
| UIMA components may be written in <quote>Multi-View</quote> mode - able to create and |
| access multiple Sofas at the same time, or in <quote>Single-View</quote> mode, simply |
| receiving a particular view of the CAS corresponding to a particular single Sofa. For |
| single-view mode components, it is up to the person assembling the component to supply |
| the needed information to insure a particular view is passed to the component at run |
| time. This is done using XML descriptors for Sofa mapping (see <olink |
| targetdoc="&uima_docs_tutorial_guides;" |
| targetptr="ugr.tug.mvs.sofa_name_mapping"/>).</para> |
| |
| <para>Multi-View capability brings benefits to text-only processing as well. An input |
| document can be transformed from one format to another. Examples of this include |
| transforming text from HTML to plain text or from one natural language to another. |
| </para> |
| </section> |
| |
| <section id="ugr.ovv.conceptual.next_steps"> |
| <title>Next Steps</title> |
| |
| <para>This chapter presented a high-level overview of UIMA concepts. Along the way, it |
| pointed to other documents in the UIMA SDK documentation set where the reader can find |
| details on how to apply the related concepts in building applications with the UIMA |
| SDK.</para> |
| |
| <para>At this point the reader may return to the documentation guide in <olink |
| targetdoc="&uima_docs_overview;" targetptr="ugr.project_overview_doc_use"/> |
| to learn how they might proceed in getting started using UIMA.</para> |
| |
| <para>For a more detailed overview of the UIMA architecture, framework and development |
| roles we refer the reader to the following paper:</para> |
| |
| <para>D. Ferrucci and A. Lally, <quote>Building an example application using the |
| Unstructured Information Management Architecture,</quote> <emphasis>IBM Systems |
| Journal</emphasis> <emphasis role="bold">43</emphasis>, No. 3, 455-475 (2004). |
| </para> |
| |
| <para>This paper can be found on line at <ulink |
| url="http://www.research.ibm.com/journal/sj43-3.html"/></para> |
| </section> |
| |
| </chapter> |