blob: 2c00e1326af5e751f63750b300e21814d5dbcc31 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="ugr.faqs">
<title>UIMA Frequently Asked Questions (FAQ&apos;s)</title>
<titleabbrev>UIMA FAQ&apos;s</titleabbrev>
<variablelist>
<varlistentry id="ugr.faqs.what_is_uima">
<term><emphasis role="bold">What is UIMA?</emphasis></term>
<listitem><para>UIMA stands for Unstructured Information Management
Architecture. It is component software architecture for the development,
discovery, composition and deployment of multi-modal analytics for the analysis
of unstructured information.</para>
<para>UIMA processing occurs through a series of modules called
<link linkend="ugr.faqs.annotator_versus_ae">analysis engines</link>. The result of analysis is an assignment of semantics to the elements of
unstructured data, for example, the indication that the phrase
<quote>Washington</quote> refers to a person&apos;s name or that it refers to a
place.</para>
<para>Analysis Engine&apos;s output can be saved in conventional structures,
for example, relational databases or search engine indices, where the content
of the original unstructured information may be efficiently accessed
according to its inferred semantics. </para>
<para>UIMA supports developers in creating,
integrating, and deploying components across platforms and among dispersed
teams working to develop unstructured information management
applications.</para>
</listitem>
</varlistentry>
<varlistentry id="ugr.faqs.pronounce">
<term><emphasis role="bold">How do you pronounce UIMA?</emphasis></term>
<listitem><para>You &ndash; eee &ndash; muh.
<!-- Or, in IPA notation, /juːiːmə/ (which does not
display correctly in our PDF documentation, so it's commented out). --></para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.difference_apache_uima">
<term><emphasis role="bold">What&apos;s the difference between UIMA and the Apache UIMA?</emphasis></term>
<listitem><para>UIMA is an architecture which specifies component interfaces,
design patterns, data representations and development roles.</para>
<para>Apache UIMA is an open source, Apache-licensed software project. It includes run-time
frameworks in Java and C++, APIs and tools for implementing, composing, packaging
and deploying UIMA components.</para>
<para>The UIMA run-time framework allows developers to plug-in their components
and applications and run them on different platforms and according to different
deployment options that range from tightly-coupled (running in the same
process space) to loosely-coupled (distributed across different processes or
machines for greater scale, flexibility and recoverability).</para>
<para>The UIMA project has several significant subprojects, including UIMA-AS (for flexibly
scaling out UIMA pipelines over clusters of machines), uimaFIT (for a way of using UIMA without the xml descriptors; also provides
many convenience methods), UIMA-DUCC (for managing clusters of
machines running scaled-out UIMA "jobs" in a "fair" way), RUTA (Eclipse-based tooling and \
a runtime framework for development of rule-based
Annotators), Addons (where you can find many extensions), and uimaFIT supplying a Java centric
set of friendlier interfaces and avoiding XML.</para>
</listitem>
</varlistentry>
<!--
<varlistentry id="ugr.faqs.include_semantic_search">
<term><emphasis role="bold">
Does UIMA include a semantic search engine?
</emphasis></term>
<listitem><para>
The Apache UIMA project does not itself include a semantic search engine.
It can interface with the semantic search engine
component (available from <ulink
url="http://www.alphaworks.ibm.com/tech/uima"/> for indexing and querying over
the results of analysis. Over time, we expect that additional search engines will
add support for semantic searching.
</para>
</listitem>
</varlistentry>
-->
<varlistentry id="ugr.faqs.what_is_an_annotation">
<term><emphasis role="bold">What is an Annotation?</emphasis></term>
<listitem><para>An annotation is metadata that is associated with a region of a
document. It often is a label, typically represented as string of characters. The
region may be the whole document. </para>
<para>An example is the label <quote>Person</quote> associated with the span of
text <quote>George Washington</quote>. We say that <quote>Person</quote>
annotates <quote>George Washington</quote> in the sentence <quote>George
Washington was the first president of the United States</quote>. The
association of the label
<quote>Person</quote> with a particular span of text is an annotation. Another
example may have an annotation represent a topic, like <quote>American
Presidents</quote> and be used to label an entire document.</para>
<para>Annotations are not limited to regions of texts. An annotation may annotate
a region of an image or a segment of audio. The same concepts apply.</para>
</listitem>
</varlistentry>
<varlistentry id="ugr.faqs.what_is_the_cas">
<term><emphasis role="bold">What is the CAS?</emphasis></term>
<listitem><para>The CAS stands for Common Analysis Structure. It provides
cooperating UIMA components with a common representation and mechanism for
shared access to the artifact being analyzed (e.g., a document, audio file, video
stream etc.) and the current analysis results.</para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.what_does_the_cas_contain">
<term><emphasis role="bold">What does the CAS contain?</emphasis></term>
<listitem><para>The CAS is a data structure for which UIMA provides multiple
interfaces. It contains and provides the analysis algorithm or application
developer with access to</para>
<itemizedlist spacing="compact">
<listitem><para>the subject of analysis (the artifact being analyzed, like
the document),</para></listitem>
<listitem><para>the analysis results or metadata(e.g., annotations, parse
trees, relations, entities etc.),</para></listitem>
<listitem><para>indices to the analysis results, and</para></listitem>
<listitem><para>the type system (a schema for the analysis results).</para>
</listitem>
</itemizedlist>
<para>A CAS can hold multiple versions of the artifact being analyzed (for
instance, a raw html document, and a detagged version, or an English version and a
corresponding German version, or an audio sample, and the text that
corresponds, etc.). For each version there is a separate instance of the results
indices.</para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.only_annotations">
<term><emphasis role="bold">Does the CAS only contain Annotations?</emphasis></term>
<listitem><para>No. The CAS contains the artifact being analyzed plus the analysis
results. Analysis results are those metadata recorded by <link linkend="ugr.faqs.annotator_versus_ae">analysis engines</link> in the
CAS. The most common form of analysis result is the addition of an annotation. But an
analysis engine may write any structure that conforms to the CAS&apos;s type
system into the CAS. These may not be annotations but may be other things, for
example links between annotations and properties of objects associated with
annotations.</para>
<para>The CAS may have multiple representations of the artifact being analyzed, each one
represented in the CAS as a particular Subject of Analysis. or <link linkend="ugr.faqs.what_is_a_sofa">Sofa</link></para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.just_xml">
<term><emphasis role="bold">Is the CAS just XML?</emphasis></term>
<listitem><para>No, in fact there are many possible representations of the CAS. If all
of the <link linkend="ugr.faqs.annotator_versus_ae">analysis engines</link> are running in the same process, an efficient, in-memory
data object is used. If a CAS must be sent to an analysis engine on a remote machine, it
can be done via an XML or a binary serialization of the CAS. </para>
<para>The UIMA framework provides multiple serialization and de-serialization methods
in various formats, including XML. See the Javadocs for the CasIOUtils class.
</para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.what_is_a_type_system">
<term><emphasis role="bold">What is a Type System?</emphasis></term>
<listitem><para>Think of a type system as a schema or class model for the <link linkend="ugr.faqs.what_is_the_cas">CAS</link>. It defines
the types of objects and their properties (or features) that may be instantiated in
a CAS. A specific CAS conforms to a particular type system. UIMA components declare
their input and output with respect to a type system. </para>
<para>Type Systems include the definitions of types, their properties, range
types (these can restrict the value of properties to other types) and
single-inheritance hierarchy of types.</para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.what_is_a_sofa">
<term><emphasis role="bold">What is a Sofa?</emphasis></term>
<listitem><para>Sofa stands for &ldquo;Subject of Analysis&quot;. A <link linkend="ugr.faqs.what_is_the_cas">CAS</link> is
associated with a single artifact being analysed by a collection of UIMA analysis
engines. But a single artifact may have multiple independent views, each of which
may be analyzed separately by a different set of <link linkend="ugr.faqs.annotator_versus_ae">analysis engines</link>. For example,
given a document it may have different translations, each of which are associated
with the original document but each potentially analyzed by different engines. A
CAS may have multiple Views, each containing a different Subject of Analysis
corresponding to some version of the original artifact. This feature is ideal for
multi-modal analysis, where for example, one view of a video stream may be the video
frames and the other the close-captions.</para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.annotator_versus_ae">
<term><emphasis role="bold">What's the difference between an Annotator and an Analysis
Engine?</emphasis></term>
<listitem><para>In the terminology of UIMA, an annotator is simply some code that
analyzes documents and outputs <link linkend="ugr.faqs.what_is_an_annotation">annotations</link> on the content of the documents. The
UIMA framework takes the annotator, together with metadata describing such
things as the input requirements and outputs types of the annotator, and produces
an analysis engine. </para>
<para>Analysis Engines contain the framework-provided infrastructure that
allows them to be easily combined with other analysis engines in different flows
and according to different deployment options (collocated or as web services,
for example). </para>
<para>Analysis Engines are the framework-generated objects that an Application
interacts with. An Annotator is a user-written class that implements the one of
the supported Annotator interfaces.</para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.web_services">
<term><emphasis role="bold">Are UIMA analysis engines web services?</emphasis></term>
<listitem><para>They can be deployed as such. Deploying an analysis engine as a web
service is one of the deployment options supported by the UIMA framework.</para>
</listitem>
</varlistentry>
<varlistentry id="ugr.faqs.stateless_aes">
<term><emphasis role="bold">Do Analysis Engines have to be
&quot;stateless&quot;?</emphasis></term>
<listitem><para>This is a user-specifyable option. The XML metadata for the
component includes an
<code>operationalProperties</code> element which can specify if multiple
deployment is allowed. If true, then a particular instance of an Engine might not
see all the CASes being processed. If false, then that component will see all of the
CASes being processed. In this case, it can accumulate state information among all
the CASes. Typically, Analysis Engines in the main analysis pipeline are marked
multipleDeploymentAllowed = true. The CAS Consumer component, on the other hand,
defaults to having this property set to false, and is typically associated with
some resource like a database or search engine that aggregates analysis results
across an entire collection.</para>
<para>Analysis Engines developers are encouraged not to maintain state between
documents that would prevent their engine from working as advertised if
operated in a parallelized environment.</para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.uddi">
<term><emphasis role="bold">Is engine meta-data compatible with web services and
UDDI?</emphasis></term>
<listitem><para>All UIMA component implementations are associated with Component
Descriptors which represents metadata describing various properties about the
component to support discovery, reuse, validation, automatic composition and
development tooling. In principle, UIMA component descriptors are compatible
with web services and UDDI. However, the UIMA framework currently uses its own XML
representation for component metadata. It would not be difficult to convert
between UIMA&apos;s XML representation and other standard representations.</para>
</listitem>
</varlistentry>
<varlistentry id="ugr.faqs.scaling">
<term><emphasis role="bold">How do you scale a UIMA application?</emphasis></term>
<listitem><para>The UIMA framework allows components such as
<link linkend="ugr.faqs.annotator_versus_ae">analysis engines</link> and
CAS Consumers to be easily deployed as services or in other containers and managed
by systems middleware designed to scale. UIMA applications tend to naturally
scale-out across documents allowing many documents to be analyzed in
parallel.</para>
<para>The UIMA-AS project has extensive capabilities to flexibly scale a UIMA
pipeline across multiple machines. The UIMA-DUCC project supports a
unified management of large clusters of machines running multiple "jobs"
each consisting of a pipeline with data sources and sinks.</para>
<para>Within the core UIMA framework, there is a component called the CPM (Collection Processing
Manager) which has features and configuration settings for scaling an
application to increase its throughput and recoverability;
the CPM was the earlier version of scaleout technology, and has been
superceded by the UIMA-AS effort (although it is still supported).</para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.embedding">
<term><emphasis role="bold">What does it mean to embed UIMA in systems middleware?</emphasis></term>
<listitem><para>An example of an embedding would be the deployment of a UIMA analysis
engine as an Enterprise Java Bean inside an application server such as IBM
WebSphere. Such an embedding allows the deployer to take advantage of the features
and tools provided by WebSphere for achieving scalability, service management,
recoverability etc. UIMA is independent of any particular systems middleware, so
<link linkend="ugr.faqs.annotator_versus_ae">analysis engines</link> could be deployed on other application servers as well.</para>
</listitem>
</varlistentry>
<varlistentry id="ugr.faqs.cpm_versus_cpe">
<term><emphasis role="bold">How is the CPM different from a CPE?</emphasis></term>
<listitem><para>These name complimentary aspects of collection processing. The CPM
(Collection Processing <emphasis role="bold">Manager</emphasis> is the part of
the UIMA framework that manages the execution of a workflow of UIMA
components orchestrated to analyze a large collection of documents. The UIMA
developer does not implement or describe a CPM. It is a piece of infrastructure code
that handles CAS transport, instance management, batching, check-pointing,
statistics collection and failure recovery in the execution of a collection
processing workflow.</para>
<para>A Collection Processing Engine (CPE) is component created by the framework
from a specific CPE descriptor. A CPE descriptor refers to a series of UIMA
components including a Collection Reader, CAS Initializer, Analysis
Engine(s) and CAS Consumers. These components are organized in a work flow and
define a collection analysis job or CPE. A CPE acquires documents from a source
collection, initializes CASs with document content, performs document
analysis and then produces collection level results (e.g., search engine
index, database etc). The CPM is the execution engine for a CPE.</para>
</listitem>
</varlistentry>
<!--
<varlistentry id="ugr.faqs.semantic_search">
<term><emphasis role="bold">What is Semantic Search and what is its relationship to
UIMA?</emphasis></term>
<listitem><para>Semantic Search refers to a document search paradigm that allows
users to search based not just on the keywords contained in the documents, but also
on the semantics associated with the text by <link linkend="ugr.faqs.annotator_versus_ae">analysis engines</link>. UIMA applications
perform analysis on text documents and generate semantics in the form of
<link linkend="ugr.faqs.what_is_an_annotation">annotations</link> on regions of text. For example, a UIMA analysis engine may discover
the text <quote>First Financial Bank</quote> to refer to an organization and
annotated it as such. With traditional keyword search, the query
<command>first</command> will return all documents that contain that word.
<command>First</command> is a frequent and ambiguous term &ndash; it occurs a lot
and can mean different things in different places. If the user is looking for
organizations that contain that word <command>first</command> in their names,
s/he will likely have to sift through lots of documents containing the word
<quote>first</quote> used in different ways. Semantic Search exploits the
results of analysis to allow more precise queries. For example, the semantic
search query <emphasis>&lt;organization&gt; first
&lt;/organization&gt;</emphasis> will rank first documents that contain the
word <quote>first</quote> as part of the name of an organization. The UIMA SDK
documentation demonstrates how UIMA applications can be built using semantic
search. It provides details about the XML Fragment Query language. This is the
particular query language used by the semantic search engine that comes with the
SDK.</para>
</listitem>
</varlistentry>
<varlistentry id="ugr.faqs.xml_fragment_not_xml">
<term><emphasis role="bold">Is an XML Fragment Query valid XML?</emphasis></term>
<listitem><para>Not necessarily. The XML Fragment Query syntax is used to formulate
queries interpreted by the semantic search engine that ships with the UIMA SDK.
This query language relies on basic XML syntax as an intuitive way to describe
hierarchical patterns of annotations that may occur in a <link linkend="ugr.faqs.what_is_the_cas">CAS</link>. The language
deviates from valid XML in order to support queries over
<quote>overlapping</quote> or <quote>cross-over</quote> annotations and
other features that affect the interpretation of the query by the query processor.
For example, it admits notations in the query to indicate whether a keyword or an
annotation is optional or required to match a document.</para></listitem>
</varlistentry>
-->
<varlistentry id="ugr.faqs.modalities_other_than_text">
<term><emphasis role="bold">Does UIMA support modalities other than text?</emphasis></term>
<listitem><para>The UIMA architecture supports the development, discovery,
composition and deployment of multi-modal analytics including text, audio and
video. Applications that process text, speech and video have been developed using
UIMA. This release of the SDK, however, does not include examples of these
multi-modal applications. </para>
<para>It does however include documentation and programming examples for using
the key feature required for building multi-modal applications. UIMA supports
multiple subjects of analysis or <link linkend="ugr.faqs.what_is_a_sofa">Sofas</link>. These allow multiple views of a single
artifact to be associated with a <link linkend="ugr.faqs.what_is_the_cas">CAS</link>. For example, if an artifact is a video
stream, one Sofa could be associated with the video frames and another with the
closed-captions text. UIMA&apos;s multiple Sofa feature is included and
described in this release of the SDK.</para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.compare">
<term><emphasis role="bold">How does UIMA compare to other similar work?</emphasis></term>
<listitem><para>A number of different frameworks for NLP have preceded UIMA. Two of
them were developed at IBM Research and represent UIMA&apos;s early roots. For
details please refer to the UIMA article that appears in the IBM Systems Journal
Vol. 43, No. 3 (<ulink
url="http://www.research.ibm.com/journal/sj/433/ferrucci.html"/>
).</para>
<para>UIMA has advanced that state of the art along a number of dimensions
including: support for distributed deployments in different middleware
environments, easy framework embedding in different software product
platforms (key for commercial applications), broader architectural converge
with its collection processing architecture, support for
multiple-modalities, support for efficient integration across programming
languages, support for a modern software engineering discipline calling out
different roles in the use of UIMA to develop applications, the extensive use of
descriptive component metadata to support development tooling, component
discovery and composition. (Please note that not all of these features are
available in this release of the SDK.)</para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.open_source">
<term><emphasis role="bold">Is UIMA Open Source?</emphasis></term>
<listitem><para>Yes. As of version 2, UIMA development has moved to Apache and is being
developed within the Apache open source processes. It is licensed under the Apache
version 2 license. Previous versions are available on the IBM alphaWorks site (
<ulink url="http://www.alphaworks.ibm.com/tech/uima"/>) and the source
code for previous version of the UIMA framework is available on SourceForge (
<ulink url="http://uima-framework.sourceforge.net/"/>).</para>
</listitem>
</varlistentry>
<varlistentry id="ugr.faqs.levels_required">
<term><emphasis role="bold">What Java level and OS are required for the UIMA SDK?</emphasis></term>
<listitem><para>As of release 3.0.0, the UIMA SDK requires Java 1.8.
It has been tested on mainly on Windows and Linux platforms, with some
testing on the MacOSX. Other
platforms and JDK implementations will likely work, but have
not been as significantly tested.</para></listitem>
</varlistentry>
<varlistentry id="ugr.faqs.building_apps_on_top_of_uima">
<term><emphasis role="bold">Can I build my UIM application on top of UIMA?</emphasis></term>
<listitem><para>Yes. Apache UIMA is licensed under the Apache version 2 license,
enabling you to build and distribute applications which include the framework.
</para></listitem>
</varlistentry>
<!--
<varlistentry id="ugr.faqs.commercial_products">
<term><emphasis role="bold">Do any commercial products support the UIMA framework or include
it as part of their product?</emphasis></term>
<listitem><para>Yes. IBM's WebSphere Information Integration Omnifind Edition
product (<ulink
url="http://www.ibm.com/developerworks/db2/zones/db2ii"/> or <ulink
url="http://www-306.ibm.com/software/data/integration/db2ii/editions_womnifind.html"/>
) has UIMA <quote>inside</quote> and supports adding UIMA annotators to the
processing pipeline. We are actively seeking other product embeddings. </para>
</listitem>
</varlistentry>
-->
<!--
<varlistentry>
<term><emphasis role="bold"></emphasis></term>
<listitem><para></para></listitem>
</varlistentry>
-->
</variablelist>
</chapter>