<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" | |
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[ | |
<!ENTITY % uimaents SYSTEM "../entities.ent" > | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<chapter id="ugr.tug.aas"> | |
<title>Annotations, Artifacts, and Sofas</title> | |
<titleabbrev>Annotations, Artifacts & Sofas</titleabbrev> | |
<para>Up to this point, the documentation has focused on analyzing strings of Unicode text, | |
producing subtypes of Annotations which reference offsets in those strings. This | |
chapter generalizes this concept and shows how other kinds of artifacts can be handled, | |
including non-text things like audio and images, and how you can define your own kinds of | |
<quote>annotations</quote> for these.</para> | |
<section id="ugr.tug.aas.terminology"> | |
<title>Terminology</title> | |
<section id="ugr.tug.aas.artifact"> | |
<title>Artifact</title> | |
<para>The Artifact is the unstructured thing being analyzed by an annotator. It could | |
be an HTML web page, an image, a video stream, a recorded audio conversation, an MPEG-4 | |
stream, etc. Artifacts are often restructured in the course of processing to | |
facilitate particular kinds of analysis. For instance, an HTML page may be converted | |
into a <quote>de-tagged</quote> version. Annotators at different places in the | |
pipeline may be analyzing different versions of the artifact.</para> | |
</section> | |
<section id="ugr.tug.aas.sofa"> | |
<title>Subject of Analysis — Sofa</title> | |
<para>Each representation of an Artifact is called a Subject of Analysis, abbreviated | |
using the acronym <quote>Sofa</quote> which stands for <emphasis | |
role="underline">S</emphasis>ubject <emphasis role="underline"> | |
OF</emphasis> <emphasis role="underline">A</emphasis>nalysis. Annotation | |
metadata, which have explicit designations of sub-regions of the artifact to which | |
they apply, are always associated with a particular Sofa. For instance, an | |
annotation over text specifies two features, the begin and end, which represent the | |
character offsets into the text string Sofa being analyzed.</para> | |
<para>Other examples of representations of Artifacts, which could be Sofas include: | |
An HTML web page, a detagged web page, the translated text of that document, an audio or | |
video stream, closed-caption text from a video stream, etc.</para> | |
<para>Often, there is one Sofa being analyzed in a CAS. The next chapter will show how | |
UIMA facilitates working with multiple representations of an artifact at the same | |
time, in the same CAS.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.aas.sofa_data_formats"> | |
<title>Formats of Sofa Data</title> | |
<para>Sofa data can be Java Unicode Strings, Feature Structure arrays of primitive | |
types, or a URI which references remote data available via a network | |
connection.</para> | |
<para>The arrays of primitive types can be things like byte arrays or float arrays, and are | |
intended to be used for artifacts like audio data, image data, etc.</para> | |
<para>The URI form holds a URI specification String.</para> | |
</section> | |
<section id="ugr.tug.aas.setting_accessing_sofa_data"> | |
<title>Setting and Accessing Sofa Data</title> | |
<section id="ugr.tug.aas.setting_sofa_data"> | |
<title>Setting Sofa Data</title> | |
<para>When a CAS is created, you can set its Sofa Data, just one time; this property | |
insures that metadata describing regions of the Sofa remain valid. As a consequence, | |
the following methods that set data for a given Sofa can only be called once for a given | |
Sofa.</para> | |
<para>The following methods on the CAS set the Sofa Data to one of the 3 formats. Assume | |
that the variable <quote>aCas</quote> holds a reference to a CAS:</para> | |
<programlisting><?db-font-size 80% ?>aCas.<emphasis role="bold">setSofaDataString</emphasis>(document_text_string, mime_type_string); | |
aCas.<emphasis role="bold">setSofaDataArray</emphasis>(feature_structure_primitive_array, mime_type_string); | |
aCas.<emphasis role="bold">setSofaDataURI</emphasis>(uri_string, mime_type_string);</programlisting> | |
<para>In addition, the method | |
<literal>aCas.setDocumentText(document_text_string)</literal> may still be | |
used, and is equivalent to <literal>setSofaDataString(string, | |
"text")</literal>. The mime type is currently not used by the UIMA framework, but may | |
be set and retrieved by user code.</para> | |
<para>Feature Structure primitive arrays are all the UIMA Array types except arrays of | |
Feature Structures, Strings, and Booleans. Typically, these are arrays of bytes, | |
but can be other types, such as floats, longs, etc.</para> | |
<para>The URI string should conform to the standard URI format.</para> | |
</section> | |
<section id="ugr.tug.aas.accessing_sofa_data"> | |
<title>Accessing Sofa Data</title> | |
<para>The analysis algorithms typically work with the Sofa data. The following | |
methods on the CAS access the Sofa Data:</para> | |
<programlisting>String aCas.getDocumentText(); | |
String aCas.getSofaDataString(); | |
FeatureStructure aCas.getSofaDataArray(); | |
String aCas.getSofaDataURI(); | |
String aCas.getSofaMimeType();</programlisting> | |
<para>The <literal>getDocumentText</literal> and | |
<literal>getSofaDataString</literal> return the same text string. The | |
<literal>getSofaDataURI</literal> returns the URI itself, not the data the URI is | |
pointing to. You can use standard Java I/O capabilities to get the data associated | |
with the URI, or use the UIMA Framework Streaming method described next.</para> | |
</section> | |
<section id="ugr.tug.aas.accessing_sofa_data_using_java_stream"> | |
<title>Accessing Sofa Data using a Java Stream</title> | |
<para>The framework provides a consistent method for accessing the Sofa data, | |
independent of it being stored locally, or accessed remotely using the URI. Get a Java | |
InputStream instance from the Sofa data using:</para> | |
<programlisting>InputStream inputStream = aCas.getSofaDataStream();</programlisting> | |
<itemizedlist spacing="compact"><listitem><para>If the data is local, this method | |
returns a ByteArrayInputStream. This stream provides bytes. | |
<itemizedlist><listitem><para>If the Sofa data was set using setDocumentText or | |
setSofaDataString, the String is converted to bytes by using the UTF-8 | |
encoding.</para></listitem> | |
<listitem><para>If the Sofa data was set as a DataArray, the bytes in the data array | |
are serialized, high-byte first. </para></listitem></itemizedlist> | |
</para></listitem> | |
<listitem><para>If the Sofa data was specified as a URI, this method returns the | |
handle from url.openStream(). Java offers built-in support for several URI | |
schemes including <quote>FILE:</quote>, <quote>HTTP:</quote>, | |
<quote>FTP:</quote> and has an extensible mechanism, | |
<literal>URLStreamHandlerFactory</literal>, for customizing access to an | |
arbitrary URI. See more details at <ulink | |
url="http://java.sun.com/j2se/1.4.2/docs/api/java/net/URLStreamHandlerFactory.html"/> | |
. </para></listitem></itemizedlist> | |
</section> | |
</section> | |
<section id="ugr.tug.aas.sofa_fs"> | |
<title>The Sofa Feature Structure</title> | |
<para>Information about a Sofa is contained in a special built-in Feature Structure of | |
type <literal>uima.cas.Sofa</literal>. This feature structure is created and | |
managed by the UIMA Framework; users must not create it directly. Although these Sofa | |
type instances are implemented as standard feature structures, <emphasis>generic | |
CAS APIs can not be used to create Sofas or set their features</emphasis>. Instead, | |
Sofas are created implicitly by the creation of new CAS views. Similarly, Sofa features | |
are set by CAS methods such as <literal>cas.setDocumentText()</literal>.</para> | |
<para>Features of the Sofa type include</para> | |
<itemizedlist><listitem><para>SofaID: Every Sofa in a CAS has a unique SofaID. SofaIDs | |
are the primary handle for access. This ID is often the same as the name string given to the | |
Sofa by the developer, but it can be mapped to a different name (see <olink | |
targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.mvs.sofa_name_mapping"/>.</para></listitem> | |
<listitem><para>Mime type: This string feature can be used to describe the type of the | |
data represented by a Sofa. It is not used by the framework; the framework provides | |
APIs to set and get its value.</para></listitem> | |
<listitem><para>Sofa Data: The Sofa data itself. This data can be resident in the CAS or | |
it can be a reference to data outside the CAS. </para></listitem></itemizedlist> | |
</section> | |
<section id="ugr.tug.aas.annotations"> | |
<title>Annotations</title> | |
<para>Annotators add meta data about a Sofa to the CAS. It is often useful to have this | |
metadata denote a region of the Sofa to which it applies. For instance, assuming the Sofa | |
is a String, the metadata might describe a particular substring as the name of a person. | |
The built-in UIMA type, uima.tcas.Annotation, has two extra features that enable this | |
- the begin and end features - which denote a character position offset into the text | |
string being analyzed.</para> | |
<para>The concept of <quote>annotations</quote> can be generalized for non-string | |
kinds of Sofas. For instance, an audio stream might have an audio annotation which | |
describes sounds regions in terms of floating point time offsets in the Sofa. An image | |
annotation might use two pairs of x,y coordinates to define the region the annotation | |
applies to.</para> | |
<section id="ugr.tug.aas.built_in_annotation_types"> | |
<title>Built-in Annotation types</title> | |
<para>The built-in CAS type, <literal>uima.tcas.Annotation</literal>, is just one | |
kind of definition of an Annotation. It was designed for annotating text strings, and | |
has begin and end features which describe which substring of the Sofa being | |
annotated.</para> | |
<para>For applications which have other kinds of Sofas, the UIMA developer will design | |
their own kinds of Annotation types, as needed to describe an annotation, by | |
declaring new types which are subtypes of | |
<literal>uima.cas.AnnotationBase</literal>. For instance, for images, you | |
might have the concept of a rectangular region to which the annotation applies. In | |
this case, you might describe the region with 2 pairs of x, y coordinates.</para> | |
</section> | |
<section id="ugr.tug.aas.annotations_associated_sofa"> | |
<title>Annotations have an associated Sofa</title> | |
<para>Annotations are always associated with a particular Sofa. In subsequent | |
chapters, you will learn how there can be multiple Sofas associated with an artifact; | |
which Sofa an annotation refers to is described by the Annotation feature structure | |
itself.</para> | |
<para>All annotation types extend from the built-in type uima.cas.AnnotationBase. | |
This type has one feature, a reference to the Sofa associated with the annotation. | |
This value is currently used by the Framework to support the getCoveredText() method | |
on the annotation instance - this returns the portion of a text Sofa that the | |
annotation spans. It also is used to insure that the Annotation is indexed only in the | |
CAS View associated with this Sofa.</para> | |
</section> | |
</section> | |
<section id="ugr.tug.aas.annotationbase"> | |
<title>AnnotationBase</title> | |
<para>A built-in type, <literal>uima.cas.AnnotationBase</literal>, is provided by | |
UIMA to allow users to extend the Annotation capabilities to different kinds of | |
Annotations. The <literal>AnnotationBase</literal> type has one feature, named | |
<literal>sofa</literal>, which holds a reference to the | |
<literal>Sofa</literal> feature structure with which this annotation is associated. | |
The <literal>sofa</literal> feature is automatically set when creating an annotation | |
(meaning — any type derived from the built-in | |
<literal>uima.cas.AnnotationBase</literal> type); it should not be set by the user.</para> | |
<para>There is one method, <literal>getView</literal>(), provided by | |
<literal>AnnotationBase</literal> that returns the CAS View for the Sofa the | |
annotation is pointing at. Note that this method always returns a CAS, even when applied | |
to JCas annotation instances.</para> | |
<para>The built-in type <literal>uima.tcas.Annotation</literal> extends | |
<literal>uima.cas.AnnotationBase</literal> and adds two features, a begin and an | |
end feature, which are suitable for identifying a span in a text string that the | |
annotation applies to. Users may define other extensions to | |
<literal>AnnotationBase</literal> with alternative specifications that can | |
denote a particular region within the subject of analysis, as appropriate to their | |
application.</para> | |
</section> | |
</chapter> |