uimaj-documentation/src/docs/asciidoc/tug/tug.aas.adoc - uima-uimaj - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements. See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership. The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License. You may obtain a copy of the License at
 //
 // http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied. See the License for the
 // specific language governing permissions and limitations
 // under the License.

 [[ugr.tug.aas]]
 = Annotations, Artifacts, and Sofas
 // <titleabbrev>Annotations, Artifacts &amp; Sofas</titleabbrev>

 Up to this point, the documentation has focused on analyzing strings of Unicode text, producing subtypes of Annotations which reference offsets in those strings.
 This chapter generalizes this concept and shows how other kinds of artifacts can be handled, including non-text things like audio and images, and how you can define your own kinds of "`annotations`" for these.

 [[ugr.tug.aas.terminology]]
 == Terminology

 [[ugr.tug.aas.artifact]]
 === Artifact

 The Artifact is the unstructured thing being analyzed by an annotator.
 It could be an HTML web page, an image, a video stream, a recorded audio conversation, an MPEG-4 stream, etc.
 Artifacts are often restructured in the course of processing to facilitate particular kinds of analysis.
 For instance, an HTML page may be converted into a "`de-tagged`" version.
 Annotators at different places in the pipeline may be analyzing different versions of the artifact.

 [[ugr.tug.aas.sofa]]
 === Subject of Analysis — Sofa

 Each representation of an Artifact is called a Subject of Analysis, abbreviated using the acronym "`Sofa`" which stands for __S__ubject _
         OF___A__nalysis.
 Annotation metadata, which have explicit designations of sub-regions of the artifact to which they apply, are always associated with a particular Sofa.
 For instance, an annotation over text specifies two features, the begin and end, which represent the character offsets into the text string Sofa being analyzed.

 Other examples of representations of Artifacts, which could be Sofas include: An HTML web page, a detagged web page, the translated text of that document, an audio or video stream, closed-caption text from a video stream, etc.

 Often, there is one Sofa being analyzed in a CAS.
 The next chapter will show how UIMA facilitates working with multiple representations of an artifact at the same time, in the same CAS.

 [[ugr.tug.aas.sofa_data_formats]]
 == Formats of Sofa Data

 Sofa data can be Java Unicode Strings, Feature Structure arrays of primitive types, or a URI which references remote data available via a network connection.

 The arrays of primitive types can be things like byte arrays or float arrays, and are intended to be used for artifacts like audio data, image data, etc.

 The URI form holds a URI specification String.

 [NOTE]
 ====
 Sofa data can be "serialized" using an XML format; when it is, the String data  being serialized must not include invalid XML characters.
 See <<ugr.tug.xmi_emf.xml_character_issues>>.
 ====

 [[ugr.tug.aas.setting_accessing_sofa_data]]
 == Setting and Accessing Sofa Data

 [[ugr.tug.aas.setting_sofa_data]]
 === Setting Sofa Data

 When a CAS is created, you can set its Sofa Data, just one time; this property insures that metadata describing regions of the Sofa remain valid.
 As a consequence, the following methods that set data for a given Sofa can only be called once for a given Sofa.

 The following methods on the CAS set the Sofa Data to one of the 3 formats.
 Assume that the variable "`aCas`" holds a reference to a CAS:

 [source]
 ----
 aCas.setSofaDataString(document_text_string, mime_type_string);
 aCas.setSofaDataArray(feature_structure_primitive_array, mime_type_string);
 aCas.setSofaDataURI(uri_string, mime_type_string);
 ----

 In addition, the method `aCas.setDocumentText(document_text_string)` may still be used, and is equivalent to ``setSofaDataString(string,
         "text")``.
 The mime type is currently not used by the UIMA framework, but may be set and retrieved by user code.

 Feature Structure primitive arrays are all the UIMA Array types except arrays of Feature Structures, Strings, and Booleans.
 Typically, these are arrays of bytes, but can be other types, such as floats, longs, etc.

 The URI string should conform to the standard URI format.

 [[ugr.tug.aas.accessing_sofa_data]]
 === Accessing Sofa Data

 The analysis algorithms typically work with the Sofa data.
 The following methods on the CAS access the Sofa Data:

 [source]
 ----
 String           aCas.getDocumentText();
 String           aCas.getSofaDataString();
 FeatureStructure aCas.getSofaDataArray();
 String           aCas.getSofaDataURI();
 String           aCas.getSofaMimeType();
 ----

 The `getDocumentText` and `getSofaDataString` return the same text string.
 The `getSofaDataURI` returns the URI itself, not the data the URI is pointing to.
 You can use standard Java I/O capabilities to get the data associated with the URI, or use the UIMA Framework Streaming method described next.

 [[ugr.tug.aas.accessing_sofa_data_using_java_stream]]
 === Accessing Sofa Data using a Java Stream

 The framework provides a consistent method for accessing the Sofa data, independent of it being stored locally, or accessed remotely using the URI.
 Get a Java InputStream instance from the Sofa data using:

 [source]
 ----
 InputStream inputStream = aCas.getSofaDataStream();
 ----

 * If the data is local, this method returns a ByteArrayInputStream. This stream provides bytes.
 +
 ** If the Sofa data was set using setDocumentText or setSofaDataString, the String is converted to bytes by using the UTF-8 encoding.
 ** If the Sofa data was set as a DataArray, the bytes in the data array are serialized, high-byte first.
 * If the Sofa data was specified as a URI, this method returns the handle from url.openStream(). Java offers built-in support for several URI schemes including "`FILE:`", "`HTTP:`", "`FTP:`" and has an extensible mechanism, ``URLStreamHandlerFactory``, for customizing access to an arbitrary URI. See more details at http://java.sun.com/j2se/1.5.0/docs/api/java/net/URLStreamHandlerFactory.html .


 [[ugr.tug.aas.sofa_fs]]
 == The Sofa Feature Structure

 Information about a Sofa is contained in a special built-in Feature Structure of type ``uima.cas.Sofa``.
 This feature structure is created and managed by the UIMA Framework; users must not create it directly.
 Although these Sofa type instances are implemented as standard feature structures, __generic CAS APIs can not be used to create Sofas or set their features__.
 Instead, Sofas are created implicitly by the creation of new CAS views.
 Similarly, Sofa features are set by CAS methods such as ``cas.setDocumentText()``.

 Features of the Sofa type include

 * SofaID: Every Sofa in a CAS has a unique SofaID. SofaIDs are the primary handle for access. This ID is often the same as the name string given to the Sofa by the developer, but it can be see xref:tug.adoc#ugr.tug.mvs.sofa_name_mapping[mapped to a different name].
 * Mime type: This string feature can be used to describe the type of the data represented by a Sofa. It is not used by the framework; the framework provides APIs to set and get its value.
 * Sofa Data: The Sofa data itself. This data can be resident in the CAS or it can be a reference to data outside the CAS.


 [[ugr.tug.aas.annotations]]
 == Annotations

 Annotators add meta data about a Sofa to the CAS.
 It is often useful to have this metadata denote a region of the Sofa to which it applies.
 For instance, assuming the Sofa is a String, the metadata might describe a particular substring as the name of a person.
 The built-in UIMA type, uima.tcas.Annotation, has two extra features that enable this - the begin and end features - which denote a character position offset into the text string being analyzed.

 The concept of "`annotations`" can be generalized for non-string kinds of Sofas.
 For instance, an audio stream might have an audio annotation which describes sounds regions in terms of floating point time offsets in the Sofa.
 An image annotation might use two pairs of x,y coordinates to define the region the annotation applies to.

 [[ugr.tug.aas.built_in_annotation_types]]
 === Built-in Annotation types

 The built-in CAS type, ``uima.tcas.Annotation``, is just one kind of definition of an Annotation.
 It was designed for annotating text strings, and has begin and end features which describe which substring of the Sofa being annotated.

 For applications which have other kinds of Sofas, the UIMA developer will design their own kinds of Annotation types, as needed to describe an annotation, by declaring new types which are subtypes of ``uima.cas.AnnotationBase``.
 For instance, for images, you might have the concept of a rectangular region to which the annotation applies.
 In this case, you might describe the region with 2 pairs of x, y coordinates.

 [[ugr.tug.aas.annotations_associated_sofa]]
 === Annotations have an associated Sofa

 Annotations are always associated with a particular Sofa.
 In subsequent chapters, you will learn how there can be multiple Sofas associated with an artifact; which Sofa an annotation refers to is described by the Annotation feature structure itself.

 All annotation types extend from the built-in type uima.cas.AnnotationBase.
 This type has one feature, a reference to the Sofa associated with the annotation.
 This value is currently used by the Framework to support the getCoveredText() method on the annotation instance - this returns the portion of a text Sofa that the annotation spans.
 It also is used to insure that the Annotation is indexed only in the CAS View associated with this Sofa.

 [[ugr.tug.aas.annotationbase]]
 == AnnotationBase

 A built-in type, ``uima.cas.AnnotationBase``, is provided by UIMA to allow users to extend the Annotation capabilities to different kinds of Annotations.
 The `AnnotationBase` type has one feature, named ``sofa``, which holds a reference to the `Sofa` feature structure with which this annotation is associated.
 The `sofa` feature is automatically set when creating an annotation  (meaning — any type derived from the built-in `uima.cas.AnnotationBase` type); it should not be set by the user.

 There is one method, ``getView``(), provided by `AnnotationBase` that returns the CAS View for the Sofa the annotation is pointing at.
 Note that this method always returns a CAS, even when applied to JCas annotation instances.

 The built-in type `uima.tcas.Annotation` extends `uima.cas.AnnotationBase` and adds two features, a begin and an end feature, which are suitable for identifying a span in a text string that the annotation applies to.
 Users may define other extensions to `AnnotationBase` with alternative specifications that can denote a particular region within the subject of analysis, as appropriate to their application.
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	[[ugr.tug.aas]]
	= Annotations, Artifacts, and Sofas
	// <titleabbrev>Annotations, Artifacts & Sofas</titleabbrev>

	Up to this point, the documentation has focused on analyzing strings of Unicode text, producing subtypes of Annotations which reference offsets in those strings.
	This chapter generalizes this concept and shows how other kinds of artifacts can be handled, including non-text things like audio and images, and how you can define your own kinds of "`annotations`" for these.

	[[ugr.tug.aas.terminology]]
	== Terminology

	[[ugr.tug.aas.artifact]]
	=== Artifact

	The Artifact is the unstructured thing being analyzed by an annotator.
	It could be an HTML web page, an image, a video stream, a recorded audio conversation, an MPEG-4 stream, etc.
	Artifacts are often restructured in the course of processing to facilitate particular kinds of analysis.
	For instance, an HTML page may be converted into a "`de-tagged`" version.
	Annotators at different places in the pipeline may be analyzing different versions of the artifact.

	[[ugr.tug.aas.sofa]]
	=== Subject of Analysis — Sofa

	Each representation of an Artifact is called a Subject of Analysis, abbreviated using the acronym "`Sofa`" which stands for __S__ubject _
	OF___A__nalysis.
	Annotation metadata, which have explicit designations of sub-regions of the artifact to which they apply, are always associated with a particular Sofa.
	For instance, an annotation over text specifies two features, the begin and end, which represent the character offsets into the text string Sofa being analyzed.

	Other examples of representations of Artifacts, which could be Sofas include: An HTML web page, a detagged web page, the translated text of that document, an audio or video stream, closed-caption text from a video stream, etc.

	Often, there is one Sofa being analyzed in a CAS.
	The next chapter will show how UIMA facilitates working with multiple representations of an artifact at the same time, in the same CAS.

	[[ugr.tug.aas.sofa_data_formats]]
	== Formats of Sofa Data

	Sofa data can be Java Unicode Strings, Feature Structure arrays of primitive types, or a URI which references remote data available via a network connection.

	The arrays of primitive types can be things like byte arrays or float arrays, and are intended to be used for artifacts like audio data, image data, etc.

	The URI form holds a URI specification String.

	[NOTE]
	====
	Sofa data can be "serialized" using an XML format; when it is, the String data being serialized must not include invalid XML characters.
	See <<ugr.tug.xmi_emf.xml_character_issues>>.
	====

	[[ugr.tug.aas.setting_accessing_sofa_data]]
	== Setting and Accessing Sofa Data

	[[ugr.tug.aas.setting_sofa_data]]
	=== Setting Sofa Data

	When a CAS is created, you can set its Sofa Data, just one time; this property insures that metadata describing regions of the Sofa remain valid.
	As a consequence, the following methods that set data for a given Sofa can only be called once for a given Sofa.

	The following methods on the CAS set the Sofa Data to one of the 3 formats.
	Assume that the variable "`aCas`" holds a reference to a CAS:

	[source]
	----
	aCas.setSofaDataString(document_text_string, mime_type_string);
	aCas.setSofaDataArray(feature_structure_primitive_array, mime_type_string);
	aCas.setSofaDataURI(uri_string, mime_type_string);
	----

	In addition, the method `aCas.setDocumentText(document_text_string)` may still be used, and is equivalent to ``setSofaDataString(string,
	"text")``.
	The mime type is currently not used by the UIMA framework, but may be set and retrieved by user code.

	Feature Structure primitive arrays are all the UIMA Array types except arrays of Feature Structures, Strings, and Booleans.
	Typically, these are arrays of bytes, but can be other types, such as floats, longs, etc.

	The URI string should conform to the standard URI format.

	[[ugr.tug.aas.accessing_sofa_data]]
	=== Accessing Sofa Data

	The analysis algorithms typically work with the Sofa data.
	The following methods on the CAS access the Sofa Data:

	[source]
	----
	String aCas.getDocumentText();
	String aCas.getSofaDataString();
	FeatureStructure aCas.getSofaDataArray();
	String aCas.getSofaDataURI();
	String aCas.getSofaMimeType();
	----

	The `getDocumentText` and `getSofaDataString` return the same text string.
	The `getSofaDataURI` returns the URI itself, not the data the URI is pointing to.
	You can use standard Java I/O capabilities to get the data associated with the URI, or use the UIMA Framework Streaming method described next.

	[[ugr.tug.aas.accessing_sofa_data_using_java_stream]]
	=== Accessing Sofa Data using a Java Stream

	The framework provides a consistent method for accessing the Sofa data, independent of it being stored locally, or accessed remotely using the URI.
	Get a Java InputStream instance from the Sofa data using:

	[source]
	----
	InputStream inputStream = aCas.getSofaDataStream();
	----

	* If the data is local, this method returns a ByteArrayInputStream. This stream provides bytes.
	+
	** If the Sofa data was set using setDocumentText or setSofaDataString, the String is converted to bytes by using the UTF-8 encoding.
	** If the Sofa data was set as a DataArray, the bytes in the data array are serialized, high-byte first.
	* If the Sofa data was specified as a URI, this method returns the handle from url.openStream(). Java offers built-in support for several URI schemes including "`FILE:`", "`HTTP:`", "`FTP:`" and has an extensible mechanism, ``URLStreamHandlerFactory``, for customizing access to an arbitrary URI. See more details at http://java.sun.com/j2se/1.5.0/docs/api/java/net/URLStreamHandlerFactory.html .


	[[ugr.tug.aas.sofa_fs]]
	== The Sofa Feature Structure

	Information about a Sofa is contained in a special built-in Feature Structure of type ``uima.cas.Sofa``.
	This feature structure is created and managed by the UIMA Framework; users must not create it directly.
	Although these Sofa type instances are implemented as standard feature structures, __generic CAS APIs can not be used to create Sofas or set their features__.
	Instead, Sofas are created implicitly by the creation of new CAS views.
	Similarly, Sofa features are set by CAS methods such as ``cas.setDocumentText()``.

	Features of the Sofa type include

	* SofaID: Every Sofa in a CAS has a unique SofaID. SofaIDs are the primary handle for access. This ID is often the same as the name string given to the Sofa by the developer, but it can be see xref:tug.adoc#ugr.tug.mvs.sofa_name_mapping[mapped to a different name].
	* Mime type: This string feature can be used to describe the type of the data represented by a Sofa. It is not used by the framework; the framework provides APIs to set and get its value.
	* Sofa Data: The Sofa data itself. This data can be resident in the CAS or it can be a reference to data outside the CAS.


	[[ugr.tug.aas.annotations]]
	== Annotations

	Annotators add meta data about a Sofa to the CAS.
	It is often useful to have this metadata denote a region of the Sofa to which it applies.
	For instance, assuming the Sofa is a String, the metadata might describe a particular substring as the name of a person.
	The built-in UIMA type, uima.tcas.Annotation, has two extra features that enable this - the begin and end features - which denote a character position offset into the text string being analyzed.

	The concept of "`annotations`" can be generalized for non-string kinds of Sofas.
	For instance, an audio stream might have an audio annotation which describes sounds regions in terms of floating point time offsets in the Sofa.
	An image annotation might use two pairs of x,y coordinates to define the region the annotation applies to.

	[[ugr.tug.aas.built_in_annotation_types]]
	=== Built-in Annotation types

	The built-in CAS type, ``uima.tcas.Annotation``, is just one kind of definition of an Annotation.
	It was designed for annotating text strings, and has begin and end features which describe which substring of the Sofa being annotated.

	For applications which have other kinds of Sofas, the UIMA developer will design their own kinds of Annotation types, as needed to describe an annotation, by declaring new types which are subtypes of ``uima.cas.AnnotationBase``.
	For instance, for images, you might have the concept of a rectangular region to which the annotation applies.
	In this case, you might describe the region with 2 pairs of x, y coordinates.

	[[ugr.tug.aas.annotations_associated_sofa]]
	=== Annotations have an associated Sofa

	Annotations are always associated with a particular Sofa.
	In subsequent chapters, you will learn how there can be multiple Sofas associated with an artifact; which Sofa an annotation refers to is described by the Annotation feature structure itself.

	All annotation types extend from the built-in type uima.cas.AnnotationBase.
	This type has one feature, a reference to the Sofa associated with the annotation.
	This value is currently used by the Framework to support the getCoveredText() method on the annotation instance - this returns the portion of a text Sofa that the annotation spans.
	It also is used to insure that the Annotation is indexed only in the CAS View associated with this Sofa.

	[[ugr.tug.aas.annotationbase]]
	== AnnotationBase

	A built-in type, ``uima.cas.AnnotationBase``, is provided by UIMA to allow users to extend the Annotation capabilities to different kinds of Annotations.
	The `AnnotationBase` type has one feature, named ``sofa``, which holds a reference to the `Sofa` feature structure with which this annotation is associated.
	The `sofa` feature is automatically set when creating an annotation (meaning — any type derived from the built-in `uima.cas.AnnotationBase` type); it should not be set by the user.

	There is one method, ``getView``(), provided by `AnnotationBase` that returns the CAS View for the Sofa the annotation is pointing at.
	Note that this method always returns a CAS, even when applied to JCas annotation instances.

	The built-in type `uima.tcas.Annotation` extends `uima.cas.AnnotationBase` and adds two features, a begin and an end feature, which are suitable for identifying a span in a text string that the annotation applies to.
	Users may define other extensions to `AnnotationBase` with alternative specifications that can denote a particular region within the subject of analysis, as appropriate to their application.