<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" > | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<chapter id="ugr.ref.xmi"> | |
<title>XMI CAS Serialization Reference</title> | |
<para>This is the specification for the mapping of the UIMA CAS into the XMI (XML Metadata | |
Interchange<footnote><para> For details on XMI see Grose et al. <emphasis>Mastering | |
XMI. Java Programming with XMI, XML, and UML. </emphasis>John Wiley & Sons, Inc. | |
2002.</para></footnote>) format. XMI is an OMG standard for expressing object graphs in | |
XML. The UIMA SDK provides support for XMI through the classes | |
<literal>org.apache.uima.cas.impl.XmiCasSerializer</literal> and | |
<literal>org.apache.uima.cas.impl.XmiCasDeserializer</literal>.</para> | |
<section id="ugr.ref.xmi.xmi_tag"> | |
<title>XMI Tag</title> | |
<para>The outermost tag is <XMI> and must include a version number and XML | |
namespace attribute: | |
<programlisting><xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI"> | |
<!-- CAS Contents here --> | |
</xmi:XMI></programlisting></para> | |
<para>XML namespaces<footnote><para>http://www.w3.org/TR/xml-names11/</para> | |
</footnote> are used throughout. The <quote>xmi</quote> namespace prefix is used to | |
identify elements and attributes that are defined by the XMI specification. The XMI | |
document will also define one namespace prefix for each CAS namespace, as described in | |
the next section.</para> | |
</section> | |
<section id="ugr.ref.xmi.feature_structures"> | |
<title>Feature Structures</title> | |
<para>UIMA Feature Structures are mapped to XML elements. The name of the element is | |
formed from the CAS type name, making use of XML namespaces as follows.</para> | |
<para>The CAS type namespace is converted to an XML namespace URI by the following rule: | |
replace all dots with slashes, prepend http:///, and append .ecore.</para> | |
<para>This mapping was chosen because it is the default mapping used by the Eclipse | |
Modeling Framework (EMF)<footnote><para> For details on EMF and Ecore see Budinsky et | |
al. <emphasis>Eclipse Modeling Framework 2.0</emphasis>. Addison-Wesley. | |
2006.</para></footnote> to create namespace URIs from Java package names. The use of | |
the http scheme is a common convention, and does not imply any HTTP communication. The | |
.ecore suffix is due to the fact that the recommended type system definition for a | |
namespace is an ECore model, see <olink targetdoc="&uima_docs_tutorial_guides;" | |
targetptr="ugr.tug.xmi_emf"/>.</para> | |
<para>Consider the CAS type name <quote>org.myproj.Foo</quote>. The CAS namespace | |
(<quote>org.myorg.</quote>) is converted to the XML namespace URI is | |
http:///org/myproj.ecore.</para> | |
<para>The XML element name is then formed by concatenating the XML namespace prefix | |
(which is an arbitrary token, but typically we use the last component of the CAS | |
namespace) with the type name (excluding the namespace).</para> | |
<para>So the example <quote>org.myproj.Foo</quote> FeatureStructure is written to | |
XMI as: | |
<programlisting><xmi:XMI | |
xmi:version="2.0" | |
xmlns:xmi="http://www.omg.org/XMI" | |
xmlns:myproj="http:///org/myproj.ecore"> | |
... | |
<myproj:Foo xmi:id="1"/> | |
... | |
</xmi:XMI></programlisting></para> | |
<para>The xmi:id attribute is only required if this object will be referred to from | |
elsewhere in the XMI document. If provided, the xmi:id must be unique for each | |
feature.</para> | |
<para>All namespace prefixes (e.g. <quote>myproj</quote>) in this example must be | |
bound to URIs using the <quote>xmlns...</quote> attribute, as defined by the XML | |
namespaces specification.</para> | |
</section> | |
<section id="ugr.ref.xmi.primitive_features"> | |
<title>Primitive Features</title> | |
<para>CAS features of primitive types (String, Boolean, Byte, Short, Integer, Long , | |
Float, or Double) can be mapped either to XML attributes or XML elements. For example, a | |
CAS FeatureStructure of type org.myproj.Foo, with features: | |
<programlisting>begin = 14 | |
end = 19 | |
myFeature = "bar"</programlisting> | |
could be mapped to: | |
<programlisting><xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI" | |
xmlns:myproj="http:///org/myproj.ecore"> | |
... | |
<myproj:Foo xmi:id="1" begin="14" end="19" myFeature="bar"/> | |
... | |
</xmi:XMI></programlisting> | |
or equivalently: | |
<programlisting><![CDATA[<xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI" | |
xmlns:myproj="http:///org/myproj.ecore"> | |
... | |
<myproj:Foo xmi:id="1"> | |
<begin>14</begin> | |
<end>19</end> | |
<myFeature>bar</myFeature> | |
</myproj:Foo> | |
... | |
</xmi:XMI>]]></programlisting></para> | |
<para>The attribute serialization is preferred for compactness, but either | |
representation is allowable. Mixing the two styles is allowed; some features can be | |
represented as attributes and others as elements.</para> | |
</section> | |
<section id="ugr.ref.xmi.reference_features"> | |
<title>Reference Features</title> | |
<para>CAS features that are references to other feature structures (excluding arrays | |
and lists, which are handled separately) are serialized as ID references.</para> | |
<para>If we add to the previous CAS example a feature structure of type org.myproj.Baz, | |
with feature <quote>myFoo</quote> that is a reference to the Foo object, the | |
serialization would be: | |
<programlisting><![CDATA[<xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI" | |
xmlns:myproj="http:///org/myproj.ecore"> | |
... | |
<myproj:Foo xmi:id="1" begin="14" end="19" myFeature="bar"/> | |
<myproj:Baz xmi:id="2" myFoo="1"/> | |
... | |
</xmi:XMI>]]></programlisting></para> | |
<para>As with primitive-valued features, it is permitted to use an element rather than an | |
attribute. However, the syntax is slightly different:</para> | |
<programlisting><myproj:Baz xmi:id="2"> | |
<myFoo href="#1"/> | |
<myproj.Baz></programlisting> | |
<para>Note that in the attribute representation, a reference feature is | |
indistinguishable from an integer-valued feature, so the meaning cannot be | |
determined without prior knowledge of the type system. The element representation is | |
unambiguous.</para> | |
</section> | |
<section id="ugr.ref.xmi.array_and_list_features"> | |
<title>Array and List Features</title> | |
<para>For a CAS feature whose range type is one of the CAS array or list types, the XMI serialization depends on the | |
setting of the <quote>multipleReferencesAllowed</quote> attribute for that feature in the UIMA Type System | |
Description (see <olink targetdoc="&uima_docs_ref;" | |
targetptr="ugr.ref.xml.component_descriptor.type_system.features"/>).</para> | |
<para>An array or list with multipleReferencesAllowed = false (the default) is serialized as a | |
<quote>multi-valued</quote> property in XMI. An array or list with multipleReferencesAllowed = true is | |
serialized as a first-class object. Details are described below.</para> | |
<section id="ugr.ref.xmi.array_and_list_features.as_multi_valued_properties"> | |
<title>Arrays and Lists as Multi-Valued Properties</title> | |
<para>In XMI, a multi-valued property is the most natural XMI representation for most cases. Consider the | |
example where the FeatureStructure of type org.myproj.Baz has a feature myIntArray whose value is the | |
integer array {2,4,6}. This can be mapped to: | |
<programlisting><myproj:Baz xmi:id="3" myIntArray="2 4 6"/></programlisting> or | |
equivalently: | |
<programlisting><myproj:Baz xmi:id="3"> | |
<myIntArray>2</myIntArray> | |
<myIntArray>4</myIntArray> | |
<myIntArray>6</myIntArray> | |
</myproj:Baz></programlisting> | |
</para> | |
<para>Note that String arrays whose elements contain embedded spaces MUST use the latter mapping.</para> | |
<para>FSArray or FSList features are serialized in a similar way. For example an FSArray feature that contains | |
references to the elements with xmi:id's <quote>13</quote> and <quote>42</quote> could be | |
serialized as: | |
<programlisting><myproj:Baz xmi:id="3" myFsArray="13 42"/></programlisting> or: | |
<programlisting><myproj:Baz xmi:id="3"> | |
<myFsArray href="#13"/> | |
<myFsArray href="#42"/> | |
</myproj:Baz></programlisting> | |
</para> | |
</section> | |
<section id="ugr.ref.xmi.array_and_list_features.as_1st_class_objects"> | |
<title>Arrays and Lists as First-Class Objects</title> | |
<para>The multi-valued-property representation described in the previous section does not allow multiple | |
references to an array or list object. Therefore, it cannot be used for features that are defined to allow | |
multiple references (i.e. features for which multipleReferencesAllowed = true in the Type System | |
Description).</para> | |
<para>When multipleReferencesAllowed is set to true, array and list features are serialized as references, | |
and the array or list objects are serialized as separate objects in the XMI. Consider again the example where | |
the FeatureStructure of type org.myproj.Baz has a feature myIntArray whose value is the integer array | |
{2,4,6}. If myIntArray is defined with multipleReferencesAllowed=true, the serialization will be as | |
follows: | |
<programlisting><myproj:Baz xmi:id="3" myIntArray="4"/></programlisting> or: | |
<programlisting><myproj:Baz xmi:id="3"> | |
<myIntArray href="#4"/> | |
</myproj:Baz></programlisting> | |
with the array object serialized as | |
<programlisting><cas:IntegerArray xmi:id="4" elements="2 4 6"/></programlisting> or: | |
<programlisting><cas:IntegerArray xmi:id="4"> | |
<elements>2</elements> | |
<elements>4</elements> | |
<elements>6</elements> | |
</cas:IntegerArray></programlisting></para> | |
<para>Note that in this case, the XML element name is formed from the CAS type name (e.g. | |
<quote><literal>uima.cas.IntegerArray</literal></quote>) in the same way as for other | |
FeatureStructures. The elements of the array are serialized either as a space-separated attribute named | |
<quote>elements</quote> or as a series of child elements named <quote>elements</quote>.</para> | |
<para>List nodes are just standard FeatureStructures with <quote>head</quote> and <quote>tail</quote> | |
features, and are serialized using the normal FeatureStructure serialization. For example, an | |
IntegerList with the values 2, 4, and 6 would be serialized as the four objects: | |
<programlisting><cas:NonEmptyIntegerList xmi:id="10" head="2" tail="11"/> | |
<cas:NonEmptyIntegerList xmi:id="11" head="4" tail="12"/> | |
<cas:NonEmptyIntegerList xmi:id="12" head="6" tail="13"/> | |
<cas:EmptyIntegerList xmi:id"13"/></programlisting></para> | |
<para>This representation of arrays allows multiple references to an array of list. It also allows a feature | |
with range type TOP to refer to an array or list. However, it is a very unnatural representation in XMI and does | |
not support interoperability with other XMI-based systems, so we instead recommend using the | |
multi-valued-property representation described in the previous section whenever it is possible.</para> | |
</section> | |
<section id="ugr.ref.xmi.null_array_list_elements"> | |
<title>Null Array/List Elements</title> | |
<para>In UIMA, an element of an FSArray or FSList may be null. In XMI, multi-valued properties do not permit null | |
values. As a workaround for this, we use a dummy instance of the special type cas:NULL, which has xmi:id 0. | |
For example, in the following example the <quote>myFsArray</quote> feature refers to an FSArray whose | |
second element is null: | |
<programlisting><cas:NULL xmi:id="0"/> | |
<myproj:Baz xmi:id="3"> | |
<myFsArray href="#13"/> | |
<myFsArray href="#0"/> | |
<myFsArray href="#42"/> | |
</myproj:Baz></programlisting></para> | |
</section> | |
</section> | |
<section id="ugr.ref.xmi.sofas_views"> | |
<title>Subjects of Analysis (Sofas) and Views</title> | |
<para>A UIMA CAS contain one or more subjects of analysis (Sofas). These are serialized no | |
differently from any other feature structure. For example: | |
<programlisting><?xml version="1.0"?> | |
<xmi:XMI xmi:version="2.0" xmlns:xmi=http://www.omg.org/XMI | |
xmlns:cas="http:///uima/cas.ecore"> | |
<cas:Sofa xmi:id="1" sofaNum="1" | |
text="the quick brown fox jumps over the lazy dog."/> | |
</xmi:XMI></programlisting></para> | |
<para>Each Sofa defines a separate View. Feature Structures in the CAS can be members of | |
one or more views. (A Feature Structure that is a member of a view is indexed in its | |
IndexRepository, but that is an implementation detail.)</para> | |
<para>In the XMI serialization, views will be represented as first-class objects. Each | |
View has an (optional) <quote>sofa</quote> feature, which references a sofa, and | |
multi-valued reference to the members of the View. For example:</para> | |
<programlisting><cas:View sofa="1" members="3 7 21 39 61"/></programlisting> | |
<para>Here the integers 3, 7, 21, 39, and 61 refer to the xmi:id fields of the objects that | |
are members of this view.</para> | |
</section> | |
<section id="ugr.ref.xmi.linking_to_ecore_type_system"> | |
<title>Linking an XMI Document to its Ecore Type System</title> | |
<titleabbrev>Linking XMI docs to Ecore Type System</titleabbrev> | |
<para>If the CAS Type System has been saved to an Ecore file (as described in <olink | |
targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.xmi_emf"/>), it is possible to store a | |
link from an XMI document to that Ecore type system. This is done using an xsi:schemaLocation attribute | |
on the root XMI element.</para> | |
<para>The xsi:schemaLocation attribute is a space-separated list that represents a | |
mapping from namespace URI (e.g. http:///org/myproj.ecore) to the physical URI of the | |
.ecore file containing the type system for that namespace. For example: | |
<programlisting>xsi:schemaLocation= | |
"http:///org/myproj.ecore file:/c:/typesystems/myproj.ecore"</programlisting> | |
would indicate that the definition for the org.myproj CAS types is contained in the file | |
<literal>c:/typesystems/myproj.ecore</literal>. You can specify a different | |
mapping for each of your CAS namespaces, using a space separated list. For details see | |
Budinsky et al. <emphasis>Eclipse Modeling Framework</emphasis>.</para> | |
</section> | |
<section id="ugr.ref.xmi.delta"> | |
<title>Delta CAS XMI Format</title> | |
<titleabbrev>Delta CAS XMI Format</titleabbrev> | |
<para> | |
The Delta CAS XMI serialization format is designed primarily to reduce the overhead serialization when calling annotators | |
configured as services. Only Feature Structures and Views that are new or modified by the service | |
are serialized and returned by the service. | |
</para> | |
<para> | |
The classes <literal>org.apache.uima.cas.impl.XmiCasSerializer</literal> and | |
<literal>org.apache.uima.cas.impl.XmiCasDeserializer</literal> support serialization of only the modifications to the CAS. | |
A caller is expected to set a marker to indicate the point from which changes to the CAS are to be tracked. | |
</para> | |
<para> | |
A Delta CAS XMI document contains only the Feature Structures and Views that have been added or modified. | |
The new and modified Feature Structures are represented in exactly the format as in a complete CAS serialization. | |
The <literal> cas:View </literal> element has been extended with three additional attributes to represent modifications to | |
View membership. These new attributes are <literal>added_members</literal>, <literal>deleted_members</literal> and | |
<literal>reindexed_members</literal>. For example: | |
</para> | |
<programlisting><cas:View sofa="1" added_members="63 77" | |
deleted_member="7 61" reindexed_members="39" /></programlisting> | |
<para> | |
Here the integers 63, 77 represent xmi:id fields of the objects that have been newly added members to this View, | |
7 and 61 are xmi:id fields of the objects that have been removed from this view and 39 is the xmi:id of an object to be reindexed in this view. | |
</para> | |
</section> | |
</chapter> |