blob: 0374123189191a94814e70234c0ff898b297a81c [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
[[ugr.ref.xmi]]
= XMI CAS Serialization Reference
This is the specification for the mapping of the UIMA CAS into the XMI (XML Metadata Interchangefootnote:[For details on XMI see Grose et al. Mastering
XMI. Java Programming with XMI, XML, and UML. John Wiley & Sons, Inc.
2002.]) format.
XMI is an OMG standard for expressing object graphs in XML.
The UIMA SDK provides support for XMI through the classes `org.apache.uima.cas.impl.XmiCasSerializer` and ``org.apache.uima.cas.impl.XmiCasDeserializer``.
[[ugr.ref.xmi.xmi_tag]]
== XMI Tag
The outermost tag is <XMI> and must include a version number and XML namespace attribute:
[source]
----
<xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI">
<!-- CAS Contents here -->
</xmi:XMI>
----
XML namespacesfootnote:[http://www.w3.org/TR/xml-names11/] are used throughout.
The "`xmi`" namespace prefix is used to identify elements and attributes that are defined by the XMI specification.
The XMI document will also define one namespace prefix for each CAS namespace, as described in the next section.
[[ugr.ref.xmi.feature_structures]]
== Feature Structures
UIMA Feature Structures are mapped to XML elements.
The name of the element is formed from the CAS type name, making use of XML namespaces as follows.
The CAS type namespace is converted to an XML namespace URI by the following rule: replace all dots with slashes, prepend http:///, and append .ecore.
This mapping was chosen because it is the default mapping used by the Eclipse Modeling Framework (EMF)footnote:[For details on EMF and Ecore see Budinsky et al. Eclipse Modeling Framework 2.0. Addison-Wesley. 2006.] to create namespace URIs from Java package names.
The use of the http scheme is a common convention, and does not imply any HTTP communication.
The `.ecore` suffix is due to the fact that the recommended type system definition for a namespace is an xref:tug.adoc#ugr.tug.xmi_emf[ECore model].
Consider the CAS type name `org.myproj.Foo`.
The CAS namespace (`org.myorg.`) is converted to the XML namespace URI is `http:///org/myproj.ecore`.
The XML element name is then formed by concatenating the XML namespace prefix (which is an arbitrary token, but typically we use the last component of the CAS namespace) with the type name (excluding the namespace).
So the example `org.myproj.Foo` Feature Structure is written to XMI as:
[source]
----
<xmi:XMI
xmi:version="2.0"
xmlns:xmi="http://www.omg.org/XMI"
xmlns:myproj="http:///org/myproj.ecore">
...
<myproj:Foo xmi:id="1"/>
...
</xmi:XMI>
----
The `xmi:id` attribute is only required if this object will be referred to from elsewhere in the XMI document.
If provided, the xmi:id must be unique for each feature.
All namespace prefixes (e.g. `myproj`) in this example must be bound to URIs using the `xmlns:...` attribute, as defined by the XML namespaces specification.
[[ugr.ref.xmi.primitive_features]]
== Primitive Features
CAS features of primitive types (String, Boolean, Byte, Short, Integer, Long , Float, or Double) can be mapped either to XML attributes or XML elements.
For example, a CAS FeatureStructure of type org.myproj.Foo, with features:
[source]
----
begin = 14
end = 19
myFeature = "bar"
----
could be mapped to:
[source]
----
<xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI"
xmlns:myproj="http:///org/myproj.ecore">
...
<myproj:Foo xmi:id="1" begin="14" end="19" myFeature="bar"/>
...
</xmi:XMI>
----
or equivalently:
[source]
----
<xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI"
xmlns:myproj="http:///org/myproj.ecore">
...
<myproj:Foo xmi:id="1">
<begin>14</begin>
<end>19</end>
<myFeature>bar</myFeature>
</myproj:Foo>
...
</xmi:XMI>
----
The attribute serialization is preferred for compactness, but either representation is allowable.
Mixing the two styles is allowed; some features can be represented as attributes and others as elements.
[[ugr.ref.xmi.reference_features]]
== Reference Features
CAS features that are references to other feature structures (excluding arrays and lists, which are handled separately) are serialized as ID references.
If we add to the previous CAS example a feature structure of type org.myproj.Baz, with feature "`myFoo`" that is a reference to the Foo object, the serialization would be:
[source]
----
<xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI"
xmlns:myproj="http:///org/myproj.ecore">
...
<myproj:Foo xmi:id="1" begin="14" end="19" myFeature="bar"/>
<myproj:Baz xmi:id="2" myFoo="1"/>
...
</xmi:XMI>
----
As with primitive-valued features, it is permitted to use an element rather than an attribute.
However, the syntax is slightly different:
[source]
----
<myproj:Baz xmi:id="2">
<myFoo href="#1"/>
<myproj.Baz>
----
Note that in the attribute representation, a reference feature is indistinguishable from an integer-valued feature, so the meaning cannot be determined without prior knowledge of the type system.
The element representation is unambiguous.
[[ugr.ref.xmi.array_and_list_features]]
== Array and List Features
For a CAS feature whose range type is one of the CAS array or list types, the XMI serialization depends on the setting of the "`multipleReferencesAllowed`" attribute for that feature in the xref:ref.adoc#ugr.ref.xml.component_descriptor.type_system.features[UIMA Type System Description].
An array or list with `multipleReferencesAllowed = false` (the default) is serialized as a __multi-valued__ property in XMI.
An array or list with `multipleReferencesAllowed = true` is serialized as a first-class object.
Details are described below.
[[ugr.ref.xmi.array_and_list_features.as_multi_valued_properties]]
=== Arrays and Lists as Multi-Valued Properties
In XMI, a multi-valued property is the most natural XMI representation for most cases.
Consider the example where the FeatureStructure of type `org.myproj.Baz` has a feature `myIntArray` whose value is the integer array `{2,4,6}`.
This can be mapped to:
[source]
----
<myproj:Baz xmi:id="3" myIntArray="2 4 6"/>
----
or equivalently:
[source]
----
<myproj:Baz xmi:id="3">
<myIntArray>2</myIntArray>
<myIntArray>4</myIntArray>
<myIntArray>6</myIntArray>
</myproj:Baz>
----
Note that String arrays whose elements contain embedded spaces MUST use the latter mapping.
`FSArray` or `FSList` features are serialized in a similar way.
For example an `FSArray` feature that contains references to the elements with `xmi:id`'s `13` and `42` could be serialized as:
[source]
----
<myproj:Baz xmi:id="3" myFsArray="13 42"/>
----
or:
[source]
----
<myproj:Baz xmi:id="3">
<myFsArray href="#13"/>
<myFsArray href="#42"/>
</myproj:Baz>
----
[[ugr.ref.xmi.array_and_list_features.as_1st_class_objects]]
=== Arrays and Lists as First-Class Objects
The multi-valued-property representation described in the previous section does not allow multiple references to an array or list object.
Therefore, it cannot be used for features that are defined to allow multiple references (i.e. features for which multipleReferencesAllowed = true in the Type System Description).
When `multipleReferencesAllowed` is set to true, array and list features are serialized as references, and the array or list objects are serialized as separate objects in the XMI.
Consider again the example where the Feature Structure of type `org.myproj.Baz` has a feature `myIntArray` whose value is the integer array `{2,4,6}`. If `myIntArray` is defined with multipleReferencesAllowed=true, the serialization will be as follows:
[source]
----
<myproj:Baz xmi:id="3" myIntArray="4"/>
----
or:
[source]
----
<myproj:Baz xmi:id="3">
<myIntArray href="#4"/>
</myproj:Baz>
----
with the array object serialized as
[source]
----
<cas:IntegerArray xmi:id="4" elements="2 4 6"/>
----
or:
[source]
----
<cas:IntegerArray xmi:id="4">
<elements>2</elements>
<elements>4</elements>
<elements>6</elements>
</cas:IntegerArray>
----
Note that in this case, the XML element name is formed from the CAS type name (e.g. `uima.cas.IntegerArray`) in the same way as for other Feature Structures.
The elements of the array are serialized either as a space-separated attribute named `elements` or as a series of child elements named `elements`.
List nodes are just standard FeatureStructures with `head` and `tail` features, and are serialized using the normal Feature Structure serialization.
For example, an `IntegerList` with the values `2`, `4`, and `6` would be serialized as the four objects:
[source]
----
<cas:NonEmptyIntegerList xmi:id="10" head="2" tail="11"/>
<cas:NonEmptyIntegerList xmi:id="11" head="4" tail="12"/>
<cas:NonEmptyIntegerList xmi:id="12" head="6" tail="13"/>
<cas:EmptyIntegerList xmi:id"13"/>
----
This representation of arrays allows multiple references to an array of list.
It also allows a feature with range type TOP to refer to an array or list.
However, it is a very unnatural representation in XMI and does not support interoperability with other XMI-based systems, so we instead recommend using the multi-valued-property representation described in the previous section whenever it is possible.
When a feature is specified in the descriptor without a multipleReferencesAllowed attribute, or with the attribute specified as `false`, but the framework discovers multiple references during serialization, it will issue a message to the log say that it discovered this (look for the phrase __serialized in duplicate__).
The serialization will continue, but the multiply-referenced items will be serialized in duplicate.
[[ugr.ref.xmi.null_array_list_elements]]
=== Null Array/List Elements
In UIMA, an element of an FSArray or FSList may be null.
In XMI, multi-valued properties do not permit null values.
As a workaround for this, we use a dummy instance of the special type `cas:NULL`, which has `xmi:id="0"`.
For example, in the following example the "`myFsArray`" feature refers to an FSArray whose second element is null:
[source]
----
<cas:NULL xmi:id="0"/>
<myproj:Baz xmi:id="3">
<myFsArray href="#13"/>
<myFsArray href="#0"/>
<myFsArray href="#42"/>
</myproj:Baz>
----
[[ugr.ref.xmi.sofas_views]]
== Subjects of Analysis (Sofas) and Views
A UIMA CAS contain one or more subjects of analysis (Sofas). These are serialized no differently from any other feature structure.
For example:
[source]
----
<?xml version="1.0"?>
<xmi:XMI xmi:version="2.0" xmlns:xmi=http://www.omg.org/XMI
xmlns:cas="http:///uima/cas.ecore">
<cas:Sofa xmi:id="1" sofaNum="1"
text="the quick brown fox jumps over the lazy dog."/>
</xmi:XMI>
----
Each Sofa defines a separate View.
Feature Structures in the CAS can be members of one or more views.
(A Feature Structure that is a member of a view is indexed in its IndexRepository, but that is an implementation detail.)
In the XMI serialization, views will be represented as first-class objects.
Each View has an (optional) "`sofa`" feature, which references a sofa, and multi-valued reference to the members of the View.
For example:
[source]
----
<cas:View sofa="1" members="3 7 21 39 61"/>
----
Here the integers 3, 7, 21, 39, and 61 refer to the xmi:id fields of the objects that are members of this view.
[[ugr.ref.xmi.linking_to_ecore_type_system]]
== Linking an XMI Document to its Ecore Type System
// <titleabbrev>Linking XMI docs to Ecore Type System</titleabbrev>
If the CAS Type System has been saved to an xref:tug.adoc#ugr.tug.xmi_emf[Ecore file], it is possible to store a link from an XMI document to that Ecore type system.
This is done using an `xsi:schemaLocation` attribute on the root XMI element.
The `xsi:schemaLocation` attribute is a space-separated list that represents a mapping from namespace URI (e.g.
`http:///org/myproj.ecore`) to the physical URI of the `.ecore` file containing the type system for that namespace.
For example:
[source]
----
xsi:schemaLocation=
"http:///org/myproj.ecore file:/c:/typesystems/myproj.ecore"
----
would indicate that the definition for the org.myproj CAS types is contained in the file `c:/typesystems/myproj.ecore`.
You can specify a different mapping for each of your CAS namespaces, using a space separated list.
For details see Budinsky et al. __Eclipse Modeling Framework__.
[[ugr.ref.xmi.delta]]
== Delta CAS XMI Format
The Delta CAS XMI serialization format is designed primarily to reduce the overhead serialization when calling annotators configured as services.
Only Feature Structures and Views that are new or modified by the service are serialized and returned by the service.
The classes `org.apache.uima.cas.impl.XmiCasSerializer` and `org.apache.uima.cas.impl.XmiCasDeserializer` support serialization of only the modifications to the CAS.
A caller is expected to set a marker to indicate the point from which changes to the CAS are to be tracked.
A Delta CAS XMI document contains only the Feature Structures and Views that have been added or modified.
The new and modified Feature Structures are represented in exactly the format as in a complete CAS serialization.
The ` cas:View ` element has been extended with three additional attributes to represent modifications to View membership.
These new attributes are ``added_members``, `deleted_members` and ``reindexed_members``.
For example:
[source]
----
<cas:View sofa="1" added_members="63 77"
deleted_member="7 61" reindexed_members="39" />
----
Here the integers 63, 77 represent xmi:id fields of the objects that have been newly added members to this View, 7 and 61 are xmi:id fields of the objects that have been removed from this view and 39 is the xmi:id of an object to be reindexed in this view.