blob: b925a58ce9fa9a77550c730667ec61ffc96419ff [file] [log] [blame]
<!--
***************************************************************
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
***************************************************************
-->
<html>
<head>
<title>Apache UIMA v2.7.0 Release Notes</title>
</head>
<body>
<h1>Apache UIMA (Unstructured Information Management Architecture) v2.7.0 Release Notes</h1>
<h2>Contents</h2>
<p>
<a href="#what.is.uima">What is UIMA?</a><br/>
<a href="#major.changes">Major Changes in this Release</a><br/>
<a href="#get.involved">How to Get Involved</a><br/>
<a href="#report.issues">How to Report Issues</a><br/>
<a href="#list.issues">List of JIRA Issues Fixed in this Release</a><br/>
</p>
<h2><a id="what.is.uima">1. What is UIMA?</a></h2>
<p>
Unstructured Information Management applications are
software systems that analyze large volumes of
unstructured information in order to discover knowledge
that is relevant to an end user. UIMA is a framework and
SDK for developing such applications. An example UIM
application might ingest plain text and identify
entities, such as persons, places, organizations; or
relations, such as works-for or located-at. UIMA enables
such an application to be decomposed into components,
for example "language identification" -&gt; "language
specific segmentation" -&gt; "sentence boundary
detection" -&gt; "entity detection (person/place names
etc.)". Each component must implement interfaces defined
by the framework and must provide self-describing
metadata via XML descriptor files. The framework manages
these components and the data flow between them.
Components are written in Java or C++; the data that
flows between components is designed for efficient
mapping between these languages. UIMA additionally
provides capabilities to wrap components as network
services, and can scale to very large volumes by
replicating processing pipelines over a cluster of
networked nodes.
</p>
<p>
Apache UIMA is an Apache-licensed open source
implementation of the UIMA specification (that
specification is, in turn, being developed concurrently
by a technical committee within
<a href="http://www.oasis-open.org">OASIS</a>,
a standards organization). We invite and encourage you
to participate in both the implementation and
specification efforts.
</p>
<p>
UIMA is a component framework for analysing unstructured
content such as text, audio and video. It comprises an
SDK and tooling for composing and running analytic
components written in Java and C++, with some support
for Perl, Python and TCL.
</p>
<h2><a id="major.changes">Major Changes in this Release</a></h2>
<h3>Java 7 minimum level</h3>
<p>Java 7 is now the minimum level of Java required.</p>
<h3>Several JVM properties support backwards compatibility</h3>
<p>See the README and <a target="_blank" href="http://uima.apache.org/d/uimaj-2.7.0/references.html#ugr.ref.config">
the Reference chapter</a> for a description of these.
<h3>JSON serialization support</h3>
<p>
JSON serialization support is added for Type System Descriptions, and for CASs.
Several formats for JSON CAS serialization are provided, please see the chapter in the
UIMA reference documentation for details.
</p>
<h3>Sorted and Bag indexes no longer store multiple instances of identical FSs</h3>
<p>
The meaning of "bag" and "sorted" index has been made consistent with how these are handled
when sending CASes to remote services. This means that adding the same identical FS to the
indexes, multiple times, will no longer add duplicate index entries. And, removing a FS from an index
is guaranteed that that particular FS will no longer be in any index. (Before, if you had
added a particular FS to a sorted or bag index, multiple times, the remove behavior would
remove just one of the instances). Deserialization of CASes sent to remote services has never
added Feature Structures to the index multiple times, so this change makes that behavior
consistent. For more details, see <a target="_blank" href="https://issues.apache.org/jira/browse/UIMA-3399">
Jira issue UIMA-3399</a>.
</p>
<p>Because some users may need the previous behavior that permitted duplicates of identical Feature Structures in
the sorted and bag indexes. this change can be disabled, by running the JVM with the defined property
"uima.allow_duplicate_add_to_indexes".</p>
<h3>Index corruption avoidance</h3>
<p>To prevent potential index corruption, UIMA now recovers (unless disabled by
<code>"-Duima.disable_auto_protect_indexes"</code>)
from illegal modifications of features.</p>
<p>These are modifications to features used as index keys, done while the Feature Structure
being modified is currently in one or more indexes (see
<a target="_blank" href="https://issues.apache.org/jira/browse/UIMA-4135">Jira issue UIMA-4135</a>).</p>
<p>Corruption is prevented by first removing the feature structure being updated,
then doing the update, and then adding the feature structure back to the indexes.
Because these actions can affect performance, it is recommended that you run with JVM property
"-Duima.report_fs_update_corrupts_index" in order to see if any user code has this problem, and fix these via
redesign, or by wrapping the affected areas with a form of <code>protectIndexes()</code>, which does the
needed removes and add-backs under your control, so you can do several feature updates at once,
before adding the feature structure back. <code>protectIndexes</code> is described in the
<a target="_blank" href=
"http://uima.apache.org/d/uimaj-2.7.0/references.html#ugr.ref.cas.updating_indexed_feature_structures">
CAS Reference Chapter</a> and the CAS Javadocs; you can also use it with the JCas.</p>
<p>Because this protection is automatic and hidden, if you are iterating over sorted or set indexes, the
automatic recovery may cause unexpected ConcurrentModificationExceptions to be thrown by the iterator when
advancing. To work around this, either stop modifying features which are used as keys in the index being iterated over,
or use Snapshot iterators (see following).</p>
<h3>New Snapshot iterators won't throw ConcurrentModificationException</h3>
<p>A long-standing difficulty with Feature Structure iterators, namely, that adding to / removing from the underlying
index being iterated over is not allowed while iterating (unless you use a moveToXXX kind of operation to "reset" the
iterator), is addressed with a new class of "snapshot" iterators.
These take a snapshot of the state of the index when the iterator is created; subsequent modifications to the index
are then permitted, while the iterator continues to iterate over the snapshot it created; these iterators do not
throw ConcurrentModificationException. The implementation of this feature is via a new method on FSIndex,
<a target="_blank" href=
"http://uima.apache.org/d/uimaj-2.7.0/apidocs/org/apache/uima/cas/FSIndex.html#withSnapshotIterators--">
withSnapshotIterators()</a>, which
creates a light-weight copy of the the FSIndex instance whose iterator method iterators gets the Snapshot kind.
This approach allows using the new index in Java's "extended for" statement.</p>
<p>The current implementation of the snapshot iterators makes a snapshot of the index being iterated over, at
creation time, which has a cost in space and time.</p>
<h3>Other changes</h3>
<p>Some of the other major changes are listed here; for the complete list, see <a href="issuesFixed/jira-report.html">
the Issues Fixed report</a>.</p>
<ul><li> making the JCasGen Eclipse plugin work with more varieties of specifications for class paths.
Jira issues: UIMA-<a target="_blank" href="https://issues.apache.org/jira/browse/UIMA-4080">4080</a>/
<a target="_blank" href="https://issues.apache.org/jira/browse/UIMA-4081">4081</a></li>
<li>moveTo(a_Feature_Structure) or creating a new iterator to start at a feature sometimes went to the wrong place.
Jira issues: <a target="_blank" href="https://issues.apache.org/jira/browse/UIMA-4094">UIMA-4094</a>
and <a target="_blank" href="https://issues.apache.org/jira/browse/UIMA-4105">UIMA-4105</a>.</li>
<li>deserialization of deltaCAS when modifying existing indexed Feature Structures could corrupt the indexes.
Jira issue: <a target="_blank" href="https://issues.apache.org/jira/browse/UIMA-4100">UIMA-4100</a>.</li>
<li>default bag indexes will now be created if there are only Set indexes.
Jira issue: <a target="_blank" href="https://issues.apache.org/jira/browse/UIMA-4111">UIMA-4111</a>.</li>
<li><p>Xmi CAS Serialization now checks to see that list and array feature values marked as multipleReferencesAllowed=false
(or not marked at all, which defaults to multipleReferencesAllowed=false) are not multiply-referenced. If they are,
they continue to be serialized as if they are independent objects (as was previously done), but now a new
warning message is issued. Because there can be a huge number of these messages, they are automatically
throttled down, to prevent running out of room in the error logs.</p>
<p>The message strings for these look like "Feature [some-feature-name] is marked multipleReferencesAllowed=false,
but it has multiple references. These will be serialized in duplicate."</p></li>
<li>The CasCopier now checks to insure that the range type of the target feature has the same name as the
range type of the source feature, which catches errors when two different type systems are used for the source and
target. For example, this now prevents a feature with range type "uima.cas.Float" from being copied
into one with range type "uima.cas.String".
</li>
</ul>
<h3>Performance change highlights</h3>
<ul>
<li>Iterators obtained for Bag indexes and from the method getAllIndexedFS( ... type ... ) are
by definition, unordered. The implementation for these iterators is now
much faster, taking advantage of the unordered aspect of these things and returning items in a more
efficient sequence.
Jira issue: <a target="_blank" href="https://issues.apache.org/jira/browse/UIMA-4166">UIMA-4166</a>.</li>
<li>CasCopier has been reimplemented and is approximately 5-10 times faster.</li>
</ul>
<p>The complete list of fixes is <a href="issuesFixed/jira-report.html">here</a>.
<h2><a id="get.involved">How to Get Involved</a></h2>
<p>
The Apache UIMA project really needs and appreciates any contributions,
including documentation help, source code and feedback. If you are interested
in contributing, please visit
<a href="http://uima.apache.org/get-involved.html">
http://uima.apache.org/get-involved.html</a>.
</p>
<h2><a id="report.issues">How to Report Issues</a></h2>
<p>
The Apache UIMA project uses JIRA for issue tracking. Please report any
issues you find at
<a href="http://issues.apache.org/jira/browse/uima">http://issues.apache.org/jira/browse/uima</a>
</p>
<h2><a id="list.issues">List of JIRA Issues Fixed in this Release</a></h2>
Click <a href="issuesFixed/jira-report.html">issuesFixed/jira-report.hmtl</a> for the list of
issues fixed in this release.
</body>
</html>