blob: 5790fd3489ebef1e55a067204d88c2f9edf45483 [file] [log] [blame]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "https://www.w3.org/TR/html4/loose.dtd">
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- ====================================================================== -->
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<style type="text/css">@import "stylesheets/base.css";</style>
<meta name="author" value="
Apache UIMA Team
">
<meta name="email" value="dev@uima.apache.org">
<title>Apache UIMA - Cookbook: addressing some typical use-cases</title>
<!-- Begin Cookie Consent plugin by Silktide - https://silktide.com/cookieconsent -->
<!-- Commented out because implied consent is not compatible with GDPR -->
<!--
<script type="text/javascript">
window.cookieconsent_options = {"message":"This website uses cookies to ensure you get the best experience on our website","dismiss":"Got it!","learnMore":"More info","link":"https://uima.apache.org/privacy-policy.html","theme":"dark-bottom"};
</script>
<script type="text/javascript" src="/cookieconsent2/cookieconsent.min.js"></script>
-->
<!-- End Cookie Consent plugin -->
<!-- Begin Google Analytics -->
<!-- Commented out because GA requires consent according to GDPR -->
<!--
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-70846351-1', 'auto');
ga('set', 'anonymizeIp', true);
ga('send', 'pageview');
</script>
-->
<!-- End Google Analytics -->
</head>
<body>
<div class="topLogos">
<table border="0" width="100%" cellspacing="0">
<!-- TOP IMAGE -->
<tr>
<td align='LEFT'>
<a href="index.html">
<img style="border: 1px solid black;" src="./images/UIMA_banner2tlpTm.png" alt="UIMA project logo" border="0"/>
</a>
</td>
<td align='CENTER'>
<div class="pageBanner">Cookbook: addressing some typical use-cases</div>
</td>
<td align='RIGHT'>
<a href="https://www.apache.org">
<img src="./images/asf-logo-on-white-smallTm.png" alt="Apache UIMA" border="0"/>
</a>
</td>
</tr>
</table>
<hr noshade="" size="1"/>
</div>
<table border="0" width="100%" cellspacing="4">
<tr>
<td align='RIGHT' colspan="2">
<form method="get" action="https://www.google.com/search">
Search the site
<input type="text" name="q" size="25" maxlength="255" value="" />
<input type="hidden" name="sitesearch" value="https://uima.apache.org/" />
<input name="Search" value="Search Site" type="submit"/>
</form>
</td>
</tr>
<tr> <!-- LEFT SIDE NAVIGATION -->
<td width="20%" valign="top">
<!-- regular menu -->
<div class="navBar">
<br/>
<div class="navBarItem"> <div class="navPartHeading">General</div>
</div>
<div class="navBar">
<div class="navBarItem"> <a href="./index.html">Home</a>
</div>
<div class="navBarItem"> <a href="./downloads.cgi">Downloads</a>
</div>
<div class="navBarItem"> <a href="./documentation.html">Documentation</a>
</div>
<div class="navBarItem"> <a href="./news.html">News</a>
</div>
<div class="navBarItem"> <a href="./publications.html">Publications</a>
</div>
<br style="line-height: .5em"/>
<div class="navBarItem"> <a href="https://issues.apache.org/jira/browse/uima" target="_blank" rel="noopener">Issue tracker <img src="images/offsitelink.png"/></a>
</div>
<div class="navBarItem"> <a href="https://cwiki.apache.org/confluence/display/UIMA/" target="_blank" rel="noopener">Wiki <img src="images/offsitelink.png"/></a>
</div>
<br style="line-height: .5em"/>
<div class="navBarItem"> <a href="https://cwiki.apache.org/confluence/display/UIMA/Powered+by+Apache+UIMA" target="_blank" rel="noopener">Powered By UIMA <img src="images/offsitelink.png"/></a>
</div>
</div>
<br/>
<div class="navBarItem"> <div class="navPartHeading">Community</div>
</div>
<div class="navBar">
<div class="navBarItem"> <a href="./get-involved.html">Get Involved</a>
</div>
<div class="navBarItem"> <a href="./mail-lists.html">Mailing Lists</a>
</div>
<div class="navBarItem"> <a href="./contribution-policy.html">Contribution Policies</a>
</div>
<div class="navBarItem"> <a href="./faq.html">FAQ</a>
</div>
<div class="navBarItem"> <a href="./project-guidelines.html">Project Guidelines</a>
</div>
</div>
<br/>
<div class="navBarItem"> <div class="navPartHeading">Scaleout Frameworks</div>
</div>
<div class="navBar">
<div class="navBarItem"> <a href="./doc-uimaas-what.html">UIMA-AS</a>
</div>
<div class="navBarItem"> <a href="./doc-uimaducc-whatitam.html">UIMA-DUCC</a>
</div>
<div class="navBarItem"> <a href="./doc-uimaducc-demo.html">..Demo Page</a>
</div>
<div class="navBarItem"> <a href="http://uima-ducc-demo.apache.org:42133" target="_blank" rel="noopener">..Demo Live <img src="images/offsitelink.png"/></a>
</div>
</div>
<br/>
<div class="navBarItem"> <div class="navPartHeading">Components & Tools</div>
</div>
<div class="navBar">
<div class="navBarItem"> <a href="./sandbox.html#uima-addons-annotators">Annotators</a>
</div>
<div class="navBarItem"> <a href="./toolsServers.html">Tools & Servers</a>
</div>
<div class="navBarItem"> <a href="./sandbox.html">Addons and Sandbox</a>
</div>
<div class="navBarItem"> <a href="./ruta.html">UIMA Ruta</a>
</div>
<div class="navBarItem"> <a href="./uimafit.html">uimaFIT</a>
</div>
<div class="navBarItem"> <a href="./external-resources.html">External Resources</a>
</div>
</div>
<br/>
<div class="navBarItem"> <div class="navPartHeading">Development</div>
</div>
<div class="navBar">
<div class="navBarItem"> <a href="./dev-quick.html">Quick Start: building</a>
</div>
<div class="navBarItem"> <a href="./building-uima.html">Building from Source</a>
</div>
<div class="navBarItem"> <a href="./one-time-setup.html">One-time setups</a>
</div>
<div class="navBarItem"> <a href="./svn.html">Source Code</a>
</div>
<div class="navBarItem"> <a href="./release.html">Doing a UIMA release</a>
</div>
<div class="navBarItem"> <a href="https://www.apache.org/security/committers.html" target="_blank" rel="noopener">Doing a CVE (Apache) <img src="images/offsitelink.png"/></a>
</div>
<div class="navBarItem"> <a href="./eclipse-update-site.html">Eclipse Update Sites</a>
</div>
<div class="navBarItem"> <a href="./git.html">GIT</a>
</div>
<div class="navBarItem"> <a href="./codeConventions.html">Code Conventions</a>
</div>
<div class="navBarItem"> <a href="./uima-specification.html">UIMA Specification (OASIS)</a>
</div>
<div class="navBarItem"> <a href="./team-list.html">Project Team</a>
</div>
<div class="navBarItem"> <a href="./maven-design.html">Maven Use</a>
</div>
<div class="navBarItem"> <a href="./updating-website.html">Updating this Website</a>
</div>
</div>
<br/>
<div class="navBarItem"> <div class="navPartHeading">Events and Conferences</div>
</div>
<div class="navBar">
<div class="navBarItem"> <a href="./coling14.html">COLING 2014</a>
</div>
<div class="navBarItem"> <a href="./gscl13.html">GSCL 2013</a>
</div>
<div class="navBarItem"> <a href="./iks09.html">IKS 2009</a>
</div>
<div class="navBarItem"> <a href="./gscl09.html">GSCL 2009</a>
</div>
<div class="navBarItem"> <a href="./lsm09.html">LSM 2009</a>
</div>
<div class="navBarItem"> <a href="./lrec08.html">LREC 2008</a>
</div>
<div class="navBarItem"> <a href="./gldv07.html">GLDV 2007</a>
</div>
</div>
<br/>
<div class="navBarItem"> <div class="navPartHeading">ASF</div>
</div>
<div class="navBar">
<div class="navBarItem"> <a href="https://www.apache.org/licenses/" target="_blank" rel="noopener">License <img src="images/offsitelink.png"/></a>
</div>
<div class="navBarItem"> <a href="https://www.apache.org/foundation/thanks.html" target="_blank" rel="noopener">ASF Sponsors <img src="images/offsitelink.png"/></a>
</div>
<div class="navBarItem"> <a href="https://www.apache.org/foundation/sponsorship.html" target="_blank" rel="noopener">ASF Sponsorship <img src="images/offsitelink.png"/></a>
</div>
<div class="navBarItem"> <a href="./security_report">Security</a>
</div>
</div>
</div>
</td>
<td width="80%" align="left" valign="top">
<div class="sectionTable">
<table class="sectionTable">
<tr><td>
<a name="Working with Feature Structures"><h1><img src="images/UIMA_4sq50tightCropSolid.png"/>&nbsp;Working with Feature Structures</h1></a>
</td></tr>
<tr><td>
<blockquote class="sectionBody">
<ul>
<li><a href='#Remove all Feature Structures of a particular type'>
Remove all Feature Structures of a particular type
</a></li>
<li><a href='#General suggestions: working with iterators'>
General suggestions: working with iterators
</a></li>
</ul>
<p>These work with all kinds of Feature Structures, Annotations and non-Annotations, both.</p>
<table class="subsectionTable">
<tr><td>
<a name="Remove all Feature Structures of a particular type">
<h2>Remove all Feature Structures of a particular type
</h2>
</a>
</td></tr>
<tr><td>
<blockquote class="subsectionBody">
<p>There are built-in methods to do this, over all indexes in a particular view. There are 2 variations:
<ul><li>remove all including the subtypes of the type
<pre>myJCasView.removeAllIncludingSubtypes(Foo.type)</pre>
</li>
<li>remove all excluding the subtypes of the type
<pre>myJCasView.removeAllExcludingSubtypes(Foo.type)</pre></li></ul>
</p>
<p>Both of these are much faster than iterating over the Feature Structures; they directly clear the associated indexes.</p>
</blockquote>
</td></tr>
</table>
<table class="subsectionTable">
<tr><td>
<a name="General suggestions: working with iterators">
<h2>General suggestions: working with iterators
</h2>
</a>
</td></tr>
<tr><td>
<blockquote class="subsectionBody">
<p>Many times code will iterate over all instances of a type, and only do something with a subset.
Frequently, the iteration can be cut short, by starting near the spot of interest and stopping as soon
as it can be determined that no further iteration will find interesting Annotations.</p>
<p>Example: Let's say you have a "token" annotation, and want to find the "sentence" that contains it.
You could write an iterator over all sentences.
</p>
<h3>Stop early</h3>
<p>
When you find the first sentence that overlaps the token, you can use extra knowledge that you might have,
such as: there's only one sentence per token, to conclude that having found it, there's no need to do any
further iteration, so you can stop the iteration.
</p>
<p>Furthermore, if the token appears outside of any sentence, you can similarly stop the iteration, and return
an "empty" result, as soon as the test sentence begins after the token's "begin".
This is because, at that point, due to the sorting of the returned values, no future sentences could
start before or equal to the token's begin.
</p>
<h3>Begin closer to the right spot, maybe iterate backwards</h3>
<p>But you can do better.</p>
<p>You can start the iteration, instead of at the beginning, at the position of the token, and iterate backwards.
Iterators have a moveTo() method which takes a feature structure argument, so you can moveTo(the-token),
and then perhaps with some edge adjustment for equality, start iterating backwards, looking for the sentence at that
position that covers the token.
</p>
<p>If you are iterating backwards, and looking for a "covering" annotation, and know the largest span for that
covering type, then you can stop iterating as soon as the start position you reach, + the largest span, is less than
the start of the annotation you're trying to cover.</p>
<p style="margin-left:1rem">This is used internally in version 3's
<a target="_blank" rel="noopener" href="https://uima.apache.org/d/uimaj-current/version_3_users_guide.html#uv3.select.annot.subselect">select framework</a>
to speed up
the <code>covering</code> kind of iteration.</p>
<p>There are many other examples, but the principle is the same: start the iteration "close to" the right spot,
perhaps moving backwards instead of forwards, and end the iteration as soon as you can logically say that
no more suitable feature structures would be found. </p>
<h3>Use UIMA Version 3's select framework</h3>
<p>The <a target="_blank" rel="noopener" href="https://uima.apache.org/d/uimaj-current/version_3_users_guide.html#uv3.select">select framework</a>
incorporates many of the popular use cases for doing iterations that we've seen, into a Java friendly approach that
automatically uses optimized iterators and can produce Java Streams, as well.</p>
</blockquote>
</td></tr>
</table>
</blockquote>
</p>
</td></tr>
</table>
<div class="sectionTable">
<table class="sectionTable">
<tr><td>
<a name="Working with Annotations"><h1><img src="images/UIMA_4sq50tightCropSolid.png"/>&nbsp;Working with Annotations</h1></a>
</td></tr>
<tr><td>
<blockquote class="sectionBody">
<ul>
<li><a href='#Watch out for type-priorites'>
Watch out for type-priorites
</a></li>
<li><a href='#Annotation containment'>
Annotation containment
</a></li>
<li><a href='#Adjusting an existing annotation's begin and end'>
Adjusting an existing annotation's begin and end
</a></li>
<li><a href='#Avoid where possible, copying sets of Feature Structures'>
Avoid where possible, copying sets of Feature Structures
</a></li>
</ul>
<p>
The CAS holds Feature Structures (FSs). There is special support for FSs which are a subtype of Annotation;
these have an associated Subject of Analysis (Sofa) and <code>begin</code> and <code>end</code> offsets.
</p>
<h3>Annotations are not required in all cases</h3>
<p>If your application deals with a different kind of unstructured data, say, for instance, images, then
Annotations may not be the appropriate supertype for your types, because they're designed for
things having a linear begin / end meaningful demarcations. </p>
<p>You can have your feature structures inherit from TOP, or from some other appropriate supertype, other
than Annotation.
<ul>
<li><p>For example, if you want to define a new kind of annotation (e.g. a rectangular
region if your subject of analysis is an image),
you should write a new type which inherits from AnnotationBase. Types which
inherit from AnnotationBase are bound to a particular subject of analysis (aka view).</p>
</li>
<li><p>On the other hand, if you have information which is not directly related to a subject of analysis
(e.g. a Date type with day/month/year fields which would be used as a value rather
than as an annotation) then consider inheriting from TOP instead.</p>
</li>
<li><p>It is also
not necessary to add all feature structures or annotations to the indexes. For example, if the
Date type just described is used as a feature value, it may well be sufficient to be
able to reach it through the feature.</p></li>
</ul>
</p>
<h3>Making use of the built-in Annotation index</h3>
<p>Annotations are special in UIMA in that there is a "built-in" index, the AnnotationIndex, which can be used
to rapidly access these in a sorted order. The ordering is by <code>begin</code> (ascending), then by
<code>end</code> (descending), and then by type-priorities.</p>
<p style="margin-left:1rem"><i>This is really a set of indexes, one for each subtype of Annotation.</i></p>
<p style="margin-left:1rem"><i>Although the index has type-priorities, in UIMA v3, the <code>select-framework</code>
by default ignores these; this behavior can be overridden on an as-needed basis.</i></p>
<table class="subsectionTable">
<tr><td>
<a name="Watch out for type-priorites">
<h2>Watch out for type-priorites
</h2>
</a>
</td></tr>
<tr><td>
<blockquote class="subsectionBody">
<p>When 2 annotations have the same start and end, but different types, then one comes before the other,
according to type priorites. This is intended to allow you to say if you have a Sentence annotation, and a
Foo annotation, both covering the same span, to declare that the Sentence logically contains Foo, and not the
other way around.</p>
<p>To make this work, you need to specify the type priorities. This is a global setting for your application.
See
<a target="_blank" rel="noopener" href="https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xml.component_descriptor.aes.primitive">
type priorities</a> (scroll down to find it) for how to specify this.</p>
<h3>Avoiding type priorities</h3>
<p>Often, the use of type priorities gets in the way. With UIMA Version 3, the
<a target="_blank" rel="noopener" href="https://uima.apache.org/d/uimaj-current/version_3_users_guide.html#uv3.select">select framework</a>
by default ignores type priorites when doing its operations; but this can be overridden as needed.</p>
</blockquote>
</td></tr>
</table>
<table class="subsectionTable">
<tr><td>
<a name="Annotation containment">
<h2>Annotation containment
</h2>
</a>
</td></tr>
<tr><td>
<blockquote class="subsectionBody">
<h3>a contains b</h3>
<ul><li>Ignoring type priorities:</li></ul>
<pre>a != null &amp;&amp; b != null &amp;&amp; // null check
a.getBegin() &lt;= b.getBegin() &amp;&amp; // a starts before (or equal to) b
a.getEnd() &gt;= b.getEnd() // a ends after (or equal to) b</pre>
<h3>a and b overlap (have at least one char position in common)</h3>
<pre>
// ((omitted) check for non-null)
if (a.getBegin() &lt;= b.getBegin()) { // if a starts before (or equal to) b
return a.getEnd() &gt; b.getBegin(); // then it overlaps if a's end is after b's begin
} else { // otherwise, b's begin is before a's begin
return b.getEnd() &gt; a.getBegin(); // so it overlaps if b's end is after a's begin.
</pre>
<p>
An alternative, where overlap includes the edge case when the annotations just touch each other, but have no char position in common:
</p>
<pre>
// ((omitted) check for non-null)
if (a.getBegin() &lt;= b.getBegin()) { // if a starts before (or equal to) b
return a.getEnd() &gt;= b.getBegin(); // then it overlaps or abuts if a's end is after or equal to b's begin
} else { // otherwise, b's begin is before a's begin
return b.getEnd() &gt;= a.getBegin(); // so it overlaps or abuts if b's end is after or equal to a's begin.
</pre>
</blockquote>
</td></tr>
</table>
<table class="subsectionTable">
<tr><td>
<a name="Adjusting an existing annotation's begin and end">
<h2>Adjusting an existing annotation's begin and end
</h2>
</a>
</td></tr>
<tr><td>
<blockquote class="subsectionBody">
<p>Sometimes, your code may want to adjust an annotations begin and end values.
If the annotation is not indexed, there's no issue - just change the value.
But if it is indexed, it's in index(es) in a position determined by its begin and end position, so if you
change these, the item needs to be reindexed (in all the indexes holding it). Typically, only one index
(the Annotation Index for a particular CAS View) is involved, but in general, there could be multiple
indexes involved.</p>
<p>If you are using UIMA version 2.7.0 or later, the UIMA
<a target="_blank" rel="nopener" href="https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.cas.updating_indexed_feature_structures">framework</a>
detects updates that would need this re-indexing, and
automatically removes the Feature Structure from all involved index(es), updates the Feature, and then adds the Feature Structure back to the index(es).
</p>
<p>You can improve the efficiency of this, if you are updating, say, both the begin and end value of an annotation, by
doing this yourself, in your code.
<ul><li>Removing the item from the index(es)</li>
<li>Doing both updates</li>
<li>Adding the item back into the index(es)</li></ul>.
More details <a target="_blank" rel="nopener" href="https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.cas.updating_indexed_feature_structures">here</a>.
</p>
<p>Example: if you know a particular annotation is only indexed in one view,
then you can update it's begin and end features using
<pre>a.<b>removeFsFromIndexes</b>();
a.setBegin(new_value_begin);
a.setEnd(new_value_end);
a.<b>addToIndexes</b>();</pre>
This is the most efficient way to do this.
</p>
<p>There's a couple of special forms you can use to protect indexes while you're updating features used as keys.
This is useful when you're not sure what feature values might be used as keys in some index.
<pre>
try (AutoCloseable ac = my_cas.<b>protectIndexes</b>()) {
// ... arbitrary user code which updates features
// which may be "keys" in one or more indexes, e.g.
a.setBegin(new_value_begin);
a.setEnd(new_value_end);
}</pre>
or
<pre>
my_cas.<b>protectIndexes</b>(() -&gt; {
// ... arbitrary user code updating "key" features,
// but no checked exceptions are permitted
// (because inside a lambda)
a.setBegin(new_value_begin);
a.setEnd(new_value_end);
});</pre>
These use the frameworks automatic detection mechanism, and removes Feature Structures from all involved indexes
if needed, but delays adding them back, until the end of the protected section.
</p>
</blockquote>
</td></tr>
</table>
<table class="subsectionTable">
<tr><td>
<a name="Avoid where possible, copying sets of Feature Structures">
<h2>Avoid where possible, copying sets of Feature Structures
</h2>
</a>
</td></tr>
<tr><td>
<blockquote class="subsectionBody">
<p>Operations which iterate over Feature Structures, and put them into a Collection or List, and then
iterate over that list to do some other operations, can often be done directly on the Feature Structures in the CAS,
omitting the first copying of them into a list.
</p>
<p>A frequent speedup can happen when the particular logic can detect when no further items in a (sorted) index
are needed, and the iteration can be stopped early.</p>
<p>For example, you might have code which iterates over all feature structures of a particular type, and puts these into a list,
and then goes thru the list, and picks out certain ones and put those into another list, which is then returned.
</p>
<p>The first copying can be omitted, by moving the logic of what to include into the first iteration, and producing the second
list directly.</p>
<p>In UIMA Version 3, you can make use of the <a target="_blank" rel="noopener" href="https://uima.apache.org/d/uimaj-current/version_3_users_guide.html#uv3.select">select framework</a>.
It already has many of the use-cases where you might want to start or exit an iteration, accounted for.
You can also use its ability to produce streams, and combine that with Java's takeWhile method, to exit a stream early.
</p>
</blockquote>
</td></tr>
</table>
</blockquote>
</p>
</td></tr>
</table>
</td>
</tr>
<!-- FOOTER -->
<tr><td colspan="2">
<hr noshade="" size="1"/>
</td></tr>
<tr><td colspan="2">
<table class="pageFooter">
<tr>
<td><a href="index.html">Home</a></td>
<td><a href="privacy-policy.html">Privacy Policy</a></td>
<td style="font-size:75%">
Copyright &#169; 2006-2013, The Apache Software Foundation.<br/>
Apache UIMA, UIMA, the Apache UIMA logo and the Apache Feather logo are trademarks of The Apache Software Foundation.<br/>
All other marks mentioned may be trademarks or registered trademarks of their respective owners.
</td>
<td><a href="mailto:dev@uima.apache.org">Contact us</a></td>
</tr>
</table>
</td></tr>
</table>
</body>
</html>