blob: 74a7ebbd2314e906cad3bdf9a44028d9e9442097 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
<!ENTITY imgroot "images/tools/ruta/language/" >
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE file distributed with this work for additional
information regarding copyright ownership. The ASF licenses this file to
you under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of
the License at Unless required
by applicable law or agreed to in writing, software distributed under the
OF ANY KIND, either express or implied. See the License for the specific
language governing permissions and limitations under the License. -->
<section id="">
<title>Internal indexing and reindexing</title>
UIMA Ruta, or to be more precise the main analysis engine RutaEngine, creates,
stores and updates additional indexing information directly in the CAS.
This indexing is not related to the annotation indexes of UIMA itself.
The internal indexing provides additional information, which is only utilized
by the Ruta rules. This section provides an overview why and how it is included in
UIMA Ruta. And how Ruta can be configured in order to optimize its performance.
<section id="">
<title>Why additional indexing?</title>
The internal indexing is utilized for many different parts of functionality within Ruta.
The need for the indexing is motivated by two main and important features.
Ruta provides different language elements, for example conditions, which are fulfill
depending on some investigation of the CAS annotation indexes. There are several
condition like PARTOF which require many index operations in worst case. Here, potentially
the complete index needed to be iterated in order to validate if a specific annotation
is part of another annotation of a specific type. And this check need to be performed
for each considered annotation and for each rule match and for each rule where a PARTOF
condition is used. Without additional internal indexing Ruta would be too slow to
actually be useful. With this feature, it is just a fast lookup. This situation applies also for many other language elements and
conditions like STARTSWITH and ENDSWITH.
A second necessity is the coverage-based visibility concept of Ruta.
Annotations and any text spans are invisible if their begin or end is covered by some
invisible annotation, i.e., an annotation of a type that is configured to be invisible.
This is a powerful feature that enables many different engineering approaches and makes
rules also more maintainable. For a (reasonably fast) implementation of this features,
it is necessary to know for each position if it is covered by annotations of specific types.
The internal indexing comes, however, with some costs. The indexing requires time and memory.
the information needs to be collected and/or updated for every Ruta script (RutaEngine)
in a pipeline. This may require many operations if many annotations are available.
Straightforward, the storage of this information at potentially all text positions
requires a lot memory. Nevertheless, the advantages outweigh the disadvantages considerably.
<section id="">
<title>How is it stored, created and updated?</title>
The internal indexing refers to three types of information that is additionally stored:
<orderedlist numeration="arabic">
All annotations of all relevant types that begin at a position.
All annotations of all relevant types that end at a position.
All types of annotations that cover a position.
The information is stored in additional annotations of the type RutaBasic,
which provides by implementation, and not by features, additional fields for
these three kinds of information. RutaBasic provide a complete disjunct
partitioning of the document. They begin and end at every position where an
annotation starts and ends. This also includes, for examples, one RutaBasic for each
SPACE annotation, registering which annotation start and end at these offsets.
They are automatically created and also extended if new smaller annotations are added.
Their initial creation is called <quote>indexing</quote> and their updating
if RutaBasics are available, but other Java analysis engines potentially added or
removed annotations, is called <quote>reindexing</quote>.
There are several configuration
parameters (see parameters with INDEX and REINDEX in their name) that can influence what types and annotations are indexed and reindexed.
In the default configuration, all annotations are indexed, but only new annotations
are reindexed (ReindexUpdateMode ADDITIVE). This means that if an analysis engine in between
two RutaEngine removes some annotations, the second RutaEngine will not be up to date.
A rule which relies on the internal indexing will match differently for these annotations,
e.g., a PARTOF condition is still fulfilled although the annotation is not present in the
UIMA indexes anymore. This problem can be avoided (if necessary) either by switching to a more costly
ReindexUpdateMode COMPLETE, or by updating the internal indexing directly in the Java analysis
engine if necessary by using the class RutaBasicUtils.
<section id="">
<title>How to optimize the performance?</title>
The are many different options and possibilities to optimize the runtime performance and
memory footprint of Ruta script, by configuring the RutaEngine. The most useful configuration,
however, depends on the actual situation: How much information is available about the pipeline
and the types of annotations and their update operations? In the following a selection
of optimizations are discussed.
If there is a RutaEngine in a pipeline, and either the previous analysis engine was also
a RutaEngine or it is known that the analysis engines before (until the last RutaEngine) did not
modify any (relevant) annotations, then the ReindexUpdateMode NONE can be applied, which simply
skips the internal reindexing. This can improve the runtime performance.
The configuration parameters indexOnly can be restricted to relevant types.
The parameter indexSkipTypes can be utilized to specify types of annotations that are not relevant.
These types can include more technical annotations for metadata, logging or debug information.
Thus, the set of types that need to be considered for internal indexing can be restricted, which
makes the indexing faster and requires less memory.
For a reindexing/updating step the corresponding reindex parameters need to be considered.
Even relevant annotations do not need to be reindexed/updated all the time.
The updating can, for example, be restricted to
types that have been potentially modified by previous Java analysis engines according to their capabilities.
Additionally, some types are rather final considering their offsets. They are only create once
and are not modified by later analysis engines. These types commonly include
Tokens and similar annotations. They do not need to be reindexed, which can be configured using the
reindexSkipTypes parameter.
An extension to this is the parameter indexOnlyMentionTypes/reindexOnlyMentionedTypes.
Here, the relevant types are collected using the
actual script: the types that are actually used in the rules and thus their internal indexing needs
to be up to date. This mainly can increase the indexing speed. This feature is highlighted with example:
Considering a larger pipeline with many annotations of different types, and also with many
modifications since the last RutaEngine, a script with one rule does not require much reindexing,
only the types that are used in this rule.