<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
<!ENTITY imgroot "images/tools/ruta/language/" >
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >  
%uimaents;
]>
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor 
  license agreements. See the NOTICE file distributed with this work for additional 
  information regarding copyright ownership. The ASF licenses this file to 
  you under the Apache License, Version 2.0 (the "License"); you may not use 
  this file except in compliance with the License. You may obtain a copy of 
  the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required 
  by applicable law or agreed to in writing, software distributed under the 
  License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS 
  OF ANY KIND, either express or implied. See the License for the specific 
  language governing permissions and limitations under the License. -->

<section id="ugr.tools.ruta.language.internal_indxexing">
  <title>Internal indexing and reindexing</title>
  <para>
   UIMA Ruta, or to be more precise the main analysis engine RutaEngine, creates, 
   stores and updates additional indexing information directly in the CAS. 
   This indexing is not related to the annotation indexes of UIMA itself. 
   The internal indexing provides additional information, which is only utilized 
   by the Ruta rules. This section provides an overview on why and how it is integrated in
   UIMA Ruta, and how Ruta can be configured in order to optimize its performance.
  </para>
  <section id="ugr.tools.ruta.language.internal_indxexing.why">
    <title>Why additional indexing?</title>
    <para>
	  The internal indexing plays a an essential role in different parts of functionality within Ruta.
	  The need for the indexing is motivated by two main and important features. 
	</para>
	<para>
	  Ruta provides different language elements like conditions, which are fulfilled 
	  depending on some investigation of the CAS annotation indexes. There are several 
	  conditions like PARTOF, which require many index operations in the worst case. Here, potentially
	  the complete index needs to be iterated in order to validate if a specific annotation 
	  is part of another annotation of a specific type. This check needs to be performed 
	  for each considered annotation, for each rule match and for each rule where a PARTOF 
	  condition is used. Without additional internal indexing, Ruta would be too slow to 
	  actually be useful. With this feature, the process is just a fast lookup. This situation applies also for many other language elements and 
	  conditions like STARTSWITH and ENDSWITH.
    </para>
    <para>
      A second necessity is the coverage-based visibility concept of Ruta.
      Annotations and any text spans are invisible if their begin or end is covered by some 
      invisible annotation, i.e., an annotation of a type that is configured to be invisible.
      This is a powerful feature that enables many different engineering approaches and makes the
      rules more maintainable as well. For a (reasonably fast) implementation of this feature, 
      it is necessary to know for each position, if it is covered by annotations of specific types.
    </para>
    <para>
      The internal indexing comes, however, at some costs. The indexing requires time and memory.
      The information needs to be collected and/or updated for every Ruta script (RutaEngine) 
      in a pipeline. This may be expensive operation-wise, if the scripts consist of many annotations to be checked.
      Straightforward, the storage of this information at potentially all text positions 
      requires a lot memory. Nevertheless, the advantages outweigh the disadvantages considerably. 
    </para>
  </section>
  <section id="ugr.tools.ruta.language.internal_indxexing.how">
    <title>How is it stored, created and updated?</title>
    <para>
      The internal indexing refers to three types of information that is additionally stored:
    </para>
    <orderedlist numeration="arabic">
    <listitem>
    <para>
      All annotations of all relevant types that begin at a position.
    </para>
    </listitem>
    <listitem>
    <para>
      All annotations of all relevant types that end at a position.
    </para>
    </listitem>
    <listitem>
    <para>
      All types of annotations that cover a position.
    </para>
    </listitem>
    </orderedlist>
    <para>
      The information is stored in additional annotations of the type RutaBasic, 
      which provides by implementation, and not by features, additional fields for
      these three kinds of information. RutaBasic types provide a complete disjunct 
      partitioning of the document. They begin and end at every position where an 
      annotation starts and ends. This also includes, for examples, one RutaBasic for each 
      SPACE annotation, registering which annotation start and end at these offsets.
      They are automatically created and also extended if new smaller annotations are added.
      Their initial creation is called <quote>indexing</quote> and their updating,
      if RutaBasics are available, while other Java analysis engines potentially added or
      removed annotations, is called <quote>reindexing</quote>. 
    </para>
    <para> 
      There are several configuration 
      parameters (see parameters with INDEX and REINDEX in their name) that can influence what types and annotations are indexed and reindexed.
      In the default configuration, all annotations are indexed, but only new annotations 
      are reindexed (ReindexUpdateMode ADDITIVE). This means that if an analysis engine in between
      two RutaEngine removes some annotations, the second RutaEngine will not be up to date.
      A rule which relies on the internal indexing will match differently for these annotations,
      e.g., a PARTOF condition is still fulfilled although the annotation is not present in the 
      UIMA indexes anymore. This problem can be avoided (if necessary) either by switching to a more costly
      ReindexUpdateMode COMPLETE, or by updating the internal indexing directly in the Java analysis
      engine by using the class RutaBasicUtils.
    </para>
  </section>
  <section id="ugr.tools.ruta.language.internal_indxexing.optimize">
    <title>How to optimize the performance?</title>
    <para>
      The are many different options and possibilities to optimize the runtime performance and
      memory footprint of a Ruta script, by configuring the RutaEngine. The most useful configuration, 
      however, depends on the actual situation: How much information is available about the pipeline
      and the types of annotations and their update operations? In the following, a selection 
      of optimizations are discussed.
    </para>
    <para>
      If there is a RutaEngine in a pipeline, and either the previous analysis engine was also 
      a RutaEngine or it is known that the analysis engines before (until the last RutaEngine) did not
      modify any (relevant) annotations, then the ReindexUpdateMode NONE can be applied, which simply 
      skips the internal reindexing. This can improve the runtime performance.
    </para>
    <para>
      The configuration parameters indexOnly can be restricted to relevant types.
      The parameter indexSkipTypes can be utilized to specify types of annotations that are not relevant. 
      These types can include more technical annotations for metadata, logging or debug information.
      Thus, the set of types that need to be considered for internal indexing can be restricted, which
      makes the indexing faster and requires less memory.
    </para>
      For a reindexing/updating step, the corresponding reindex parameters need to be considered.
      Even relevant annotations do not need to be reindexed/updated all the time.
      The updating can, for example, be restricted to
      types that have been potentially modified by previous Java analysis engines according to their capabilities.
      Additionally, some types are rather final considering their offsets. They are only create once 
      and are not modified by later analysis engines. These types commonly include 
      Tokens and similar annotations. They do not need to be reindexed, which can be configured using the
      reindexSkipTypes parameter.
    <para>
      An extension to this is the parameter indexOnlyMentionedTypes/reindexOnlyMentionedTypes. 
      Here, the relevant types are collected using the
      actual script:  the types that are actually used in the rules and thus their internal indexing needs
      to be up to date. This can increase the indexing speed. This feature is highlighted in the following example:
      Considering a larger pipeline with many annotations of different types, and also with many 
      modifications since the last RutaEngine, a script with one rule does not require much reindexing,
      except the exclusive types used in this rule.
    </para>
  </section>
</section>