ruta-docbook/src/docbook/tools.ruta.language.internal_indexing.xml - uima-ruta - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
 <!ENTITY imgroot "images/tools/ruta/language/" >
 <!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
 %uimaents;
 ]>
 <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
   license agreements. See the NOTICE file distributed with this work for additional
   information regarding copyright ownership. The ASF licenses this file to
   you under the Apache License, Version 2.0 (the "License"); you may not use
   this file except in compliance with the License. You may obtain a copy of
   the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
   by applicable law or agreed to in writing, software distributed under the
   License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
   OF ANY KIND, either express or implied. See the License for the specific
   language governing permissions and limitations under the License. -->

 <section id="ugr.tools.ruta.language.internal_indxexing">
   <title>Internal indexing and reindexing</title>
   <para>
    UIMA Ruta, or to be more precise the main analysis engine RutaEngine, creates,
    stores and updates additional indexing information directly in the CAS.
    This indexing is not related to the annotation indexes of UIMA itself.
    The internal indexing provides additional information, which is only utilized
    by the Ruta rules. This section provides an overview why and how it is included in
    UIMA Ruta. And how Ruta can be configured in order to optimize its performance.
   </para>
   <section id="ugr.tools.ruta.language.internal_indxexing.why">
     <title>Why additional indexing?</title>
     <para>
 	  The internal indexing is utilized for many different parts of functionality within Ruta.
 	  The need for the indexing is motivated by two main and important features.
 	</para>
 	<para>
 	  Ruta provides different language elements, for example conditions, which are fulfill
 	  depending on some investigation of the CAS annotation indexes. There are several
 	  condition like PARTOF which require many index operations in worst case. Here, potentially
 	  the complete index needed to be iterated in order to validate if a specific annotation
 	  is part of another annotation of a specific type. And this check need to be performed
 	  for each considered annotation and for each rule match and for each rule where a PARTOF
 	  condition is used. Without additional internal indexing Ruta would be too slow to
 	  actually be useful. With this feature, it is just a fast lookup. This situation applies also for many other language elements and
 	  conditions like STARTSWITH and ENDSWITH.
     </para>
     <para>
       A second necessity is the coverage-based visibility concept of Ruta.
       Annotations and any text spans are invisible if their begin or end is covered by some
       invisible annotation, i.e., an annotation of a type that is configured to be invisible.
       This is a powerful feature that enables many different engineering approaches and makes
       rules also more maintainable. For a (reasonably fast) implementation of this features,
       it is necessary to know for each position if it is covered by annotations of specific types.
     </para>
     <para>
       The internal indexing comes, however, with some costs. The indexing requires time and memory.
       the information needs to be collected and/or updated for every Ruta script (RutaEngine)
       in a pipeline. This may require many operations if many annotations are available.
       Straightforward, the storage of this information at potentially all text positions
       requires a lot memory. Nevertheless, the advantages outweigh the disadvantages considerably.
     </para>
   </section>
   <section id="ugr.tools.ruta.language.internal_indxexing.how">
     <title>How is it stored, created and updated?</title>
     <para>
       The internal indexing refers to three types of information that is additionally stored:
     </para>
     <orderedlist numeration="arabic">
     <listitem>
     <para>
       All annotations of all relevant types that begin at a position.
     </para>
     </listitem>
     <listitem>
     <para>
       All annotations of all relevant types that end at a position.
     </para>
     </listitem>
     <listitem>
     <para>
       All types of annotations that cover a position.
     </para>
     </listitem>
     </orderedlist>
     <para>
       The information is stored in additional annotations of the type RutaBasic,
       which provides by implementation, and not by features, additional fields for
       these three kinds of information. RutaBasic provide a complete disjunct
       partitioning of the document. They begin and end at every position where an
       annotation starts and ends. This also includes, for examples, one RutaBasic for each
       SPACE annotation, registering which annotation start and end at these offsets.
       They are automatically created and also extended if new smaller annotations are added.
       Their initial creation is called <quote>indexing</quote> and their updating
       if RutaBasics are available, but other Java analysis engines potentially added or
       removed annotations, is called <quote>reindexing</quote>.
     </para>
     <para>
       There are several configuration
       parameters (see parameters with INDEX and REINDEX in their name) that can influence what types and annotations are indexed and reindexed.
       In the default configuration, all annotations are indexed, but only new annotations
       are reindexed (ReindexUpdateMode ADDITIVE). This means that if an analysis engine in between
       two RutaEngine removes some annotations, the second RutaEngine will not be up to date.
       A rule which relies on the internal indexing will match differently for these annotations,
       e.g., a PARTOF condition is still fulfilled although the annotation is not present in the
       UIMA indexes anymore. This problem can be avoided (if necessary) either by switching to a more costly
       ReindexUpdateMode COMPLETE, or by updating the internal indexing directly in the Java analysis
       engine if necessary by using the class RutaBasicUtils.
     </para>
   </section>
   <section id="ugr.tools.ruta.language.internal_indxexing.optimize">
     <title>How to optimize the performance?</title>
     <para>
       The are many different options and possibilities to optimize the runtime performance and
       memory footprint of Ruta script, by configuring the RutaEngine. The most useful configuration,
       however, depends on the actual situation: How much information is available about the pipeline
       and the types of annotations and their update operations? In the following a selection
       of optimizations are discussed.
     </para>
     <para>
       If there is a RutaEngine in a pipeline, and either the previous analysis engine was also
       a RutaEngine or it is known that the analysis engines before (until the last RutaEngine) did not
       modify any (relevant) annotations, then the ReindexUpdateMode NONE can be applied, which simply
       skips the internal reindexing. This can improve the runtime performance.
     </para>
     <para>
       The configuration parameters indexOnly can be restricted to relevant types.
       The parameter indexSkipTypes can be utilized to specify types of annotations that are not relevant.
       These types can include more technical annotations for metadata, logging or debug information.
       Thus, the set of types that need to be considered for internal indexing can be restricted, which
       makes the indexing faster and requires less memory.
     </para>
       For a reindexing/updating step the corresponding reindex parameters need to be considered.
       Even relevant annotations do not need to be reindexed/updated all the time.
       The updating can, for example, be restricted to
       types that have been potentially modified by previous Java analysis engines according to their capabilities.
       Additionally, some types are rather final considering their offsets. They are only create once
       and are not modified by later analysis engines. These types commonly include
       Tokens and similar annotations. They do not need to be reindexed, which can be configured using the
       reindexSkipTypes parameter.
     <para>
       An extension to this is the parameter indexOnlyMentionTypes/reindexOnlyMentionedTypes.
       Here, the relevant types are collected using the
       actual script:  the types that are actually used in the rules and thus their internal indexing needs
       to be up to date. This mainly can increase the indexing speed. This feature is highlighted with example:
       Considering a larger pipeline with many annotations of different types, and also with many
       modifications since the last RutaEngine, a script with one rule does not require much reindexing,
       only the types that are used in this rule.
     </para>
   </section>
 </section>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
	"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
	<!ENTITY imgroot "images/tools/ruta/language/" >
	<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
	%uimaents;
	]>
	<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
	license agreements. See the NOTICE file distributed with this work for additional
	information regarding copyright ownership. The ASF licenses this file to
	you under the Apache License, Version 2.0 (the "License"); you may not use
	this file except in compliance with the License. You may obtain a copy of
	the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
	by applicable law or agreed to in writing, software distributed under the
	License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
	OF ANY KIND, either express or implied. See the License for the specific
	language governing permissions and limitations under the License. -->

	<section id="ugr.tools.ruta.language.internal_indxexing">
	<title>Internal indexing and reindexing</title>
	<para>
	UIMA Ruta, or to be more precise the main analysis engine RutaEngine, creates,
	stores and updates additional indexing information directly in the CAS.
	This indexing is not related to the annotation indexes of UIMA itself.
	The internal indexing provides additional information, which is only utilized
	by the Ruta rules. This section provides an overview why and how it is included in
	UIMA Ruta. And how Ruta can be configured in order to optimize its performance.
	</para>
	<section id="ugr.tools.ruta.language.internal_indxexing.why">
	<title>Why additional indexing?</title>
	<para>
	The internal indexing is utilized for many different parts of functionality within Ruta.
	The need for the indexing is motivated by two main and important features.
	</para>
	<para>
	Ruta provides different language elements, for example conditions, which are fulfill
	depending on some investigation of the CAS annotation indexes. There are several
	condition like PARTOF which require many index operations in worst case. Here, potentially
	the complete index needed to be iterated in order to validate if a specific annotation
	is part of another annotation of a specific type. And this check need to be performed
	for each considered annotation and for each rule match and for each rule where a PARTOF
	condition is used. Without additional internal indexing Ruta would be too slow to
	actually be useful. With this feature, it is just a fast lookup. This situation applies also for many other language elements and
	conditions like STARTSWITH and ENDSWITH.
	</para>
	<para>
	A second necessity is the coverage-based visibility concept of Ruta.
	Annotations and any text spans are invisible if their begin or end is covered by some
	invisible annotation, i.e., an annotation of a type that is configured to be invisible.
	This is a powerful feature that enables many different engineering approaches and makes
	rules also more maintainable. For a (reasonably fast) implementation of this features,
	it is necessary to know for each position if it is covered by annotations of specific types.
	</para>
	<para>
	The internal indexing comes, however, with some costs. The indexing requires time and memory.
	the information needs to be collected and/or updated for every Ruta script (RutaEngine)
	in a pipeline. This may require many operations if many annotations are available.
	Straightforward, the storage of this information at potentially all text positions
	requires a lot memory. Nevertheless, the advantages outweigh the disadvantages considerably.
	</para>
	</section>
	<section id="ugr.tools.ruta.language.internal_indxexing.how">
	<title>How is it stored, created and updated?</title>
	<para>
	The internal indexing refers to three types of information that is additionally stored:
	</para>
	<orderedlist numeration="arabic">
	<listitem>
	<para>
	All annotations of all relevant types that begin at a position.
	</para>
	</listitem>
	<listitem>
	<para>
	All annotations of all relevant types that end at a position.
	</para>
	</listitem>
	<listitem>
	<para>
	All types of annotations that cover a position.
	</para>
	</listitem>
	</orderedlist>
	<para>
	The information is stored in additional annotations of the type RutaBasic,
	which provides by implementation, and not by features, additional fields for
	these three kinds of information. RutaBasic provide a complete disjunct
	partitioning of the document. They begin and end at every position where an
	annotation starts and ends. This also includes, for examples, one RutaBasic for each
	SPACE annotation, registering which annotation start and end at these offsets.
	They are automatically created and also extended if new smaller annotations are added.
	Their initial creation is called <quote>indexing</quote> and their updating
	if RutaBasics are available, but other Java analysis engines potentially added or
	removed annotations, is called <quote>reindexing</quote>.
	</para>
	<para>
	There are several configuration
	parameters (see parameters with INDEX and REINDEX in their name) that can influence what types and annotations are indexed and reindexed.
	In the default configuration, all annotations are indexed, but only new annotations
	are reindexed (ReindexUpdateMode ADDITIVE). This means that if an analysis engine in between
	two RutaEngine removes some annotations, the second RutaEngine will not be up to date.
	A rule which relies on the internal indexing will match differently for these annotations,
	e.g., a PARTOF condition is still fulfilled although the annotation is not present in the
	UIMA indexes anymore. This problem can be avoided (if necessary) either by switching to a more costly
	ReindexUpdateMode COMPLETE, or by updating the internal indexing directly in the Java analysis
	engine if necessary by using the class RutaBasicUtils.
	</para>
	</section>
	<section id="ugr.tools.ruta.language.internal_indxexing.optimize">
	<title>How to optimize the performance?</title>
	<para>
	The are many different options and possibilities to optimize the runtime performance and
	memory footprint of Ruta script, by configuring the RutaEngine. The most useful configuration,
	however, depends on the actual situation: How much information is available about the pipeline
	and the types of annotations and their update operations? In the following a selection
	of optimizations are discussed.
	</para>
	<para>
	If there is a RutaEngine in a pipeline, and either the previous analysis engine was also
	a RutaEngine or it is known that the analysis engines before (until the last RutaEngine) did not
	modify any (relevant) annotations, then the ReindexUpdateMode NONE can be applied, which simply
	skips the internal reindexing. This can improve the runtime performance.
	</para>
	<para>
	The configuration parameters indexOnly can be restricted to relevant types.
	The parameter indexSkipTypes can be utilized to specify types of annotations that are not relevant.
	These types can include more technical annotations for metadata, logging or debug information.
	Thus, the set of types that need to be considered for internal indexing can be restricted, which
	makes the indexing faster and requires less memory.
	</para>
	For a reindexing/updating step the corresponding reindex parameters need to be considered.
	Even relevant annotations do not need to be reindexed/updated all the time.
	The updating can, for example, be restricted to
	types that have been potentially modified by previous Java analysis engines according to their capabilities.
	Additionally, some types are rather final considering their offsets. They are only create once
	and are not modified by later analysis engines. These types commonly include
	Tokens and similar annotations. They do not need to be reindexed, which can be configured using the
	reindexSkipTypes parameter.
	<para>
	An extension to this is the parameter indexOnlyMentionTypes/reindexOnlyMentionedTypes.
	Here, the relevant types are collected using the
	actual script: the types that are actually used in the rules and thus their internal indexing needs
	to be up to date. This mainly can increase the indexing speed. This feature is highlighted with example:
	Considering a larger pipeline with many annotations of different types, and also with many
	modifications since the last RutaEngine, a script with one rule does not require much reindexing,
	only the types that are used in this rule.
	</para>
	</section>
	</section>