blob: 1bce6ab048c27f5779d9bdd763347e6b544f650d [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
<!ENTITY imgroot "images/tools/tools.ruta/" >
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="ugr.tools.ruta.overview">
<title>Apache UIMA Ruta Overview</title>
<para>
</para>
<section id="ugr.tools.ruta.overview.intro">
<title>What is Apache UIMA Ruta?</title>
<para>
Apache UIMA Ruta&#8482; is a rule-based script language supported by Eclipse-based tooling.
The language is designed to enable rapid development of text processing applications within Apache UIMA&#8482;.
A special focus lies on the intuitive and flexible domain specific language for defining
patterns of annotations. Writing rules for information extraction or other text processing
applications is a tedious process. The Eclipse-based tooling for UIMA Ruta, called the Apache UIMA Ruta Workbench,
was created to support the user and to facilitate every step when writing UIMA Ruta rules. Both the
Ruta rule language and the UIMA Ruta Workbench integrate smoothly with Apache UIMA.
</para>
</section>
<section id="ugr.tools.ruta.overview.gettingstarted">
<title>Getting started</title>
<para>
This section gives a short roadmap how to read the documentation and gives some recommendations how to
start developing UIMA Ruta-based applications. This documentation assumes that the reader knows about
the core concepts of Apache UIMA. Knowledge of the meaning and usage of the terms <quote>CAS</quote>,
<quote>Feature Structure</quote>, <quote>Annotation</quote>, <quote>Type</quote>, <quote>Type System</quote>
and <quote>Analysis Engine</quote> is required. Please refer to the documentation of Apache UIMA for an introduction.
</para>
<para>
Unexperienced users that want to learn about UIMA Ruta can start with the next two sections:
<xref linkend="ugr.tools.ruta.overview.coreconcepts"/>
gives a short overview of the core ideas and features of the UIMA Ruta language and Workbench.
This section introduces the main concepts of the UIMA Ruta language. It explains how UIMA Ruta rules
are composed and applied, and discusses the advantages of the UIMA Ruta system.
The following <xref linkend="ugr.tools.ruta.overview.examples"/> approaches the UIMA Ruta language using a different
perspective. Here, the language is introduced by examples. The first example starts with explaining how a simple rule
looks like, and each following example extends the syntax or semantics of the UIMA Ruta language.
After the consultation of these two sections, the reader is expected to have gained enough
knowledge to start writing her first UIMA Ruta-based application.
</para>
<para>
The UIMA Ruta Workbench was created to support the user and to facilitate the development process. It is strongly recommended to
use this Eclipse-based IDE since it, for example, automatically configures the component descriptors and provides editing support like
syntax checking. <xref linkend="section.ugr.tools.ruta.workbench.install"/> describes how the UIMA Ruta Workbench is installed.
UIMA Ruta rules can also be applied on CAS without using the UIMA Ruta Workbench.
<xref linkend="ugr.tools.ruta.ae.basic.apply"/> contains examples how to execute UIMA Ruta rules in plain java.
A good way to get started with UIMA Ruta is to play around with an exemplary UIMA Ruta project, e.g.,
<quote>ExampleProject</quote> in the example-projects of the UIMA Ruta source release.
This UIMA Ruta project contains some simple rules for processing citation metadata.
</para>
<para>
<xref linkend="ugr.tools.ruta.language.language"/> and <xref linkend="ugr.tools.ruta.workbench"/> provide
more detailed descriptions and can be referred to in order to gain knowledge of specific parts
of the UIMA Ruta language or the UIMA Ruta Workbench.
</para>
</section>
<section id="ugr.tools.ruta.overview.coreconcepts">
<title>Core Concepts</title>
<para>
The UIMA Ruta language is an imperative rule language extended with scripting elements. A UIMA Ruta rule defines a
pattern of annotations with additional conditions. If this pattern applies, then the actions of the rule are performed
on the matched annotations. A rule is composed of a sequence of rule elements and a rule element essentially consist of four parts:
A matching condition, an optional quantifier, a list of conditions and a list of actions.
The matching condition is typically a type of an annotation by which the rule element matches on the covered text of one of those annotations.
The quantifier specifies, whether it is necessary that the rule element successfully matches and how often the rule element may match.
The list of conditions specifies additional constraints that the matched text or annotations need to fulfill. The list of actions defines
the consequences of the rule and often creates new annotations or modifies existing annotations.
They are only applied if all rule elements of the rule have successfully matched. Examples for UIMA Ruta rules can be found in
<xref linkend="ugr.tools.ruta.overview.examples"/>.
</para>
<para>
When UIMA Ruta rules are applied on a document, respectively on a CAS, then they are always grouped in a script file. However, a UIMA Ruta
script file does not only contain rules, but also other statements. First of all, each script file starts with a package declaration followed by
a list of optional imports. Then, common statements like rules, type declarations or blocks build the body and functionality of a script.
<xref linkend="ugr.tools.ruta.ae.basic.apply"/> gives an example, how UIMA Ruta scripts can be applied in plain Java.
UIMA Ruta script files are naturally organized in UIMA Ruta projects, which is a concept of the UIMA Ruta Workbench.
The structure of a UIMA Ruta project is described in <xref linkend="section.ugr.tools.ruta.workbench.projects"/>
</para>
<para>
The inference of UIMA Ruta rules, that is the approach how the rules are applied, can be described as imperative depth-first matching.
In contrast to similar rule-based systems, UIMA Ruta rules are applied in the order they are defined in the script.
The imperative execution of the matching rules may have disadvantages, but also many advantages like an increased rate of development or
an easier explanation. The second main property of the UIMA Ruta inference is the depth-first matching. When a rule matches on a pattern of annotations, then
an alternative is always tracked until it has matched or failed before the next alternative is considered. The behavior of a rule may change, if
it has already matched on an early alternative and thus has performed an action, which influences some constraints of the rule.
Examples, how UIMA Ruta rules are applied, are given in <xref linkend="ugr.tools.ruta.overview.examples"/>.
</para>
<para>
The UIMA Ruta language provides the possibility to approach an annotation problem in different ways. Let us distinguish
some approaches as an example.
It is common in the UIMA Ruta language to create many annotations of different types. These annotations are probably not the targeted annotation of the domain,
but can be helpful to incrementally approximate the annotation of interest. This enables the user to work <quote>bottom-up</quote> and <quote>top-down</quote>.
In the former approach, the rules add incrementally more complex annotations using simple ones until the target annotation can be created.
In the latter approach, the rules get more specific while partitioning the document in smaller segments, which result in the targeted annotation, eventually.
By using many <quote>helper</quote>-annotations, the engineering task becomes easier and more comprehensive.
The UIMA Ruta language provides distinctive language elements for different tasks. There are, for example, actions
that are able to create new annotations, actions that are able to remove annotations and actions that are able to modify the
offsets of annotations. This enables, amongst other things, a transformation-based approach. The user starts by creating general rules that are able to
annotate most of the text fragments of interest. Then, instead of making these rules more complex by adding more conditions for situations where they fail,
additional rules are defined that correct the mistakes of the general rules, e.g., by deleting false positive annotations.
<xref linkend="ugr.tools.ruta.overview.examples"/> provides some examples how UIMA Ruta rules can be engineered.
</para>
<para>
To write rules manually is a tedious and error-prone process. The <link linkend="ugr.tools.ruta.workbench">UIMA Ruta Workbench</link>
was developed to facilitate writing rules by providing as much tooling support as possible. This includes, for example, syntax checking and auto completion, which
make the development less error-prone. The user can annotate documents and use these documents as unit tests for test-driven development or
quality maintenance. Sometimes, it is necessary to debug the rules because they do not match as expected. In this case, the explanation perspective provides views
that explain every detail of the matching process. Finally, the UIMA Ruta language can also be used by the tooling, for example, by the <quote>Query</quote> view.
Here, UIMA Ruta rules can be used as query statements in order to investigate annotated documents.
</para>
<para>
UIMA Ruta smoothly integrates with Apache UIMA. First of all, the UIMA Ruta rules are applied using a generic Analysis Engine and thus UIMA Ruta scripts can
easily be added to Apache UIMA pipelines. UIMA Ruta also provides the functionality to import and use other UIMA components like Analysis Engines and Type Systems.
UIMA Ruta rules can refer to every type defined in an imported type system, and the UIMA Ruta Workbench generates a type system descriptor file containing all
types that were defined in a script file. Any Analysis Engine can be executed by rules as long as their implementation is available in the classpath. Therefore,
functionality outsourced in an arbitrary Analysis Engine can be added and used within UIMA Ruta.
</para>
</section>
<section id="ugr.tools.ruta.overview.examples">
<title>Learning by Example</title>
<para>
This section gives an introduction to the UIMA Ruta language by explaining the rule syntax
and inference with some simplified examples. It is recommended to use the UIMA Ruta Workbench to write UIMA Ruta rules
in order to gain advantages like syntax checking. A short description how to install the UIMA Ruta Workbench
is given <link linkend="section.ugr.tools.ruta.workbench.install">here</link>. The following examples make use of the
annotations added by the default seeding of the UIMA Ruta Analysis Engine. Their meaning is explained along with the examples.
</para>
<para>
The first example consists of a declaration of a type followed by a simple rule. Type declarations always start with the keyword
<quote>DECLARE</quote> followed by the short name of the new type. The namespace of the type is equal to the package declaration of the script file.
If there is no package declaration, then the types declared in the script file have no namespace.
There is also the possibility to create more complex types with features or specific parent types, but this will be neglected for now.
In the example, a simple annotation type with the short name <quote>Animal</quote> is defined.
After the declaration of the type, a rule with one rule element is given.
UIMA Ruta rules in general can consist of a sequence of rule elements. Simple rule elements themselves consist of four parts: A matching condition,
an optional quantifier, an optional list of conditions and an optional list of actions. The rule element in the
following example has a matching condition <quote>W</quote>, an annotation type standing for normal words.
Statements like declarations and rules always end with a semicolon.
</para>
<programlisting><![CDATA[DECLARE Animal;
W{REGEXP("dog") -> MARK(Animal)};]]></programlisting>
<para>
The rule element also contains one condition and one action, both surrounded by curly parentheses. In order to distinguish conditions from actions
they are separated by <quote>-></quote>. The condition <quote>REGEXP("dog")</quote> indicates that the matched
word must match the regular expression <quote>dog</quote>. If the matching condition and the additional regular expression are fulfilled, then the action
is executed, which creates a new annotation of the type <quote>Animal</quote> with the same offsets as the matched token.
The default seeder does actually not add annotations of the type <quote>W</quote>, but annotations of the types <quote>SW</quote> and
<quote>CW</quote> for small written words and capitalized words, which both have the parent type <quote>W</quote>.
</para>
<para>
There is also the possibility to add implicit actions and conditions, which have no explicit name, but consist only of an expression.
In the part of the conditions, boolean expressions and feature match expression can be applied, and in the part of the actions,
type expressions and feature assignment expression can be added. The following example contains one implicit condition and one implicit action.
The additional condition is a boolean expression (boolean variable), which is set to <quote>true</quote>, and therefore is always fulfills the condition.
The <quote>MARK</quote> action was replaced by a type expression, which refer to the type <quote>Animal</quote>. The following rule shows, therefore,
the same behavior as the rule in the last example.
</para>
<programlisting><![CDATA[DECLARE Animal;
BOOLEAN active = true;
W{REGEXP("dog"), active -> Animal};]]></programlisting>
<para>
There is also a special kind of rules, which follow a different syntax and semantic, and enables a simplified creation of annotations based on regular expression.
The following rule, for example, creates an <quote>Animal</quote> annotation for each occurrence of <quote>dog</quote> or <quote>cat</quote>.
</para>
<programlisting><![CDATA[DECLARE Animal;
"dog|cat" -> Animal;]]></programlisting>
<para>
Since it is tedious to create Animal annotations by matching on different regular expression, we apply an external dictionary in the next example.
The first line defines a word list named <quote>AnimalsList</quote>, which is located in the resource folder (the file <quote>Animals.txt</quote>
contains one animal name in each line). After the declaration of the type, a rule uses this word list to find all occurrences of animals
in the complete document.
</para>
<programlisting><![CDATA[WORDLIST AnimalsList = 'Animals.txt';
DECLARE Animal;
Document{-> MARKFAST(Animal, AnimalsList)};
]]></programlisting>
<para>
The matching condition of the rule element refers to the complete document, or more specific to the annotation of the type
<quote>DocumentAnnotation</quote>, which covers the whole document.
The action <quote>MARKFAST</quote> of this rule element creates an annotation of the type <quote>Animal</quote> for each found
entry of the dictionary <quote>AnimalsList</quote>.
</para>
<para>
The next example introduces rules with more than one rule element, whereby one of them is a composed rule element. The following rule tries to
annotate occurrences of animals separated by commas, e.g., <quote>dog, cat, bird</quote>.
</para>
<programlisting><![CDATA[DECLARE AnimalEnum;
(Animal COMMA)+{-> MARK(AnimalEnum,1,2)} Animal;]]></programlisting>
<para>
The rule consists of two rule elements, with <quote>(Animal COMMA)+{-> MARK(AnimalEnum,1,2)}</quote> being the first rule element and
<quote>Animal</quote> the second one. Let us take a closer look at the first rule element. This rule element is actually composed of two normal rule elements,
that are <quote>Animal</quote> and <quote>COMMA</quote>, and contains a greedy quantifier and one action. This rule element, therefore, matches on
one Animal annotation and a following comma. This is repeated until one of the inner rule elements does not match anymore. Then, there has to be
another Animal annotation afterwards, specified by the second rule element of the rule. In this case, the rule matches and its action is executed:
The MARK action creates a new annotation of the type <quote>AnimalEnum</quote>. However, in contrast to the previous examples, this action also
contains two numbers. These numbers refer to the rule elements that should be used to calculate the span of the created annotation. The numbers
<quote>1, 2</quote> state that the new annotation should start with the first rule element, the composed one, and should end with the second rule element.
</para>
<para>
Let us make the composed rule element more complex. The following rule also matches on lists of animals, which are
separated by semicolon. A disjunctive rule element is therefore added, indicated by the symbol <quote>|</quote>, which matches on
annotations of the type <quote>COMMA</quote> or <quote>SEMICOLON</quote>.
</para>
<programlisting><![CDATA[(Animal (COMMA | SEMICOLON))+{-> MARK(AnimalEnum,1,2)} Animal;]]></programlisting>
<para>
There two more special symbols that can be used to link rule elements. If the symbol <quote>|</quote> is replaced by the
symbol <quote><![CDATA[&]]></quote> in the last example, then the token after the animal need to be a comma and a semicolon, which is of course not possible.
Another symbol with a special meaning is <quote>%</quote>, which cannot only be used within a composed rule element (parentheses).
This symbol can be interpreted as a global <quote>and</quote>: It links several rules, which only fire, if all rules have successfully matched.
In the following example, an annotation of the type <quote>FoundIt</quote> is created, if the document contains two periods in a row and two commas in a row:
</para>
<programlisting><![CDATA[PERIOD PERIOD % COMMA COMMA{-> FoundIt};]]></programlisting>
<para>
There is a <quote>wild card</quote> (<quote>#</quote>) rule element, which can be used to skip some text or annotations until the next rule element is able to match.
</para>
<programlisting><![CDATA[DECLARE Sentence;
PERIOD #{-> MARK(Sentence)} PERIOD;]]></programlisting>
<para>
This rule annotates everything between two <quote>PERIOD</quote> annotations with the type <quote>Sentence</quote>. Please note that the resulting
annotations is automatically trimmed using the current filtering settings. Conditions at wild card rule elements should by avoided and only be used
by advanced users.
</para>
<para>
Another special rule element is called <quote>optional</quote> (<quote>_</quote>). Sometimes, an annotation should be created on a
text position if it is not followed by an annotation of a specific property. In contrast to normal rule elements with optional quantifier,
the optional rule element does not need to match at all.
</para>
<programlisting><![CDATA[W ANY{-PARTOF(NUM)};
W _{-PARTOF(NUM)};]]></programlisting>
<para>
The two rules in this example specify the same pattern: A word that is not followed by a number. The difference between the rules
shows itself at the border of the matching window, e.g., at the end of the document. If the document contains only a single word,
the first rule will not match successfully because the second rule element already fails at its matching condition. The second rule, however,
will successfully match due to the optional rule element.
</para>
<para>
Rule elements can contain more then one condition. The rule in the next example tries to identify headlines, which are bold,
underlined and end with a colon.
</para>
<programlisting><![CDATA[DECLARE Headline;
Paragraph{CONTAINS(Bold, 90, 100, true),
CONTAINS(Underlined, 90, 100, true), ENDSWITH(COLON)
-> MARK(Headline)};]]></programlisting>
<para>
The matching condition of this rule element is given with the type <quote>Paragraph</quote>, thus the rule takes a look at all Paragraph annotations.
The rule matches only if the three conditions, separated by commas, are fulfilled. The first condition <quote>CONTAINS(Bold, 90, 100, true)</quote> states that
90%-100% of the matched paragraph annotation should also be annotated with annotations of the type <quote>Bold</quote>. The boolean parameter <quote>true</quote>
indicates that amount of Bold annotations should be calculated relatively to the matched annotation. The two numbers <quote>90,100</quote> are, therefore, interpreted as
percent amounts. The exact calculation of the coverage is dependent on the tokenization of the document and is neglected for now. The second condition
<quote>CONTAINS(Underlined, 90, 100, true)</quote> consequently states that the paragraph should also contain at least 90% of annotations of the type <quote>underlined</quote>.
The third condition <quote>ENDSWITH(COLON)</quote> finally forces the Paragraph annotation to end with a colon. It is only fulfilled, if there is an annotation of the type
<quote>COLON</quote>, which has an end offset equal to the end offset of the matched Paragraph annotation.
</para>
<para>
The readability and maintenance of rules does not increase, if more conditions are added.
One of the strengths of the UIMA Ruta language is that it provides different approaches to solve an annotation task. The next two examples
introduce actions for transformation-based rules.
</para>
<programlisting><![CDATA[Headline{-CONTAINS(W) -> UNMARK(Headline)};]]></programlisting>
<para>
This rule consists of one condition and one action. The condition <quote>-CONTAINS(W)</quote> is negated (indicated by the character <quote>-</quote>),
and is therefore only fulfilled, if there are no annotations of the type <quote>W</quote> within the bound of the matched Headline annotation.
The action <quote>UNMARK(Headline)</quote> removes the matched Headline annotation. Put into simple words, headlines that contain no words at all are not headlines.
</para>
<para>
The next rule does not remove an annotation, but changes its offsets dependent on the context.
</para>
<programlisting><![CDATA[]]>Headline{-> SHIFT(Headline, 1, 2)} COLON;</programlisting>
<para>
Here, the action <quote>SHIFT(Headline, 1, 2)</quote> expands the matched Headline annotation to the next colon, if that Headline annotation
is followed by a COLON annotation.
</para>
<para>
UIMA Ruta rules can contain arbitrary conditions and actions, which is illustrated by the next example.
</para>
<programlisting><![CDATA[DECLARE Month, Year, Date;
ANY{INLIST(MonthsList) -> MARK(Month), MARK(Date,1,3)}
PERIOD? NUM{REGEXP(".{2,4}") -> MARK(Year)};]]></programlisting>
<para>
This rule consists of three rule elements. The first one matches on every token, which has a covered text that occurs in a word lists named <quote>MonthsList</quote>.
The second rule element is optional and does not need to be fulfilled, which is indicated by the quantifier <quote>?</quote>. The last rule element matches
on numbers that fulfill the regular expression <quote>REGEXP(".{2,4}"</quote> and are therefore at least two characters to a maximum of four characters long.
If this rule successfully matches on a text passage, then its three actions are executed: An annotation of the type <quote>Month</quote> is created for the first rule element,
an annotation of the type <quote>Year</quote> is created for the last rule element and an annotation of the type <quote>Date</quote>
is created for the span of all three rule elements. If the word list contains the correct entries, then this rule matches on strings like
<quote>Dec. 2004</quote>, <quote>July 85</quote> or <quote>11.2008</quote> and creates the corresponding annotations.
</para>
<para>
After introducing the composition of rule elements, the default matching strategy is examined. The two rules in the next example create an annotation
for a sequence of arbitrary tokens with the only difference of one condition.
</para>
<programlisting><![CDATA[DECLARE Text1, Text2;
ANY+{ -> MARK(Text1)};
ANY+{-PARTOF(Text2) -> MARK(Text2)};]]></programlisting>
<para>
The first rule matches on each occurrence of an arbitrary token and continues this until the end of the document is reached.
This is caused by the greedy quantifier <quote>+</quote>. Note that this rule considers each occurrence of a token and is therefore
executed for each token resulting many overlapping annotations. This behavior is illustrated with an example:
When applied on the document <quote>Peter works for Frank</quote>, the rule creates four annotations with the covered texts
<quote>Peter works for Frank</quote>, <quote>works for Frank</quote>, <quote>for Frank</quote> and <quote>Frank</quote>.
The rule first tries to match on the token <quote>Peter</quote> and continues its matching. Then, it tries to match on the token <quote>works</quote> and
continues its matching, and so on.
</para>
<para>
In this example, the second rule only returns one annotation, which covers the complete document. This is caused by the additional
condition <quote>-PARTOF(Text2)</quote>. The PARTOF condition is fulfilled, if the matched annotation is located within an annotation of the given type, or
put in simple words, if the matched annotation is part of an annotation of the type <quote>Text2</quote>. When applied on the
document <quote>Peter works for Frank</quote>, the rule matches on the first token <quote>Peter</quote>, continues its match and
creates an annotation of the type <quote>Text2</quote> for the complete document. Then it tries to match on the second token <quote>works</quote>, but fails,
because this token is already part of an Text2 annotation.
</para>
<para>
UIMA Ruta rules can not only be used to create or modify annotations, but also to create features for annotations. The next example defines
and assigns a relation of employment, by storing the given annotations as feature values.
</para>
<programlisting><![CDATA[DECLARE Annotation EmplRelation
(Employee employeeRef, Employer employerRef);
Sentence{CONTAINS(EmploymentIndicator) -> CREATE(EmplRelation,
"employeeRef" = Employee, "employerRef" = Employer)};]]></programlisting>
<para>
The first statement of this example is a declaration that defines a new type of annotation named <quote>EmplRelation</quote>.
This annotation has two features:
One feature with the name <quote>employeeRef</quote> of the type <quote>Employee</quote> and
one feature with the name <quote>employerRef</quote> of the type <quote>Employer</quote>.
If the parent type is Annotation, then it can be omitted resulting in the following declaration:
<programlisting><![CDATA[DECLARE EmplRelation (Employee employeeRef, Employer employerRef);]]></programlisting>
The second statement of the example, which is a simple rule, creates one annotation of the type <quote>EmplRelation</quote> for
each Sentence annotation that contains at least one annotation of the type <quote>EmploymentIndicator</quote>. Additionally to creating an annotation,
the CREATE action also assigns an annotation of the <quote>Employee</quote>, which needs to be located within the span of the matched sentence,
to the feature <quote>employeeRef</quote> and an Employer annotation to the feature <quote>employerRef</quote>. The annotations mentioned in this
example need to be present in advance.
</para>
<para>
In order to refer to annotations and, for example, assigning them to some features,
special kinds of local and global variables can be utilized. Local variables for annotations
do not need to be defined by are specified by a label at a rule element. This label can be utilized
for referring to the matched annotation of this rule element within the current rule match alone.
The following example illustrate some simple use cases using local variables:
</para>
<programlisting><![CDATA[DECLARE Annotation EmplRelation
(Employee employeeRef, Employer employerRef);
e1:Employer # EmploymentIndicator # e2:Employee)
{-> EmplRelation, EmplRelation.employeeRef=e2,
EmplRelation.employerRef=e1};]]></programlisting>
<para>
Global variables for annotations are declared like other variables and are able to store annotations
across rules as illustrated by the next example:
</para>
<programlisting><![CDATA[DECLARE MentionedAfter(Annotation first);
ANNOTATION firstPerson;
# p:Person{-> firstPerson = p};
Entity{-> MentionedAfter, MentionedAfter.first = firstPerson};]]></programlisting>
<para>
The first line declares a new type that are utilized afterwards. The second line defines a variable
named <code>firstPerson</code> which can store one annotation. A variable able to hold several annotations
is defined with ANNOTATIONLIST. The next line assigns the first occurrence of Person annotation to the
annotation variable <code>firstPerson</code>. The last line creates an annotation of the type MentionedAfter and assigns the value
of the variable <code>firstPerson</code> to the feature <code>first</code> of the created annotation.
</para>
<para>
Expressions for annotations can be extended by a feature match and also conditions. This does also apply for type expressions
that represent annotations. This functionality is illustrated with a simple example:
</para>
<programlisting><![CDATA[Sentence{-> CREATE(EmplRelation, "employeeRef" =
Employee.ct=="Peter"{ENDSWITH(Sentence)})};]]></programlisting>
<para>
Here, an annotation of the type <code>EmplRelation</code> is created for each sentence.
The feature <code>employeeRef</code> is filled with one <code>Employee</code> annotation.
This annotation is specified by its type <code>Employee</code>. The first annotation
of this type within the matched sentence, which covers the text <quote>Peter</quote> and also
ends with a <code>Sentence</code> annotation, is selected.
</para>
<para>
Sometimes, an annotation which was just created by an action should be assigned to a feature.
This can be achieved by referring to the annotation given its type like it was shown in the
first example with <quote>EmplRelation</quote>. However, this can cause problems in situations, e.g. where
several annotation of a type are present at a specific span. Local variables using labels can also be used directly at actions,
which create or modify actions. The action will assign the new annotation the the label variable,
which can then be utilized by following actions as shown in the following example:
</para>
<programlisting><![CDATA[W.ct=="Peter"{-> e:Employee, CREATE(EmplRelation, "employeeRef" = e)};]]></programlisting>
<para>
In the last examples, the values of features were defined as annotation types. However, also primitive
types can be used, as will be shown in the next example, together with a short introduction of variables.
</para>
<programlisting><![CDATA[DECLARE Annotation MoneyAmount(STRING currency, INT amount);
INT moneyAmount;
STRING moneyCurrency;
NUM{PARSE(moneyAmount)} SPECIAL{REGEXP("€") -> MATCHEDTEXT(moneyCurrency),
CREATE(MoneyAmount, 1, 2, "amount" = moneyAmount,
"currency" = moneyCurrency)};]]></programlisting>
<para>
First, a new annotation with the name <quote>MoneyAmount</quote> and two features are defined, one string feature and one integer feature.
Then, two UIMA Ruta variables are declared, one integer variable and one string variable. The rule matches on a number, whose value is stored
in the variable <quote>moneyAmount</quote>, followed by a special token that needs to be equal to the string <quote></quote>. Then,
the covered text of the special annotation is stored in the string variable <quote>moneyCurrency</quote> and annotation of the
type <quote>MoneyAmount</quote> spanning over both rule elements is created. Additionally, the variables are assigned as feature values.
</para>
<para>
Using feature expression for conditions and action, can reduce the complexity of a rule. The first rule in the following example set the value of the feature
<quote>currency</quote> of the annotation of the type <quote>MoneyAmount</quote> to <quote>Euro</quote>, if it was <quote></quote> before.
The second rule creates an annotation of the type <quote>LessThan</quote> for all annotations of the type <quote>MoneyAmount</quote>,
if their amount is less than 100 and the currency is <quote>Euro</quote>.
</para>
<programlisting><![CDATA[DECLARE LessThan;
MoneyAmount.currency=="€"{-> MoneyAmount.currency="Euro"};
MoneyAmount{(MoneyAmount.amount<=100),
MoneyAmount.currency=="Euro" -> LessThan};]]></programlisting>
<para>
UIMA Ruta script files with many rules can quickly confuse the reader. The UIMA Ruta language, therefore, allows to import other script files in order to increase
the modularity of a project or to create rule libraries. The next example imports the rules together with all known types of another script file
and executes that script file.
</para>
<programlisting><![CDATA[SCRIPT uima.ruta.example.SecondaryScript;
Document{-> CALL(SecondaryScript)};]]></programlisting>
<para>
The script file with the name <quote>SecondaryScript.ruta</quote>, which is located in the package <quote>uima/ruta/example</quote>, is imported and executed
by the CALL action on the complete document. The script needs to be located in the folder specified by the parameter
<link linkend="ugr.tools.ruta.ae.basic.parameter.scriptPaths">scriptPaths</link>, or in a corresponding package in the classpath. It is also possible to import script files of other UIMA Ruta projects, e.g.,
by adapting the configuration parameters of the UIMA Ruta Analysis Engine or
by setting a project reference in the project properties of a UIMA Ruta project.
</para>
<para>
For simple rules that match on the complete document and only specify actions, a simplified syntax exists that omits the matching parts:
</para>
<programlisting><![CDATA[SCRIPT uima.ruta.example.SecondaryScript;
CALL(SecondaryScript);]]></programlisting>
<para>
The types of important annotations of the application are often defined in a separate type system. The next example shows how to import those types.
</para>
<programlisting><![CDATA[TYPESYSTEM my.package.NamedEntityTypeSystem;
Person{PARTOF(Organization) -> UNMARK(Person)};
]]></programlisting>
<para>
The type system descriptor file with the name <quote>NamedEntityTypeSystem.xml</quote> located in the package <quote>my/package</quote> is imported.
The descriptor needs to be located in a folder specified by the parameter
<link linkend="ugr.tools.ruta.ae.basic.parameter.descriptorPaths">descriptorPaths</link>.
</para>
<para>
It is sometimes easier to express functionality with control structures known by programming languages rather than to engineer all functionality
only with matching rules. The UIMA Ruta language provides the BLOCK element for some of these use cases.
The UIMA Ruta BLOCK element starts with the keyword <quote>BLOCK</quote> followed by its name in parentheses. The name of a block has two purposes:
On the one hand, it is easier to distinguish the block, if they have different names, e.g., in the
<link linkend="section.ugr.tools.ruta.workbench.explain_perspective">explain perspective</link> of the UIMA Ruta Workbench. On the other hand,
the name can be used to execute this block using the CALL action. Hereby, it is possible to access only specific sets of rules of other script files,
or to implement a recursive call of rules. After the name of the block, a single rule element is given, which has curly parentheses,
even if no conditions or actions are specified. Then, the body of the block is framed by curly brackets.
</para>
<programlisting><![CDATA[BLOCK(English) Document{FEATURE("language", "en")} {
// rules for english documents
}
BLOCK(German) Document{FEATURE("language", "de")} {
// rules for german documents
}]]></programlisting>
<para>
This example contains two simple BLOCK statements. The rules defined within the block are only executed, if the condition in the head of the block is fulfilled.
The rules of the first block are only considered if the feature <quote>language</quote> of the document annotation has the value <quote>en</quote>.
Following this, the rules of the second block are only considered for German documents.
</para>
<para>
The rule element of the block definition can also refer to other annotation types than <quote>Document</quote>. While the last example implemented something similar
to an if-statement, the next example provides a show case for something similar to a for-each-statement.
</para>
<programlisting><![CDATA[DECLARE SentenceWithNoLeadingNP;
BLOCK(ForEach) Sentence{} {
Document{-STARTSWITH(NP) -> MARK(SentenceWithNoLeadingNP)};
}
]]></programlisting>
<para>
Here, the rule in the block statement is performed for each occurrence of an annotation of the type <quote>Sentence</quote>.
The rule within the block matches on the complete document, which is the current sentence in the context of the block statement.
As a consequence, this example creates an annotation of the type <quote>SentenceWithNoLeadingNP</quote> for each sentence
that does not start with a NP annotation.
</para>
<para>
There are two more language constructs (<quote><![CDATA[->]]></quote> and <quote><![CDATA[<-]]></quote>) that allow to apply rules within a certain context. These rules are added to an arbitrary rule element
and are called inlined rules. The first example interprets the inlined rules as actions. They are executed if the surrounding rule was able to match,
which makes this one very similar to the block statement.
</para>
<programlisting><![CDATA[DECLARE SentenceWithNoLeadingNP;
Sentence{}->{
Document{-STARTSWITH(NP) -> SentenceWithNoLeadingNP};
};
]]></programlisting>
<para>
The second one (<quote><![CDATA[<-]]></quote>) interprets the inlined rules as conditions. The surrounding rule can only match if at least one inlined rule was successfully applied.
In the following example, a sentence is annotated with the type SentenceWithNPNP, if there are two successive NP annotations within this sentence.
</para>
<programlisting><![CDATA[DECLARE SentenceWithNPNP;
Sentence{-> SentenceWithNPNP}<-{
NP NP;
};]]></programlisting>
<para>
A rule element may be extended with several inlined rule block as condition or action. If there a more than one inlined rule blocks as condition,
each needs to contain at least one rule that was successfully applied. In the following example, the rule will one match if the sentence contains
a number followed by a another number and a period followed by a comma, independently from their location within the sentence:
</para>
<programlisting><![CDATA[Sentence<-{NUM NUM;}<-{PERIOD COMMA;};]]></programlisting>
<para>
Let us take a closer look on what exactly the UIMA Ruta rules match. The following rule matches on a word followed by another word:
</para>
<programlisting><![CDATA[W W;]]></programlisting>
<para>
To be more precise, this rule matches on all documents like <quote>Apache UIMA</quote>, <quote>Apache UIMA</quote>, <quote>ApacheUIMA</quote>,
<quote><![CDATA[Apache <b>UIMA</b>]]></quote>. There are two main reasons for this: First of all, it depends on how the available annotations are defined. The default seeder
for the initial annotations creates an annotation for all characters until an upper case character occurs. Thus, the string <quote>ApacheUIMA</quote> consists of
two tokens.
However, more important, the UIMA Ruta language provides a concept of visibility of the annotations. By default, all annotations of the types
<quote>SPACE</quote>, <quote>NBSP</quote>, <quote>BREAK</quote> and <quote>MARKUP</quote> (whitespace and XML elements) are filtered and not visible. This holds of course for
their covered text, too. The rule elements skip all positions of the
document where those annotations occur. The rule in the last example matches on all examples. Without the default filtering settings,
with all annotations set to visible, the rule matches only on the document <quote>ApacheUIMA</quote> since it is the only one that contains two word annotations without
any whitespace between them.
</para>
<para>
The filtering setting can also be modified by the UIMA Ruta rules themselves. The next example provides rules that extend and limit
the amount of visible text of the document.
</para>
<programlisting><![CDATA[Sentence;
Document{-> RETAINTYPE(SPACE)};
Sentence;
Document{-> FILTERTYPE(CW)};
Sentence;
Document{-> RETAINTYPE, FILTERTYPE};]]></programlisting>
<para>
The first rule matches on sentences, which do not start with any filtered type. Sentences that start with whitespace or markup,
for example, are not considered.
The next rule retains all text that is covered by annotations of the type <quote>SPACE</quote> meaning
that the rule elements are now sensible to whitespaces. The following rule will, therefore, match on sentences that start with whitespaces.
The third rule now filters the type <quote>CW</quote> with the consequence that all capitalized words are invisible.
If the following rule now wants to match on sentences, then this is only possible for Sentence annotations that do not start with a capitalized word.
The last rule finally resets the filtering setting to the default configuration in the UIMA Ruta Analysis Engine.
</para>
<para>
The next example gives a showcase for importing external Analysis Engines and for modifying the documents by creating a new view called <quote>modified</quote>.
Additional Analysis Engines can be imported with the keyword <quote>ENGINE</quote> followed by the name of the descriptor. These imported Analysis Engines can be
executed with the actions <quote>CALL</quote> or <quote>EXEC</quote>. If the executed Analysis Engine adds, removes or modifies annotations, then their types need
to be mentioned when calling the descriptor, or else these annotations will not be correctly processed by the following UIMA Ruta rules.
</para>
<programlisting><![CDATA[ENGINE utils.Modifier;
Date{-> DEL};
MoneyAmount{-> REPLACE("<MoneyAmount/>")};
Document{-> COLOR(Headline, "green")};
Document{-> EXEC(Modifier)};
]]></programlisting>
<para>
In this example, we first import an Analysis Engine defined by the descriptor <quote>Modifier.xml</quote> located in the folder <quote>utils</quote>.
The descriptor needs to be located in the folder specified by the parameter <link linkend="ugr.tools.ruta.ae.basic.parameter.descriptorPaths">descriptorPaths</link>.
The first rule deletes all text covered by annotations of the type <quote>Date</quote>. The second rule replaces the text of all annotations of the type <quote>MoneyAmount</quote>
with the string <quote><![CDATA[<MoneyAmount/>]]></quote>. The third rule remembers to set the background color of text in Headline annotation to green. The last rule
finally performs all of these changes in an additional view called <quote>modified</quote>, which is specified in the configuration parameters of the analysis engine.
<xref linkend="ugr.tools.ruta.ae.modifier"/> and <xref linkend="ugr.tools.ruta.language.modification"/> provide a more detailed description.
</para>
<para>
In the last example, a descriptor file was loaded in order to import and apply an external analysis engine. Analysis engines can also be loaded using uimaFIT,
whereas the given class name has to be present in the classpath. In the UIMA Ruta Workbench, you can add a dependency to a java project, which contains the
implementation, to the UIMA Ruta project.
The following example loads an analysis engine without an descriptor and applies it on the document. The additional list of types states that
the annotations of those types created by the analysis engine should be available to the following Ruta rules.
</para>
<programlisting><![CDATA[UIMAFIT my.package.impl.MyAnalysisEngine;
Document{-> EXEC(MyAnalysisEngine, {MyType1, MyType2})};
]]></programlisting>
</section>
<section id="ugr.tools.ruta.ae">
<title>UIMA Analysis Engines</title>
<para>This section gives an overview of the UIMA Analysis Engines shipped with UIMA Ruta. The most
important one is <quote>RutaEngine</quote>, a generic analysis engine, which is able to interpret
and execute script files. The other analysis engines provide support for some additional functionality or
add certain types of annotations.
</para>
<section id="ugr.tools.ruta.ae.basic">
<title>Ruta Engine</title>
<para>
This generic Analysis Engine is the most important one for the UIMA Ruta language since it is
responsible for applying the UIMA Ruta rules on a CAS. Its functionality is configured by the configuration parameters,
which, for example, specify the rule file that should be executed. In the UIMA Ruta Workbench, a basic template named <quote>BasicEngine.xml</quote>
is given in the descriptor folder of a UIMA Ruta project and correctly configured descriptors typically named <quote>MyScriptEngine.xml</quote>
are generated in the descriptor folder corresponding to the package namespace of the script file.
The available configuration parameters of the UIMA Ruta Analysis Engine are described in the following.
</para>
<section id="ugr.tools.ruta.ae.basic.parameter">
<title>Configuration Parameters</title>
<para>
The configuration parameters of the UIMA Ruta Analysis Engine can be subdivided into three
different groups: parameters for the setup of the environment (<link linkend='ugr.tools.ruta.ae.basic.parameter.mainScript'>mainScript</link>
to <link linkend='ugr.tools.ruta.ae.basic.parameter.additionalExtensions'>additionalExtensions</link>),
parameters that change the behavior of the analysis engine (<link linkend='ugr.tools.ruta.ae.basic.parameter.reloadScript'>reloadScript</link>
to <link linkend='ugr.tools.ruta.ae.basic.parameter.simpleGreedyForComposed'>simpleGreedyForComposed</link>)
and parameters for creating additional information how the rules were executed
(<link linkend='ugr.tools.ruta.ae.basic.parameter.debug'>debug</link>
to <link linkend='ugr.tools.ruta.ae.basic.parameter.createdBy'>createdBy</link>). First, a short overview of the configuration parameters is given in
<xref linkend='table.ugr.tools.ruta.ae.parameter' />. Afterwards, all parameters are described in detail with examples.
</para>
<para>
To change the value of any configuration parameter within a UIMA Ruta script, the CONFIGURE action (see <xref linkend='ugr.tools.ruta.language.actions.configure' />)
can be used. For changing behavior of <link linkend='ugr.tools.ruta.ae.basic.parameter.dynamicAnchoring'>dynamicAnchoring</link> the DYNAMICANCHORING action
(see <xref linkend='ugr.tools.ruta.language.actions.dynamicanchoring' />) is recommended.
</para>
<para>
<table id="table.ugr.tools.ruta.ae.parameter" frame="all">
<title>Configuration parameters of the UIMA Ruta Analysis Engine </title>
<tgroup cols="3" colsep="1" rowsep="1">
<colspec colname="c1" colwidth="1.2*" />
<colspec colname="c2" colwidth="2*" />
<colspec colname="c3" colwidth="0.8*" />
<thead>
<row>
<entry align="center">Name</entry>
<entry align="center">Short description</entry>
<entry align="center">Type</entry>
</row>
</thead>
<tbody>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.mainScript'>mainScript</link>
</entry>
<entry>Name with complete namespace of the script which will be interpreted and
executed by the analysis engine.
</entry>
<entry>Single String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.rules'>rules</link>
</entry>
<entry>Script (list of rules) to be applied.
</entry>
<entry>Single String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.rules'>rulesScriptName</link>
</entry>
<entry>This parameter specifies the name of the non-existing script if the parameter 'rules' is used.
</entry>
<entry>Single String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.scriptEncoding'>scriptEncoding</link>
</entry>
<entry>Encoding of all UIMA Ruta script files.</entry>
<entry>Single String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.scriptPaths'>scriptPaths</link>
</entry>
<entry>List of absolute locations, which contain the necessary script files like
the main script.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.descriptorPaths'>descriptorPaths</link>
</entry>
<entry>List of absolute locations, which contain the necessary descriptor files
like type systems.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.resourcePaths'>resourcePaths</link>
</entry>
<entry>List of absolute locations, which contain the necessary resource files like
word lists.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.additionalScripts'>additionalScripts</link>
</entry>
<entry>Optional list of names with complete namespace of additional scripts, which can be
referred to.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.additionalEngines'>additionalEngines</link>
</entry>
<entry>Optional list of names with complete namespace of additional analysis engines, which
can be called by UIMA Ruta rules.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.additionalUimafitEngines'>additionalUimafitEngines</link>
</entry>
<entry>Optional list of class names with complete namespace of additional uimaFIT analysis engines, which
can be called by UIMA Ruta rules.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.additionalExtensions'>additionalExtensions</link>
</entry>
<entry>List of factory classes for additional extensions of the UIMA Ruta language
like proprietary conditions.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.reloadScript'>reloadScript</link>
</entry>
<entry>Option to initialize the rule script each time the analysis engine processes
a CAS.
</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.seeders'>seeders</link>
</entry>
<entry>List of class names that provide additional annotations before the rules are
executed.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.defaultFilteredTypes'>defaultFilteredTypes</link>
</entry>
<entry>List of complete type names of annotations that are invisible by default.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.removeBasics'>removeBasics</link>
</entry>
<entry>Option to remove all inference annotations after execution of the rule script.
</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.indexOnly'>indexOnly</link>
</entry>
<entry>Option to select annotation types that should be indexed internally in ruta.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.indexSkipTypes'>indexSkipTypes</link>
</entry>
<entry>Option to skip annotation types in the internal indexing.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.indexOnlyMentionedTypes'>indexOnlyMentionedTypes</link>
</entry>
<entry>Option to index only mentioned types internally in ruta.
</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.indexAdditionally'>indexAdditionally</link>
</entry>
<entry>Option to index types additionally to the mentioned ones internally in ruta.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.reindexOnly'>reindexOnly</link>
</entry>
<entry>Option to select annotation types that should be reindexed internally in ruta.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.reindexSkipTypes'>reindexSkipTypes</link>
</entry>
<entry>Option to skip annotation types in the internal reindexing.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.reindexOnlyMentionedTypes'>reindexOnlyMentionedTypes</link>
</entry>
<entry>Option to reindex only mentioned types internally in ruta.
</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.reindexAdditionally'>reindexAdditionally</link>
</entry>
<entry>Option to reindex types additionally to the mentioned ones internally in ruta.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.indexUpdateMode'>indexUpdateMode</link>
</entry>
<entry>Mode how internal indexing should be applied.
</entry>
<entry>Single String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.indexUpdateMode'>validateInternalIndexing</link>
</entry>
<entry>Option to validate the internal indexing.
</entry>
<entry>Single String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.emptyIsInvisible'>emptyIsInvisible</link>
</entry>
<entry>Option to define empty text positions as invisible.
</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.modifyDataPath'>modifyDataPath</link>
</entry>
<entry>Option to extend the datapath by the descriptorPaths
</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.strictImports'>strictImports</link>
</entry>
<entry>Option to restrict short type names resolution to those in the declared typesystems.
</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.dynamicAnchoring'>dynamicAnchoring</link>
</entry>
<entry>Option to allow rule matches to start at any rule element.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.lowMemoryProfile'>lowMemoryProfile</link>
</entry>
<entry>Option to decrease the memory consumption when processing a large CAS.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.simpleGreedyForComposed'>simpleGreedyForComposed</link>
</entry>
<entry>Option to activate a different inferencer for composed rule elements.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.debug'>debug</link>
</entry>
<entry>Option to add debug information to the CAS.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.debugWithMatches'>debugWithMatches</link>
</entry>
<entry>Option to add information about the rule matches to the CAS.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.debugOnlyFor'>debugOnlyFor</link>
</entry>
<entry>List of rule ids. If provided, then debug information is only created for
those rules.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.profile'>profile</link>
</entry>
<entry>Option to add profile information to the CAS.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.statistics'>statistics</link>
</entry>
<entry>Option to add statistics of conditions and actions to the CAS.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.createdBy'>createdBy</link>
</entry>
<entry>Option to add additional information, which rule created an annotation.
</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.varNames'>varNames</link>
</entry>
<entry>String array with names of variables. Is used in combination with varValues.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.varValues'>varValues</link>
</entry>
<entry>String array with values of variables. Is used in combination with varNames.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.dictRemoveWS'>dictRemoveWS</link>
</entry>
<entry>Remove whitespaces when loading dictionaries.
</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.csvSeparator'>csvSeparator</link>
</entry>
<entry>String/token to be used to split columns in CSV tables.
</entry>
<entry>Single String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.inferenceVisitors'>inferenceVisitors</link>
</entry>
<entry>List of factory classes for additional inference visitors.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.maxRuleMatches'>maxRuleMatches</link>
</entry>
<entry>Maximum amount of allowed matches of a single rule.
</entry>
<entry>Single Integer</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.ruta.ae.basic.parameter.maxRuleMatches'>maxRuleElementMatches</link>
</entry>
<entry>Maximum amount of allowed matches of a single rule element.
</entry>
<entry>Single Integer</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
<section id="ugr.tools.ruta.ae.basic.parameter.mainScript">
<title>mainScript</title>
<para>
This parameter specifies the rule file that will be executed by the analysis engine and is,
therefore, one of the most important ones. The exact name of the script is given by the complete namespace of the file, which corresponds to its location
relative to the given parameter <link linkend='ugr.tools.ruta.ae.basic.parameter.scriptPaths'>scriptPaths</link>.
The single names of packages (or folders) are separated by periods. An exemplary value for this parameter could be "org.apache.uima.Main",
whereas "Main" specifies the file containing the rules and "org.apache.uima" its package.
In this case, the analysis engine loads the script file "Main.ruta", which is located in the folder structure "org/apache/uima/".
This parameter has no default value and has to be provided, although it is not specified as mandatory.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.rules">
<title>rules</title>
<para>
A String parameter representing the rule that should be applied by the analysis engine.
If set, it replaces the content of file specified by the <link linkend='ugr.tools.ruta.ae.basic.parameter.mainScript'>mainScript</link> parameter.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.rulesScriptName">
<title>rulesScriptName</title>
<para>
This parameter specifies the name of the non-existing script if the <link linkend='ugr.tools.ruta.ae.basic.parameter.rules'>rules</link> parameter is used.
The default value is 'Anonymous'.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.scriptEncoding">
<title>scriptEncoding</title>
<para>
This parameter specifies the encoding of the rule files. Its default value is "UTF-8".
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.scriptPaths">
<title>scriptPaths</title>
<para>
The parameter scriptPaths refers to a list of String values, which specify the possible locations of script files.
The given locations are absolute paths. A typical value for this parameter is, for example, "C:/Ruta/MyProject/script/".
If the parameter <link linkend='ugr.tools.ruta.ae.basic.parameter.mainScript'>mainScript</link> is set to org.apache.uima.Main,
then the absolute path of the script file has to be "C:/Ruta/MyProject/script/org/apache/uima/Main.ruta".
This parameter can contain multiple values, as the main script can refer to multiple projects similar to a class path in Java.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.descriptorPaths">
<title>descriptorPaths</title>
<para>
This parameter specifies the possible locations for descriptors like analysis engines or type systems, similar to the parameter
<link linkend='ugr.tools.ruta.ae.basic.parameter.scriptPaths'>scriptPaths</link> for the script files. A typical value for this parameter
is for example "C:/Ruta/MyProject/descriptor/".
The relative values of the parameter <link linkend='ugr.tools.ruta.ae.basic.parameter.additionalEngines'>additionalEngines</link> are
resolved to these absolute locations.
This parameter can contain multiple values, as the main script can refer to multiple projects similar to a class path in Java.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.resourcePaths">
<title>resourcePaths</title>
<para>
This parameter specifies the possible locations of additional resources like word lists or CSV tables. The string values have to contain absolute
locations, for example, "C:/Ruta/MyProject/resources/".
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.additionalScripts">
<title>additionalScripts</title>
<para>
The optional parameter additionalScripts is defined as a list of string values and contains script files, which are additionally loaded by the analysis engine. These script files are specified by their
complete namespace, exactly like the value of the parameter <link linkend='ugr.tools.ruta.ae.basic.parameter.mainScript'>mainScript</link>
and can be refered to by language elements, e.g., by executing the containing rules. An exemplary value of this parameter is "org.apache.uima.SecondaryScript". In this example, the main script could import
this script file by the declaration "SCRIPT org.apache.uima.SecondaryScript;" and then could execute it with the rule
"Document{-> CALL(SecondaryScript)};". This optional list can be used as a replacement of global imports in the script file.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.additionalEngines">
<title>additionalEngines</title>
<para>
This optional parameter contains a list of additional analysis engines, which can be executed by the UIMA Ruta rules. The single values
are given by the name of the analysis engine with their complete namespace and have to be located relative to one value of the parameter
<link linkend='ugr.tools.ruta.ae.basic.parameter.descriptorPaths'>descriptorPaths</link>, the location where the analysis engine searches for the descriptor file.
An example for one value of the parameter is "utils.HtmlAnnotator", which points to the descriptor "HtmlAnnotator.xml" in the folder "utils".
This optional list can be used as a replacement of global imports in the script file.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.additionalUimafitEngines">
<title>additionalUimafitEngines</title>
<para>
This optional parameter contains a list of additional analysis engines, which can be executed by the UIMA Ruta rules. The single values
are given by the name of the implementation with the complete namespace and have to be present int he classpath of the application.
An example for one value of the parameter is "org.apache.uima.ruta.engine.HtmlAnnotator", which points to the "HtmlAnnotator" class.
This optional list can be used as a replacement of global imports in the script file.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.additionalExtensions">
<title>additionalExtensions</title>
<para>
This parameter specifies optional extensions of the UIMA Ruta language. The elements of the string list have to implement the interface
"org.apache.uima.ruta.extensions.IRutaExtension". With these extensions, application-specific conditions and actions can be
added to the set of provided ones.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.reloadScript">
<title>reloadScript</title>
<para>
This boolean parameter indicates whether the script or resource files should be reloaded when processing a CAS. The default value is set to false.
In this case, the script files are loaded when the analysis engine is initialized. If script files or resource files are extended, e.g., a dictionary is filled
yet when a collection of documents are processed, then the parameter is needed to be set to true in order to include the changes.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.seeders">
<title>seeders</title>
<para>
This list of string values refers to implementations of the interface "org.apache.uima.ruta.seed.RutaAnnotationSeeder",
which can be used to automatically add annotations to the CAS. The default value of the parameter is a single seeder, namely "org.apache.uima.ruta.seed.DefaultSeeder"
that adds annotations for token classes like CW, MARKUP or SEMICOLON. Remember that additional annotations can also be added with
an additional engine that is executed by a UIMA Ruta rule.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.defaultFilteredTypes">
<title>defaultFilteredTypes</title>
<para>
This parameter specifies a list of types, which are filtered by default when executing a script file. Using the default values of this parameter,
whitespaces, line breaks and markup elements are not visible to Ruta rules. The visibility of annotations and, therefore, the covered text can be changed
using the actions <link linkend='ugr.tools.ruta.language.actions.filtertype'>FILTERTYPE</link> and
<link linkend='ugr.tools.ruta.language.actions.retaintype'>RETAINTYPE</link>.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.removeBasics">
<title>removeBasics</title>
<para>
This parameter specifies whether the inference annotations created by the analysis engine should be removed after processing the CAS.
The default value is set to false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.indexOnly">
<title>indexOnly</title>
<para>
This parameter specifies the annotation types which should be indexed for ruta's internal
annotations. All annotation types that are relevant need to be listed here. The value of this
parameter needs only be adapted for performance and memory optimization in pipelines that
contains several ruta analysis engines. Default value is uima.tcas.Annotation
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.indexSkipTypes">
<title>indexSkipTypes</title>
<para>
This parameter specifies annotation types that should not be indexed at all. These types
normally include annotations that provide no meaningful semantics for text processing, e.g.,
types concerning ruta debug information.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.indexOnlyMentionedTypes">
<title>indexOnlyMentionedTypes</title>
<para>
If this parameter is activated, then only annotations of types are internally indexed that are
mentioned with in the rules. This optimization of the internal indexing can improve the speed
and reduce the memory footprint. However, several features of the rule matching require the
indexing of types that are not mentioned in the rules, e.g., literal rule matches, wildcards
and actions like MARKFAST, MARKTABLE, TRIE. Default value is false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.indexAdditionally">
<title>indexAdditionally</title>
<para>
This parameter specifies annotation types that should
be index additionally to types mentioned in the rules. This parameter is only used if the
parameter 'indexOnlyMentionedTypes' is activated.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.reindexOnly">
<title>reindexOnly</title>
<para>
This parameter specifies the annotation types which should be reindexed for ruta's internal annotations
All annotation types that changed since the last call of a ruta script need to be listed here.
The value of this parameter needs only be adapted for performance optimization in pipelines that
contains several ruta analysis engines.
Default value is uima.tcas.Annotation
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.reindexSkipTypes">
<title>reindexSkipTypes</title>
<para>
This parameter specifies annotation types that should not be reindexed. These types normally
include annotations that are added once and are not changed in the following pipeline, e.g.,
Tokens or TokenSeed (like CW).
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.reindexOnlyMentionedTypes">
<title>reindexOnlyMentionedTypes</title>
<para>
If this parameter is activated, then only annotations of types are internally reindexed at
beginning that are mentioned with in the rules. This parameter overrides the values of the parameter
'reindexOnly' with the types that are mentioned in the rules. Default value is false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.reindexAdditionally">
<title>reindexAdditionally</title>
<para>
This parameter specifies annotation types that should be reindexed additionally to types
mentioned in the rules. This parameter is only used if the parameter
'reindexOnlyMentionedTypes' is activated.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.indexUpdateMode">
<title>indexUpdateMode</title>
<para>
This parameter specifies the mode for updating the internal indexing in RutaBasic annotations.
This is a technical parameter for optimizing the runtime performance/speed of RutaEngines.
Available modes are: COMPLETE, ADDITIVE, SAFE_ADDITIVE, NONE.
Default value is ADDITIVE.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.validateInternalIndexing">
<title>validateInternalIndexing</title>
<para>
Option to validate the internal indexing in RutaBasic with the current CAS after the indexing
and reindexing is performed. Annotations that are not correctly indexing in RutaBasics cause
Exceptions. Annotations of types listed in parameter 'indexSkipTypes' and 'reindexSkipTypes'
are ignored. Default value is false.
</para>
</section>
validateInternalIndexing
<section id="ugr.tools.ruta.ae.basic.parameter.emptyIsInvisible">
<title>emptyIsInvisible</title>
<para>
This parameter determines positions as invisible if the internal indexing of the corresponding
RutaBasic annotation is empty. Default value is true.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.modifyDataPath">
<title>modifyDataPath</title>
<para>
This parameter specifies whether the datapath of the ResourceManager is extended by the values of the configuration parameter <code>descriptorPaths</code>.
The default value is set to false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.strictImports">
<title>strictImports</title>
<para>
This parameter specifies whether short type names should be resolved against the typesystems declared in the script (true) or at runtime in the CAS typesystem (false).
The default value is set to false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.dynamicAnchoring">
<title>dynamicAnchoring</title>
<para>
If this parameter is set to true, then the Ruta rules are not forced to start to match with the first rule element.
Rather, the rule element referring to the most rare type is chosen. This option can be utilized to optimize the performance.
Please mind that the matching result can vary in some cases when greedy rule elements are applied.
The default value is set to false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.lowMemoryProfile">
<title>lowMemoryProfile</title>
<para>
This parameter specifies whether the memory consumption should be reduced. This parameter should be set to true for
very large CAS documents (e.g., > 500k tokens), but it also reduces the performance. The default value is set to false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.simpleGreedyForComposed">
<title>simpleGreedyForComposed</title>
<para>
This parameter specifies whether a different inference strategy for composed rule elements should be applied. This option is only necessary
when the composed rule element is expected to match very often, e.g., a rule element like (ANY ANY)+.
The default value of this parameter is set to false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.debug">
<title>debug</title>
<para>
If this parameter is set to true, then additional information about the execution of a rule script is added to the CAS.
The actual information is specified by the following parameters.
The default value of this parameter is set to false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.debugWithMatches">
<title>debugWithMatches</title>
<para>
This parameter specifies whether the match information (covered text) of the rules should be stored in the CAS.
The default value of this parameter is set to false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.debugOnlyFor">
<title>debugOnlyFor</title>
<para>
This parameter specifies a list of rule-ids that enumerate the rule for which debug information should be created.
No specific ids are given by default.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.profile">
<title>profile</title>
<para>
If this parameter is set to true, then additional information about the runtime of applied rules is added to the CAS.
The default value of this parameter is set to false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.statistics">
<title>statistics</title>
<para>
If this parameter is set to true, then additional information about the runtime of UIMA Ruta language elements like conditions and actions
is added to the CAS.
The default value of this parameter is set to false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.createdBy">
<title>createdBy</title>
<para>
If this parameter is set to true, then additional information about what annotation was created by which rule is added to the CAS.
The default value of this parameter is set to false.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.varNames">
<title>varNames</title>
<para>
This parameter specifies the names of variables and is used in combination with the parameter
varValues, which contains the values of the corresponding variables. The n-th entry of this
string array specifies the variable of the n-th entry of the string array of the parameter
varValues. If the variables is defined in the root of a script, then the name of the variable
suffices. If the variable is defined in a BLOCK or imported script, then the the name must
contain the namespaces of the blocks as a prefix, e.g., InnerBlock.varName or OtherScript.SomeBlock.varName.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.varValues">
<title>varValues</title>
<para>
This parameter specifies the values of variables as string values in an string array. It is
used in combination with the parameter varNames, which contains the names of the corresponding
variables. The n-th entry of this string array specifies the value of the n-th entry of the
string array of the parameter varNames. The values for list variables are separated by the character
<quote>,</quote>. Thus, the usage of commas is not allowed if the variable is a list.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.dictRemoveWS">
<title>dictRemoveWS</title>
<para>
If this parameter is set to true, then whitespaces are removed when dictionaries are loaded.
The default is set to "true".
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.csvSeparator">
<title>csvSeparator</title>
<para>
If this parameter is set to any String value then this String/token is used to split columns in CSV tables.
The default is set to ';'.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.inferenceVisitors">
<title>inferenceVisitors</title>
<para>
This parameter specifies optional class names implementing the interface
<code>org.apache.uima.ruta.visitor.RutaInferenceVisitor</code>, which will be notified during
applying the rules.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.maxRuleMatches">
<title>maxRuleMatches</title>
<para>
Maximum amount of allowed matches of a single rule.
</para>
</section>
<section id="ugr.tools.ruta.ae.basic.parameter.maxRuleElementMatches">
<title>maxRuleElementMatches</title>
<para>
Maximum amount of allowed matches of a single rule element.
</para>
</section>
</section>
</section>
<section id="ugr.tools.ruta.ae.annotationwriter">
<title>Annotation Writer</title>
<para>
This Analysis Engine can be utilized to write the covered text of annotations in a text file, whereas each covered text is put into a new line.
If the Analysis engine, for example, is configured for the type <quote>uima.example.Person</quote>, then all covered texts of all Person annotations are stored
in a text file, one person in each line.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a UIMA Ruta project.
</para>
<section id="ugr.tools.ruta.ae.annotationwriter.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.ruta.ae.annotationwriter.parameter.output">
<title>Output</title>
<para>
This string parameter specifies the absolute path of the resulting file named <quote>output.txt</quote>. However, if an annotation of the
type <quote>org.apache.uima.examples.SourceDocumentInformation</quote> is given, then the value of this parameter is interpreted to be relative
to the URI stored in the annotation and the name of the file will be adapted to the name of the source file. If this functionality is activated in the preferences,
then the UIMA Ruta Workbench adds
the SourceDocumentInformation annotation when the user launches a script file.
The default value of this parameter is <quote>/../output/</quote>.
</para>
</section>
<section id="ugr.tools.ruta.ae.annotationwriter.parameter.encoding">
<title>Encoding</title>
<para>
This string parameter specifies the encoding of the resulting file. The default value of this parameter is <quote>UTF-8</quote>.
</para>
</section>
<section id="ugr.tools.ruta.ae.annotationwriter.parameter.type">
<title>Type</title>
<para>
Only the covered texts of annotations of the type specified with this parameter are stored in the resulting file.
The default value of this parameter is <quote>uima.tcas.DocumentAnnotation</quote>, which will store the complete document in a new file.
</para>
</section>
</section>
</section>
<section id="ugr.tools.ruta.ae.plaintext">
<title>Plain Text Annotator</title>
<para>
This Analysis Engines adds annotations for lines and paragraphs.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a UIMA Ruta project. There are no configuration parameters.
</para>
</section>
<section id="ugr.tools.ruta.ae.modifier">
<title>Modifier</title>
<para>
The Modifier Analysis Engine can be used to create an additional view, which contains all textual modifications and HTML highlightings that
were specified by the executed rules. This Analysis Engine can be applied, e.g.,
for anonymization where all annotations of persons are replaced by the string <quote>Person</quote>.
Furthermore, the content of the new view can optionally be stored in a new HTML file.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a UIMA Ruta project.
</para>
<section id="ugr.tools.ruta.ae.modifier.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.ruta.ae.modifier.parameter.styleMap">
<title>styleMap</title>
<para>
This string parameter specifies the name of the style map file created by the Style Map Creator Analysis Engine, which stores the colors for
additional highlightings in the modified view.
</para>
</section>
<section id="ugr.tools.ruta.ae.modifier.parameter.descriptorPaths">
<title>descriptorPaths</title>
<para>
This parameter can contain multiple string values and specifies the absolute paths where the style map file can be found.
</para>
</section>
<section id="ugr.tools.ruta.ae.modifier.parameter.outputLocation">
<title>outputLocation</title>
<para>
This optional string parameter specifies the absolute path of the resulting file named <quote>output.modified.html</quote>. However, if an annotation of the
type <quote>org.apache.uima.examples.SourceDocumentInformation</quote> is given, then the value of this parameter is interpreted to be relative
to the URI stored in the annotation and the name of the file will be adapted to the name of the source file. If this functionality is activated in the preferences,
then the UIMA Ruta Workbench adds
the SourceDocumentInformation annotation when the user launches a script file.
The default value of this parameter is empty.
In this case no additional html file will be created.
</para>
</section>
<section id="ugr.tools.ruta.ae.modifier.parameter.outputView">
<title>outputView</title>
<para>
This string parameter specifies the name of the view, which will contain the modified document. A view of this name must not yet exist.
The default value of this parameter is <quote>modified</quote>.
</para>
</section>
</section>
</section>
<section id="ugr.tools.ruta.ae.html">
<title>HTML Annotator</title>
<para>
This Analysis Engine provides support for HTML files by adding annotations for the HTML elements. Using the default values, the HTML Annotator creates annotations
for each HTML element spanning the content of the element, whereas the most common elements are represented by own types.
The document <quote><![CDATA[This text is <b>bold</b>.]]></quote>, for example, would be annotated with an annotation of the type
<quote>org.apache.uima.ruta.type.html.B</quote> for the word <quote>bold</quote>. The HTML annotator can be configured
in order to include the start and end elements in the created annotations.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a UIMA Ruta project.
</para>
<section id="ugr.tools.ruta.ae.html.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.ruta.ae.html.parameter.onlyContent">
<title>onlyContent</title>
<para>
This parameter specifies whether created annotations should cover only the content of the HTML elements or also their start and end elements.
The default value is <quote>true</quote>.
</para>
</section>
</section>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter">
<title>HTML Converter</title>
<para>
This Analysis Engine is able to convert html content from a source view into a plain string representation stored in an output view.
Especially, the Analysis Engine transfers annotations under consideration of the changed document text and annotation offsets in the new view.
The copy process also sets features, however, features of type annotation are currently not supported.
Note that if an annotation would have the same start and end positions in the new view, i.e.,
if it would be mapped to an annotation of length 0, it is not moved to the new view.
The HTML Converter also supports heuristic and explicit conversion patterns which default to html4 decoding,
e.g., "<![CDATA[&nbsp;]]>", "<![CDATA[&lt;]]>", etc. Concepts like tables or lists are not supported.
Note that in general it is suggested to run an html cleaner before any further processing to avoid problems with malformed html.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a UIMA Ruta project.
</para>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.outputView">
<title>outputView</title>
<para>
This string parameter specifies the name of the new view.
The default value is <quote>plaintext</quote>.
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.inputView">
<title>inputView</title>
<para>
This string parameter can optionally be set to specify the name of the input view.
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.newlineInducingTags">
<title>newlineInducingTags</title>
<para>
This string array parameter sets the names of the html tags that create linebreaks in the output view.
The default is <quote>br, p, div, ul, ol, dl, li, h1, ..., h6, blockquote</quote>.
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.replaceLinebreaks">
<title>replaceLinebreaks</title>
<para>
This boolean parameter determines if linebreaks inside the text nodes are kept or removed.
The default behavior is <quote>true</quote>.
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.linebreakReplacement">
<title>replaceLinebreaks</title>
<para>
This string parameter determines the character sequence that replaces a linebreak.
The default behavior is the empty string.
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.conversionPolicy">
<title>conversionPolicy</title>
<para>
This string parameter determines the conversion policy used, either "heuristic", "explicit", or "none".
When the value is "explicit", the parameters <quote>conversionPatterns</quote> and optionally <quote>conversionReplacements</quote> are considered.
The "heuristic" conversion policy uses simple regular expressions to decode html4 entities such as "<![CDATA[&nbsp;]]>".
The default behavior is "heuristic".
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.conversionPatterns">
<title>conversionPatterns</title>
<para>
This string array parameter can be used to apply custom conversions.
It defaults to a list of commonly used codes, e.g., <![CDATA[&nbsp;]]>,
which are converted using html 4 entity unescaping.
However, explicit conversion strings can also be passed via the parameter <quote>conversionReplacements</quote>.
Remember to enable explicit conversion via <quote>conversionPolicy</quote> first.
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.conversionReplacements">
<title>conversionReplacements</title>
<para>
This string array parameter corresponds to <quote>conversionPatterns</quote>
such that <quote>conversionPatterns[i]</quote> will be replaced by <quote>conversionReplacements[i]</quote>;
replacements should be shorter than the source pattern.
Per default, the replacement strings are computed using Html4 decoding.
Remember to enable explicit conversion via <quote>conversionPolicy</quote> first.
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.skipWhitespaces">
<title>skipWhitespaces</title>
<para>
This boolean parameter determines if the converter should skip whitespaces.
Html documents often contains whitespaces for indentation and formatting,
which should not be reproduced in the converted plain text document.
If the parameter is set to false, then the whitespaces are not removed.
This behavior is useful, if not Html documents are converted, but XMl files.
The default value is true.
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.processAll">
<title>processAll</title>
<para>
If this boolean parameter is set to true, then the tags of the complete document is processed
and not only those within the body tag.
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.newlineInducingTagRegExp">
<title>newlineInducingTagRegExp</title>
<para>
This string parameter contains a regular expression for HTML/XML elements. If the pattern
matches, then the element will introduce a new line break similar to the element of the
parameter <quote>newlineInducingTags</quote>.
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.gapInducingTags">
<title>gapInducingTags</title>
<para>
This string array parameter sets the names of the html tags that create additional text in the
output view. The actual string of the gap is defined by the parameter <quote>gapText</quote>.
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.gapText">
<title>gapText</title>
<para>
This string parameter determines the character sequence that is introduced by the html tags
specified in the <quote>gapInducingTags</quote>.
</para>
</section>
<section id="ugr.tools.ruta.ae.htmlconverter.parameter.useSpaceGap">
<title>useSpaceGap</title>
<para>
This boolean parameter sets the value of the parameter <quote>gapText</quote> to a single space..
</para>
</section>
</section>
</section><section id="ugr.tools.ruta.ae.stylemap">
<title>Style Map Creator</title>
<para>
This Analysis Engine can be utilized to create style map information, which is needed by the Modifier Analysis Engine in order to create
highlighting for some annotations.
Style map information can be created using the <link linkend='ugr.tools.ruta.language.actions.color'>COLOR</link> action.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a UIMA Ruta project.
</para>
<section id="ugr.tools.ruta.ae.stylemap.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.ruta.ae.stylemap.parameter.styleMap">
<title>styleMap</title>
<para>
This string parameter specifies the name of the style map file created by the Style Map Creator Analysis Engine, which stores the colors for
additional highlightings in the modified view.
</para>
</section>
<section id="ugr.tools.ruta.ae.stylemap.parameter.descriptorPaths">
<title>descriptorPaths</title>
<para>
This parameter can contain multiple string values and specifies the absolute paths where the style map can be found.
</para>
</section>
</section>
</section>
<section id="ugr.tools.ruta.ae.cutter">
<title>Cutter</title>
<para>
This Analysis Engine is able to cut the document of the CAS. Only the text covered by annotations of the specified type will be retained and all other parts of the documents will be removed.
The offsets of annotations in the index will be updated, but not feature structures nested as feature values.
</para>
<section id="ugr.tools.ruta.ae.cutter.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.ruta.ae.cutter.parameter.keep">
<title>keep</title>
<para>
This string parameter specifies the complete name of a type. Only the text covered by annotations of this type will be retained and all other parts of the documents will be removed.
</para>
</section>
<section id="ugr.tools.ruta.ae.cutter.parameter.inputView">
<title>inputView</title>
<para>
The name of the view that should be processed.
</para>
</section>
<section id="ugr.tools.ruta.ae.cutter.parameter.outputView">
<title>outputView</title>
<para>
The name of the view, which will contain the modified CAS.
</para>
</section>
</section>
</section>
<section id="ugr.tools.ruta.ae.view">
<title>View Writer</title>
<para>
This Analysis Engine is able to serialize the processed CAS to an XMI file whereas the the source and destination view can be specified
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a UIMA Ruta project.
</para>
<section id="ugr.tools.ruta.ae.view.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.ruta.ae.view.parameter.output">
<title>output</title>
<para>
This string parameter specifies the absolute path of the resulting file named <quote>output.xmi</quote>. However, if an annotation of the
type <quote>org.apache.uima.examples.SourceDocumentInformation</quote> is given, then the value of this parameter is interpreted to be relative
to the URI stored in the annotation and the name of the file will be adapted to the name of the source file. If this functionality is activated in the preferences,
then the UIMA Ruta Workbench adds
the SourceDocumentInformation annotation when the user launches a script file.
</para>
</section>
<section id="ugr.tools.ruta.ae.view.parameter.inputView">
<title>inputView</title>
<para>
The name of the view that should be stored in a file.
</para>
</section>
<section id="ugr.tools.ruta.ae.view.parameter.outputView">
<title>outputView</title>
<para>
The name, which should be used, to store the view in the file.
</para>
</section>
</section>
</section>
<section id="ugr.tools.ruta.ae.xmi">
<title>XMI Writer</title>
<para>
This Analysis Engine is able to serialize the processed CAS to an XMI file. One use case for the XMI Writer is, for example, a rule-based sort,
which stores the processed XMI files in different folder, dependent on the execution of the rules, e.g., whether a pattern of annotations occurs or not.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a UIMA Ruta project.
</para>
<section id="ugr.tools.ruta.ae.xmi.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.ruta.ae.xmi.parameter.output">
<title>Output</title>
<para>
This string parameter specifies the absolute path of the resulting file named <quote>output.xmi</quote>. However, if an annotation of the
type <quote>org.apache.uima.examples.SourceDocumentInformation</quote> is given, then the value of this parameter is interpreted to be relative
to the URI stored in the annotation and the name of the file will be adapted to the name of the source file. If this functionality is activated in the preferences,
then the UIMA Ruta Workbench adds
the SourceDocumentInformation annotation when the user launches a script file.
The default value is <quote>/../output/</quote>
</para>
</section>
</section>
</section>
</section>
</chapter>