<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ | |
<!ENTITY imgroot "images/tools/ruta/language/" > | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" > | |
%uimaents; | |
]> | |
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor | |
license agreements. See the NOTICE file distributed with this work for additional | |
information regarding copyright ownership. The ASF licenses this file to | |
you under the Apache License, Version 2.0 (the "License"); you may not use | |
this file except in compliance with the License. You may obtain a copy of | |
the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required | |
by applicable law or agreed to in writing, software distributed under the | |
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS | |
OF ANY KIND, either express or implied. See the License for the specific | |
language governing permissions and limitations under the License. --> | |
<section id="ugr.tools.ruta.language.anchoring"> | |
<title>Rule elements and their matching order</title> | |
<para> | |
If not specified otherwise, then the UIMA Ruta rules normally start the matching | |
process with their first rule element. The first rule element searches for possible positions for its matching | |
condition and then will advise the next rule element to continue the matching process. | |
For that reason, writing rules that contain a first rule element with an optional quantifier is discouraged | |
and will result in ignoring the optional attribute of the quantifier. | |
</para> | |
<para> | |
The starting rule element can also be manually specified by adding <quote>@</quote> directly in front of the matching condition. | |
In the following example, the rule first searches for capitalized words (CW) and then checks whether | |
there is a period in front of the matched word. | |
<programlisting><![CDATA[PERIOD @CW;]]></programlisting> | |
This functionality can also be used for rules that start with an optional rule element by manually specifying a later | |
rule element to start the matching process. | |
</para> | |
<para> | |
The choice of the starting rule element can greatly influence the performance speed of the rule execution. | |
This circumstance is illustrated with the following example that contains two rules, whereas already an annotation | |
of the type <quote>LastToken</quote> was added to the last token of the document: | |
<programlisting><![CDATA[ANY LastToken; | |
ANY @LastToken;]]></programlisting> | |
The first rule matches on each token of the document and checks whether the next annotation is the last token of the document. | |
This will result in many index operations because all tokens of the document are considered. | |
The second rule, however, matches on the last token and then checks if there is any token in front of it. This | |
rule, therefore, considers only one token. | |
</para> | |
<para> | |
The UIMA Ruta language provides also a concept for automatically selecting the starting rule element called dynamic anchoring. | |
Here, a simple heuristic concerning the position of the rule element and the involved types is applied in order to identify | |
the favorable rule element. This functionality can be activated in the <link linkend="ugr.tools.ruta.ae.basic.parameter">configuration parameters</link> of the analysis engine or | |
directly in the script file with the <link linkend="ugr.tools.ruta.language.actions.dynamicanchoring">DYNAMICANCHORING</link> action. | |
</para> | |
<para> | |
A list of rule elements normally specifies a sequential pattern. The rule is able to match if the first rule element successfully matches | |
and then the following rule element at the position after the match of the first rule element, and so on. There are three language constructs that break up that | |
sequential matching: <quote><![CDATA[&]]></quote>, <quote>|</quote> and <quote>%</quote>. A composed rule element where all inner rule elements are linked by the symbol <quote><![CDATA[&]]></quote> | |
matches only if all inner rule elements successfully match at the given position. A composed rule element with inner rule elements linked by the | |
symbol <quote>|</quote> matches if one of the inner rule element successfully matches. These composed rule elements therefore specify a conjunction (<quote>and</quote>) | |
and a disjunction (<quote>or</quote>) of its rule element at the given position. The symbol <quote>%</quote> specifies a different use case. | |
Here, rules themselves are linked and they are only able to fire if each one of the linked rules successfully matched. In contrast to <quote><![CDATA[&]]></quote>, | |
this linkage of rule elements does not introduce constraints for the matched positions. In the following, a few examples of these three language constructs are given. | |
</para> | |
<programlisting><![CDATA[(Token.posTag=="DET" & Lemma.value=="the");]]></programlisting> | |
<para> | |
This rule is fulfilled, if there is a token whose feature <quote>posTag</quote> has the value <quote>DET</quote> and an annotation of the type <quote>Lemma</quote> whose feature <quote>value</quote> | |
has the value <quote>the</quote>. Both rule elements need to be fulfilled at the same position. | |
</para> | |
<programlisting><![CDATA[NUM (W{REGEXP("Peter") -> Name} & (ANY CW{PARTOF(Name)}));]]></programlisting> | |
<para> | |
This rule matches on a number and then validates if the next word is <quote>Peter</quote> and if next but one token is capitalized and part of an annotation of the type <quote>Name</quote>. | |
If all rule elements successfully matched, then a new annotation of the type <quote>Name</quote> will be created covering the largest match of the linked rule elements. In this example, | |
the new annotation covers also the token after the word <quote>Peter</quote> even if the actions was specified at the rule element with the smaller match. | |
</para> | |
<programlisting><![CDATA[((W{REGEXP("Peter")} CW) | ("Mr" PERIOD CW)){-> Name};]]></programlisting> | |
<para> | |
In this example, an annotation of the type <quote>Name</quote> will be created for the token <quote>Peter</quote> followed by a | |
capitalized word or the word <quote>Mr</quote> followed by a period and a capitalized word. | |
</para> | |
<programlisting><![CDATA[(Animal ((COMMA | "and") Animal)+){-> AnimalEnum};]]></programlisting> | |
<para> | |
This rule annotates enumerations of animal annotations whereas each animal annotation is separated by either a comma or the word <quote>and</quote>. | |
</para> | |
<programlisting><![CDATA[BLOCK(forEach) Sentence{}{ | |
CW NUM % SW NUM{-> MARK(Found, 1, 2)}; | |
}]]></programlisting> | |
<para> | |
Here, annotations of the type <quote>Found</quote> are created if a sentence contains a capitalized word followed by a number and a small written word followed by a number | |
regardless of where these annotations occur in the sentence. | |
</para> | |
</section> |