<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ | |
<!ENTITY imgroot "images/tools/ruta/workbench/" > | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" > | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<section id="section.tools.ruta.workbench.textruler"> | |
<title>TextRuler</title> | |
<para> | |
Apache UIMA Ruta TextRuler is a framework for supervised rule induction included in the UIMA Ruta Workbench. | |
It provides several configurable algorithms, which are able to learn new rules based on given labeled data. | |
The framework was created in order to support the user by suggesting new rules for the given task. | |
The user selects a suitable learning algorithm and adapts its configuration parameters. Furthermore, | |
the user engineers a set of annotation-based features, which enable the algorithms to form efficient, effective and comprehensive rules. | |
The rule learning algorithms present their suggested rules in a new view, in which the user can either copy | |
the complete script or single rules to a new script file, where the rules can be further refined. | |
</para> | |
<para> | |
This section gives a short introduction about the included features and learners, and how to use the framework to learn UIMA Ruta rules. First, the | |
available rule learning algorithms are introduced in <xref linkend="section.tools.ruta.workbench.textruler.learner"/>. Then, | |
the user interface and the usage is explained in <xref linkend="section.tools.ruta.workbench.textruler.ui"/> and | |
<xref linkend="section.tools.ruta.workbench.textruler.example"/> illustrates the usage with an exemplary UIMA Ruta project. | |
</para> | |
<section id="section.tools.ruta.workbench.textruler.learner"> | |
<title>Included rule learning algorithms</title> | |
<para> | |
This section gives a short description of the rule learning algorithms, | |
which are provided in the UIMA Ruta TextRuler framework. | |
</para> | |
<section id="section.tools.ruta.workbench.textruler.lp2"> | |
<title>LP2</title> | |
<note> | |
<para> | |
This rule learner is an experimental implementation of the ideas and algorithms published in: | |
F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic | |
Constraints. Technical Report CS-03-07, Department of Computer Science, University of | |
Sheffield, Sheffield, 2003. | |
</para> | |
</note> | |
<para>This algorithms learns separate rules for | |
the beginning and the end of a single slot, which are later combined | |
in order to identify the targeted annotation. The learning strategy is a bottom-up covering | |
algorithm. It starts by creating a specific seed instance with a window of w tokens to the | |
left and right of the target boundary and searches for the best generalization. Additional context rules are | |
induced in order to identify missing boundaries. The current implementation does not support correction rules. | |
The TextRuler framework provides two versions of this algorithm: LP2 (naive) is a straightforward implementation | |
with limited expressiveness concerning the resulting Ruta rules. LP2 (optimized) is an improved | |
version with a dynamic programming approach and is providing better results in general. | |
The following parameters are available. For a more detailed description of the parameters, | |
please refer to the implementation and the publication. | |
</para> | |
<para> | |
<itemizedlist> | |
<listitem> | |
<para>Context Window Size (to the left and right)</para> | |
</listitem> | |
<listitem> | |
<para>Best Rules List Size</para> | |
</listitem> | |
<listitem> | |
<para>Minimum Covered Positives per Rule</para> | |
</listitem> | |
<listitem> | |
<para>Maximum Error Threshold</para> | |
</listitem> | |
<listitem> | |
<para>Contextual Rules List Size</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
</section> | |
<section id="section.tools.ruta.workbench.textruler.whisk"> | |
<title>WHISK</title> | |
<note> | |
<para> | |
This rule learner is an experimental implementation of the ideas and algorithms published in: | |
Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information | |
Extraction Rules for Semi-Structured and Free Text. In Machine Learning, volume 34, | |
pages 233-272, 1999. | |
</para> | |
</note> | |
<para>WHISK is a multi-slot method that operates on all three kinds of documents and learns | |
single- or multi-slot rules looking similar to regular expressions. However, the current implementation only support single slot rules. | |
The top-down covering algorithm begins with the most general rule and specializes it by adding single rule terms | |
until the rule does not make errors anymore on the training set. The TextRuler framework provides two versions of this algorithm: | |
WHISK (token) is a naive token-based implementation. WHISK (generic) is an optimized and improved implementation, | |
which is able to refer to arbitrary annotations and also supports primitive features. The following parameters are available. For a more detailed description of the parameters, | |
please refer to the implementation and the publication. | |
</para> | |
<para> | |
<itemizedlist> | |
<listitem> | |
<para>Parameters Window Size</para> | |
</listitem> | |
<listitem> | |
<para>Maximum Error Threshold</para> | |
</listitem> | |
<listitem> | |
<para>PosTag Root Type</para> | |
</listitem> | |
<listitem> | |
<para>Considered Features (comma-separated) - only WHISK (generic)</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
</section> | |
<section id="section.tools.ruta.workbench.textruler.trabal"> | |
<title>TraBaL</title> | |
<note> | |
<para> | |
This rule learner is an implementation of the ideas and algorithms published in: | |
Benjamin Eckstein, Peter Kluegl, and Frank Puppe. Towards Learning Error-Driven | |
Transformations for Information Extraction. Workshop Notes of the LWA 2011 - | |
Learning, Knowledge, Adaptation, 2011. | |
</para> | |
</note> | |
<para> | |
The TraBal rule learner induces rules that try to correct annotations error and relies on two set of documents. A set of | |
documents with gold standard annotation and an additional set of annotated documents with the same text that possibly contain erroneous | |
annotations, for which correction rules should be learnt. First, the algorithm compares the two sets of documents and | |
identifies the present errors. Then, rules for each error are induced and extended. This process can be iterated in order | |
to incrementally remove all errors. The following parameters are available. For a more detailed description of the parameters, | |
please refer to the implementation and the publication. | |
</para> | |
<para> | |
<itemizedlist> | |
<listitem> | |
<para>Number of times, the algorithm iterates.</para> | |
</listitem> | |
<listitem> | |
<para>Number of basic rules to be created for one example.</para> | |
</listitem> | |
<listitem> | |
<para>Number of optimized rules to be created for one example.</para> | |
</listitem> | |
<listitem> | |
<para>Maximum number of iterations, when optimizing rules.</para> | |
</listitem> | |
<listitem> | |
<para>Maximum allowed error rate.</para> | |
</listitem> | |
<listitem> | |
<para>Correct features in rules and conditions. (not yet available)</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
</section> | |
<section id="section.tools.ruta.workbench.textruler.kep"> | |
<title>KEP</title> | |
<!-- | |
<note> | |
<para> | |
</para> | |
</note> | |
--> | |
<para> | |
The name of the rule learner KEP (knowledge engineering patterns) is derived from the idea that humans use different engineering patterns | |
to write annotation rules. This algorithms implements simple rule induction methods for some patterns, such as boundary detection | |
or annotation-based restriction of the window. The results are then combined in order to take advantage of the combination of | |
the different kinds of induced rules. Since the single rules are constructed according to how humans engineer the annotations rules, | |
the resulting rule set should resemble more a handcrafted rule set. Furthermore, by exploiting the synergy of the patterns, solutions for | |
some annotation are much simpler. The following parameters are available. For a more detailed description of the parameters, | |
please refer to the implementation. | |
</para> | |
<para> | |
<itemizedlist> | |
<listitem> | |
<para>Maximum number of <quote>Expand Rules</quote></para> | |
</listitem> | |
<listitem> | |
<para>Maximum number of <quote>Infiller Rules</quote></para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
</section> | |
</section> | |
<section id="section.tools.ruta.workbench.textruler.ui"> | |
<title>The TextRuler view</title> | |
<para> | |
The TextRuler view is normally located in the lower center of the UIMA Ruta perspective and is the main | |
user interface to configure and start the rule learning algorithms. The view consists of four parts (cf. <xref linkend="figure.tools.ruta.workbench.textruler.main"/>): | |
The toolbar contains buttons for starting (green button) and stopping (red button) the learning process, | |
and one button that opening the preference page (blue gears) for configuring the rule induction algorithms cf. <xref linkend="figure.tools.ruta.workbench.textruler.pref"/>. | |
The upper part of the view contains text fields for defining the set of utilized documents. <quote>Training Data</quote> | |
points to the absolute location of the folder containing the gold standard documents. <quote>Additional Data</quote> points | |
to the absolute location of documents that can be additionally used by the algorithms. These documents are currently only needed | |
by the TraBal algorithm, which tries to learn correction rules for the error in those documents. <quote>Test Data</quote> is not yet available. | |
Finally, <quote>Preprocess Script</quote> points to the absolute location of a UIMA Ruta script, which contains all necessary types and can be applied | |
on the documents before the algorithms start in order to add additional annotations as learning features. The preprocessing can be skipped. | |
All text fields support drag and drop: the user can drag a file in the script explorer and drop it in the respective text field. | |
In the center of the view, the target types, for which rule should be induced, can be specified in the <quote>Information Types</quote> list. | |
The list <quote>Featured Feature Types</quote> specify the filtering settings, but it is discourage to change these settings. The user is able to drop | |
a simple text file, which contains a type with complete namespace in each line, to the <quote>Information Types</quote> list in order to add all those types. | |
The lower part of the view contains the list of available algorithms. All checked algorithms will be started, if the start button in the toolbar of the view is pressed. | |
When the algorithms are started, they display their current action after their name, and a result view with the currently induced rules is displayed | |
in the right part of the perspective. | |
</para> | |
<figure id="figure.tools.ruta.workbench.textruler.main"> | |
<title>The UIMA Ruta TextRuler framework | |
</title> | |
<mediaobject> | |
<imageobject role="html"> | |
<imagedata width="776px" format="PNG" align="center" | |
fileref="&imgroot;textruler/textruler.png" /> | |
</imageobject> | |
<imageobject role="fo"> | |
<imagedata width="5.4in" format="PNG" align="center" | |
fileref="&imgroot;textruler/textruler.png" /> | |
</imageobject> | |
<textobject> | |
<phrase>UIMA Ruta TextRuler framework</phrase> | |
</textobject> | |
</mediaobject> | |
</figure> | |
<figure id="figure.tools.ruta.workbench.textruler.pref"> | |
<title>The UIMA Ruta TextRuler Preferences | |
</title> | |
<mediaobject> | |
<imageobject role="html"> | |
<imagedata width="576px" format="PNG" align="center" | |
fileref="&imgroot;textruler/textruler_pref.png" /> | |
</imageobject> | |
<imageobject role="fo"> | |
<imagedata width="3.3in" format="PNG" align="center" | |
fileref="&imgroot;textruler/textruler_pref.png" /> | |
</imageobject> | |
<textobject> | |
<phrase>UIMA Ruta TextRuler Preferences</phrase> | |
</textobject> | |
</mediaobject> | |
</figure> | |
</section> | |
<section id="section.tools.ruta.workbench.textruler.example"> | |
<title>Example</title> | |
<para> | |
This section gives a short example how the TextRuler framework is applied in order to induce annotation rules. We refer to the screenshot in <xref linkend="figure.tools.ruta.workbench.textruler.main"/> | |
for the configuration and are using the exemplary UIMA Ruta project <quote>TextRulerExample</quote>, which is part of the source release of UIMA Ruta. | |
</para> | |
<para> | |
In this example, we are using the <quote>KEP</quote> algorithm for learning annotation rules for identifying Bibtex entries in the reference section of scientific publications: | |
<orderedlist> | |
<listitem> | |
<para>Select the folder <quote>single</quote> and drag and drop it to the <quote>Training Data</quote> text field. This folder contains one file with | |
correct annotations and serves as gold standard data in our example.</para> | |
</listitem> | |
<listitem> | |
<para>Select the file <quote>Feature.ruta</quote> and drag and drop it to the <quote>Preprocess Script</quote> text field. This UIMA Ruta script knows all necessary types, especially the types | |
of the annotations we try the learn rules for, and additionally it contains rules that create useful annotations, which can be used by the algorithm in order to learn better rules.</para> | |
</listitem> | |
<listitem> | |
<para>Select the file <quote>InfoTypes.txt</quote> and drag and drop it to the <quote>Information Types</quote> list. This specifies the goal of the learning process, | |
which types of annotations should be annotated by the induced rules, respectively.</para> | |
</listitem> | |
<listitem> | |
<para>Check the checkbox of the <quote>KEP</quote> algorithm and press the start button in the toolbar fo the view.</para> | |
</listitem> | |
<listitem> | |
<para>The algorithm now tries to induce rules for the targeted types. The current result is displayed in the view <quote>KEP Results</quote> in the right part of the perspective.</para> | |
</listitem> | |
<listitem> | |
<para>After the algorithms finished the learning process, create a new UIMA Ruta file in the <quote>uima.ruta.example</quote> package and copy the content of the result view | |
to the new file. Now, the induced rules can be applied as a normal UIMA Ruta script file.</para> | |
</listitem> | |
</orderedlist> | |
</para> | |
</section> | |
</section> |