blob: 776f493dbc803247b7dc92f7fbbd6031edc10579 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
<!ENTITY imgroot "images/tools/ruta/workbench/" >
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<section id="section.tools.ruta.workbench.textruler">
<title>TextRuler</title>
<para>
Apache UIMA Ruta TextRuler is a framework for supervised rule induction included in the UIMA Ruta Workbench.
It provides several configurable algorithms, which are able to learn new rules based on given labeled data.
The framework was created in order to support the user by suggesting new rules for the given task.
The user selects a suitable learning algorithm and adapts its configuration parameters. Furthermore,
the user engineers a set of annotation-based features, which enable the algorithms to form efficient, effective and comprehensive rules.
The rule learning algorithms present their suggested rules in a new view, in which the user can either copy
the complete script or single rules to a new script file, where the rules can be further refined.
</para>
<para>
This section gives a short introduction about the included features and learners, and how to use the framework to learn UIMA Ruta rules. First, the
available rule learning algorithms are introduced in <xref linkend="section.tools.ruta.workbench.textruler.learner"/>. Then,
the user interface and the usage is explained in <xref linkend="section.tools.ruta.workbench.textruler.ui"/> and
<xref linkend="section.tools.ruta.workbench.textruler.example"/> illustrates the usage with an exemplary UIMA Ruta project.
</para>
<section id="section.tools.ruta.workbench.textruler.learner">
<title>Included rule learning algorithms</title>
<para>
This section gives a short description of the rule learning algorithms,
which are provided in the UIMA Ruta TextRuler framework.
</para>
<section id="section.tools.ruta.workbench.textruler.lp2">
<title>LP2</title>
<note>
<para>
This rule learner is an experimental implementation of the ideas and algorithms published in:
F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic
Constraints. Technical Report CS-03-07, Department of Computer Science, University of
Sheffield, Sheffield, 2003.
</para>
</note>
<para>This algorithms learns separate rules for
the beginning and the end of a single slot, which are later combined
in order to identify the targeted annotation. The learning strategy is a bottom-up covering
algorithm. It starts by creating a specific seed instance with a window of w tokens to the
left and right of the target boundary and searches for the best generalization. Additional context rules are
induced in order to identify missing boundaries. The current implementation does not support correction rules.
The TextRuler framework provides two versions of this algorithm: LP2 (naive) is a straightforward implementation
with limited expressiveness concerning the resulting Ruta rules. LP2 (optimized) is an improved
version with a dynamic programming approach and is providing better results in general.
The following parameters are available. For a more detailed description of the parameters,
please refer to the implementation and the publication.
</para>
<para>
<itemizedlist>
<listitem>
<para>Context Window Size (to the left and right)</para>
</listitem>
<listitem>
<para>Best Rules List Size</para>
</listitem>
<listitem>
<para>Minimum Covered Positives per Rule</para>
</listitem>
<listitem>
<para>Maximum Error Threshold</para>
</listitem>
<listitem>
<para>Contextual Rules List Size</para>
</listitem>
</itemizedlist>
</para>
</section>
<section id="section.tools.ruta.workbench.textruler.whisk">
<title>WHISK</title>
<note>
<para>
This rule learner is an experimental implementation of the ideas and algorithms published in:
Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information
Extraction Rules for Semi-Structured and Free Text. In Machine Learning, volume 34,
pages 233-272, 1999.
</para>
</note>
<para>WHISK is a multi-slot method that operates on all three kinds of documents and learns
single- or multi-slot rules looking similar to regular expressions. However, the current implementation only support single slot rules.
The top-down covering algorithm begins with the most general rule and specializes it by adding single rule terms
until the rule does not make errors anymore on the training set. The TextRuler framework provides two versions of this algorithm:
WHISK (token) is a naive token-based implementation. WHISK (generic) is an optimized and improved implementation,
which is able to refer to arbitrary annotations and also supports primitive features. The following parameters are available. For a more detailed description of the parameters,
please refer to the implementation and the publication.
</para>
<para>
<itemizedlist>
<listitem>
<para>Parameters Window Size</para>
</listitem>
<listitem>
<para>Maximum Error Threshold</para>
</listitem>
<listitem>
<para>PosTag Root Type</para>
</listitem>
<listitem>
<para>Considered Features (comma-separated) - only WHISK (generic)</para>
</listitem>
</itemizedlist>
</para>
</section>
<section id="section.tools.ruta.workbench.textruler.trabal">
<title>TraBaL</title>
<note>
<para>
This rule learner is an implementation of the ideas and algorithms published in:
Benjamin Eckstein, Peter Kluegl, and Frank Puppe. Towards Learning Error-Driven
Transformations for Information Extraction. Workshop Notes of the LWA 2011 -
Learning, Knowledge, Adaptation, 2011.
</para>
</note>
<para>
The TraBal rule learner induces rules that try to correct annotations error and relies on two set of documents. A set of
documents with gold standard annotation and an additional set of annotated documents with the same text that possibly contain erroneous
annotations, for which correction rules should be learnt. First, the algorithm compares the two sets of documents and
identifies the present errors. Then, rules for each error are induced and extended. This process can be iterated in order
to incrementally remove all errors. The following parameters are available. For a more detailed description of the parameters,
please refer to the implementation and the publication.
</para>
<para>
<itemizedlist>
<listitem>
<para>Number of times, the algorithm iterates.</para>
</listitem>
<listitem>
<para>Number of basic rules to be created for one example.</para>
</listitem>
<listitem>
<para>Number of optimized rules to be created for one example.</para>
</listitem>
<listitem>
<para>Maximum number of iterations, when optimizing rules.</para>
</listitem>
<listitem>
<para>Maximum allowed error rate.</para>
</listitem>
<listitem>
<para>Correct features in rules and conditions. (not yet available)</para>
</listitem>
</itemizedlist>
</para>
</section>
<section id="section.tools.ruta.workbench.textruler.kep">
<title>KEP</title>
<!--
<note>
<para>
</para>
</note>
-->
<para>
The name of the rule learner KEP (knowledge engineering patterns) is derived from the idea that humans use different engineering patterns
to write annotation rules. This algorithms implements simple rule induction methods for some patterns, such as boundary detection
or annotation-based restriction of the window. The results are then combined in order to take advantage of the combination of
the different kinds of induced rules. Since the single rules are constructed according to how humans engineer the annotations rules,
the resulting rule set should resemble more a handcrafted rule set. Furthermore, by exploiting the synergy of the patterns, solutions for
some annotation are much simpler. The following parameters are available. For a more detailed description of the parameters,
please refer to the implementation.
</para>
<para>
<itemizedlist>
<listitem>
<para>Maximum number of <quote>Expand Rules</quote></para>
</listitem>
<listitem>
<para>Maximum number of <quote>Infiller Rules</quote></para>
</listitem>
</itemizedlist>
</para>
</section>
</section>
<section id="section.tools.ruta.workbench.textruler.ui">
<title>The TextRuler view</title>
<para>
The TextRuler view is normally located in the lower center of the UIMA Ruta perspective and is the main
user interface to configure and start the rule learning algorithms. The view consists of four parts (cf. <xref linkend="figure.tools.ruta.workbench.textruler.main"/>):
The toolbar contains buttons for starting (green button) and stopping (red button) the learning process,
and one button that opening the preference page (blue gears) for configuring the rule induction algorithms cf. <xref linkend="figure.tools.ruta.workbench.textruler.pref"/>.
The upper part of the view contains text fields for defining the set of utilized documents. <quote>Training Data</quote>
points to the absolute location of the folder containing the gold standard documents. <quote>Additional Data</quote> points
to the absolute location of documents that can be additionally used by the algorithms. These documents are currently only needed
by the TraBal algorithm, which tries to learn correction rules for the error in those documents. <quote>Test Data</quote> is not yet available.
Finally, <quote>Preprocess Script</quote> points to the absolute location of a UIMA Ruta script, which contains all necessary types and can be applied
on the documents before the algorithms start in order to add additional annotations as learning features. The preprocessing can be skipped.
All text fields support drag and drop: the user can drag a file in the script explorer and drop it in the respective text field.
In the center of the view, the target types, for which rule should be induced, can be specified in the <quote>Information Types</quote> list.
The list <quote>Featured Feature Types</quote> specify the filtering settings, but it is discourage to change these settings. The user is able to drop
a simple text file, which contains a type with complete namespace in each line, to the <quote>Information Types</quote> list in order to add all those types.
The lower part of the view contains the list of available algorithms. All checked algorithms will be started, if the start button in the toolbar of the view is pressed.
When the algorithms are started, they display their current action after their name, and a result view with the currently induced rules is displayed
in the right part of the perspective.
</para>
<figure id="figure.tools.ruta.workbench.textruler.main">
<title>The UIMA Ruta TextRuler framework
</title>
<mediaobject>
<imageobject role="html">
<imagedata width="776px" format="PNG" align="center"
fileref="&imgroot;textruler/textruler.png" />
</imageobject>
<imageobject role="fo">
<imagedata width="5.4in" format="PNG" align="center"
fileref="&imgroot;textruler/textruler.png" />
</imageobject>
<textobject>
<phrase>UIMA Ruta TextRuler framework</phrase>
</textobject>
</mediaobject>
</figure>
<figure id="figure.tools.ruta.workbench.textruler.pref">
<title>The UIMA Ruta TextRuler Preferences
</title>
<mediaobject>
<imageobject role="html">
<imagedata width="576px" format="PNG" align="center"
fileref="&imgroot;textruler/textruler_pref.png" />
</imageobject>
<imageobject role="fo">
<imagedata width="3.3in" format="PNG" align="center"
fileref="&imgroot;textruler/textruler_pref.png" />
</imageobject>
<textobject>
<phrase>UIMA Ruta TextRuler Preferences</phrase>
</textobject>
</mediaobject>
</figure>
</section>
<section id="section.tools.ruta.workbench.textruler.example">
<title>Example</title>
<para>
This section gives a short example how the TextRuler framework is applied in order to induce annotation rules. We refer to the screenshot in <xref linkend="figure.tools.ruta.workbench.textruler.main"/>
for the configuration and are using the exemplary UIMA Ruta project <quote>TextRulerExample</quote>, which is part of the source release of UIMA Ruta.
</para>
<para>
In this example, we are using the <quote>KEP</quote> algorithm for learning annotation rules for identifying Bibtex entries in the reference section of scientific publications:
<orderedlist>
<listitem>
<para>Select the folder <quote>single</quote> and drag and drop it to the <quote>Training Data</quote> text field. This folder contains one file with
correct annotations and serves as gold standard data in our example.</para>
</listitem>
<listitem>
<para>Select the file <quote>Feature.ruta</quote> and drag and drop it to the <quote>Preprocess Script</quote> text field. This UIMA Ruta script knows all necessary types, especially the types
of the annotations we try the learn rules for, and additionally it contains rules that create useful annotations, which can be used by the algorithm in order to learn better rules.</para>
</listitem>
<listitem>
<para>Select the file <quote>InfoTypes.txt</quote> and drag and drop it to the <quote>Information Types</quote> list. This specifies the goal of the learning process,
which types of annotations should be annotated by the induced rules, respectively.</para>
</listitem>
<listitem>
<para>Check the checkbox of the <quote>KEP</quote> algorithm and press the start button in the toolbar fo the view.</para>
</listitem>
<listitem>
<para>The algorithm now tries to induce rules for the targeted types. The current result is displayed in the view <quote>KEP Results</quote> in the right part of the perspective.</para>
</listitem>
<listitem>
<para>After the algorithms finished the learning process, create a new UIMA Ruta file in the <quote>uima.ruta.example</quote> package and copy the content of the result view
to the new file. Now, the induced rules can be applied as a normal UIMA Ruta script file.</para>
</listitem>
</orderedlist>
</para>
</section>
</section>