blob: a22210761998afffc51729a1d86e826cf78edc64 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
<!ENTITY imgroot "images/tools/tm/workbench/" >
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<section id="section.ugr.tools.tm.workbench.textruler">
<title>TextRuler</title>
<para> Using the knowledge engineering approach, a knowledge engineer normally writes handcrafted
rules to create a domain dependent information extraction application, often supported by a gold
standard. When starting the engineering process for the acquisition of the extraction knowledge
for possibly new slot or more general for new concepts, machine learning methods are often able
to offer support in an iterative engineering process. This section gives a conceptual overview
of the process model for the semi-automatic development of rule-based information extraction
applications.
</para>
<para> First, a suitable set of documents that contain the text fragments with interesting
patterns needs to be selected and annotated with the target concepts. Then, the knowledge
engineer chooses and configures the methods for automatic rule acquisition to the best of his
knowledge for the learning task: Lambda expressions based on tokens and linguistic features, for
example, differ in their application domain from wrappers that process generated HTML pages.
</para>
<para> Furthermore, parameters like the window size defining relevant features need to be set to
an appropriate level. Before the annotated training documents form the input of the learning
task, they are enriched with features generated by the partial rule set of the developed
application. The result of the methods, that is the learned rules, are proposed to the knowledge
engineer for the extraction of the target concept.
</para>
<para> The knowledge engineer has different options to proceed: If the quality, amount or
generality of the presented rules is not sufficient, then additional training documents need to
be annotated or additional rules have to be handcrafted to provide more features in general or
more appropriate features. Rules or rule sets of high quality can be modified, combined or
generalized and transfered to the rule set of the application in order to support the extraction
task of the target concept. In the case that the methods did not learn reasonable rules at all,
the knowledge engineer proceeds with writing handcrafted rules.
</para>
<para> Having gathered enough extraction knowledge for the current concept, the semi-automatic
process is iterated and the focus is moved to the next concept until the development of the
application is completed.
</para>
<section id="ugr.tools.tm.textruler.learner">
<title>Available Learners</title>
<para>
The available learners are based on the following publications:
<orderedlist numeration="arabic">
<!--
<listitem>
<para> Dayne Freitag and Nicholas Kushmerick. Boosted Wrapper Induction. In AAAI/IAAI,
pages 577-583, 2000.</para>
</listitem>
-->
<listitem>
<para> F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic
Constraints. Technical Report CS-03-07, Department of Computer Science, University of
Sheffield, Sheffield, 2003.</para>
</listitem>
<listitem>
<para> Mary Elaine Califf and Raymond J. Mooney. Bottom-up Relational Learning of Pattern
Matching Rules for Information Extraction. Journal of Machine Learning Research,
4:177-210, 2003.</para>
</listitem>
<listitem>
<para> Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information
Extraction Rules for Semi-Structured and Free Text. In Machine Learning, volume 34,
pages 233-272, 1999.</para>
</listitem>
<listitem>
<para> N. Kushmerick, D. Weld, and B. Doorenbos. Wrapper Induction for Information
Extraction. In Proc. IJC Artificial Intelligence, 1997.</para>
</listitem>
</orderedlist>
</para>
<para>
Each available learner has several features. Their meaning is explained here:
<itemizedlist>
<listitem>
<para> Strategy: The used strategy of the learning methods are commonly coverage
algorithms.</para>
</listitem>
<listitem>
<para>
Document: The type of the document may be
<quote>free</quote>
like in newspapers,
<quote>semi</quote>
or
<quote>struct</quote>
like in HTML pages.
</para>
</listitem>
<listitem>
<para> Slots: The slots refer to a single annotation that represents the goal of the
learning task. Some rule are able to create several annotation at once in the same
context (multi-slot). However, only single slots are supported by the current
implementations.</para>
</listitem>
<listitem>
<para> Status: The current status of the implementation in the TextRuler framework.</para>
</listitem>
</itemizedlist>
</para>
<para>
The following table gives an overview:
<table id="table.ugr.tools.tm.workbench.textruler.available_learners" frame="all">
<title>Overview of available learners</title>
<tgroup cols="6" colsep="1" rowsep="1">
<colspec colname="c1" colwidth="1*" />
<colspec colname="c2" colwidth="1*" />
<colspec colname="c3" colwidth="1*" />
<colspec colname="c4" colwidth="1*" />
<colspec colname="c5" colwidth="1*" />
<colspec colname="c6" colwidth="1*" />
<thead>
<row>
<entry align="center">Name</entry>
<entry align="center">Strategy</entry>
<entry align="center">Document</entry>
<entry align="center">Slots</entry>
<entry align="center">Status</entry>
<entry align="center">Publication</entry>
</row>
</thead>
<tbody>
<!--
<row>
<entry>BWI</entry>
<entry>Boosting, Top Down</entry>
<entry>Struct, Semi</entry>
<entry>Single, Boundary</entry>
<entry>Planning</entry>
<entry>1</entry>
</row>
-->
<row>
<entry>LP2</entry>
<entry>Bottom Up Cover</entry>
<entry>All</entry>
<entry>Single, Boundary</entry>
<entry>Prototype</entry>
<entry>2</entry>
</row>
<row>
<entry>RAPIER</entry>
<entry>Top Down/Bottom Up Compr.</entry>
<entry>Semi</entry>
<entry>Single</entry>
<entry>Experimental</entry>
<entry>3</entry>
</row>
<row>
<entry>WHISK</entry>
<entry>Top Down Cover</entry>
<entry>All</entry>
<entry>Multi</entry>
<entry>Prototype</entry>
<entry>4</entry>
</row>
<row>
<entry>WIEN</entry>
<entry>CSP</entry>
<entry>Struct</entry>
<entry>Multi, Rows</entry>
<entry>Prototype</entry>
<entry>5</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
<!--
<section id="section.ugr.tools.tm.workbench.textruler.bwi">
<title>BWI (Boosted Wrapper Induction)</title>
<para> BWI uses boosting techniques to improve the performance of simple pattern matching
single-slot boundary wrappers (boundary detectors). Two sets of detectors are learned: the
"fore" and the "aft" detectors. Weighted by their confidences and combined with a slot
length histogram derived from the training data they can classify a given pair of boundaries
within a document. BWI can be used for structured, semi-structured and free text. The
patterns are token-based with special wildcards for more general rules. </para>
<para> Implementations No implementations are yet available. </para>
<para> Parameters No parameters are yet available. </para>
</section>
-->
<section id="section.ugr.tools.tm.workbench.textruler.lp2">
<title>LP2</title>
<para>LP2 This method operates on all three kinds of documents. It learns separate rules for
the beginning and the end of a single slot. So called tagging rules insert boundary SGML
tags and additionally induced correction rules shift misplaced tags to their correct
positions in order to improve precision. The learning strategy is a bottom-up covering
algorithm. It starts by creating a specific seed instance with a window of w tokens to the
left and right of the target boundary and searches for the best generalization. Other
linguistic NLP-features can be used in order to generalize over the flat word sequence.
</para>
<para> Parameters Context Window Size (to the left and right): Best Rules List Size: Minimum
Covered Positives per Rule: Maximum Error Threshold: Contextual Rules List Size: </para>
</section>
<section id="section.ugr.tools.tm.workbench.textruler.rapier">
<title>RAPIER</title>
<para>RAPIER induces single slot extraction rules for semi-structured documents. The rules
consist of three patterns: a pre-filler, a filler and a post-filler pattern. Each can hold
several constraints on tokens and their according POS-tag- and semantic information. The
algorithm uses a bottom-up compression strategy, starting with a most specific seed rule for
each training instance. This initial rule base is compressed by randomly selecting rule
pairs and search for the best generalization. Considering two rules, the least general
generalization (LGG) of the slot fillers are created and specialized by adding rule items to
the pre- and post-filler until the new rules operate well on the training set. The best of
the k rules (k-beam search) is added to the rule base and all empirically subsumed rules are
removed. </para>
<para> Parameters Maximum Compression Fail Count: Internal Rules List Size: Rule Pairs for
Generalizing: Maximum 'No improvement' Count: Maximum Noise Threshold: Minimum Covered
Positives Per Rule: PosTag Root Type: Use All 3 GenSets at Specialization: </para>
</section>
<section id="section.ugr.tools.tm.workbench.textruler.whisk">
<title>WHISK</title>
<para> WHISK is a multi-slot method that operates on all three kinds of documents and learns
single- or multi-slot rules looking similar to regular expressions. The top-down covering
algorithm begins with the most general rule and specializes it by adding single rule terms
until the rule makes no errors on the training set. Domain specific classes or linguistic
information obtained by a syntactic analyzer can be used as additional features. The exact
definition of a rule term (e.g. a token) and of a problem instance (e.g. a whole document or
a single sentence) depends on the operating domain and document type. </para>
<para> Parameters Window Size: Maximum Error Threshold: PosTag Root Type. </para>
</section>
<section id="section.ugr.tools.tm.workbench.textruler.wien">
<title>WIEN </title>
<para> WIEN is the only method listed here that operates on highly structured texts only. It
induces so called wrappers that anchor the slots by their structured context around them.
The HLRT (head left right tail) wrapper class for example can determine and extract several
multi-slot-templates by first separating the important information block from unimportant
head and tail portions and then extracting multiple data rows from table like data
structures from the remaining document. Inducing a wrapper is done by solving a CSP for all
possible pattern combinations from the training data. </para>
<para> Parameters No parameters are available. </para>
</section>
</section>
</section>