<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ | |
<!ENTITY imgroot "images/tools/tm/workbench/" > | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" > | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<section id="section.ugr.tools.tm.workbench.textruler"> | |
<title>TextRuler</title> | |
<para> Using the knowledge engineering approach, a knowledge engineer normally writes handcrafted | |
rules to create a domain dependent information extraction application, often supported by a gold | |
standard. When starting the engineering process for the acquisition of the extraction knowledge | |
for possibly new slot or more general for new concepts, machine learning methods are often able | |
to offer support in an iterative engineering process. This section gives a conceptual overview | |
of the process model for the semi-automatic development of rule-based information extraction | |
applications. | |
</para> | |
<para> First, a suitable set of documents that contain the text fragments with interesting | |
patterns needs to be selected and annotated with the target concepts. Then, the knowledge | |
engineer chooses and configures the methods for automatic rule acquisition to the best of his | |
knowledge for the learning task: Lambda expressions based on tokens and linguistic features, for | |
example, differ in their application domain from wrappers that process generated HTML pages. | |
</para> | |
<para> Furthermore, parameters like the window size defining relevant features need to be set to | |
an appropriate level. Before the annotated training documents form the input of the learning | |
task, they are enriched with features generated by the partial rule set of the developed | |
application. The result of the methods, that is the learned rules, are proposed to the knowledge | |
engineer for the extraction of the target concept. | |
</para> | |
<para> The knowledge engineer has different options to proceed: If the quality, amount or | |
generality of the presented rules is not sufficient, then additional training documents need to | |
be annotated or additional rules have to be handcrafted to provide more features in general or | |
more appropriate features. Rules or rule sets of high quality can be modified, combined or | |
generalized and transfered to the rule set of the application in order to support the extraction | |
task of the target concept. In the case that the methods did not learn reasonable rules at all, | |
the knowledge engineer proceeds with writing handcrafted rules. | |
</para> | |
<para> Having gathered enough extraction knowledge for the current concept, the semi-automatic | |
process is iterated and the focus is moved to the next concept until the development of the | |
application is completed. | |
</para> | |
<section id="ugr.tools.tm.textruler.learner"> | |
<title>Available Learners</title> | |
<para> | |
The available learners are based on the following publications: | |
<orderedlist numeration="arabic"> | |
<!-- | |
<listitem> | |
<para> Dayne Freitag and Nicholas Kushmerick. Boosted Wrapper Induction. In AAAI/IAAI, | |
pages 577-583, 2000.</para> | |
</listitem> | |
--> | |
<listitem> | |
<para> F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic | |
Constraints. Technical Report CS-03-07, Department of Computer Science, University of | |
Sheffield, Sheffield, 2003.</para> | |
</listitem> | |
<listitem> | |
<para> Mary Elaine Califf and Raymond J. Mooney. Bottom-up Relational Learning of Pattern | |
Matching Rules for Information Extraction. Journal of Machine Learning Research, | |
4:177-210, 2003.</para> | |
</listitem> | |
<listitem> | |
<para> Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information | |
Extraction Rules for Semi-Structured and Free Text. In Machine Learning, volume 34, | |
pages 233-272, 1999.</para> | |
</listitem> | |
<listitem> | |
<para> N. Kushmerick, D. Weld, and B. Doorenbos. Wrapper Induction for Information | |
Extraction. In Proc. IJC Artificial Intelligence, 1997.</para> | |
</listitem> | |
</orderedlist> | |
</para> | |
<para> | |
Each available learner has several features. Their meaning is explained here: | |
<itemizedlist> | |
<listitem> | |
<para> Strategy: The used strategy of the learning methods are commonly coverage | |
algorithms.</para> | |
</listitem> | |
<listitem> | |
<para> | |
Document: The type of the document may be | |
<quote>free</quote> | |
like in newspapers, | |
<quote>semi</quote> | |
or | |
<quote>struct</quote> | |
like in HTML pages. | |
</para> | |
</listitem> | |
<listitem> | |
<para> Slots: The slots refer to a single annotation that represents the goal of the | |
learning task. Some rule are able to create several annotation at once in the same | |
context (multi-slot). However, only single slots are supported by the current | |
implementations.</para> | |
</listitem> | |
<listitem> | |
<para> Status: The current status of the implementation in the TextRuler framework.</para> | |
</listitem> | |
</itemizedlist> | |
</para> | |
<para> | |
The following table gives an overview: | |
<table id="table.ugr.tools.tm.workbench.textruler.available_learners" frame="all"> | |
<title>Overview of available learners</title> | |
<tgroup cols="6" colsep="1" rowsep="1"> | |
<colspec colname="c1" colwidth="1*" /> | |
<colspec colname="c2" colwidth="1*" /> | |
<colspec colname="c3" colwidth="1*" /> | |
<colspec colname="c4" colwidth="1*" /> | |
<colspec colname="c5" colwidth="1*" /> | |
<colspec colname="c6" colwidth="1*" /> | |
<thead> | |
<row> | |
<entry align="center">Name</entry> | |
<entry align="center">Strategy</entry> | |
<entry align="center">Document</entry> | |
<entry align="center">Slots</entry> | |
<entry align="center">Status</entry> | |
<entry align="center">Publication</entry> | |
</row> | |
</thead> | |
<tbody> | |
<!-- | |
<row> | |
<entry>BWI</entry> | |
<entry>Boosting, Top Down</entry> | |
<entry>Struct, Semi</entry> | |
<entry>Single, Boundary</entry> | |
<entry>Planning</entry> | |
<entry>1</entry> | |
</row> | |
--> | |
<row> | |
<entry>LP2</entry> | |
<entry>Bottom Up Cover</entry> | |
<entry>All</entry> | |
<entry>Single, Boundary</entry> | |
<entry>Prototype</entry> | |
<entry>2</entry> | |
</row> | |
<row> | |
<entry>RAPIER</entry> | |
<entry>Top Down/Bottom Up Compr.</entry> | |
<entry>Semi</entry> | |
<entry>Single</entry> | |
<entry>Experimental</entry> | |
<entry>3</entry> | |
</row> | |
<row> | |
<entry>WHISK</entry> | |
<entry>Top Down Cover</entry> | |
<entry>All</entry> | |
<entry>Multi</entry> | |
<entry>Prototype</entry> | |
<entry>4</entry> | |
</row> | |
<row> | |
<entry>WIEN</entry> | |
<entry>CSP</entry> | |
<entry>Struct</entry> | |
<entry>Multi, Rows</entry> | |
<entry>Prototype</entry> | |
<entry>5</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
</para> | |
<!-- | |
<section id="section.ugr.tools.tm.workbench.textruler.bwi"> | |
<title>BWI (Boosted Wrapper Induction)</title> | |
<para> BWI uses boosting techniques to improve the performance of simple pattern matching | |
single-slot boundary wrappers (boundary detectors). Two sets of detectors are learned: the | |
"fore" and the "aft" detectors. Weighted by their confidences and combined with a slot | |
length histogram derived from the training data they can classify a given pair of boundaries | |
within a document. BWI can be used for structured, semi-structured and free text. The | |
patterns are token-based with special wildcards for more general rules. </para> | |
<para> Implementations No implementations are yet available. </para> | |
<para> Parameters No parameters are yet available. </para> | |
</section> | |
--> | |
<section id="section.ugr.tools.tm.workbench.textruler.lp2"> | |
<title>LP2</title> | |
<para>LP2 This method operates on all three kinds of documents. It learns separate rules for | |
the beginning and the end of a single slot. So called tagging rules insert boundary SGML | |
tags and additionally induced correction rules shift misplaced tags to their correct | |
positions in order to improve precision. The learning strategy is a bottom-up covering | |
algorithm. It starts by creating a specific seed instance with a window of w tokens to the | |
left and right of the target boundary and searches for the best generalization. Other | |
linguistic NLP-features can be used in order to generalize over the flat word sequence. | |
</para> | |
<para> Parameters Context Window Size (to the left and right): Best Rules List Size: Minimum | |
Covered Positives per Rule: Maximum Error Threshold: Contextual Rules List Size: </para> | |
</section> | |
<section id="section.ugr.tools.tm.workbench.textruler.rapier"> | |
<title>RAPIER</title> | |
<para>RAPIER induces single slot extraction rules for semi-structured documents. The rules | |
consist of three patterns: a pre-filler, a filler and a post-filler pattern. Each can hold | |
several constraints on tokens and their according POS-tag- and semantic information. The | |
algorithm uses a bottom-up compression strategy, starting with a most specific seed rule for | |
each training instance. This initial rule base is compressed by randomly selecting rule | |
pairs and search for the best generalization. Considering two rules, the least general | |
generalization (LGG) of the slot fillers are created and specialized by adding rule items to | |
the pre- and post-filler until the new rules operate well on the training set. The best of | |
the k rules (k-beam search) is added to the rule base and all empirically subsumed rules are | |
removed. </para> | |
<para> Parameters Maximum Compression Fail Count: Internal Rules List Size: Rule Pairs for | |
Generalizing: Maximum 'No improvement' Count: Maximum Noise Threshold: Minimum Covered | |
Positives Per Rule: PosTag Root Type: Use All 3 GenSets at Specialization: </para> | |
</section> | |
<section id="section.ugr.tools.tm.workbench.textruler.whisk"> | |
<title>WHISK</title> | |
<para> WHISK is a multi-slot method that operates on all three kinds of documents and learns | |
single- or multi-slot rules looking similar to regular expressions. The top-down covering | |
algorithm begins with the most general rule and specializes it by adding single rule terms | |
until the rule makes no errors on the training set. Domain specific classes or linguistic | |
information obtained by a syntactic analyzer can be used as additional features. The exact | |
definition of a rule term (e.g. a token) and of a problem instance (e.g. a whole document or | |
a single sentence) depends on the operating domain and document type. </para> | |
<para> Parameters Window Size: Maximum Error Threshold: PosTag Root Type. </para> | |
</section> | |
<section id="section.ugr.tools.tm.workbench.textruler.wien"> | |
<title>WIEN </title> | |
<para> WIEN is the only method listed here that operates on highly structured texts only. It | |
induces so called wrappers that anchor the slots by their structured context around them. | |
The HLRT (head left right tail) wrapper class for example can determine and extract several | |
multi-slot-templates by first separating the important information block from unimportant | |
head and tail portions and then extracting multiple data rows from table like data | |
structures from the remaining document. Inducing a wrapper is done by solving a CSP for all | |
possible pattern combinations from the training data. </para> | |
<para> Parameters No parameters are available. </para> | |
</section> | |
</section> | |
</section> |