TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.textruler.xml - uima-sandbox - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
 <!ENTITY imgroot "images/tools/tm/workbench/" >
 <!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
 %uimaents;
 ]>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->

 <section id="section.ugr.tools.tm.workbench.textruler">
   <title>TextRuler</title>
   <para> Using the knowledge engineering approach, a knowledge engineer normally writes handcrafted
     rules to create a domain dependent information extraction application, often supported by a gold
     standard. When starting the engineering process for the acquisition of the extraction knowledge
     for possibly new slot or more general for new concepts, machine learning methods are often able
     to offer support in an iterative engineering process. This section gives a conceptual overview
     of the process model for the semi-automatic development of rule-based information extraction
     applications.
   </para>
   <para> First, a suitable set of documents that contain the text fragments with interesting
     patterns needs to be selected and annotated with the target concepts. Then, the knowledge
     engineer chooses and configures the methods for automatic rule acquisition to the best of his
     knowledge for the learning task: Lambda expressions based on tokens and linguistic features, for
     example, differ in their application domain from wrappers that process generated HTML pages.
   </para>
   <para> Furthermore, parameters like the window size defining relevant features need to be set to
     an appropriate level. Before the annotated training documents form the input of the learning
     task, they are enriched with features generated by the partial rule set of the developed
     application. The result of the methods, that is the learned rules, are proposed to the knowledge
     engineer for the extraction of the target concept.
   </para>
   <para> The knowledge engineer has different options to proceed: If the quality, amount or
     generality of the presented rules is not sufficient, then additional training documents need to
     be annotated or additional rules have to be handcrafted to provide more features in general or
     more appropriate features. Rules or rule sets of high quality can be modified, combined or
     generalized and transfered to the rule set of the application in order to support the extraction
     task of the target concept. In the case that the methods did not learn reasonable rules at all,
     the knowledge engineer proceeds with writing handcrafted rules.
   </para>
   <para> Having gathered enough extraction knowledge for the current concept, the semi-automatic
     process is iterated and the focus is moved to the next concept until the development of the
     application is completed.
   </para>
   <section id="ugr.tools.tm.textruler.learner">
     <title>Available Learners</title>
     <para>
       The available learners are based on the following publications:
       <orderedlist numeration="arabic">
       <!--
         <listitem>
           <para> Dayne Freitag and Nicholas Kushmerick. Boosted Wrapper Induction. In AAAI/IAAI,
             pages 577-583, 2000.</para>
         </listitem>
        -->
         <listitem>
           <para> F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic
             Constraints. Technical Report CS-03-07, Department of Computer Science, University of
             Sheffield, Sheffield, 2003.</para>
         </listitem>
         <listitem>
           <para> Mary Elaine Califf and Raymond J. Mooney. Bottom-up Relational Learning of Pattern
             Matching Rules for Information Extraction. Journal of Machine Learning Research,
             4:177-210, 2003.</para>
         </listitem>
         <listitem>
           <para> Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information
             Extraction Rules for Semi-Structured and Free Text. In Machine Learning, volume 34,
             pages 233-272, 1999.</para>
         </listitem>
         <listitem>
           <para> N. Kushmerick, D. Weld, and B. Doorenbos. Wrapper Induction for Information
             Extraction. In Proc. IJC Artificial Intelligence, 1997.</para>
         </listitem>
       </orderedlist>
     </para>
     <para>
       Each available learner has several features. Their meaning is explained here:
       <itemizedlist>
         <listitem>
           <para> Strategy: The used strategy of the learning methods are commonly coverage
             algorithms.</para>
         </listitem>
         <listitem>
           <para>
             Document: The type of the document may be
             <quote>free</quote>
             like in newspapers,
             <quote>semi</quote>
             or
             <quote>struct</quote>
             like in HTML pages.
    </para>
         </listitem>
         <listitem>
           <para> Slots: The slots refer to a single annotation that represents the goal of the
             learning task. Some rule are able to create several annotation at once in the same
             context (multi-slot). However, only single slots are supported by the current
             implementations.</para>
         </listitem>
         <listitem>
           <para> Status: The current status of the implementation in the TextRuler framework.</para>
         </listitem>
       </itemizedlist>
     </para>
     <para>
       The following table gives an overview:
       <table id="table.ugr.tools.tm.workbench.textruler.available_learners" frame="all">
         <title>Overview of available learners</title>
         <tgroup cols="6" colsep="1" rowsep="1">
           <colspec colname="c1" colwidth="1*" />
           <colspec colname="c2" colwidth="1*" />
           <colspec colname="c3" colwidth="1*" />
           <colspec colname="c4" colwidth="1*" />
           <colspec colname="c5" colwidth="1*" />
           <colspec colname="c6" colwidth="1*" />
           <thead>
             <row>
               <entry align="center">Name</entry>
               <entry align="center">Strategy</entry>
               <entry align="center">Document</entry>
               <entry align="center">Slots</entry>
               <entry align="center">Status</entry>
               <entry align="center">Publication</entry>
             </row>
           </thead>
           <tbody>
           <!--
             <row>
               <entry>BWI</entry>
               <entry>Boosting, Top Down</entry>
               <entry>Struct, Semi</entry>
               <entry>Single, Boundary</entry>
               <entry>Planning</entry>
               <entry>1</entry>
             </row>
            -->
             <row>
               <entry>LP2</entry>
               <entry>Bottom Up Cover</entry>
               <entry>All</entry>
               <entry>Single, Boundary</entry>
               <entry>Prototype</entry>
               <entry>2</entry>
             </row>
             <row>
               <entry>RAPIER</entry>
               <entry>Top Down/Bottom Up Compr.</entry>
               <entry>Semi</entry>
               <entry>Single</entry>
               <entry>Experimental</entry>
               <entry>3</entry>
             </row>
             <row>
               <entry>WHISK</entry>
               <entry>Top Down Cover</entry>
               <entry>All</entry>
               <entry>Multi</entry>
               <entry>Prototype</entry>
               <entry>4</entry>
             </row>
             <row>
               <entry>WIEN</entry>
               <entry>CSP</entry>
               <entry>Struct</entry>
               <entry>Multi, Rows</entry>
               <entry>Prototype</entry>
               <entry>5</entry>
             </row>
           </tbody>
         </tgroup>
       </table>
     </para>
     <!--
     <section id="section.ugr.tools.tm.workbench.textruler.bwi">
       <title>BWI (Boosted Wrapper Induction)</title>
       <para> BWI uses boosting techniques to improve the performance of simple pattern matching
         single-slot boundary wrappers (boundary detectors). Two sets of detectors are learned: the
         "fore" and the "aft" detectors. Weighted by their confidences and combined with a slot
         length histogram derived from the training data they can classify a given pair of boundaries
         within a document. BWI can be used for structured, semi-structured and free text. The
         patterns are token-based with special wildcards for more general rules.   </para>
       <para> Implementations No implementations are yet available.   </para>
       <para> Parameters No parameters are yet available.   </para>
     </section>
      -->
     <section id="section.ugr.tools.tm.workbench.textruler.lp2">
       <title>LP2</title>
       <para>LP2 This method operates on all three kinds of documents. It learns separate rules for
         the beginning and the end of a single slot. So called tagging rules insert boundary SGML
         tags and additionally induced correction rules shift misplaced tags to their correct
         positions in order to improve precision. The learning strategy is a bottom-up covering
         algorithm. It starts by creating a specific seed instance with a window of w tokens to the
         left and right of the target boundary and searches for the best generalization. Other
         linguistic NLP-features can be used in order to generalize over the flat word sequence.
       </para>
       <para> Parameters Context Window Size (to the left and right): Best Rules List Size: Minimum
         Covered Positives per Rule: Maximum Error Threshold: Contextual Rules List Size:   </para>
     </section>
     <section id="section.ugr.tools.tm.workbench.textruler.rapier">
       <title>RAPIER</title>
       <para>RAPIER induces single slot extraction rules for semi-structured documents. The rules
         consist of three patterns: a pre-filler, a filler and a post-filler pattern. Each can hold
         several constraints on tokens and their according POS-tag- and semantic information. The
         algorithm uses a bottom-up compression strategy, starting with a most specific seed rule for
         each training instance. This initial rule base is compressed by randomly selecting rule
         pairs and search for the best generalization. Considering two rules, the least general
         generalization (LGG) of the slot fillers are created and specialized by adding rule items to
         the pre- and post-filler until the new rules operate well on the training set. The best of
         the k rules (k-beam search) is added to the rule base and all empirically subsumed rules are
         removed.   </para>
       <para> Parameters Maximum Compression Fail Count: Internal Rules List Size: Rule Pairs for
         Generalizing: Maximum 'No improvement' Count: Maximum Noise Threshold: Minimum Covered
         Positives Per Rule: PosTag Root Type: Use All 3 GenSets at Specialization:   </para>
     </section>
     <section id="section.ugr.tools.tm.workbench.textruler.whisk">
       <title>WHISK</title>
       <para> WHISK is a multi-slot method that operates on all three kinds of documents and learns
         single- or multi-slot rules looking similar to regular expressions. The top-down covering
         algorithm begins with the most general rule and specializes it by adding single rule terms
         until the rule makes no errors on the training set. Domain specific classes or linguistic
         information obtained by a syntactic analyzer can be used as additional features. The exact
         definition of a rule term (e.g. a token) and of a problem instance (e.g. a whole document or
         a single sentence) depends on the operating domain and document type.   </para>
       <para> Parameters Window Size: Maximum Error Threshold: PosTag Root Type.   </para>
     </section>
     <section id="section.ugr.tools.tm.workbench.textruler.wien">
       <title>WIEN </title>
       <para> WIEN is the only method listed here that operates on highly structured texts only. It
         induces so called wrappers that anchor the slots by their structured context around them.
         The HLRT (head left right tail) wrapper class for example can determine and extract several
         multi-slot-templates by first separating the important information block from unimportant
         head and tail portions and then extracting multiple data rows from table like data
         structures from the remaining document. Inducing a wrapper is done by solving a CSP for all
         possible pattern combinations from the training data.   </para>
       <para> Parameters No parameters are available.   </para>
     </section>
   </section>
 </section>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
	"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
	<!ENTITY imgroot "images/tools/tm/workbench/" >
	<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
	%uimaents;
	]>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<section id="section.ugr.tools.tm.workbench.textruler">
	<title>TextRuler</title>
	<para> Using the knowledge engineering approach, a knowledge engineer normally writes handcrafted
	rules to create a domain dependent information extraction application, often supported by a gold
	standard. When starting the engineering process for the acquisition of the extraction knowledge
	for possibly new slot or more general for new concepts, machine learning methods are often able
	to offer support in an iterative engineering process. This section gives a conceptual overview
	of the process model for the semi-automatic development of rule-based information extraction
	applications.
	</para>
	<para> First, a suitable set of documents that contain the text fragments with interesting
	patterns needs to be selected and annotated with the target concepts. Then, the knowledge
	engineer chooses and configures the methods for automatic rule acquisition to the best of his
	knowledge for the learning task: Lambda expressions based on tokens and linguistic features, for
	example, differ in their application domain from wrappers that process generated HTML pages.
	</para>
	<para> Furthermore, parameters like the window size defining relevant features need to be set to
	an appropriate level. Before the annotated training documents form the input of the learning
	task, they are enriched with features generated by the partial rule set of the developed
	application. The result of the methods, that is the learned rules, are proposed to the knowledge
	engineer for the extraction of the target concept.
	</para>
	<para> The knowledge engineer has different options to proceed: If the quality, amount or
	generality of the presented rules is not sufficient, then additional training documents need to
	be annotated or additional rules have to be handcrafted to provide more features in general or
	more appropriate features. Rules or rule sets of high quality can be modified, combined or
	generalized and transfered to the rule set of the application in order to support the extraction
	task of the target concept. In the case that the methods did not learn reasonable rules at all,
	the knowledge engineer proceeds with writing handcrafted rules.
	</para>
	<para> Having gathered enough extraction knowledge for the current concept, the semi-automatic
	process is iterated and the focus is moved to the next concept until the development of the
	application is completed.
	</para>
	<section id="ugr.tools.tm.textruler.learner">
	<title>Available Learners</title>
	<para>
	The available learners are based on the following publications:
	<orderedlist numeration="arabic">
	<!--
	<listitem>
	<para> Dayne Freitag and Nicholas Kushmerick. Boosted Wrapper Induction. In AAAI/IAAI,
	pages 577-583, 2000.</para>
	</listitem>
	-->
	<listitem>
	<para> F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic
	Constraints. Technical Report CS-03-07, Department of Computer Science, University of
	Sheffield, Sheffield, 2003.</para>
	</listitem>
	<listitem>
	<para> Mary Elaine Califf and Raymond J. Mooney. Bottom-up Relational Learning of Pattern
	Matching Rules for Information Extraction. Journal of Machine Learning Research,
	4:177-210, 2003.</para>
	</listitem>
	<listitem>
	<para> Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information
	Extraction Rules for Semi-Structured and Free Text. In Machine Learning, volume 34,
	pages 233-272, 1999.</para>
	</listitem>
	<listitem>
	<para> N. Kushmerick, D. Weld, and B. Doorenbos. Wrapper Induction for Information
	Extraction. In Proc. IJC Artificial Intelligence, 1997.</para>
	</listitem>
	</orderedlist>
	</para>
	<para>
	Each available learner has several features. Their meaning is explained here:
	<itemizedlist>
	<listitem>
	<para> Strategy: The used strategy of the learning methods are commonly coverage
	algorithms.</para>
	</listitem>
	<listitem>
	<para>
	Document: The type of the document may be
	<quote>free</quote>
	like in newspapers,
	<quote>semi</quote>
	or
	<quote>struct</quote>
	like in HTML pages.
	</para>
	</listitem>
	<listitem>
	<para> Slots: The slots refer to a single annotation that represents the goal of the
	learning task. Some rule are able to create several annotation at once in the same
	context (multi-slot). However, only single slots are supported by the current
	implementations.</para>
	</listitem>
	<listitem>
	<para> Status: The current status of the implementation in the TextRuler framework.</para>
	</listitem>
	</itemizedlist>
	</para>
	<para>
	The following table gives an overview:
	<table id="table.ugr.tools.tm.workbench.textruler.available_learners" frame="all">
	<title>Overview of available learners</title>
	<tgroup cols="6" colsep="1" rowsep="1">
	<colspec colname="c1" colwidth="1*" />
	<colspec colname="c2" colwidth="1*" />
	<colspec colname="c3" colwidth="1*" />
	<colspec colname="c4" colwidth="1*" />
	<colspec colname="c5" colwidth="1*" />
	<colspec colname="c6" colwidth="1*" />
	<thead>
	<row>
	<entry align="center">Name</entry>
	<entry align="center">Strategy</entry>
	<entry align="center">Document</entry>
	<entry align="center">Slots</entry>
	<entry align="center">Status</entry>
	<entry align="center">Publication</entry>
	</row>
	</thead>
	<tbody>
	<!--
	<row>
	<entry>BWI</entry>
	<entry>Boosting, Top Down</entry>
	<entry>Struct, Semi</entry>
	<entry>Single, Boundary</entry>
	<entry>Planning</entry>
	<entry>1</entry>
	</row>
	-->
	<row>
	<entry>LP2</entry>
	<entry>Bottom Up Cover</entry>
	<entry>All</entry>
	<entry>Single, Boundary</entry>
	<entry>Prototype</entry>
	<entry>2</entry>
	</row>
	<row>
	<entry>RAPIER</entry>
	<entry>Top Down/Bottom Up Compr.</entry>
	<entry>Semi</entry>
	<entry>Single</entry>
	<entry>Experimental</entry>
	<entry>3</entry>
	</row>
	<row>
	<entry>WHISK</entry>
	<entry>Top Down Cover</entry>
	<entry>All</entry>
	<entry>Multi</entry>
	<entry>Prototype</entry>
	<entry>4</entry>
	</row>
	<row>
	<entry>WIEN</entry>
	<entry>CSP</entry>
	<entry>Struct</entry>
	<entry>Multi, Rows</entry>
	<entry>Prototype</entry>
	<entry>5</entry>
	</row>
	</tbody>
	</tgroup>
	</table>
	</para>
	<!--
	<section id="section.ugr.tools.tm.workbench.textruler.bwi">
	<title>BWI (Boosted Wrapper Induction)</title>
	<para> BWI uses boosting techniques to improve the performance of simple pattern matching
	single-slot boundary wrappers (boundary detectors). Two sets of detectors are learned: the
	"fore" and the "aft" detectors. Weighted by their confidences and combined with a slot
	length histogram derived from the training data they can classify a given pair of boundaries
	within a document. BWI can be used for structured, semi-structured and free text. The
	patterns are token-based with special wildcards for more general rules. </para>
	<para> Implementations No implementations are yet available. </para>
	<para> Parameters No parameters are yet available. </para>
	</section>
	-->
	<section id="section.ugr.tools.tm.workbench.textruler.lp2">
	<title>LP2</title>
	<para>LP2 This method operates on all three kinds of documents. It learns separate rules for
	the beginning and the end of a single slot. So called tagging rules insert boundary SGML
	tags and additionally induced correction rules shift misplaced tags to their correct
	positions in order to improve precision. The learning strategy is a bottom-up covering
	algorithm. It starts by creating a specific seed instance with a window of w tokens to the
	left and right of the target boundary and searches for the best generalization. Other
	linguistic NLP-features can be used in order to generalize over the flat word sequence.
	</para>
	<para> Parameters Context Window Size (to the left and right): Best Rules List Size: Minimum
	Covered Positives per Rule: Maximum Error Threshold: Contextual Rules List Size: </para>
	</section>
	<section id="section.ugr.tools.tm.workbench.textruler.rapier">
	<title>RAPIER</title>
	<para>RAPIER induces single slot extraction rules for semi-structured documents. The rules
	consist of three patterns: a pre-filler, a filler and a post-filler pattern. Each can hold
	several constraints on tokens and their according POS-tag- and semantic information. The
	algorithm uses a bottom-up compression strategy, starting with a most specific seed rule for
	each training instance. This initial rule base is compressed by randomly selecting rule
	pairs and search for the best generalization. Considering two rules, the least general
	generalization (LGG) of the slot fillers are created and specialized by adding rule items to
	the pre- and post-filler until the new rules operate well on the training set. The best of
	the k rules (k-beam search) is added to the rule base and all empirically subsumed rules are
	removed. </para>
	<para> Parameters Maximum Compression Fail Count: Internal Rules List Size: Rule Pairs for
	Generalizing: Maximum 'No improvement' Count: Maximum Noise Threshold: Minimum Covered
	Positives Per Rule: PosTag Root Type: Use All 3 GenSets at Specialization: </para>
	</section>
	<section id="section.ugr.tools.tm.workbench.textruler.whisk">
	<title>WHISK</title>
	<para> WHISK is a multi-slot method that operates on all three kinds of documents and learns
	single- or multi-slot rules looking similar to regular expressions. The top-down covering
	algorithm begins with the most general rule and specializes it by adding single rule terms
	until the rule makes no errors on the training set. Domain specific classes or linguistic
	information obtained by a syntactic analyzer can be used as additional features. The exact
	definition of a rule term (e.g. a token) and of a problem instance (e.g. a whole document or
	a single sentence) depends on the operating domain and document type. </para>
	<para> Parameters Window Size: Maximum Error Threshold: PosTag Root Type. </para>
	</section>
	<section id="section.ugr.tools.tm.workbench.textruler.wien">
	<title>WIEN </title>
	<para> WIEN is the only method listed here that operates on highly structured texts only. It
	induces so called wrappers that anchor the slots by their structured context around them.
	The HLRT (head left right tail) wrapper class for example can determine and extract several
	multi-slot-templates by first separating the important information block from unimportant
	head and tail portions and then extracting multiple data rows from table like data
	structures from the remaining document. Inducing a wrapper is done by solving a CSP for all
	possible pattern combinations from the training data. </para>
	<para> Parameters No parameters are available. </para>
	</section>
	</section>
	</section>