TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.language.xml - uima-sandbox - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
 <!ENTITY imgroot "images/tools/tm/language/" >
 <!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
 %uimaents;
 ]>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->

 <chapter id="ugr.tools.tm.language.language">
   <title>TextMarker Language</title>
   <para>
     This chapter provides a complete description of the TextMarker
     language.
   </para>

   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
     href="tools.textmarker.language.syntax.xml" />

   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
     href="tools.textmarker.language.basic_annotations.xml" />
   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
     href="tools.textmarker.language.quantifier.xml" />
   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
     href="tools.textmarker.language.declarations.xml" />
   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
     href="tools.textmarker.language.expressions.xml" />
   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
     href="tools.textmarker.language.conditions.xml" />
   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
     href="tools.textmarker.language.actions.xml" />


   <section id="ugr.tools.tm.language.filtering">
     <title>Robust extraction using filtering</title>
     <para>
       Rule based or pattern based information extraction systems often
       suffer from unimportant fill words, additional whitespace and
       unexpected markup. The TextMarker System enables the knowledge
       engineer to filter and to hide all possible combinations of
       predefined and new types of annotations. The
       visibility of tokens and
       annotations is modified by the actions of
       rule elements and can be
       conditioned using the complete
       expressiveness of the language.
       Therefore the TextMarker system
       supports a robust approach to
       information extraction and simplifies
       the creation of new rules since
       the knowledge engineer can focus on
       important textual features. If no
       rule action changed the
       configuration of the filtering settings, then
       the default filtering
       configuration ignores whitespaces and markup.
       Look at the following rule:
       <programlisting><![CDATA["Dr" PERIOD CW CW
 ]]></programlisting>
       Using the default
       setting, this rule matches on all four lines
       of this
       input document:
       <programlisting><![CDATA[Dr. Joachim Baumeister
 Dr . Joachim      Baumeister
 Dr. <b><i>Joachim</i> Baumeister</b>
 Dr.JoachimBaumeister
 ]]></programlisting>
     </para>
     <para>
       To change the default setting use the
       <quote>FILTERTYPE</quote>
       or
       <quote>RETAINTYPE</quote>
       action. For example if markups should no longer be ignored, try
       the following example on the above input document:
       <programlisting><![CDATA[Document{->RETAINTYPE(MARKUP)};
 "Dr" PERIOD CW CW
 ]]></programlisting>
       You will see that the third line of the previous input example
       will  no longer be matched.
     </para>
     <para>
       To filter types try the following on the input document:
       <programlisting><![CDATA[Document{->FILTERTYPE(PERIOD)};
 "Dr" CW CW
 ]]></programlisting>
       Since periods are ignored now, the rule will match on all four
       lines of the example.
     </para>
     <para>
       Notice that using a filtered annotation type within a
       rule, prevents this rule from being executed. Try the following:
       <programlisting><![CDATA[Document{->FILTERTYPE(PERIOD)};
 "Dr" PERIOD CW CW
 ]]></programlisting>
       You will see that this matches on no line of the input document
       since the second rule uses the filtered type PERIOD and is therefore not
       executed.
     </para>
   </section>
   <section id="ugr.tools.tm.language.blocks">
     <title>Blocks</title>
     <para>
       Blocks combine some more complex control structures in the
       TextMarker
       language:
       <orderedlist numeration="arabic">
         <listitem>
           <para>
             Conditioned statements
           </para>
         </listitem>
         <listitem>
           <para>
             <quote>Foreach</quote>
             -Loops
           </para>
         </listitem>
         <listitem>
           <para>
             Procedures
           </para>
         </listitem>
       </orderedlist>
     </para>
     <para>
       Declaration of a block:
       <programlisting><![CDATA[BlockDeclaration   -> "BLOCK" "(" Identifier ")" RuleElementWithCA
                                             "{" Statements "}"
 RuleElementWithCA      ->  TypeExpression QuantifierPart?
                                             "{" Conditions?  Actions? "}"]]></programlisting>
       A block declaration always starts with the keyword
       <quote>BLOCK</quote>
       , followed by the identifier of the block within brackets. The
       <quote>RuleElementType</quote>
       -element
       is a TextMarker rule that consists of exactly one rule
       element. The
       rule element has to be a declared annotation type.
       <note>
         <para>
           The
           rule element in the definition of a block has to define
           a
           condition/action part, even if that part is empty (LCURLY and
           RCULRY).
         </para>
       </note>
     </para>
     <para>
       Through the rule element a new local document is defined, whose
       scope
       is the related block. So if you use
       <literal>Document</literal>
       within a block, this always refers to the locally limited
       document.
       <programlisting><![CDATA[BLOCK(ForEach) Paragraph{} {
     Document{COUNT(CW)}; // Here "Document" is limited to a Paragraph;
                // therefore the rule only counts the CW annotations
                // within the Paragraph
 }
 ]]></programlisting>
     </para>
     <para>
       A block is always executed when the TextMarker interpreter
       reaches its
       declaration. But a block may also be called from another
       position of
       the script. See
       <xref linkend='ugr.tools.tm.language.blocks.procedure' />
     </para>
     <section id="ugr.tools.tm.language.blocks.condition">
       <title>Conditioned statements</title>
       <para>
         A block can use common TextMarker conditions to condition the
         execution of its containing rules.
       </para>
       <para>
         Examples:
         <programlisting><![CDATA[DECLARE Month;

 BLOCK(EnglishDates) Document{FEATURE("language", "en")} {
     Document{->MARKFAST(Month,'englishMonthNames.txt')};
     //...
 }

 BLOCK(GermanDates) Document{FEATURE("language", "de")} {
     Document{->MARKFAST(Month,'germanMonthNames.txt')};
     //...
 }
 ]]></programlisting>
         The example is explained in detail in
         <xref linkend='ugr.tools.tm.overview.examples' />
         .
       </para>
     </section>
     <section id="ugr.tools.tm.language.blocks.foreach">
       <title>
         <quote>Foreach</quote>
         -Loops
       </title>
       <para>
         A block can be used to execute the containing rules on a
         sequence of
         similar text passages, therefore representing a
         <quote>foreach</quote>
         like loop.
       </para>
       <para>
         Examples:
         <programlisting><![CDATA[DECLARE SentenceWithNoLeadingNP;
 BLOCK(ForEach) Sentence{} {
     Document{-STARTSWITH(NP) -> MARK(SentenceWithNoLeadingNP)};
 }
 ]]></programlisting>
         The example is explained in detail in
         <xref linkend='ugr.tools.tm.overview.examples' />
         .
       </para>
       <para>
         This construction is especially useful, if you have a set of
         rules
         which has to be executed continously on the same part of an input
         document. Lets assume you have already annotated your document
         with
         Paragraph annotations. Now you want to count the number of words
         within each paragraph and if the number of words is bigger than 500
         annotate it as BigParagraph. Therefore you wrote the following
         rules:
         <programlisting><![CDATA[DECLARE BigParagraph;
 INT numberOfWords;
 Paragraph{COUNT(W,numberOfWords)};
 Paragraph{IF(numberOfWords > 500) -> MARK(BigParagraph)};
 ]]></programlisting>
         This will not work. The reason is that the rule which counts the
         number of words within a Paragraph is executed on all Paragraphs
         before the last rule which marks the Paragraph as BigParagraph
         is
         even executed once. Therefore when reaching the last rule in this
         example, the variable
         <literal>numberOfWords</literal>
         holds the
         number of words of the last Paragraph in the input
         document,
         thus annotating all Paragraphs either as BigParagraph or
         not.
       </para>
       <para>
         To solve this, use a block to tie the
         execution of this rules
         together for each Paragraph:
         <programlisting><![CDATA[DECLARE BigParagraph;
 INT numberOfWords;
 BLOCK(IsBig) Paragraph{} {
   Document{COUNT(W,numberOfWords)};
   Document{IF(numberOfWords > 500) -> MARK(BigParagraph)};
 }
 ]]></programlisting>
         Since the scope of the Document is limited to a Paragraph within
         the
         block, the rule which counts the words is only executed once
         before
         the second rule decides if the Paragraph is a BigParagraph.
         Of course
         this is done for every Paragraph in the whole document.
       </para>
     </section>
     <section id="ugr.tools.tm.language.blocks.procedure">
       <title>Procedures</title>
       <para>
         Blocks can be used to introduce procedures into TextMarker
         language.
         To do this declare a block as before. Lets assume you want to
         simulate a procedure
         <programlisting><![CDATA[public int countAmountOfTypesInDocument(Type type){
     int amount = 0;
     for(Token token : Document) {
       if(token.isType(type)){
         amount++;
       }
     }
     return amount;
 }

 public static void main() {
   int amount = countAmountOfTypesInDocument(Paragraph));
 }
 ]]></programlisting>
         which counts the number of the passed type wihtin the document
         and
         gives back the counted number. This can be done in the following
         way:
         <programlisting><![CDATA[BOOLEAN executeProcedure = false;
 TYPE type;
 INT amount;

 BLOCK(countNumberOfTypesInDocument) Document{IF(executeProcedure)} {
     Document{COUNT(type, amount)};
 }

 Document{->ASSIGN(executeProcedure, true)};
 Document{->ASSIGN(type, Paragraph)};
 Document{->CALL(MyScript.countNumberOfTypesInDocument)};
 ]]></programlisting>
         The boolean variable
         <literal>executeProcedure</literal>
         is used to prohibit the execution of the block when the
         interpreter
         first reaches the block since this is no procedure call. The block
         can be called
         by referring to it with its name, preceded by the name
         of the script
         the
         block is defined in. In this exmaple, the script is
         called MyScript.tm.
       </para>
     </section>

   </section>
   <section id="ugr.tools.tm.language.score">
     <title>Heuristic extraction using scoring rules</title>
     <para>
       Diagnostic scores are a well known and successfully applied
       knowledge
       formalization pattern for diagnostic problems. Single known
       findings
       valuate a possible solution by adding or subtracting points
       on an
       account of that solution. If the sum exceeds a given threshold,
       then
       the solution is derived. One of the advantages of this pattern
       is the
       robustness against missing or false findings, since a high
       number of
       findings is used to derive a solution.

       The TextMarker system tries to
       transfer this diagnostic problem
       solution strategy to the
       information
       extraction problem. In addition to a
       normal creation of a new
       annotation, a MARKSCORE action can add positive
       or negative scoring
       points to the text fragments matched by the rule
       elements. The current
       value of heuristic points of an annotation can
       be evaluated by the
       SCORE condition, which can be used in an
       additional rule to create
       another annotation.
       In the following, the heuristic extraction using
       scoring rules is demonstrated by a short example:

       <programlisting><![CDATA[Paragraph{CONTAINS(W,1,5)->MARKSCORE(5,Headline)};
 Paragraph{CONTAINS(W,6,10)->MARKSCORE(2,Headline)};
 Paragraph{CONTAINS(Emph,80,100,true)->MARKSCORE(7,Headline)};
 Paragraph{CONTAINS(Emph,30,80,true)->MARKSCORE(3,Headline)};
 Paragraph{CONTAINS(CW,50,100,true)->MARKSCORE(7,Headline)};
 Paragraph{CONTAINS(W,0,0)->MARKSCORE(-50,Headline)};
 Headline{SCORE(10)->MARK(Realhl)};
 Headline{SCORE(5,10)->LOG("Maybe a headline")};]]></programlisting>


       In the first part of this rule set, annotations of the type
       paragraph
       receive scoring points for a headline annotation, if they
       fulfill
       certain CONTAINS conditions. The first condition, for
       example,
       evaluates to true, if the paragraph contains one word up to
       five
       words, whereas the fourth conditions is fulfilled, if the
       paragraph
       contains thirty up to eighty percent of emph annotations.
       The last two
       rules finally execute their actions, if the score of a
       headline
       annotation exceeds ten points, or lies in the interval of
       five and ten
       points, respectively.
     </para>
   </section>
   <section id="ugr.tools.tm.language.modification">
     <title>Modification</title>
     <para>
       There are different actions that can modify the input document,
       like DEL,
       COLOR and REPLACE. But the input document itself can not be
       modified
       directly. A separate engine, the Modifier.xml, has to be
       called in
       order to create another cas view with the name "modified".
       In that
       document all modifications are executed.
     </para>
     <para>
       The following example shows how to import and call the
       Modifier.xml
       engine.
       The example is explained in detail in
       <xref linkend='ugr.tools.tm.overview.examples' />
       .
     </para>
     <programlisting><![CDATA[ENGINE utils.Modifier;
 Date{-> DEL};
 MoneyAmount{-> REPLACE("<MoneyAmount/>")};
 Document{-> COLOR(Headline, "green")};
 Document{-> EXEC(Modifier)};
 ]]></programlisting>

     <para>
       To get to the modified view of an input document
       <quote>file1.txt</quote>
       open the output document
       <quote>file1.txt.xmi</quote>
       .
       In editor do right-click and choose
       <quote>CAS Views &rarr;
         modified
       </quote>
       .
     </para>
   </section>

   <section id="ugr.tools.tm.language.external_resources">
     <title>External resources</title>
     <para>
       Imagine you have a set of documents containing lots of different
       first names. (As example we use a short list, containing the first
       names
       <quote>Frank</quote>
       ,
       <quote>Peter</quote>
       ,
       <quote>Jochen</quote>
       and
       <quote>Martin</quote>
       .)
       If you would like to annotate all of them with a
       <quote>FirstName</quote>
       annotation you could write a script using the rule
       <literal>("Frank" | "Peter" | "Jochen" |
         "Martin"){->MARK(FirstName)};</literal>.
       This does exactly what you want. But in fact it is not very handy.
       If you like to add new first names to the list of recognized first
       names you have to change the rule itself every time. Moreover writing
       rules with possibly hundreds of first names
       is not really practically realizable and definitely not efficient if you have
       the list of first names already as a simple text file. Using this text file directly
       would much reduce the effort.
     </para>
     <para>
       Therefore TextMarker provides two kinds of external resources to
       solve such tasks more easily: WORDLISTs and WORDTABLEs.
     </para>
     <section>
       <title>WORDLISTs</title>
       <para>
         A WORDLIST is simply a list of text items. There are three
         different possibilities of how to provide a WORDLIST to the TextMarker system.
       </para>
       <para>
         The first possibility is the use of simple text files, which
         contain exactly one list item per line. For example, a list "FirstNames.txt"
         of first names could look like this:
         <programlisting><![CDATA[Frank
 Peter
 Jochen
 Martin
 ]]></programlisting>
         First names within a document containing any number of these
         listed
         names, could be annotated
         by using
         <literal>Document{->MARKFAST(FistName, "FirstNames.txt")};</literal>, assuming
         an already declared type FirstName. To make this rule
         recognizing more first names just add
         them to the external list.
         You could also use a WORLIST variable to do the same thing like this:
         <programlisting><![CDATA[WORDLIST FirstNameList = "FirstNames.txt";
 DECLARE FirstName;
 Document{->MARKFAST(FistName, FirstNameList)};
 ]]></programlisting>
       </para>
       <para>
         Another possibility to provide WORDLISTs is the use of compiled
         <quote>tree word list</quote>
         s. The file ending for this is <quote>.twl</quote>
         A tree word list is similar to a trie. It is a XML-file that contains
         a tree-like structure with a node for each character. The nodes
         themselves refer to child nodes that represent all characters that
         succeed the character of the parent node. For single word entries the
         resulting complexity is O(m*log(n)) instead of O(m*n) for simple text
         files. Here m is the amount of basic annotations in the document and
         n is the amount of entries in the dictionary. To generate a tree word
         list, see <xref linkend='section.ugr.tools.tm.workbench.create_dictionaries' />.
         A tree word list is used in the same way as simple word lists,
         for example <literal>Document{->MARKFAST(FistName, "FirstNames.twl")};</literal>.
       </para>
       <para>
         A third kind of usable WORDLISTs are <quote>multi tree word list</quote>s.
         The file ending for this is <quote>.mtwl</quote>. It is generated from
         several ordinary WORDLISTs given as simple text files. It contains special
         nodes that provide additional information about the original file. These
         kind of WORDLIST is useful, if several fifferent WORDLISTs are used within
         a TextMarker script. Using five different lists results in five rules using
         the MARKFAST action. The documents to annotate are thus searched five
         times resulting in a complexity of 5*O(m*log(n)) With a multi tree
         word list this can be reduced to about O(m*log(5*n)). To
         generate a multi tree word list, see
         <xref linkend='section.ugr.tools.tm.workbench.create_dictionaries' />
         To use a multi tree word list TextMarker provides the action
         TRIE. If for example two word lists
         <quote>FirstNames.txt</quote>
         and
         <quote>LastNames.txt</quote>
         have been merged in the multi tree word list
         <quote>Names.mtwl</quote>, then the following rule annotates all
         first names and last names in the whole document:
         <programlisting><![CDATA[WORDLIST Names = "Names.mtwl";
 Declare FirstName, LastName;
 Document{->TRIE("FirstNames.txt" = FistName, "LastNames.txt" = LastName,
                                           Names, false, 0, false, 0, "")};]]></programlisting>
       </para>
     </section>
     <section>
       <title>WORDTABLEs</title>
       <para>
         WORDLISTS have been used to annotate all occurrences of any list
         item in a document with a certain type. Imagine now that each annotation
         has features that should be filled with values dependent on the list item
         that matched. This can be achieved with WORDTABLEs. For example, lets
         assume we want to annotate all US presidents within a document.
         Moreover each annotation should contain the party of the president as well as the
         year of his inauguration. Therefore we use an annotation type
         <literal>DECLARE Annotation PresidentOfUSA(STRING party, INT
           yearOfInauguration)</literal>. To achieve this, it is recommended to use WORDTABLEs.
       </para>
       <para>
         A WORDTABLE is simply a comma-separated file (.csv). For our
         example such a file, named
         <quote>presidentsOfUSA.csv</quote>
         could look like this:
         <programlisting><![CDATA[Bill Clinton; democrats; 1993
 George W. Bush; republicans; 2001
 Barack Obama; democrats; 2009
 ]]></programlisting>
         To annotate our documents we could use the following set of
         rules:
         <programlisting><![CDATA[WORDTABLE presidentsOfUSA = "presidentsOfUSA.csv";
 DECLARE Annotation PresidentOfUSA(STRING party, INT yearOfInauguration);
 Document{->MARKTABLE(PresidentOfUSA, 1, "party" = 2,
 											 "yearOfInauguration" = 3)};]]></programlisting>
       </para>
     </section>
   </section>
 </chapter>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
	"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
	<!ENTITY imgroot "images/tools/tm/language/" >
	<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
	%uimaents;
	]>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<chapter id="ugr.tools.tm.language.language">
	<title>TextMarker Language</title>
	<para>
	This chapter provides a complete description of the TextMarker
	language.
	</para>

	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
	href="tools.textmarker.language.syntax.xml" />

	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
	href="tools.textmarker.language.basic_annotations.xml" />
	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
	href="tools.textmarker.language.quantifier.xml" />
	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
	href="tools.textmarker.language.declarations.xml" />
	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
	href="tools.textmarker.language.expressions.xml" />
	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
	href="tools.textmarker.language.conditions.xml" />
	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
	href="tools.textmarker.language.actions.xml" />


	<section id="ugr.tools.tm.language.filtering">
	<title>Robust extraction using filtering</title>
	<para>
	Rule based or pattern based information extraction systems often
	suffer from unimportant fill words, additional whitespace and
	unexpected markup. The TextMarker System enables the knowledge
	engineer to filter and to hide all possible combinations of
	predefined and new types of annotations. The
	visibility of tokens and
	annotations is modified by the actions of
	rule elements and can be
	conditioned using the complete
	expressiveness of the language.
	Therefore the TextMarker system
	supports a robust approach to
	information extraction and simplifies
	the creation of new rules since
	the knowledge engineer can focus on
	important textual features. If no
	rule action changed the
	configuration of the filtering settings, then
	the default filtering
	configuration ignores whitespaces and markup.
	Look at the following rule:
	<programlisting><![CDATA["Dr" PERIOD CW CW
	]]></programlisting>
	Using the default
	setting, this rule matches on all four lines
	of this
	input document:
	<programlisting><![CDATA[Dr. Joachim Baumeister
	Dr . Joachim Baumeister
	Dr. <b><i>Joachim</i> Baumeister</b>
	Dr.JoachimBaumeister
	]]></programlisting>
	</para>
	<para>
	To change the default setting use the
	<quote>FILTERTYPE</quote>
	or
	<quote>RETAINTYPE</quote>
	action. For example if markups should no longer be ignored, try
	the following example on the above input document:
	<programlisting><![CDATA[Document{->RETAINTYPE(MARKUP)};
	"Dr" PERIOD CW CW
	]]></programlisting>
	You will see that the third line of the previous input example
	will no longer be matched.
	</para>
	<para>
	To filter types try the following on the input document:
	<programlisting><![CDATA[Document{->FILTERTYPE(PERIOD)};
	"Dr" CW CW
	]]></programlisting>
	Since periods are ignored now, the rule will match on all four
	lines of the example.
	</para>
	<para>
	Notice that using a filtered annotation type within a
	rule, prevents this rule from being executed. Try the following:
	<programlisting><![CDATA[Document{->FILTERTYPE(PERIOD)};
	"Dr" PERIOD CW CW
	]]></programlisting>
	You will see that this matches on no line of the input document
	since the second rule uses the filtered type PERIOD and is therefore not
	executed.
	</para>
	</section>
	<section id="ugr.tools.tm.language.blocks">
	<title>Blocks</title>
	<para>
	Blocks combine some more complex control structures in the
	TextMarker
	language:
	<orderedlist numeration="arabic">
	<listitem>
	<para>
	Conditioned statements
	</para>
	</listitem>
	<listitem>
	<para>
	<quote>Foreach</quote>
	-Loops
	</para>
	</listitem>
	<listitem>
	<para>
	Procedures
	</para>
	</listitem>
	</orderedlist>
	</para>
	<para>
	Declaration of a block:
	<programlisting><![CDATA[BlockDeclaration -> "BLOCK" "(" Identifier ")" RuleElementWithCA
	"{" Statements "}"
	RuleElementWithCA -> TypeExpression QuantifierPart?
	"{" Conditions? Actions? "}"]]></programlisting>
	A block declaration always starts with the keyword
	<quote>BLOCK</quote>
	, followed by the identifier of the block within brackets. The
	<quote>RuleElementType</quote>
	-element
	is a TextMarker rule that consists of exactly one rule
	element. The
	rule element has to be a declared annotation type.
	<note>
	<para>
	The
	rule element in the definition of a block has to define
	a
	condition/action part, even if that part is empty (LCURLY and
	RCULRY).
	</para>
	</note>
	</para>
	<para>
	Through the rule element a new local document is defined, whose
	scope
	is the related block. So if you use
	<literal>Document</literal>
	within a block, this always refers to the locally limited
	document.
	<programlisting><![CDATA[BLOCK(ForEach) Paragraph{} {
	Document{COUNT(CW)}; // Here "Document" is limited to a Paragraph;
	// therefore the rule only counts the CW annotations
	// within the Paragraph
	}
	]]></programlisting>
	</para>
	<para>
	A block is always executed when the TextMarker interpreter
	reaches its
	declaration. But a block may also be called from another
	position of
	the script. See
	<xref linkend='ugr.tools.tm.language.blocks.procedure' />
	</para>
	<section id="ugr.tools.tm.language.blocks.condition">
	<title>Conditioned statements</title>
	<para>
	A block can use common TextMarker conditions to condition the
	execution of its containing rules.
	</para>
	<para>
	Examples:
	<programlisting><![CDATA[DECLARE Month;

	BLOCK(EnglishDates) Document{FEATURE("language", "en")} {
	Document{->MARKFAST(Month,'englishMonthNames.txt')};
	//...
	}

	BLOCK(GermanDates) Document{FEATURE("language", "de")} {
	Document{->MARKFAST(Month,'germanMonthNames.txt')};
	//...
	}
	]]></programlisting>
	The example is explained in detail in
	<xref linkend='ugr.tools.tm.overview.examples' />
	.
	</para>
	</section>
	<section id="ugr.tools.tm.language.blocks.foreach">
	<title>
	<quote>Foreach</quote>
	-Loops
	</title>
	<para>
	A block can be used to execute the containing rules on a
	sequence of
	similar text passages, therefore representing a
	<quote>foreach</quote>
	like loop.
	</para>
	<para>
	Examples:
	<programlisting><![CDATA[DECLARE SentenceWithNoLeadingNP;
	BLOCK(ForEach) Sentence{} {
	Document{-STARTSWITH(NP) -> MARK(SentenceWithNoLeadingNP)};
	}
	]]></programlisting>
	The example is explained in detail in
	<xref linkend='ugr.tools.tm.overview.examples' />
	.
	</para>
	<para>
	This construction is especially useful, if you have a set of
	rules
	which has to be executed continously on the same part of an input
	document. Lets assume you have already annotated your document
	with
	Paragraph annotations. Now you want to count the number of words
	within each paragraph and if the number of words is bigger than 500
	annotate it as BigParagraph. Therefore you wrote the following
	rules:
	<programlisting><![CDATA[DECLARE BigParagraph;
	INT numberOfWords;
	Paragraph{COUNT(W,numberOfWords)};
	Paragraph{IF(numberOfWords > 500) -> MARK(BigParagraph)};
	]]></programlisting>
	This will not work. The reason is that the rule which counts the
	number of words within a Paragraph is executed on all Paragraphs
	before the last rule which marks the Paragraph as BigParagraph
	is
	even executed once. Therefore when reaching the last rule in this
	example, the variable
	<literal>numberOfWords</literal>
	holds the
	number of words of the last Paragraph in the input
	document,
	thus annotating all Paragraphs either as BigParagraph or
	not.
	</para>
	<para>
	To solve this, use a block to tie the
	execution of this rules
	together for each Paragraph:
	<programlisting><![CDATA[DECLARE BigParagraph;
	INT numberOfWords;
	BLOCK(IsBig) Paragraph{} {
	Document{COUNT(W,numberOfWords)};
	Document{IF(numberOfWords > 500) -> MARK(BigParagraph)};
	}
	]]></programlisting>
	Since the scope of the Document is limited to a Paragraph within
	the
	block, the rule which counts the words is only executed once
	before
	the second rule decides if the Paragraph is a BigParagraph.
	Of course
	this is done for every Paragraph in the whole document.
	</para>
	</section>
	<section id="ugr.tools.tm.language.blocks.procedure">
	<title>Procedures</title>
	<para>
	Blocks can be used to introduce procedures into TextMarker
	language.
	To do this declare a block as before. Lets assume you want to
	simulate a procedure
	<programlisting><![CDATA[public int countAmountOfTypesInDocument(Type type){
	int amount = 0;
	for(Token token : Document) {
	if(token.isType(type)){
	amount++;
	}
	}
	return amount;
	}

	public static void main() {
	int amount = countAmountOfTypesInDocument(Paragraph));
	}
	]]></programlisting>
	which counts the number of the passed type wihtin the document
	and
	gives back the counted number. This can be done in the following
	way:
	<programlisting><![CDATA[BOOLEAN executeProcedure = false;
	TYPE type;
	INT amount;

	BLOCK(countNumberOfTypesInDocument) Document{IF(executeProcedure)} {
	Document{COUNT(type, amount)};
	}

	Document{->ASSIGN(executeProcedure, true)};
	Document{->ASSIGN(type, Paragraph)};
	Document{->CALL(MyScript.countNumberOfTypesInDocument)};
	]]></programlisting>
	The boolean variable
	<literal>executeProcedure</literal>
	is used to prohibit the execution of the block when the
	interpreter
	first reaches the block since this is no procedure call. The block
	can be called
	by referring to it with its name, preceded by the name
	of the script
	the
	block is defined in. In this exmaple, the script is
	called MyScript.tm.
	</para>
	</section>

	</section>
	<section id="ugr.tools.tm.language.score">
	<title>Heuristic extraction using scoring rules</title>
	<para>
	Diagnostic scores are a well known and successfully applied
	knowledge
	formalization pattern for diagnostic problems. Single known
	findings
	valuate a possible solution by adding or subtracting points
	on an
	account of that solution. If the sum exceeds a given threshold,
	then
	the solution is derived. One of the advantages of this pattern
	is the
	robustness against missing or false findings, since a high
	number of
	findings is used to derive a solution.

	The TextMarker system tries to
	transfer this diagnostic problem
	solution strategy to the
	information
	extraction problem. In addition to a
	normal creation of a new
	annotation, a MARKSCORE action can add positive
	or negative scoring
	points to the text fragments matched by the rule
	elements. The current
	value of heuristic points of an annotation can
	be evaluated by the
	SCORE condition, which can be used in an
	additional rule to create
	another annotation.
	In the following, the heuristic extraction using
	scoring rules is demonstrated by a short example:

	<programlisting><![CDATA[Paragraph{CONTAINS(W,1,5)->MARKSCORE(5,Headline)};
	Paragraph{CONTAINS(W,6,10)->MARKSCORE(2,Headline)};
	Paragraph{CONTAINS(Emph,80,100,true)->MARKSCORE(7,Headline)};
	Paragraph{CONTAINS(Emph,30,80,true)->MARKSCORE(3,Headline)};
	Paragraph{CONTAINS(CW,50,100,true)->MARKSCORE(7,Headline)};
	Paragraph{CONTAINS(W,0,0)->MARKSCORE(-50,Headline)};
	Headline{SCORE(10)->MARK(Realhl)};
	Headline{SCORE(5,10)->LOG("Maybe a headline")};]]></programlisting>


	In the first part of this rule set, annotations of the type
	paragraph
	receive scoring points for a headline annotation, if they
	fulfill
	certain CONTAINS conditions. The first condition, for
	example,
	evaluates to true, if the paragraph contains one word up to
	five
	words, whereas the fourth conditions is fulfilled, if the
	paragraph
	contains thirty up to eighty percent of emph annotations.
	The last two
	rules finally execute their actions, if the score of a
	headline
	annotation exceeds ten points, or lies in the interval of
	five and ten
	points, respectively.
	</para>
	</section>
	<section id="ugr.tools.tm.language.modification">
	<title>Modification</title>
	<para>
	There are different actions that can modify the input document,
	like DEL,
	COLOR and REPLACE. But the input document itself can not be
	modified
	directly. A separate engine, the Modifier.xml, has to be
	called in
	order to create another cas view with the name "modified".
	In that
	document all modifications are executed.
	</para>
	<para>
	The following example shows how to import and call the
	Modifier.xml
	engine.
	The example is explained in detail in
	<xref linkend='ugr.tools.tm.overview.examples' />
	.
	</para>
	<programlisting><![CDATA[ENGINE utils.Modifier;
	Date{-> DEL};
	MoneyAmount{-> REPLACE("<MoneyAmount/>")};
	Document{-> COLOR(Headline, "green")};
	Document{-> EXEC(Modifier)};
	]]></programlisting>

	<para>
	To get to the modified view of an input document
	<quote>file1.txt</quote>
	open the output document
	<quote>file1.txt.xmi</quote>
	.
	In editor do right-click and choose
	<quote>CAS Views →
	modified
	</quote>
	.
	</para>
	</section>

	<section id="ugr.tools.tm.language.external_resources">
	<title>External resources</title>
	<para>
	Imagine you have a set of documents containing lots of different
	first names. (As example we use a short list, containing the first
	names
	<quote>Frank</quote>
	,
	<quote>Peter</quote>
	,
	<quote>Jochen</quote>
	and
	<quote>Martin</quote>
	.)
	If you would like to annotate all of them with a
	<quote>FirstName</quote>
	annotation you could write a script using the rule
	<literal>("Frank" \| "Peter" \| "Jochen" \|
	"Martin"){->MARK(FirstName)};</literal>.
	This does exactly what you want. But in fact it is not very handy.
	If you like to add new first names to the list of recognized first
	names you have to change the rule itself every time. Moreover writing
	rules with possibly hundreds of first names
	is not really practically realizable and definitely not efficient if you have
	the list of first names already as a simple text file. Using this text file directly
	would much reduce the effort.
	</para>
	<para>
	Therefore TextMarker provides two kinds of external resources to
	solve such tasks more easily: WORDLISTs and WORDTABLEs.
	</para>
	<section>
	<title>WORDLISTs</title>
	<para>
	A WORDLIST is simply a list of text items. There are three
	different possibilities of how to provide a WORDLIST to the TextMarker system.
	</para>
	<para>
	The first possibility is the use of simple text files, which
	contain exactly one list item per line. For example, a list "FirstNames.txt"
	of first names could look like this:
	<programlisting><![CDATA[Frank
	Peter
	Jochen
	Martin
	]]></programlisting>
	First names within a document containing any number of these
	listed
	names, could be annotated
	by using
	<literal>Document{->MARKFAST(FistName, "FirstNames.txt")};</literal>, assuming
	an already declared type FirstName. To make this rule
	recognizing more first names just add
	them to the external list.
	You could also use a WORLIST variable to do the same thing like this:
	<programlisting><![CDATA[WORDLIST FirstNameList = "FirstNames.txt";
	DECLARE FirstName;
	Document{->MARKFAST(FistName, FirstNameList)};
	]]></programlisting>
	</para>
	<para>
	Another possibility to provide WORDLISTs is the use of compiled
	<quote>tree word list</quote>
	s. The file ending for this is <quote>.twl</quote>
	A tree word list is similar to a trie. It is a XML-file that contains
	a tree-like structure with a node for each character. The nodes
	themselves refer to child nodes that represent all characters that
	succeed the character of the parent node. For single word entries the
	resulting complexity is O(mlog(n)) instead of O(mn) for simple text
	files. Here m is the amount of basic annotations in the document and
	n is the amount of entries in the dictionary. To generate a tree word
	list, see <xref linkend='section.ugr.tools.tm.workbench.create_dictionaries' />.
	A tree word list is used in the same way as simple word lists,
	for example <literal>Document{->MARKFAST(FistName, "FirstNames.twl")};</literal>.
	</para>
	<para>
	A third kind of usable WORDLISTs are <quote>multi tree word list</quote>s.
	The file ending for this is <quote>.mtwl</quote>. It is generated from
	several ordinary WORDLISTs given as simple text files. It contains special
	nodes that provide additional information about the original file. These
	kind of WORDLIST is useful, if several fifferent WORDLISTs are used within
	a TextMarker script. Using five different lists results in five rules using
	the MARKFAST action. The documents to annotate are thus searched five
	times resulting in a complexity of 5O(mlog(n)) With a multi tree
	word list this can be reduced to about O(mlog(5n)). To
	generate a multi tree word list, see
	<xref linkend='section.ugr.tools.tm.workbench.create_dictionaries' />
	To use a multi tree word list TextMarker provides the action
	TRIE. If for example two word lists
	<quote>FirstNames.txt</quote>
	and
	<quote>LastNames.txt</quote>
	have been merged in the multi tree word list
	<quote>Names.mtwl</quote>, then the following rule annotates all
	first names and last names in the whole document:
	<programlisting><![CDATA[WORDLIST Names = "Names.mtwl";
	Declare FirstName, LastName;
	Document{->TRIE("FirstNames.txt" = FistName, "LastNames.txt" = LastName,
	Names, false, 0, false, 0, "")};]]></programlisting>
	</para>
	</section>
	<section>
	<title>WORDTABLEs</title>
	<para>
	WORDLISTS have been used to annotate all occurrences of any list
	item in a document with a certain type. Imagine now that each annotation
	has features that should be filled with values dependent on the list item
	that matched. This can be achieved with WORDTABLEs. For example, lets
	assume we want to annotate all US presidents within a document.
	Moreover each annotation should contain the party of the president as well as the
	year of his inauguration. Therefore we use an annotation type
	<literal>DECLARE Annotation PresidentOfUSA(STRING party, INT
	yearOfInauguration)</literal>. To achieve this, it is recommended to use WORDTABLEs.
	</para>
	<para>
	A WORDTABLE is simply a comma-separated file (.csv). For our
	example such a file, named
	<quote>presidentsOfUSA.csv</quote>
	could look like this:
	<programlisting><![CDATA[Bill Clinton; democrats; 1993
	George W. Bush; republicans; 2001
	Barack Obama; democrats; 2009
	]]></programlisting>
	To annotate our documents we could use the following set of
	rules:
	<programlisting><![CDATA[WORDTABLE presidentsOfUSA = "presidentsOfUSA.csv";
	DECLARE Annotation PresidentOfUSA(STRING party, INT yearOfInauguration);
	Document{->MARKTABLE(PresidentOfUSA, 1, "party" = 2,
	"yearOfInauguration" = 3)};]]></programlisting>
	</para>
	</section>
	</section>
	</chapter>