blob: 23fcfd0c9b76abe0a2473da7a41063633674a6f5 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
<!ENTITY imgroot "images/tools/tm/language/" >
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="ugr.tools.tm.language.language">
<title>TextMarker Language</title>
<para>
This chapter provides a complete description of the TextMarker
language.
</para>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="tools.textmarker.language.syntax.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="tools.textmarker.language.basic_annotations.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="tools.textmarker.language.quantifier.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="tools.textmarker.language.declarations.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="tools.textmarker.language.expressions.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="tools.textmarker.language.conditions.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="tools.textmarker.language.actions.xml" />
<section id="ugr.tools.tm.language.filtering">
<title>Robust extraction using filtering</title>
<para>
Rule based or pattern based information extraction systems often
suffer from unimportant fill words, additional whitespace and
unexpected markup. The TextMarker System enables the knowledge
engineer to filter and to hide all possible combinations of
predefined and new types of annotations. The
visibility of tokens and
annotations is modified by the actions of
rule elements and can be
conditioned using the complete
expressiveness of the language.
Therefore the TextMarker system
supports a robust approach to
information extraction and simplifies
the creation of new rules since
the knowledge engineer can focus on
important textual features. If no
rule action changed the
configuration of the filtering settings, then
the default filtering
configuration ignores whitespaces and markup.
Look at the following rule:
<programlisting><![CDATA["Dr" PERIOD CW CW
]]></programlisting>
Using the default
setting, this rule matches on all four lines
of this
input document:
<programlisting><![CDATA[Dr. Joachim Baumeister
Dr . Joachim Baumeister
Dr. <b><i>Joachim</i> Baumeister</b>
Dr.JoachimBaumeister
]]></programlisting>
</para>
<para>
To change the default setting use the
<quote>FILTERTYPE</quote>
or
<quote>RETAINTYPE</quote>
action. For example if markups should no longer be ignored, try
the following example on the above input document:
<programlisting><![CDATA[Document{->RETAINTYPE(MARKUP)};
"Dr" PERIOD CW CW
]]></programlisting>
You will see that the third line of the previous input example
will no longer be matched.
</para>
<para>
To filter types try the following on the input document:
<programlisting><![CDATA[Document{->FILTERTYPE(PERIOD)};
"Dr" CW CW
]]></programlisting>
Since periods are ignored now, the rule will match on all four
lines of the example.
</para>
<para>
Notice that using a filtered annotation type within a
rule, prevents this rule from being executed. Try the following:
<programlisting><![CDATA[Document{->FILTERTYPE(PERIOD)};
"Dr" PERIOD CW CW
]]></programlisting>
You will see that this matches on no line of the input document
since the second rule uses the filtered type PERIOD and is therefore not
executed.
</para>
</section>
<section id="ugr.tools.tm.language.blocks">
<title>Blocks</title>
<para>
Blocks combine some more complex control structures in the
TextMarker
language:
<orderedlist numeration="arabic">
<listitem>
<para>
Conditioned statements
</para>
</listitem>
<listitem>
<para>
<quote>Foreach</quote>
-Loops
</para>
</listitem>
<listitem>
<para>
Procedures
</para>
</listitem>
</orderedlist>
</para>
<para>
Declaration of a block:
<programlisting><![CDATA[BlockDeclaration -> "BLOCK" "(" Identifier ")" RuleElementWithCA
"{" Statements "}"
RuleElementWithCA -> TypeExpression QuantifierPart?
"{" Conditions? Actions? "}"]]></programlisting>
A block declaration always starts with the keyword
<quote>BLOCK</quote>
, followed by the identifier of the block within brackets. The
<quote>RuleElementType</quote>
-element
is a TextMarker rule that consists of exactly one rule
element. The
rule element has to be a declared annotation type.
<note>
<para>
The
rule element in the definition of a block has to define
a
condition/action part, even if that part is empty (LCURLY and
RCULRY).
</para>
</note>
</para>
<para>
Through the rule element a new local document is defined, whose
scope
is the related block. So if you use
<literal>Document</literal>
within a block, this always refers to the locally limited
document.
<programlisting><![CDATA[BLOCK(ForEach) Paragraph{} {
Document{COUNT(CW)}; // Here "Document" is limited to a Paragraph;
// therefore the rule only counts the CW annotations
// within the Paragraph
}
]]></programlisting>
</para>
<para>
A block is always executed when the TextMarker interpreter
reaches its
declaration. But a block may also be called from another
position of
the script. See
<xref linkend='ugr.tools.tm.language.blocks.procedure' />
</para>
<section id="ugr.tools.tm.language.blocks.condition">
<title>Conditioned statements</title>
<para>
A block can use common TextMarker conditions to condition the
execution of its containing rules.
</para>
<para>
Examples:
<programlisting><![CDATA[DECLARE Month;
BLOCK(EnglishDates) Document{FEATURE("language", "en")} {
Document{->MARKFAST(Month,'englishMonthNames.txt')};
//...
}
BLOCK(GermanDates) Document{FEATURE("language", "de")} {
Document{->MARKFAST(Month,'germanMonthNames.txt')};
//...
}
]]></programlisting>
The example is explained in detail in
<xref linkend='ugr.tools.tm.overview.examples' />
.
</para>
</section>
<section id="ugr.tools.tm.language.blocks.foreach">
<title>
<quote>Foreach</quote>
-Loops
</title>
<para>
A block can be used to execute the containing rules on a
sequence of
similar text passages, therefore representing a
<quote>foreach</quote>
like loop.
</para>
<para>
Examples:
<programlisting><![CDATA[DECLARE SentenceWithNoLeadingNP;
BLOCK(ForEach) Sentence{} {
Document{-STARTSWITH(NP) -> MARK(SentenceWithNoLeadingNP)};
}
]]></programlisting>
The example is explained in detail in
<xref linkend='ugr.tools.tm.overview.examples' />
.
</para>
<para>
This construction is especially useful, if you have a set of
rules
which has to be executed continously on the same part of an input
document. Lets assume you have already annotated your document
with
Paragraph annotations. Now you want to count the number of words
within each paragraph and if the number of words is bigger than 500
annotate it as BigParagraph. Therefore you wrote the following
rules:
<programlisting><![CDATA[DECLARE BigParagraph;
INT numberOfWords;
Paragraph{COUNT(W,numberOfWords)};
Paragraph{IF(numberOfWords > 500) -> MARK(BigParagraph)};
]]></programlisting>
This will not work. The reason is that the rule which counts the
number of words within a Paragraph is executed on all Paragraphs
before the last rule which marks the Paragraph as BigParagraph
is
even executed once. Therefore when reaching the last rule in this
example, the variable
<literal>numberOfWords</literal>
holds the
number of words of the last Paragraph in the input
document,
thus annotating all Paragraphs either as BigParagraph or
not.
</para>
<para>
To solve this, use a block to tie the
execution of this rules
together for each Paragraph:
<programlisting><![CDATA[DECLARE BigParagraph;
INT numberOfWords;
BLOCK(IsBig) Paragraph{} {
Document{COUNT(W,numberOfWords)};
Document{IF(numberOfWords > 500) -> MARK(BigParagraph)};
}
]]></programlisting>
Since the scope of the Document is limited to a Paragraph within
the
block, the rule which counts the words is only executed once
before
the second rule decides if the Paragraph is a BigParagraph.
Of course
this is done for every Paragraph in the whole document.
</para>
</section>
<section id="ugr.tools.tm.language.blocks.procedure">
<title>Procedures</title>
<para>
Blocks can be used to introduce procedures into TextMarker
language.
To do this declare a block as before. Lets assume you want to
simulate a procedure
<programlisting><![CDATA[public int countAmountOfTypesInDocument(Type type){
int amount = 0;
for(Token token : Document) {
if(token.isType(type)){
amount++;
}
}
return amount;
}
public static void main() {
int amount = countAmountOfTypesInDocument(Paragraph));
}
]]></programlisting>
which counts the number of the passed type wihtin the document
and
gives back the counted number. This can be done in the following
way:
<programlisting><![CDATA[BOOLEAN executeProcedure = false;
TYPE type;
INT amount;
BLOCK(countNumberOfTypesInDocument) Document{IF(executeProcedure)} {
Document{COUNT(type, amount)};
}
Document{->ASSIGN(executeProcedure, true)};
Document{->ASSIGN(type, Paragraph)};
Document{->CALL(MyScript.countNumberOfTypesInDocument)};
]]></programlisting>
The boolean variable
<literal>executeProcedure</literal>
is used to prohibit the execution of the block when the
interpreter
first reaches the block since this is no procedure call. The block
can be called
by referring to it with its name, preceded by the name
of the script
the
block is defined in. In this exmaple, the script is
called MyScript.tm.
</para>
</section>
</section>
<section id="ugr.tools.tm.language.score">
<title>Heuristic extraction using scoring rules</title>
<para>
Diagnostic scores are a well known and successfully applied
knowledge
formalization pattern for diagnostic problems. Single known
findings
valuate a possible solution by adding or subtracting points
on an
account of that solution. If the sum exceeds a given threshold,
then
the solution is derived. One of the advantages of this pattern
is the
robustness against missing or false findings, since a high
number of
findings is used to derive a solution.
The TextMarker system tries to
transfer this diagnostic problem
solution strategy to the
information
extraction problem. In addition to a
normal creation of a new
annotation, a MARKSCORE action can add positive
or negative scoring
points to the text fragments matched by the rule
elements. The current
value of heuristic points of an annotation can
be evaluated by the
SCORE condition, which can be used in an
additional rule to create
another annotation.
In the following, the heuristic extraction using
scoring rules is demonstrated by a short example:
<programlisting><![CDATA[Paragraph{CONTAINS(W,1,5)->MARKSCORE(5,Headline)};
Paragraph{CONTAINS(W,6,10)->MARKSCORE(2,Headline)};
Paragraph{CONTAINS(Emph,80,100,true)->MARKSCORE(7,Headline)};
Paragraph{CONTAINS(Emph,30,80,true)->MARKSCORE(3,Headline)};
Paragraph{CONTAINS(CW,50,100,true)->MARKSCORE(7,Headline)};
Paragraph{CONTAINS(W,0,0)->MARKSCORE(-50,Headline)};
Headline{SCORE(10)->MARK(Realhl)};
Headline{SCORE(5,10)->LOG("Maybe a headline")};]]></programlisting>
In the first part of this rule set, annotations of the type
paragraph
receive scoring points for a headline annotation, if they
fulfill
certain CONTAINS conditions. The first condition, for
example,
evaluates to true, if the paragraph contains one word up to
five
words, whereas the fourth conditions is fulfilled, if the
paragraph
contains thirty up to eighty percent of emph annotations.
The last two
rules finally execute their actions, if the score of a
headline
annotation exceeds ten points, or lies in the interval of
five and ten
points, respectively.
</para>
</section>
<section id="ugr.tools.tm.language.modification">
<title>Modification</title>
<para>
There are different actions that can modify the input document,
like DEL,
COLOR and REPLACE. But the input document itself can not be
modified
directly. A separate engine, the Modifier.xml, has to be
called in
order to create another cas view with the name "modified".
In that
document all modifications are executed.
</para>
<para>
The following example shows how to import and call the
Modifier.xml
engine.
The example is explained in detail in
<xref linkend='ugr.tools.tm.overview.examples' />
.
</para>
<programlisting><![CDATA[ENGINE utils.Modifier;
Date{-> DEL};
MoneyAmount{-> REPLACE("<MoneyAmount/>")};
Document{-> COLOR(Headline, "green")};
Document{-> EXEC(Modifier)};
]]></programlisting>
<para>
To get to the modified view of an input document
<quote>file1.txt</quote>
open the output document
<quote>file1.txt.xmi</quote>
.
In editor do right-click and choose
<quote>CAS Views &rarr;
modified
</quote>
.
</para>
</section>
<section id="ugr.tools.tm.language.external_resources">
<title>External resources</title>
<para>
Imagine you have a set of documents containing lots of different
first names. (As example we use a short list, containing the first
names
<quote>Frank</quote>
,
<quote>Peter</quote>
,
<quote>Jochen</quote>
and
<quote>Martin</quote>
.)
If you would like to annotate all of them with a
<quote>FirstName</quote>
annotation you could write a script using the rule
<literal>("Frank" | "Peter" | "Jochen" |
"Martin"){->MARK(FirstName)};</literal>.
This does exactly what you want. But in fact it is not very handy.
If you like to add new first names to the list of recognized first
names you have to change the rule itself every time. Moreover writing
rules with possibly hundreds of first names
is not really practically realizable and definitely not efficient if you have
the list of first names already as a simple text file. Using this text file directly
would much reduce the effort.
</para>
<para>
Therefore TextMarker provides two kinds of external resources to
solve such tasks more easily: WORDLISTs and WORDTABLEs.
</para>
<section>
<title>WORDLISTs</title>
<para>
A WORDLIST is simply a list of text items. There are three
different possibilities of how to provide a WORDLIST to the TextMarker system.
</para>
<para>
The first possibility is the use of simple text files, which
contain exactly one list item per line. For example, a list "FirstNames.txt"
of first names could look like this:
<programlisting><![CDATA[Frank
Peter
Jochen
Martin
]]></programlisting>
First names within a document containing any number of these
listed
names, could be annotated
by using
<literal>Document{->MARKFAST(FistName, "FirstNames.txt")};</literal>, assuming
an already declared type FirstName. To make this rule
recognizing more first names just add
them to the external list.
You could also use a WORLIST variable to do the same thing like this:
<programlisting><![CDATA[WORDLIST FirstNameList = "FirstNames.txt";
DECLARE FirstName;
Document{->MARKFAST(FistName, FirstNameList)};
]]></programlisting>
</para>
<para>
Another possibility to provide WORDLISTs is the use of compiled
<quote>tree word list</quote>
s. The file ending for this is <quote>.twl</quote>
A tree word list is similar to a trie. It is a XML-file that contains
a tree-like structure with a node for each character. The nodes
themselves refer to child nodes that represent all characters that
succeed the character of the parent node. For single word entries the
resulting complexity is O(m*log(n)) instead of O(m*n) for simple text
files. Here m is the amount of basic annotations in the document and
n is the amount of entries in the dictionary. To generate a tree word
list, see <xref linkend='section.ugr.tools.tm.workbench.create_dictionaries' />.
A tree word list is used in the same way as simple word lists,
for example <literal>Document{->MARKFAST(FistName, "FirstNames.twl")};</literal>.
</para>
<para>
A third kind of usable WORDLISTs are <quote>multi tree word list</quote>s.
The file ending for this is <quote>.mtwl</quote>. It is generated from
several ordinary WORDLISTs given as simple text files. It contains special
nodes that provide additional information about the original file. These
kind of WORDLIST is useful, if several fifferent WORDLISTs are used within
a TextMarker script. Using five different lists results in five rules using
the MARKFAST action. The documents to annotate are thus searched five
times resulting in a complexity of 5*O(m*log(n)) With a multi tree
word list this can be reduced to about O(m*log(5*n)). To
generate a multi tree word list, see
<xref linkend='section.ugr.tools.tm.workbench.create_dictionaries' />
To use a multi tree word list TextMarker provides the action
TRIE. If for example two word lists
<quote>FirstNames.txt</quote>
and
<quote>LastNames.txt</quote>
have been merged in the multi tree word list
<quote>Names.mtwl</quote>, then the following rule annotates all
first names and last names in the whole document:
<programlisting><![CDATA[WORDLIST Names = "Names.mtwl";
Declare FirstName, LastName;
Document{->TRIE("FirstNames.txt" = FistName, "LastNames.txt" = LastName,
Names, false, 0, false, 0, "")};]]></programlisting>
</para>
</section>
<section>
<title>WORDTABLEs</title>
<para>
WORDLISTS have been used to annotate all occurrences of any list
item in a document with a certain type. Imagine now that each annotation
has features that should be filled with values dependent on the list item
that matched. This can be achieved with WORDTABLEs. For example, lets
assume we want to annotate all US presidents within a document.
Moreover each annotation should contain the party of the president as well as the
year of his inauguration. Therefore we use an annotation type
<literal>DECLARE Annotation PresidentOfUSA(STRING party, INT
yearOfInauguration)</literal>. To achieve this, it is recommended to use WORDTABLEs.
</para>
<para>
A WORDTABLE is simply a comma-separated file (.csv). For our
example such a file, named
<quote>presidentsOfUSA.csv</quote>
could look like this:
<programlisting><![CDATA[Bill Clinton; democrats; 1993
George W. Bush; republicans; 2001
Barack Obama; democrats; 2009
]]></programlisting>
To annotate our documents we could use the following set of
rules:
<programlisting><![CDATA[WORDTABLE presidentsOfUSA = "presidentsOfUSA.csv";
DECLARE Annotation PresidentOfUSA(STRING party, INT yearOfInauguration);
Document{->MARKTABLE(PresidentOfUSA, 1, "party" = 2,
"yearOfInauguration" = 3)};]]></programlisting>
</para>
</section>
</section>
</chapter>