blob: d56681802a16c6964ccfb84825dc4f466a4743e5 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY imgroot "./images/" >
<!ENTITY % xinclude SYSTEM "../../../uima-docbook-tool/xinclude.mod">
%xinclude;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<book lang="en">
<title>Tagger Annotator Documentation</title>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="../../../SandboxDocs/src/docbook/book_info.xml" />
<preface id="sandbox.tagger.introduction">
<title>Introduction</title>
<para>
Tagger Annotator is an Apache UIMA statistical analysis
engine that annotates tokens with corresponding grammatical
types (parts of speech, or just POS). The tagger is a
standard hidden Markov model (HMM) tagger.
</para>
</preface>
<chapter id="sandbox.tagger.prerequisites">
<title>Prerequisites</title>
<para>
The UIMA HMM Tagger annotator assumes that sentences and
tokens have already been annotated in the CAS with Sentence
and Token annotations respectively (see e.g.
<code>Whitespace Tokenizer Annotator</code>
).
Further, the tagger requires a parameter file which
specifies a number of necessary parameters for tagging
procedure (see
<xref
linkend="sandbox.tagger.annotatorDescriptor.configParam" />
).
Two trained models for English and German are included in
the package (in the
<code>resources</code>
folder). Other models can be trained outside of the UIMA
framework (see
<xref linkend="sandbox.tagger.training" />
).
</para>
</chapter>
<chapter id="sandbox.tagger.processingOverview">
<title>Processing Overview</title>
<para>
The algorithm iterates over sentences and tokens in turn to
accumulate a list of words. These are then sent to a
processing engine of HMM tagger. For each
<code>Token</code>
, the
<code>posTag</code>
field is updated with the corresponding part of speech (e.g.
<code>posTag = "NN"</code>
where
<code>NN</code>
stands for
<emphasis>common noun</emphasis>
).
</para>
</chapter>
<chapter id="sandbox.tagger.annotatorDescriptor">
<title>Annotator Descriptor</title>
<para>
Two descriptors are employed to configure tagger's
functionality:
<itemizedlist>
<listitem>
<para>
<code>HmmTagger.xml</code>
- is a primitive analysis engine descriptor,
which defines the tagger basic functionality and
can be combined in an aggregate analysis engine
with an arbitrary tokenizer. This descriptor
cannot be used on itself as the tagger alone
does not perfom tokenization.
</para>
</listitem>
<listitem>
<para>
<code>HmmTaggerTAE.xml</code>
- is an aggregate analysis engine whose only
function is to combine UIMA
<code>Whitespace Tokenizer Annotator</code>
with
<code>HMM Tagger Annotator</code>
and is thereby a "ready to use" tagging
descriptor.
</para>
</listitem>
</itemizedlist>
</para>
<section id="sandbox.tagger.annotatorDescriptor.configParam">
<title>Configuration Parameters</title>
<para>
The HMM tagger annotator (
<code>HmmTagger.xml</code>
) requires the following configuration parameters:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<code>NGRAM_SIZE</code>
- this parameter is an Integer, defining
whether a bi- or trigram model should be
used for tagging (default is N=3).
<programlisting><emphasis><![CDATA[ <configurationParameters>
<configurationParameter>
<name>NGRAM_SIZE</name>
<type>Integer</type>
<multiValued>false</multiValued>
<mandatory>true</mandatory>
</configurationParameter>
</configurationParameters>
<configurationParameterSettings>
<nameValuePair>
<name>NGRAM_SIZE</name>
<value>
<integer>3</integer>
</value>
</nameValuePair>
</configurationParameterSettings>]]></emphasis></programlisting>
</para>
</listitem>
<listitem>
<para>
<code>ModelFile</code>
- binary file containing the statistical model which should be used for tagging is defined as an external resource
<programlisting><emphasis><![CDATA[
<externalResources>
<externalResource>
<name>ModelFile</name>
<description>HMM Tagger model file</description>
<fileResourceSpecifier>
<fileUrl>file:german/TuebaModel.dat</fileUrl>
</fileResourceSpecifier>
<implementationName>org.apache.uima.examples.tagger.ModelResource</implementationName>
</externalResource>
</externalResources>]]></emphasis></programlisting>
Thus, one can easily use a different model by changing the <code>fileUrl</code> line:
<code>file:german/TuebaModel.dat</code>.
(NB. <emphasis>New models must be located in the <code>resources</code> folder</emphasis>.)
After these two parameters have been set, the tagger is ready to use.
</para>
</listitem>
</itemizedlist>
</para>
</section>
<section id="sandbox.tagger.annotatorDescriptor.capabilities">
<title>Capabilities</title>
<para>
As the tagger inherits tokenization indexes from the CAS,
<code>uima.SentenceAnnotation</code> and <code>uima.TokenAnnotation</code> with their
<code>begin</code> and <code>end</code> features respectively have to be defined as
input capabilities in the HMM Tagger annotator descriptor. <code>Token</code> receives
also an additional <code>posTag</code> feature as an output capability.
</para>
<para>
<programlisting><emphasis><![CDATA[<capabilities>
<capability>
<inputs>
<type>org.apache.uima.TokenAnnotation</type>
<type allAnnotatorFeatures="true">org.apache.uima.SentenceAnnotation</type>
<feature>org.apache.uima.TokenAnnotation:end</feature>
<feature>org.apache.uima.TokenAnnotation:begin</feature>
</inputs>
<outputs>
<type>org.apache.uima.TokenAnnotation</type>
<feature>org.apache.uima.TokenAnnotation:posTag</feature>
<feature>org.apache.uima.TokenAnnotation:end</feature>
<feature>org.apache.uima.TokenAnnotation:begin</feature>
</outputs>
</capability>
</capabilities>]]></emphasis></programlisting>
</para>
</section>
</chapter>
<chapter id="sandbox.tagger.unittest">
<title>Functionality Test</title>
<para>
The <code>TaggerTest</code> is a JUnit test file (available in the <code>test</code> folder),
which provides an opportunity to test provided models for English and German,
as well as the basic functionality of the tagger. In order to check whether
the tagger's configuration is correct, just run this file as JUnit Test and you should get the following output:
<programlisting><![CDATA[Tesing German Model...
The used model is:resources/german/TuebaModel.dat
61646 distinct words in the model
Number of part-of-speech tags used: 54
These are: [$(, $,, $., ADJA, ADJD, ADV, APPO, APPR, APPRART, APZR, ART, CARD, ... ]
Testing German trigram tagger..
[Jerry, liebt, Wansley, .]
expected: [NE, VVFIN, NE, $.]
tagger output: [NE, VVFIN, NE, $.]
Very Good!
==========================================================
Tesing English Model...
The used model is:resources/english/BrownModel.dat
56012 distinct words in the model
Number of part-of-speech tags used: 473
These are: [', '', (, ), *, ,, --, ., :, ``, abl, abn, abx, ap, ap$, at, be, bed, ...]
Testing English trigram tagger...
[Jerry, loves, Wansley, .]
expected: [np, vbz, np, .]
tagger output: [np, vbz, np, .]
Very Good!]]></programlisting>
</para>
</chapter>
<chapter id="sandbox.tagger.tagger">
<title>Overview of the Tagger package</title>
<para>
The package <code>org.apache.uima.examples.tagger</code> contains:
<itemizedlist>
<listitem>
<para>
two interfaces:
<orderedlist>
<listitem>
<para>
<code>IModelResource</code>
- model resource interface
</para>
</listitem>
<listitem>
<para>
<code>Tagger</code>
- general tagger interface, in case one would want to integrate further tagger types.
</para>
</listitem>
</orderedlist>
</para>
</listitem>
<listitem>
<para> three classes:
<orderedlist>
<listitem>
<para>
<code>HMMTagger</code>
- hidden Markov model tagger for UIMA, that is using Viterbi algorithm to compute the most
probable part-of-speech sequence for a given list of tokens.
</para>
</listitem>
<listitem>
<para>
<code>Viterbi</code>
- implementation of the Viterbi Algorithm. This class makes up the core of the tagger.
</para>
</listitem>
<listitem>
<para>
<code>ModelResource.java</code>
- implementation of the <code>IModelResource</code>
</para>
</listitem>
</orderedlist>
</para>
</listitem>
</itemizedlist>
</para>
</chapter>
<chapter id="sandbox.tagger.training">
<title>Training Own Models</title>
<para>
Though we decide not to include training directly into UIMA framework, one can easily
train other models for different pre-annotated corpora outside of the UIMA using <code>ModelGeneration</code> class,
available in the subpackage <code>org.apache.uima.examples.tagger.trainAndTest</code>.
This subpackage includes some further files needed for training of own models:
<itemizedlist>
<listitem>
<para>
<code>MappingInterface</code>
- defines mapping for a tagset. For example, one may wish to map a more detailed tagset
to a less distinctive one (i.e. tell a program to tag all verbs as just <code>VERB</code>
instead of differentiating between <code>verb infinitive</code>, <code>verb imperative</code>, etc.
Two sample implementations for <code>MappingInterface</code> are included,
namely <code>TagMappingBrown</code> (mapping reducing Brown corpus tagset from more than 400 tags to 93) and
<code>GrobMappingTueba</code>(mapping German STTS tagset from 54 tags to basic 11 categories plus special symbols and punctuation)
</para>
</listitem>
<listitem>
<para>
<code>ModelGeneration</code>
- trains an N-gram model for the tagger, iterating over a List of <code>Token</code>s.
Writes the resulting model to a binary file. At the moment,
only bi-and trigram models are supported. Further N-grams can be easily integrated.
<code>ModelGeneration</code> is not concerned with the fact,
whether the training corpus is given as a single file or as a directory containing a number of files,
as this is a <code>CORPUS_READER</code> implementation issue. Two supplied readers include both a reader
for a corpus as a single file (<code>TT_FormatReader</code>code>) or as a directory (<code>BrownReader</code>code>)
</para>
</listitem>
<listitem>
<para>
Interface <code>CorpusReader</code>
- should be used to implement corpus readers for own corpora; the objective
of the reader is to take charge of the preprocessing and transform tokenized units
(usually <emphasis>words</emphasis>) into a List of <code>Token</code> objects.
Two sample implementations of <code>CorpusReader</code> are included:
<orderedlist>
<listitem>
<para>
<code>BrownReader</code>
- for the Brown corpus from the nltk distribution (nltk.sourceforge.net)
</para>
</listitem>
<listitem>
<para>
<code>TT_FormatReader</code>
- for the corpora in TreeTagger format, i.e. one word per line
with tags separated from the words by tabs.
</para>
</listitem>
</orderedlist>
</para>
</listitem>
</itemizedlist>
</para>
<para>
To train a new model, one should adjust a number of parameters in the <code>"tagger.properties"</code> file,
which is in Java properties file format (see <xref linkend="properties.file"/>). After the parameters are set, you just need to run
<code>ModelGeneration.java</code>
<programlisting id="properties.file" xreflabel="tagger.properties file"><emphasis><![CDATA[######## This is the default tagger.properties file
######## This file is used for training and testing only,
######## The configuration for tagging is directly tuned in the descriptor "HmmTagger.xml"
########################## BOTH FOR TRAINING AND EVALUATION ################################
######## THESE ARE THE DEFAULT MODEL FILES FOR GERMAN AND ENGLISH
######## You can either uncomment one of them, if you want to replace given models with your own one,
#MODEL_FILE = resources/german/TuebaModel.dat
#MODEL_FILE = resources/english/BrownModel.dat
######## or specify a completely different name
MODEL_FILE =
######## If mapping of tags is desired, uncomment the following
#DO_MAPPING = true
####### EXAMPLES OF MAPPING CLASSES
## Basic mapping for the Brown corpus (nltk distribution) tagset: to get 93 tags out of 473
#MAPPING = org.apache.uima.examples.tagger.TagMappingBrown
## Basic mapping for STTS tagset: from 54 tags onto the basic ca. 15 classes plus punctuation
#MAPPING = org.apache.uima.examples.tagger.GrobMappingTueba
## If you implement your own mapping, you should specify here in the same manner as above a java-path to the class
MAPPING =
####### FILE CONTAINING TRAINING CORPUS:
####### can be in specified either as an absolute or as a relative path
####### e.g. FILE = ../../tueba_tigerFormat.txt or FILE = C:/Data/tueba.txt
FILE =
######## If corpus is in a different format and cannot be read with the provided READERS,
######## you should specify here a java-path to the class (s. examples below)
#CORPUS_READER = org.apache.uima.examples.tagger.trainAndTest.TT_FormatReader
#CORPUS_READER = org.apache.uima.examples.tagger.trainAndTest.BrownReader
CORPUS_READER =
################# ONLY FOR EVALUATION ###############################
######### GOLD STANDARD CORPUS FILE:
######### can be specified as an absolute or as a relative path
## e.g. GOLD_STANDARD = ../../tueba_tigerFormat.txt or GOLD_STANDARD = C:/Data/tueba.txt
GOLD_STANDARD =
######### Here we specify whether one intends to test a bi- or a trigram model (default is a trigram model)
N=3
]]></emphasis>
</programlisting>
</para>
</chapter>
<chapter id="sandbox.tagger.evaluation">
<title>Evaluation</title>
<para>
To evaluate performance if a "gold standard" corpus is available, one can use the following provided file:
<itemizedlist>
<listitem>
<para>
<code>TaggerEvaluation.java</code>
- can be used to evaluate the tagger and/or new models on a manually annotated corpus.
</para>
</listitem>
</itemizedlist>
</para>
<para>
<code>HMMTagger</code> was evaluated for English and German. For English, it was trained on 80% of the Brown corpus
(180,000 tokens) and tested on the rest unseen 20%. The achieved accuracy was about 96%, test corpus contained 4.5% of unknown tokens.
</para>
<para>
For German, it achieves between 95% and 96% accuracy when trained and tested on the same type of corpus, i.e. with 80% of corpus used for training and 20% for testing.
The accuracy goes a bit down when tagging is performed for different types of corpora than the training one, mostly due to the growing number of unknown words.
</para>
</chapter>
<appendix id="sandbox.tagger.theory">
<title>Theory Behind</title>
<para>
This chapter is just a sketch of the statistical model
undelying the tagger.
Hidden Markov Models (HMMs) are the mainstay of the
applications employing statistical modeling in any form,
like speech recognition and production systems, signal
processing, part of speech tagging.
A Hidden Markov Model is a probabilistic function of a
Markov process. A Markov process is a process that fulfills
Markov assumptions.
Markov assumptions are:
<itemizedlist>
<listitem>
<para>
<code>limited horizon</code>
- Markov processes are states without memory,
except for condition of the current state.
Though we usually consider sequences of
variables that are not independent of each
other, it often suffices to know the value of
the current situation without going deep into
the past happenings. As [
<biblioref linkend="schuetze" />
] put it, we do not really need to know, how
many books were in the library last week or last
year in order to predict how many books there
will be tomorrow. It is often enough to know the
current situation. Thereby, future states in the
Markov process are independent of the past, they
only depend on the present. Let
<inlineequation>
<mathphrase>
X = (X
<subscript>1</subscript>
, ..., X
<subscript>T</subscript>
)
</mathphrase>
</inlineequation>
be a sequence of random variables taking the
values from the finite state space
<inlineequation>
<mathphrase>
S = (s
<subscript>1</subscript>
, ..., s
<subscript>N</subscript>
)
</mathphrase>
</inlineequation>
, then a limited horizon property could be
formalized by:
<informalequation>
<mathphrase>
P(X
<subscript>t+1</subscript>
= s
<subscript>k</subscript>
|X
<subscript>1</subscript>
, ..., X
<subscript>t</subscript>
) = P(X
<subscript>t+1</subscript>
= s
<subscript>k</subscript>
|X
<subscript>t</subscript>
)
</mathphrase>
</informalequation>
</para>
</listitem>
<listitem>
<para>
<code>time invariance</code>
</para>
<para>
The probabilities do not change over time, i.e.
if we know that the probability of observing a
rainbow after the rain is equal to 90\%, we know
that it should be true for today as well as for
tomorrow.
</para>
</listitem>
</itemizedlist>
</para>
<para>
If
<code>X</code>
conforms to these two properties, then it is said to be a
Markov chain.
One can describe a Markov chain by a transition matrix:
<informalequation>
<mathphrase>
A = a
<subscript>i,j</subscript>
= P(X
<subscript>t+1</subscript>
= s
<subscript>j</subscript>
|X
<subscript>t</subscript>
=s
<subscript>i</subscript>
)
</mathphrase>
</informalequation>
<informalequation>
<mathphrase>
- with a
<subscript>i,j</subscript>
>= 0 (for all
<emphasis>i,j</emphasis>
) and the sum of all transition probabilities from
state
<emphasis>i</emphasis>
(a
<subscript>i,j</subscript>
) should be equal to 1 (for all
<emphasis>i</emphasis>
)
</mathphrase>
</informalequation>
</para>
<para>
Markov models can be used whenever one needs to model the
probability of a linear sequence of variables.
One distinguishes Visible Markov Models (VMM) vs. Hidden
Markov Models. The difference is that when we work with
"visible" events, we can directly estimate the corresponding
probabilities (which is the case if training corpus is
available to train own models for HMM taggers).
Finding a sequence of part of speech tags (i.e. Viterbi part
of the tagger) in contrast is a hidden Markov model as the
states (tags) are not directly observable.
</para>
<para>
<emphasis>The goal of HMM - based tagger</emphasis>
is to find part of speech tags ( = hidden states) that
generate a sequence of words ( = observable states). Most of
the known implementations of POS taggers are viewing text as
being produced by a hidden Markov model, so that tagging can
be regarded as a Markov process deciding which states the
system went through to generate a given text.
</para>
<para>
<emphasis>General Form of HMM</emphasis>
</para>
<para>
A HMM is a five-tuple:
<inlineequation>
<mathphrase>(S, O, &pgr;, A, B)</mathphrase>
</inlineequation>
<informalexample>
<para>where:</para>
<para>
<itemizedlist>
<listitem>
<para>
<code>S</code>
- the set of states (here: parts of
speech)
</para>
</listitem>
<listitem>
<para>
<code>K</code>
- the set of observations (here: words)
</para>
</listitem>
<listitem>
<para>
<code>&pgr;</code>
- initial state probabilities
</para>
</listitem>
<listitem>
<para>
<code>A</code>
- state transitions probabilities
</para>
</listitem>
<listitem>
<para>
<code>B</code>
- symbol emissions probabilities
</para>
</listitem>
</itemizedlist>
</para>
</informalexample>
</para>
<para>
Further,
<code>
X
<subscript>t</subscript>
</code>
(state sequence) and
<code>
O
<subscript>t</subscript>
</code>
(output sequence) are given.
Tagging procedure is then the following:
<informalexample>
<orderedlist>
<listitem>
<para>
<code>t := 1</code>
</para>
</listitem>
<listitem>
<para>
<code>
Start in state s
<subscript>i</subscript>
with probability &pgr;
<subscript>i</subscript>
(i.e., X
<subscript>1</subscript>
= i)
</code>
</para>
</listitem>
<listitem>
<para>
<code>forever do:</code>
</para>
<itemizedlist>
<listitem>
<para>
<code>
Move from s
<subscript>i</subscript>
to s
<subscript>j</subscript>
with probability a
<subscript>i,j</subscript>
(i.e. X
<subscript>t+1</subscript>
= j)
</code>
</para>
</listitem>
<listitem>
<para>
<code>
Emit observation symbol o
<subscript>t</subscript>
= k with probability b
<subscript>i,j,k</subscript>
</code>
</para>
</listitem>
<listitem>
<para>
<code>t := T+1</code>
</para>
</listitem>
</itemizedlist>
</listitem>
<listitem>
<para>
<code>end</code>
</para>
</listitem>
</orderedlist>
</informalexample>
</para>
<para>
Despite their limitations, HMM-s are one of the most
successful techniques in natural language processing and are
widely used, especially in sequence tagging applications.
The best statistical taggers all perform at about the same
level of accuracy.
</para>
</appendix>
<!-- ... -->
<glossary>
<title>Glossary</title>
<glossdiv>
<title>HMM</title>
<glossentry id="hmm">
<glossterm>Hidden Markov Model</glossterm>
<acronym>HMM</acronym>
<glossdef>
<para></para>
</glossdef>
</glossentry>
</glossdiv>
<glossdiv>
<title>POS</title>
<glossentry id="pos">
<glossterm>Part of Speech</glossterm>
<acronym>POS</acronym>
<glossdef>
<para></para>
</glossdef>
</glossentry>
</glossdiv>
</glossary>
<bibliography>
<biblioentry xreflabel="ManningSchuetze99" id="schuetze">
<authorgroup>
<author>
<firstname>Christopher</firstname>
<surname>Manning</surname>
</author>
<author>
<firstname>Hinrich</firstname>
<surname>Schuetze</surname>
</author>
</authorgroup>
<title>
Foundations of Statistical Natural Language Processing
</title>
<copyright>
<year>1999</year>
</copyright>
<publisher>
<publishername>MIT Press</publishername>
</publisher>
</biblioentry>
</bibliography>
</book>