Tagger/src/docbook/hmmTaggerUsersGuide.xml - uima-addons - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd">
 <!--
 	Licensed to the Apache Software Foundation (ASF) under one
 	or more contributor license agreements.  See the NOTICE file
 	distributed with this work for additional information
 	regarding copyright ownership.  The ASF licenses this file
 	to you under the Apache License, Version 2.0 (the
 	"License"); you may not use this file except in compliance
 	with the License.  You may obtain a copy of the License at

 	http://www.apache.org/licenses/LICENSE-2.0

 	Unless required by applicable law or agreed to in writing,
 	software distributed under the License is distributed on an
 	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 	KIND, either express or implied.  See the License for the
 	specific language governing permissions and limitations
 	under the License.
 -->

 <book lang="en">

 	<title>Tagger Annotator Documentation</title>


 	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
 		href="../../target/docbook-shared/common_book_info.xml" />

 	<preface id="sandbox.tagger.introduction">
 		<title>Introduction</title>
 		<para>
 			Tagger Annotator is an Apache UIMA statistical analysis
 			engine that annotates tokens with corresponding grammatical
 			types (parts of speech, or just POS). The tagger is a
 			standard hidden Markov model (HMM) tagger.
 		</para>
 	</preface>

 	<chapter id="sandbox.tagger.prerequisites">
 		<title>Prerequisites</title>
 		<para>
 			The UIMA HMM Tagger annotator assumes that sentences and
 			tokens have already been annotated in the CAS with Sentence
 			and Token annotations respectively (see e.g.
 			<code>Whitespace Tokenizer Annotator</code>
 			).

 			Further, the tagger requires a parameter file which
 			specifies a number of necessary parameters for tagging
 			procedure (see
 			<xref
 				linkend="sandbox.tagger.annotatorDescriptor.configParam" />
 			).

 			Two trained models for English and German are included in
 			the package (in the
 			<code>resources</code>
 			folder). Other models can be trained outside of the UIMA
 			framework (see
 			<xref linkend="sandbox.tagger.training" />
 			).
 		</para>
 	</chapter>

 	<chapter id="sandbox.tagger.processingOverview">
 		<title>Processing Overview</title>
 		<para>
 			The algorithm iterates over sentences and tokens in turn to
 			accumulate a list of words. These are then sent to a
 			processing engine of HMM tagger. For each
 			<code>Token</code>
 			, the
 			<code>posTag</code>
 			field is updated with the corresponding part of speech (e.g.
 			<code>posTag = "NN"</code>
 			where
 			<code>NN</code>
 			stands for
 			<emphasis>common noun</emphasis>
 			).
 		</para>
 	</chapter>

 	<chapter id="sandbox.tagger.annotatorDescriptor">
 		<title>Annotator Descriptor</title>
 		<para>
 			Two descriptors are employed to configure tagger's
 			functionality:
 			<itemizedlist>
 				<listitem>
 					<para>
 						<code>HmmTagger.xml</code>
 						- is a primitive analysis engine descriptor,
 						which defines the tagger basic functionality and
 						can be combined in an aggregate analysis engine
 						with an arbitrary tokenizer. This descriptor
 						cannot be used on itself as the tagger alone
 						does not perfom tokenization.
 					</para>
 				</listitem>
 				<listitem>
 					<para>
 						<code>HmmTaggerTAE.xml</code>
 						- is an aggregate analysis engine whose only
 						function is to combine UIMA
 						<code>Whitespace Tokenizer Annotator</code>
 						with
 						<code>HMM Tagger Annotator</code>
 						and is thereby a "ready to use" tagging
 						descriptor.
 					</para>
 				</listitem>
 			</itemizedlist>
 		</para>
 		<section id="sandbox.tagger.annotatorDescriptor.configParam">
 			<title>Configuration Parameters</title>
 			<para>
 				The HMM tagger annotator (
 				<code>HmmTagger.xml</code>
 				) requires the following configuration parameters:
 			</para>
 			<para>
 				<itemizedlist>
 					<listitem>
 						<para>
 							<code>NGRAM_SIZE</code>
 							- this parameter is an Integer, defining
 							whether a bi- or trigram model should be
 							used for tagging (default is N=3).
 							<programlisting><emphasis><![CDATA[    <configurationParameters>
       <configurationParameter>
         <name>NGRAM_SIZE</name>
         <type>Integer</type>
         <multiValued>false</multiValued>
         <mandatory>true</mandatory>
       </configurationParameter>
     </configurationParameters>
     <configurationParameterSettings>
       <nameValuePair>
         <name>NGRAM_SIZE</name>
         <value>
           <integer>3</integer>
         </value>
       </nameValuePair>
     </configurationParameterSettings>]]></emphasis></programlisting>
 	</para>
 						</listitem>


 	<listitem>
 						<para>
 							<code>ModelFile</code>
 							- binary file containing the statistical model which should be used for tagging is defined as an external resource
 <programlisting><emphasis><![CDATA[
     <externalResources>
       <externalResource>
         <name>ModelFile</name>
         <description>HMM Tagger model file</description>
         <fileResourceSpecifier>
           <fileUrl>file:german/TuebaModel.dat</fileUrl>
         </fileResourceSpecifier>
         <implementationName>
         org.apache.uima.examples.tagger.ModelResource
         </implementationName>
       </externalResource>
     </externalResources>]]></emphasis></programlisting>

 Thus, one can easily use a different model by changing the <code>fileUrl</code> line:
 <code>file:german/TuebaModel.dat</code>.
 (NB. <emphasis>New models must be located in the <code>resources</code> folder</emphasis>.)
 After these two parameters have been set, the tagger is ready to use.

 							</para>
 						</listitem>
 				  	</itemizedlist>
 				  	</para>

 			</section>
 			<section id="sandbox.tagger.annotatorDescriptor.capabilities">
 				<title>Capabilities</title>
 				<para>
 				  As the tagger inherits tokenization indexes from the CAS,
 				<code>uima.SentenceAnnotation</code>  and <code>uima.TokenAnnotation</code> with their
 				<code>begin</code> and <code>end</code> features respectively have to be defined as
 				input capabilities in the HMM Tagger annotator descriptor. <code>Token</code> receives
 				also an additional <code>posTag</code> feature as an output capability.
 				</para>
 				<para>
 				<programlisting><emphasis><![CDATA[<capabilities>
       <capability>
         <inputs>
           <type>org.apache.uima.TokenAnnotation</type>
           <type allAnnotatorFeatures="true">
           org.apache.uima.SentenceAnnotation
           </type>
           <feature>org.apache.uima.TokenAnnotation:end</feature>
           <feature>org.apache.uima.TokenAnnotation:begin</feature>
         </inputs>
         <outputs>
           <type>org.apache.uima.TokenAnnotation</type>
           <feature>org.apache.uima.TokenAnnotation:posTag</feature>
           <feature>org.apache.uima.TokenAnnotation:end</feature>
           <feature>org.apache.uima.TokenAnnotation:begin</feature>
         </outputs>
       </capability>
     </capabilities>]]></emphasis></programlisting>
 				</para>
 			</section>
 	</chapter>

 	<chapter id="sandbox.tagger.unittest">
 		<title>Functionality Test</title>
 		<para>
 			The <code>TaggerTest</code> is a JUnit test file (available in the <code>test</code> folder),
 			which provides an opportunity to test provided models for English and German,
 			as well as the basic functionality of the tagger. In order to check whether
 			the tagger's configuration is correct, just run this file as JUnit Test and you should get the following output:

 			<programlisting><![CDATA[Tesing German Model...
 The used model is:resources/german/TuebaModel.dat
 61646 distinct words in the model
 Number of part-of-speech tags used: 54
 These are:  [$(, $,, $., ADJA, ADJD, ADV, APPO,
  APPR, APPRART, APZR, ART, CARD, ... ]
 Testing German trigram tagger..
 [Jerry, liebt, Wansley, .]
 expected: [NE, VVFIN, NE, $.]
 tagger output: [NE, VVFIN, NE, $.]
 Very Good!
 ==========================================================
 Tesing English Model...
 The used model is:resources/english/BrownModel.dat
 56012 distinct words in the model
 Number of part-of-speech tags used: 473
 These are:  [', '', (, ), *, ,, --, ., :, ``, abl,
  abn, abx, ap, ap$, at, be, bed, ...]
 Testing English trigram tagger...
 [Jerry, loves, Wansley, .]
 expected: [np, vbz, np, .]
 tagger output: [np, vbz, np, .]
 Very Good!]]></programlisting>

 		</para>
 	</chapter>

 	<chapter id="sandbox.tagger.tagger">
 		<title>Overview of the Tagger package</title>
 		<para>
 			The package <code>org.apache.uima.examples.tagger</code> contains:
 			<itemizedlist>
 					<listitem>
 						<para>
 							two interfaces:
 							<orderedlist>
 								<listitem>
 									<para>
 									<code>IModelResource</code>
 									- model resource interface
 									 </para>
 								</listitem>
 								<listitem>
 									<para>
 									 <code>Tagger</code>
 									 - general tagger interface, in case one would want to integrate further tagger types.
 									</para>
 								</listitem>
 							</orderedlist>

 							</para>
 					</listitem>
 					<listitem>
 						<para> three classes:
 						<orderedlist>
 								<listitem>
 									<para>
 									 <code>HMMTagger</code>
 									- hidden Markov model tagger for UIMA, that is using Viterbi algorithm to compute the most
 									probable part-of-speech sequence for a given list of tokens.
 							</para>
 								</listitem>

 								<listitem>
 									<para>
 									 <code>Viterbi</code>
 									 - implementation of the Viterbi Algorithm. This class makes up the core of the tagger.
 									</para>
 								</listitem>
 								<listitem>
 									<para>
 									 <code>ModelResource.java</code>
 									 - implementation of the <code>IModelResource</code>
 									</para>
 								</listitem>
 							</orderedlist>

 						</para>
 					</listitem>
 			</itemizedlist>
 				</para>
 	</chapter>


 	<chapter id="sandbox.tagger.training">
 		<title>Training Own Models</title>
 		<para>
 			Though we decide not to include training directly into UIMA framework, one can easily
 			train other models for different pre-annotated corpora outside of the UIMA using <code>ModelGeneration</code> class,
 			available in the subpackage <code>org.apache.uima.examples.tagger.trainAndTest</code>.
 			This subpackage includes some further files needed for training of own models:


 			<itemizedlist>
 				<listitem>
 						<para>
 							<code>MappingInterface</code>
 									 - defines mapping for a tagset. For example, one may wish to map a more detailed tagset
 									 to a less distinctive one (i.e. tell a program to tag all verbs as just <code>VERB</code>
 									 instead of differentiating between <code>verb infinitive</code>, <code>verb imperative</code>, etc.

 									Two sample implementations for <code>MappingInterface</code> are included,
 									namely <code>TagMappingBrown</code> (mapping reducing Brown corpus tagset from more than 400 tags to 93) and
 									<code>GrobMappingTueba</code>(mapping German STTS tagset from 54 tags to basic 11 categories plus special symbols and punctuation)
  						</para>
 					</listitem>

 					<listitem>
 						<para>
 							<code>ModelGeneration</code>
 							- trains an N-gram model for the tagger, iterating over a List of <code>Token</code>s.
 							Writes the resulting model to a binary file. At the moment,
 							only bi-and trigram models are supported. Further N-grams can be easily integrated.
 							<code>ModelGeneration</code> is not concerned with the fact,
 							whether the training corpus is given as a single file or as a directory containing a number of files,
 							as this is a <code>CORPUS_READER</code> implementation issue. Two supplied readers include both a reader
 							for a corpus as a single file (<code>TT_FormatReader</code>code>) or as a directory (<code>BrownReader</code>code>)
 							</para>
 					</listitem>
 					<listitem>

 						<para>
 							Interface <code>CorpusReader</code>
 							- should be used to implement corpus readers for own corpora; the objective
 							of the reader is to take charge of the preprocessing and transform tokenized units
 							(usually <emphasis>words</emphasis>) into a List of <code>Token</code> objects.
 							Two sample implementations of <code>CorpusReader</code> are included:

 							<orderedlist>

 							<listitem>
 									<para>
 									 <code>BrownReader</code>
 									-  for the Brown corpus from the nltk distribution (nltk.sourceforge.net)
 									</para>
 							</listitem>

 							<listitem>
 									<para>
 									 <code>TT_FormatReader</code>
 									-  for the corpora in TreeTagger format, i.e. one word per line
 									with tags separated from the words by tabs.
 									</para>
 							</listitem>


 							</orderedlist>

 							</para>
 					</listitem>

 				</itemizedlist>
 				</para>
 		<para>
 		To train a new model, one should  adjust a number of parameters in the <code>"tagger.properties"</code> file,
 		which is in Java properties file format (see <xref linkend="properties.file"/>). After the parameters are set, you just need to run
 		<code>ModelGeneration.java</code>
 				<programlisting id="properties.file" xreflabel="tagger.properties file"><emphasis><![CDATA[######## This is the default tagger.properties file
 ######## This file is used for training and testing only,
 ######## The configuration for tagging is directly
 ######## tuned in the descriptor "HmmTagger.xml"


 ##########################  BOTH FOR TRAINING AND EVALUATION  ########

 ######## THESE ARE THE DEFAULT MODEL FILES FOR GERMAN AND ENGLISH
 ######## You can either uncomment one of them, if you want to replace
 ######## given models with your own one,

 #MODEL_FILE = resources/german/TuebaModel.dat
 #MODEL_FILE = resources/english/BrownModel.dat

 ######## or specify a completely different name
 MODEL_FILE =

 ######## If mapping of tags is desired, uncomment the following
 #DO_MAPPING = true


 #######	EXAMPLES OF MAPPING CLASSES

 ## Basic mapping for the Brown corpus (nltk distribution) tagset:
 ## to get 93 tags out of 473
 #MAPPING = org.apache.uima.examples.tagger.TagMappingBrown
 ## Basic mapping for STTS tagset: from 54 tags onto the basic
 ## ca. 15 classes plus punctuation
 #MAPPING = org.apache.uima.examples.tagger.GrobMappingTueba

 ## If you implement your own mapping, you should specify here in
 ## the same manner as above a java-path to the class
 MAPPING =

 ####### FILE CONTAINING TRAINING CORPUS:
 ####### can be in specified either as an absolute or as a relative path
 ####### e.g. FILE = ../../tueba_tigerFormat.txt or FILE = C:/Data/tueba.txt
 FILE =

 ######## If corpus is in a different format and
 ######## cannot be read with the provided READERS,
 ######## you should specify here a java-path to the
 ######## class (s. examples below)

 #CORPUS_READER=org.apache.uima.examples.tagger.trainAndTest.TT_FormatReader
 #CORPUS_READER=org.apache.uima.examples.tagger.trainAndTest.BrownReader
 CORPUS_READER =

 #################      ONLY FOR EVALUATION   ######################

 ######### GOLD STANDARD CORPUS FILE:
 ######### can be specified as an absolute or as a relative path
 ## e.g. GOLD_STANDARD = ../../tueba_tigerFormat.txt or
 ## GOLD_STANDARD = C:/Data/tueba.txt
 GOLD_STANDARD =

 ######### Here we specify whether one intends to test a bi- or a
 ######### trigram model (default is a trigram model)
 N=3
 ]]></emphasis>
 				</programlisting>

 				</para>

 	</chapter>

 	<chapter id="sandbox.tagger.evaluation">
 		<title>Evaluation</title>
 		<para>
 		To evaluate performance if a "gold standard" corpus is available, one can use the following provided file:
 		<itemizedlist>
 			<listitem>
 				<para>
 					<code>TaggerEvaluation.java</code>
 						- can be used to evaluate the tagger and/or new models on a manually annotated corpus.
 				</para>
 			</listitem>
 		</itemizedlist>
 		</para>
 		<para>
 		<code>HMMTagger</code> was evaluated for English and German. For English, it was trained on 80% of the Brown corpus
 		(180,000 tokens) and tested on the rest unseen 20%. The achieved accuracy was about 96%, test corpus contained 4.5% of unknown tokens.
 		</para>
 		<para>
 		For German, it achieves between 95% and 96% accuracy when trained and tested on the same type of corpus, i.e. with 80% of corpus used for training and 20% for testing.
 		The accuracy goes a bit down when tagging is performed for different types of corpora than the training one, mostly due to the growing number of unknown words.
 		</para>
 	</chapter>


 	<appendix id="sandbox.tagger.theory">
 		<title>Theory Behind</title>

 		<para>
 			This chapter is just a sketch of the statistical model
 			undelying the tagger.

 			Hidden Markov Models (HMMs) are the mainstay of the
 			applications employing statistical modeling in any form,
 			like speech recognition and production systems, signal
 			processing, part of speech tagging.

 			A Hidden Markov Model is a probabilistic function of a
 			Markov process. A Markov process is a process that fulfills
 			Markov assumptions.


 			Markov assumptions are:
 			<itemizedlist>
 				<listitem>
 					<para>
 						<code>limited horizon</code>
 						- Markov processes are states without memory,
 						except for condition of the current state.
 						Though we usually consider sequences of
 						variables that are not independent of each
 						other, it often suffices to know the value of
 						the current situation without going deep into
 						the past happenings. As [
 						<biblioref linkend="schuetze" />
 						] put it, we do not really need to know, how
 						many books were in the library last week or last
 						year in order to predict how many books there
 						will be tomorrow. It is often enough to know the
 						current situation. Thereby, future states in the
 						Markov process are independent of the past, they
 						only depend on the present. Let
 						<inlineequation>
 							<mathphrase>
 								X = (X
 								<subscript>1</subscript>
 								, ..., X
 								<subscript>T</subscript>
 								)
 							</mathphrase>
 						</inlineequation>
 						be a sequence of random variables taking the
 						values from the finite state space
 						<inlineequation>
 							<mathphrase>
 								S = (s
 								<subscript>1</subscript>
 								, ..., s
 								<subscript>N</subscript>
 								)
 							</mathphrase>
 						</inlineequation>
 						, then a limited horizon property could be
 						formalized by:
 						<informalequation>
 							<mathphrase>
 								P(X
 								<subscript>t+1</subscript>
 								= s
 								<subscript>k</subscript>
 								|X
 								<subscript>1</subscript>
 								, ..., X
 								<subscript>t</subscript>
 								) = P(X
 								<subscript>t+1</subscript>
 								= s
 								<subscript>k</subscript>
 								|X
 								<subscript>t</subscript>
 								)
 							</mathphrase>
 						</informalequation>


 					</para>
 				</listitem>
 				<listitem>
 					<para>
 						<code>time invariance</code>
 					</para>

 					<para>
 						The probabilities do not change over time, i.e.
 						if we know that the probability of observing a
 						rainbow after the rain is equal to 90\%, we know
 						that it should be true for today as well as for
 						tomorrow.
 					</para>
 				</listitem>
 			</itemizedlist>
 		</para>
 		<para>

 			If
 			<code>X</code>
 			conforms to these two properties, then it is said to be a
 			Markov chain.

 			One can describe a Markov chain by a transition matrix:
 			<informalequation>
 				<mathphrase>
 					A = a
 					<subscript>i,j</subscript>
 					= P(X
 					<subscript>t+1</subscript>
 					= s
 					<subscript>j</subscript>
 					|X
 					<subscript>t</subscript>
 					=s
 					<subscript>i</subscript>
 					)
 				</mathphrase>
 			</informalequation>


 			<informalequation>
 				<mathphrase>
 					- with a
 					<subscript>i,j</subscript>
 					>= 0 (for all
 					<emphasis>i,j</emphasis>
 					) and the sum of all transition probabilities from
 					state
 					<emphasis>i</emphasis>
 					(a
 					<subscript>i,j</subscript>
 					) should be equal to 1 (for all
 					<emphasis>i</emphasis>
 					)
 				</mathphrase>
 			</informalequation>

 		</para>

 		<para>

 			Markov models can be used whenever one needs to model the
 			probability of a linear sequence of variables.

 			One distinguishes Visible Markov Models (VMM) vs. Hidden
 			Markov Models. The difference is that when we work with
 			"visible" events, we can directly estimate the corresponding
 			probabilities (which is the case if training corpus is
 			available to train own models for HMM taggers).

 			Finding a sequence of part of speech tags (i.e. Viterbi part
 			of the tagger) in contrast is a hidden Markov model as the
 			states (tags) are not directly observable.
 		</para>

 		<para>
 			<emphasis>The goal of HMM - based tagger</emphasis>
 			is to find part of speech tags ( = hidden states) that
 			generate a sequence of words ( = observable states). Most of
 			the known implementations of POS taggers are viewing text as
 			being produced by a hidden Markov model, so that tagging can
 			be regarded as a Markov process deciding which states the
 			system went through to generate a given text.

 		</para>
 		<para>
 			<emphasis>General Form of HMM</emphasis>
 		</para>
 		<para>
 			A HMM is a five-tuple:
 			<inlineequation>
 				<mathphrase>(S, O, &pgr;, A, B)</mathphrase>
 			</inlineequation>
 			<informalexample>
 				<para>where:</para>
 				<para>
 					<itemizedlist>
 						<listitem>
 							<para>
 								<code>S</code>
 								- the set of states (here: parts of
 								speech)
 							</para>
 						</listitem>
 						<listitem>
 							<para>
 								<code>K</code>
 								- the set of observations (here: words)
 							</para>
 						</listitem>
 						<listitem>
 							<para>
 								<code>&pgr;</code>
 								- initial state probabilities
 							</para>
 						</listitem>
 						<listitem>
 							<para>
 								<code>A</code>
 								- state transitions probabilities
 							</para>
 						</listitem>
 						<listitem>
 							<para>
 								<code>B</code>
 								- symbol emissions probabilities
 							</para>
 						</listitem>
 					</itemizedlist>
 				</para>

 			</informalexample>
 		</para>

         <para>
 		Further,
 		<code>
 			X
 			<subscript>t</subscript>
 		</code>
 		(state sequence) and
 		<code>
 			O
 			<subscript>t</subscript>
 		</code>
 		(output sequence) are given.

 		Tagging procedure is then the following:
 		<informalexample>
 			<orderedlist>
 				<listitem>
 					<para>
 						<code>t := 1</code>
 					</para>
 				</listitem>
 				<listitem>
 					<para>
 						<code>
 							Start in state s
 							<subscript>i</subscript>
 							with probability &pgr;
 							<subscript>i</subscript>
 							(i.e., X
 							<subscript>1</subscript>
 							= i)
 						</code>
 					</para>
 				</listitem>

 				<listitem>
 					<para>
 						<code>forever do:</code>
 					</para>

 					<itemizedlist>
 						<listitem>
 							<para>
 								<code>
 									Move from s
 									<subscript>i</subscript>
 									to s
 									<subscript>j</subscript>
 									with probability a
 									<subscript>i,j</subscript>
 									(i.e. X
 									<subscript>t+1</subscript>
 									= j)
 								</code>
 							</para>
 						</listitem>

 						<listitem>
 							<para>
 								<code>
 									Emit observation symbol o
 									<subscript>t</subscript>
 									= k with probability b
 									<subscript>i,j,k</subscript>
 								</code>
 							</para>
 						</listitem>

 						<listitem>
 							<para>
 								<code>t := T+1</code>
 							</para>

 						</listitem>
 					</itemizedlist>
 				</listitem>

 				<listitem>
 					<para>
 						<code>end</code>
 					</para>
 				</listitem>

 			</orderedlist>
 		</informalexample>
 </para>

 		<para>
 			Despite their limitations, HMM-s are one of the most
 			successful techniques in natural language processing and are
 			widely used, especially in sequence tagging applications.
 			The best statistical taggers all perform at about the same
 			level of accuracy.
 		</para>
 	</appendix>

 	<!-- ... -->
 	<glossary>
 		<title>Glossary</title>

 		<glossdiv>
 			<title>HMM</title>

 			<glossentry id="hmm">
 				<glossterm>Hidden Markov Model</glossterm>
 				<acronym>HMM</acronym>
 				<glossdef>
 					<para></para>
 				</glossdef>
 			</glossentry>
 		</glossdiv>

 		<glossdiv>
 			<title>POS</title>
 			<glossentry id="pos">
 				<glossterm>Part of Speech</glossterm>
 				<acronym>POS</acronym>
 				<glossdef>
 					<para></para>
 				</glossdef>
 			</glossentry>
 		</glossdiv>

 	</glossary>

 	<bibliography>
 		<biblioentry xreflabel="ManningSchuetze99" id="schuetze">
 			<authorgroup>
 				<author>
 					<firstname>Christopher</firstname>
 					<surname>Manning</surname>
 				</author>
 				<author>
 					<firstname>Hinrich</firstname>
 					<surname>Schuetze</surname>
 				</author>
 			</authorgroup>

 			<title>
 				Foundations of Statistical Natural Language Processing
 			</title>
 			<copyright>
 				<year>1999</year>
 			</copyright>
 			<publisher>
 				<publishername>MIT Press</publishername>
 			</publisher>
 		</biblioentry>
 	</bibliography>

 </book>