opennlp-docs/src/docbkx/corpora.xml - opennlp - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
 ]>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->

 <chapter id="tools.corpora">

 	<title>Corpora</title>
 	<para>
 	    OpenNLP has built-in support to convert into the native training format or directly use
         various corpora	needed by the different	trainable components.
 	</para>
 	<section id="tools.corpora.conll">
 		<title>CONLL</title>
 		<para>
 		CoNLL stands for the Conference on Computational Natural Language Learning and is not
 		a single project but a consortium of developers attempting to broaden the computing
 		environment. More information about the entire conference series can be obtained here
 		for CoNLL.
 		</para>
 		<section id="tools.corpora.conll.2000">
 		<title>CONLL 2000</title>
 		<para>
 		The shared task of CoNLL-2000 is Chunking.
 		</para>
 		<section id="tools.corpora.conll.2000.getting">
 		<title>Getting the data</title>
 		<para>
 		CoNLL-2000 made available training and test data for the Chunk task in English.
 		The data consists of the same partitions of the Wall Street Journal corpus (WSJ)
 		as the widely used data for noun phrase chunking: sections 15-18 as training data
 		(211727 tokens) and section 20 as test data (47377 tokens). The annotation of the
 		data has been derived from the WSJ corpus by a program written by Sabine Buchholz
 		from Tilburg University, The Netherlands. Both training and test data can be
 		obtained from <ulink url="http://www.cnts.ua.ac.be/conll2000/chunking">http://www.cnts.ua.ac.be/conll2000/chunking</ulink>.
 		</para>
 		</section>
 		<section id="tools.corpora.conll.2000.converting">
 		<title>Converting the data</title>
 		<para>
 		The data don't need to be transformed because Apache OpenNLP Chunker follows
 		the CONLL 2000 format for training. Check <link linkend="tools.chunker.training">Chunker Training</link> section to learn more.
 		</para>
 		</section>
 		<section id="tools.corpora.conll.2000.training">
 		<title>Training</title>
 		<para>
 		 We can train the model for the Chunker using the train.txt available at CONLL 2000:
 		 <screen>
 			<![CDATA[
 $ opennlp ChunkerTrainerME -model en-chunker.bin -iterations 500 \
                            -lang en -data train.txt -encoding UTF-8]]>
 		</screen>
 		<screen>
 			<![CDATA[
 Indexing events using cutoff of 5

 	Computing event counts...  done. 211727 events
 	Indexing...  done.
 Sorting and merging events... done. Reduced 211727 events to 197252.
 Done indexing.
 Incorporating indexed data for training...
 done.
 	Number of Event Tokens: 197252
 	    Number of Outcomes: 22
 	  Number of Predicates: 107838
 ...done.
 Computing model parameters...
 Performing 500 iterations.
   1:  .. loglikelihood=-654457.1455212828	0.2601510435608118
   2:  .. loglikelihood=-239513.5583724216	0.9260037690044255
   3:  .. loglikelihood=-141313.1386347238	0.9443387003074715
   4:  .. loglikelihood=-101083.50853437989	0.954375209585929
 ... cut lots of iterations ...
 498:  .. loglikelihood=-1710.8874647317095	0.9995040783650645
 499:  .. loglikelihood=-1708.0908900815848	0.9995040783650645
 500:  .. loglikelihood=-1705.3045902366732	0.9995040783650645
 Writing chunker model ... done (4.019s)

 Wrote chunker model to path: .\en-chunker.bin]]>
 		</screen>
 		</para>
 		</section>
 		<section id="tools.corpora.conll.2000.evaluation">
 		<title>Evaluating</title>
 		<para>
 		We evaluate the model using the file test.txt  available at CONLL 2000:
 		<screen>
 			<![CDATA[
 $ opennlp ChunkerEvaluator -model en-chunker.bin -lang en -encoding utf8 -data test.txt]]>
 		</screen>
 		<screen>
 			<![CDATA[
 Loading Chunker model ... done (0,665s)
 current: 85,8 sent/s avg: 85,8 sent/s total: 86 sent
 current: 88,1 sent/s avg: 87,0 sent/s total: 174 sent
 current: 156,2 sent/s avg: 110,0 sent/s total: 330 sent
 current: 192,2 sent/s avg: 130,5 sent/s total: 522 sent
 current: 167,2 sent/s avg: 137,8 sent/s total: 689 sent
 current: 179,2 sent/s avg: 144,6 sent/s total: 868 sent
 current: 183,2 sent/s avg: 150,3 sent/s total: 1052 sent
 current: 183,2 sent/s avg: 154,4 sent/s total: 1235 sent
 current: 169,2 sent/s avg: 156,0 sent/s total: 1404 sent
 current: 178,2 sent/s avg: 158,2 sent/s total: 1582 sent
 current: 172,2 sent/s avg: 159,4 sent/s total: 1754 sent
 current: 177,2 sent/s avg: 160,9 sent/s total: 1931 sent


 Average: 161,6 sent/s
 Total: 2013 sent
 Runtime: 12.457s

 Precision: 0.9244354736974896
 Recall: 0.9216837162502096
 F-Measure: 0.9230575441395671]]>
 		</screen>
 		</para>
 		</section>
 	</section>
 		<section id="tools.corpora.conll.2002">
 		<title>CONLL 2002</title>
 		<para>
 		The shared task of CoNLL-2002 is language independent named entity recognition for Spanish and Dutch.
 		</para>
 		<section id="tools.corpora.conll.2002.getting">
 		<title>Getting the data</title>
 		<para>The data consists of three files per language: one training file and two test files testa and testb.
 		The first test file will be used in the development phase for finding good parameters for the learning system.
 		The second test file will be used for the final evaluation. Currently there are data files available for two languages:
 		Spanish and Dutch.
 		</para>
 		<para>
 		The Spanish data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are
 		from May 2000. The annotation was carried out by the <ulink url="http://www.talp.cat/">TALP Research Center</ulink> of the Technical University of Catalonia (UPC)
 		and the <ulink url="http://clic.ub.edu/">Center of Language and Computation (CLiC)</ulink>of the University of Barcelona (UB), and funded by the European Commission
 		through the NAMIC project (IST-1999-12392).
 		</para>
 		<para>
 		The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1).
 		The data was annotated as a part of the <ulink url="http://atranos.esat.kuleuven.ac.be/">Atranos</ulink> project at the University of Antwerp.
 		</para>
 		<para>
 		You can find the Spanish files here:
 		<ulink url="http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html">http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html</ulink>
 		You must download esp.train.gz, unzip it and you will see the file esp.train.
 		</para>
 		<para>
 		You can find the Dutch files here:
 		<ulink url="http://www.cnts.ua.ac.be/conll2002/ner.tgz">http://www.cnts.ua.ac.be/conll2002/ner.tgz</ulink>
 		You must unzip it and go to /ner/data/ned.train.gz, so you unzip it too, and you will see the file ned.train.
 		</para>
 		</section>
 		<section id="tools.corpora.conll.2002.converting">
 		<title>Converting the data</title>
 		<para>
 		I will use Spanish data as reference, but it would be the same operations to Dutch. You just must remember change “-lang es” to “-lang nl” and use
 		the correct training files. So to convert the information to the OpenNLP format:
 		<screen>
 			<![CDATA[
 $ opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt]]>
 		</screen>
 		Optionally, you can convert the training test samples as well.
 		<screen>
 			<![CDATA[
 $ opennlp TokenNameFinderConverter conll02 -data esp.testa -lang es -types per > corpus_testa.txt
 $ opennlp TokenNameFinderConverter conll02 -data esp.testb -lang es -types per > corpus_testb.txt]]>
 		</screen>
 		</para>
 		</section>
 		<section id="tools.corpora.conll.2002.training.spanish">
 		<title>Training with Spanish data</title>
 		<para>
 		To train the model for the name finder:
 		<screen>
 			<![CDATA[
 \bin\opennlp TokenNameFinderTrainer -lang es -encoding u
 tf8 -iterations 500 -data es_corpus_train_persons.txt -model es_ner_person.bin


 Indexing events using cutoff of 5

         Computing event counts...  done. 264715 events
         Indexing...  done.
 Sorting and merging events... done. Reduced 264715 events to 222660.
 Done indexing.
 Incorporating indexed data for training...
 done.
         Number of Event Tokens: 222660
            Number of Outcomes: 3
           Number of Predicates: 71514
 ...done.
 Computing model parameters ...
 Performing 500 iterations.
   1:  ... loglikelihood=-290819.1519958615      0.9689326256540053
   2:  ... loglikelihood=-37097.17676455632      0.9689326256540053
   3:  ... loglikelihood=-22910.372489660916     0.9706476776911017
   4:  ... loglikelihood=-17091.547325669497     0.9777874317662392
   5:  ... loglikelihood=-13797.620926769372     0.9833821279489262
   6:  ... loglikelihood=-11715.806710780415     0.9867140131839903
   7:  ... loglikelihood=-10289.222078246517     0.9886859452618855
   8:  ... loglikelihood=-9249.208318314624      0.9902310031543358
   9:  ... loglikelihood=-8454.169590899777      0.9913227433277298
  10:  ... loglikelihood=-7823.742997451327      0.9921953799369133
  11:  ... loglikelihood=-7309.375882641964      0.9928224694482746
  12:  ... loglikelihood=-6880.131972149693      0.9932946754056249
  13:  ... loglikelihood=-6515.3828767792365     0.993638441342576
  14:  ... loglikelihood=-6200.82723154046       0.9939595413935742
  15:  ... loglikelihood=-5926.213730444915      0.994269308501596
  16:  ... loglikelihood=-5683.9821840753275     0.9945299661900534
  17:  ... loglikelihood=-5468.4211798176075     0.9948246227074401
  18:  ... loglikelihood=-5275.127017232056      0.9950286156810154

 ... cut lots of iterations ...

 491:  ... loglikelihood=-1174.8485558758211     0.998983812779782
 492:  ... loglikelihood=-1173.9971776942477     0.998983812779782
 493:  ... loglikelihood=-1173.1482915871768     0.998983812779782
 494:  ... loglikelihood=-1172.3018855781158     0.998983812779782
 495:  ... loglikelihood=-1171.457947774544      0.998983812779782
 496:  ... loglikelihood=-1170.6164663670502     0.998983812779782
 497:  ... loglikelihood=-1169.7774296286693     0.998983812779782
 498:  ... loglikelihood=-1168.94082591387       0.998983812779782
 499:  ... loglikelihood=-1168.1066436580463     0.9989875904274408
 500:  ... loglikelihood=-1167.2748713765225     0.9989875904274408
 Writing name finder model ... done (2,168s)

 Wrote name finder model to
 path: .\es_ner_person.bin]]>
 		</screen>
 		</para>
 		</section>
 		</section>

 		<section id="tools.corpora.conll.2003">
 		<title>CONLL 2003</title>
 		<para>
 		The shared task of CoNLL-2003 is language independent named entity recognition
 		for English and German.
 		</para>
 		<section id="tools.corpora.conll.2003.getting">
 		<title>Getting the data</title>
 		<para>
 		The English data is the Reuters Corpus, which is a collection of news wire articles.
 		The Reuters Corpus can be obtained free of charges from the NIST for research
 		purposes: <ulink url="http://trec.nist.gov/data/reuters/reuters.html">http://trec.nist.gov/data/reuters/reuters.html</ulink>
 		</para>
 		<para>
 		The German data is a collection of articles from the German newspaper Frankfurter
 		Rundschau. The articles are part of the ECI Multilingual Text Corpus which
 		can be obtained for 75$ (2010) from the Linguistic Data Consortium:
 <ulink url="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5</ulink>		</para>
 		<para>After one of the corpora is available the data must be
 		transformed as explained in the README file to the CONLL format.
 		The transformed data can be read by the OpenNLP CONLL03 converter.
 		</para>
 		</section>
 		<section id="tools.corpora.conll.2003.converting">
 		<title>Converting the data (optional)</title>
 		<para>
 		To convert the information to the OpenNLP format:
 		<screen>
 			<![CDATA[
 $ opennlp TokenNameFinderConverter conll03 -lang en -types per -data eng.train > corpus_train.txt]]>
 		</screen>
 		Optionally, you can convert the training test samples as well.
 		<screen>
 			<![CDATA[
 $ opennlp TokenNameFinderConverter conll03 -lang en -types per -data eng.testa > corpus_testa.txt
 $ opennlp TokenNameFinderConverter conll03 -lang en -types per -data eng.testb > corpus_testb.txt]]>
 		</screen>
 		</para>
 		</section>
 		<section id="tools.corpora.conll.2003.training.english">
 		<title>Training with English data</title>
             <para>
                 You can train the model for the name finder this way:
                 <screen>
                 <![CDATA[
 $ opennlp TokenNameFinderTrainer.conll03 -model en_ner_person.bin -iterations 500 \
                                  -lang en -types per -data eng.train -encoding utf8]]>
                 </screen>
             </para>
 		    <para>
                 If you have converted the data, then you can train the model for the name finder this way:
                 <screen>
                 <![CDATA[
 $ opennlp TokenNameFinderTrainer -model en_ner_person.bin -iterations 500 \
                                  -lang en -data corpus_train.txt -encoding utf8]]>
 		        </screen>
             </para>
             <para>
                 Either way you should see the following output during the training process:
 		        <screen>
 			    <![CDATA[
 Indexing events using cutoff of 5

 	Computing event counts...  done. 203621 events
 	Indexing...  done.
 Sorting and merging events... done. Reduced 203621 events to 179409.
 Done indexing.
 Incorporating indexed data for training...
 done.
 	Number of Event Tokens: 179409
 	    Number of Outcomes: 3
 	  Number of Predicates: 58814
 ...done.
 Computing model parameters...
 Performing 500 iterations.
   1:  .. loglikelihood=-223700.5328318588	0.9453494482396216
   2:  .. loglikelihood=-40525.939777363084	0.9467933071736215
   3:  .. loglikelihood=-24893.98837874921	0.9598518816821447
   4:  .. loglikelihood=-18420.3379471033	0.9712996203731442
 ... cut lots of iterations ...
 498:  .. loglikelihood=-952.8501399442295	0.9988950059178572
 499:  .. loglikelihood=-952.0600155746948	0.9988950059178572
 500:  .. loglikelihood=-951.2722802086295	0.9988950059178572
 Writing name finder model ... done (1.638s)

 Wrote name finder model to
 path: .\en_ner_person.bin]]>
 		        </screen>
 		    </para>
 		</section>
 		<section id="tools.corpora.conll.2003.evaluation.english">
 		<title>Evaluating with English data</title>
             <para>
                 You can evaluate the model for the name finder this way:
                 <screen>
                 <![CDATA[
 $ opennlp TokenNameFinderEvaluator.conll03 -model en_ner_person.bin \
                                    -lang en -types per -data eng.testa -encoding utf8]]>
                 </screen>
             </para>
 		    <para>
 		        If you converted the test A and B files above, you can use them to evaluate the
                 model.
 		        <screen>
 			<![CDATA[
 $ opennlp TokenNameFinderEvaluator -model en_ner_person.bin -lang en -data corpus_testa.txt \
                                    -encoding utf8]]>
 		        </screen>
             </para>
             <para>
                 Either way you should see the following output:
 		        <screen>
 			<![CDATA[
 Loading Token Name Finder model ... done (0.359s)
 current: 190.2 sent/s avg: 190.2 sent/s total: 199 sent
 current: 648.3 sent/s avg: 415.9 sent/s total: 850 sent
 current: 530.1 sent/s avg: 453.6 sent/s total: 1380 sent
 current: 793.8 sent/s avg: 539.0 sent/s total: 2178 sent
 current: 705.4 sent/s avg: 571.9 sent/s total: 2882 sent


 Average: 569.4 sent/s
 Total: 3251 sent
 Runtime: 5.71s

 Precision: 0.9366247297154147
 Recall: 0.739956568946797
 F-Measure: 0.8267557582133971]]>
 		</screen>
 		</para>
 		</section>
 	</section>
 	</section>
 	<section id="tools.corpora.arvores-deitadas">
 		<title>Arvores Deitadas</title>
 		<para>
 		The Portuguese corpora available at <ulink url="http://www.linguateca.pt">Floresta Sintá(c)tica</ulink> project follow the Arvores Deitadas (AD) format. Apache OpenNLP includes tools to convert from AD format to native format.
 		</para>
 		<section id="tools.corpora.arvores-deitadas.getting">
 			<title>Getting the data</title>
 			<para>
 			The Corpus can be downloaded from here: <ulink url="http://www.linguateca.pt/floresta/corpus.html">http://www.linguateca.pt/floresta/corpus.html</ulink>
 			</para>
 			<para>
 			The Name Finder models were trained using the Amazonia corpus: <ulink url="http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz">amazonia.ad</ulink>.
 			The Chunker models were trained using the <ulink url="http://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz">Bosque_CF_8.0.ad</ulink>.
 			</para>
 		</section>

 		<section id="tools.corpora.arvores-deitadas.converting">
 			<title>Converting the data (optional)</title>
 			    <para>
 				    To extract NameFinder training data from Amazonia corpus:
 			        <screen>
 			        <![CDATA[
 $ opennlp TokenNameFinderConverter ad -lang pt -encoding ISO-8859-1 -data amazonia.ad > corpus.txt]]>
 			        </screen>
 			</para>
 			<para>
 				To extract Chunker training data from Bosque_CF_8.0.ad corpus:
 			    <screen>
 			    <![CDATA[
 $ opennlp ChunkerConverter ad -lang pt -data Bosque_CF_8.0.ad.txt -encoding ISO-8859-1 > bosque-chunk]]>
     			</screen>
 			</para>
 		</section>
 		<section id="tools.corpora.arvores-deitadas.evaluation">
 			<title>Training and Evaluation</title>
 			    <para>
 			        To perform the evaluation the corpus was split into a training and a test part.
 			        <screen>
 			        <![CDATA[
 $ sed '1,55172d' corpus.txt > corpus_train.txt
 $ sed '55172,100000000d' corpus.txt > corpus_test.txt]]>
         			</screen>
         			<screen>
         			<![CDATA[
 $ opennlp TokenNameFinderTrainer -model pt-ner.bin -cutoff 20 -lang PT -data corpus_train.txt -encoding UTF-8
 ...
 $ opennlp TokenNameFinderEvaluator -model pt-ner.bin -lang PT -data corpus_train.txt -encoding UTF-8

 Precision: 0.8005071889818507
 Recall: 0.7450581122145297
 F-Measure: 0.7717879983140168]]>
 			</screen>
 			</para>
 		</section>
 	</section>

 	<section id="tools.corpora.ontonotes">
 		<title>OntoNotes Release 4.0</title>
 	<para>
 		"OntoNotes Release 4.0, Linguistic Data Consortium (LDC) catalog number
 		LDC2011T03 and isbn 1-58563-574-X, was developed as part of the
 		OntoNotes project, a collaborative effort between BBN Technologies,
 		the University of Colorado, the University of Pennsylvania and the
 		University of Southern Californias Information Sciences Institute. The
 		goal of the project is to annotate a large corpus comprising various
 		genres of text (news, conversational telephone speech, weblogs, usenet
 		newsgroups, broadcast, talk shows) in three languages (English,
 		Chinese, and Arabic) with structural information (syntax and predicate
 		argument structure) and shallow semantics (word sense linked to an
 		ontology and coreference). OntoNotes Release 4.0 is supported by the
 		Defense Advance Research Project Agency, GALE Program Contract No.
 		HR0011-06-C-0022.
 	</para>
 	<para>
 		OntoNotes Release 4.0 contains the content of earlier releases -- OntoNotes
 		Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04 and OntoNotes
 		Release 3.0 LDC2009T24 -- and adds newswire, broadcast news, broadcast
 		conversation and web data in English and Chinese and newswire data in
 		Arabic. This cumulative publication consists of 2.4 million words as
 		follows: 300k words of Arabic newswire 250k words of Chinese newswire,
 		250k words of Chinese broadcast news, 150k words of Chinese broadcast
 		conversation and 150k words of Chinese web text and 600k words of
 		English newswire, 200k word of English broadcast news, 200k words of
 		English broadcast conversation and 300k words of English web text.
 	</para>
 	<para>
 		The OntoNotes project builds on two time-tested resources, following the
 		Penn Treebank for syntax and the Penn PropBank for predicate-argument
 		structure. Its semantic representation will include word sense
 		disambiguation for nouns and verbs, with each word sense connected to
 		an ontology, and coreference. The current goals call for annotation of
 		over a million words each of English and Chinese, and half a million
 		words of Arabic over five years." (http://catalog.ldc.upenn.edu/LDC2011T03)
 	</para>
 		<section id="tools.corpora.ontonotes.namefinder">
 		<title>Name Finder Training</title>
 	<para>
 		The OntoNotes corpus can be used to train the Name Finder. The corpus
 		contains many different name types
 		to train a model for a specific type only the built-in type filter
 		option should be used.
 	</para>
 		<para>
 		The sample shows how to train a model to detect person names.
 			<programlisting>
 			<![CDATA[
 $ bin/opennlp TokenNameFinderTrainer.ontonotes -lang en -model en-ontonotes.bin \
 			 -nameTypes person -ontoNotesDir ontonotes-release-4.0/data/files/data/english/

 Indexing events using cutoff of 5

 	Computing event counts...  done. 1953446 events
 	Indexing...  done.
 Sorting and merging events... done. Reduced 1953446 events to 1822037.
 Done indexing.
 Incorporating indexed data for training...
 done.
 	Number of Event Tokens: 1822037
 	    Number of Outcomes: 3
 	  Number of Predicates: 298263
 ...done.
 Computing model parameters ...
 Performing 100 iterations.
   1:  ... loglikelihood=-2146079.7808976253	0.976677625078963
   2:  ... loglikelihood=-195016.59754190338	0.976677625078963
 ... cut lots of iterations ...
  99:  ... loglikelihood=-10269.902459614596	0.9987299367374374
 100:  ... loglikelihood=-10227.160010853702	0.9987314724850341
 Writing name finder model ... done (2.315s)

 Wrote name finder model to
 path: /dev/opennlp/trunk/opennlp-tools/en-ontonotes.bin]]>
 	</programlisting>
 		</para>
 		</section>
 	</section>

 	<section id="tools.corpora.brat">
 		<title>Brat Format Support</title>
 		<para>
 			The brat annotation tool is an online environment for collaborative text annotation and
 			supports labeling documents with named entities. The best performance of a name finder
 			can only be achieved if it was trained on documents similar to the the documents it will
 			process. For that reason it is often necessary to manually label a large number of documents and
 			build a custom corpus. This is where brat comes in handy.

 			<imagedata fileref="images/brat.png" width="6.5in" depth="4in" scalefit="1"/>

 			OpenNLP can directly be trained and evaluated on labeled data in the brat format.
 			Instructions on how to use, download and install brat can be found on the project website:

 			<ulink url="http://brat.nlplab.org">http://brat.nlplab.org</ulink>

 			Configuration of brat, including setting up the different entities and relations can be found at:

 			<ulink url="http://brat.nlplab.org/configuration.html">http://brat.nlplab.org/configuration.html</ulink>

 		</para>


 		<section id="tools.corpora.brat.webtool">
 			<title>Sentences and Tokens</title>
 			<para>
 				The brat annotation tool only adds named entity spans to the data and doesn't provide information
 				about tokens and sentences. To train the name finder this information is required. By default it
 				is assumed that each line is a sentence and that tokens are whitespace separated. This can be
 				adjusted by providing a custom sentence detector and optional also a tokenizer.

 				The opennlp brat command supports the following arguments for providing custom sentence detector
 				and tokenizer.

 				<simplelist type='horiz' columns='1'>
 					<member><para>-sentenceDetectorModel - your sentence model</para></member>
 					<member><para>-tokenizerModel - your tokenizer model</para></member>
 					<member><para>-ruleBasedTokenizer - simple | whitespace</para></member>
 				</simplelist>

 			</para>
 		</section>

 		<section id="tools.corpora.brat.training">
 			<title>Training</title>
 			<para>
 				To train your namefinder model using your brat annotated files you can either use the opennlp command
 				line tool or call opennlp.tools.cmdline.CLI main class from your preferred IDE.

 				Calling opennlp TokenNameFinder.brat without arguments gives you a list of all the arguments you can use.
 				Obviously some combinations are not valid. E.g. you should not provide a token model and also define
 				a rule based tokenizer.

 				<screen>
 					<![CDATA[
 $ opennlp TokenNameFinderTrainer.brat
 Usage: opennlp TokenNameFinderTrainer.brat [-factory factoryName] [-resources resourcesDir] [-type modelType]
 [-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language
 -model modelFile [-tokenizerModel modelFile] [-ruleBasedTokenizer name] -annotationConfig annConfFile
 -bratDataDir bratDataDir [-recursive value] [-sentenceDetectorModel modelFile]

 Arguments description:
 	-factory factoryName
 		A sub-class of TokenNameFinderFactory
 	-resources resourcesDir
 		The resources directory
 	-type modelType
 		The type of the token name finder model
 	-featuregen featuregenFile
 		The feature generator descriptor file
 	-nameTypes types
 		name types to use for training
 	-sequenceCodec codec
 		sequence codec used to code name spans
 	-params paramsFile
 		training parameters file.
 	-lang language
 		language which is being processed.
 	-model modelFile
 		output model file.
 	-tokenizerModel modelFile
 	-ruleBasedTokenizer name
 	-annotationConfig annConfFile
 	-bratDataDir bratDataDir
 		location of brat data dir
 	-recursive value
 	-sentenceDetectorModel modelFile
 ]]>
 				</screen>

 				The following command will train a danish organization name finder model.

 				<screen>
 					<![CDATA[
 $ opennlp TokenNameFinderTrainer.brat -resources conf/resources \
 -featuregen conf/resources/fg-da-org.xml -nameTypes Organization \
 -params conf/resources/TrainerParams.txt -lang da \
 -model models/da-org.bin -ruleBasedTokenizer simple \
 -annotationConfig data/annotation.conf -bratDataDir data/gold/da/train \
 -recursive true -sentenceDetectorModel models/da-sent.bin

 Indexing events using cutoff of 0

 Computing event counts...
 done. 620738 events
 Indexing...  done.
 Collecting events... Done indexing.
 Incorporating indexed data for training...
 done.
 	Number of Event Tokens: 620738
 	    Number of Outcomes: 3
 	  Number of Predicates: 1403655
 Computing model parameters...
 Performing 100 iterations.
   1:  . (614536/620738) 0.9900086671027067
   2:  . (617590/620738) 0.9949286172265915
   3:  . (618615/620738) 0.9965798775006525
   4:  . (619263/620738) 0.9976237961909856
   5:  . (619509/620738) 0.9980200986567602
   6:  . (619830/620738) 0.9985372250450271
   7:  . (619968/620738) 0.9987595410624128
   8:  . (620110/620738) 0.9989883010223315
   9:  . (620200/620738) 0.9991332897293222
  10:  . (620266/620738) 0.9992396147811153
  20:  . (620538/620738) 0.999677802873354
  30:  . (620641/620738) 0.9998437343935767
  40:  . (620653/620738) 0.9998630662211755
 Stopping: change in training set accuracy less than 1.0E-5
 Stats: (620594/620738) 0.9997680180688149
 ...done.

 Writing name finder model ... Training data summary:
 #Sentences: 26133
 #Tokens: 620738
 #Organization entities: 13053

 Compressed 1403655 parameters to 116378
 4 outcome patterns
 done (11.099s)

 Wrote name finder model to
 path: models/da-org.bin
 ]]>
 				</screen>
 			</para>
 		</section>

 		<section id="tools.corpora.brat.evaluation">
 			<title>Evaluation</title>
 			<para>
 				To evaluate you name finder model opennlp provides an evaluator that works with your brat
 				annotated data. Normally you would partition your data in a training set and a test set e.g. 70%
 				training and 30% test.
 				The training set is of cause only used for training the model and should never be used for
 				evaluation. The test set is only used for evaluation. In order to avoid overfitting, it is preferable if the training set and
 				test set is somewhat balanced so that both sets represents a broad variety of the entities
 				it should be able to identify. Shuffling the data before splitting is most likely sufficient in many cases.

 				<screen>
 					<![CDATA[
 $ opennlp TokenNameFinderEvaluator.brat -model models/da-org.bin \
 -ruleBasedTokenizer simple -annotationConfig data/annotation.conf \
 -bratDataDir data/gold/da/test -recursive true \
 -sentenceDetectorModel models/da-sent.bin

 Loading Token Name Finder model ... done (12.395s)

 Average: 610.7 sent/s
 Total: 6133 sent
 Runtime: 10.043s

 Precision: 0.7321974661424203
 Recall: 0.25176505933603727
 F-Measure: 0.3746926000447127

 ]]>
 				</screen>
 			</para>
 		</section>

 		<section id="tools.corpora.brat.cross-validation">
 			<title>Cross Validation</title>
 			<para>
 				You can also use the cross validation to evaluate you model. This can come in handy when you do
 				not have enough data to divide it into a proper training and test set.
 				Running cross validation with the misclassified attribute set to true can also be helpful because it
 				will identify missed annotations as they will pop up as false positives in the text output.
 				<screen>
 					<![CDATA[
 $ opennlp TokenNameFinderCrossValidator.brat -resources conf/resources \
 -featuregen conf/resources/fg-da-org.xml -nameTypes Organization \
 -params conf/resources/TrainerParams.txt -lang da -misclassified true \
 -folds 10 -detailedF true -ruleBasedTokenizer simple -annotationConfig data/annotation.conf \
 -bratDataDir data/gold/da -recursive true -sentenceDetectorModel models/da-sent.bin

 Indexing events using cutoff of 0

 Computing event counts...
 done. 555858 events
     Indexing...  done.
 Collecting events... Done indexing.
 Incorporating indexed data for training...
 done.
 	Number of Event Tokens: 555858
 	    Number of Outcomes: 3
 	  Number of Predicates: 1302740
 Computing model parameters...
 Performing 100 iterations.
   1:  . (550095/555858) 0.9896322442062541
   2:  . (552971/555858) 0.9948062274897546
 ...
 ...
 ... (training and evaluationg x 10)
 ...
 done

 Evaluated 26133 samples with 13053 entities; found: 12174 entities; correct: 10361.
        TOTAL: precision:   85.11%;  recall:   79.38%; F1:   82.14%.
 Organization: precision:   85.11%;  recall:   79.38%; F1:   82.14%. [target: 13053; tp: 10361; fp: 1813]


         ]]>
 				</screen>
 			</para>
 		</section>
 	</section>
 </chapter>