blob: aa97dc2e0d09272ec4729ba4b68fb84e81c1d438 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="tools.corpora">
<title>Corpora</title>
<para>
OpenNLP has built-in support to convert into the native training format or directly use
various corpora needed by the different trainable components.
</para>
<section id="tools.corpora.conll">
<title>CONLL</title>
<para>
CoNLL stands for the Conference on Computational Natural Language Learning and is not
a single project but a consortium of developers attempting to broaden the computing
environment. More information about the entire conference series can be obtained here
for CoNLL.
</para>
<section id="tools.corpora.conll.2000">
<title>CONLL 2000</title>
<para>
The shared task of CoNLL-2000 is Chunking.
</para>
<section id="tools.corpora.conll.2000.getting">
<title>Getting the data</title>
<para>
CoNLL-2000 made available training and test data for the Chunk task in English.
The data consists of the same partitions of the Wall Street Journal corpus (WSJ)
as the widely used data for noun phrase chunking: sections 15-18 as training data
(211727 tokens) and section 20 as test data (47377 tokens). The annotation of the
data has been derived from the WSJ corpus by a program written by Sabine Buchholz
from Tilburg University, The Netherlands. Both training and test data can be
obtained from <ulink url="http://www.cnts.ua.ac.be/conll2000/chunking">http://www.cnts.ua.ac.be/conll2000/chunking</ulink>.
</para>
</section>
<section id="tools.corpora.conll.2000.converting">
<title>Converting the data</title>
<para>
The data don't need to be transformed because Apache OpenNLP Chunker follows
the CONLL 2000 format for training. Check <link linkend="tools.chunker.training">Chunker Training</link> section to learn more.
</para>
</section>
<section id="tools.corpora.conll.2000.training">
<title>Training</title>
<para>
We can train the model for the Chunker using the train.txt available at CONLL 2000:
<screen>
<![CDATA[
$ opennlp ChunkerTrainerME -model en-chunker.bin -iterations 500 \
-lang en -data train.txt -encoding UTF-8]]>
</screen>
<screen>
<![CDATA[
Indexing events using cutoff of 5
Computing event counts... done. 211727 events
Indexing... done.
Sorting and merging events... done. Reduced 211727 events to 197252.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 197252
Number of Outcomes: 22
Number of Predicates: 107838
...done.
Computing model parameters...
Performing 500 iterations.
1: .. loglikelihood=-654457.1455212828 0.2601510435608118
2: .. loglikelihood=-239513.5583724216 0.9260037690044255
3: .. loglikelihood=-141313.1386347238 0.9443387003074715
4: .. loglikelihood=-101083.50853437989 0.954375209585929
... cut lots of iterations ...
498: .. loglikelihood=-1710.8874647317095 0.9995040783650645
499: .. loglikelihood=-1708.0908900815848 0.9995040783650645
500: .. loglikelihood=-1705.3045902366732 0.9995040783650645
Writing chunker model ... done (4.019s)
Wrote chunker model to path: .\en-chunker.bin]]>
</screen>
</para>
</section>
<section id="tools.corpora.conll.2000.evaluation">
<title>Evaluating</title>
<para>
We evaluate the model using the file test.txt available at CONLL 2000:
<screen>
<![CDATA[
$ opennlp ChunkerEvaluator -model en-chunker.bin -lang en -encoding utf8 -data test.txt]]>
</screen>
<screen>
<![CDATA[
Loading Chunker model ... done (0,665s)
current: 85,8 sent/s avg: 85,8 sent/s total: 86 sent
current: 88,1 sent/s avg: 87,0 sent/s total: 174 sent
current: 156,2 sent/s avg: 110,0 sent/s total: 330 sent
current: 192,2 sent/s avg: 130,5 sent/s total: 522 sent
current: 167,2 sent/s avg: 137,8 sent/s total: 689 sent
current: 179,2 sent/s avg: 144,6 sent/s total: 868 sent
current: 183,2 sent/s avg: 150,3 sent/s total: 1052 sent
current: 183,2 sent/s avg: 154,4 sent/s total: 1235 sent
current: 169,2 sent/s avg: 156,0 sent/s total: 1404 sent
current: 178,2 sent/s avg: 158,2 sent/s total: 1582 sent
current: 172,2 sent/s avg: 159,4 sent/s total: 1754 sent
current: 177,2 sent/s avg: 160,9 sent/s total: 1931 sent
Average: 161,6 sent/s
Total: 2013 sent
Runtime: 12.457s
Precision: 0.9244354736974896
Recall: 0.9216837162502096
F-Measure: 0.9230575441395671]]>
</screen>
</para>
</section>
</section>
<section id="tools.corpora.conll.2002">
<title>CONLL 2002</title>
<para>
The shared task of CoNLL-2002 is language independent named entity recognition for Spanish and Dutch.
</para>
<section id="tools.corpora.conll.2002.getting">
<title>Getting the data</title>
<para>The data consists of three files per language: one training file and two test files testa and testb.
The first test file will be used in the development phase for finding good parameters for the learning system.
The second test file will be used for the final evaluation. Currently there are data files available for two languages:
Spanish and Dutch.
</para>
<para>
The Spanish data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are
from May 2000. The annotation was carried out by the <ulink url="http://www.talp.cat/">TALP Research Center</ulink> of the Technical University of Catalonia (UPC)
and the <ulink url="http://clic.ub.edu/">Center of Language and Computation (CLiC)</ulink>of the University of Barcelona (UB), and funded by the European Commission
through the NAMIC project (IST-1999-12392).
</para>
<para>
The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1).
The data was annotated as a part of the <ulink url="http://atranos.esat.kuleuven.ac.be/">Atranos</ulink> project at the University of Antwerp.
</para>
<para>
You can find the Spanish files here:
<ulink url="http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html">http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html</ulink>
You must download esp.train.gz, unzip it and you will see the file esp.train.
</para>
<para>
You can find the Dutch files here:
<ulink url="http://www.cnts.ua.ac.be/conll2002/ner.tgz">http://www.cnts.ua.ac.be/conll2002/ner.tgz</ulink>
You must unzip it and go to /ner/data/ned.train.gz, so you unzip it too, and you will see the file ned.train.
</para>
</section>
<section id="tools.corpora.conll.2002.converting">
<title>Converting the data</title>
<para>
I will use Spanish data as reference, but it would be the same operations to Dutch. You just must remember change “-lang es” to “-lang nl” and use
the correct training files. So to convert the information to the OpenNLP format:
<screen>
<![CDATA[
$ opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt]]>
</screen>
Optionally, you can convert the training test samples as well.
<screen>
<![CDATA[
$ opennlp TokenNameFinderConverter conll02 -data esp.testa -lang es -types per > corpus_testa.txt
$ opennlp TokenNameFinderConverter conll02 -data esp.testb -lang es -types per > corpus_testb.txt]]>
</screen>
</para>
</section>
<section id="tools.corpora.conll.2002.training.spanish">
<title>Training with Spanish data</title>
<para>
To train the model for the name finder:
<screen>
<![CDATA[
\bin\opennlp TokenNameFinderTrainer -lang es -encoding u
tf8 -iterations 500 -data es_corpus_train_persons.txt -model es_ner_person.bin
Indexing events using cutoff of 5
Computing event counts... done. 264715 events
Indexing... done.
Sorting and merging events... done. Reduced 264715 events to 222660.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 222660
Number of Outcomes: 3
Number of Predicates: 71514
...done.
Computing model parameters ...
Performing 500 iterations.
1: ... loglikelihood=-290819.1519958615 0.9689326256540053
2: ... loglikelihood=-37097.17676455632 0.9689326256540053
3: ... loglikelihood=-22910.372489660916 0.9706476776911017
4: ... loglikelihood=-17091.547325669497 0.9777874317662392
5: ... loglikelihood=-13797.620926769372 0.9833821279489262
6: ... loglikelihood=-11715.806710780415 0.9867140131839903
7: ... loglikelihood=-10289.222078246517 0.9886859452618855
8: ... loglikelihood=-9249.208318314624 0.9902310031543358
9: ... loglikelihood=-8454.169590899777 0.9913227433277298
10: ... loglikelihood=-7823.742997451327 0.9921953799369133
11: ... loglikelihood=-7309.375882641964 0.9928224694482746
12: ... loglikelihood=-6880.131972149693 0.9932946754056249
13: ... loglikelihood=-6515.3828767792365 0.993638441342576
14: ... loglikelihood=-6200.82723154046 0.9939595413935742
15: ... loglikelihood=-5926.213730444915 0.994269308501596
16: ... loglikelihood=-5683.9821840753275 0.9945299661900534
17: ... loglikelihood=-5468.4211798176075 0.9948246227074401
18: ... loglikelihood=-5275.127017232056 0.9950286156810154
... cut lots of iterations ...
491: ... loglikelihood=-1174.8485558758211 0.998983812779782
492: ... loglikelihood=-1173.9971776942477 0.998983812779782
493: ... loglikelihood=-1173.1482915871768 0.998983812779782
494: ... loglikelihood=-1172.3018855781158 0.998983812779782
495: ... loglikelihood=-1171.457947774544 0.998983812779782
496: ... loglikelihood=-1170.6164663670502 0.998983812779782
497: ... loglikelihood=-1169.7774296286693 0.998983812779782
498: ... loglikelihood=-1168.94082591387 0.998983812779782
499: ... loglikelihood=-1168.1066436580463 0.9989875904274408
500: ... loglikelihood=-1167.2748713765225 0.9989875904274408
Writing name finder model ... done (2,168s)
Wrote name finder model to
path: .\es_ner_person.bin]]>
</screen>
</para>
</section>
</section>
<section id="tools.corpora.conll.2003">
<title>CONLL 2003</title>
<para>
The shared task of CoNLL-2003 is language independent named entity recognition
for English and German.
</para>
<section id="tools.corpora.conll.2003.getting">
<title>Getting the data</title>
<para>
The English data is the Reuters Corpus, which is a collection of news wire articles.
The Reuters Corpus can be obtained free of charges from the NIST for research
purposes: <ulink url="http://trec.nist.gov/data/reuters/reuters.html">http://trec.nist.gov/data/reuters/reuters.html</ulink>
</para>
<para>
The German data is a collection of articles from the German newspaper Frankfurter
Rundschau. The articles are part of the ECI Multilingual Text Corpus which
can be obtained for 75$ (2010) from the Linguistic Data Consortium:
<ulink url="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5</ulink> </para>
<para>After one of the corpora is available the data must be
transformed as explained in the README file to the CONLL format.
The transformed data can be read by the OpenNLP CONLL03 converter.
</para>
</section>
<section id="tools.corpora.conll.2003.converting">
<title>Converting the data (optional)</title>
<para>
To convert the information to the OpenNLP format:
<screen>
<![CDATA[
$ opennlp TokenNameFinderConverter conll03 -lang en -types per -data eng.train > corpus_train.txt]]>
</screen>
Optionally, you can convert the training test samples as well.
<screen>
<![CDATA[
$ opennlp TokenNameFinderConverter conll03 -lang en -types per -data eng.testa > corpus_testa.txt
$ opennlp TokenNameFinderConverter conll03 -lang en -types per -data eng.testb > corpus_testb.txt]]>
</screen>
</para>
</section>
<section id="tools.corpora.conll.2003.training.english">
<title>Training with English data</title>
<para>
You can train the model for the name finder this way:
<screen>
<![CDATA[
$ opennlp TokenNameFinderTrainer.conll03 -model en_ner_person.bin -iterations 500 \
-lang en -types per -data eng.train -encoding utf8]]>
</screen>
</para>
<para>
If you have converted the data, then you can train the model for the name finder this way:
<screen>
<![CDATA[
$ opennlp TokenNameFinderTrainer -model en_ner_person.bin -iterations 500 \
-lang en -data corpus_train.txt -encoding utf8]]>
</screen>
</para>
<para>
Either way you should see the following output during the training process:
<screen>
<![CDATA[
Indexing events using cutoff of 5
Computing event counts... done. 203621 events
Indexing... done.
Sorting and merging events... done. Reduced 203621 events to 179409.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 179409
Number of Outcomes: 3
Number of Predicates: 58814
...done.
Computing model parameters...
Performing 500 iterations.
1: .. loglikelihood=-223700.5328318588 0.9453494482396216
2: .. loglikelihood=-40525.939777363084 0.9467933071736215
3: .. loglikelihood=-24893.98837874921 0.9598518816821447
4: .. loglikelihood=-18420.3379471033 0.9712996203731442
... cut lots of iterations ...
498: .. loglikelihood=-952.8501399442295 0.9988950059178572
499: .. loglikelihood=-952.0600155746948 0.9988950059178572
500: .. loglikelihood=-951.2722802086295 0.9988950059178572
Writing name finder model ... done (1.638s)
Wrote name finder model to
path: .\en_ner_person.bin]]>
</screen>
</para>
</section>
<section id="tools.corpora.conll.2003.evaluation.english">
<title>Evaluating with English data</title>
<para>
You can evaluate the model for the name finder this way:
<screen>
<![CDATA[
$ opennlp TokenNameFinderEvaluator.conll03 -model en_ner_person.bin \
-lang en -types per -data eng.testa -encoding utf8]]>
</screen>
</para>
<para>
If you converted the test A and B files above, you can use them to evaluate the
model.
<screen>
<![CDATA[
$ opennlp TokenNameFinderEvaluator -model en_ner_person.bin -lang en -data corpus_testa.txt \
-encoding utf8]]>
</screen>
</para>
<para>
Either way you should see the following output:
<screen>
<![CDATA[
Loading Token Name Finder model ... done (0.359s)
current: 190.2 sent/s avg: 190.2 sent/s total: 199 sent
current: 648.3 sent/s avg: 415.9 sent/s total: 850 sent
current: 530.1 sent/s avg: 453.6 sent/s total: 1380 sent
current: 793.8 sent/s avg: 539.0 sent/s total: 2178 sent
current: 705.4 sent/s avg: 571.9 sent/s total: 2882 sent
Average: 569.4 sent/s
Total: 3251 sent
Runtime: 5.71s
Precision: 0.9366247297154147
Recall: 0.739956568946797
F-Measure: 0.8267557582133971]]>
</screen>
</para>
</section>
</section>
</section>
<section id="tools.corpora.arvores-deitadas">
<title>Arvores Deitadas</title>
<para>
The Portuguese corpora available at <ulink url="http://www.linguateca.pt">Floresta Sintá(c)tica</ulink> project follow the Arvores Deitadas (AD) format. Apache OpenNLP includes tools to convert from AD format to native format.
</para>
<section id="tools.corpora.arvores-deitadas.getting">
<title>Getting the data</title>
<para>
The Corpus can be downloaded from here: <ulink url="http://www.linguateca.pt/floresta/corpus.html">http://www.linguateca.pt/floresta/corpus.html</ulink>
</para>
<para>
The Name Finder models were trained using the Amazonia corpus: <ulink url="http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz">amazonia.ad</ulink>.
The Chunker models were trained using the <ulink url="http://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz">Bosque_CF_8.0.ad</ulink>.
</para>
</section>
<section id="tools.corpora.arvores-deitadas.converting">
<title>Converting the data (optional)</title>
<para>
To extract NameFinder training data from Amazonia corpus:
<screen>
<![CDATA[
$ opennlp TokenNameFinderConverter ad -lang pt -encoding ISO-8859-1 -data amazonia.ad > corpus.txt]]>
</screen>
</para>
<para>
To extract Chunker training data from Bosque_CF_8.0.ad corpus:
<screen>
<![CDATA[
$ opennlp ChunkerConverter ad -lang pt -data Bosque_CF_8.0.ad.txt -encoding ISO-8859-1 > bosque-chunk]]>
</screen>
</para>
</section>
<section id="tools.corpora.arvores-deitadas.evaluation">
<title>Training and Evaluation</title>
<para>
To perform the evaluation the corpus was split into a training and a test part.
<screen>
<![CDATA[
$ sed '1,55172d' corpus.txt > corpus_train.txt
$ sed '55172,100000000d' corpus.txt > corpus_test.txt]]>
</screen>
<screen>
<![CDATA[
$ opennlp TokenNameFinderTrainer -model pt-ner.bin -cutoff 20 -lang PT -data corpus_train.txt -encoding UTF-8
...
$ opennlp TokenNameFinderEvaluator -model pt-ner.bin -lang PT -data corpus_train.txt -encoding UTF-8
Precision: 0.8005071889818507
Recall: 0.7450581122145297
F-Measure: 0.7717879983140168]]>
</screen>
</para>
</section>
</section>
<section id="tools.corpora.ontonotes">
<title>OntoNotes Release 4.0</title>
<para>
"OntoNotes Release 4.0, Linguistic Data Consortium (LDC) catalog number
LDC2011T03 and isbn 1-58563-574-X, was developed as part of the
OntoNotes project, a collaborative effort between BBN Technologies,
the University of Colorado, the University of Pennsylvania and the
University of Southern Californias Information Sciences Institute. The
goal of the project is to annotate a large corpus comprising various
genres of text (news, conversational telephone speech, weblogs, usenet
newsgroups, broadcast, talk shows) in three languages (English,
Chinese, and Arabic) with structural information (syntax and predicate
argument structure) and shallow semantics (word sense linked to an
ontology and coreference). OntoNotes Release 4.0 is supported by the
Defense Advance Research Project Agency, GALE Program Contract No.
HR0011-06-C-0022.
</para>
<para>
OntoNotes Release 4.0 contains the content of earlier releases -- OntoNotes
Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04 and OntoNotes
Release 3.0 LDC2009T24 -- and adds newswire, broadcast news, broadcast
conversation and web data in English and Chinese and newswire data in
Arabic. This cumulative publication consists of 2.4 million words as
follows: 300k words of Arabic newswire 250k words of Chinese newswire,
250k words of Chinese broadcast news, 150k words of Chinese broadcast
conversation and 150k words of Chinese web text and 600k words of
English newswire, 200k word of English broadcast news, 200k words of
English broadcast conversation and 300k words of English web text.
</para>
<para>
The OntoNotes project builds on two time-tested resources, following the
Penn Treebank for syntax and the Penn PropBank for predicate-argument
structure. Its semantic representation will include word sense
disambiguation for nouns and verbs, with each word sense connected to
an ontology, and coreference. The current goals call for annotation of
over a million words each of English and Chinese, and half a million
words of Arabic over five years." (http://catalog.ldc.upenn.edu/LDC2011T03)
</para>
<section id="tools.corpora.ontonotes.namefinder">
<title>Name Finder Training</title>
<para>
The OntoNotes corpus can be used to train the Name Finder. The corpus
contains many different name types
to train a model for a specific type only the built-in type filter
option should be used.
</para>
<para>
The sample shows how to train a model to detect person names.
<programlisting>
<![CDATA[
$ bin/opennlp TokenNameFinderTrainer.ontonotes -lang en -model en-ontonotes.bin \
-nameTypes person -ontoNotesDir ontonotes-release-4.0/data/files/data/english/
Indexing events using cutoff of 5
Computing event counts... done. 1953446 events
Indexing... done.
Sorting and merging events... done. Reduced 1953446 events to 1822037.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 1822037
Number of Outcomes: 3
Number of Predicates: 298263
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-2146079.7808976253 0.976677625078963
2: ... loglikelihood=-195016.59754190338 0.976677625078963
... cut lots of iterations ...
99: ... loglikelihood=-10269.902459614596 0.9987299367374374
100: ... loglikelihood=-10227.160010853702 0.9987314724850341
Writing name finder model ... done (2.315s)
Wrote name finder model to
path: /dev/opennlp/trunk/opennlp-tools/en-ontonotes.bin]]>
</programlisting>
</para>
</section>
</section>
<section id="tools.corpora.brat">
<title>Brat Format Support</title>
<para>
The brat annotation tool is an online environment for collaborative text annotation and
supports labeling documents with named entities. The best performance of a name finder
can only be achieved if it was trained on documents similar to the the documents it will
process. For that reason it is often necessary to manually label a large number of documents and
build a custom corpus. This is where brat comes in handy.
<imagedata fileref="images/brat.png" width="6.5in" depth="4in" scalefit="1"/>
OpenNLP can directly be trained and evaluated on labeled data in the brat format.
Instructions on how to use, download and install brat can be found on the project website:
<ulink url="http://brat.nlplab.org">http://brat.nlplab.org</ulink>
Configuration of brat, including setting up the different entities and relations can be found at:
<ulink url="http://brat.nlplab.org/configuration.html">http://brat.nlplab.org/configuration.html</ulink>
</para>
<section id="tools.corpora.brat.webtool">
<title>Sentences and Tokens</title>
<para>
The brat annotation tool only adds named entity spans to the data and doesn't provide information
about tokens and sentences. To train the name finder this information is required. By default it
is assumed that each line is a sentence and that tokens are whitespace separated. This can be
adjusted by providing a custom sentence detector and optional also a tokenizer.
The opennlp brat command supports the following arguments for providing custom sentence detector
and tokenizer.
<simplelist type='horiz' columns='1'>
<member><para>-sentenceDetectorModel - your sentence model</para></member>
<member><para>-tokenizerModel - your tokenizer model</para></member>
<member><para>-ruleBasedTokenizer - simple | whitespace</para></member>
</simplelist>
</para>
</section>
<section id="tools.corpora.brat.training">
<title>Training</title>
<para>
To train your namefinder model using your brat annotated files you can either use the opennlp command
line tool or call opennlp.tools.cmdline.CLI main class from your preferred IDE.
Calling opennlp TokenNameFinder.brat without arguments gives you a list of all the arguments you can use.
Obviously some combinations are not valid. E.g. you should not provide a token model and also define
a rule based tokenizer.
<screen>
<![CDATA[
$ opennlp TokenNameFinderTrainer.brat
Usage: opennlp TokenNameFinderTrainer.brat [-factory factoryName] [-resources resourcesDir] [-type modelType]
[-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language
-model modelFile [-tokenizerModel modelFile] [-ruleBasedTokenizer name] -annotationConfig annConfFile
-bratDataDir bratDataDir [-recursive value] [-sentenceDetectorModel modelFile]
Arguments description:
-factory factoryName
A sub-class of TokenNameFinderFactory
-resources resourcesDir
The resources directory
-type modelType
The type of the token name finder model
-featuregen featuregenFile
The feature generator descriptor file
-nameTypes types
name types to use for training
-sequenceCodec codec
sequence codec used to code name spans
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-tokenizerModel modelFile
-ruleBasedTokenizer name
-annotationConfig annConfFile
-bratDataDir bratDataDir
location of brat data dir
-recursive value
-sentenceDetectorModel modelFile
]]>
</screen>
The following command will train a danish organization name finder model.
<screen>
<![CDATA[
$ opennlp TokenNameFinderTrainer.brat -resources conf/resources \
-featuregen conf/resources/fg-da-org.xml -nameTypes Organization \
-params conf/resources/TrainerParams.txt -lang da \
-model models/da-org.bin -ruleBasedTokenizer simple \
-annotationConfig data/annotation.conf -bratDataDir data/gold/da/train \
-recursive true -sentenceDetectorModel models/da-sent.bin
Indexing events using cutoff of 0
Computing event counts...
done. 620738 events
Indexing... done.
Collecting events... Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 620738
Number of Outcomes: 3
Number of Predicates: 1403655
Computing model parameters...
Performing 100 iterations.
1: . (614536/620738) 0.9900086671027067
2: . (617590/620738) 0.9949286172265915
3: . (618615/620738) 0.9965798775006525
4: . (619263/620738) 0.9976237961909856
5: . (619509/620738) 0.9980200986567602
6: . (619830/620738) 0.9985372250450271
7: . (619968/620738) 0.9987595410624128
8: . (620110/620738) 0.9989883010223315
9: . (620200/620738) 0.9991332897293222
10: . (620266/620738) 0.9992396147811153
20: . (620538/620738) 0.999677802873354
30: . (620641/620738) 0.9998437343935767
40: . (620653/620738) 0.9998630662211755
Stopping: change in training set accuracy less than 1.0E-5
Stats: (620594/620738) 0.9997680180688149
...done.
Writing name finder model ... Training data summary:
#Sentences: 26133
#Tokens: 620738
#Organization entities: 13053
Compressed 1403655 parameters to 116378
4 outcome patterns
done (11.099s)
Wrote name finder model to
path: models/da-org.bin
]]>
</screen>
</para>
</section>
<section id="tools.corpora.brat.evaluation">
<title>Evaluation</title>
<para>
To evaluate you name finder model opennlp provides an evaluator that works with your brat
annotated data. Normally you would partition your data in a training set and a test set e.g. 70%
training and 30% test.
The training set is of cause only used for training the model and should never be used for
evaluation. The test set is only used for evaluation. In order to avoid overfitting, it is preferable if the training set and
test set is somewhat balanced so that both sets represents a broad variety of the entities
it should be able to identify. Shuffling the data before splitting is most likely sufficient in many cases.
<screen>
<![CDATA[
$ opennlp TokenNameFinderEvaluator.brat -model models/da-org.bin \
-ruleBasedTokenizer simple -annotationConfig data/annotation.conf \
-bratDataDir data/gold/da/test -recursive true \
-sentenceDetectorModel models/da-sent.bin
Loading Token Name Finder model ... done (12.395s)
Average: 610.7 sent/s
Total: 6133 sent
Runtime: 10.043s
Precision: 0.7321974661424203
Recall: 0.25176505933603727
F-Measure: 0.3746926000447127
]]>
</screen>
</para>
</section>
<section id="tools.corpora.brat.cross-validation">
<title>Cross Validation</title>
<para>
You can also use the cross validation to evaluate you model. This can come in handy when you do
not have enough data to divide it into a proper training and test set.
Running cross validation with the misclassified attribute set to true can also be helpful because it
will identify missed annotations as they will pop up as false positives in the text output.
<screen>
<![CDATA[
$ opennlp TokenNameFinderCrossValidator.brat -resources conf/resources \
-featuregen conf/resources/fg-da-org.xml -nameTypes Organization \
-params conf/resources/TrainerParams.txt -lang da -misclassified true \
-folds 10 -detailedF true -ruleBasedTokenizer simple -annotationConfig data/annotation.conf \
-bratDataDir data/gold/da -recursive true -sentenceDetectorModel models/da-sent.bin
Indexing events using cutoff of 0
Computing event counts...
done. 555858 events
Indexing... done.
Collecting events... Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 555858
Number of Outcomes: 3
Number of Predicates: 1302740
Computing model parameters...
Performing 100 iterations.
1: . (550095/555858) 0.9896322442062541
2: . (552971/555858) 0.9948062274897546
...
...
... (training and evaluationg x 10)
...
done
Evaluated 26133 samples with 13053 entities; found: 12174 entities; correct: 10361.
TOTAL: precision: 85.11%; recall: 79.38%; F1: 82.14%.
Organization: precision: 85.11%; recall: 79.38%; F1: 82.14%. [target: 13053; tp: 10361; fp: 1813]
]]>
</screen>
</para>
</section>
</section>
</chapter>