| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" |
| "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ |
| ]> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| <chapter id="tools.corpora"> |
| |
| <title>Corpora</title> |
| <para> |
| OpenNLP has built-in support to convert into the native training format or directly use |
| various corpora needed by the different trainable components. |
| </para> |
| <section id="tools.corpora.conll"> |
| <title>CONLL</title> |
| <para> |
| CoNLL stands for the Conference on Computational Natural Language Learning and is not |
| a single project but a consortium of developers attempting to broaden the computing |
| environment. More information about the entire conference series can be obtained here |
| for CoNLL. |
| </para> |
| <section id="tools.corpora.conll.2000"> |
| <title>CONLL 2000</title> |
| <para> |
| The shared task of CoNLL-2000 is Chunking. |
| </para> |
| <section id="tools.corpora.conll.2000.getting"> |
| <title>Getting the data</title> |
| <para> |
| CoNLL-2000 made available training and test data for the Chunk task in English. |
| The data consists of the same partitions of the Wall Street Journal corpus (WSJ) |
| as the widely used data for noun phrase chunking: sections 15-18 as training data |
| (211727 tokens) and section 20 as test data (47377 tokens). The annotation of the |
| data has been derived from the WSJ corpus by a program written by Sabine Buchholz |
| from Tilburg University, The Netherlands. Both training and test data can be |
| obtained from <ulink url="http://www.cnts.ua.ac.be/conll2000/chunking">http://www.cnts.ua.ac.be/conll2000/chunking</ulink>. |
| </para> |
| </section> |
| <section id="tools.corpora.conll.2000.converting"> |
| <title>Converting the data</title> |
| <para> |
| The data don't need to be transformed because Apache OpenNLP Chunker follows |
| the CONLL 2000 format for training. Check <link linkend="tools.chunker.training">Chunker Training</link> section to learn more. |
| </para> |
| </section> |
| <section id="tools.corpora.conll.2000.training"> |
| <title>Training</title> |
| <para> |
| We can train the model for the Chunker using the train.txt available at CONLL 2000: |
| <screen> |
| <![CDATA[ |
| $ opennlp ChunkerTrainerME -model en-chunker.bin -iterations 500 \ |
| -lang en -data train.txt -encoding UTF-8]]> |
| </screen> |
| <screen> |
| <![CDATA[ |
| Indexing events using cutoff of 5 |
| |
| Computing event counts... done. 211727 events |
| Indexing... done. |
| Sorting and merging events... done. Reduced 211727 events to 197252. |
| Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 197252 |
| Number of Outcomes: 22 |
| Number of Predicates: 107838 |
| ...done. |
| Computing model parameters... |
| Performing 500 iterations. |
| 1: .. loglikelihood=-654457.1455212828 0.2601510435608118 |
| 2: .. loglikelihood=-239513.5583724216 0.9260037690044255 |
| 3: .. loglikelihood=-141313.1386347238 0.9443387003074715 |
| 4: .. loglikelihood=-101083.50853437989 0.954375209585929 |
| ... cut lots of iterations ... |
| 498: .. loglikelihood=-1710.8874647317095 0.9995040783650645 |
| 499: .. loglikelihood=-1708.0908900815848 0.9995040783650645 |
| 500: .. loglikelihood=-1705.3045902366732 0.9995040783650645 |
| Writing chunker model ... done (4.019s) |
| |
| Wrote chunker model to path: .\en-chunker.bin]]> |
| </screen> |
| </para> |
| </section> |
| <section id="tools.corpora.conll.2000.evaluation"> |
| <title>Evaluating</title> |
| <para> |
| We evaluate the model using the file test.txt available at CONLL 2000: |
| <screen> |
| <![CDATA[ |
| $ opennlp ChunkerEvaluator -model en-chunker.bin -lang en -encoding utf8 -data test.txt]]> |
| </screen> |
| <screen> |
| <![CDATA[ |
| Loading Chunker model ... done (0,665s) |
| current: 85,8 sent/s avg: 85,8 sent/s total: 86 sent |
| current: 88,1 sent/s avg: 87,0 sent/s total: 174 sent |
| current: 156,2 sent/s avg: 110,0 sent/s total: 330 sent |
| current: 192,2 sent/s avg: 130,5 sent/s total: 522 sent |
| current: 167,2 sent/s avg: 137,8 sent/s total: 689 sent |
| current: 179,2 sent/s avg: 144,6 sent/s total: 868 sent |
| current: 183,2 sent/s avg: 150,3 sent/s total: 1052 sent |
| current: 183,2 sent/s avg: 154,4 sent/s total: 1235 sent |
| current: 169,2 sent/s avg: 156,0 sent/s total: 1404 sent |
| current: 178,2 sent/s avg: 158,2 sent/s total: 1582 sent |
| current: 172,2 sent/s avg: 159,4 sent/s total: 1754 sent |
| current: 177,2 sent/s avg: 160,9 sent/s total: 1931 sent |
| |
| |
| Average: 161,6 sent/s |
| Total: 2013 sent |
| Runtime: 12.457s |
| |
| Precision: 0.9244354736974896 |
| Recall: 0.9216837162502096 |
| F-Measure: 0.9230575441395671]]> |
| </screen> |
| </para> |
| </section> |
| </section> |
| <section id="tools.corpora.conll.2002"> |
| <title>CONLL 2002</title> |
| <para> |
| The shared task of CoNLL-2002 is language independent named entity recognition for Spanish and Dutch. |
| </para> |
| <section id="tools.corpora.conll.2002.getting"> |
| <title>Getting the data</title> |
| <para>The data consists of three files per language: one training file and two test files testa and testb. |
| The first test file will be used in the development phase for finding good parameters for the learning system. |
| The second test file will be used for the final evaluation. Currently there are data files available for two languages: |
| Spanish and Dutch. |
| </para> |
| <para> |
| The Spanish data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are |
| from May 2000. The annotation was carried out by the <ulink url="http://www.talp.cat/">TALP Research Center</ulink> of the Technical University of Catalonia (UPC) |
| and the <ulink url="http://clic.ub.edu/">Center of Language and Computation (CLiC)</ulink>of the University of Barcelona (UB), and funded by the European Commission |
| through the NAMIC project (IST-1999-12392). |
| </para> |
| <para> |
| The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1). |
| The data was annotated as a part of the <ulink url="http://atranos.esat.kuleuven.ac.be/">Atranos</ulink> project at the University of Antwerp. |
| </para> |
| <para> |
| You can find the Spanish files here: |
| <ulink url="http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html">http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html</ulink> |
| You must download esp.train.gz, unzip it and you will see the file esp.train. |
| </para> |
| <para> |
| You can find the Dutch files here: |
| <ulink url="http://www.cnts.ua.ac.be/conll2002/ner.tgz">http://www.cnts.ua.ac.be/conll2002/ner.tgz</ulink> |
| You must unzip it and go to /ner/data/ned.train.gz, so you unzip it too, and you will see the file ned.train. |
| </para> |
| </section> |
| <section id="tools.corpora.conll.2002.converting"> |
| <title>Converting the data</title> |
| <para> |
| I will use Spanish data as reference, but it would be the same operations to Dutch. You just must remember change “-lang es” to “-lang nl” and use |
| the correct training files. So to convert the information to the OpenNLP format: |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt]]> |
| </screen> |
| Optionally, you can convert the training test samples as well. |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderConverter conll02 -data esp.testa -lang es -types per > corpus_testa.txt |
| $ opennlp TokenNameFinderConverter conll02 -data esp.testb -lang es -types per > corpus_testb.txt]]> |
| </screen> |
| </para> |
| </section> |
| <section id="tools.corpora.conll.2002.training.spanish"> |
| <title>Training with Spanish data</title> |
| <para> |
| To train the model for the name finder: |
| <screen> |
| <![CDATA[ |
| \bin\opennlp TokenNameFinderTrainer -lang es -encoding u |
| tf8 -iterations 500 -data es_corpus_train_persons.txt -model es_ner_person.bin |
| |
| |
| Indexing events using cutoff of 5 |
| |
| Computing event counts... done. 264715 events |
| Indexing... done. |
| Sorting and merging events... done. Reduced 264715 events to 222660. |
| Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 222660 |
| Number of Outcomes: 3 |
| Number of Predicates: 71514 |
| ...done. |
| Computing model parameters ... |
| Performing 500 iterations. |
| 1: ... loglikelihood=-290819.1519958615 0.9689326256540053 |
| 2: ... loglikelihood=-37097.17676455632 0.9689326256540053 |
| 3: ... loglikelihood=-22910.372489660916 0.9706476776911017 |
| 4: ... loglikelihood=-17091.547325669497 0.9777874317662392 |
| 5: ... loglikelihood=-13797.620926769372 0.9833821279489262 |
| 6: ... loglikelihood=-11715.806710780415 0.9867140131839903 |
| 7: ... loglikelihood=-10289.222078246517 0.9886859452618855 |
| 8: ... loglikelihood=-9249.208318314624 0.9902310031543358 |
| 9: ... loglikelihood=-8454.169590899777 0.9913227433277298 |
| 10: ... loglikelihood=-7823.742997451327 0.9921953799369133 |
| 11: ... loglikelihood=-7309.375882641964 0.9928224694482746 |
| 12: ... loglikelihood=-6880.131972149693 0.9932946754056249 |
| 13: ... loglikelihood=-6515.3828767792365 0.993638441342576 |
| 14: ... loglikelihood=-6200.82723154046 0.9939595413935742 |
| 15: ... loglikelihood=-5926.213730444915 0.994269308501596 |
| 16: ... loglikelihood=-5683.9821840753275 0.9945299661900534 |
| 17: ... loglikelihood=-5468.4211798176075 0.9948246227074401 |
| 18: ... loglikelihood=-5275.127017232056 0.9950286156810154 |
| |
| ... cut lots of iterations ... |
| |
| 491: ... loglikelihood=-1174.8485558758211 0.998983812779782 |
| 492: ... loglikelihood=-1173.9971776942477 0.998983812779782 |
| 493: ... loglikelihood=-1173.1482915871768 0.998983812779782 |
| 494: ... loglikelihood=-1172.3018855781158 0.998983812779782 |
| 495: ... loglikelihood=-1171.457947774544 0.998983812779782 |
| 496: ... loglikelihood=-1170.6164663670502 0.998983812779782 |
| 497: ... loglikelihood=-1169.7774296286693 0.998983812779782 |
| 498: ... loglikelihood=-1168.94082591387 0.998983812779782 |
| 499: ... loglikelihood=-1168.1066436580463 0.9989875904274408 |
| 500: ... loglikelihood=-1167.2748713765225 0.9989875904274408 |
| Writing name finder model ... done (2,168s) |
| |
| Wrote name finder model to |
| path: .\es_ner_person.bin]]> |
| </screen> |
| </para> |
| </section> |
| </section> |
| |
| <section id="tools.corpora.conll.2003"> |
| <title>CONLL 2003</title> |
| <para> |
| The shared task of CoNLL-2003 is language independent named entity recognition |
| for English and German. |
| </para> |
| <section id="tools.corpora.conll.2003.getting"> |
| <title>Getting the data</title> |
| <para> |
| The English data is the Reuters Corpus, which is a collection of news wire articles. |
| The Reuters Corpus can be obtained free of charges from the NIST for research |
| purposes: <ulink url="http://trec.nist.gov/data/reuters/reuters.html">http://trec.nist.gov/data/reuters/reuters.html</ulink> |
| </para> |
| <para> |
| The German data is a collection of articles from the German newspaper Frankfurter |
| Rundschau. The articles are part of the ECI Multilingual Text Corpus which |
| can be obtained for 75$ (2010) from the Linguistic Data Consortium: |
| <ulink url="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5</ulink> </para> |
| <para>After one of the corpora is available the data must be |
| transformed as explained in the README file to the CONLL format. |
| The transformed data can be read by the OpenNLP CONLL03 converter. |
| </para> |
| </section> |
| <section id="tools.corpora.conll.2003.converting"> |
| <title>Converting the data (optional)</title> |
| <para> |
| To convert the information to the OpenNLP format: |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderConverter conll03 -lang en -types per -data eng.train > corpus_train.txt]]> |
| </screen> |
| Optionally, you can convert the training test samples as well. |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderConverter conll03 -lang en -types per -data eng.testa > corpus_testa.txt |
| $ opennlp TokenNameFinderConverter conll03 -lang en -types per -data eng.testb > corpus_testb.txt]]> |
| </screen> |
| </para> |
| </section> |
| <section id="tools.corpora.conll.2003.training.english"> |
| <title>Training with English data</title> |
| <para> |
| You can train the model for the name finder this way: |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderTrainer.conll03 -model en_ner_person.bin -iterations 500 \ |
| -lang en -types per -data eng.train -encoding utf8]]> |
| </screen> |
| </para> |
| <para> |
| If you have converted the data, then you can train the model for the name finder this way: |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderTrainer -model en_ner_person.bin -iterations 500 \ |
| -lang en -data corpus_train.txt -encoding utf8]]> |
| </screen> |
| </para> |
| <para> |
| Either way you should see the following output during the training process: |
| <screen> |
| <![CDATA[ |
| Indexing events using cutoff of 5 |
| |
| Computing event counts... done. 203621 events |
| Indexing... done. |
| Sorting and merging events... done. Reduced 203621 events to 179409. |
| Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 179409 |
| Number of Outcomes: 3 |
| Number of Predicates: 58814 |
| ...done. |
| Computing model parameters... |
| Performing 500 iterations. |
| 1: .. loglikelihood=-223700.5328318588 0.9453494482396216 |
| 2: .. loglikelihood=-40525.939777363084 0.9467933071736215 |
| 3: .. loglikelihood=-24893.98837874921 0.9598518816821447 |
| 4: .. loglikelihood=-18420.3379471033 0.9712996203731442 |
| ... cut lots of iterations ... |
| 498: .. loglikelihood=-952.8501399442295 0.9988950059178572 |
| 499: .. loglikelihood=-952.0600155746948 0.9988950059178572 |
| 500: .. loglikelihood=-951.2722802086295 0.9988950059178572 |
| Writing name finder model ... done (1.638s) |
| |
| Wrote name finder model to |
| path: .\en_ner_person.bin]]> |
| </screen> |
| </para> |
| </section> |
| <section id="tools.corpora.conll.2003.evaluation.english"> |
| <title>Evaluating with English data</title> |
| <para> |
| You can evaluate the model for the name finder this way: |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderEvaluator.conll03 -model en_ner_person.bin \ |
| -lang en -types per -data eng.testa -encoding utf8]]> |
| </screen> |
| </para> |
| <para> |
| If you converted the test A and B files above, you can use them to evaluate the |
| model. |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderEvaluator -model en_ner_person.bin -lang en -data corpus_testa.txt \ |
| -encoding utf8]]> |
| </screen> |
| </para> |
| <para> |
| Either way you should see the following output: |
| <screen> |
| <![CDATA[ |
| Loading Token Name Finder model ... done (0.359s) |
| current: 190.2 sent/s avg: 190.2 sent/s total: 199 sent |
| current: 648.3 sent/s avg: 415.9 sent/s total: 850 sent |
| current: 530.1 sent/s avg: 453.6 sent/s total: 1380 sent |
| current: 793.8 sent/s avg: 539.0 sent/s total: 2178 sent |
| current: 705.4 sent/s avg: 571.9 sent/s total: 2882 sent |
| |
| |
| Average: 569.4 sent/s |
| Total: 3251 sent |
| Runtime: 5.71s |
| |
| Precision: 0.9366247297154147 |
| Recall: 0.739956568946797 |
| F-Measure: 0.8267557582133971]]> |
| </screen> |
| </para> |
| </section> |
| </section> |
| </section> |
| <section id="tools.corpora.arvores-deitadas"> |
| <title>Arvores Deitadas</title> |
| <para> |
| The Portuguese corpora available at <ulink url="http://www.linguateca.pt">Floresta Sintá(c)tica</ulink> project follow the Arvores Deitadas (AD) format. Apache OpenNLP includes tools to convert from AD format to native format. |
| </para> |
| <section id="tools.corpora.arvores-deitadas.getting"> |
| <title>Getting the data</title> |
| <para> |
| The Corpus can be downloaded from here: <ulink url="http://www.linguateca.pt/floresta/corpus.html">http://www.linguateca.pt/floresta/corpus.html</ulink> |
| </para> |
| <para> |
| The Name Finder models were trained using the Amazonia corpus: <ulink url="http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz">amazonia.ad</ulink>. |
| The Chunker models were trained using the <ulink url="http://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz">Bosque_CF_8.0.ad</ulink>. |
| </para> |
| </section> |
| |
| <section id="tools.corpora.arvores-deitadas.converting"> |
| <title>Converting the data (optional)</title> |
| <para> |
| To extract NameFinder training data from Amazonia corpus: |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderConverter ad -lang pt -encoding ISO-8859-1 -data amazonia.ad > corpus.txt]]> |
| </screen> |
| </para> |
| <para> |
| To extract Chunker training data from Bosque_CF_8.0.ad corpus: |
| <screen> |
| <![CDATA[ |
| $ opennlp ChunkerConverter ad -lang pt -data Bosque_CF_8.0.ad.txt -encoding ISO-8859-1 > bosque-chunk]]> |
| </screen> |
| </para> |
| </section> |
| <section id="tools.corpora.arvores-deitadas.evaluation"> |
| <title>Training and Evaluation</title> |
| <para> |
| To perform the evaluation the corpus was split into a training and a test part. |
| <screen> |
| <![CDATA[ |
| $ sed '1,55172d' corpus.txt > corpus_train.txt |
| $ sed '55172,100000000d' corpus.txt > corpus_test.txt]]> |
| </screen> |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderTrainer -model pt-ner.bin -cutoff 20 -lang PT -data corpus_train.txt -encoding UTF-8 |
| ... |
| $ opennlp TokenNameFinderEvaluator -model pt-ner.bin -lang PT -data corpus_train.txt -encoding UTF-8 |
| |
| Precision: 0.8005071889818507 |
| Recall: 0.7450581122145297 |
| F-Measure: 0.7717879983140168]]> |
| </screen> |
| </para> |
| </section> |
| </section> |
| |
| <section id="tools.corpora.ontonotes"> |
| <title>OntoNotes Release 4.0</title> |
| <para> |
| "OntoNotes Release 4.0, Linguistic Data Consortium (LDC) catalog number |
| LDC2011T03 and isbn 1-58563-574-X, was developed as part of the |
| OntoNotes project, a collaborative effort between BBN Technologies, |
| the University of Colorado, the University of Pennsylvania and the |
| University of Southern Californias Information Sciences Institute. The |
| goal of the project is to annotate a large corpus comprising various |
| genres of text (news, conversational telephone speech, weblogs, usenet |
| newsgroups, broadcast, talk shows) in three languages (English, |
| Chinese, and Arabic) with structural information (syntax and predicate |
| argument structure) and shallow semantics (word sense linked to an |
| ontology and coreference). OntoNotes Release 4.0 is supported by the |
| Defense Advance Research Project Agency, GALE Program Contract No. |
| HR0011-06-C-0022. |
| </para> |
| <para> |
| OntoNotes Release 4.0 contains the content of earlier releases -- OntoNotes |
| Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04 and OntoNotes |
| Release 3.0 LDC2009T24 -- and adds newswire, broadcast news, broadcast |
| conversation and web data in English and Chinese and newswire data in |
| Arabic. This cumulative publication consists of 2.4 million words as |
| follows: 300k words of Arabic newswire 250k words of Chinese newswire, |
| 250k words of Chinese broadcast news, 150k words of Chinese broadcast |
| conversation and 150k words of Chinese web text and 600k words of |
| English newswire, 200k word of English broadcast news, 200k words of |
| English broadcast conversation and 300k words of English web text. |
| </para> |
| <para> |
| The OntoNotes project builds on two time-tested resources, following the |
| Penn Treebank for syntax and the Penn PropBank for predicate-argument |
| structure. Its semantic representation will include word sense |
| disambiguation for nouns and verbs, with each word sense connected to |
| an ontology, and coreference. The current goals call for annotation of |
| over a million words each of English and Chinese, and half a million |
| words of Arabic over five years." (http://catalog.ldc.upenn.edu/LDC2011T03) |
| </para> |
| <section id="tools.corpora.ontonotes.namefinder"> |
| <title>Name Finder Training</title> |
| <para> |
| The OntoNotes corpus can be used to train the Name Finder. The corpus |
| contains many different name types |
| to train a model for a specific type only the built-in type filter |
| option should be used. |
| </para> |
| <para> |
| The sample shows how to train a model to detect person names. |
| <programlisting> |
| <![CDATA[ |
| $ bin/opennlp TokenNameFinderTrainer.ontonotes -lang en -model en-ontonotes.bin \ |
| -nameTypes person -ontoNotesDir ontonotes-release-4.0/data/files/data/english/ |
| |
| Indexing events using cutoff of 5 |
| |
| Computing event counts... done. 1953446 events |
| Indexing... done. |
| Sorting and merging events... done. Reduced 1953446 events to 1822037. |
| Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 1822037 |
| Number of Outcomes: 3 |
| Number of Predicates: 298263 |
| ...done. |
| Computing model parameters ... |
| Performing 100 iterations. |
| 1: ... loglikelihood=-2146079.7808976253 0.976677625078963 |
| 2: ... loglikelihood=-195016.59754190338 0.976677625078963 |
| ... cut lots of iterations ... |
| 99: ... loglikelihood=-10269.902459614596 0.9987299367374374 |
| 100: ... loglikelihood=-10227.160010853702 0.9987314724850341 |
| Writing name finder model ... done (2.315s) |
| |
| Wrote name finder model to |
| path: /dev/opennlp/trunk/opennlp-tools/en-ontonotes.bin]]> |
| </programlisting> |
| </para> |
| </section> |
| </section> |
| |
| <section id="tools.corpora.brat"> |
| <title>Brat Format Support</title> |
| <para> |
| The brat annotation tool is an online environment for collaborative text annotation and |
| supports labeling documents with named entities. The best performance of a name finder |
| can only be achieved if it was trained on documents similar to the the documents it will |
| process. For that reason it is often necessary to manually label a large number of documents and |
| build a custom corpus. This is where brat comes in handy. |
| |
| <imagedata fileref="images/brat.png" width="6.5in" depth="4in" scalefit="1"/> |
| |
| OpenNLP can directly be trained and evaluated on labeled data in the brat format. |
| Instructions on how to use, download and install brat can be found on the project website: |
| |
| <ulink url="http://brat.nlplab.org">http://brat.nlplab.org</ulink> |
| |
| Configuration of brat, including setting up the different entities and relations can be found at: |
| |
| <ulink url="http://brat.nlplab.org/configuration.html">http://brat.nlplab.org/configuration.html</ulink> |
| |
| </para> |
| |
| |
| <section id="tools.corpora.brat.webtool"> |
| <title>Sentences and Tokens</title> |
| <para> |
| The brat annotation tool only adds named entity spans to the data and doesn't provide information |
| about tokens and sentences. To train the name finder this information is required. By default it |
| is assumed that each line is a sentence and that tokens are whitespace separated. This can be |
| adjusted by providing a custom sentence detector and optional also a tokenizer. |
| |
| The opennlp brat command supports the following arguments for providing custom sentence detector |
| and tokenizer. |
| |
| <simplelist type='horiz' columns='1'> |
| <member><para>-sentenceDetectorModel - your sentence model</para></member> |
| <member><para>-tokenizerModel - your tokenizer model</para></member> |
| <member><para>-ruleBasedTokenizer - simple | whitespace</para></member> |
| </simplelist> |
| |
| </para> |
| </section> |
| |
| <section id="tools.corpora.brat.training"> |
| <title>Training</title> |
| <para> |
| To train your namefinder model using your brat annotated files you can either use the opennlp command |
| line tool or call opennlp.tools.cmdline.CLI main class from your preferred IDE. |
| |
| Calling opennlp TokenNameFinder.brat without arguments gives you a list of all the arguments you can use. |
| Obviously some combinations are not valid. E.g. you should not provide a token model and also define |
| a rule based tokenizer. |
| |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderTrainer.brat |
| Usage: opennlp TokenNameFinderTrainer.brat [-factory factoryName] [-resources resourcesDir] [-type modelType] |
| [-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language |
| -model modelFile [-tokenizerModel modelFile] [-ruleBasedTokenizer name] -annotationConfig annConfFile |
| -bratDataDir bratDataDir [-recursive value] [-sentenceDetectorModel modelFile] |
| |
| Arguments description: |
| -factory factoryName |
| A sub-class of TokenNameFinderFactory |
| -resources resourcesDir |
| The resources directory |
| -type modelType |
| The type of the token name finder model |
| -featuregen featuregenFile |
| The feature generator descriptor file |
| -nameTypes types |
| name types to use for training |
| -sequenceCodec codec |
| sequence codec used to code name spans |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -tokenizerModel modelFile |
| -ruleBasedTokenizer name |
| -annotationConfig annConfFile |
| -bratDataDir bratDataDir |
| location of brat data dir |
| -recursive value |
| -sentenceDetectorModel modelFile |
| ]]> |
| </screen> |
| |
| The following command will train a danish organization name finder model. |
| |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderTrainer.brat -resources conf/resources \ |
| -featuregen conf/resources/fg-da-org.xml -nameTypes Organization \ |
| -params conf/resources/TrainerParams.txt -lang da \ |
| -model models/da-org.bin -ruleBasedTokenizer simple \ |
| -annotationConfig data/annotation.conf -bratDataDir data/gold/da/train \ |
| -recursive true -sentenceDetectorModel models/da-sent.bin |
| |
| Indexing events using cutoff of 0 |
| |
| Computing event counts... |
| done. 620738 events |
| Indexing... done. |
| Collecting events... Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 620738 |
| Number of Outcomes: 3 |
| Number of Predicates: 1403655 |
| Computing model parameters... |
| Performing 100 iterations. |
| 1: . (614536/620738) 0.9900086671027067 |
| 2: . (617590/620738) 0.9949286172265915 |
| 3: . (618615/620738) 0.9965798775006525 |
| 4: . (619263/620738) 0.9976237961909856 |
| 5: . (619509/620738) 0.9980200986567602 |
| 6: . (619830/620738) 0.9985372250450271 |
| 7: . (619968/620738) 0.9987595410624128 |
| 8: . (620110/620738) 0.9989883010223315 |
| 9: . (620200/620738) 0.9991332897293222 |
| 10: . (620266/620738) 0.9992396147811153 |
| 20: . (620538/620738) 0.999677802873354 |
| 30: . (620641/620738) 0.9998437343935767 |
| 40: . (620653/620738) 0.9998630662211755 |
| Stopping: change in training set accuracy less than 1.0E-5 |
| Stats: (620594/620738) 0.9997680180688149 |
| ...done. |
| |
| Writing name finder model ... Training data summary: |
| #Sentences: 26133 |
| #Tokens: 620738 |
| #Organization entities: 13053 |
| |
| Compressed 1403655 parameters to 116378 |
| 4 outcome patterns |
| done (11.099s) |
| |
| Wrote name finder model to |
| path: models/da-org.bin |
| ]]> |
| </screen> |
| </para> |
| </section> |
| |
| <section id="tools.corpora.brat.evaluation"> |
| <title>Evaluation</title> |
| <para> |
| To evaluate you name finder model opennlp provides an evaluator that works with your brat |
| annotated data. Normally you would partition your data in a training set and a test set e.g. 70% |
| training and 30% test. |
| The training set is of cause only used for training the model and should never be used for |
| evaluation. The test set is only used for evaluation. In order to avoid overfitting, it is preferable if the training set and |
| test set is somewhat balanced so that both sets represents a broad variety of the entities |
| it should be able to identify. Shuffling the data before splitting is most likely sufficient in many cases. |
| |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderEvaluator.brat -model models/da-org.bin \ |
| -ruleBasedTokenizer simple -annotationConfig data/annotation.conf \ |
| -bratDataDir data/gold/da/test -recursive true \ |
| -sentenceDetectorModel models/da-sent.bin |
| |
| Loading Token Name Finder model ... done (12.395s) |
| |
| Average: 610.7 sent/s |
| Total: 6133 sent |
| Runtime: 10.043s |
| |
| Precision: 0.7321974661424203 |
| Recall: 0.25176505933603727 |
| F-Measure: 0.3746926000447127 |
| |
| ]]> |
| </screen> |
| </para> |
| </section> |
| |
| <section id="tools.corpora.brat.cross-validation"> |
| <title>Cross Validation</title> |
| <para> |
| You can also use the cross validation to evaluate you model. This can come in handy when you do |
| not have enough data to divide it into a proper training and test set. |
| Running cross validation with the misclassified attribute set to true can also be helpful because it |
| will identify missed annotations as they will pop up as false positives in the text output. |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenNameFinderCrossValidator.brat -resources conf/resources \ |
| -featuregen conf/resources/fg-da-org.xml -nameTypes Organization \ |
| -params conf/resources/TrainerParams.txt -lang da -misclassified true \ |
| -folds 10 -detailedF true -ruleBasedTokenizer simple -annotationConfig data/annotation.conf \ |
| -bratDataDir data/gold/da -recursive true -sentenceDetectorModel models/da-sent.bin |
| |
| Indexing events using cutoff of 0 |
| |
| Computing event counts... |
| done. 555858 events |
| Indexing... done. |
| Collecting events... Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 555858 |
| Number of Outcomes: 3 |
| Number of Predicates: 1302740 |
| Computing model parameters... |
| Performing 100 iterations. |
| 1: . (550095/555858) 0.9896322442062541 |
| 2: . (552971/555858) 0.9948062274897546 |
| ... |
| ... |
| ... (training and evaluationg x 10) |
| ... |
| done |
| |
| Evaluated 26133 samples with 13053 entities; found: 12174 entities; correct: 10361. |
| TOTAL: precision: 85.11%; recall: 79.38%; F1: 82.14%. |
| Organization: precision: 85.11%; recall: 79.38%; F1: 82.14%. [target: 13053; tp: 10361; fp: 1813] |
| |
| |
| ]]> |
| </screen> |
| </para> |
| </section> |
| </section> |
| </chapter> |