| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" |
| "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ |
| ]> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| <chapter id="tools.chunker"> |
| |
| <title>Chunker</title> |
| |
| <section id="tools.parser.chunking"> |
| <title>Chunking</title> |
| <para> |
| Text chunking consists of dividing a text in syntactically correlated parts of words, |
| like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence. |
| </para> |
| |
| <section id="tools.parser.chunking.cmdline"> |
| <title>Chunker Tool</title> |
| <para> |
| The easiest way to try out the Chunker is the command line tool. The tool is only intended |
| for demonstration and testing. |
| </para> |
| <para> |
| Download the english maxent chunker model from the website and start the Chunker Tool with this command: |
| </para> |
| <para> |
| <screen> |
| <![CDATA[ |
| $ opennlp ChunkerME en-chunker.bin]]> |
| </screen> |
| The Chunker now reads a pos tagged sentence per line from stdin. |
| Copy these two sentences to the console: |
| <screen> |
| <![CDATA[ |
| Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP signed_VBD |
| a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN with_IN Boeing_NNP Co._NNP |
| to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP 's_POS 747_CD jetliners_NNS ._. |
| Rockwell_NNP said_VBD the_DT agreement_NN calls_VBZ for_IN it_PRP to_TO supply_VB 200_CD |
| additional_JJ so-called_JJ shipsets_NNS for_IN the_DT planes_NNS ._.]]> |
| </screen> |
| The Chunker will now echo the sentences grouped tokens to the console: |
| <screen> |
| <![CDATA[ |
| [NP Rockwell_NNP International_NNP Corp._NNP ] [NP 's_POS Tulsa_NNP unit_NN ] [VP said_VBD ] |
| [NP it_PRP ] [VP signed_VBD ] [NP a_DT tentative_JJ agreement_NN ] [VP extending_VBG ] |
| [NP its_PRP$ contract_NN ] [PP with_IN ] [NP Boeing_NNP Co._NNP ] [VP to_TO provide_VB ] |
| [NP structural_JJ parts_NNS ] [PP for_IN ] [NP Boeing_NNP ] [NP 's_POS 747_CD jetliners_NNS ] ._. |
| [NP Rockwell_NNP ] [VP said_VBD ] [NP the_DT agreement_NN ] [VP calls_VBZ ] [SBAR for_IN ] |
| [NP it_PRP ] [VP to_TO supply_VB ] [NP 200_CD additional_JJ so-called_JJ shipsets_NNS ] |
| [PP for_IN ] [NP the_DT planes_NNS ] ._.]]> |
| </screen> |
| The tag set used by the english pos model is the <ulink url="http://www.cis.upenn.edu/~treebank/">Penn Treebank tag set</ulink>. |
| </para> |
| </section> |
| <section id="tools.parser.chunking.api"> |
| <title>Chunking API</title> |
| <para> |
| The Chunker can be embedded into an application via its API. |
| First the chunker model must be loaded into memory from disk or an other source. |
| In the sample below its loaded from disk. |
| <programlisting language="java"> |
| <![CDATA[ |
| InputStream modelIn = null; |
| ChunkerModel model = null; |
| |
| try (modelIn = new FileInputStream("en-chunker.bin")){ |
| model = new ChunkerModel(modelIn); |
| }]]> |
| </programlisting> |
| After the model is loaded a Chunker can be instantiated. |
| <programlisting language="java"> |
| <![CDATA[ |
| ChunkerME chunker = new ChunkerME(model);]]> |
| </programlisting> |
| The Chunker instance is now ready to tag data. It expects a tokenized sentence |
| as input, which is represented as a String array, each String object in the array |
| is one token, and the POS tags associated with each token. |
| </para> |
| <para> |
| The following code shows how to determine the most likely chunk tag sequence for a sentence. |
| <programlisting language="java"> |
| <![CDATA[ |
| String sent[] = new String[] { "Rockwell", "International", "Corp.", "'s", |
| "Tulsa", "unit", "said", "it", "signed", "a", "tentative", "agreement", |
| "extending", "its", "contract", "with", "Boeing", "Co.", "to", |
| "provide", "structural", "parts", "for", "Boeing", "'s", "747", |
| "jetliners", "." }; |
| |
| String pos[] = new String[] { "NNP", "NNP", "NNP", "POS", "NNP", "NN", |
| "VBD", "PRP", "VBD", "DT", "JJ", "NN", "VBG", "PRP$", "NN", "IN", |
| "NNP", "NNP", "TO", "VB", "JJ", "NNS", "IN", "NNP", "POS", "CD", "NNS", |
| "." }; |
| |
| String tag[] = chunker.chunk(sent, pos);]]> |
| </programlisting> |
| The tags array contains one chunk tag for each token in the input array. The corresponding |
| tag can be found at the same index as the token has in the input array. |
| The confidence scores for the returned tags can be easily retrieved from |
| a ChunkerME with the following method call: |
| <programlisting language="java"> |
| <![CDATA[ |
| double probs[] = chunker.probs();]]> |
| </programlisting> |
| The call to probs is stateful and will always return the probabilities of the last |
| tagged sentence. The probs method should only be called when the tag method |
| was called before, otherwise the behavior is undefined. |
| </para> |
| <para> |
| Some applications need to retrieve the n-best chunk tag sequences and not |
| only the best sequence. |
| The topKSequences method is capable of returning the top sequences. |
| It can be called in a similar way as chunk. |
| <programlisting language="java"> |
| <![CDATA[ |
| Sequence topSequences[] = chunk.topKSequences(sent, pos);]]> |
| </programlisting> |
| Each Sequence object contains one sequence. The sequence can be retrieved |
| via Sequence.getOutcomes() which returns a tags array |
| and Sequence.getProbs() returns the probability array for this sequence. |
| </para> |
| </section> |
| </section> |
| <section id="tools.chunker.training"> |
| <title>Chunker Training</title> |
| <para> |
| The pre-trained models might not be available for a desired language, |
| can not detect important entities or the performance is not good enough outside the news domain. |
| </para> |
| <para> |
| These are the typical reason to do custom training of the chunker on a ne |
| corpus or on a corpus which is extended by private training data taken from the data which should be analyzed. |
| </para> |
| <para> |
| The training data can be converted to the OpenNLP chunker training format, |
| that is based on <ulink url="http://www.cnts.ua.ac.be/conll2000/chunking">CoNLL2000</ulink>. |
| Other formats may also be available. |
| The train data consist of three columns separated one single space. Each word has been put on a |
| separate line and there is an empty line after each sentence. The first column contains |
| the current word, the second its part-of-speech tag and the third its chunk tag. |
| The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words |
| and I-VP for verb phrase words. Most chunk types have two types of chunk tags, |
| B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk. |
| Here is an example of the file format: |
| </para> |
| <para> |
| Sample sentence of the training data: |
| <screen> |
| <![CDATA[ |
| He PRP B-NP |
| reckons VBZ B-VP |
| the DT B-NP |
| current JJ I-NP |
| account NN I-NP |
| deficit NN I-NP |
| will MD B-VP |
| narrow VB I-VP |
| to TO B-PP |
| only RB B-NP |
| # # I-NP |
| 1.8 CD I-NP |
| billion CD I-NP |
| in IN B-PP |
| September NNP B-NP |
| . . O]]> |
| </screen> |
| Note that for improved visualization the example above uses tabs instead of a single space as column separator. |
| </para> |
| <section id="tools.chunker.training.tool"> |
| <title>Training Tool</title> |
| <para> |
| OpenNLP has a command line tool which is used to train the models available from the |
| model download page on various corpora. |
| </para> |
| <para> |
| Usage of the tool: |
| <screen> |
| <![CDATA[ |
| $ opennlp ChunkerTrainerME |
| Usage: opennlp ChunkerTrainerME[.ad] [-params paramsFile] [-iterations num] [-cutoff num] \ |
| -model modelFile -lang language -data sampleData [-encoding charsetName] |
| |
| Arguments description: |
| -params paramsFile |
| training parameters file. |
| -iterations num |
| number of training iterations, ignored if -params is used. |
| -cutoff num |
| minimal number of times a feature must be seen, ignored if -params is used. |
| -model modelFile |
| output model file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used.]]> |
| </screen> |
| Its now assumed that the english chunker model should be trained from a file called |
| en-chunker.train which is encoded as UTF-8. The following command will train the |
| name finder and write the model to en-chunker.bin: |
| <screen> |
| <![CDATA[ |
| $ opennlp ChunkerTrainerME -model en-chunker.bin -lang en -data en-chunker.train -encoding UTF-8]]> |
| </screen> |
| Additionally its possible to specify the number of iterations, the cutoff and to overwrite |
| all types in the training data with a single type. |
| </para> |
| </section> |
| <section id="tools.chunker.training.api"> |
| <title>Training API</title> |
| <para> |
| The Chunker offers an API to train a new chunker model. The following sample code |
| illustrates how to do it: |
| <programlisting language="java"> |
| <![CDATA[ |
| ObjectStream<String> lineStream = |
| new PlainTextByLineStream(new FileInputStream("en-chunker.train"), StandardCharsets.UTF_8); |
| |
| ChunkerModel model; |
| |
| try(ObjectStream<ChunkSample> sampleStream = new ChunkSampleStream(lineStream)) { |
| model = ChunkerME.train("en", sampleStream, |
| new DefaultChunkerContextGenerator(), TrainingParameters.defaultParams()); |
| } |
| |
| try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelFile))) { |
| model.serialize(modelOut); |
| }]]> |
| </programlisting> |
| </para> |
| </section> |
| </section> |
| |
| <section id="tools.chunker.evaluation"> |
| <title>Chunker Evaluation</title> |
| <para> |
| The built in evaluation can measure the chunker performance. The performance is either |
| measured on a test dataset or via cross validation. |
| </para> |
| <section id="tools.chunker.evaluation.tool"> |
| <title>Chunker Evaluation Tool</title> |
| <para> |
| The following command shows how the tool can be run: |
| <screen> |
| <![CDATA[ |
| $ opennlp ChunkerEvaluator |
| Usage: opennlp ChunkerEvaluator[.ad] -model model [-misclassified true|false] \ |
| [-detailedF true|false] -lang language -data sampleData [-encoding charsetName]]]> |
| </screen> |
| A sample of the command considering you have a data sample named en-chunker.eval |
| and you trained a model called en-chunker.bin: |
| <screen> |
| <![CDATA[ |
| $ opennlp ChunkerEvaluator -model en-chunker.bin -data en-chunker.eval -encoding UTF-8]]> |
| </screen> |
| and here is a sample output: |
| <screen> |
| <![CDATA[ |
| Precision: 0.9255923572240226 |
| Recall: 0.9220610430991112 |
| F-Measure: 0.9238233255623465]]> |
| </screen> |
| You can also use the tool to perform 10-fold cross validation of the Chunker. |
| he following command shows how the tool can be run: |
| <screen> |
| <![CDATA[ |
| $ opennlp ChunkerCrossValidator |
| Usage: opennlp ChunkerCrossValidator[.ad] [-params paramsFile] [-iterations num] [-cutoff num] \ |
| [-misclassified true|false] [-folds num] [-detailedF true|false] \ |
| -lang language -data sampleData [-encoding charsetName] |
| |
| Arguments description: |
| -params paramsFile |
| training parameters file. |
| -iterations num |
| number of training iterations, ignored if -params is used. |
| -cutoff num |
| minimal number of times a feature must be seen, ignored if -params is used. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -detailedF true|false |
| if true will print detailed FMeasure results. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used.]]> |
| </screen> |
| It is not necessary to pass a model. The tool will automatically split the data to train and evaluate: |
| <screen> |
| <![CDATA[ |
| $ opennlp ChunkerCrossValidator -lang pt -data en-chunker.cross -encoding UTF-8]]> |
| </screen> |
| </para> |
| </section> |
| </section> |
| </chapter> |