| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" |
| "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ |
| ]> |
| <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor |
| license agreements. See the NOTICE file distributed with this work for additional |
| information regarding copyright ownership. The ASF licenses this file to |
| you under the Apache License, Version 2.0 (the "License"); you may not use |
| this file except in compliance with the License. You may obtain a copy of |
| the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required |
| by applicable law or agreed to in writing, software distributed under the |
| License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS |
| OF ANY KIND, either express or implied. See the License for the specific |
| language governing permissions and limitations under the License. --> |
| |
| <chapter id="tools.lemmatizer"> |
| <title>Lemmatizer</title> |
| <para> |
| The lemmatizer returns, for a given word form (token) and Part of Speech |
| tag, |
| the dictionary form of a word, which is usually referred to as its |
| lemma. A token could |
| ambiguously be derived from several basic forms or dictionary words which is why |
| the |
| postag of the word is required to find the lemma. For example, the form |
| `show' may refer |
| to either the verb "to show" or to the noun "show". |
| Currently OpenNLP implement statistical and dictionary-based lemmatizers. |
| </para> |
| <section id="tools.lemmatizer.tagging.cmdline"> |
| <title>Lemmatizer Tool</title> |
| <para> |
| The easiest way to try out the Lemmatizer is the command line tool, |
| which provides access to the statistical |
| lemmatizer. Note that the tool is only intended for demonstration and testing. |
| </para> |
| <para> |
| Once you have trained a lemmatizer model (see below for instructions), |
| you can start the Lemmatizer Tool with this command: |
| </para> |
| <para> |
| <screen> |
| <![CDATA[ |
| $ opennlp LemmatizerME en-lemmatizer.bin < sentences]]> |
| </screen> |
| The Lemmatizer now reads a pos tagged sentence(s) per line from |
| standard input. For example, you can copy this sentence to the |
| console: |
| <screen> |
| <![CDATA[ |
| Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP |
| signed_VBD a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN with_IN |
| Boeing_NNP Co._NNP to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP 's_POS |
| 747_CD jetliners_NNS ._.]]> |
| </screen> |
| The Lemmatizer will now echo the lemmas for each word postag pair to |
| the console: |
| <screen> |
| <![CDATA[ |
| Rockwell NNP rockwell |
| International NNP international |
| Corp. NNP corp. |
| 's POS 's |
| Tulsa NNP tulsa |
| unit NN unit |
| said VBD say |
| it PRP it |
| signed VBD sign |
| ... |
| ]]> |
| </screen> |
| </para> |
| </section> |
| <section id="tools.lemmatizer.tagging.api"> |
| <title>Lemmatizer API</title> |
| <para> |
| The Lemmatizer can be embedded into an application via its API. |
| Currently a statistical |
| and DictionaryLemmatizer are available. Note that these two methods are |
| complementary and |
| the DictionaryLemmatizer can also be used as a way of post-processing |
| the output of the statistical |
| lemmatizer. |
| </para> |
| <para> |
| The statistical lemmatizer requires that a trained model is loaded |
| into memory from disk or from another source. |
| In the example below it is loaded from disk: |
| <programlisting language="java"> |
| <![CDATA[ |
| LemmatizerModel model = null; |
| try (InputStream modelIn = new FileInputStream("en-lemmatizer.bin"))) { |
| model = new LemmatizerModel(modelIn); |
| } |
| ]]> |
| </programlisting> |
| After the model is loaded a LemmatizerME can be instantiated. |
| <programlisting language="java"> |
| <![CDATA[ |
| LemmatizerME lemmatizer = new LemmatizerME(model);]]> |
| </programlisting> |
| The Lemmatizer instance is now ready to lemmatize data. It expects a |
| tokenized sentence |
| as input, which is represented as a String array, each String object |
| in the array |
| is one token, and the POS tags associated with each token. |
| </para> |
| <para> |
| The following code shows how to determine the most likely lemma for |
| a sentence. |
| <programlisting language="java"> |
| <![CDATA[ |
| String[] tokens = new String[] { "Rockwell", "International", "Corp.", "'s", |
| "Tulsa", "unit", "said", "it", "signed", "a", "tentative", "agreement", |
| "extending", "its", "contract", "with", "Boeing", "Co.", "to", |
| "provide", "structural", "parts", "for", "Boeing", "'s", "747", |
| "jetliners", "." }; |
| |
| String[] postags = new String[] { "NNP", "NNP", "NNP", "POS", "NNP", "NN", |
| "VBD", "PRP", "VBD", "DT", "JJ", "NN", "VBG", "PRP$", "NN", "IN", |
| "NNP", "NNP", "TO", "VB", "JJ", "NNS", "IN", "NNP", "POS", "CD", "NNS", |
| "." }; |
| |
| String[] lemmas = lemmatizer.lemmatize(tokens, postags);]]> |
| </programlisting> |
| The lemmas array contains one lemma for each token in the |
| input array. The corresponding |
| tag and lemma can be found at the same index as the token has in the |
| input array. |
| </para> |
| |
| <para> |
| The DictionaryLemmatizer is constructed |
| by passing the InputStream of a lemmatizer dictionary. Such dictionary |
| consists of a text file containing, for each row, a word, its postag and the |
| corresponding lemma, each column separated by a tab character. |
| <screen> |
| <![CDATA[ |
| show NN show |
| showcase NN showcase |
| showcases NNS showcase |
| showdown NN showdown |
| showdowns NNS showdown |
| shower NN shower |
| showers NNS shower |
| showman NN showman |
| showmanship NN showmanship |
| showmen NNS showman |
| showroom NN showroom |
| showrooms NNS showroom |
| shows NNS show |
| shrapnel NN shrapnel |
| ]]> |
| </screen> |
| Alternatively, if a (word,postag) pair can output multiple lemmas, the |
| the lemmatizer dictionary would consists of a text file containing, for |
| each row, a word, its postag and the corresponding lemmas separated by "#": |
| <screen> |
| <![CDATA[ |
| muestras NN muestra |
| cantaba V cantar |
| fue V ir#ser |
| entramos V entrar |
| ]]> |
| </screen> |
| First the dictionary must be loaded into memory from disk or another |
| source. |
| In the sample below it is loaded from disk. |
| <programlisting language="java"> |
| <![CDATA[ |
| InputStream dictLemmatizer = null; |
| |
| try (dictLemmatizer = new FileInputStream("english-lemmatizer.txt")) { |
| |
| } |
| ]]> |
| </programlisting> |
| After the dictionary is loaded the DictionaryLemmatizer can be |
| instantiated. |
| <programlisting language="java"> |
| <![CDATA[ |
| DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(dictLemmatizer);]]> |
| </programlisting> |
| The DictionaryLemmatizer instance is now ready. It expects two |
| String arrays as input, |
| a containing the tokens and another one their respective postags. |
| </para> |
| <para> |
| The following code shows how to find a lemma using a |
| DictionaryLemmatizer. |
| <programlisting language="java"> |
| <![CDATA[ |
| String[] tokens = new String[]{"Most", "large", "cities", "in", "the", "US", "had", |
| "morning", "and", "afternoon", "newspapers", "."}; |
| String[] tags = tagger.tag(sent); |
| String[] lemmas = lemmatizer.lemmatize(tokens, postags); |
| ]]> |
| </programlisting> |
| The tags array contains one part-of-speech tag for each token in the |
| input array. The corresponding |
| tag and lemmas can be found at the same index as the token has in the |
| input array. |
| </para> |
| </section> |
| <section id="tools.lemmatizer.training"> |
| <title>Lemmatizer Training</title> |
| <para> |
| The training data consist of three columns separated by spaces. Each |
| word has been put on a |
| separate line and there is an empty line after each sentence. The first |
| column contains |
| the current word, the second its part-of-speech tag and the third its |
| lemma. |
| Here is an example of the file format: |
| </para> |
| <para> |
| Sample sentence of the training data: |
| <screen> |
| <![CDATA[ |
| He PRP he |
| reckons VBZ reckon |
| the DT the |
| current JJ current |
| accounts NNS account |
| deficit NN deficit |
| will MD will |
| narrow VB narrow |
| to TO to |
| only RB only |
| # # # |
| 1.8 CD 1.8 |
| millions CD million |
| in IN in |
| September NNP september |
| . . O]]> |
| </screen> |
| The Universal Dependencies Treebank and the CoNLL 2009 datasets |
| distribute training data for many languages. |
| </para> |
| <section id="tools.lemmatizer.training.tool"> |
| <title>Training Tool</title> |
| <para> |
| OpenNLP has a command line tool which is used to train the models on |
| various corpora. |
| </para> |
| <para> |
| Usage of the tool: |
| <screen> |
| <![CDATA[ |
| $ opennlp LemmatizerTrainerME |
| Usage: opennlp LemmatizerTrainerME [-factory factoryName] [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName] |
| |
| Arguments description: |
| -factory factoryName |
| A sub-class of LemmatizerFactory where to get implementation and resources. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| ]]> |
| </screen> |
| Its now assumed that the english lemmatizer model should be trained |
| from a file called |
| en-lemmatizer.train which is encoded as UTF-8. The following command will train the |
| lemmatizer and write the model to en-lemmatizer.bin: |
| <screen> |
| <![CDATA[ |
| $ opennlp LemmatizerTrainerME -model en-lemmatizer.bin -params PerceptronTrainerParams.txt -lang en -data en-lemmatizer.train -encoding UTF-8]]> |
| </screen> |
| </para> |
| </section> |
| <section id="tools.lemmatizer.training.api"> |
| <title>Training API</title> |
| <para> |
| The Lemmatizer offers an API to train a new lemmatizer model. First |
| a training parameters |
| file needs to be instantiated: |
| <programlisting language="java"> |
| <![CDATA[ |
| TrainingParameters mlParams = CmdLineUtil.loadTrainingParameters(params.getParams(), false); |
| if (mlParams == null) { |
| mlParams = ModelUtil.createDefaultTrainingParameters(); |
| }]]> |
| </programlisting> |
| Then we read the training data: |
| <programlisting language="java"> |
| <![CDATA[ |
| InputStreamFactory inputStreamFactory = null; |
| try { |
| inputStreamFactory = new MarkableFileInputStreamFactory( |
| new File(en-lemmatizer.train)); |
| } catch (FileNotFoundException e) { |
| e.printStackTrace(); |
| } |
| ObjectStream<String> lineStream = null; |
| LemmaSampleStream lemmaStream = null; |
| try { |
| lineStream = new PlainTextByLineStream( |
| (inputStreamFactory), "UTF-8"); |
| lemmaStream = new LemmaSampleStream(lineStream); |
| } catch (IOException e) { |
| CmdLineUtil.handleCreateObjectStreamError(e); |
| } |
| ]]> |
| </programlisting> |
| The following step proceeds to train the model: |
| <programlisting> |
| LemmatizerModel model; |
| try { |
| LemmatizerFactory lemmatizerFactory = LemmatizerFactory |
| .create(params.getFactory()); |
| model = LemmatizerME.train(params.getLang(), lemmaStream, mlParams, |
| lemmatizerFactory); |
| } catch (IOException e) { |
| throw new TerminateToolException(-1, |
| "IO error while reading training data or indexing data: " |
| + e.getMessage(), |
| e); |
| } finally { |
| try { |
| sampleStream.close(); |
| } catch (IOException e) { |
| } |
| } |
| </programlisting> |
| </para> |
| </section> |
| </section> |
| <section id="tools.lemmatizer.evaluation"> |
| <title>Lemmatizer Evaluation</title> |
| <para> |
| The built in evaluation can measure the accuracy of the statistical |
| lemmatizer. |
| The accuracy can be measured on a test data set. |
| </para> |
| <para> |
| There is a command line tool to evaluate a given model on a test |
| data set. |
| The following command shows how the tool can be run: |
| <screen> |
| <![CDATA[ |
| $ opennlp LemmatizerEvaluator -model en-lemmatizer.bin -data en-lemmatizer.test -encoding utf-8]]> |
| </screen> |
| This will display the resulting accuracy score, e.g.: |
| <screen> |
| <![CDATA[ |
| Loading model ... done |
| Evaluating ... done |
| |
| Accuracy: 0.9659110277825124]]> |
| </screen> |
| </para> |
| </section> |
| </chapter> |