| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" |
| "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ |
| ]> |
| <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor |
| license agreements. See the NOTICE file distributed with this work for additional |
| information regarding copyright ownership. The ASF licenses this file to |
| you under the Apache License, Version 2.0 (the "License"); you may not use |
| this file except in compliance with the License. You may obtain a copy of |
| the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required |
| by applicable law or agreed to in writing, software distributed under the |
| License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS |
| OF ANY KIND, either express or implied. See the License for the specific |
| language governing permissions and limitations under the License. --> |
| |
| |
| <chapter id="tools.morfologik-addon"> |
| <title>Morfologik Addon</title> |
| <para> |
| <ulink url="https://github.com/morfologik/morfologik-stemming"><citetitle>Morfologik</citetitle></ulink> |
| provides tools for finite state automata (FSA) construction and dictionary-based morphological dictionaries. |
| </para> |
| <para> |
| The Morfologik Addon implements OpenNLP interfaces and extensions to allow the use of FSA Morfologik dictionary tools. |
| </para> |
| <section id="tools.morfologik-addon.api"> |
| <title>Morfologik Integration</title> |
| <para> |
| To allow for an easy integration with OpenNLP, the following implementations are provided: |
| <itemizedlist mark='opencircle'> |
| <listitem> |
| <para> |
| The <code>MorfologikPOSTaggerFactory</code> extends <code>POSTaggerFactory</code>, which helps creating a POSTagger model with an embedded FSA TagDictionary. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| The <code>MorfologikTagDictionary</code> implements a FSA based <code>TagDictionary</code>, allowing for much smaller files than the default XML based with improved memory consumption. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| The <code>MorfologikLemmatizer</code> implements a FSA based <code>Lemmatizer</code> dictionaries. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| <para> |
| The first two implementations can be used directly from command line, as in the example bellow. Having a FSA Morfologik dictionary (see next section how to build one), you can train a POS Tagger |
| model with an embedded FSA dictionary. |
| </para> |
| <para> |
| The example trains a POSTagger with a CONLL corpus named <code>portuguese_bosque_train.conll</code> and a FSA dictionary named |
| <code>pt-morfologik.dict</code>. It will output a model named <code>pos-pt_fsadic.model</code>. |
| |
| <screen> |
| <![CDATA[ |
| $ bin/opennlp POSTaggerTrainer -type perceptron -lang pt -model pos-pt_fsadic.model -data portuguese_bosque_train.conll \ |
| -encoding UTF-8 -factory opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory -dict pt-morfologik.dict]]> |
| </screen> |
| |
| </para> |
| <para> |
| Another example follows. It shows how to use the <code>MorfologikLemmatizer</code>. You will need a lemma dictionary and info file, in this example, we will use a very small Portuguese dictionary. |
| Its syntax is <code>lemma,lexeme,postag</code>. |
| </para> |
| <para> |
| File <code>lemmaDictionary.txt:</code> |
| <screen> |
| <![CDATA[ |
| casa,casa,NOUN |
| casar,casa,V |
| casar,casar,V-INF |
| Casa,Casa,PROP |
| casa,casinha,NOUN |
| casa,casona,NOUN |
| menino,menina,NOUN |
| menino,menino,NOUN |
| menino,meninĂ£o,NOUN |
| menino,menininho,NOUN |
| carro,carro,NOUN]]> |
| </screen> |
| </para> |
| <para> |
| Mandatory metadata file, which must have the same name but .info extension <code>lemmaDictionary.info:</code> |
| <screen> |
| <![CDATA[ |
| # |
| # REQUIRED PROPERTIES |
| # |
| |
| # Column (lemma, inflected, tag) separator. This must be a single byte in the target encoding. |
| fsa.dict.separator=, |
| |
| # The charset in which the input is encoded. UTF-8 is strongly recommended. |
| fsa.dict.encoding=UTF-8 |
| |
| # The type of lemma-inflected form encoding compression that precedes automaton |
| # construction. Allowed values: [suffix, infix, prefix, none]. |
| # Details are in Daciuk's paper and in the code. |
| # Leave at 'prefix' if not sure. |
| fsa.dict.encoder=prefix |
| ]]> |
| </screen> |
| </para> |
| <para> |
| The following code creates a binary FSA Morfologik dictionary, loads it in MorfologikLemmatizer and uses it to |
| find the lemma the word "casa" noun and verb. |
| |
| <programlisting language="java"> |
| <![CDATA[ |
| // Part 1: compile a FSA lemma dictionary |
| |
| // we need the tabular dictionary. It is mandatory to have info |
| // file with same name, but .info extension |
| Path textLemmaDictionary = Paths.get("dictionaryWithLemma.txt"); |
| |
| // this will build a binary dictionary located in compiledLemmaDictionary |
| Path compiledLemmaDictionary = new MorfologikDictionayBuilder() |
| .build(textLemmaDictionary); |
| |
| // Part 2: load a MorfologikLemmatizer and use it |
| MorfologikLemmatizer lemmatizer = new MorfologikLemmatizer(compiledLemmaDictionary); |
| |
| String[] toks = {"casa", "casa"}; |
| String[] tags = {"NOUN", "V"}; |
| |
| String[] lemmas = lemmatizer.lemmatize(toks, tags); |
| System.out.println(Arrays.toString(lemmas)); // outputs [casa, casar] |
| ]]> |
| </programlisting> |
| |
| </para> |
| </section> |
| <section id="tools.morfologik-addon.cmdline"> |
| <title>Morfologik CLI Tools</title> |
| <para> |
| The Morfologik addon provides a command line tool. <code>XMLDictionaryToTable</code> makes easy to convert from an OpenNLP XML based dictionary |
| to a tabular format. <code>MorfologikDictionaryBuilder</code> can take a tabular dictionary and output a binary Morfologik FSA dictionary. |
| </para> |
| <screen> |
| <![CDATA[ |
| $ sh bin/morfologik-addon |
| OpenNLP Morfologik Addon. Usage: opennlp-morfologik-addon TOOL |
| where TOOL is one of: |
| MorfologikDictionaryBuilder builds a binary POS Dictionary using Morfologik |
| XMLDictionaryToTable reads an OpenNLP XML tag dictionary and outputs it in a tabular file |
| All tools print help when invoked with help parameter |
| Example: opennlp-morfologik-addon POSDictionaryBuilder help |
| ]]> |
| </screen> |
| </section> |
| </chapter> |