blob: 6f18844804c544c5f0d8b9c75fe2c6e5440fc884 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
]>
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE file distributed with this work for additional
information regarding copyright ownership. The ASF licenses this file to
you under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of
the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
OF ANY KIND, either express or implied. See the License for the specific
language governing permissions and limitations under the License. -->
<chapter id="tools.morfologik-addon">
<title>Morfologik Addon</title>
<para>
<ulink url="https://github.com/morfologik/morfologik-stemming"><citetitle>Morfologik</citetitle></ulink>
provides tools for finite state automata (FSA) construction and dictionary-based morphological dictionaries.
</para>
<para>
The Morfologik Addon implements OpenNLP interfaces and extensions to allow the use of FSA Morfologik dictionary tools.
</para>
<section id="tools.morfologik-addon.api">
<title>Morfologik Integration</title>
<para>
To allow for an easy integration with OpenNLP, the following implementations are provided:
<itemizedlist mark='opencircle'>
<listitem>
<para>
The <code>MorfologikPOSTaggerFactory</code> extends <code>POSTaggerFactory</code>, which helps creating a POSTagger model with an embedded FSA TagDictionary.
</para>
</listitem>
<listitem>
<para>
The <code>MorfologikTagDictionary</code> implements a FSA based <code>TagDictionary</code>, allowing for much smaller files than the default XML based with improved memory consumption.
</para>
</listitem>
<listitem>
<para>
The <code>MorfologikLemmatizer</code> implements a FSA based <code>Lemmatizer</code> dictionaries.
</para>
</listitem>
</itemizedlist>
</para>
<para>
The first two implementations can be used directly from command line, as in the example bellow. Having a FSA Morfologik dictionary (see next section how to build one), you can train a POS Tagger
model with an embedded FSA dictionary.
</para>
<para>
The example trains a POSTagger with a CONLL corpus named <code>portuguese_bosque_train.conll</code> and a FSA dictionary named
<code>pt-morfologik.dict</code>. It will output a model named <code>pos-pt_fsadic.model</code>.
<screen>
<![CDATA[
$ bin/opennlp POSTaggerTrainer -type perceptron -lang pt -model pos-pt_fsadic.model -data portuguese_bosque_train.conll \
-encoding UTF-8 -factory opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory -dict pt-morfologik.dict]]>
</screen>
</para>
<para>
Another example follows. It shows how to use the <code>MorfologikLemmatizer</code>. You will need a lemma dictionary and info file, in this example, we will use a very small Portuguese dictionary.
Its syntax is <code>lemma,lexeme,postag</code>.
</para>
<para>
File <code>lemmaDictionary.txt:</code>
<screen>
<![CDATA[
casa,casa,NOUN
casar,casa,V
casar,casar,V-INF
Casa,Casa,PROP
casa,casinha,NOUN
casa,casona,NOUN
menino,menina,NOUN
menino,menino,NOUN
menino,meninĂ£o,NOUN
menino,menininho,NOUN
carro,carro,NOUN]]>
</screen>
</para>
<para>
Mandatory metadata file, which must have the same name but .info extension <code>lemmaDictionary.info:</code>
<screen>
<![CDATA[
#
# REQUIRED PROPERTIES
#
# Column (lemma, inflected, tag) separator. This must be a single byte in the target encoding.
fsa.dict.separator=,
# The charset in which the input is encoded. UTF-8 is strongly recommended.
fsa.dict.encoding=UTF-8
# The type of lemma-inflected form encoding compression that precedes automaton
# construction. Allowed values: [suffix, infix, prefix, none].
# Details are in Daciuk's paper and in the code.
# Leave at 'prefix' if not sure.
fsa.dict.encoder=prefix
]]>
</screen>
</para>
<para>
The following code creates a binary FSA Morfologik dictionary, loads it in MorfologikLemmatizer and uses it to
find the lemma the word "casa" noun and verb.
<programlisting language="java">
<![CDATA[
// Part 1: compile a FSA lemma dictionary
// we need the tabular dictionary. It is mandatory to have info
// file with same name, but .info extension
Path textLemmaDictionary = Paths.get("dictionaryWithLemma.txt");
// this will build a binary dictionary located in compiledLemmaDictionary
Path compiledLemmaDictionary = new MorfologikDictionayBuilder()
.build(textLemmaDictionary);
// Part 2: load a MorfologikLemmatizer and use it
MorfologikLemmatizer lemmatizer = new MorfologikLemmatizer(compiledLemmaDictionary);
String[] toks = {"casa", "casa"};
String[] tags = {"NOUN", "V"};
String[] lemmas = lemmatizer.lemmatize(toks, tags);
System.out.println(Arrays.toString(lemmas)); // outputs [casa, casar]
]]>
</programlisting>
</para>
</section>
<section id="tools.morfologik-addon.cmdline">
<title>Morfologik CLI Tools</title>
<para>
The Morfologik addon provides a command line tool. <code>XMLDictionaryToTable</code> makes easy to convert from an OpenNLP XML based dictionary
to a tabular format. <code>MorfologikDictionaryBuilder</code> can take a tabular dictionary and output a binary Morfologik FSA dictionary.
</para>
<screen>
<![CDATA[
$ sh bin/morfologik-addon
OpenNLP Morfologik Addon. Usage: opennlp-morfologik-addon TOOL
where TOOL is one of:
MorfologikDictionaryBuilder builds a binary POS Dictionary using Morfologik
XMLDictionaryToTable reads an OpenNLP XML tag dictionary and outputs it in a tabular file
All tools print help when invoked with help parameter
Example: opennlp-morfologik-addon POSDictionaryBuilder help
]]>
</screen>
</section>
</chapter>