blob: 6b18f4f7adc7b8980d5626e7700259cd46657e61 [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE]><![endif]-->
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Namespace Lucene.Net.Analysis.Stempel
| Apache Lucene.NET 4.8.0-beta00010 Documentation </title>
<meta name="viewport" content="width=device-width">
<meta name="title" content="Namespace Lucene.Net.Analysis.Stempel
| Apache Lucene.NET 4.8.0-beta00010 Documentation ">
<meta name="generator" content="docfx 2.56.0.0">
<link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css">
<meta property="docfx:navrel" content="toc.html">
<meta property="docfx:tocrel" content="analysis-stempel/toc.html">
<meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/">
</head>
<body data-spy="scroll" data-target="#affix" data-offset="120">
<div id="wrapper">
<header>
<nav id="autocollapse" class="navbar ng-scope" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/">
<img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt="">
</a>
</div>
<div class="collapse navbar-collapse" id="navbar">
<form class="navbar-form navbar-right" role="search" id="search">
<div class="form-group">
<input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off">
</div>
</form>
</div>
</div>
</nav>
<div class="subnav navbar navbar-default">
<div class="container hide-when-search">
<ul class="level0 breadcrumb">
<li>
<a href="https://lucenenet.apache.org/docs/4.8.0-beta00009/">API</a>
<span id="breadcrumb">
<ul class="breadcrumb">
<li></li>
</ul>
</span>
</li>
</ul>
</div>
</div>
</header>
<div class="container body-content">
<div id="search-results">
<div class="search-list"></div>
<div class="sr-items">
<p><i class="glyphicon glyphicon-refresh index-loading"></i></p>
</div>
<ul id="pagination"></ul>
</div>
</div>
<div role="main" class="container body-content hide-when-search">
<div class="sidenav hide-when-search">
<a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a>
<div class="sidetoggle collapse" id="sidetoggle">
<div id="sidetoc"></div>
</div>
</div>
<div class="article row grid-right">
<div class="col-md-10">
<article class="content wrap" id="_content" data-uid="Lucene.Net.Analysis.Stempel">
<h1 id="Lucene_Net_Analysis_Stempel" data-uid="Lucene.Net.Analysis.Stempel" class="text-break">Namespace Lucene.Net.Analysis.Stempel
</h1>
<div class="markdown level0 summary"><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<h1 id="stempel---algorithmic-stemmer-for-polish-language"><em>Stempel</em> - Algorithmic Stemmer for Polish Language</h1>
<h2 id="introduction">Introduction</h2>
<p>A method for conflation of different inflected word forms is an important component of many Information Retrieval systems. It helps to improve the system&#39;s recall and can significantly reduce the index size. This is especially true for highly-inflectional languages like those from the Slavic language family (Czech, Slovak, Polish, Russian, Bulgarian, etc).</p>
<p>This page describes a software package consisting of high-quality stemming tables for Polish, and a universal algorithmic stemmer, which operates using these tables. The stemmer code is taken virtually unchanged from the <a href="http://www.egothor.org">Egothor project</a>.</p>
<p>The software distribution includes stemmer tables prepared using an extensive corpus of Polish language (see details below).</p>
<p>This work is available under Apache-style Open Source license - the stemmer code is covered by Egothor License, the tables and other additions are covered by Apache License 2.0. Both licenses allow to use the code in Open Source as well as commercial (closed source) projects.</p>
<h3 id="terminology">Terminology</h3>
<p>A short explanation is in order about the terminology used in this text.</p>
<p>In the following sections I make a distinction between <strong>stem</strong> and <strong>lemma</strong>.</p>
<p>Lemma is a base grammatical form (dictionary form, headword) of a word. Lemma is an existing, grammatically correct word in some human language.</p>
<p>Stem on the other hand is just a unique token, not necessarily making any sense in any human language, but which can serve as a unique label instead of lemma for the same set of inflected forms. Quite often stem is referred to as a &quot;root&quot; of the word - which is incorrect and misleading (stems sometimes have very little to do with the linguistic root of a word, i.e. a pattern found in a word which is common to all inflected forms or within a family of languages).</p>
<p>For an IR system stems are usually sufficient, for a morphological analysis system obviously lemmas are a must. In practice, various stemmers produce a mix of stems and lemmas, as is the case with the stemmer described here. Additionally, for some languages, which use suffix-based inflection rules many stemmers based on suffix-stripping will produce a large percentage of stems equivalent to lemmas. This is however not the case for languages with complex, irregular inflection rules (such as Slavic languages) - here simplistic suffix-stripping stemmers produce very poor results.</p>
<h3 id="background">Background</h3>
<p>Lemmatization is a process of finding the base, non-inflected form of a word. The result of lemmatization is a correct existing word, often in nominative case for nouns and infinitive form for verbs. A given inflected form may correspond to several lemmas (e.g. &quot;found&quot; -&gt; find, found) - the correct choice depends on the context. </p>
<p> Stemming is concerned mostly with finding a unique &quot;root&quot; of a word, which not necessarily results in any existing word or lemma. The quality of stemming is measured by the rate of collisions (overstemming - which causes words with different lemmas to be incorrectly conflated into one &quot;root&quot;), and the rate of superfluous word &quot;roots&quot; (understemming - which assigns several &quot;roots&quot; to words with the same lemma). </p>
<p> Both stemmer and lemmatizer can be implemented in various ways. The two most common approaches are: </p>
<ul>
<li><p>dictionary-based: where the stemmer uses an extensive dictionary
of morphological forms in order to find the corresponding stem or lemma</p>
</li>
<li><p>algorithmic: where the stemmer uses an algorithm, based on
general morphological properties of a given language plus a set of
heuristic rules </p>
</li>
</ul>
<p>There are many existing and well-known implementations of stemmers for
English (Porter, Lovins, Krovetz) and other European languages
(<a href="http://snowball.tartarus.org">Snowball</a>). There are also
good quality commercial lemmatizers for Polish. However, there is only
one
freely available Polish stemmer, implemented by
<a href="http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml?lang=en">Dawid
Weiss</a>, based on the &quot;ispell&quot; dictionary and Jan Daciuk&#39;s <a href="http://www.eti.pg.gda.pl/%7Ejandac/">FSA package</a>. That
stemmer is dictionary-based. This means that even
though it can achieve
perfect accuracy for previously known word forms found in its
dictionary, it
completely fails in case of all other word forms. This deficiency is
somewhat mitigated by the comprehensive dictionary distributed with
this stemmer (so there is a high probability that most of the words in
the input text will be found in the dictionary), however the problem
still remains (please see the page above for more detailed description). </p>
<p>The implementation described here uses an algorithmic method. This
method
and particular algorithm implementation are described in detail in
[1][2].
The main advantage of algorithmic stemmers is their ability to process
previously
unseen word forms with high accuracy. This particular algorithm uses a
set
of
transformation rules (patch commands), which describe how a word with a
given pattern should be transformed to its stem. These rules are first
learned from a training corpus. They don&#39;t
cover
all possible cases, so there is always some loss of precision/recall
(which
means that even the words from the training corpus are sometimes
incorrectly stemmed). </p>
<h2 id="algorithm-and-implementationspan-stylefont-style-italicspan">Algorithm and implementation<span style="font-style: italic;"></span></h2>
<p>The algorithm and its Java implementation is described in detail in the
publications cited below. Here&#39;s just a short excerpt from [2]: </p>
<center>
<div style="width: 80%;" align="justify">&quot;The aim is separation of the
stemmer execution code from the data
structures [...]. In other words, a static algorithm configurable by
data must be developed. The word transformations that happen in the
stemmer must be then encoded to the data tables.<br>
The tacit input of our method is a sample set (a so-called dictionary)
of words (as keys) and their stems. Each record can be equivalently
stored as a key and the record of key&#39;s transformation to its
respective stem. The transformation record is termed a patch command
(P-command). It must be ensured that P-commands are universal, and that
P-commands can transform any word to its stem. Our solution[6,8] is
based on the Levenstein metric [10], which produces P-command as the
minimum cost path in a directed graph.<br>
One can imagine the P-command as an algorithm for an operator (editor)
that rewrites a string to another string. The operator can use these
instructions (PP-command&#39;s): <span style="font-weight: bold;">removal </span>-
deletes a sequence of characters starting at the current cursor
position and moves the cursor to the next character. The length of this
sequence is the parameter; <span style="font-weight: bold;">insertion </span>-
inserts a character ch, without moving the cursor. The character ch is
a parameter; <span style="font-weight: bold;">substitution </span>
- rewrites a character at the current cursor position to the character
ch and moves the cursor to the next character. The character ch is a
parameter; <span style="font-weight: bold;">no operation</span> (NOOP)
- skip a sequence of characters starting at the current cursor
position. The length of this sequence is the parameter.<br>
The P-commands are applied from the end of a word (right to left). This
assumption can reduce the set of P-command&#39;s, because the last NOOP,
moving the cursor to the end of a string without any changes, need not
be stored.&quot;</div>
</center>
<p>Data structure used to keep the dictionary (words and their P-commands)
is a trie. Several optimization steps are applied in turn to reduce and
optimize the initial trie, by eliminating useless information and
shortening the paths in the trie. </p>
<p>Finally, in order to obtain a stem from the input word, the word is
passed once through a matching path in the trie (applying at each node
the P-commands stored there). The result is a word stem. </p>
<h2 id="corpus">Corpus</h2>
<p><em>(to be completed...)</em></p>
<p>The following Polish corpora have been used:</p>
<ul>
<li><p><a href="http://sourceforge.net/project/showfiles.php?group_id=49316&amp;package_id=65354">Polish
dictionary
from ispell distribution</a></p>
</li>
<li><p><a href="http://www.mimuw.edu.pl/polszczyzna/">Wzbogacony korpus
słownika frekwencyjnego</a></p>
</li>
<li><p>The
Bible (so called &quot;TysiÄ…clecia&quot;) - unauthorized electronic version</p>
</li>
<li><p><a href="http://www.mimuw.edu.pl/polszczyzna/Debian/sam34_3.4a.02-1_i386.deb">Analizator
morfologiczny SAM v. 3.4</a> - this was used to recover lemmas
missing from other texts</p>
</li>
</ul>
<p>This step was the most time-consuming - and it would probably be even more tedious and difficult if not for the help of <a href="http://www.python.org/">Python</a>. The source texts had to be brought to a common encoding (UTF-8) - some of them used quite ancient encodings like Mazovia or DHN - and then scripts were written to collect all lemmas and inflected forms from the source texts. In cases when the source text was not tagged, I used the SAM analyzer to produce lemmas. In cases of ambiguous lemmatization I decided to put references to inflected forms from all base forms. </p>
<p>All grammatical categories were allowed to appear in the corpus, i.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting corpus consisted of roughly 87,000+ inflection sets, i.e. each set consisted of one base form (lemma) and many inflected forms. However, because of the nature of the training method I restricted these sets to include only those where there were at least 4 inflected forms. Sets with 3 or less inflected forms were removed, so that the final corpus consisted of ~69,000 unique sets, which in turn contained ~1.5 mln inflected forms. </p>
<h2 id="testing">Testing</h2>
<p>I tested the stemmer tables produced using the implementation described above. The following sections give some details about the testing setup. </p>
<h3 id="testing-procedure">Testing procedure</h3>
<p>The testing procedure was as follows: </p>
<ul>
<li><p>the whole corpus of ~69,000 unique sets was shuffled, so that the
input sets were in random order.</p>
</li>
<li><p>the corpus was split into two parts - one with 30,000 sets (Part
1), the other with ~39,000 sets (Part 2).</p>
</li>
<li><p>Training samples were drawn in sequential order from the Part 1.
Since the sets were already randomized, the training samples were also
randomized, but this procedure ensured that each larger training sample
contained all smaller samples.</p>
</li>
<li><p>Part 2 was used for testing. Note: this means that the testing
run used <em>only</em> words previously unseen during the training
phase. This is the worst scenario, because it means that stemmer must
extrapolate the learned rules to unknown cases. This also means that in
a real-life case (where the input is a mix between known and unknown
words) the F-measure of the stemmer will be even higher than in the
table below.</p>
</li>
</ul>
<h3 id="test-results">Test results</h3>
<p>The following table summarizes test results for varying sizes of training samples. The meaning of the table columns is described below: </p>
<ul>
<li><p><strong>training sets:</strong> the number of training sets. One set
consists of one lemma and at least 4 and up to ~80 inflected forms
(including pre- and suffixed forms).</p>
</li>
<li><p><strong>testing forms:</strong> the number of testing forms. Only inflected
forms were used in testing.</p>
</li>
<li><p><strong>stem OK:</strong> the number of cases when produced output was a
correct (unique) stem. Note: quite often correct stems were also
correct lemmas.</p>
</li>
<li><p><strong>lemma OK:</strong> the number of cases when produced output was a
correct lemma.</p>
</li>
<li><p><strong>missing:</strong> the number of cases when stemmer was unable to
provide any output.</p>
</li>
<li><p><strong>stem bad:</strong> the number of cases when produced output was a
stem, but already in use identifying a different set.</p>
</li>
<li><p><strong>lemma bad:</strong> the number of cases when produced output was an
incorrect lemma. Note: quite often in such case the output was a
correct stem.</p>
</li>
<li><p><strong>table size:</strong> the size in bytes of the stemmer table.</p>
</li>
</ul>
<div align="center">
<table border="1" cellpadding="2" cellspacing="0">
<tbody>
<tr bgcolor="#a0b0c0">
<th>Training sets</th>
<th>Testing forms</th>
<th>Stem OK</th>
<th>Lemma OK</th>
<th>Missing</th>
<th>Stem Bad</th>
<th>Lemma Bad</th>
<th>Table size [B]</th>
</tr>
<tr align="right">
<td>100</td>
<td>1022985</td>
<td>842209</td>
<td>593632</td>
<td>172711</td>
<td>22331</td>
<td>256642</td>
<td>28438</td>
</tr>
<tr align="right">
<td>200</td>
<td>1022985</td>
<td>862789</td>
<td>646488</td>
<td>153288</td>
<td>16306</td>
<td>223209</td>
<td>48660</td>
</tr>
<tr align="right">
<td>500</td>
<td>1022985</td>
<td>885786</td>
<td>685009</td>
<td>130772</td>
<td>14856</td>
<td>207204</td>
<td>108798</td>
</tr>
<tr align="right">
<td>700</td>
<td>1022985</td>
<td>909031</td>
<td>704609</td>
<td>107084</td>
<td>15442</td>
<td>211292</td>
<td>139291</td>
</tr>
<tr align="right">
<td>1000</td>
<td>1022985</td>
<td>926079</td>
<td>725720</td>
<td>90117</td>
<td>14941</td>
<td>207148</td>
<td>183677</td>
</tr>
<tr align="right">
<td>2000</td>
<td>1022985</td>
<td>942886</td>
<td>746641</td>
<td>73429</td>
<td>14903</td>
<td>202915</td>
<td>313516</td>
</tr>
<tr align="right">
<td>5000</td>
<td>1022985</td>
<td>954721</td>
<td>759930</td>
<td>61476</td>
<td>14817</td>
<td>201579</td>
<td>640969</td>
</tr>
<tr align="right">
<td>7000</td>
<td>1022985</td>
<td>956165</td>
<td>764033</td>
<td>60364</td>
<td>14620</td>
<td>198588</td>
<td>839347</td>
</tr>
<tr align="right">
<td>10000</td>
<td>1022985</td>
<td>965427</td>
<td>775507</td>
<td>50797</td>
<td>14662</td>
<td>196681</td>
<td>1144537</td>
</tr>
<tr align="right">
<td>12000</td>
<td>1022985</td>
<td>967664</td>
<td>782143</td>
<td>48722</td>
<td>14284</td>
<td>192120</td>
<td>1313508</td>
</tr>
<tr align="right">
<td>15000</td>
<td>1022985</td>
<td>973188</td>
<td>788867</td>
<td>43247</td>
<td>14349</td>
<td>190871</td>
<td>1567902</td>
</tr>
<tr align="right">
<td>17000</td>
<td>1022985</td>
<td>974203</td>
<td>791804</td>
<td>42319</td>
<td>14333</td>
<td>188862</td>
<td>1733957</td>
</tr>
<tr align="right">
<td>20000</td>
<td>1022985</td>
<td>976234</td>
<td>791554</td>
<td>40058</td>
<td>14601</td>
<td>191373</td>
<td>1977615</td>
</tr>
</tbody>
</table>
</div>
<p>I also measured the time to produce a stem (which involves traversing a trie, retrieving a patch command and applying the patch command to the input string). On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03 HotSpot), for tables ranging in size from 1,000 to 20,000 cells, the time to produce a single stem varies between 5-10 microseconds. </p>
<p>This means that the stemmer can process up to <span style="font-weight: bold;">200,000 words per second</span>, an outstanding result when compared to other stemmers (Morfeusz - ~2,000 w/s, FormAN (MS Word analyzer) - ~1,000 w/s). </p>
<p>The package contains a class <code>org.getopt.stempel.Benchmark</code>, which you can use to produce reports like the one below: </p>
<p>--------- Stemmer benchmark report: -----------<br>Stemmer table: /res/tables/stemmer_2000.out<br>Input file: ../test3.txt<br>Number of runs: 3 </p>
<p> RUN NUMBER: 1 2 3<br> Total input words 1378176 1378176 1378176<br> Missed output words 112 112 112<br> Time elapsed [ms] 6989 6940 6640<br> Hit rate percent 99.99% 99.99% 99.99%<br> Miss rate percent 00.01% 00.01% 00.01%<br> Words per second 197192 198584 207557<br> Time per word [us] 5.07 5.04 4.82 </p>
<h2 id="summary">Summary</h2>
<p>The results of these tests are very encouraging. It seems that using the training corpus and the stemming algorithm described above results in a high-quality stemmer useful for most applications. Moreover, it can also be used as a better than average lemmatizer.</p>
<p>Both the author of the implementation (Leo Galambos, &lt;leo.galambos AT egothor DOT org&gt;) and the author of this compilation (Andrzej Bialecki <ab at="" getopt="" dot="" org="">) would appreciate any feedback and suggestions for further improvements.<p>
<h2 id="bibliography">Bibliography</h2>
<ol>
<li><p>Galambos, L.: Multilingual Stemmer in Web Environment, PhD
Thesis,
Faculty of Mathematics and Physics, Charles University in Prague, in
press.</p>
</li>
<li><p>Galambos, L.: Semi-automatic Stemmer Evaluation. International
Intelligent Information Processing and Web Mining Conference, 2004,
Zakopane, Poland.</p>
</li>
<li><p>Galambos, L.: Lemmatizer for Document Information Retrieval
Systems in JAVA.<span style="text-decoration: underline;"> </span><a href="http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01">&lt;http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01&gt;</a>
SOFSEM 2001, Piestany, Slovakia. </p>
</li>
</ol>
</ab></div>
<div class="markdown level0 conceptual"></div>
<div class="markdown level0 remarks"></div>
<h3 id="classes">Classes
</h3>
<h4><a class="xref" href="Lucene.Net.Analysis.Stempel.StempelFilter.html">StempelFilter</a></h4>
<section><p>Transforms the token stream as per the stemming algorithm.
<p>
Note: the input to the stemming filter must already be in lower case, so you
will need to use <span class="xref">Lucene.Net.Analysis.Core.LowerCaseFilter</span> or <span class="xref">Lucene.Net.Analysis.Core.LowerCaseTokenizer</span> farther down the
<a class="xref" href="http://localhost:8080/api/core/Lucene.Net.Analysis.Tokenizer.html">Tokenizer</a> chain in order for this to work properly!
</p></p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Stempel.StempelPolishStemFilterFactory.html">StempelPolishStemFilterFactory</a></h4>
<section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Stempel.StempelFilter.html">StempelFilter</a> using a Polish stemming table.</p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Stempel.StempelStemmer.html">StempelStemmer</a></h4>
<section><p>Stemmer class is a convenient facade for other stemmer-related classes. The
core stemming algorithm and its implementation is taken verbatim from the
Egothor project ( <a href="http://www.egothor.org">www.egothor.org </a>).
<p>
Even though the stemmer tables supplied in the distribution package are built
for Polish language, there is nothing language-specific here.
</p></p>
</section>
</article>
</div>
<div class="hidden-sm col-md-2" role="complementary">
<div class="sideaffix">
<div class="contribution">
<ul class="nav">
<li>
<a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00010/src/Lucene.Net.Analysis.Stempel/overview.md/#L2" class="contribution-link">Improve this Doc</a>
</li>
</ul>
</div>
<nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix">
<!-- <p><a class="back-to-top" href="#top">Back to top</a><p> -->
</nav>
</div>
</div>
</div>
</div>
<footer>
<div class="grad-bottom"></div>
<div class="footer">
<div class="container">
<span class="pull-right">
<a href="#top">Back to top</a>
</span>
Copyright © 2020 Licensed to the Apache Software Foundation (ASF)
</div>
</div>
</footer>
</div>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script>
</body>
</html>