lucene/analysis/stempel/src/java/overview.html - lucene-solr - Git at Google

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 <!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
 -->
 <html>
 <head>
   <meta content="text/html; charset=UTF-8" http-equiv="content-type">
   <title>Stempel - Algorithmic Stemmer for Polish Language</title>
   <meta content="Andrzej Bialecki" name="author">
   <meta name="keywords"
  content="stemming, stemmer, algorithmic stemmer, Polish stemmer">
   <meta
  content="This page describes a software package consisting of high-quality stemming tables for Polish, and a universal algorithmic stemmer, which operates using these tables."
  name="description">
 </head>
 <body style="font-family: Arial,SansSerif;">
 <h1><i>Stempel</i> - Algorithmic Stemmer for Polish Language</h1>
 <h2>Introduction</h2>
 <p>A method for conflation of different inflected word forms is an
 important component of many Information Retrieval systems. It helps to
 improve the system's recall and can significantly reduce the index
 size. This is especially true for highly-inflectional languages like
 those from the Slavic language family (Czech, Slovak, Polish, Russian,
 Bulgarian, etc).</p>
 <p>This page describes a software package consisting of high-quality
 stemming tables for Polish, and a universal algorithmic stemmer, which
 operates using these tables. The stemmer code is taken virtually
 unchanged from the <a href="http://www.egothor.org">Egothor project</a>.</p>
 <p>The software distribution includes stemmer
 tables prepared using an extensive corpus of Polish language (see
 details below).</p>
 <p>This work is available under Apache-style Open Source license - the
 stemmer code is covered by Egothor License, the tables and other
 additions are covered by Apache License 2.0. Both licenses allow to use
 the code in Open Source as well as commercial (closed source) projects.</p>
 <h3>Terminology</h3>
 <p>A short explanation is in order about the terminology used in this
 text.</p>
 <p>In the following sections I make a distinction between <b>stem</b>
 and <b>lemma</b>.</p>
 <p>Lemma is a base grammatical form (dictionary form, headword) of a
 word. Lemma is an existing, grammatically correct word in some human
 language.</p>
 <p>Stem on the other hand is just a unique token, not necessarily
 making any sense in any human language, but which can serve as a unique
 label instead of lemma for the same set of inflected forms. Quite often
 stem is referred to as a "root" of the word - which is incorrect and
 misleading (stems sometimes have very little to do with the linguistic
 root of a word, i.e. a pattern found in a word which is common to all
 inflected forms or within a family of languages).</p>
 <p>For an IR system stems are usually sufficient, for a morphological
 analysis system obviously lemmas are a must. In practice, various
 stemmers produce a mix of stems and lemmas, as is the case with the
 stemmer described here. Additionally, for some languages, which use
 suffix-based inflection rules many stemmers based on suffix-stripping
 will produce a large percentage of stems equivalent to lemmas. This is
 however not the case for languages with complex, irregular inflection
 rules (such as Slavic languages) - here simplistic suffix-stripping
 stemmers produce very poor results.</p>
 <h3>Background</h3>
 <p>Lemmatization is a process of finding the base, non-inflected form
 of a word. The result of lemmatization is a correct existing word,
 often in nominative case for nouns and infinitive form for verbs. A
 given inflected form may correspond to several lemmas (e.g. "found"
 -&gt; find, found) - the correct choice depends on the context.<br>
 <br>
 Stemming is concerned mostly with finding a unique "root" of a word,
 which not necessarily results in any existing word or lemma. The
 quality of stemming is measured by the rate of collisions (overstemming
 - which causes words with different lemmas to be incorrectly conflated
 into one "root"), and the rate of superfluous word "roots"
 (understemming - which assigns several "roots" to words with the same
 lemma). <br>
 <br>
 Both stemmer and lemmatizer can be implemented in various ways. The two
 most common approaches are:<br>
 </p>
 <ul>
   <li>dictionary-based: where the stemmer uses an extensive dictionary
 of morphological forms in order to find the corresponding stem or lemma</li>
   <li>algorithmic: where the stemmer uses an algorithm, based on
 general morphological properties of a given language plus a set of
 heuristic rules<br>
   </li>
 </ul>
 There are many existing and well-known implementations of stemmers for
 English (Porter, Lovins, Krovetz) and other European languages
 (<a href="http://snowball.tartarus.org">Snowball</a>). There are also
 good quality commercial lemmatizers for Polish. However, there is only
 one
 freely available Polish stemmer, implemented by
 <a
  href="http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml?lang=en">Dawid
 Weiss</a>, based on the "ispell" dictionary and Jan Daciuk's <a
  href="http://www.eti.pg.gda.pl/%7Ejandac/">FSA package</a>. That
 stemmer is dictionary-based. This means that even
 though it can achieve
 perfect accuracy for previously known word forms found in its
 dictionary, it
 completely fails in case of all other word forms. This deficiency is
 somewhat mitigated by the comprehensive dictionary distributed with
 this stemmer (so there is a high probability that most of the words in
 the input text will be found in the dictionary), however the problem
 still remains (please see the page above for more detailed description).<br>
 <br>
 The implementation described here uses an algorithmic method. This
 method
 and particular algorithm implementation are described in detail in
 [1][2].
 The main advantage of algorithmic stemmers is their ability to process
 previously
 unseen word forms with high accuracy. This particular algorithm uses a
 set
 of
 transformation rules (patch commands), which describe how a word with a
 given pattern should be transformed to its stem. These rules are first
 learned from a training corpus. They don't
 cover
 all possible cases, so there is always some loss of precision/recall
 (which
 means that even the words from the training corpus are sometimes
 incorrectly stemmed).<br>
 <h2>Algorithm and implementation<span style="font-style: italic;"></span></h2>
 The algorithm and its Java implementation is described in detail in the
 publications cited below. Here's just a short excerpt from [2]:<br>
 <br>
 <center>
 <div style="width: 80%;" align="justify">"The aim is separation of the
 stemmer execution code from the data
 structures [...]. In other words, a static algorithm configurable by
 data must be developed. The word transformations that happen in the
 stemmer must be then encoded to the data tables.<br>
 <br>
 The tacit input of our method is a sample set (a so-called dictionary)
 of words (as keys) and their stems. Each record can be equivalently
 stored as a key and the record of key's transformation to its
 respective stem. The transformation record is termed a patch command
 (P-command). It must be ensured that P-commands are universal, and that
 P-commands can transform any word to its stem. Our solution[6,8] is
 based on the Levenstein metric [10], which produces P-command as the
 minimum cost path in a directed graph.<br>
 <br>
 One can imagine the P-command as an algorithm for an operator (editor)
 that rewrites a string to another string. The operator can use these
 instructions (PP-command's): <span style="font-weight: bold;">removal </span>-
 deletes a sequence of characters starting at the current cursor
 position and moves the cursor to the next character. The length of this
 sequence is the parameter; <span style="font-weight: bold;">insertion </span>-
 inserts a character ch, without moving the cursor. The character ch is
 a parameter; <span style="font-weight: bold;">substitution&nbsp;</span>
 - rewrites a character at the current cursor position to the character
 ch and moves the cursor to the next character. The character ch is a
 parameter; <span style="font-weight: bold;">no operation</span> (NOOP)
 - skip a sequence of characters starting at the current cursor
 position. The length of this sequence is the parameter.<br>
 <br>
 The P-commands are applied from the end of a word (right to left). This
 assumption can reduce the set of P-command's, because the last NOOP,
 moving the cursor to the end of a string without any changes, need not
 be stored."</div>
 </center>
 <br>
 Data structure used to keep the dictionary (words and their P-commands)
 is a trie. Several optimization steps are applied in turn to reduce and
 optimize the initial trie, by eliminating useless information and
 shortening the paths in the trie.<br>
 <br>
 Finally, in order to obtain a stem from the input word, the word is
 passed once through a matching path in the trie (applying at each node
 the P-commands stored there). The result is a word stem.<br>
 <h2>Corpus</h2>
 <p><i>(to be completed...)</i></p>
 <p>The following Polish corpora have been used:</p>
 <ul>
   <li><a
  href="http://sourceforge.net/project/showfiles.php?group_id=49316&amp;package_id=65354">Polish
 dictionary
 from ispell distribution</a></li>
   <li><a href="http://www.mimuw.edu.pl/polszczyzna/">Wzbogacony korpus
 sÅ‚ownika frekwencyjnego</a></li>
 <!--<li><a href="http://www.korpus.pl">Korpus IPI PAN</a></li>-->
 <!--<li>The Bible (so called "Warsaw Bible" or "Brytyjka")</li>--><li>The
 Bible (so called "TysiÄ…clecia") - unauthorized electronic version</li>
   <li><a
  href="http://www.mimuw.edu.pl/polszczyzna/Debian/sam34_3.4a.02-1_i386.deb">Analizator
 morfologiczny SAM v. 3.4</a> - this was used to recover lemmas
 missing from other texts</li>
 </ul>
 <p>This step was the most time-consuming - and it would probably be
 even more tedious and difficult if not for the
 help of
 <a href="http://www.python.org/">Python</a>. The source texts had to be
 brought to a common encoding (UTF-8) - some of them used quite ancient
 encodings like Mazovia or DHN - and then scripts were written to
 collect all lemmas and
 inflected forms from the source texts. In cases when the source text
 was not
 tagged,
 I used the SAM analyzer to produce lemmas. In cases of ambiguous
 lemmatization I decided to put references to inflected forms from all
 base forms.<br>
 </p>
 <p>All grammatical categories were allowed to appear in the corpus,
 i.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting
 corpus consisted of roughly 87,000+ inflection sets, i.e. each set
 consisted of one base form (lemma) and many inflected forms. However,
 because of the nature of the training method I restricted these sets to
 include only those where there were at least 4 inflected forms. Sets
 with 3 or less inflected forms were removed, so that the final corpus
 consisted of ~69,000 unique sets, which in turn contained ~1.5 mln
 inflected forms. <br>
 </p>
 <h2>Testing</h2>
 <p>I tested the stemmer tables produced using the implementation
 described above. The following sections give some details about
 the testing setup.
 </p>
 <h3>Testing procedure</h3>
 <p>The testing procedure was as follows:
 </p>
 <ul>
   <li>the whole corpus of ~69,000 unique sets was shuffled, so that the
 input sets were in random order.</li>
   <li>the corpus was split into two parts - one with 30,000 sets (Part
 1), the other with ~39,000 sets (Part 2).</li>
   <li>Training samples were drawn in sequential order from the Part 1.
 Since the sets were already randomized, the training samples were also
 randomized, but this procedure ensured that each larger training sample
 contained all smaller samples.</li>
   <li>Part 2 was used for testing. Note: this means that the testing
 run used <em>only</em> words previously unseen during the training
 phase. This is the worst scenario, because it means that stemmer must
 extrapolate the learned rules to unknown cases. This also means that in
 a real-life case (where the input is a mix between known and unknown
 words) the F-measure of the stemmer will be even higher than in the
 table below.</li>
 </ul>
 <h3>Test results</h3>
 <p>The following table summarizes test results for varying sizes
 of training samples. The meaning of the table columns is
 described below:
 </p>
 <ul>
   <li><b>training sets:</b> the number of training sets. One set
 consists of one lemma and at least 4 and up to ~80 inflected forms
 (including pre- and suffixed forms).</li>
   <li><b>testing forms:</b> the number of testing forms. Only inflected
 forms were used in testing.</li>
   <li><b>stem OK:</b> the number of cases when produced output was a
 correct (unique) stem. Note: quite often correct stems were also
 correct lemmas.</li>
   <li><b>lemma OK:</b> the number of cases when produced output was a
 correct lemma.</li>
   <li><b>missing:</b> the number of cases when stemmer was unable to
 provide any output.</li>
   <li><b>stem bad:</b> the number of cases when produced output was a
 stem, but already in use identifying a different set.</li>
   <li><b>lemma bad:</b> the number of cases when produced output was an
 incorrect lemma. Note: quite often in such case the output was a
 correct stem.</li>
   <li><b>table size:</b> the size in bytes of the stemmer table.</li>
 </ul>
 <div align="center">
 <table border="1" cellpadding="2" cellspacing="0">
   <tbody>
     <tr bgcolor="#a0b0c0">
       <th>Training sets</th>
       <th>Testing forms</th>
       <th>Stem OK</th>
       <th>Lemma OK</th>
       <th>Missing</th>
       <th>Stem Bad</th>
       <th>Lemma Bad</th>
       <th>Table size [B]</th>
     </tr>
     <tr align="right">
       <td>100</td>
       <td>1022985</td>
       <td>842209</td>
       <td>593632</td>
       <td>172711</td>
       <td>22331</td>
       <td>256642</td>
       <td>28438</td>
     </tr>
     <tr align="right">
       <td>200</td>
       <td>1022985</td>
       <td>862789</td>
       <td>646488</td>
       <td>153288</td>
       <td>16306</td>
       <td>223209</td>
       <td>48660</td>
     </tr>
     <tr align="right">
       <td>500</td>
       <td>1022985</td>
       <td>885786</td>
       <td>685009</td>
       <td>130772</td>
       <td>14856</td>
       <td>207204</td>
       <td>108798</td>
     </tr>
     <tr align="right">
       <td>700</td>
       <td>1022985</td>
       <td>909031</td>
       <td>704609</td>
       <td>107084</td>
       <td>15442</td>
       <td>211292</td>
       <td>139291</td>
     </tr>
     <tr align="right">
       <td>1000</td>
       <td>1022985</td>
       <td>926079</td>
       <td>725720</td>
       <td>90117</td>
       <td>14941</td>
       <td>207148</td>
       <td>183677</td>
     </tr>
     <tr align="right">
       <td>2000</td>
       <td>1022985</td>
       <td>942886</td>
       <td>746641</td>
       <td>73429</td>
       <td>14903</td>
       <td>202915</td>
       <td>313516</td>
     </tr>
     <tr align="right">
       <td>5000</td>
       <td>1022985</td>
       <td>954721</td>
       <td>759930</td>
       <td>61476</td>
       <td>14817</td>
       <td>201579</td>
       <td>640969</td>
     </tr>
     <tr align="right">
       <td>7000</td>
       <td>1022985</td>
       <td>956165</td>
       <td>764033</td>
       <td>60364</td>
       <td>14620</td>
       <td>198588</td>
       <td>839347</td>
     </tr>
     <tr align="right">
       <td>10000</td>
       <td>1022985</td>
       <td>965427</td>
       <td>775507</td>
       <td>50797</td>
       <td>14662</td>
       <td>196681</td>
       <td>1144537</td>
     </tr>
     <tr align="right">
       <td>12000</td>
       <td>1022985</td>
       <td>967664</td>
       <td>782143</td>
       <td>48722</td>
       <td>14284</td>
       <td>192120</td>
       <td>1313508</td>
     </tr>
     <tr align="right">
       <td>15000</td>
       <td>1022985</td>
       <td>973188</td>
       <td>788867</td>
       <td>43247</td>
       <td>14349</td>
       <td>190871</td>
       <td>1567902</td>
     </tr>
     <tr align="right">
       <td>17000</td>
       <td>1022985</td>
       <td>974203</td>
       <td>791804</td>
       <td>42319</td>
       <td>14333</td>
       <td>188862</td>
       <td>1733957</td>
     </tr>
     <tr align="right">
       <td>20000</td>
       <td>1022985</td>
       <td>976234</td>
       <td>791554</td>
       <td>40058</td>
       <td>14601</td>
       <td>191373</td>
       <td>1977615</td>
     </tr>
   </tbody>
 </table>
 </div>
 <p>I also measured the time to produce a stem (which involves
 traversing a trie,
 retrieving a patch command and applying the patch command to the input
 string).
 On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03
 HotSpot),
 for tables ranging in size from 1,000 to 20,000 cells, the time to
 produce a
 single stem varies between 5-10 microseconds.<br>
 </p>
 <p>This means that the stemmer can process up to <span
  style="font-weight: bold;">200,000 words per second</span>, an
 outstanding result when compared to other stemmers (Morfeusz - ~2,000
 w/s, FormAN (MS Word analyzer) - ~1,000 w/s).<br>
 </p>
 <p>The package contains a class <code>org.getopt.stempel.Benchmark</code>,
 which you can use to produce reports
 like the one below:<br>
 </p>
 <pre>--------- Stemmer benchmark report: -----------<br>Stemmer table:  /res/tables/stemmer_2000.out<br>Input file:     ../test3.txt<br>Number of runs: 3<br><br> RUN NUMBER:            1       2       3<br> Total input words      1378176 1378176 1378176<br> Missed output words    112     112     112<br> Time elapsed [ms]      6989    6940    6640<br> Hit rate percent       99.99%  99.99%  99.99%<br> Miss rate percent      00.01%  00.01%  00.01%<br> Words per second       197192  198584  207557<br> Time per word [us]     5.07    5.04    4.82<br></pre>
 <h2>Summary</h2>
 <p>The results of these tests are very encouraging. It seems that using
 the
 training corpus and the stemming algorithm described above results in a
 high-quality stemmer useful for most applications. Moreover, it can
 also
 be used as a better than average lemmatizer.</p>
 <p>Both the author of the implementation
 (Leo Galambos, &lt;leo.galambos AT egothor DOT org&gt;) and the author
 of this
 compilation (Andrzej Bialecki &lt;ab AT getopt DOT org&gt;) would
 appreciate any
 feedback and suggestions for further improvements.</p>
 <h2>Bibliography</h2>
 <ol>
   <li>Galambos, L.: Multilingual Stemmer in Web Environment, PhD
 Thesis,
 Faculty of Mathematics and Physics, Charles University in Prague, in
 press.</li>
   <li>Galambos, L.: Semi-automatic Stemmer Evaluation. International
 Intelligent Information Processing and Web Mining Conference, 2004,
 Zakopane, Poland.</li>
   <li>Galambos, L.: Lemmatizer for Document Information Retrieval
 Systems in JAVA.<span style="text-decoration: underline;"> </span><a
  class="moz-txt-link-rfc2396E"
  href="http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01">&lt;http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01&gt;</a>
 SOFSEM 2001, Piestany, Slovakia. <br>
   </li>
 </ol>
 <br>
 <br>
 </body>
 </html>
	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	<html>
	<head>
	<meta content="text/html; charset=UTF-8" http-equiv="content-type">
	<title>Stempel - Algorithmic Stemmer for Polish Language</title>
	<meta content="Andrzej Bialecki" name="author">
	<meta name="keywords"
	content="stemming, stemmer, algorithmic stemmer, Polish stemmer">
	<meta
	content="This page describes a software package consisting of high-quality stemming tables for Polish, and a universal algorithmic stemmer, which operates using these tables."
	name="description">
	</head>
	<body style="font-family: Arial,SansSerif;">
	<h1><i>Stempel</i> - Algorithmic Stemmer for Polish Language</h1>
	<h2>Introduction</h2>
	<p>A method for conflation of different inflected word forms is an
	important component of many Information Retrieval systems. It helps to
	improve the system's recall and can significantly reduce the index
	size. This is especially true for highly-inflectional languages like
	those from the Slavic language family (Czech, Slovak, Polish, Russian,
	Bulgarian, etc).</p>
	<p>This page describes a software package consisting of high-quality
	stemming tables for Polish, and a universal algorithmic stemmer, which
	operates using these tables. The stemmer code is taken virtually
	unchanged from the <a href="http://www.egothor.org">Egothor project</a>.</p>
	<p>The software distribution includes stemmer
	tables prepared using an extensive corpus of Polish language (see
	details below).</p>
	<p>This work is available under Apache-style Open Source license - the
	stemmer code is covered by Egothor License, the tables and other
	additions are covered by Apache License 2.0. Both licenses allow to use
	the code in Open Source as well as commercial (closed source) projects.</p>
	<h3>Terminology</h3>
	<p>A short explanation is in order about the terminology used in this
	text.</p>
	<p>In the following sections I make a distinction between <b>stem</b>
	and <b>lemma</b>.</p>
	<p>Lemma is a base grammatical form (dictionary form, headword) of a
	word. Lemma is an existing, grammatically correct word in some human
	language.</p>
	<p>Stem on the other hand is just a unique token, not necessarily
	making any sense in any human language, but which can serve as a unique
	label instead of lemma for the same set of inflected forms. Quite often
	stem is referred to as a "root" of the word - which is incorrect and
	misleading (stems sometimes have very little to do with the linguistic
	root of a word, i.e. a pattern found in a word which is common to all
	inflected forms or within a family of languages).</p>
	<p>For an IR system stems are usually sufficient, for a morphological
	analysis system obviously lemmas are a must. In practice, various
	stemmers produce a mix of stems and lemmas, as is the case with the
	stemmer described here. Additionally, for some languages, which use
	suffix-based inflection rules many stemmers based on suffix-stripping
	will produce a large percentage of stems equivalent to lemmas. This is
	however not the case for languages with complex, irregular inflection
	rules (such as Slavic languages) - here simplistic suffix-stripping
	stemmers produce very poor results.</p>
	<h3>Background</h3>
	<p>Lemmatization is a process of finding the base, non-inflected form
	of a word. The result of lemmatization is a correct existing word,
	often in nominative case for nouns and infinitive form for verbs. A
	given inflected form may correspond to several lemmas (e.g. "found"
	-> find, found) - the correct choice depends on the context.<br>
	<br>
	Stemming is concerned mostly with finding a unique "root" of a word,
	which not necessarily results in any existing word or lemma. The
	quality of stemming is measured by the rate of collisions (overstemming
	- which causes words with different lemmas to be incorrectly conflated
	into one "root"), and the rate of superfluous word "roots"
	(understemming - which assigns several "roots" to words with the same
	lemma). <br>
	<br>
	Both stemmer and lemmatizer can be implemented in various ways. The two
	most common approaches are:<br>
	</p>
	<ul>
	<li>dictionary-based: where the stemmer uses an extensive dictionary
	of morphological forms in order to find the corresponding stem or lemma</li>
	<li>algorithmic: where the stemmer uses an algorithm, based on
	general morphological properties of a given language plus a set of
	heuristic rules<br>
	</li>
	</ul>
	There are many existing and well-known implementations of stemmers for
	English (Porter, Lovins, Krovetz) and other European languages
	(<a href="http://snowball.tartarus.org">Snowball</a>). There are also
	good quality commercial lemmatizers for Polish. However, there is only
	one
	freely available Polish stemmer, implemented by
	<a
	href="http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml?lang=en">Dawid
	Weiss</a>, based on the "ispell" dictionary and Jan Daciuk's <a
	href="http://www.eti.pg.gda.pl/%7Ejandac/">FSA package</a>. That
	stemmer is dictionary-based. This means that even
	though it can achieve
	perfect accuracy for previously known word forms found in its
	dictionary, it
	completely fails in case of all other word forms. This deficiency is
	somewhat mitigated by the comprehensive dictionary distributed with
	this stemmer (so there is a high probability that most of the words in
	the input text will be found in the dictionary), however the problem
	still remains (please see the page above for more detailed description).<br>
	<br>
	The implementation described here uses an algorithmic method. This
	method
	and particular algorithm implementation are described in detail in
	[1][2].
	The main advantage of algorithmic stemmers is their ability to process
	previously
	unseen word forms with high accuracy. This particular algorithm uses a
	set
	of
	transformation rules (patch commands), which describe how a word with a
	given pattern should be transformed to its stem. These rules are first
	learned from a training corpus. They don't
	cover
	all possible cases, so there is always some loss of precision/recall
	(which
	means that even the words from the training corpus are sometimes
	incorrectly stemmed).<br>
	<h2>Algorithm and implementation<span style="font-style: italic;"></span></h2>
	The algorithm and its Java implementation is described in detail in the
	publications cited below. Here's just a short excerpt from [2]:<br>
	<br>
	<center>
	<div style="width: 80%;" align="justify">"The aim is separation of the
	stemmer execution code from the data
	structures [...]. In other words, a static algorithm configurable by
	data must be developed. The word transformations that happen in the
	stemmer must be then encoded to the data tables.<br>
	<br>
	The tacit input of our method is a sample set (a so-called dictionary)
	of words (as keys) and their stems. Each record can be equivalently
	stored as a key and the record of key's transformation to its
	respective stem. The transformation record is termed a patch command
	(P-command). It must be ensured that P-commands are universal, and that
	P-commands can transform any word to its stem. Our solution[6,8] is
	based on the Levenstein metric [10], which produces P-command as the
	minimum cost path in a directed graph.<br>
	<br>
	One can imagine the P-command as an algorithm for an operator (editor)
	that rewrites a string to another string. The operator can use these
	instructions (PP-command's): <span style="font-weight: bold;">removal </span>-
	deletes a sequence of characters starting at the current cursor
	position and moves the cursor to the next character. The length of this
	sequence is the parameter; <span style="font-weight: bold;">insertion </span>-
	inserts a character ch, without moving the cursor. The character ch is
	a parameter; <span style="font-weight: bold;">substitution </span>
	- rewrites a character at the current cursor position to the character
	ch and moves the cursor to the next character. The character ch is a
	parameter; <span style="font-weight: bold;">no operation</span> (NOOP)
	- skip a sequence of characters starting at the current cursor
	position. The length of this sequence is the parameter.<br>
	<br>
	The P-commands are applied from the end of a word (right to left). This
	assumption can reduce the set of P-command's, because the last NOOP,
	moving the cursor to the end of a string without any changes, need not
	be stored."</div>
	</center>
	<br>
	Data structure used to keep the dictionary (words and their P-commands)
	is a trie. Several optimization steps are applied in turn to reduce and
	optimize the initial trie, by eliminating useless information and
	shortening the paths in the trie.<br>
	<br>
	Finally, in order to obtain a stem from the input word, the word is
	passed once through a matching path in the trie (applying at each node
	the P-commands stored there). The result is a word stem.<br>
	<h2>Corpus</h2>
	<p><i>(to be completed...)</i></p>
	<p>The following Polish corpora have been used:</p>
	<ul>
	<li><a
	href="http://sourceforge.net/project/showfiles.php?group_id=49316&package_id=65354">Polish
	dictionary
	from ispell distribution</a></li>
	<li><a href="http://www.mimuw.edu.pl/polszczyzna/">Wzbogacony korpus
	sÅ‚ownika frekwencyjnego</a></li>
	<!--<li><a href="http://www.korpus.pl">Korpus IPI PAN</a></li>-->
	<!--<li>The Bible (so called "Warsaw Bible" or "Brytyjka")</li>--><li>The
	Bible (so called "TysiÄ…clecia") - unauthorized electronic version</li>
	<li><a
	href="http://www.mimuw.edu.pl/polszczyzna/Debian/sam34_3.4a.02-1_i386.deb">Analizator
	morfologiczny SAM v. 3.4</a> - this was used to recover lemmas
	missing from other texts</li>
	</ul>
	<p>This step was the most time-consuming - and it would probably be
	even more tedious and difficult if not for the
	help of
	<a href="http://www.python.org/">Python</a>. The source texts had to be
	brought to a common encoding (UTF-8) - some of them used quite ancient
	encodings like Mazovia or DHN - and then scripts were written to
	collect all lemmas and
	inflected forms from the source texts. In cases when the source text
	was not
	tagged,
	I used the SAM analyzer to produce lemmas. In cases of ambiguous
	lemmatization I decided to put references to inflected forms from all
	base forms.<br>
	</p>
	<p>All grammatical categories were allowed to appear in the corpus,
	i.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting
	corpus consisted of roughly 87,000+ inflection sets, i.e. each set
	consisted of one base form (lemma) and many inflected forms. However,
	because of the nature of the training method I restricted these sets to
	include only those where there were at least 4 inflected forms. Sets
	with 3 or less inflected forms were removed, so that the final corpus
	consisted of ~69,000 unique sets, which in turn contained ~1.5 mln
	inflected forms. <br>
	</p>
	<h2>Testing</h2>
	<p>I tested the stemmer tables produced using the implementation
	described above. The following sections give some details about
	the testing setup.
	</p>
	<h3>Testing procedure</h3>
	<p>The testing procedure was as follows:
	</p>
	<ul>
	<li>the whole corpus of ~69,000 unique sets was shuffled, so that the
	input sets were in random order.</li>
	<li>the corpus was split into two parts - one with 30,000 sets (Part
	1), the other with ~39,000 sets (Part 2).</li>
	<li>Training samples were drawn in sequential order from the Part 1.
	Since the sets were already randomized, the training samples were also
	randomized, but this procedure ensured that each larger training sample
	contained all smaller samples.</li>
	<li>Part 2 was used for testing. Note: this means that the testing
	run used <em>only</em> words previously unseen during the training
	phase. This is the worst scenario, because it means that stemmer must
	extrapolate the learned rules to unknown cases. This also means that in
	a real-life case (where the input is a mix between known and unknown
	words) the F-measure of the stemmer will be even higher than in the
	table below.</li>
	</ul>
	<h3>Test results</h3>
	<p>The following table summarizes test results for varying sizes
	of training samples. The meaning of the table columns is
	described below:
	</p>
	<ul>
	<li><b>training sets:</b> the number of training sets. One set
	consists of one lemma and at least 4 and up to ~80 inflected forms
	(including pre- and suffixed forms).</li>
	<li><b>testing forms:</b> the number of testing forms. Only inflected
	forms were used in testing.</li>
	<li><b>stem OK:</b> the number of cases when produced output was a
	correct (unique) stem. Note: quite often correct stems were also
	correct lemmas.</li>
	<li><b>lemma OK:</b> the number of cases when produced output was a
	correct lemma.</li>
	<li><b>missing:</b> the number of cases when stemmer was unable to
	provide any output.</li>
	<li><b>stem bad:</b> the number of cases when produced output was a
	stem, but already in use identifying a different set.</li>
	<li><b>lemma bad:</b> the number of cases when produced output was an
	incorrect lemma. Note: quite often in such case the output was a
	correct stem.</li>
	<li><b>table size:</b> the size in bytes of the stemmer table.</li>
	</ul>
	<div align="center">
	<table border="1" cellpadding="2" cellspacing="0">
	<tbody>
	<tr bgcolor="#a0b0c0">
	<th>Training sets</th>
	<th>Testing forms</th>
	<th>Stem OK</th>
	<th>Lemma OK</th>
	<th>Missing</th>
	<th>Stem Bad</th>
	<th>Lemma Bad</th>
	<th>Table size [B]</th>
	</tr>
	<tr align="right">
	<td>100</td>
	<td>1022985</td>
	<td>842209</td>
	<td>593632</td>
	<td>172711</td>
	<td>22331</td>
	<td>256642</td>
	<td>28438</td>
	</tr>
	<tr align="right">
	<td>200</td>
	<td>1022985</td>
	<td>862789</td>
	<td>646488</td>
	<td>153288</td>
	<td>16306</td>
	<td>223209</td>
	<td>48660</td>
	</tr>
	<tr align="right">
	<td>500</td>
	<td>1022985</td>
	<td>885786</td>
	<td>685009</td>
	<td>130772</td>
	<td>14856</td>
	<td>207204</td>
	<td>108798</td>
	</tr>
	<tr align="right">
	<td>700</td>
	<td>1022985</td>
	<td>909031</td>
	<td>704609</td>
	<td>107084</td>
	<td>15442</td>
	<td>211292</td>
	<td>139291</td>
	</tr>
	<tr align="right">
	<td>1000</td>
	<td>1022985</td>
	<td>926079</td>
	<td>725720</td>
	<td>90117</td>
	<td>14941</td>
	<td>207148</td>
	<td>183677</td>
	</tr>
	<tr align="right">
	<td>2000</td>
	<td>1022985</td>
	<td>942886</td>
	<td>746641</td>
	<td>73429</td>
	<td>14903</td>
	<td>202915</td>
	<td>313516</td>
	</tr>
	<tr align="right">
	<td>5000</td>
	<td>1022985</td>
	<td>954721</td>
	<td>759930</td>
	<td>61476</td>
	<td>14817</td>
	<td>201579</td>
	<td>640969</td>
	</tr>
	<tr align="right">
	<td>7000</td>
	<td>1022985</td>
	<td>956165</td>
	<td>764033</td>
	<td>60364</td>
	<td>14620</td>
	<td>198588</td>
	<td>839347</td>
	</tr>
	<tr align="right">
	<td>10000</td>
	<td>1022985</td>
	<td>965427</td>
	<td>775507</td>
	<td>50797</td>
	<td>14662</td>
	<td>196681</td>
	<td>1144537</td>
	</tr>
	<tr align="right">
	<td>12000</td>
	<td>1022985</td>
	<td>967664</td>
	<td>782143</td>
	<td>48722</td>
	<td>14284</td>
	<td>192120</td>
	<td>1313508</td>
	</tr>
	<tr align="right">
	<td>15000</td>
	<td>1022985</td>
	<td>973188</td>
	<td>788867</td>
	<td>43247</td>
	<td>14349</td>
	<td>190871</td>
	<td>1567902</td>
	</tr>
	<tr align="right">
	<td>17000</td>
	<td>1022985</td>
	<td>974203</td>
	<td>791804</td>
	<td>42319</td>
	<td>14333</td>
	<td>188862</td>
	<td>1733957</td>
	</tr>
	<tr align="right">
	<td>20000</td>
	<td>1022985</td>
	<td>976234</td>
	<td>791554</td>
	<td>40058</td>
	<td>14601</td>
	<td>191373</td>
	<td>1977615</td>
	</tr>
	</tbody>
	</table>
	</div>
	<p>I also measured the time to produce a stem (which involves
	traversing a trie,
	retrieving a patch command and applying the patch command to the input
	string).
	On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03
	HotSpot),
	for tables ranging in size from 1,000 to 20,000 cells, the time to
	produce a
	single stem varies between 5-10 microseconds.<br>
	</p>
	<p>This means that the stemmer can process up to <span
	style="font-weight: bold;">200,000 words per second</span>, an
	outstanding result when compared to other stemmers (Morfeusz - ~2,000
	w/s, FormAN (MS Word analyzer) - ~1,000 w/s).<br>
	</p>
	<p>The package contains a class <code>org.getopt.stempel.Benchmark</code>,
	which you can use to produce reports
	like the one below:<br>
	</p>
	<pre>--------- Stemmer benchmark report: -----------<br>Stemmer table: /res/tables/stemmer_2000.out<br>Input file: ../test3.txt<br>Number of runs: 3<br><br> RUN NUMBER: 1 2 3<br> Total input words 1378176 1378176 1378176<br> Missed output words 112 112 112<br> Time elapsed [ms] 6989 6940 6640<br> Hit rate percent 99.99% 99.99% 99.99%<br> Miss rate percent 00.01% 00.01% 00.01%<br> Words per second 197192 198584 207557<br> Time per word [us] 5.07 5.04 4.82<br></pre>
	<h2>Summary</h2>
	<p>The results of these tests are very encouraging. It seems that using
	the
	training corpus and the stemming algorithm described above results in a
	high-quality stemmer useful for most applications. Moreover, it can
	also
	be used as a better than average lemmatizer.</p>
	<p>Both the author of the implementation
	(Leo Galambos, <leo.galambos AT egothor DOT org>) and the author
	of this
	compilation (Andrzej Bialecki <ab AT getopt DOT org>) would
	appreciate any
	feedback and suggestions for further improvements.</p>
	<h2>Bibliography</h2>
	<ol>
	<li>Galambos, L.: Multilingual Stemmer in Web Environment, PhD
	Thesis,
	Faculty of Mathematics and Physics, Charles University in Prague, in
	press.</li>
	<li>Galambos, L.: Semi-automatic Stemmer Evaluation. International
	Intelligent Information Processing and Web Mining Conference, 2004,
	Zakopane, Poland.</li>
	<li>Galambos, L.: Lemmatizer for Document Information Retrieval
	Systems in JAVA.<span style="text-decoration: underline;"> </span><a
	class="moz-txt-link-rfc2396E"
	href="http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01"><http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01></a>
	SOFSEM 2001, Piestany, Slovakia. <br>
	</li>
	</ol>
	<br>
	<br>
	</body>
	</html>