| <!DOCTYPE html> |
| <!--[if IE]><![endif]--> |
| <html> |
| |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| <title>Namespace Lucene.Net.Search.Similarities |
| | Apache Lucene.NET 4.8.0-beta00009 Documentation </title> |
| <meta name="viewport" content="width=device-width"> |
| <meta name="title" content="Namespace Lucene.Net.Search.Similarities |
| | Apache Lucene.NET 4.8.0-beta00009 Documentation "> |
| <meta name="generator" content="docfx 2.56.0.0"> |
| |
| <link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css"> |
| <meta property="docfx:navrel" content="toc.html"> |
| <meta property="docfx:tocrel" content="core/toc.html"> |
| |
| <meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/"> |
| |
| </head> |
| <body data-spy="scroll" data-target="#affix" data-offset="120"> |
| <div id="wrapper"> |
| <header> |
| |
| <nav id="autocollapse" class="navbar ng-scope" role="navigation"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| |
| <a class="navbar-brand" href="/"> |
| <img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt=""> |
| </a> |
| </div> |
| <div class="collapse navbar-collapse" id="navbar"> |
| <form class="navbar-form navbar-right" role="search" id="search"> |
| <div class="form-group"> |
| <input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off"> |
| </div> |
| </form> |
| </div> |
| </div> |
| </nav> |
| |
| <div class="subnav navbar navbar-default"> |
| <div class="container hide-when-search"> |
| <ul class="level0 breadcrumb"> |
| <li> |
| <a href="https://lucenenet.apache.org/docs/4.8.0-beta00009/">API</a> |
| <span id="breadcrumb"> |
| <ul class="breadcrumb"> |
| <li></li> |
| </ul> |
| </span> |
| </li> |
| </ul> |
| </div> |
| </div> |
| </header> |
| <div class="container body-content"> |
| |
| <div id="search-results"> |
| <div class="search-list"></div> |
| <div class="sr-items"> |
| <p><i class="glyphicon glyphicon-refresh index-loading"></i></p> |
| </div> |
| <ul id="pagination"></ul> |
| </div> |
| </div> |
| <div role="main" class="container body-content hide-when-search"> |
| |
| <div class="sidenav hide-when-search"> |
| <a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a> |
| <div class="sidetoggle collapse" id="sidetoggle"> |
| <div id="sidetoc"></div> |
| </div> |
| </div> |
| <div class="article row grid-right"> |
| <div class="col-md-10"> |
| <article class="content wrap" id="_content" data-uid="Lucene.Net.Search.Similarities"> |
| |
| <h1 id="Lucene_Net_Search_Similarities" data-uid="Lucene.Net.Search.Similarities" class="text-break">Namespace Lucene.Net.Search.Similarities |
| </h1> |
| <div class="markdown level0 summary"><!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <p>This package contains the various ranking models that can be used in Lucene. The |
| abstract class <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> serves |
| as the base for ranking functions. For searching, users can employ the models |
| already implemented or create their own by extending one of the classes in this |
| package.</p> |
| <h2 id="table-of-contents">Table Of Contents</h2> |
| <ol> |
| <li><a href="#sims">Summary of the Ranking Methods</a> 2. <a href="#changingsimilarity">Changing the Similarity</a> </li> |
| </ol> |
| <h2 id="summary-of-the-ranking-methods">Summary of the Ranking Methods</h2> |
| <p><a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html">DefaultSimilarity</a> is the original Lucene scoring function. It is based on a highly optimized <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model</a>. For more information, see <a class="xref" href="Lucene.Net.Search.Similarities.TFIDFSimilarity.html">TFIDFSimilarity</a>.</p> |
| <p><a class="xref" href="Lucene.Net.Search.Similarities.BM25Similarity.html">BM25Similarity</a> is an optimized implementation of the successful Okapi BM25 model.</p> |
| <p><a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a> provides a basic implementation of the Similarity contract and exposes a highly simplified interface, which makes it an ideal starting point for new ranking functions. Lucene ships the following methods built on <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a>: * Amati and Rijsbergen's {@linkplain org.apache.lucene.search.similarities.DFRSimilarity DFR} framework; * Clinchant and Gaussier's {@linkplain org.apache.lucene.search.similarities.IBSimilarity Information-based models} for IR; * The implementation of two {@linkplain org.apache.lucene.search.similarities.LMSimilarity language models} from Zhai and Lafferty's paper. Since <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a> is not optimized to the same extent as <a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html">DefaultSimilarity</a> and <a class="xref" href="Lucene.Net.Search.Similarities.BM25Similarity.html">BM25Similarity</a>, a difference in performance is to be expected when using the methods listed above. However, optimizations can always be implemented in subclasses; see <a href="#changingsimilarity">below</a>.</p> |
| <h2 id="changing-similarity">Changing Similarity</h2> |
| <p>Chances are the available Similarities are sufficient for all your searching needs. However, in some applications it may be necessary to customize your <a href="Similarity.html">Similarity</a> implementation. For instance, some applications do not need to distinguish between shorter and longer documents (see <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a "fair" similarity</a>).</p> |
| <p>To change <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a>, one must do so for both indexing and searching, and the changes must happen before either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen. </p> |
| <p>To make this change, implement your own <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> (likely you'll want to simply subclass an existing method, be it <a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html">DefaultSimilarity</a> or a descendant of <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a>), and then register the new class by calling <a class="xref" href="Lucene.Net.Index.IndexWriterConfig.html">#setSimilarity(Similarity)</a> before indexing and <a class="xref" href="Lucene.Net.Search.IndexSearcher.html">#setSimilarity(Similarity)</a> before searching. </p> |
| <h3 id="extending-linkplain-orgapachelucenesearchsimilaritiessimilaritybase">Extending {@linkplain org.apache.lucene.search.similarities.SimilarityBase}</h3> |
| <p> The easiest way to quickly implement a new ranking method is to extend <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a>, which provides basic implementations for the low level . Subclasses are only required to implement the <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html#methods">Float)</a> and <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">#toString()</a> methods.</p> |
| <p>Another option is to extend one of the <a href="#framework">frameworks</a> based on <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a>. These Similarities are implemented modularly, e.g. <a class="xref" href="Lucene.Net.Search.Similarities.DFRSimilarity.html">DFRSimilarity</a> delegates computation of the three parts of its formula to the classes <a class="xref" href="Lucene.Net.Search.Similarities.BasicModel.html">BasicModel</a>, <a class="xref" href="Lucene.Net.Search.Similarities.AfterEffect.html">AfterEffect</a> and <a class="xref" href="Lucene.Net.Search.Similarities.Normalization.html">Normalization</a>. Instead of subclassing the Similarity, one can simply introduce a new basic model and tell <a class="xref" href="Lucene.Net.Search.Similarities.DFRSimilarity.html">DFRSimilarity</a> to use it.</p> |
| <h3 id="changing-linkplain-orgapachelucenesearchsimilaritiesdefaultsimilarity">Changing {@linkplain org.apache.lucene.search.similarities.DefaultSimilarity}</h3> |
| <p> If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at <a href="http://www.gossamer-threads.com/lists/lucene/java-user/39125">Overriding Similarity</a>. In summary, here are a few use cases: 1. <p>The <code>SweetSpotSimilarity</code> in <code>org.apache.lucene.misc</code> gives small increases as the frequency increases a small amount and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.</p> 2. <p>Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the tf() method.</p> 3. <p>Changing Length Normalization — By overriding <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html#methods">State)</a>, it is possible to discount how the length of a field contributes to a score. In <a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html">DefaultSimilarity</a>, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">"fairly"</a>.</p> In general, Chris Hostetter sums it up best in saying (from <a href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users's mailing list</a>): </p> |
| <blockquote><p>[One would override the Similarity in] ... any situation where you know more about your data then just that it's "text" is a situation where it <em>might</em> make sense to to override your Similarity method.</p> |
| </blockquote> |
| </div> |
| <div class="markdown level0 conceptual"></div> |
| <div class="markdown level0 remarks"></div> |
| <h3 id="classes">Classes |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffect.html">AfterEffect</a></h4> |
| <section><p>This class acts as the base class for the implementations of the <em>first |
| normalization of the informative content</em> in the DFR framework. This |
| component is also called the <em>after effect</em> and is defined by the |
| formula <em>Inf<sub>2</sub> = 1 - Prob<sub>2</sub></em>, where |
| <em>Prob<sub>2</sub></em> measures the <em>information gain</em>. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div><p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffect.NoAfterEffect.html">AfterEffect.NoAfterEffect</a></h4> |
| <section><p>Implementation used when there is no aftereffect. </p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffectB.html">AfterEffectB</a></h4> |
| <section><p>Model of the information gain based on the ratio of two Bernoulli processes. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffectL.html">AfterEffectL</a></h4> |
| <section><p>Model of the information gain based on Laplace's law of succession. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModel.html">BasicModel</a></h4> |
| <section><p>This class acts as the base class for the specific <em>basic model</em> |
| implementations in the DFR framework. Basic models compute the |
| <em>informative content Inf<sub>1</sub> = -log<sub>2</sub>Prob<sub>1</sub> |
| </em>. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelBE.html">BasicModelBE</a></h4> |
| <section><p>Limiting form of the Bose-Einstein model. The formula used in Lucene differs |
| slightly from the one in the original paper: <code>F</code> is increased by <code>tfn+1</code> |
| and <code>N</code> is increased by <code>F</code> |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div><p><p> |
| NOTE: in some corner cases this model may give poor performance with Normalizations that |
| return large values for <code>tfn</code> such as <a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH3.html">NormalizationH3</a>. Consider using the |
| geometric approximation (<a class="xref" href="Lucene.Net.Search.Similarities.BasicModelG.html">BasicModelG</a>) instead, which provides the same relevance |
| but with less practical problems.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelD.html">BasicModelD</a></h4> |
| <section><p>Implements the approximation of the binomial model with the divergence |
| for DFR. The formula used in Lucene differs slightly from the one in the |
| original paper: to avoid underflow for small values of <code>N</code> and |
| <code>F</code>, <code>N</code> is increased by <code>1</code> and |
| <code>F</code> is always increased by <code>tfn+1</code>. |
| <p> |
| WARNING: for terms that do not meet the expected random distribution |
| (e.g. stopwords), this model may give poor performance, such as |
| abnormally high scores for low tf values. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelG.html">BasicModelG</a></h4> |
| <section><p>Geometric as limiting form of the Bose-Einstein model. The formula used in Lucene differs |
| slightly from the one in the original paper: <code>F</code> is increased by <code>1</code> |
| and <code>N</code> is increased by <code>F</code>. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelIF.html">BasicModelIF</a></h4> |
| <section><p>An approximation of the <em>I(n<sub>e</sub>)</em> model. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelIn.html">BasicModelIn</a></h4> |
| <section><p>The basic tf-idf model of randomness. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelIne.html">BasicModelIne</a></h4> |
| <section><p>Tf-idf model of randomness, based on a mixture of Poisson and inverse |
| document frequency. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelP.html">BasicModelP</a></h4> |
| <section><p>Implements the Poisson approximation for the binomial model for DFR. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div><p><p> |
| WARNING: for terms that do not meet the expected random distribution |
| (e.g. stopwords), this model may give poor performance, such as |
| abnormally high scores for low tf values.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicStats.html">BasicStats</a></h4> |
| <section><p>Stores all statistics commonly used ranking methods. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.BM25Similarity.html">BM25Similarity</a></h4> |
| <section><p>BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker, |
| Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. |
| In Proceedings of the Third <strong>T</strong>ext <strong>RE</strong>trieval <strong>C</strong>onference (TREC 1994). |
| Gaithersburg, USA, November 1994. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html">DefaultSimilarity</a></h4> |
| <section><p>Expert: Default scoring implementation which encodes (<a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html#Lucene_Net_Search_Similarities_DefaultSimilarity_EncodeNormValue_System_Single_">EncodeNormValue(Single)</a>) |
| norm values as a single byte before being stored. At search time, |
| the norm byte value is read from the index |
| <a class="xref" href="Lucene.Net.Store.Directory.html">Directory</a> and |
| decoded (<a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html#Lucene_Net_Search_Similarities_DefaultSimilarity_DecodeNormValue_System_Int64_">DecodeNormValue(Int64)</a>) back to a float <em>norm</em> value. |
| this encoding/decoding, while reducing index size, comes with the price of |
| precision loss - it is not guaranteed that <em>Decode(Encode(x)) = x</em>. For |
| instance, <em>Decode(Encode(0.89)) = 0.75</em>. |
| <p> |
| Compression of norm values to a single byte saves memory at search time, |
| because once a field is referenced at search time, its norms - for all |
| documents - are maintained in memory. |
| <p> |
| The rationale supporting such lossy compression of norm values is that given |
| the difficulty (and inaccuracy) of users to express their true information |
| need by a query, only big differences matter. |
| <p> |
| Last, note that search time is too late to modify this <em>norm</em> part of |
| scoring, e.g. by using a different <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> for search.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.DFRSimilarity.html">DFRSimilarity</a></h4> |
| <section><p>Implements the <em>divergence from randomness (DFR)</em> framework |
| introduced in Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. |
| Probabilistic models of information retrieval based on measuring the |
| divergence from randomness. ACM Trans. Inf. Syst. 20, 4 (October 2002), |
| 357-389. |
| <p>The DFR scoring formula is composed of three separate components: the |
| <em>basic model</em>, the <em>aftereffect</em> and an additional |
| <em>normalization</em> component, represented by the classes |
| <a class="xref" href="Lucene.Net.Search.Similarities.BasicModel.html">BasicModel</a>, <a class="xref" href="Lucene.Net.Search.Similarities.AfterEffect.html">AfterEffect</a> and <a class="xref" href="Lucene.Net.Search.Similarities.Normalization.html">Normalization</a>, |
| respectively. The names of these classes were chosen to match the names of |
| their counterparts in the Terrier IR engine.</p> |
| <p>To construct a <a class="xref" href="Lucene.Net.Search.Similarities.DFRSimilarity.html">DFRSimilarity</a>, you must specify the implementations for |
| all three components of DFR: |
| <table><thead><tr><th>ComponentImplementations</th><th></th></tr></thead><tbody><tr><td><a class="xref" href="Lucene.Net.Search.Similarities.BasicModel.html">BasicModel</a>: Basic model of information content: |
| <ul><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelBE.html">BasicModelBE</a>: Limiting form of Bose-Einstein</li><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelG.html">BasicModelG</a>: Geometric approximation of Bose-Einstein</li><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelP.html">BasicModelP</a>: Poisson approximation of the Binomial</li><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelD.html">BasicModelD</a>: Divergence approximation of the Binomial</li><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelIn.html">BasicModelIn</a>: Inverse document frequency</li><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelIne.html">BasicModelIne</a>: Inverse expected document frequency [mixture of Poisson and IDF]</li><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelIF.html">BasicModelIF</a>: Inverse term frequency [approximation of I(ne)]</li></ul> |
| </td><td></td></tr><tr><td><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffect.html">AfterEffect</a>: First normalization of information gain: |
| <ul><li><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffectL.html">AfterEffectL</a>: Laplace's law of succession</li><li><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffectB.html">AfterEffectB</a>: Ratio of two Bernoulli processes</li><li><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffect.NoAfterEffect.html">AfterEffect.NoAfterEffect</a>: no first normalization</li></ul> |
| </td><td></td></tr><tr><td><a class="xref" href="Lucene.Net.Search.Similarities.Normalization.html">Normalization</a>: Second (length) normalization: |
| <ul><li><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH1.html">NormalizationH1</a>: Uniform distribution of term frequency</li><li><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH2.html">NormalizationH2</a>: term frequency density inversely related to length</li><li><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH3.html">NormalizationH3</a>: term frequency normalization provided by Dirichlet prior</li><li><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationZ.html">NormalizationZ</a>: term frequency normalization provided by a Zipfian relation</li><li><a class="xref" href="Lucene.Net.Search.Similarities.Normalization.NoNormalization.html">Normalization.NoNormalization</a>: no second normalization</li></ul> |
| </td><td></td></tr></tbody></table></p> |
| <p> |
| <p>Note that <em>qtf</em>, the multiplicity of term-occurrence in the query, |
| is not handled by this implementation. |
| </p> </p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.Distribution.html">Distribution</a></h4> |
| <section><p>The probabilistic distribution used to model term occurrence |
| in information-based models. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.DistributionLL.html">DistributionLL</a></h4> |
| <section><p>Log-logistic distribution. |
| <p>Unlike for DFR, the natural logarithm is used, as |
| it is faster to compute and the original paper does not express any |
| preference to a specific base.</p></p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.DistributionSPL.html">DistributionSPL</a></h4> |
| <section><p>The smoothed power-law (SPL) distribution for the information-based framework |
| that is described in the original paper. |
| <p>Unlike for DFR, the natural logarithm is used, as |
| it is faster to compute and the original paper does not express any |
| preference to a specific base.</p></p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.IBSimilarity.html">IBSimilarity</a></h4> |
| <section><p>Provides a framework for the family of information-based models, as described |
| in StÉphane Clinchant and Eric Gaussier. 2010. Information-based |
| models for ad hoc IR. In Proceeding of the 33rd international ACM SIGIR |
| conference on Research and development in information retrieval (SIGIR '10). |
| ACM, New York, NY, USA, 234-241. |
| <p>The retrieval function is of the form <em>RSV(q, d) = ∑ |
| -x<sup>q</sup><sub>w</sub> log Prob(X<sub>w</sub> >= |
| t<sup>d</sup><sub>w</sub> | λ<sub>w</sub>)</em>, where |
| <ul><li><em>x<sup>q</sup><sub>w</sub></em> is the query boost;</li><li><em>X<sub>w</sub></em> is a random variable that counts the occurrences |
| of word <em>w</em>;</li><li><em>t<sup>d</sup><sub>w</sub></em> is the normalized term frequency;</li><li><em>λ<sub>w</sub></em> is a parameter.</li></ul> |
| </p> |
| <p>The framework described in the paper has many similarities to the DFR |
| framework (see <a class="xref" href="Lucene.Net.Search.Similarities.DFRSimilarity.html">DFRSimilarity</a>). It is possible that the two |
| Similarities will be merged at one point.</p> |
| <p>To construct an <a class="xref" href="Lucene.Net.Search.Similarities.IBSimilarity.html">IBSimilarity</a>, you must specify the implementations for |
| all three components of the Information-Based model. |
| <table><thead><tr><th>ComponentImplementations</th><th></th></tr></thead><tbody><tr><td><a class="xref" href="Lucene.Net.Search.Similarities.IBSimilarity.html#Lucene_Net_Search_Similarities_IBSimilarity_Distribution">Distribution</a>: Probabilistic distribution used to |
| model term occurrence |
| <ul><li><a class="xref" href="Lucene.Net.Search.Similarities.DistributionLL.html">DistributionLL</a>: Log-logistic</li><li><a class="xref" href="Lucene.Net.Search.Similarities.DistributionLL.html">DistributionLL</a>: Smoothed power-law</li></ul> |
| </td><td></td></tr><tr><td><a class="xref" href="Lucene.Net.Search.Similarities.IBSimilarity.html#Lucene_Net_Search_Similarities_IBSimilarity_Lambda">Lambda</a>: λ<sub>w</sub> parameter of the |
| probability distribution |
| <ul><li><a class="xref" href="Lucene.Net.Search.Similarities.LambdaDF.html">LambdaDF</a>: <code>N<sub>w</sub>/N</code> or average |
| number of documents where w occurs</li><li><a class="xref" href="Lucene.Net.Search.Similarities.LambdaTTF.html">LambdaTTF</a>: <code>F<sub>w</sub>/N</code> or |
| average number of occurrences of w in the collection</li></ul> |
| </td><td></td></tr><tr><td><a class="xref" href="Lucene.Net.Search.Similarities.IBSimilarity.html#Lucene_Net_Search_Similarities_IBSimilarity_Normalization">Normalization</a>: Term frequency normalizationAny supported DFR normalization (listed in |
| <a class="xref" href="Lucene.Net.Search.Similarities.DFRSimilarity.html">DFRSimilarity</a>) |
| </td><td></td></tr></tbody></table> |
| </p></p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.Lambda.html">Lambda</a></h4> |
| <section><p>The <em>lambda (λ<sub>w</sub>)</em> parameter in information-based |
| models. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.LambdaDF.html">LambdaDF</a></h4> |
| <section><p>Computes lambda as <code>docFreq+1 / numberOfDocuments+1</code>. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.LambdaTTF.html">LambdaTTF</a></h4> |
| <section><p>Computes lambda as <code>totalTermFreq+1 / numberOfDocuments+1</code>. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.LMDirichletSimilarity.html">LMDirichletSimilarity</a></h4> |
| <section><p>Bayesian smoothing using Dirichlet priors. From Chengxiang Zhai and John |
| Lafferty. 2001. A study of smoothing methods for language models applied to |
| Ad Hoc information retrieval. In Proceedings of the 24th annual international |
| ACM SIGIR conference on Research and development in information retrieval |
| (SIGIR '01). ACM, New York, NY, USA, 334-342. |
| <p> |
| The formula as defined the paper assigns a negative score to documents that |
| contain the term, but with fewer occurrences than predicted by the collection |
| language model. The Lucene implementation returns <code>0</code> for such |
| documents. |
| </p></p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.LMJelinekMercerSimilarity.html">LMJelinekMercerSimilarity</a></h4> |
| <section><p>Language model based on the Jelinek-Mercer smoothing method. From Chengxiang |
| Zhai and John Lafferty. 2001. A study of smoothing methods for language |
| models applied to Ad Hoc information retrieval. In Proceedings of the 24th |
| annual international ACM SIGIR conference on Research and development in |
| information retrieval (SIGIR '01). ACM, New York, NY, USA, 334-342. |
| <p>The model has a single parameter, λ. According to said paper, the |
| optimal value depends on both the collection and the query. The optimal value |
| is around <code>0.1</code> for title queries and <code>0.7</code> for long queries.</p></p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.html">LMSimilarity</a></h4> |
| <section><p>Abstract superclass for language modeling Similarities. The following inner |
| types are introduced: |
| <ul><li><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.LMStats.html">LMSimilarity.LMStats</a>, which defines a new statistic, the probability that |
| the collection language model generates the current term;</li><li><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.ICollectionModel.html">LMSimilarity.ICollectionModel</a>, which is a strategy interface for object that |
| compute the collection language model <code>p(w|C)</code>;</li><li><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.DefaultCollectionModel.html">LMSimilarity.DefaultCollectionModel</a>, an implementation of the former, that |
| computes the term probability as the number of occurrences of the term in the |
| collection, divided by the total number of tokens.</li></ul> |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.DefaultCollectionModel.html">LMSimilarity.DefaultCollectionModel</a></h4> |
| <section><p>Models <code>p(w|C)</code> as the number of occurrences of the term in the |
| collection, divided by the total number of tokens <code>+ 1</code>.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.LMStats.html">LMSimilarity.LMStats</a></h4> |
| <section><p>Stores the collection distribution of the current term. </p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.MultiSimilarity.html">MultiSimilarity</a></h4> |
| <section><p>Implements the CombSUM method for combining evidence from multiple |
| similarity values described in: Joseph A. Shaw, Edward A. Fox. |
| In Text REtrieval Conference (1993), pp. 243-252 |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.Normalization.html">Normalization</a></h4> |
| <section><p>This class acts as the base class for the implementations of the term |
| frequency normalization methods in the DFR framework. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.Normalization.NoNormalization.html">Normalization.NoNormalization</a></h4> |
| <section><p>Implementation used when there is no normalization. </p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH1.html">NormalizationH1</a></h4> |
| <section><p>Normalization model that assumes a uniform distribution of the term frequency. |
| <p>While this model is parameterless in the |
| <a href="http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.101.742"> |
| original article</a>, <a href="http://dl.acm.org/citation.cfm?id=1835490"> |
| information-based models</a> (see <a class="xref" href="Lucene.Net.Search.Similarities.IBSimilarity.html">IBSimilarity</a>) introduced a |
| multiplying factor. |
| The default value for the <code>c</code> parameter is <code>1</code>.</p></p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH2.html">NormalizationH2</a></h4> |
| <section><p>Normalization model in which the term frequency is inversely related to the |
| length. |
| <p>While this model is parameterless in the |
| <a href="http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.101.742"> |
| original article</a>, the <a href="http://theses.gla.ac.uk/1570/">thesis</a> |
| introduces the parameterized variant. |
| The default value for the <code>c</code> parameter is <code>1</code>.</p></p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH3.html">NormalizationH3</a></h4> |
| <section><p>Dirichlet Priors normalization |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationZ.html">NormalizationZ</a></h4> |
| <section><p>Pareto-Zipf Normalization |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.PerFieldSimilarityWrapper.html">PerFieldSimilarityWrapper</a></h4> |
| <section><p>Provides the ability to use a different <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> for different fields. |
| <p> |
| Subclasses should implement <a class="xref" href="Lucene.Net.Search.Similarities.PerFieldSimilarityWrapper.html#Lucene_Net_Search_Similarities_PerFieldSimilarityWrapper_Get_System_String_">Get(String)</a> to return an appropriate |
| <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> (for example, using field-specific parameter values) for the field. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a></h4> |
| <section><p>Similarity defines the components of Lucene scoring. |
| <p> |
| Expert: Scoring API. |
| <p> |
| This is a low-level API, you should only extend this API if you want to implement |
| an information retrieval <em>model</em>. If you are instead looking for a convenient way |
| to alter Lucene's scoring, consider extending a higher-level implementation |
| such as <a class="xref" href="Lucene.Net.Search.Similarities.TFIDFSimilarity.html">TFIDFSimilarity</a>, which implements the vector space model with this API, or |
| just tweaking the default implementation: <a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html">DefaultSimilarity</a>. |
| <p> |
| Similarity determines how Lucene weights terms, and Lucene interacts with |
| this class at both <a href="#indextime">index-time</a> and |
| <a href="#querytime">query-time</a>. |
| <p> |
| <a name="indextime"></a> |
| At indexing time, the indexer calls <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html#Lucene_Net_Search_Similarities_Similarity_ComputeNorm_Lucene_Net_Index_FieldInvertState_">ComputeNorm(FieldInvertState)</a>, allowing |
| the <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> implementation to set a per-document value for the field that will |
| be later accessible via <a class="xref" href="Lucene.Net.Index.AtomicReader.html#Lucene_Net_Index_AtomicReader_GetNormValues_System_String_">GetNormValues(String)</a>. Lucene makes no assumption |
| about what is in this norm, but it is most useful for encoding length normalization |
| information. |
| <p> |
| Implementations should carefully consider how the normalization is encoded: while |
| Lucene's classical <a class="xref" href="Lucene.Net.Search.Similarities.TFIDFSimilarity.html">TFIDFSimilarity</a> encodes a combination of index-time boost |
| and length normalization information with <a class="xref" href="Lucene.Net.Util.SmallSingle.html">SmallSingle</a> into a single byte, this |
| might not be suitable for all purposes. |
| <p> |
| Many formulas require the use of average document length, which can be computed via a |
| combination of <a class="xref" href="Lucene.Net.Search.CollectionStatistics.html#Lucene_Net_Search_CollectionStatistics_SumTotalTermFreq">SumTotalTermFreq</a> and |
| <a class="xref" href="Lucene.Net.Search.CollectionStatistics.html#Lucene_Net_Search_CollectionStatistics_MaxDoc">MaxDoc</a> or <a class="xref" href="Lucene.Net.Search.CollectionStatistics.html#Lucene_Net_Search_CollectionStatistics_DocCount">DocCount</a>, |
| depending upon whether the average should reflect field sparsity. |
| <p> |
| Additional scoring factors can be stored in named |
| <a class="xref" href="Lucene.Net.Documents.NumericDocValuesField.html">NumericDocValuesField</a>s and accessed |
| at query-time with <a class="xref" href="Lucene.Net.Index.AtomicReader.html#Lucene_Net_Index_AtomicReader_GetNumericDocValues_System_String_">GetNumericDocValues(String)</a>. |
| <p> |
| Finally, using index-time boosts (either via folding into the normalization byte or |
| via <a class="xref" href="Lucene.Net.Index.DocValues.html">DocValues</a>), is an inefficient way to boost the scores of different fields if the |
| boost will be the same for every document, instead the Similarity can simply take a constant |
| boost parameter <em>C</em>, and <a class="xref" href="Lucene.Net.Search.Similarities.PerFieldSimilarityWrapper.html">PerFieldSimilarityWrapper</a> can return different |
| instances with different boosts depending upon field name. |
| <p> |
| <a name="querytime"></a> |
| At query-time, Queries interact with the Similarity via these steps: |
| <ol><li>The <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html#Lucene_Net_Search_Similarities_Similarity_ComputeWeight_System_Single_Lucene_Net_Search_CollectionStatistics_Lucene_Net_Search_TermStatistics___">ComputeWeight(Single, CollectionStatistics, TermStatistics[])</a> method is called a single time, |
| allowing the implementation to compute any statistics (such as IDF, average document length, etc) |
| across <em>the entire collection</em>. The <a class="xref" href="Lucene.Net.Search.TermStatistics.html">TermStatistics</a> and <a class="xref" href="Lucene.Net.Search.CollectionStatistics.html">CollectionStatistics</a> passed in |
| already contain all of the raw statistics involved, so a <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> can freely use any combination |
| of statistics without causing any additional I/O. Lucene makes no assumption about what is |
| stored in the returned <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimWeight.html">Similarity.SimWeight</a> object.</li><li>The query normalization process occurs a single time: <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimWeight.html#Lucene_Net_Search_Similarities_Similarity_SimWeight_GetValueForNormalization">GetValueForNormalization()</a> |
| is called for each query leaf node, <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html#Lucene_Net_Search_Similarities_Similarity_QueryNorm_System_Single_">QueryNorm(Single)</a> is called for the top-level |
| query, and finally <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimWeight.html#Lucene_Net_Search_Similarities_Similarity_SimWeight_Normalize_System_Single_System_Single_">Normalize(Single, Single)</a> passes down the normalization value |
| and any top-level boosts (e.g. from enclosing <a class="xref" href="Lucene.Net.Search.BooleanQuery.html">BooleanQuery</a>s).</li><li>For each segment in the index, the <a class="xref" href="Lucene.Net.Search.Query.html">Query</a> creates a <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html#Lucene_Net_Search_Similarities_Similarity_GetSimScorer_Lucene_Net_Search_Similarities_Similarity_SimWeight_Lucene_Net_Index_AtomicReaderContext_">GetSimScorer(Similarity.SimWeight, AtomicReaderContext)</a> |
| The GetScore() method is called for each matching document.</li></ol> |
| <p> |
| <a name="explaintime"></a> |
| When <a class="xref" href="Lucene.Net.Search.IndexSearcher.html#Lucene_Net_Search_IndexSearcher_Explain_Lucene_Net_Search_Query_System_Int32_">Explain(Query, Int32)</a> is called, queries consult the Similarity's DocScorer for an |
| explanation of how it computed its score. The query passes in a the document id and an explanation of how the frequency |
| was computed. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimScorer.html">Similarity.SimScorer</a></h4> |
| <section><p>API for scoring "sloppy" queries such as <a class="xref" href="Lucene.Net.Search.TermQuery.html">TermQuery</a>, |
| <a class="xref" href="Lucene.Net.Search.Spans.SpanQuery.html">SpanQuery</a>, and <a class="xref" href="Lucene.Net.Search.PhraseQuery.html">PhraseQuery</a>. |
| <p> |
| Frequencies are floating-point values: an approximate |
| within-document frequency adjusted for "sloppiness" by |
| <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimScorer.html#Lucene_Net_Search_Similarities_Similarity_SimScorer_ComputeSlopFactor_System_Int32_">ComputeSlopFactor(Int32)</a>.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimWeight.html">Similarity.SimWeight</a></h4> |
| <section><p>Stores the weight for a query across the indexed collection. this abstract |
| implementation is empty; descendants of <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> should |
| subclass <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimWeight.html">Similarity.SimWeight</a> and define the statistics they require in the |
| subclass. Examples include idf, average field length, etc.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a></h4> |
| <section><p>A subclass of <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> that provides a simplified API for its |
| descendants. Subclasses are only required to implement the <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html#Lucene_Net_Search_Similarities_SimilarityBase_Score_Lucene_Net_Search_Similarities_BasicStats_System_Single_System_Single_">Score(BasicStats, Single, Single)</a> |
| and <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html#Lucene_Net_Search_Similarities_SimilarityBase_ToString">ToString()</a> methods. Implementing |
| <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html#Lucene_Net_Search_Similarities_SimilarityBase_Explain_Lucene_Net_Search_Explanation_Lucene_Net_Search_Similarities_BasicStats_System_Int32_System_Single_System_Single_">Explain(Explanation, BasicStats, Int32, Single, Single)</a> is optional, |
| inasmuch as <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a> already provides a basic explanation of the score |
| and the term frequency. However, implementers of a subclass are encouraged to |
| include as much detail about the scoring method as possible. |
| <p> |
| Note: multi-word queries such as phrase queries are scored in a different way |
| than Lucene's default ranking algorithm: whereas it "fakes" an IDF value for |
| the phrase as a whole (since it does not know it), this class instead scores |
| phrases as a summation of the individual term scores. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.TFIDFSimilarity.html">TFIDFSimilarity</a></h4> |
| <section><p>Implementation of <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> with the Vector Space Model. |
| <p> |
| Expert: Scoring API. |
| <p>TFIDFSimilarity defines the components of Lucene scoring. |
| Overriding computation of these components is a convenient |
| way to alter Lucene scoring.</p> |
| <p>Suggested reading: |
| <a href="http://nlp.stanford.edu/IR-book/html/htmledition/queries-as-vectors-1.html"> |
| Introduction To Information Retrieval, Chapter 6</a>. |
| |
| <p>The following describes how Lucene scoring evolves from |
| underlying information retrieval models to (efficient) implementation. |
| We first brief on <em>VSM Score</em>, |
| then derive from it <em>Lucene's Conceptual Scoring Formula</em>, |
| from which, finally, evolves <em>Lucene's Practical Scoring Function</em> |
| (the latter is connected directly with Lucene classes and methods). |
| |
| <p>Lucene combines |
| <a href="http://en.wikipedia.org/wiki/Standard_Boolean_model"> |
| Boolean model (BM) of Information Retrieval</a> |
| with |
| <a href="http://en.wikipedia.org/wiki/Vector_Space_Model"> |
| Vector Space Model (VSM) of Information Retrieval</a> - |
| documents "approved" by BM are scored by VSM. |
| |
| <p>In VSM, documents and queries are represented as |
| weighted vectors in a multi-dimensional space, |
| where each distinct index term is a dimension, |
| and weights are |
| <a href="http://en.wikipedia.org/wiki/Tfidf">Tf-idf</a> values. |
| |
| <p>VSM does not require weights to be <em>Tf-idf</em> values, |
| but <em>Tf-idf</em> values are believed to produce search results of high quality, |
| and so Lucene is using <em>Tf-idf</em>. |
| <em>Tf</em> and <em>Idf</em> are described in more detail below, |
| but for now, for completion, let's just say that |
| for given term <em>t</em> and document (or query) <em>x</em>, |
| <em>Tf(t,x)</em> varies with the number of occurrences of term <em>t</em> in <em>x</em> |
| (when one increases so does the other) and |
| <em>idf(t)</em> similarly varies with the inverse of the |
| number of index documents containing term <em>t</em>. |
| |
| <p><em>VSM score</em> of document <em>d</em> for query <em>q</em> is the |
| <a href="http://en.wikipedia.org/wiki/Cosine_similarity"> |
| Cosine Similarity</a> |
| of the weighted query vectors <em>V(q)</em> and <em>V(d)</em>: |
| <p> |
| <table><tbody><tr><td> |
| <table><tbody><tr><td>cosine-similarity(q,d) =<br><table> |
| <item><small>V(q) · V(d)</small></item> |
| <item>–––––––––</item> |
| <item><small>|V(q)| |V(d)|</small></item> |
| </table> |
| </td><td></td></tr></tbody></table> |
| </td><td></td></tr><tr><td>VSM Score</td><td></td></tr></tbody></table> |
| <p> |
| |
| |
| <p>Where <em>V(q)</em> · <em>V(d)</em> is the |
| <a href="http://en.wikipedia.org/wiki/Dot_product">dot product</a> |
| of the weighted vectors, |
| and <em>|V(q)|</em> and <em>|V(d)|</em> are their |
| <a href="http://en.wikipedia.org/wiki/Euclidean_norm#Euclidean_norm">Euclidean norms</a>.</p> |
| <p>Note: the above equation can be viewed as the dot product of |
| the normalized weighted vectors, in the sense that dividing |
| <em>V(q)</em> by its euclidean norm is normalizing it to a unit vector. |
| |
| <p>Lucene refines <em>VSM score</em> for both search quality and usability: |
| <ul><li>Normalizing <em>V(d)</em> to the unit vector is known to be problematic in that |
| it removes all document length information. |
| For some documents removing this info is probably ok, |
| e.g. a document made by duplicating a certain paragraph <em>10</em> times, |
| especially if that paragraph is made of distinct terms. |
| But for a document which contains no duplicated paragraphs, |
| this might be wrong. |
| To avoid this problem, a different document length normalization |
| factor is used, which normalizes to a vector equal to or larger |
| than the unit vector: <em>doc-len-norm(d)</em>. |
| </li><li>At indexing, users can specify that certain documents are more |
| important than others, by assigning a document boost. |
| For this, the score of each document is also multiplied by its boost value |
| <em>doc-boost(d)</em>. |
| </li><li>Lucene is field based, hence each query term applies to a single |
| field, document length normalization is by the length of the certain field, |
| and in addition to document boost there are also document fields boosts. |
| </li><li>The same field can be added to a document during indexing several times, |
| and so the boost of that field is the multiplication of the boosts of |
| the separate additions (or parts) of that field within the document. |
| </li><li>At search time users can specify boosts to each query, sub-query, and |
| each query term, hence the contribution of a query term to the score of |
| a document is multiplied by the boost of that query term <em>query-boost(q)</em>. |
| </li><li>A document may match a multi term query without containing all |
| the terms of that query (this is correct for some of the queries), |
| and users can further reward documents matching more query terms |
| through a coordination factor, which is usually larger when |
| more terms are matched: <em>coord-factor(q,d)</em>. |
| </li></ul> |
| |
| <p>Under the simplifying assumption of a single field in the index, |
| we get <em>Lucene's Conceptual scoring formula</em>: |
| |
| <p> |
| <table><tbody><tr><td> |
| <table><tbody><tr><td> |
| score(q,d) =<br><font color="#FF9933">coord-factor(q,d)</font> ·<br><font color="#CCCC00">query-boost(q)</font> ·<br> |
| <table><tbody><tr><td><small><font color="#993399">V(q) · V(d)</font></small></td><td></td></tr><tr><td>–––––––––</td><td></td></tr><tr><td><small><font color="#FF33CC">|V(q)|</font></small></td><td></td></tr></tbody></table> |
| |
| · <font color="#3399FF">doc-len-norm(d)</font> |
| · <font color="#3399FF">doc-boost(d)</font> |
| </td><td></td></tr></tbody></table> |
| </td><td></td></tr><tr><td>Lucene Conceptual Scoring Formula</td><td></td></tr></tbody></table> |
| <p> |
| |
| |
| <p>The conceptual formula is a simplification in the sense that (1) terms and documents |
| are fielded and (2) boosts are usually per query term rather than per query. |
| |
| <p>We now describe how Lucene implements this conceptual scoring formula, and |
| derive from it <em>Lucene's Practical Scoring Function</em>. |
| |
| <p>For efficient score computation some scoring components |
| are computed and aggregated in advance: |
| |
| <ul><li><em>Query-boost</em> for the query (actually for each query term) |
| is known when search starts. |
| </li><li>Query Euclidean norm <em>|V(q)|</em> can be computed when search starts, |
| as it is independent of the document being scored. |
| From search optimization perspective, it is a valid question |
| why bother to normalize the query at all, because all |
| scored documents will be multiplied by the same <em>|V(q)|</em>, |
| and hence documents ranks (their order by score) will not |
| be affected by this normalization. |
| There are two good reasons to keep this normalization: |
| <ul><li>Recall that |
| <a href="http://en.wikipedia.org/wiki/Cosine_similarity"> |
| Cosine Similarity</a> can be used find how similar |
| two documents are. One can use Lucene for e.g. |
| clustering, and use a document as a query to compute |
| its similarity to other documents. |
| In this use case it is important that the score of document <em>d3</em> |
| for query <em>d1</em> is comparable to the score of document <em>d3</em> |
| for query <em>d2</em>. In other words, scores of a document for two |
| distinct queries should be comparable. |
| There are other applications that may require this. |
| And this is exactly what normalizing the query vector <em>V(q)</em> |
| provides: comparability (to a certain extent) of two or more queries. |
| </li><li>Applying query normalization on the scores helps to keep the |
| scores around the unit vector, hence preventing loss of score data |
| because of floating point precision limitations. |
| </li></ul> |
| </li><li>Document length norm <em>doc-len-norm(d)</em> and document |
| boost <em>doc-boost(d)</em> are known at indexing time. |
| They are computed in advance and their multiplication |
| is saved as a single value in the index: <em>norm(d)</em>. |
| (In the equations below, <em>norm(t in d)</em> means <em>norm(field(t) in doc d)</em> |
| where <em>field(t)</em> is the field associated with term <em>t</em>.) |
| </li></ul> |
| |
| <p><em>Lucene's Practical Scoring Function</em> is derived from the above. |
| The color codes demonstrate how it relates |
| to those of the <em>conceptual</em> formula: |
| |
| <p> |
| <table><tbody><tr><td> |
| <table><tbody><tr><td> |
| score(q,d) =<br><a href="#formula_coord"><font color="#FF9933">coord(q,d)</font></a> ·<br><a href="#formula_queryNorm"><font color="#FF33CC">queryNorm(q)</font></a> ·<br><big><big><big>∑</big></big></big> |
| <big><big>(</big></big> |
| <a href="#formula_tf"><font color="#993399">tf(t in d)</font></a> ·<br><a href="#formula_idf"><font color="#993399">idf(t)</font></a><sup>2</sup> ·<br><a href="#formula_termBoost"><font color="#CCCC00">t.Boost</font></a> ·<br><a href="#formula_norm"><font color="#3399FF">norm(t,d)</font></a> |
| <big><big>)</big></big> |
| </td><td></td></tr><tr><td><small>t in q</small></td><td></td></tr></tbody></table> |
| </td><td></td></tr><tr><td>Lucene Practical Scoring Function</td><td></td></tr></tbody></table> |
| |
| <p> where |
| <ol><li> |
| <a name="formula_tf"></a> |
| <strong><em>tf(t in d)</em></strong> |
| correlates to the term's <em>frequency</em>, |
| defined as the number of times term <em>t</em> appears in the currently scored document <em>d</em>. |
| Documents that have more occurrences of a given term receive a higher score. |
| Note that <em>tf(t in q)</em> is assumed to be <em>1</em> and therefore it does not appear in this equation, |
| However if a query contains twice the same term, there will be |
| two term-queries with that same term and hence the computation would still be correct (although |
| not very efficient). |
| The default computation for <em>tf(t in d)</em> in |
| DefaultSimilarity (<a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html#Lucene_Net_Search_Similarities_DefaultSimilarity_Tf_System_Single_">Tf(Single)</a>) is: |
| |
| <p> |
| <table><tbody><tr><td> |
| tf(t in d) =<br> |
| frequency<sup><big>½</big></sup> |
| </td><td></td></tr></tbody></table> |
| <p> |
| |
| <p></li><li> |
| <a name="formula_idf"></a> |
| <strong><em>idf(t)</em></strong> stands for Inverse Document Frequency. this value |
| correlates to the inverse of <em>DocFreq</em> |
| (the number of documents in which the term <em>t</em> appears). |
| this means rarer terms give higher contribution to the total score. |
| <em>idf(t)</em> appears for <em>t</em> in both the query and the document, |
| hence it is squared in the equation. |
| The default computation for <em>idf(t)</em> in |
| DefaultSimilarity (<a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html#Lucene_Net_Search_Similarities_DefaultSimilarity_Idf_System_Int64_System_Int64_">Idf(Int64, Int64)</a>) is:<p> |
| <p> |
| <table><tbody><tr><td>idf(t) = 1 + log <big>(</big> |
| <table><tbody><tr><td><small>NumDocs</small></td><td></td></tr><tr><td>–––––––––</td><td></td></tr><tr><td><small>DocFreq+1</small></td><td></td></tr></tbody></table> |
| <big>)</big></td><td></td></tr></tbody></table> |
| <p> |
| |
| <p></li><li> |
| <a name="formula_coord"></a> |
| <strong><em>coord(q,d)</em></strong> |
| is a score factor based on how many of the query terms are found in the specified document. |
| Typically, a document that contains more of the query's terms will receive a higher score |
| than another document with fewer query terms. |
| this is a search time factor computed in |
| coord(q,d) (<a class="xref" href="Lucene.Net.Search.Similarities.TFIDFSimilarity.html#Lucene_Net_Search_Similarities_TFIDFSimilarity_Coord_System_Int32_System_Int32_">Coord(Int32, Int32)</a>) |
| by the Similarity in effect at search time. |
| <p> |
| </li><li><strong> |
| <a name="formula_queryNorm"></a> |
| <em>queryNorm(q)</em> |
| </strong> |
| is a normalizing factor used to make scores between queries comparable. |
| this factor does not affect document ranking (since all ranked documents are multiplied by the same factor), |
| but rather just attempts to make scores from different queries (or even different indexes) comparable. |
| this is a search time factor computed by the Similarity in effect at search time.<p> |
| <p>The default computation in |
| DefaultSimilarity (<a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html#Lucene_Net_Search_Similarities_DefaultSimilarity_QueryNorm_System_Single_">QueryNorm(Single)</a>) |
| produces a <a href="http://en.wikipedia.org/wiki/Euclidean_norm#Euclidean_norm">Euclidean norm</a>:</p> |
| <p> |
| <table><tbody><tr><td> |
| queryNorm(q) =<br> queryNorm(sumOfSquaredWeights) |
| =<br> |
| <table><tbody><tr><td><big>1</big></td><td></td></tr><tr><td><big>––––––––––––––</big></td><td></td></tr><tr><td>sumOfSquaredWeights<sup><big>½</big></sup></td><td></td></tr></tbody></table> |
| </td><td></td></tr></tbody></table> |
| <p> |
| |
| <p>The sum of squared weights (of the query terms) is |
| computed by the query <a class="xref" href="Lucene.Net.Search.Weight.html">Weight</a> object. |
| For example, a <a class="xref" href="Lucene.Net.Search.BooleanQuery.html">BooleanQuery</a> |
| computes this value as:</p> |
| <p><p> |
| <table><tbody><tr><td> |
| sumOfSquaredWeights =<br> q.Boost <sup><big>2</big></sup> |
| · |
| <big><big><big>∑</big></big></big> |
| <big><big>(</big></big> |
| <a href="#formula_idf">idf(t)</a> · |
| <a href="#formula_termBoost">t.Boost</a> |
| <big><big>) <sup>2</sup> </big></big> |
| </td><td></td></tr><tr><td><small>t in q</small></td><td></td></tr></tbody></table> |
| where sumOfSquaredWeights is <a class="xref" href="Lucene.Net.Search.Weight.html#Lucene_Net_Search_Weight_GetValueForNormalization">GetValueForNormalization()</a> and |
| q.Boost is <a class="xref" href="Lucene.Net.Search.Query.html#Lucene_Net_Search_Query_Boost">Boost</a> |
| <p> |
| </li><li> |
| <a name="formula_termBoost"></a> |
| <strong><em>t.Boost</em></strong> |
| is a search time boost of term <em>t</em> in the query <em>q</em> as |
| specified in the query text |
| (see <a href="{@docRoot}/../queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Boosting_a_Term">query syntax</a>), |
| or as set by application calls to |
| <a class="xref" href="Lucene.Net.Search.Query.html#Lucene_Net_Search_Query_Boost">Boost</a>. |
| Notice that there is really no direct API for accessing a boost of one term in a multi term query, |
| but rather multi terms are represented in a query as multi |
| <a class="xref" href="Lucene.Net.Search.TermQuery.html">TermQuery</a> objects, |
| and so the boost of a term in the query is accessible by calling the sub-query |
| <a class="xref" href="Lucene.Net.Search.Query.html#Lucene_Net_Search_Query_Boost">Boost</a>. |
| <p> |
| </li><li> |
| <a name="formula_norm"></a> |
| <strong><em>norm(t,d)</em></strong> encapsulates a few (indexing time) boost and length factors:<p> |
| <p><ul><li><strong>Field boost</strong> - set |
| <a class="xref" href="Lucene.Net.Documents.Field.html#Lucene_Net_Documents_Field_Boost">Boost</a> |
| before adding the field to a document. |
| </li><li><strong>lengthNorm</strong> - computed |
| when the document is added to the index in accordance with the number of tokens |
| of this field in the document, so that shorter fields contribute more to the score. |
| LengthNorm is computed by the <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> class in effect at indexing. |
| </li></ul> |
| The <a class="xref" href="Lucene.Net.Search.Similarities.TFIDFSimilarity.html#Lucene_Net_Search_Similarities_TFIDFSimilarity_ComputeNorm_Lucene_Net_Index_FieldInvertState_">ComputeNorm(FieldInvertState)</a> method is responsible for |
| combining all of these factors into a single <span class="xref">System.Single</span>.</p> |
| <p><p> |
| When a document is added to the index, all the above factors are multiplied. |
| If the document has multiple fields with the same name, all their boosts are multiplied together:</p> |
| <p><p> |
| <table><tbody><tr><td> |
| norm(t,d) =<br> lengthNorm |
| · |
| <big><big><big>∏</big></big></big><a class="xref" href="Lucene.Net.Index.IIndexableField.html#Lucene_Net_Index_IIndexableField_Boost">Boost</a></td><td></td></tr><tr><td><small>field <em><strong>f</strong></em> in <em>d</em> named as <em><strong>t</strong></em></small></td><td></td></tr></tbody></table> |
| Note that search time is too late to modify this <em>norm</em> part of scoring, |
| e.g. by using a different <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> for search. |
| </li></ol></p> |
| </section> |
| <h3 id="interfaces">Interfaces |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.ICollectionModel.html">LMSimilarity.ICollectionModel</a></h4> |
| <section><p>A strategy for computing the collection language model. </p> |
| </section> |
| </article> |
| </div> |
| |
| <div class="hidden-sm col-md-2" role="complementary"> |
| <div class="sideaffix"> |
| <div class="contribution"> |
| <ul class="nav"> |
| <li> |
| <a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00009/src/Lucene.Net/Search/Similarities/package.md/#L2" class="contribution-link">Improve this Doc</a> |
| </li> |
| </ul> |
| </div> |
| <nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix"> |
| <!-- <p><a class="back-to-top" href="#top">Back to top</a><p> --> |
| </nav> |
| </div> |
| </div> |
| </div> |
| </div> |
| |
| <footer> |
| <div class="grad-bottom"></div> |
| <div class="footer"> |
| <div class="container"> |
| <span class="pull-right"> |
| <a href="#top">Back to top</a> |
| </span> |
| Copyright © 2020 Licensed to the Apache Software Foundation (ASF) |
| |
| </div> |
| </div> |
| </footer> |
| </div> |
| |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script> |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script> |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script> |
| </body> |
| </html> |