blob: 7f56385f908e7213ff9d88fafbdccb64b6816609 [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE]><![endif]-->
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Namespace Lucene.Net.Search.Similarities
| Apache Lucene.NET 4.8.0-beta00013 Documentation </title>
<meta name="viewport" content="width=device-width">
<meta name="title" content="Namespace Lucene.Net.Search.Similarities
| Apache Lucene.NET 4.8.0-beta00013 Documentation ">
<meta name="generator" content="docfx 2.56.2.0">
<link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css">
<meta property="docfx:navrel" content="toc.html">
<meta property="docfx:tocrel" content="core/toc.html">
<meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/">
</head>
<body data-spy="scroll" data-target="#affix" data-offset="120">
<span id="forkongithub"><a href="https://github.com/apache/lucenenet" target="_blank">Fork me on GitHub</a></span>
<div id="wrapper">
<header>
<nav id="autocollapse" class="navbar ng-scope" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/">
<img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt="">
</a>
</div>
<div class="collapse navbar-collapse" id="navbar">
<form class="navbar-form navbar-right" role="search" id="search">
<div class="form-group">
<input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off">
</div>
</form>
</div>
</div>
</nav>
<div class="subnav navbar navbar-default">
<div class="container hide-when-search">
<ul class="level0 breadcrumb">
<li>
<a href="https://lucenenet.apache.org/docs/4.8.0-beta00009/">API</a>
<span id="breadcrumb">
<ul class="breadcrumb">
<li></li>
</ul>
</span>
</li>
</ul>
</div>
</div>
</header>
<div class="container body-content">
<div id="search-results">
<div class="search-list"></div>
<div class="sr-items">
<p><i class="glyphicon glyphicon-refresh index-loading"></i></p>
</div>
<ul id="pagination"></ul>
</div>
</div>
<div role="main" class="container body-content hide-when-search">
<div class="sidenav hide-when-search">
<a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a>
<div class="sidetoggle collapse" id="sidetoggle">
<div id="sidetoc"></div>
</div>
</div>
<div class="article row grid-right">
<div class="col-md-10">
<article class="content wrap" id="_content" data-uid="Lucene.Net.Search.Similarities">
<h1 id="Lucene_Net_Search_Similarities" data-uid="Lucene.Net.Search.Similarities" class="text-break">Namespace Lucene.Net.Search.Similarities
</h1>
<div class="markdown level0 summary"><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<p>This package contains the various ranking models that can be used in Lucene. The
abstract class <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> serves
as the base for ranking functions. For searching, users can employ the models
already implemented or create their own by extending one of the classes in this
package.</p>
<h2 id="table-of-contents">Table Of Contents</h2>
<ol>
<li><a href="#sims">Summary of the Ranking Methods</a> 2. <a href="#changingsimilarity">Changing the Similarity</a> </li>
</ol>
<h2 id="summary-of-the-ranking-methods">Summary of the Ranking Methods</h2>
<p><a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html">DefaultSimilarity</a> is the original Lucene scoring function. It is based on a highly optimized <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model</a>. For more information, see <a class="xref" href="Lucene.Net.Search.Similarities.TFIDFSimilarity.html">TFIDFSimilarity</a>.</p>
<p><a class="xref" href="Lucene.Net.Search.Similarities.BM25Similarity.html">BM25Similarity</a> is an optimized implementation of the successful Okapi BM25 model.</p>
<p><a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a> provides a basic implementation of the Similarity contract and exposes a highly simplified interface, which makes it an ideal starting point for new ranking functions. Lucene ships the following methods built on <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a>: * Amati and Rijsbergen&#39;s {@linkplain org.apache.lucene.search.similarities.DFRSimilarity DFR} framework; * Clinchant and Gaussier&#39;s {@linkplain org.apache.lucene.search.similarities.IBSimilarity Information-based models} for IR; * The implementation of two {@linkplain org.apache.lucene.search.similarities.LMSimilarity language models} from Zhai and Lafferty&#39;s paper. Since <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a> is not optimized to the same extent as <a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html">DefaultSimilarity</a> and <a class="xref" href="Lucene.Net.Search.Similarities.BM25Similarity.html">BM25Similarity</a>, a difference in performance is to be expected when using the methods listed above. However, optimizations can always be implemented in subclasses; see <a href="#changingsimilarity">below</a>.</p>
<h2 id="changing-similarity">Changing Similarity</h2>
<p>Chances are the available Similarities are sufficient for all your searching needs. However, in some applications it may be necessary to customize your <a href="Similarity.html">Similarity</a> implementation. For instance, some applications do not need to distinguish between shorter and longer documents (see <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a &quot;fair&quot; similarity</a>).</p>
<p>To change <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a>, one must do so for both indexing and searching, and the changes must happen before either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn&#39;t well-defined what is going to happen. </p>
<p>To make this change, implement your own <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> (likely you&#39;ll want to simply subclass an existing method, be it <a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html">DefaultSimilarity</a> or a descendant of <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a>), and then register the new class by calling <a class="xref" href="Lucene.Net.Index.IndexWriterConfig.html">#setSimilarity(Similarity)</a> before indexing and <a class="xref" href="Lucene.Net.Search.IndexSearcher.html">#setSimilarity(Similarity)</a> before searching. </p>
<h3 id="extending-linkplain-orgapachelucenesearchsimilaritiessimilaritybase">Extending {@linkplain org.apache.lucene.search.similarities.SimilarityBase}</h3>
<p> The easiest way to quickly implement a new ranking method is to extend <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a>, which provides basic implementations for the low level . Subclasses are only required to implement the <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html#methods">Float)</a> and <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">#toString()</a> methods.</p>
<p>Another option is to extend one of the <a href="#framework">frameworks</a> based on <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a>. These Similarities are implemented modularly, e.g. <a class="xref" href="Lucene.Net.Search.Similarities.DFRSimilarity.html">DFRSimilarity</a> delegates computation of the three parts of its formula to the classes <a class="xref" href="Lucene.Net.Search.Similarities.BasicModel.html">BasicModel</a>, <a class="xref" href="Lucene.Net.Search.Similarities.AfterEffect.html">AfterEffect</a> and <a class="xref" href="Lucene.Net.Search.Similarities.Normalization.html">Normalization</a>. Instead of subclassing the Similarity, one can simply introduce a new basic model and tell <a class="xref" href="Lucene.Net.Search.Similarities.DFRSimilarity.html">DFRSimilarity</a> to use it.</p>
<h3 id="changing-linkplain-orgapachelucenesearchsimilaritiesdefaultsimilarity">Changing {@linkplain org.apache.lucene.search.similarities.DefaultSimilarity}</h3>
<p> If you are interested in use cases for changing your similarity, see the Lucene users&#39;s mailing list at <a href="http://www.gossamer-threads.com/lists/lucene/java-user/39125">Overriding Similarity</a>. In summary, here are a few use cases: 1. <p>The <code>SweetSpotSimilarity</code> in <code>org.apache.lucene.misc</code> gives small increases as the frequency increases a small amount and then greater increases when you hit the &quot;sweet spot&quot;, i.e. where you think the frequency of terms is more significant.</p> 2. <p>Overriding tf — In some applications, it doesn&#39;t matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the tf() method.</p> 3. <p>Changing Length Normalization — By overriding <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html#methods">State)</a>, it is possible to discount how the length of a field contributes to a score. In <a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html">DefaultSimilarity</a>, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">&quot;fairly&quot;</a>.</p> In general, Chris Hostetter sums it up best in saying (from <a href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users&#39;s mailing list</a>): </p>
<blockquote><p>[One would override the Similarity in] ... any situation where you know more about your data then just that it&#39;s &quot;text&quot; is a situation where it <em>might</em> make sense to to override your Similarity method.</p>
</blockquote>
</div>
<div class="markdown level0 conceptual"></div>
<div class="markdown level0 remarks"></div>
<h3 id="classes">Classes
</h3>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffect.html">AfterEffect</a></h4>
<section><p>This class acts as the base class for the implementations of the <em>first
normalization of the informative content</em> in the DFR framework. This
component is also called the <em>after effect</em> and is defined by the
formula <em>Inf<sub>2</sub> = 1 - Prob<sub>2</sub></em>, where
<em>Prob<sub>2</sub></em> measures the <em>information gain</em>.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div><p>
</section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffect.NoAfterEffect.html">AfterEffect.NoAfterEffect</a></h4>
<section><p>Implementation used when there is no aftereffect. </p>
</section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffectB.html">AfterEffectB</a></h4>
<section><p>Model of the information gain based on the ratio of two Bernoulli processes.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffectL.html">AfterEffectL</a></h4>
<section><p>Model of the information gain based on Laplace&apos;s law of succession.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModel.html">BasicModel</a></h4>
<section><p>This class acts as the base class for the specific <em>basic model</em>
implementations in the DFR framework. Basic models compute the
<em>informative content Inf<sub>1</sub> = -log<sub>2</sub>Prob<sub>1</sub>
</em>.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelBE.html">BasicModelBE</a></h4>
<section><p>Limiting form of the Bose-Einstein model. The formula used in Lucene differs
slightly from the one in the original paper: <code>F</code> is increased by <code>tfn+1</code>
and <code>N</code> is increased by <code>F</code>
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div><p><p>
NOTE: in some corner cases this model may give poor performance with Normalizations that
return large values for <code>tfn</code> such as <a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH3.html">NormalizationH3</a>. Consider using the
geometric approximation (<a class="xref" href="Lucene.Net.Search.Similarities.BasicModelG.html">BasicModelG</a>) instead, which provides the same relevance
but with less practical problems.</p>
</section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelD.html">BasicModelD</a></h4>
<section><p>Implements the approximation of the binomial model with the divergence
for DFR. The formula used in Lucene differs slightly from the one in the
original paper: to avoid underflow for small values of <code>N</code> and
<code>F</code>, <code>N</code> is increased by <code>1</code> and
<code>F</code> is always increased by <code>tfn+1</code>.
<p>
WARNING: for terms that do not meet the expected random distribution
(e.g. stopwords), this model may give poor performance, such as
abnormally high scores for low tf values.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelG.html">BasicModelG</a></h4>
<section><p>Geometric as limiting form of the Bose-Einstein model. The formula used in Lucene differs
slightly from the one in the original paper: <code>F</code> is increased by <code>1</code>
and <code>N</code> is increased by <code>F</code>.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelIF.html">BasicModelIF</a></h4>
<section><p>An approximation of the <em>I(n<sub>e</sub>)</em> model.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelIn.html">BasicModelIn</a></h4>
<section><p>The basic tf-idf model of randomness.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelIne.html">BasicModelIne</a></h4>
<section><p>Tf-idf model of randomness, based on a mixture of Poisson and inverse
document frequency.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelP.html">BasicModelP</a></h4>
<section><p>Implements the Poisson approximation for the binomial model for DFR.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div><p><p>
WARNING: for terms that do not meet the expected random distribution
(e.g. stopwords), this model may give poor performance, such as
abnormally high scores for low tf values.</p>
</section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.BasicStats.html">BasicStats</a></h4>
<section><p>Stores all statistics commonly used ranking methods.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.BM25Similarity.html">BM25Similarity</a></h4>
<section><p>BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker,
Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3.
In Proceedings of the Third <strong>T</strong>ext <strong>RE</strong>trieval <strong>C</strong>onference (TREC 1994).
Gaithersburg, USA, November 1994.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html">DefaultSimilarity</a></h4>
<section><p>Expert: Default scoring implementation which encodes (<a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html#Lucene_Net_Search_Similarities_DefaultSimilarity_EncodeNormValue_System_Single_">EncodeNormValue(Single)</a>)
norm values as a single byte before being stored. At search time,
the norm byte value is read from the index
<a class="xref" href="Lucene.Net.Store.Directory.html">Directory</a> and
decoded (<a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html#Lucene_Net_Search_Similarities_DefaultSimilarity_DecodeNormValue_System_Int64_">DecodeNormValue(Int64)</a>) back to a float <em>norm</em> value.
this encoding/decoding, while reducing index size, comes with the price of
precision loss - it is not guaranteed that <em>Decode(Encode(x)) = x</em>. For
instance, <em>Decode(Encode(0.89)) = 0.75</em>.
<p>
Compression of norm values to a single byte saves memory at search time,
because once a field is referenced at search time, its norms - for all
documents - are maintained in memory.
<p>
The rationale supporting such lossy compression of norm values is that given
the difficulty (and inaccuracy) of users to express their true information
need by a query, only big differences matter.
<p>
Last, note that search time is too late to modify this <em>norm</em> part of
scoring, e.g. by using a different <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> for search.</p>
</section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.DFRSimilarity.html">DFRSimilarity</a></h4>
<section><p>Implements the <em>divergence from randomness (DFR)</em> framework
introduced in Gianni Amati and Cornelis Joost Van Rijsbergen. 2002.
Probabilistic models of information retrieval based on measuring the
divergence from randomness. ACM Trans. Inf. Syst. 20, 4 (October 2002),
357-389.
<p>The DFR scoring formula is composed of three separate components: the
<em>basic model</em>, the <em>aftereffect</em> and an additional
<em>normalization</em> component, represented by the classes
<a class="xref" href="Lucene.Net.Search.Similarities.BasicModel.html">BasicModel</a>, <a class="xref" href="Lucene.Net.Search.Similarities.AfterEffect.html">AfterEffect</a> and <a class="xref" href="Lucene.Net.Search.Similarities.Normalization.html">Normalization</a>,
respectively. The names of these classes were chosen to match the names of
their counterparts in the Terrier IR engine.</p>
<p>To construct a <a class="xref" href="Lucene.Net.Search.Similarities.DFRSimilarity.html">DFRSimilarity</a>, you must specify the implementations for
all three components of DFR:
<table><thead><tr><th>ComponentImplementations</th><th></th></tr></thead><tbody><tr><td><a class="xref" href="Lucene.Net.Search.Similarities.BasicModel.html">BasicModel</a>: Basic model of information content:
<ul><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelBE.html">BasicModelBE</a>: Limiting form of Bose-Einstein</li><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelG.html">BasicModelG</a>: Geometric approximation of Bose-Einstein</li><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelP.html">BasicModelP</a>: Poisson approximation of the Binomial</li><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelD.html">BasicModelD</a>: Divergence approximation of the Binomial</li><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelIn.html">BasicModelIn</a>: Inverse document frequency</li><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelIne.html">BasicModelIne</a>: Inverse expected document frequency [mixture of Poisson and IDF]</li><li><a class="xref" href="Lucene.Net.Search.Similarities.BasicModelIF.html">BasicModelIF</a>: Inverse term frequency [approximation of I(ne)]</li></ul>
</td><td></td></tr><tr><td><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffect.html">AfterEffect</a>: First normalization of information gain:
<ul><li><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffectL.html">AfterEffectL</a>: Laplace&apos;s law of succession</li><li><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffectB.html">AfterEffectB</a>: Ratio of two Bernoulli processes</li><li><a class="xref" href="Lucene.Net.Search.Similarities.AfterEffect.NoAfterEffect.html">AfterEffect.NoAfterEffect</a>: no first normalization</li></ul>
</td><td></td></tr><tr><td><a class="xref" href="Lucene.Net.Search.Similarities.Normalization.html">Normalization</a>: Second (length) normalization:
<ul><li><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH1.html">NormalizationH1</a>: Uniform distribution of term frequency</li><li><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH2.html">NormalizationH2</a>: term frequency density inversely related to length</li><li><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH3.html">NormalizationH3</a>: term frequency normalization provided by Dirichlet prior</li><li><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationZ.html">NormalizationZ</a>: term frequency normalization provided by a Zipfian relation</li><li><a class="xref" href="Lucene.Net.Search.Similarities.Normalization.NoNormalization.html">Normalization.NoNormalization</a>: no second normalization</li></ul>
</td><td></td></tr></tbody></table></p>
<p>
<p>Note that <em>qtf</em>, the multiplicity of term-occurrence in the query,
is not handled by this implementation.
</p> </p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.Distribution.html">Distribution</a></h4>
<section><p>The probabilistic distribution used to model term occurrence
in information-based models.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.DistributionLL.html">DistributionLL</a></h4>
<section><p>Log-logistic distribution.
<p>Unlike for DFR, the natural logarithm is used, as
it is faster to compute and the original paper does not express any
preference to a specific base.</p></p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.DistributionSPL.html">DistributionSPL</a></h4>
<section><p>The smoothed power-law (SPL) distribution for the information-based framework
that is described in the original paper.
<p>Unlike for DFR, the natural logarithm is used, as
it is faster to compute and the original paper does not express any
preference to a specific base.</p></p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.IBSimilarity.html">IBSimilarity</a></h4>
<section><p>Provides a framework for the family of information-based models, as described
in StÉphane Clinchant and Eric Gaussier. 2010. Information-based
models for ad hoc IR. In Proceeding of the 33rd international ACM SIGIR
conference on Research and development in information retrieval (SIGIR &apos;10).
ACM, New York, NY, USA, 234-241.
<p>The retrieval function is of the form <em>RSV(q, d) = ∑
-x<sup>q</sup><sub>w</sub> log Prob(X<sub>w</sub> &gt;=
t<sup>d</sup><sub>w</sub> | λ<sub>w</sub>)</em>, where
<ul><li><em>x<sup>q</sup><sub>w</sub></em> is the query boost;</li><li><em>X<sub>w</sub></em> is a random variable that counts the occurrences
of word <em>w</em>;</li><li><em>t<sup>d</sup><sub>w</sub></em> is the normalized term frequency;</li><li><em>λ<sub>w</sub></em> is a parameter.</li></ul>
</p>
<p>The framework described in the paper has many similarities to the DFR
framework (see <a class="xref" href="Lucene.Net.Search.Similarities.DFRSimilarity.html">DFRSimilarity</a>). It is possible that the two
Similarities will be merged at one point.</p>
<p>To construct an <a class="xref" href="Lucene.Net.Search.Similarities.IBSimilarity.html">IBSimilarity</a>, you must specify the implementations for
all three components of the Information-Based model.
<table><thead><tr><th>ComponentImplementations</th><th></th></tr></thead><tbody><tr><td><a class="xref" href="Lucene.Net.Search.Similarities.IBSimilarity.html#Lucene_Net_Search_Similarities_IBSimilarity_Distribution">Distribution</a>: Probabilistic distribution used to
model term occurrence
<ul><li><a class="xref" href="Lucene.Net.Search.Similarities.DistributionLL.html">DistributionLL</a>: Log-logistic</li><li><a class="xref" href="Lucene.Net.Search.Similarities.DistributionLL.html">DistributionLL</a>: Smoothed power-law</li></ul>
</td><td></td></tr><tr><td><a class="xref" href="Lucene.Net.Search.Similarities.IBSimilarity.html#Lucene_Net_Search_Similarities_IBSimilarity_Lambda">Lambda</a>: λ<sub>w</sub> parameter of the
probability distribution
<ul><li><a class="xref" href="Lucene.Net.Search.Similarities.LambdaDF.html">LambdaDF</a>: <code>N<sub>w</sub>/N</code> or average
number of documents where w occurs</li><li><a class="xref" href="Lucene.Net.Search.Similarities.LambdaTTF.html">LambdaTTF</a>: <code>F<sub>w</sub>/N</code> or
average number of occurrences of w in the collection</li></ul>
</td><td></td></tr><tr><td><a class="xref" href="Lucene.Net.Search.Similarities.IBSimilarity.html#Lucene_Net_Search_Similarities_IBSimilarity_Normalization">Normalization</a>: Term frequency normalizationAny supported DFR normalization (listed in
<a class="xref" href="Lucene.Net.Search.Similarities.DFRSimilarity.html">DFRSimilarity</a>)
</td><td></td></tr></tbody></table>
</p></p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.Lambda.html">Lambda</a></h4>
<section><p>The <em>lambda (λ<sub>w</sub>)</em> parameter in information-based
models.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.LambdaDF.html">LambdaDF</a></h4>
<section><p>Computes lambda as <code>docFreq+1 / numberOfDocuments+1</code>.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.LambdaTTF.html">LambdaTTF</a></h4>
<section><p>Computes lambda as <code>totalTermFreq+1 / numberOfDocuments+1</code>.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.LMDirichletSimilarity.html">LMDirichletSimilarity</a></h4>
<section><p>Bayesian smoothing using Dirichlet priors. From Chengxiang Zhai and John
Lafferty. 2001. A study of smoothing methods for language models applied to
Ad Hoc information retrieval. In Proceedings of the 24th annual international
ACM SIGIR conference on Research and development in information retrieval
(SIGIR &apos;01). ACM, New York, NY, USA, 334-342.
<p>
The formula as defined the paper assigns a negative score to documents that
contain the term, but with fewer occurrences than predicted by the collection
language model. The Lucene implementation returns <code>0</code> for such
documents.
</p></p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.LMJelinekMercerSimilarity.html">LMJelinekMercerSimilarity</a></h4>
<section><p>Language model based on the Jelinek-Mercer smoothing method. From Chengxiang
Zhai and John Lafferty. 2001. A study of smoothing methods for language
models applied to Ad Hoc information retrieval. In Proceedings of the 24th
annual international ACM SIGIR conference on Research and development in
information retrieval (SIGIR &apos;01). ACM, New York, NY, USA, 334-342.
<p>The model has a single parameter, λ. According to said paper, the
optimal value depends on both the collection and the query. The optimal value
is around <code>0.1</code> for title queries and <code>0.7</code> for long queries.</p></p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.html">LMSimilarity</a></h4>
<section><p>Abstract superclass for language modeling Similarities. The following inner
types are introduced:
<ul><li><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.LMStats.html">LMSimilarity.LMStats</a>, which defines a new statistic, the probability that
the collection language model generates the current term;</li><li><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.ICollectionModel.html">LMSimilarity.ICollectionModel</a>, which is a strategy interface for object that
compute the collection language model <code>p(w|C)</code>;</li><li><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.DefaultCollectionModel.html">LMSimilarity.DefaultCollectionModel</a>, an implementation of the former, that
computes the term probability as the number of occurrences of the term in the
collection, divided by the total number of tokens.</li></ul>
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.DefaultCollectionModel.html">LMSimilarity.DefaultCollectionModel</a></h4>
<section><p>Models <code>p(w|C)</code> as the number of occurrences of the term in the
collection, divided by the total number of tokens <code>+ 1</code>.</p>
</section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.LMStats.html">LMSimilarity.LMStats</a></h4>
<section><p>Stores the collection distribution of the current term. </p>
</section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.MultiSimilarity.html">MultiSimilarity</a></h4>
<section><p>Implements the CombSUM method for combining evidence from multiple
similarity values described in: Joseph A. Shaw, Edward A. Fox.
In Text REtrieval Conference (1993), pp. 243-252
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.Normalization.html">Normalization</a></h4>
<section><p>This class acts as the base class for the implementations of the term
frequency normalization methods in the DFR framework.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.Normalization.NoNormalization.html">Normalization.NoNormalization</a></h4>
<section><p>Implementation used when there is no normalization. </p>
</section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH1.html">NormalizationH1</a></h4>
<section><p>Normalization model that assumes a uniform distribution of the term frequency.
<p>While this model is parameterless in the
<a href="http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.101.742">
original article</a>, <a href="http://dl.acm.org/citation.cfm?id=1835490">
information-based models</a> (see <a class="xref" href="Lucene.Net.Search.Similarities.IBSimilarity.html">IBSimilarity</a>) introduced a
multiplying factor.
The default value for the <code>c</code> parameter is <code>1</code>.</p></p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH2.html">NormalizationH2</a></h4>
<section><p>Normalization model in which the term frequency is inversely related to the
length.
<p>While this model is parameterless in the
<a href="http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.101.742">
original article</a>, the <a href="http://theses.gla.ac.uk/1570/">thesis</a>
introduces the parameterized variant.
The default value for the <code>c</code> parameter is <code>1</code>.</p></p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationH3.html">NormalizationH3</a></h4>
<section><p>Dirichlet Priors normalization
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.NormalizationZ.html">NormalizationZ</a></h4>
<section><p>Pareto-Zipf Normalization
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.PerFieldSimilarityWrapper.html">PerFieldSimilarityWrapper</a></h4>
<section><p>Provides the ability to use a different <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> for different fields.
<p>
Subclasses should implement <a class="xref" href="Lucene.Net.Search.Similarities.PerFieldSimilarityWrapper.html#Lucene_Net_Search_Similarities_PerFieldSimilarityWrapper_Get_System_String_">Get(String)</a> to return an appropriate
<a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> (for example, using field-specific parameter values) for the field.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a></h4>
<section><p>Similarity defines the components of Lucene scoring.
<p>
Expert: Scoring API.
<p>
This is a low-level API, you should only extend this API if you want to implement
an information retrieval <em>model</em>. If you are instead looking for a convenient way
to alter Lucene&apos;s scoring, consider extending a higher-level implementation
such as <a class="xref" href="Lucene.Net.Search.Similarities.TFIDFSimilarity.html">TFIDFSimilarity</a>, which implements the vector space model with this API, or
just tweaking the default implementation: <a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html">DefaultSimilarity</a>.
<p>
Similarity determines how Lucene weights terms, and Lucene interacts with
this class at both <a href="#indextime">index-time</a> and
<a href="#querytime">query-time</a>.
<p>
<a name="indextime"></a>
At indexing time, the indexer calls <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html#Lucene_Net_Search_Similarities_Similarity_ComputeNorm_Lucene_Net_Index_FieldInvertState_">ComputeNorm(FieldInvertState)</a>, allowing
the <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> implementation to set a per-document value for the field that will
be later accessible via <a class="xref" href="Lucene.Net.Index.AtomicReader.html#Lucene_Net_Index_AtomicReader_GetNormValues_System_String_">GetNormValues(String)</a>. Lucene makes no assumption
about what is in this norm, but it is most useful for encoding length normalization
information.
<p>
Implementations should carefully consider how the normalization is encoded: while
Lucene&apos;s classical <a class="xref" href="Lucene.Net.Search.Similarities.TFIDFSimilarity.html">TFIDFSimilarity</a> encodes a combination of index-time boost
and length normalization information with <a class="xref" href="Lucene.Net.Util.SmallSingle.html">SmallSingle</a> into a single byte, this
might not be suitable for all purposes.
<p>
Many formulas require the use of average document length, which can be computed via a
combination of <a class="xref" href="Lucene.Net.Search.CollectionStatistics.html#Lucene_Net_Search_CollectionStatistics_SumTotalTermFreq">SumTotalTermFreq</a> and
<a class="xref" href="Lucene.Net.Search.CollectionStatistics.html#Lucene_Net_Search_CollectionStatistics_MaxDoc">MaxDoc</a> or <a class="xref" href="Lucene.Net.Search.CollectionStatistics.html#Lucene_Net_Search_CollectionStatistics_DocCount">DocCount</a>,
depending upon whether the average should reflect field sparsity.
<p>
Additional scoring factors can be stored in named
<a class="xref" href="Lucene.Net.Documents.NumericDocValuesField.html">NumericDocValuesField</a>s and accessed
at query-time with <a class="xref" href="Lucene.Net.Index.AtomicReader.html#Lucene_Net_Index_AtomicReader_GetNumericDocValues_System_String_">GetNumericDocValues(String)</a>.
<p>
Finally, using index-time boosts (either via folding into the normalization byte or
via <a class="xref" href="Lucene.Net.Index.DocValues.html">DocValues</a>), is an inefficient way to boost the scores of different fields if the
boost will be the same for every document, instead the Similarity can simply take a constant
boost parameter <em>C</em>, and <a class="xref" href="Lucene.Net.Search.Similarities.PerFieldSimilarityWrapper.html">PerFieldSimilarityWrapper</a> can return different
instances with different boosts depending upon field name.
<p>
<a name="querytime"></a>
At query-time, Queries interact with the Similarity via these steps:
<ol><li>The <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html#Lucene_Net_Search_Similarities_Similarity_ComputeWeight_System_Single_Lucene_Net_Search_CollectionStatistics_Lucene_Net_Search_TermStatistics___">ComputeWeight(Single, CollectionStatistics, TermStatistics[])</a> method is called a single time,
allowing the implementation to compute any statistics (such as IDF, average document length, etc)
across <em>the entire collection</em>. The <a class="xref" href="Lucene.Net.Search.TermStatistics.html">TermStatistics</a> and <a class="xref" href="Lucene.Net.Search.CollectionStatistics.html">CollectionStatistics</a> passed in
already contain all of the raw statistics involved, so a <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> can freely use any combination
of statistics without causing any additional I/O. Lucene makes no assumption about what is
stored in the returned <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimWeight.html">Similarity.SimWeight</a> object.</li><li>The query normalization process occurs a single time: <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimWeight.html#Lucene_Net_Search_Similarities_Similarity_SimWeight_GetValueForNormalization">GetValueForNormalization()</a>
is called for each query leaf node, <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html#Lucene_Net_Search_Similarities_Similarity_QueryNorm_System_Single_">QueryNorm(Single)</a> is called for the top-level
query, and finally <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimWeight.html#Lucene_Net_Search_Similarities_Similarity_SimWeight_Normalize_System_Single_System_Single_">Normalize(Single, Single)</a> passes down the normalization value
and any top-level boosts (e.g. from enclosing <a class="xref" href="Lucene.Net.Search.BooleanQuery.html">BooleanQuery</a>s).</li><li>For each segment in the index, the <a class="xref" href="Lucene.Net.Search.Query.html">Query</a> creates a <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html#Lucene_Net_Search_Similarities_Similarity_GetSimScorer_Lucene_Net_Search_Similarities_Similarity_SimWeight_Lucene_Net_Index_AtomicReaderContext_">GetSimScorer(Similarity.SimWeight, AtomicReaderContext)</a>
The GetScore() method is called for each matching document.</li></ol>
<p>
<a name="explaintime"></a>
When <a class="xref" href="Lucene.Net.Search.IndexSearcher.html#Lucene_Net_Search_IndexSearcher_Explain_Lucene_Net_Search_Query_System_Int32_">Explain(Query, Int32)</a> is called, queries consult the Similarity&apos;s DocScorer for an
explanation of how it computed its score. The query passes in a the document id and an explanation of how the frequency
was computed.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimScorer.html">Similarity.SimScorer</a></h4>
<section><p>API for scoring &quot;sloppy&quot; queries such as <a class="xref" href="Lucene.Net.Search.TermQuery.html">TermQuery</a>,
<a class="xref" href="Lucene.Net.Search.Spans.SpanQuery.html">SpanQuery</a>, and <a class="xref" href="Lucene.Net.Search.PhraseQuery.html">PhraseQuery</a>.
<p>
Frequencies are floating-point values: an approximate
within-document frequency adjusted for &quot;sloppiness&quot; by
<a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimScorer.html#Lucene_Net_Search_Similarities_Similarity_SimScorer_ComputeSlopFactor_System_Int32_">ComputeSlopFactor(Int32)</a>.</p>
</section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimWeight.html">Similarity.SimWeight</a></h4>
<section><p>Stores the weight for a query across the indexed collection. this abstract
implementation is empty; descendants of <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> should
subclass <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.SimWeight.html">Similarity.SimWeight</a> and define the statistics they require in the
subclass. Examples include idf, average field length, etc.</p>
</section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a></h4>
<section><p>A subclass of <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> that provides a simplified API for its
descendants. Subclasses are only required to implement the <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html#Lucene_Net_Search_Similarities_SimilarityBase_Score_Lucene_Net_Search_Similarities_BasicStats_System_Single_System_Single_">Score(BasicStats, Single, Single)</a>
and <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html#Lucene_Net_Search_Similarities_SimilarityBase_ToString">ToString()</a> methods. Implementing
<a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html#Lucene_Net_Search_Similarities_SimilarityBase_Explain_Lucene_Net_Search_Explanation_Lucene_Net_Search_Similarities_BasicStats_System_Int32_System_Single_System_Single_">Explain(Explanation, BasicStats, Int32, Single, Single)</a> is optional,
inasmuch as <a class="xref" href="Lucene.Net.Search.Similarities.SimilarityBase.html">SimilarityBase</a> already provides a basic explanation of the score
and the term frequency. However, implementers of a subclass are encouraged to
include as much detail about the scoring method as possible.
<p>
Note: multi-word queries such as phrase queries are scored in a different way
than Lucene&apos;s default ranking algorithm: whereas it &quot;fakes&quot; an IDF value for
the phrase as a whole (since it does not know it), this class instead scores
phrases as a summation of the individual term scores.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.TFIDFSimilarity.html">TFIDFSimilarity</a></h4>
<section><p>Implementation of <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> with the Vector Space Model.
<p>
Expert: Scoring API.
<p>TFIDFSimilarity defines the components of Lucene scoring.
Overriding computation of these components is a convenient
way to alter Lucene scoring.</p>
<p>Suggested reading:
<a href="http://nlp.stanford.edu/IR-book/html/htmledition/queries-as-vectors-1.html">
Introduction To Information Retrieval, Chapter 6</a>.
<p>The following describes how Lucene scoring evolves from
underlying information retrieval models to (efficient) implementation.
We first brief on <em>VSM Score</em>,
then derive from it <em>Lucene&apos;s Conceptual Scoring Formula</em>,
from which, finally, evolves <em>Lucene&apos;s Practical Scoring Function</em>
(the latter is connected directly with Lucene classes and methods).
<p>Lucene combines
<a href="http://en.wikipedia.org/wiki/Standard_Boolean_model">
Boolean model (BM) of Information Retrieval</a>
with
<a href="http://en.wikipedia.org/wiki/Vector_Space_Model">
Vector Space Model (VSM) of Information Retrieval</a> -
documents &quot;approved&quot; by BM are scored by VSM.
<p>In VSM, documents and queries are represented as
weighted vectors in a multi-dimensional space,
where each distinct index term is a dimension,
and weights are
<a href="http://en.wikipedia.org/wiki/Tfidf">Tf-idf</a> values.
<p>VSM does not require weights to be <em>Tf-idf</em> values,
but <em>Tf-idf</em> values are believed to produce search results of high quality,
and so Lucene is using <em>Tf-idf</em>.
<em>Tf</em> and <em>Idf</em> are described in more detail below,
but for now, for completion, let&apos;s just say that
for given term <em>t</em> and document (or query) <em>x</em>,
<em>Tf(t,x)</em> varies with the number of occurrences of term <em>t</em> in <em>x</em>
(when one increases so does the other) and
<em>idf(t)</em> similarly varies with the inverse of the
number of index documents containing term <em>t</em>.
<p><em>VSM score</em> of document <em>d</em> for query <em>q</em> is the
<a href="http://en.wikipedia.org/wiki/Cosine_similarity">
Cosine Similarity</a>
of the weighted query vectors <em>V(q)</em> and <em>V(d)</em>:
<p>
<table><tbody><tr><td>
<table><tbody><tr><td>cosine-similarity(q,d) =<br><table>
<item><small>V(q) · V(d)</small></item>
<item>–––––––––</item>
<item><small>|V(q)| |V(d)|</small></item>
</table>
</td><td></td></tr></tbody></table>
</td><td></td></tr><tr><td>VSM Score</td><td></td></tr></tbody></table>
<p>
<p>Where <em>V(q)</em> · <em>V(d)</em> is the
<a href="http://en.wikipedia.org/wiki/Dot_product">dot product</a>
of the weighted vectors,
and <em>|V(q)|</em> and <em>|V(d)|</em> are their
<a href="http://en.wikipedia.org/wiki/Euclidean_norm#Euclidean_norm">Euclidean norms</a>.</p>
<p>Note: the above equation can be viewed as the dot product of
the normalized weighted vectors, in the sense that dividing
<em>V(q)</em> by its euclidean norm is normalizing it to a unit vector.
<p>Lucene refines <em>VSM score</em> for both search quality and usability:
<ul><li>Normalizing <em>V(d)</em> to the unit vector is known to be problematic in that
it removes all document length information.
For some documents removing this info is probably ok,
e.g. a document made by duplicating a certain paragraph <em>10</em> times,
especially if that paragraph is made of distinct terms.
But for a document which contains no duplicated paragraphs,
this might be wrong.
To avoid this problem, a different document length normalization
factor is used, which normalizes to a vector equal to or larger
than the unit vector: <em>doc-len-norm(d)</em>.
</li><li>At indexing, users can specify that certain documents are more
important than others, by assigning a document boost.
For this, the score of each document is also multiplied by its boost value
<em>doc-boost(d)</em>.
</li><li>Lucene is field based, hence each query term applies to a single
field, document length normalization is by the length of the certain field,
and in addition to document boost there are also document fields boosts.
</li><li>The same field can be added to a document during indexing several times,
and so the boost of that field is the multiplication of the boosts of
the separate additions (or parts) of that field within the document.
</li><li>At search time users can specify boosts to each query, sub-query, and
each query term, hence the contribution of a query term to the score of
a document is multiplied by the boost of that query term <em>query-boost(q)</em>.
</li><li>A document may match a multi term query without containing all
the terms of that query (this is correct for some of the queries),
and users can further reward documents matching more query terms
through a coordination factor, which is usually larger when
more terms are matched: <em>coord-factor(q,d)</em>.
</li></ul>
<p>Under the simplifying assumption of a single field in the index,
we get <em>Lucene&apos;s Conceptual scoring formula</em>:
<p>
<table><tbody><tr><td>
<table><tbody><tr><td>
score(q,d) =<br><font color="#FF9933">coord-factor(q,d)</font> ·<br><font color="#CCCC00">query-boost(q)</font> ·<br>
<table><tbody><tr><td><small><font color="#993399">V(q) · V(d)</font></small></td><td></td></tr><tr><td>–––––––––</td><td></td></tr><tr><td><small><font color="#FF33CC">|V(q)|</font></small></td><td></td></tr></tbody></table>
· <font color="#3399FF">doc-len-norm(d)</font>
· <font color="#3399FF">doc-boost(d)</font>
</td><td></td></tr></tbody></table>
</td><td></td></tr><tr><td>Lucene Conceptual Scoring Formula</td><td></td></tr></tbody></table>
<p>
<p>The conceptual formula is a simplification in the sense that (1) terms and documents
are fielded and (2) boosts are usually per query term rather than per query.
<p>We now describe how Lucene implements this conceptual scoring formula, and
derive from it <em>Lucene&apos;s Practical Scoring Function</em>.
<p>For efficient score computation some scoring components
are computed and aggregated in advance:
<ul><li><em>Query-boost</em> for the query (actually for each query term)
is known when search starts.
</li><li>Query Euclidean norm <em>|V(q)|</em> can be computed when search starts,
as it is independent of the document being scored.
From search optimization perspective, it is a valid question
why bother to normalize the query at all, because all
scored documents will be multiplied by the same <em>|V(q)|</em>,
and hence documents ranks (their order by score) will not
be affected by this normalization.
There are two good reasons to keep this normalization:
<ul><li>Recall that
<a href="http://en.wikipedia.org/wiki/Cosine_similarity">
Cosine Similarity</a> can be used find how similar
two documents are. One can use Lucene for e.g.
clustering, and use a document as a query to compute
its similarity to other documents.
In this use case it is important that the score of document <em>d3</em>
for query <em>d1</em> is comparable to the score of document <em>d3</em>
for query <em>d2</em>. In other words, scores of a document for two
distinct queries should be comparable.
There are other applications that may require this.
And this is exactly what normalizing the query vector <em>V(q)</em>
provides: comparability (to a certain extent) of two or more queries.
</li><li>Applying query normalization on the scores helps to keep the
scores around the unit vector, hence preventing loss of score data
because of floating point precision limitations.
</li></ul>
</li><li>Document length norm <em>doc-len-norm(d)</em> and document
boost <em>doc-boost(d)</em> are known at indexing time.
They are computed in advance and their multiplication
is saved as a single value in the index: <em>norm(d)</em>.
(In the equations below, <em>norm(t in d)</em> means <em>norm(field(t) in doc d)</em>
where <em>field(t)</em> is the field associated with term <em>t</em>.)
</li></ul>
<p><em>Lucene&apos;s Practical Scoring Function</em> is derived from the above.
The color codes demonstrate how it relates
to those of the <em>conceptual</em> formula:
<p>
<table><tbody><tr><td>
<table><tbody><tr><td>
score(q,d) =<br><a href="#formula_coord"><font color="#FF9933">coord(q,d)</font></a> ·<br><a href="#formula_queryNorm"><font color="#FF33CC">queryNorm(q)</font></a> ·<br><big><big><big></big></big></big>
<big><big>(</big></big>
<a href="#formula_tf"><font color="#993399">tf(t in d)</font></a> ·<br><a href="#formula_idf"><font color="#993399">idf(t)</font></a><sup>2</sup> ·<br><a href="#formula_termBoost"><font color="#CCCC00">t.Boost</font></a> ·<br><a href="#formula_norm"><font color="#3399FF">norm(t,d)</font></a>
<big><big>)</big></big>
</td><td></td></tr><tr><td><small>t in q</small></td><td></td></tr></tbody></table>
</td><td></td></tr><tr><td>Lucene Practical Scoring Function</td><td></td></tr></tbody></table>
<p> where
<ol><li>
<a name="formula_tf"></a>
<strong><em>tf(t in d)</em></strong>
correlates to the term&apos;s <em>frequency</em>,
defined as the number of times term <em>t</em> appears in the currently scored document <em>d</em>.
Documents that have more occurrences of a given term receive a higher score.
Note that <em>tf(t in q)</em> is assumed to be <em>1</em> and therefore it does not appear in this equation,
However if a query contains twice the same term, there will be
two term-queries with that same term and hence the computation would still be correct (although
not very efficient).
The default computation for <em>tf(t in d)</em> in
DefaultSimilarity (<a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html#Lucene_Net_Search_Similarities_DefaultSimilarity_Tf_System_Single_">Tf(Single)</a>) is:
<p>
<table><tbody><tr><td>
tf(t in d) =<br>
frequency<sup><big>½</big></sup>
</td><td></td></tr></tbody></table>
<p>
<p></li><li>
<a name="formula_idf"></a>
<strong><em>idf(t)</em></strong> stands for Inverse Document Frequency. this value
correlates to the inverse of <em>DocFreq</em>
(the number of documents in which the term <em>t</em> appears).
this means rarer terms give higher contribution to the total score.
<em>idf(t)</em> appears for <em>t</em> in both the query and the document,
hence it is squared in the equation.
The default computation for <em>idf(t)</em> in
DefaultSimilarity (<a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html#Lucene_Net_Search_Similarities_DefaultSimilarity_Idf_System_Int64_System_Int64_">Idf(Int64, Int64)</a>) is:<p>
<p>
<table><tbody><tr><td>idf(t) = 1 + log <big>(</big>
<table><tbody><tr><td><small>NumDocs</small></td><td></td></tr><tr><td>–––––––––</td><td></td></tr><tr><td><small>DocFreq+1</small></td><td></td></tr></tbody></table>
<big>)</big></td><td></td></tr></tbody></table>
<p>
<p></li><li>
<a name="formula_coord"></a>
<strong><em>coord(q,d)</em></strong>
is a score factor based on how many of the query terms are found in the specified document.
Typically, a document that contains more of the query&apos;s terms will receive a higher score
than another document with fewer query terms.
this is a search time factor computed in
coord(q,d) (<a class="xref" href="Lucene.Net.Search.Similarities.TFIDFSimilarity.html#Lucene_Net_Search_Similarities_TFIDFSimilarity_Coord_System_Int32_System_Int32_">Coord(Int32, Int32)</a>)
by the Similarity in effect at search time.
<p>
</li><li><strong>
<a name="formula_queryNorm"></a>
<em>queryNorm(q)</em>
</strong>
is a normalizing factor used to make scores between queries comparable.
this factor does not affect document ranking (since all ranked documents are multiplied by the same factor),
but rather just attempts to make scores from different queries (or even different indexes) comparable.
this is a search time factor computed by the Similarity in effect at search time.<p>
<p>The default computation in
DefaultSimilarity (<a class="xref" href="Lucene.Net.Search.Similarities.DefaultSimilarity.html#Lucene_Net_Search_Similarities_DefaultSimilarity_QueryNorm_System_Single_">QueryNorm(Single)</a>)
produces a <a href="http://en.wikipedia.org/wiki/Euclidean_norm#Euclidean_norm">Euclidean norm</a>:</p>
<p>
<table><tbody><tr><td>
queryNorm(q) =<br> queryNorm(sumOfSquaredWeights)
=<br>
<table><tbody><tr><td><big>1</big></td><td></td></tr><tr><td><big>––––––––––––––</big></td><td></td></tr><tr><td>sumOfSquaredWeights<sup><big>½</big></sup></td><td></td></tr></tbody></table>
</td><td></td></tr></tbody></table>
<p>
<p>The sum of squared weights (of the query terms) is
computed by the query <a class="xref" href="Lucene.Net.Search.Weight.html">Weight</a> object.
For example, a <a class="xref" href="Lucene.Net.Search.BooleanQuery.html">BooleanQuery</a>
computes this value as:</p>
<p><p>
<table><tbody><tr><td>
sumOfSquaredWeights =<br> q.Boost <sup><big>2</big></sup>
·
<big><big><big></big></big></big>
<big><big>(</big></big>
<a href="#formula_idf">idf(t)</a> ·
<a href="#formula_termBoost">t.Boost</a>
<big><big>) <sup>2</sup> </big></big>
</td><td></td></tr><tr><td><small>t in q</small></td><td></td></tr></tbody></table>
where sumOfSquaredWeights is <a class="xref" href="Lucene.Net.Search.Weight.html#Lucene_Net_Search_Weight_GetValueForNormalization">GetValueForNormalization()</a> and
q.Boost is <a class="xref" href="Lucene.Net.Search.Query.html#Lucene_Net_Search_Query_Boost">Boost</a>
<p>
</li><li>
<a name="formula_termBoost"></a>
<strong><em>t.Boost</em></strong>
is a search time boost of term <em>t</em> in the query <em>q</em> as
specified in the query text
(see <a href="{@docRoot}/../queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Boosting_a_Term">query syntax</a>),
or as set by application calls to
<a class="xref" href="Lucene.Net.Search.Query.html#Lucene_Net_Search_Query_Boost">Boost</a>.
Notice that there is really no direct API for accessing a boost of one term in a multi term query,
but rather multi terms are represented in a query as multi
<a class="xref" href="Lucene.Net.Search.TermQuery.html">TermQuery</a> objects,
and so the boost of a term in the query is accessible by calling the sub-query
<a class="xref" href="Lucene.Net.Search.Query.html#Lucene_Net_Search_Query_Boost">Boost</a>.
<p>
</li><li>
<a name="formula_norm"></a>
<strong><em>norm(t,d)</em></strong> encapsulates a few (indexing time) boost and length factors:<p>
<p><ul><li><strong>Field boost</strong> - set
<a class="xref" href="Lucene.Net.Documents.Field.html#Lucene_Net_Documents_Field_Boost">Boost</a>
before adding the field to a document.
</li><li><strong>lengthNorm</strong> - computed
when the document is added to the index in accordance with the number of tokens
of this field in the document, so that shorter fields contribute more to the score.
LengthNorm is computed by the <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> class in effect at indexing.
</li></ul>
The <a class="xref" href="Lucene.Net.Search.Similarities.TFIDFSimilarity.html#Lucene_Net_Search_Similarities_TFIDFSimilarity_ComputeNorm_Lucene_Net_Index_FieldInvertState_">ComputeNorm(FieldInvertState)</a> method is responsible for
combining all of these factors into a single <span class="xref">System.Single</span>.</p>
<p><p>
When a document is added to the index, all the above factors are multiplied.
If the document has multiple fields with the same name, all their boosts are multiplied together:</p>
<p><p>
<table><tbody><tr><td>
norm(t,d) =<br> lengthNorm
·
<big><big><big></big></big></big><a class="xref" href="Lucene.Net.Index.IIndexableField.html#Lucene_Net_Index_IIndexableField_Boost">Boost</a></td><td></td></tr><tr><td><small>field <em><strong>f</strong></em> in <em>d</em> named as <em><strong>t</strong></em></small></td><td></td></tr></tbody></table>
Note that search time is too late to modify this <em>norm</em> part of scoring,
e.g. by using a different <a class="xref" href="Lucene.Net.Search.Similarities.Similarity.html">Similarity</a> for search.
</li></ol></p>
</section>
<h3 id="interfaces">Interfaces
</h3>
<h4><a class="xref" href="Lucene.Net.Search.Similarities.LMSimilarity.ICollectionModel.html">LMSimilarity.ICollectionModel</a></h4>
<section><p>A strategy for computing the collection language model. </p>
</section>
</article>
</div>
<div class="hidden-sm col-md-2" role="complementary">
<div class="sideaffix">
<div class="contribution">
<ul class="nav">
<li>
<a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00013/src/Lucene.Net/Search/Similarities/package.md/#L2" class="contribution-link">Improve this Doc</a>
</li>
</ul>
</div>
<nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix">
<!-- <p><a class="back-to-top" href="#top">Back to top</a><p> -->
</nav>
</div>
</div>
</div>
</div>
<footer>
<div class="grad-bottom"></div>
<div class="footer">
<div class="container">
<span class="pull-right">
<a href="#top">Back to top</a>
</span>
Copyright © 2020 The Apache Software Foundation, Licensed under the <a href='http://www.apache.org/licenses/LICENSE-2.0' target='_blank'>Apache License, Version 2.0</a><br> <small>Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation. <br>All other marks mentioned may be trademarks or registered trademarks of their respective owners.</small>
</div>
</div>
</footer>
</div>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script>
</body>
</html>