blob: 23b63fa432db1c54cc41bbe9321074dbd3036bd6 [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE]><![endif]-->
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Namespace Lucene.Net.Queries.Mlt
| Apache Lucene.NET 4.8.0-beta00010 Documentation </title>
<meta name="viewport" content="width=device-width">
<meta name="title" content="Namespace Lucene.Net.Queries.Mlt
| Apache Lucene.NET 4.8.0-beta00010 Documentation ">
<meta name="generator" content="docfx 2.56.0.0">
<link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css">
<meta property="docfx:navrel" content="toc.html">
<meta property="docfx:tocrel" content="queries/toc.html">
<meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/">
</head>
<body data-spy="scroll" data-target="#affix" data-offset="120">
<div id="wrapper">
<header>
<nav id="autocollapse" class="navbar ng-scope" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/">
<img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt="">
</a>
</div>
<div class="collapse navbar-collapse" id="navbar">
<form class="navbar-form navbar-right" role="search" id="search">
<div class="form-group">
<input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off">
</div>
</form>
</div>
</div>
</nav>
<div class="subnav navbar navbar-default">
<div class="container hide-when-search">
<ul class="level0 breadcrumb">
<li>
<a href="https://lucenenet.apache.org/docs/4.8.0-beta00009/">API</a>
<span id="breadcrumb">
<ul class="breadcrumb">
<li></li>
</ul>
</span>
</li>
</ul>
</div>
</div>
</header>
<div class="container body-content">
<div id="search-results">
<div class="search-list"></div>
<div class="sr-items">
<p><i class="glyphicon glyphicon-refresh index-loading"></i></p>
</div>
<ul id="pagination"></ul>
</div>
</div>
<div role="main" class="container body-content hide-when-search">
<div class="sidenav hide-when-search">
<a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a>
<div class="sidetoggle collapse" id="sidetoggle">
<div id="sidetoc"></div>
</div>
</div>
<div class="article row grid-right">
<div class="col-md-10">
<article class="content wrap" id="_content" data-uid="Lucene.Net.Queries.Mlt">
<h1 id="Lucene_Net_Queries_Mlt" data-uid="Lucene.Net.Queries.Mlt" class="text-break">Namespace Lucene.Net.Queries.Mlt
</h1>
<div class="markdown level0 summary"><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<p>Document similarity query generators.</p>
</div>
<div class="markdown level0 conceptual"></div>
<div class="markdown level0 remarks"></div>
<h3 id="classes">Classes
</h3>
<h4><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html">MoreLikeThis</a></h4>
<section><p>Generate &quot;more like this&quot; similarity queries.
Based on this mail:</p>
<pre><code>Lucene does let you access the document frequency of terms, with <a class="xref" href="http://localhost:8080/api/core/Lucene.Net.Index.IndexReader.html#Lucene_Net_Index_IndexReader_DocFreq_Lucene_Net_Index_Term_">DocFreq(Term)</a>.
Term frequencies can be computed by re-tokenizing the text, which, for a single document,
is usually fast enough. But looking up the <a class="xref" href="http://localhost:8080/api/core/Lucene.Net.Index.IndexReader.html#Lucene_Net_Index_IndexReader_DocFreq_Lucene_Net_Index_Term_">DocFreq(Term)</a> of every term in the document is
probably too slow.
<p>
You can use some heuristics to prune the set of terms, to avoid calling <a class="xref" href="http://localhost:8080/api/core/Lucene.Net.Index.IndexReader.html#Lucene_Net_Index_IndexReader_DocFreq_Lucene_Net_Index_Term_">DocFreq(Term)</a> too much,
or at all. Since you&apos;re trying to maximize a tf*idf score, you&apos;re probably most interested
in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
reduce the number of terms under consideration. Another heuristic is that terms with a
high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the
number of characters, not selecting anything less than, e.g., six or seven characters.
With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
that do a pretty good job of characterizing a document.
<p>
It all depends on what you&apos;re trying to do. If you&apos;re trying to eek out that last percent
of precision and recall regardless of computational difficulty so that you can win a TREC
competition, then the techniques I mention above are useless. But if you&apos;re trying to
provide a &quot;more like this&quot; button on a search results page that does a decent job and has
good performance, such techniques might be useful.
<p>
An efficient, effective &quot;more-like-this&quot; query generator would be a great contribution, if
anyone&apos;s interested. I&apos;d imagine that it would take a Reader or a String (the document&apos;s
text), analyzer Analyzer, and return a set of representative terms using heuristics like those
above. The frequency and length thresholds could be parameters, etc.
<p>
Doug</code></pre>
<p><p>
<p>
<p>
<strong>Initial Usage</strong>
<p>
This class has lots of options to try to make it efficient and flexible.
The simplest possible usage is as follows. The bold
fragment is specific to this class.
<p>
<pre><code>IndexReader ir = ...
IndexSearcher is = ...
MoreLikeThis mlt = new MoreLikeThis(ir);
TextReader target = ... // orig source of doc you want to find similarities to
Query query = mlt.Like(target);
Hits hits = is.Search(query);
// now the usual iteration thru &apos;hits&apos; - the only thing to watch for is to make sure
//you ignore the doc if it matches your &apos;target&apos; document, as it should be similar to itself</code></pre>
<p><p>
Thus you:
<ul><li>do your normal, Lucene setup for searching,</li><li>create a MoreLikeThis,</li><li>get the text of the doc you want to find similarities to</li><li>then call one of the <a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_Like_System_IO_TextReader_System_String_">Like(TextReader, String)</a> calls to generate a similarity query</li><li>call the searcher to find the similar docs</li></ul>
<p>
<strong>More Advanced Usage</strong>
<p>
You may want to use the setter for <a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_FieldNames">FieldNames</a> so you can examine
multiple fields (e.g. body and title) for similarity.
<p>
<p>
Depending on the size of your index and the size and makeup of your documents you
may want to call the other set methods to control how the similarity queries are
generated:
<ul><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MinTermFreq">MinTermFreq</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MinDocFreq">MinDocFreq</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MaxDocFreq">MaxDocFreq</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_SetMaxDocFreqPct_System_Int32_">SetMaxDocFreqPct(Int32)</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MinWordLen">MinWordLen</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MaxWordLen">MaxWordLen</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MaxQueryTerms">MaxQueryTerms</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MaxNumTokensParsed">MaxNumTokensParsed</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_StopWords">StopWords</a></li></ul></p>
</section>
<h4><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThisQuery.html">MoreLikeThisQuery</a></h4>
<section><p>A simple wrapper for <a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html">MoreLikeThis</a> for use in scenarios where a <span class="xref">Lucene.Net.Search.Query</span> object is required eg
in custom QueryParser extensions. At query.Rewrite() time the reader is used to construct the
actual <a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html">MoreLikeThis</a> object and obtain the real <span class="xref">Lucene.Net.Search.Query</span> object.</p>
</section>
</article>
</div>
<div class="hidden-sm col-md-2" role="complementary">
<div class="sideaffix">
<div class="contribution">
<ul class="nav">
<li>
<a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00010/src/Lucene.Net.Queries/Mlt/package.md/#L2" class="contribution-link">Improve this Doc</a>
</li>
</ul>
</div>
<nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix">
<!-- <p><a class="back-to-top" href="#top">Back to top</a><p> -->
</nav>
</div>
</div>
</div>
</div>
<footer>
<div class="grad-bottom"></div>
<div class="footer">
<div class="container">
<span class="pull-right">
<a href="#top">Back to top</a>
</span>
Copyright © 2020 Licensed to the Apache Software Foundation (ASF)
</div>
</div>
</footer>
</div>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script>
</body>
</html>