| <!DOCTYPE html> |
| <!--[if IE]><![endif]--> |
| <html> |
| |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| <title>Namespace Lucene.Net.Queries.Mlt |
| | Apache Lucene.NET 4.8.0-beta00009 Documentation </title> |
| <meta name="viewport" content="width=device-width"> |
| <meta name="title" content="Namespace Lucene.Net.Queries.Mlt |
| | Apache Lucene.NET 4.8.0-beta00009 Documentation "> |
| <meta name="generator" content="docfx 2.56.0.0"> |
| |
| <link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css"> |
| <meta property="docfx:navrel" content="toc.html"> |
| <meta property="docfx:tocrel" content="queries/toc.html"> |
| |
| <meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/"> |
| |
| </head> |
| <body data-spy="scroll" data-target="#affix" data-offset="120"> |
| <div id="wrapper"> |
| <header> |
| |
| <nav id="autocollapse" class="navbar ng-scope" role="navigation"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| |
| <a class="navbar-brand" href="/"> |
| <img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt=""> |
| </a> |
| </div> |
| <div class="collapse navbar-collapse" id="navbar"> |
| <form class="navbar-form navbar-right" role="search" id="search"> |
| <div class="form-group"> |
| <input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off"> |
| </div> |
| </form> |
| </div> |
| </div> |
| </nav> |
| |
| <div class="subnav navbar navbar-default"> |
| <div class="container hide-when-search"> |
| <ul class="level0 breadcrumb"> |
| <li> |
| <a href="https://lucenenet.apache.org/docs/4.8.0-beta00009/">API</a> |
| <span id="breadcrumb"> |
| <ul class="breadcrumb"> |
| <li></li> |
| </ul> |
| </span> |
| </li> |
| </ul> |
| </div> |
| </div> |
| </header> |
| <div class="container body-content"> |
| |
| <div id="search-results"> |
| <div class="search-list"></div> |
| <div class="sr-items"> |
| <p><i class="glyphicon glyphicon-refresh index-loading"></i></p> |
| </div> |
| <ul id="pagination"></ul> |
| </div> |
| </div> |
| <div role="main" class="container body-content hide-when-search"> |
| |
| <div class="sidenav hide-when-search"> |
| <a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a> |
| <div class="sidetoggle collapse" id="sidetoggle"> |
| <div id="sidetoc"></div> |
| </div> |
| </div> |
| <div class="article row grid-right"> |
| <div class="col-md-10"> |
| <article class="content wrap" id="_content" data-uid="Lucene.Net.Queries.Mlt"> |
| |
| <h1 id="Lucene_Net_Queries_Mlt" data-uid="Lucene.Net.Queries.Mlt" class="text-break">Namespace Lucene.Net.Queries.Mlt |
| </h1> |
| <div class="markdown level0 summary"><!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <p>Document similarity query generators.</p> |
| </div> |
| <div class="markdown level0 conceptual"></div> |
| <div class="markdown level0 remarks"></div> |
| <h3 id="classes">Classes |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html">MoreLikeThis</a></h4> |
| <section><p>Generate "more like this" similarity queries. |
| Based on this mail:</p> |
| <pre><code>Lucene does let you access the document frequency of terms, with <a class="xref" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/api/core/Lucene.Net.Index.IndexReader.html#Lucene_Net_Index_IndexReader_DocFreq_Lucene_Net_Index_Term_">DocFreq(Term)</a>. |
| Term frequencies can be computed by re-tokenizing the text, which, for a single document, |
| is usually fast enough. But looking up the <a class="xref" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/api/core/Lucene.Net.Index.IndexReader.html#Lucene_Net_Index_IndexReader_DocFreq_Lucene_Net_Index_Term_">DocFreq(Term)</a> of every term in the document is |
| probably too slow. |
| <p> |
| You can use some heuristics to prune the set of terms, to avoid calling <a class="xref" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/api/core/Lucene.Net.Index.IndexReader.html#Lucene_Net_Index_IndexReader_DocFreq_Lucene_Net_Index_Term_">DocFreq(Term)</a> too much, |
| or at all. Since you're trying to maximize a tf*idf score, you're probably most interested |
| in terms with a high tf. Choosing a tf threshold even as low as two or three will radically |
| reduce the number of terms under consideration. Another heuristic is that terms with a |
| high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the |
| number of characters, not selecting anything less than, e.g., six or seven characters. |
| With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms |
| that do a pretty good job of characterizing a document. |
| <p> |
| It all depends on what you're trying to do. If you're trying to eek out that last percent |
| of precision and recall regardless of computational difficulty so that you can win a TREC |
| competition, then the techniques I mention above are useless. But if you're trying to |
| provide a "more like this" button on a search results page that does a decent job and has |
| good performance, such techniques might be useful. |
| <p> |
| An efficient, effective "more-like-this" query generator would be a great contribution, if |
| anyone's interested. I'd imagine that it would take a Reader or a String (the document's |
| text), analyzer Analyzer, and return a set of representative terms using heuristics like those |
| above. The frequency and length thresholds could be parameters, etc. |
| <p> |
| Doug</code></pre> |
| <p><p> |
| <p> |
| <p> |
| <strong>Initial Usage</strong> |
| <p> |
| This class has lots of options to try to make it efficient and flexible. |
| The simplest possible usage is as follows. The bold |
| fragment is specific to this class. |
| <p> |
| <pre><code>IndexReader ir = ... |
| IndexSearcher is = ... |
| |
| MoreLikeThis mlt = new MoreLikeThis(ir); |
| TextReader target = ... // orig source of doc you want to find similarities to |
| Query query = mlt.Like(target); |
| |
| Hits hits = is.Search(query); |
| // now the usual iteration thru 'hits' - the only thing to watch for is to make sure |
| //you ignore the doc if it matches your 'target' document, as it should be similar to itself</code></pre> |
| <p><p> |
| Thus you: |
| <ul><li>do your normal, Lucene setup for searching,</li><li>create a MoreLikeThis,</li><li>get the text of the doc you want to find similarities to</li><li>then call one of the <a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_Like_System_IO_TextReader_System_String_">Like(TextReader, String)</a> calls to generate a similarity query</li><li>call the searcher to find the similar docs</li></ul> |
| <p> |
| <strong>More Advanced Usage</strong> |
| <p> |
| You may want to use the setter for <a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_FieldNames">FieldNames</a> so you can examine |
| multiple fields (e.g. body and title) for similarity. |
| <p> |
| <p> |
| Depending on the size of your index and the size and makeup of your documents you |
| may want to call the other set methods to control how the similarity queries are |
| generated: |
| <ul><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MinTermFreq">MinTermFreq</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MinDocFreq">MinDocFreq</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MaxDocFreq">MaxDocFreq</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_SetMaxDocFreqPct_System_Int32_">SetMaxDocFreqPct(Int32)</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MinWordLen">MinWordLen</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MaxWordLen">MaxWordLen</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MaxQueryTerms">MaxQueryTerms</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_MaxNumTokensParsed">MaxNumTokensParsed</a></li><li><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html#Lucene_Net_Queries_Mlt_MoreLikeThis_StopWords">StopWords</a></li></ul></p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThisQuery.html">MoreLikeThisQuery</a></h4> |
| <section><p>A simple wrapper for <a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html">MoreLikeThis</a> for use in scenarios where a <span class="xref">Lucene.Net.Search.Query</span> object is required eg |
| in custom QueryParser extensions. At query.Rewrite() time the reader is used to construct the |
| actual <a class="xref" href="Lucene.Net.Queries.Mlt.MoreLikeThis.html">MoreLikeThis</a> object and obtain the real <span class="xref">Lucene.Net.Search.Query</span> object.</p> |
| </section> |
| </article> |
| </div> |
| |
| <div class="hidden-sm col-md-2" role="complementary"> |
| <div class="sideaffix"> |
| <div class="contribution"> |
| <ul class="nav"> |
| <li> |
| <a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00009/src/Lucene.Net.Queries/Mlt/package.md/#L2" class="contribution-link">Improve this Doc</a> |
| </li> |
| </ul> |
| </div> |
| <nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix"> |
| <!-- <p><a class="back-to-top" href="#top">Back to top</a><p> --> |
| </nav> |
| </div> |
| </div> |
| </div> |
| </div> |
| |
| <footer> |
| <div class="grad-bottom"></div> |
| <div class="footer"> |
| <div class="container"> |
| <span class="pull-right"> |
| <a href="#top">Back to top</a> |
| </span> |
| Copyright © 2020 Licensed to the Apache Software Foundation (ASF) |
| |
| </div> |
| </div> |
| </footer> |
| </div> |
| |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script> |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script> |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script> |
| </body> |
| </html> |