<!doctype html public "-//w3c//dtd html 4.0 transitional//en"> | |
<html> | |
<head> | |
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> | |
<meta name="Author" content="Doug Cutting"> | |
<meta content="Grant Ingersoll" name="Author"> | |
</head> | |
<body> | |
Code to search indices. | |
<h2>Table Of Contents</h2> | |
<p> | |
<ol> | |
<li><a href = "#search">Search Basics</a></li> | |
<li><a href = "#query">The Query Classes</a></li> | |
<li><a href = "#scoring">Changing the Scoring</a></li> | |
</ol> | |
</p> | |
<a name = "search"></a> | |
<h2>Search</h2> | |
<p> | |
Search over indices. | |
Applications usually call {@link | |
org.apache.lucene.search.Searcher#search(Query)} or {@link | |
org.apache.lucene.search.Searcher#search(Query,Filter)}. | |
<!-- FILL IN MORE HERE --> | |
</p> | |
<a name = "query"></a> | |
<h2>Query Classes</h2> | |
<h4> | |
<a href = "TermQuery.html">TermQuery</a> | |
</h4> | |
<p>Of the various implementations of | |
<a href = "Query.html">Query</a>, the | |
<a href = "TermQuery.html">TermQuery</a> | |
is the easiest to understand and the most often used in applications. A <a | |
href="TermQuery.html">TermQuery</a> matches all the documents that contain the | |
specified | |
<a href = "index/Term.html">Term</a>, | |
which is a word that occurs in a certain | |
<a href = "document/Field.html">Field</a>. | |
Thus, a <a href = "TermQuery.html">TermQuery</a> identifies and scores all | |
<a href = "document/Document.html">Document</a>s that have a <a | |
href="../document/Field.html">Field</a> with the specified string in it. | |
Constructing a <a | |
href="TermQuery.html">TermQuery</a> | |
is as simple as: | |
<pre> | |
TermQuery tq = new TermQuery(new Term("fieldName", "term")); | |
</pre>In this example, the <a href = "Query.html">Query</a> identifies all <a | |
href="../document/Document.html">Document</a>s that have the <a | |
href="../document/Field.html">Field</a> named <tt>"fieldName"</tt> | |
containing the word <tt>"term"</tt>. | |
</p> | |
<h4> | |
<a href = "BooleanQuery.html">BooleanQuery</a> | |
</h4> | |
<p>Things start to get interesting when one combines multiple | |
<a href = "TermQuery.html">TermQuery</a> instances into a <a | |
href="BooleanQuery.html">BooleanQuery</a>. | |
A <a href = "BooleanQuery.html">BooleanQuery</a> contains multiple | |
<a href = "BooleanClause.html">BooleanClause</a>s, | |
where each clause contains a sub-query (<a href = "Query.html">Query</a> | |
instance) and an operator (from <a | |
href="BooleanClause.Occur.html">BooleanClause.Occur</a>) | |
describing how that sub-query is combined with the other clauses: | |
<ol> | |
<li><p>SHOULD — Use this operator when a clause can occur in the result set, but is not required. | |
If a query is made up of all SHOULD clauses, then every document in the result | |
set matches at least one of these clauses.</p></li> | |
<li><p>MUST — Use this operator when a clause is required to occur in the result set. Every | |
document in the result set will match | |
all such clauses.</p></li> | |
<li><p>MUST NOT — Use this operator when a | |
clause must not occur in the result set. No | |
document in the result set will match | |
any such clauses.</p></li> | |
</ol> | |
Boolean queries are constructed by adding two or more | |
<a href = "BooleanClause.html">BooleanClause</a> | |
instances. If too many clauses are added, a <a href = "BooleanQuery.TooManyClauses.html">TooManyClauses</a> | |
exception will be thrown during searching. This most often occurs | |
when a <a href = "Query.html">Query</a> | |
is rewritten into a <a href = "BooleanQuery.html">BooleanQuery</a> with many | |
<a href = "TermQuery.html">TermQuery</a> clauses, | |
for example by <a href = "WildcardQuery.html">WildcardQuery</a>. | |
The default setting for the maximum number | |
of clauses 1024, but this can be changed via the | |
static method <a href = "BooleanQuery.html#setMaxClauseCount(int)">setMaxClauseCount</a> | |
in <a href = "BooleanQuery.html">BooleanQuery</a>. | |
</p> | |
<h4>Phrases</h4> | |
<p>Another common search is to find documents containing certain phrases. This | |
is handled two different ways: | |
<ol> | |
<li> | |
<p><a href = "PhraseQuery.html">PhraseQuery</a> | |
— Matches a sequence of | |
<a href = "index/Term.html">Terms</a>. | |
<a href = "PhraseQuery.html">PhraseQuery</a> uses a slop factor to determine | |
how many positions may occur between any two terms in the phrase and still be considered a match.</p> | |
</li> | |
<li> | |
<p><a href = "spans/SpanNearQuery.html">SpanNearQuery</a> | |
— Matches a sequence of other | |
<a href = "spans/SpanQuery.html">SpanQuery</a> | |
instances. <a href = "spans/SpanNearQuery.html">SpanNearQuery</a> allows for | |
much more | |
complicated phrase queries since it is constructed from other <a | |
href="spans/SpanQuery.html">SpanQuery</a> | |
instances, instead of only <a href = "TermQuery.html">TermQuery</a> | |
instances.</p> | |
</li> | |
</ol> | |
</p> | |
<h4> | |
<a href = "RangeQuery.html">RangeQuery</a> | |
</h4> | |
<p>The | |
<a href = "RangeQuery.html">RangeQuery</a> | |
matches all documents that occur in the | |
exclusive range of a lower | |
<a href = "index/Term.html">Term</a> | |
and an upper | |
<a href = "index/Term.html">Term</a>. | |
For example, one could find all documents | |
that have terms beginning with the letters <tt>a</tt> through <tt>c</tt>. This type of <a | |
href="Query.html">Query</a> is frequently used to | |
find | |
documents that occur in a specific date range. | |
</p> | |
<h4> | |
<a href = "PrefixQuery.html">PrefixQuery</a>, | |
<a href = "WildcardQuery.html">WildcardQuery</a> | |
</h4> | |
<p>While the | |
<a href = "PrefixQuery.html">PrefixQuery</a> | |
has a different implementation, it is essentially a special case of the | |
<a href = "WildcardQuery.html">WildcardQuery</a>. | |
The <a href = "PrefixQuery.html">PrefixQuery</a> allows an application | |
to identify all documents with terms that begin with a certain string. The <a | |
href="WildcardQuery.html">WildcardQuery</a> generalizes this by allowing | |
for the use of <tt>*</tt> (matches 0 or more characters) and <tt>?</tt> (matches exactly one character) wildcards. | |
Note that the <a href = "WildcardQuery.html">WildcardQuery</a> can be quite slow. Also | |
note that | |
<a href = "WildcardQuery.html">WildcardQuery</a> should | |
not start with <tt>*</tt> and <tt>?</tt>, as these are extremely slow. | |
To remove this protection and allow a wildcard at the beginning of a term, see method | |
<a href = "queryParser/QueryParser.html#setAllowLeadingWildcard(boolean)">setAllowLeadingWildcard</a> in | |
<a href = "queryParser/QueryParser.html">QueryParser</a>. | |
</p> | |
<h4> | |
<a href = "FuzzyQuery.html">FuzzyQuery</a> | |
</h4> | |
<p>A | |
<a href = "FuzzyQuery.html">FuzzyQuery</a> | |
matches documents that contain terms similar to the specified term. Similarity is | |
determined using | |
<a href = "http://en.wikipedia.org//wiki/Levenshtein">Levenshtein (edit) distance</a>. | |
This type of query can be useful when accounting for spelling variations in the collection. | |
</p> | |
<a name = "changingSimilarity"></a> | |
<h2>Changing Similarity</h2> | |
<p>Chances are <a href = "DefaultSimilarity.html">DefaultSimilarity</a> is sufficient for all | |
your searching needs. | |
However, in some applications it may be necessary to customize your <a | |
href="Similarity.html">Similarity</a> implementation. For instance, some | |
applications do not need to | |
distinguish between shorter and longer documents (see <a | |
href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a "fair" similarity</a>).</p> | |
<p>To change <a href = "Similarity.html">Similarity</a>, one must do so for both indexing and | |
searching, and the changes must happen before | |
either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it | |
just isn't well-defined what is going to happen. | |
</p> | |
<p>To make this change, implement your own <a href = "Similarity.html">Similarity</a> (likely | |
you'll want to simply subclass | |
<a href = "DefaultSimilarity.html">DefaultSimilarity</a>) and then use the new | |
class by calling | |
<a href = "index/IndexWriter.html#setSimilarity(org.apache.lucene.search.Similarity)">IndexWriter.setSimilarity</a> | |
before indexing and | |
<a href = "Searcher.html#setSimilarity(org.apache.lucene.search.Similarity)">Searcher.setSimilarity</a> | |
before searching. | |
</p> | |
<p> | |
If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at <a | |
href="http://www.nabble.com/Overriding-Similarity-tf2128934.html">Overriding Similarity</a>. | |
In summary, here are a few use cases: | |
<ol> | |
<li><p><a href = "api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a> — <a | |
href="api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a> gives small increases | |
as the frequency increases a small amount | |
and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is | |
more significant.</p></li> | |
<li><p>Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a | |
matching term occurs. In these | |
cases people have overridden Similarity to return 1 from the tf() method.</p></li> | |
<li><p>Changing Length Normalization — By overriding <a | |
href="Similarity.html#lengthNorm(java.lang.String,%20int)">lengthNorm</a>, | |
it is possible to discount how the length of a field contributes | |
to a score. In <a href = "DefaultSimilarity.html">DefaultSimilarity</a>, | |
lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be | |
1 / (numTerms in field), all fields will be treated | |
<a href = "http://www.gossamer-threads.com//lists/lucene/java-user/38967#38967">"fairly"</a>.</p></li> | |
</ol> | |
In general, Chris Hostetter sums it up best in saying (from <a | |
href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users's mailing list</a>): | |
<blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just | |
that | |
it's "text" is a situation where it *might* make sense to to override your | |
Similarity method.</blockquote> | |
</p> | |
<a name = "scoring"></a> | |
<h2>Changing Scoring — Expert Level</h2> | |
<p>Changing scoring is an expert level task, so tread carefully and be prepared to share your code if | |
you want help. | |
</p> | |
<p>With the warning out of the way, it is possible to change a lot more than just the Similarity | |
when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by | |
<span >three main classes</span>: | |
<ol> | |
<li> | |
<a href = "Query.html">Query</a> — The abstract object representation of the | |
user's information need.</li> | |
<li> | |
<a href = "Weight.html">Weight</a> — The internal interface representation of | |
the user's Query, so that Query objects may be reused.</li> | |
<li> | |
<a href = "Scorer.html">Scorer</a> — An abstract class containing common | |
functionality for scoring. Provides both scoring and explanation capabilities.</li> | |
</ol> | |
Details on each of these classes, and their children, can be found in the subsections below. | |
</p> | |
<h4>The Query Class</h4> | |
<p>In some sense, the | |
<a href = "Query.html">Query</a> | |
class is where it all begins. Without a Query, there would be | |
nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it | |
is often responsible | |
for creating them or coordinating the functionality between them. The | |
<a href = "Query.html">Query</a> class has several methods that are important for | |
derived classes: | |
<ol> | |
<li>createWeight(Searcher searcher) — A | |
<a href = "Weight.html">Weight</a> is the internal representation of the | |
Query, so each Query implementation must | |
provide an implementation of Weight. See the subsection on <a | |
href="#The Weight Interface">The Weight Interface</a> below for details on implementing the Weight | |
interface.</li> | |
<li>rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are: | |
<a href = "TermQuery.html">TermQuery</a>, | |
<a href = "BooleanQuery.html">BooleanQuery</a>, <span | |
>OTHERS????</span></li> | |
</ol> | |
</p> | |
<h4>The Weight Interface</h4> | |
<p>The | |
<a href = "Weight.html">Weight</a> | |
interface provides an internal representation of the Query so that it can be reused. Any | |
<a href = "Searcher.html">Searcher</a> | |
dependent state should be stored in the Weight implementation, | |
not in the Query class. The interface defines six methods that must be implemented: | |
<ol> | |
<li> | |
<a href = "Weight.html#getQuery()">Weight#getQuery()</a> — Pointer to the | |
Query that this Weight represents.</li> | |
<li> | |
<a href = "Weight.html#getValue()">Weight#getValue()</a> — The weight for | |
this Query. For example, the TermQuery.TermWeight value is | |
equal to the idf^2 * boost * queryNorm <!-- DOUBLE CHECK THIS --></li> | |
<li> | |
<a href = "Weight.html#sumOfSquaredWeights()"> | |
Weight#sumOfSquaredWeights()</a> — The sum of squared weights. For TermQuery, this is (idf * | |
boost)^2</li> | |
<li> | |
<a href = "Weight.html#normalize(float)"> | |
Weight#normalize(float)</a> — Determine the query normalization factor. The query normalization may | |
allow for comparing scores between queries.</li> | |
<li> | |
<a href = "Weight.html#scorer(IndexReader)"> | |
Weight#scorer(IndexReader)</a> — Construct a new | |
<a href = "Scorer.html">Scorer</a> | |
for this Weight. See | |
<a href = "#The Scorer Class">The Scorer Class</a> | |
below for help defining a Scorer. As the name implies, the | |
Scorer is responsible for doing the actual scoring of documents given the Query. | |
</li> | |
<li> | |
<a href = "Weight.html#explain(IndexReader, int)"> | |
Weight#explain(IndexReader, int)</a> — Provide a means for explaining why a given document was | |
scored | |
the way it was.</li> | |
</ol> | |
</p> | |
<h4>The Scorer Class</h4> | |
<p>The | |
<a href = "Scorer.html">Scorer</a> | |
abstract class provides common scoring functionality for all Scorer implementations and | |
is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which | |
must be implemented: | |
<ol> | |
<li> | |
<a href = "Scorer.html#next()">Scorer#next()</a> — Advances to the next | |
document that matches this Query, returning true if and only | |
if there is another document that matches.</li> | |
<li> | |
<a href = "Scorer.html#doc()">Scorer#doc()</a> — Returns the id of the | |
<a href = "document/Document.html">Document</a> | |
that contains the match. It is not valid until next() has been called at least once. | |
</li> | |
<li> | |
<a href = "Scorer.html#score()">Scorer#score()</a> — Return the score of the | |
current document. This value can be determined in any | |
appropriate way for an application. For instance, the | |
<a href = "http://svn.apache.org//viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/TermScorer.java?view=log">TermScorer</a> | |
returns the tf * Weight.getValue() * fieldNorm. | |
</li> | |
<li> | |
<a href = "Scorer.html#skipTo(int)">Scorer#skipTo(int)</a> — Skip ahead in | |
the document matches to the document whose id is greater than | |
or equal to the passed in value. In many instances, skipTo can be | |
implemented more efficiently than simply looping through all the matching documents until | |
the target document is identified.</li> | |
<li> | |
<a href = "Scorer.html#explain(int)">Scorer#explain(int)</a> — Provides | |
details on why the score came about.</li> | |
</ol> | |
</p> | |
<h4>Why would I want to add my own Query?</h4> | |
<p>In a nutshell, you want to add your own custom Query implementation when you think that Lucene's | |
aren't appropriate for the | |
task that you want to do. You might be doing some cutting edge research or you need more information | |
back | |
out of Lucene (similar to Doug adding SpanQuery functionality).</p> | |
<h4>Examples</h4> | |
<p >FILL IN HERE</p> | |
</body> | |
</html> |