blob: b1ee3ddbaf1caa17890bfd87fcce4c9e4613389e [file] [log] [blame]
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<html>
<body>
This is an another highlighter implementation.
<h2>Features</h2>
<ul>
<li>fast for large docs</li>
<li>support N-gram fields</li>
<li>support phrase-unit highlighting with slops</li>
<li>need Java 1.5</li>
<li>highlight fields need to be stored with term vector positions and offsets</li>
<li>take into account query boost to score fragments</li>
<li>support colored highlight tags</li>
<li>pluggable FragListBuilder</li>
<li>pluggable FragmentsBuilder</li>
</ul>
<h2>Algorithm</h2>
<p>To explain the algorithm, let's use the following sample text
(to be highlighted) and user query:</p>
<table border=1>
<tr>
<td><b>Sample Text</b></td>
<td>Lucene is a search engine library.</td>
</tr>
<tr>
<td><b>User Query</b></td>
<td>Lucene^2 OR "search library"~1</td>
</tr>
</table>
<p>The user query is a BooleanQuery that consists of TermQuery("Lucene")
with boost of 2 and PhraseQuery("search library") with slop of 1.</p>
<p>For your convenience, here is the offsets and positions info of the
sample text.</p>
<pre>
+--------+-----------------------------------+
| | 1111111111222222222233333|
| offset|01234567890123456789012345678901234|
+--------+-----------------------------------+
|document|Lucene is a search engine library. |
+--------*-----------------------------------+
|position|0 1 2 3 4 5 |
+--------*-----------------------------------+
</pre>
<h3>Step 1.</h3>
<p>In Step 1, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldQuery.QueryPhraseMap} from the user query.
<code>QueryPhraseMap</code> consists of the following members:</p>
<pre class="prettyprint">
public class QueryPhraseMap {
boolean terminal;
int slop; // valid if terminal == true and phraseHighlight == true
float boost; // valid if terminal == true
Map&lt;String, QueryPhraseMap&gt; subMap;
}
</pre>
<p><code>QueryPhraseMap</code> has subMap. The key of the subMap is a term
text in the user query and the value is a subsequent <code>QueryPhraseMap</code>.
If the query is a term (not phrase), then the subsequent <code>QueryPhraseMap</code>
is marked as terminal. If the query is a phrase, then the subsequent <code>QueryPhraseMap</code>
is not a terminal and it has the next term text in the phrase.</p>
<p>From the sample user query, the following <code>QueryPhraseMap</code>
will be generated:</p>
<pre>
QueryPhraseMap
+--------+-+ +-------+-+
|"Lucene"|o+->|boost=2|*| * : terminal
+--------+-+ +-------+-+
+--------+-+ +---------+-+ +-------+------+-+
|"search"|o+->|"library"|o+->|boost=1|slop=1|*|
+--------+-+ +---------+-+ +-------+------+-+
</pre>
<h3>Step 2.</h3>
<p>In Step 2, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldTermStack}. Fast Vector Highlighter uses {@link org.apache.lucene.index.TermFreqVector} data
(must be stored with term vector positions and offsets)
to generate it. <code>FieldTermStack</code> keeps the terms in the user query.
Therefore, in this sample case, Fast Vector Highlighter generates the following <code>FieldTermStack</code>:</p>
<pre>
FieldTermStack
+------------------+
|"Lucene"(0,6,0) |
+------------------+
|"search"(12,18,3) |
+------------------+
|"library"(26,33,5)|
+------------------+
where : "termText"(startOffset,endOffset,position)
</pre>
<h3>Step 3.</h3>
<p>In Step 3, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldPhraseList}
by reference to <code>QueryPhraseMap</code> and <code>FieldTermStack</code>.</p>
<pre>
FieldPhraseList
+----------------+-----------------+---+
|"Lucene" |[(0,6)] |w=2|
+----------------+-----------------+---+
|"search library"|[(12,18),(26,33)]|w=1|
+----------------+-----------------+---+
</pre>
<p>The type of each entry is <code>WeightedPhraseInfo</code> that consists of
an array of terms offsets and weight. The weight (Fast Vector Highlighter uses query boost to
calculate the weight) will be taken into account when Fast Vector Highlighter creates
{@link org.apache.lucene.search.vectorhighlight.FieldFragList} in the next step.</p>
<h3>Step 4.</h3>
<p>In Step 4, Fast Vector Highlighter creates <code>FieldFragList</code> by reference to
<code>FieldPhraseList</code>. In this sample case, the following
<code>FieldFragList</code> will be generated:</p>
<pre>
FieldFragList
+---------------------------------+
|"Lucene"[(0,6)] |
|"search library"[(12,18),(26,33)]|
|totalBoost=3 |
+---------------------------------+
</pre>
<h3>Step 5.</h3>
<p>In Step 5, by using <code>FieldFragList</code> and the field stored data,
Fast Vector Highlighter creates highlighted snippets!</p>
</body>
</html>