| <!doctype html public "-//w3c//dtd html 4.0 transitional//en"> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <html> |
| <body> |
| This is an another highlighter implementation. |
| |
| <h2>Features</h2> |
| <ul> |
| <li>fast for large docs</li> |
| <li>support N-gram fields</li> |
| <li>support phrase-unit highlighting with slops</li> |
| <li>need Java 1.5</li> |
| <li>highlight fields need to be stored with term vector positions and offsets</li> |
| <li>take into account query boost to score fragments</li> |
| <li>support colored highlight tags</li> |
| <li>pluggable FragListBuilder</li> |
| <li>pluggable FragmentsBuilder</li> |
| </ul> |
| |
| <h2>Algorithm</h2> |
| <p>To explain the algorithm, let's use the following sample text |
| (to be highlighted) and user query:</p> |
| |
| <table border=1> |
| <tr> |
| <td><b>Sample Text</b></td> |
| <td>Lucene is a search engine library.</td> |
| </tr> |
| <tr> |
| <td><b>User Query</b></td> |
| <td>Lucene^2 OR "search library"~1</td> |
| </tr> |
| </table> |
| |
| <p>The user query is a BooleanQuery that consists of TermQuery("Lucene") |
| with boost of 2 and PhraseQuery("search library") with slop of 1.</p> |
| <p>For your convenience, here is the offsets and positions info of the |
| sample text.</p> |
| |
| <pre> |
| +--------+-----------------------------------+ |
| | | 1111111111222222222233333| |
| | offset|01234567890123456789012345678901234| |
| +--------+-----------------------------------+ |
| |document|Lucene is a search engine library. | |
| +--------*-----------------------------------+ |
| |position|0 1 2 3 4 5 | |
| +--------*-----------------------------------+ |
| </pre> |
| |
| <h3>Step 1.</h3> |
| <p>In Step 1, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldQuery.QueryPhraseMap} from the user query. |
| <code>QueryPhraseMap</code> consists of the following members:</p> |
| <pre class="prettyprint"> |
| public class QueryPhraseMap { |
| boolean terminal; |
| int slop; // valid if terminal == true and phraseHighlight == true |
| float boost; // valid if terminal == true |
| Map<String, QueryPhraseMap> subMap; |
| } |
| </pre> |
| <p><code>QueryPhraseMap</code> has subMap. The key of the subMap is a term |
| text in the user query and the value is a subsequent <code>QueryPhraseMap</code>. |
| If the query is a term (not phrase), then the subsequent <code>QueryPhraseMap</code> |
| is marked as terminal. If the query is a phrase, then the subsequent <code>QueryPhraseMap</code> |
| is not a terminal and it has the next term text in the phrase.</p> |
| |
| <p>From the sample user query, the following <code>QueryPhraseMap</code> |
| will be generated:</p> |
| <pre> |
| QueryPhraseMap |
| +--------+-+ +-------+-+ |
| |"Lucene"|o+->|boost=2|*| * : terminal |
| +--------+-+ +-------+-+ |
| |
| +--------+-+ +---------+-+ +-------+------+-+ |
| |"search"|o+->|"library"|o+->|boost=1|slop=1|*| |
| +--------+-+ +---------+-+ +-------+------+-+ |
| </pre> |
| |
| <h3>Step 2.</h3> |
| <p>In Step 2, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldTermStack}. Fast Vector Highlighter uses {@link org.apache.lucene.index.TermFreqVector} data |
| (must be stored with term vector positions and offsets) |
| to generate it. <code>FieldTermStack</code> keeps the terms in the user query. |
| Therefore, in this sample case, Fast Vector Highlighter generates the following <code>FieldTermStack</code>:</p> |
| <pre> |
| FieldTermStack |
| +------------------+ |
| |"Lucene"(0,6,0) | |
| +------------------+ |
| |"search"(12,18,3) | |
| +------------------+ |
| |"library"(26,33,5)| |
| +------------------+ |
| where : "termText"(startOffset,endOffset,position) |
| </pre> |
| <h3>Step 3.</h3> |
| <p>In Step 3, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldPhraseList} |
| by reference to <code>QueryPhraseMap</code> and <code>FieldTermStack</code>.</p> |
| <pre> |
| FieldPhraseList |
| +----------------+-----------------+---+ |
| |"Lucene" |[(0,6)] |w=2| |
| +----------------+-----------------+---+ |
| |"search library"|[(12,18),(26,33)]|w=1| |
| +----------------+-----------------+---+ |
| </pre> |
| <p>The type of each entry is <code>WeightedPhraseInfo</code> that consists of |
| an array of terms offsets and weight. The weight (Fast Vector Highlighter uses query boost to |
| calculate the weight) will be taken into account when Fast Vector Highlighter creates |
| {@link org.apache.lucene.search.vectorhighlight.FieldFragList} in the next step.</p> |
| <h3>Step 4.</h3> |
| <p>In Step 4, Fast Vector Highlighter creates <code>FieldFragList</code> by reference to |
| <code>FieldPhraseList</code>. In this sample case, the following |
| <code>FieldFragList</code> will be generated:</p> |
| <pre> |
| FieldFragList |
| +---------------------------------+ |
| |"Lucene"[(0,6)] | |
| |"search library"[(12,18),(26,33)]| |
| |totalBoost=3 | |
| +---------------------------------+ |
| </pre> |
| <h3>Step 5.</h3> |
| <p>In Step 5, by using <code>FieldFragList</code> and the field stored data, |
| Fast Vector Highlighter creates highlighted snippets!</p> |
| </body> |
| </html> |