lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/package.html - lucene-solr - Git at Google

 <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
 <!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
 -->
 <html>
 <body>
 This is an another highlighter implementation.

 <h2>Features</h2>
 <ul>
 <li>fast for large docs</li>
 <li>support N-gram fields</li>
 <li>support phrase-unit highlighting with slops</li>
 <li>need Java 1.5</li>
 <li>highlight fields need to be stored with term vector positions and offsets</li>
 <li>take into account query boost to score fragments</li>
 <li>support colored highlight tags</li>
 <li>pluggable FragListBuilder</li>
 <li>pluggable FragmentsBuilder</li>
 </ul>

 <h2>Algorithm</h2>
 <p>To explain the algorithm, let's use the following sample text
  (to be highlighted) and user query:</p>

 <table border=1>
 <tr>
 <td><b>Sample Text</b></td>
 <td>Lucene is a search engine library.</td>
 </tr>
 <tr>
 <td><b>User Query</b></td>
 <td>Lucene^2 OR "search library"~1</td>
 </tr>
 </table>

 <p>The user query is a BooleanQuery that consists of TermQuery("Lucene")
 with boost of 2 and PhraseQuery("search library") with slop of 1.</p>
 <p>For your convenience, here is the offsets and positions info of the
 sample text.</p>

 <pre>
 +--------+-----------------------------------+
 |        |          1111111111222222222233333|
 |  offset|01234567890123456789012345678901234|
 +--------+-----------------------------------+
 |document|Lucene is a search engine library. |
 +--------*-----------------------------------+
 |position|0      1  2 3      4      5        |
 +--------*-----------------------------------+
 </pre>

 <h3>Step 1.</h3>
 <p>In Step 1, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldQuery.QueryPhraseMap} from the user query.
 <code>QueryPhraseMap</code> consists of the following members:</p>
 <pre class="prettyprint">
 public class QueryPhraseMap {
   boolean terminal;
   int slop;   // valid if terminal == true and phraseHighlight == true
   float boost;  // valid if terminal == true
   Map&lt;String, QueryPhraseMap&gt; subMap;
 }
 </pre>
 <p><code>QueryPhraseMap</code> has subMap. The key of the subMap is a term
 text in the user query and the value is a subsequent <code>QueryPhraseMap</code>.
 If the query is a term (not phrase), then the subsequent <code>QueryPhraseMap</code>
 is marked as terminal. If the query is a phrase, then the subsequent <code>QueryPhraseMap</code>
 is not a terminal and it has the next term text in the phrase.</p>

 <p>From the sample user query, the following <code>QueryPhraseMap</code>
 will be generated:</p>
 <pre>
    QueryPhraseMap
 +--------+-+  +-------+-+
 |"Lucene"|o+->|boost=2|*|  * : terminal
 +--------+-+  +-------+-+

 +--------+-+  +---------+-+  +-------+------+-+
 |"search"|o+->|"library"|o+->|boost=1|slop=1|*|
 +--------+-+  +---------+-+  +-------+------+-+
 </pre>

 <h3>Step 2.</h3>
 <p>In Step 2, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldTermStack}. Fast Vector Highlighter uses {@link org.apache.lucene.index.TermFreqVector} data
 (must be stored with term vector positions and offsets)
 to generate it. <code>FieldTermStack</code> keeps the terms in the user query.
 Therefore, in this sample case, Fast Vector Highlighter generates the following <code>FieldTermStack</code>:</p>
 <pre>
    FieldTermStack
 +------------------+
 |"Lucene"(0,6,0)   |
 +------------------+
 |"search"(12,18,3) |
 +------------------+
 |"library"(26,33,5)|
 +------------------+
 where : "termText"(startOffset,endOffset,position)
 </pre>
 <h3>Step 3.</h3>
 <p>In Step 3, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldPhraseList}
 by reference to <code>QueryPhraseMap</code> and <code>FieldTermStack</code>.</p>
 <pre>
    FieldPhraseList
 +----------------+-----------------+---+
 |"Lucene"        |[(0,6)]          |w=2|
 +----------------+-----------------+---+
 |"search library"|[(12,18),(26,33)]|w=1|
 +----------------+-----------------+---+
 </pre>
 <p>The type of each entry is <code>WeightedPhraseInfo</code> that consists of
 an array of terms offsets and weight. The weight (Fast Vector Highlighter uses query boost to
 calculate the weight) will be taken into account when Fast Vector Highlighter creates
 {@link org.apache.lucene.search.vectorhighlight.FieldFragList} in the next step.</p>
 <h3>Step 4.</h3>
 <p>In Step 4, Fast Vector Highlighter creates <code>FieldFragList</code> by reference to
 <code>FieldPhraseList</code>. In this sample case, the following
 <code>FieldFragList</code> will be generated:</p>
 <pre>
    FieldFragList
 +---------------------------------+
 |"Lucene"[(0,6)]                  |
 |"search library"[(12,18),(26,33)]|
 |totalBoost=3                     |
 +---------------------------------+
 </pre>
 <h3>Step 5.</h3>
 <p>In Step 5, by using <code>FieldFragList</code> and the field stored data,
 Fast Vector Highlighter creates highlighted snippets!</p>
 </body>
 </html>
	<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	<html>
	<body>
	This is an another highlighter implementation.

	<h2>Features</h2>
	<ul>
	<li>fast for large docs</li>
	<li>support N-gram fields</li>
	<li>support phrase-unit highlighting with slops</li>
	<li>need Java 1.5</li>
	<li>highlight fields need to be stored with term vector positions and offsets</li>
	<li>take into account query boost to score fragments</li>
	<li>support colored highlight tags</li>
	<li>pluggable FragListBuilder</li>
	<li>pluggable FragmentsBuilder</li>
	</ul>

	<h2>Algorithm</h2>
	<p>To explain the algorithm, let's use the following sample text
	(to be highlighted) and user query:</p>

	<table border=1>
	<tr>
	<td><b>Sample Text</b></td>
	<td>Lucene is a search engine library.</td>
	</tr>
	<tr>
	<td><b>User Query</b></td>
	<td>Lucene^2 OR "search library"~1</td>
	</tr>
	</table>

	<p>The user query is a BooleanQuery that consists of TermQuery("Lucene")
	with boost of 2 and PhraseQuery("search library") with slop of 1.</p>
	<p>For your convenience, here is the offsets and positions info of the
	sample text.</p>

	<pre>
	+--------+-----------------------------------+
	\| \| 1111111111222222222233333\|
	\| offset\|01234567890123456789012345678901234\|
	+--------+-----------------------------------+
	\|document\|Lucene is a search engine library. \|
	+--------*-----------------------------------+
	\|position\|0 1 2 3 4 5 \|
	+--------*-----------------------------------+
	</pre>

	<h3>Step 1.</h3>
	<p>In Step 1, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldQuery.QueryPhraseMap} from the user query.
	<code>QueryPhraseMap</code> consists of the following members:</p>
	<pre class="prettyprint">
	public class QueryPhraseMap {
	boolean terminal;
	int slop; // valid if terminal == true and phraseHighlight == true
	float boost; // valid if terminal == true
	Map<String, QueryPhraseMap> subMap;
	}
	</pre>
	<p><code>QueryPhraseMap</code> has subMap. The key of the subMap is a term
	text in the user query and the value is a subsequent <code>QueryPhraseMap</code>.
	If the query is a term (not phrase), then the subsequent <code>QueryPhraseMap</code>
	is marked as terminal. If the query is a phrase, then the subsequent <code>QueryPhraseMap</code>
	is not a terminal and it has the next term text in the phrase.</p>

	<p>From the sample user query, the following <code>QueryPhraseMap</code>
	will be generated:</p>
	<pre>
	QueryPhraseMap
	+--------+-+ +-------+-+
	\|"Lucene"\|o+->\|boost=2\|\| : terminal
	+--------+-+ +-------+-+

	+--------+-+ +---------+-+ +-------+------+-+
	\|"search"\|o+->\|"library"\|o+->\|boost=1\|slop=1\|*\|
	+--------+-+ +---------+-+ +-------+------+-+
	</pre>

	<h3>Step 2.</h3>
	<p>In Step 2, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldTermStack}. Fast Vector Highlighter uses {@link org.apache.lucene.index.TermFreqVector} data
	(must be stored with term vector positions and offsets)
	to generate it. <code>FieldTermStack</code> keeps the terms in the user query.
	Therefore, in this sample case, Fast Vector Highlighter generates the following <code>FieldTermStack</code>:</p>
	<pre>
	FieldTermStack
	+------------------+
	\|"Lucene"(0,6,0) \|
	+------------------+
	\|"search"(12,18,3) \|
	+------------------+
	\|"library"(26,33,5)\|
	+------------------+
	where : "termText"(startOffset,endOffset,position)
	</pre>
	<h3>Step 3.</h3>
	<p>In Step 3, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldPhraseList}
	by reference to <code>QueryPhraseMap</code> and <code>FieldTermStack</code>.</p>
	<pre>
	FieldPhraseList
	+----------------+-----------------+---+
	\|"Lucene" \|[(0,6)] \|w=2\|
	+----------------+-----------------+---+
	\|"search library"\|[(12,18),(26,33)]\|w=1\|
	+----------------+-----------------+---+
	</pre>
	<p>The type of each entry is <code>WeightedPhraseInfo</code> that consists of
	an array of terms offsets and weight. The weight (Fast Vector Highlighter uses query boost to
	calculate the weight) will be taken into account when Fast Vector Highlighter creates
	{@link org.apache.lucene.search.vectorhighlight.FieldFragList} in the next step.</p>
	<h3>Step 4.</h3>
	<p>In Step 4, Fast Vector Highlighter creates <code>FieldFragList</code> by reference to
	<code>FieldPhraseList</code>. In this sample case, the following
	<code>FieldFragList</code> will be generated:</p>
	<pre>
	FieldFragList
	+---------------------------------+
	\|"Lucene"[(0,6)] \|
	\|"search library"[(12,18),(26,33)]\|
	\|totalBoost=3 \|
	+---------------------------------+
	</pre>
	<h3>Step 5.</h3>
	<p>In Step 5, by using <code>FieldFragList</code> and the field stored data,
	Fast Vector Highlighter creates highlighted snippets!</p>
	</body>
	</html>