| <!DOCTYPE html> |
| <!--[if IE]><![endif]--> |
| <html> |
| |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| <title>Namespace Lucene.Net.Search.VectorHighlight |
| | Apache Lucene.NET 4.8.0-beta00008 Documentation </title> |
| <meta name="viewport" content="width=device-width"> |
| <meta name="title" content="Namespace Lucene.Net.Search.VectorHighlight |
| | Apache Lucene.NET 4.8.0-beta00008 Documentation "> |
| <meta name="generator" content="docfx 2.50.0.0"> |
| |
| <link rel="shortcut icon" href="../../logo/favicon.ico"> |
| <link rel="stylesheet" href="../../styles/docfx.vendor.css"> |
| <link rel="stylesheet" href="../../styles/docfx.css"> |
| <link rel="stylesheet" href="../../styles/main.css"> |
| <meta property="docfx:navrel" content="../../toc.html"> |
| <meta property="docfx:tocrel" content="../toc.html"> |
| |
| <meta property="docfx:rel" content="../../"> |
| |
| </head> |
| <body data-spy="scroll" data-target="#affix" data-offset="120"> |
| <div id="wrapper"> |
| <header> |
| |
| <nav id="autocollapse" class="navbar ng-scope" role="navigation"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| |
| <a class="navbar-brand" href="../../index.html"> |
| <img id="logo" class="svg" src="../../logo/lucene-net-color.png" alt=""> |
| </a> |
| </div> |
| <div class="collapse navbar-collapse" id="navbar"> |
| <form class="navbar-form navbar-right" role="search" id="search"> |
| <div class="form-group"> |
| <input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off"> |
| </div> |
| </form> |
| </div> |
| </div> |
| </nav> |
| |
| <div class="subnav navbar navbar-default"> |
| <div class="container hide-when-search" id="breadcrumb"> |
| <ul class="breadcrumb"> |
| <li></li> |
| </ul> |
| </div> |
| </div> |
| </header> |
| <div class="container body-content"> |
| |
| <div id="search-results"> |
| <div class="search-list"></div> |
| <div class="sr-items"> |
| <p><i class="glyphicon glyphicon-refresh index-loading"></i></p> |
| </div> |
| <ul id="pagination"></ul> |
| </div> |
| </div> |
| <div role="main" class="container body-content hide-when-search"> |
| |
| <div class="sidenav hide-when-search"> |
| <a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a> |
| <div class="sidetoggle collapse" id="sidetoggle"> |
| <div id="sidetoc"></div> |
| </div> |
| </div> |
| <div class="article row grid-right"> |
| <div class="col-md-10"> |
| <article class="content wrap" id="_content" data-uid="Lucene.Net.Search.VectorHighlight"> |
| |
| <h1 id="Lucene_Net_Search_VectorHighlight" data-uid="Lucene.Net.Search.VectorHighlight" class="text-break">Namespace Lucene.Net.Search.VectorHighlight |
| </h1> |
| <div class="markdown level0 summary"><!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <p>This is an another highlighter implementation.</p> |
| <h2 id="features">Features</h2> |
| <ul> |
| <li><p>fast for large docs</p> |
| </li> |
| <li><p>support N-gram fields</p> |
| </li> |
| <li><p>support phrase-unit highlighting with slops</p> |
| </li> |
| <li><p>support multi-term (includes wildcard, range, regexp, etc) queries</p> |
| </li> |
| <li><p>need Java 1.5</p> |
| </li> |
| <li><p>highlight fields need to be stored with Positions and Offsets</p> |
| </li> |
| <li><p>take into account query boost and/or IDF-weight to score fragments</p> |
| </li> |
| <li><p>support colored highlight tags</p> |
| </li> |
| <li><p>pluggable FragListBuilder / FieldFragList</p> |
| </li> |
| <li><p>pluggable FragmentsBuilder</p> |
| </li> |
| </ul> |
| <h2 id="algorithm">Algorithm</h2> |
| <p>To explain the algorithm, let's use the following sample text (to be highlighted) and user query:</p> |
| <table border="1"> |
| <tr> |
| <td><strong>Sample Text</strong></td> |
| <td>Lucene is a search engine library.</td> |
| </tr> |
| <tr> |
| <td><strong>User Query</strong></td> |
| <td>Lucene^2 OR "search library"~1</td> |
| </tr> |
| </table> |
| |
| <p>The user query is a BooleanQuery that consists of TermQuery("Lucene") with boost of 2 and PhraseQuery("search library") with slop of 1.</p> |
| <p>For your convenience, here is the offsets and positions info of the sample text.</p> |
| <pre><code>+--------+-----------------------------------+ |
| | | 1111111111222222222233333| |
| | offset|01234567890123456789012345678901234| |
| +--------+-----------------------------------+ |
| |document|Lucene is a search engine library. | |
| +--------*-----------------------------------+ |
| |position|0 1 2 3 4 5 | |
| +--------*-----------------------------------+ |
| </code></pre><h3 id="step-1">Step 1.</h3> |
| <p>In Step 1, Fast Vector Highlighter generates <a class="xref" href="../Lucene.Net.Highlighter/Lucene.Net.Search.VectorHighlight.FieldQuery.QueryPhraseMap.html">FieldQuery.QueryPhraseMap</a> from the user query. <code>QueryPhraseMap</code> consists of the following members:</p> |
| <pre><code>public class QueryPhraseMap { |
| boolean terminal; |
| int slop; // valid if terminal == true and phraseHighlight == true |
| float boost; // valid if terminal == true |
| Map<String, QueryPhraseMap> subMap; |
| } |
| </code></pre><p><code>QueryPhraseMap</code> has subMap. The key of the subMap is a term text in the user query and the value is a subsequent <code>QueryPhraseMap</code>. If the query is a term (not phrase), then the subsequent <code>QueryPhraseMap</code> is marked as terminal. If the query is a phrase, then the subsequent <code>QueryPhraseMap</code> is not a terminal and it has the next term text in the phrase.</p> |
| <p>From the sample user query, the following <code>QueryPhraseMap</code> will be generated:</p> |
| <pre><code> QueryPhraseMap |
| +--------+-+ +-------+-+ |
| |"Lucene"|o+->|boost=2|*| * : terminal |
| +--------+-+ +-------+-+ |
| </code></pre><p>+--------+-+ +---------+-+ +-------+------+-+ |
| |"search"|o+->|"library"|o+->|boost=1|slop=1|*| |
| +--------+-+ +---------+-+ +-------+------+-+</p> |
| <h3 id="step-2">Step 2.</h3> |
| <p>In Step 2, Fast Vector Highlighter generates <a class="xref" href="../Lucene.Net.Highlighter/Lucene.Net.Search.VectorHighlight.FieldTermStack.html">FieldTermStack</a>. Fast Vector Highlighter uses term vector data (must be stored <a class="xref" href="../Lucene.Net/Lucene.Net.Documents.FieldType.html">#setStoreTermVectorOffsets(boolean)</a> and <a class="xref" href="../Lucene.Net/Lucene.Net.Documents.FieldType.html">#setStoreTermVectorPositions(boolean)</a>) to generate it. <code>FieldTermStack</code> keeps the terms in the user query. Therefore, in this sample case, Fast Vector Highlighter generates the following <code>FieldTermStack</code>:</p> |
| <pre><code> FieldTermStack |
| +------------------+ |
| |"Lucene"(0,6,0) | |
| +------------------+ |
| |"search"(12,18,3) | |
| +------------------+ |
| |"library"(26,33,5)| |
| +------------------+ |
| where : "termText"(startOffset,endOffset,position) |
| </code></pre><h3 id="step-3">Step 3.</h3> |
| <p>In Step 3, Fast Vector Highlighter generates <a class="xref" href="../Lucene.Net.Highlighter/Lucene.Net.Search.VectorHighlight.FieldPhraseList.html">FieldPhraseList</a> by reference to <code>QueryPhraseMap</code> and <code>FieldTermStack</code>.</p> |
| <pre><code> FieldPhraseList |
| +----------------+-----------------+---+ |
| |"Lucene" |[(0,6)] |w=2| |
| +----------------+-----------------+---+ |
| |"search library"|[(12,18),(26,33)]|w=1| |
| +----------------+-----------------+---+ |
| </code></pre><p>The type of each entry is <code>WeightedPhraseInfo</code> that consists of an array of terms offsets and weight. </p> |
| <h3 id="step-4">Step 4.</h3> |
| <p>In Step 4, Fast Vector Highlighter creates <code>FieldFragList</code> by reference to <code>FieldPhraseList</code>. In this sample case, the following <code>FieldFragList</code> will be generated:</p> |
| <pre><code> FieldFragList |
| +---------------------------------+ |
| |"Lucene"[(0,6)] | |
| |"search library"[(12,18),(26,33)]| |
| |totalBoost=3 | |
| +---------------------------------+ |
| </code></pre><p>The calculation for each <code>FieldFragList.WeightedFragInfo.totalBoost</code> (weight)<br>depends on the implementation of <code>FieldFragList.add( ... )</code>:</p> |
| <pre><code> public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList ) { |
| float totalBoost = 0; |
| List<SubInfo> subInfos = new ArrayList<SubInfo>(); |
| for( WeightedPhraseInfo phraseInfo : phraseInfoList ){ |
| subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) ); |
| totalBoost += phraseInfo.getBoost(); |
| } |
| getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost ) ); |
| } |
| </code></pre><p>The used implementation of <code>FieldFragList</code> is noted in <code>BaseFragListBuilder.createFieldFragList( ... )</code>:</p> |
| <pre><code> public FieldFragList createFieldFragList( FieldPhraseList fieldPhraseList, int fragCharSize ){ |
| return createFieldFragList( fieldPhraseList, new SimpleFieldFragList( fragCharSize ), fragCharSize ); |
| } |
| </code></pre><p> Currently there are basically to approaches available: </p> |
| <ul> |
| <li><p><code>SimpleFragListBuilder using SimpleFieldFragList</code>: <em>sum-of-boosts</em>-approach. The totalBoost is calculated by summarizing the query-boosts per term. Per default a term is boosted by 1.0</p> |
| </li> |
| <li><p><code>WeightedFragListBuilder using WeightedFieldFragList</code>: <em>sum-of-distinct-weights</em>-approach. The totalBoost is calculated by summarizing the IDF-weights of distinct terms.</p> |
| </li> |
| </ul> |
| <p>Comparison of the two approaches:</p> |
| <table border="1"> |
| <caption> |
| query = das alte testament (The Old Testament) |
| </caption> |
| <tr><th>Terms in fragment</th><th>sum-of-distinct-weights</th><th>sum-of-boosts</th></tr> |
| <tr><td>das alte testament</td><td>5.339621</td><td>3.0</td></tr> |
| <tr><td>das alte testament</td><td>5.339621</td><td>3.0</td></tr> |
| <tr><td>das testament alte</td><td>5.339621</td><td>3.0</td></tr> |
| <tr><td>das alte testament</td><td>5.339621</td><td>3.0</td></tr> |
| <tr><td>das testament</td><td>2.9455688</td><td>2.0</td></tr> |
| <tr><td>das alte</td><td>2.4759595</td><td>2.0</td></tr> |
| <tr><td>das das das das</td><td>1.5015357</td><td>4.0</td></tr> |
| <tr><td>das das das</td><td>1.3003681</td><td>3.0</td></tr> |
| <tr><td>das das</td><td>1.061746</td><td>2.0</td></tr> |
| <tr><td>alte</td><td>1.0</td><td>1.0</td></tr> |
| <tr><td>alte</td><td>1.0</td><td>1.0</td></tr> |
| <tr><td>das</td><td>0.7507678</td><td>1.0</td></tr> |
| <tr><td>das</td><td>0.7507678</td><td>1.0</td></tr> |
| <tr><td>das</td><td>0.7507678</td><td>1.0</td></tr> |
| <tr><td>das</td><td>0.7507678</td><td>1.0</td></tr> |
| <tr><td>das</td><td>0.7507678</td><td>1.0</td></tr> |
| </table> |
| |
| <h3 id="step-5">Step 5.</h3> |
| <p>In Step 5, by using <code>FieldFragList</code> and the field stored data, Fast Vector Highlighter creates highlighted snippets!</p> |
| </div> |
| <div class="markdown level0 conceptual"></div> |
| <div class="markdown level0 remarks"></div> |
| <h3 id="classes">Classes |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Search.VectorHighlight.BreakIteratorBoundaryScanner.html">BreakIteratorBoundaryScanner</a></h4> |
| <section><p>A <a class="xref" href="../Lucene.Net.Highlighter/Lucene.Net.Search.VectorHighlight.IBoundaryScanner.html">IBoundaryScanner</a> implementation that uses <span class="xref">ICU4N.Text.BreakIterator</span> to find |
| boundaries in the text.</p> |
| </section> |
| </article> |
| </div> |
| |
| <div class="hidden-sm col-md-2" role="complementary"> |
| <div class="sideaffix"> |
| <div class="contribution"> |
| <ul class="nav"> |
| <li> |
| <a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00008/src/Lucene.Net.Highlighter/VectorHighlight/package.md/#L2" class="contribution-link">Improve this Doc</a> |
| </li> |
| </ul> |
| </div> |
| <nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix"> |
| <!-- <p><a class="back-to-top" href="#top">Back to top</a><p> --> |
| </nav> |
| </div> |
| </div> |
| </div> |
| </div> |
| |
| <footer> |
| <div class="grad-bottom"></div> |
| <div class="footer"> |
| <div class="container"> |
| <span class="pull-right"> |
| <a href="#top">Back to top</a> |
| </span> |
| Copyright © 2020 Licensed to the Apache Software Foundation (ASF) |
| |
| </div> |
| </div> |
| </footer> |
| </div> |
| |
| <script type="text/javascript" src="../../styles/docfx.vendor.js"></script> |
| <script type="text/javascript" src="../../styles/docfx.js"></script> |
| <script type="text/javascript" src="../../styles/main.js"></script> |
| </body> |
| </html> |