| <!DOCTYPE html> |
| <!--[if IE]><![endif]--> |
| <html> |
| |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| <title>Namespace Lucene.Net.Codecs.Lucene41 |
| | Apache Lucene.NET 4.8.0 Documentation </title> |
| <meta name="viewport" content="width=device-width"> |
| <meta name="title" content="Namespace Lucene.Net.Codecs.Lucene41 |
| | Apache Lucene.NET 4.8.0 Documentation "> |
| <meta name="generator" content="docfx 2.47.0.0"> |
| |
| <link rel="shortcut icon" href="../../logo/favicon.ico"> |
| <link rel="stylesheet" href="../../styles/docfx.vendor.css"> |
| <link rel="stylesheet" href="../../styles/docfx.css"> |
| <link rel="stylesheet" href="../../styles/main.css"> |
| <meta property="docfx:navrel" content="../../toc.html"> |
| <meta property="docfx:tocrel" content="../toc.html"> |
| |
| <meta property="docfx:rel" content="../../"> |
| |
| </head> |
| <body data-spy="scroll" data-target="#affix" data-offset="120"> |
| <div id="wrapper"> |
| <header> |
| |
| <nav id="autocollapse" class="navbar ng-scope" role="navigation"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| |
| <a class="navbar-brand" href="../../index.html"> |
| <img id="logo" class="svg" src="../../logo/lucene-net-color.png" alt=""> |
| </a> |
| </div> |
| <div class="collapse navbar-collapse" id="navbar"> |
| <form class="navbar-form navbar-right" role="search" id="search"> |
| <div class="form-group"> |
| <input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off"> |
| </div> |
| </form> |
| </div> |
| </div> |
| </nav> |
| |
| <div class="subnav navbar navbar-default"> |
| <div class="container hide-when-search" id="breadcrumb"> |
| <ul class="breadcrumb"> |
| <li></li> |
| </ul> |
| </div> |
| </div> |
| </header> |
| <div class="container body-content"> |
| |
| <div id="search-results"> |
| <div class="search-list"></div> |
| <div class="sr-items"> |
| <p><i class="glyphicon glyphicon-refresh index-loading"></i></p> |
| </div> |
| <ul id="pagination"></ul> |
| </div> |
| </div> |
| <div role="main" class="container body-content hide-when-search"> |
| |
| <div class="sidenav hide-when-search"> |
| <a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a> |
| <div class="sidetoggle collapse" id="sidetoggle"> |
| <div id="sidetoc"></div> |
| </div> |
| </div> |
| <div class="article row grid-right"> |
| <div class="col-md-10"> |
| <article class="content wrap" id="_content" data-uid="Lucene.Net.Codecs.Lucene41"> |
| |
| <h1 id="Lucene_Net_Codecs_Lucene41" data-uid="Lucene.Net.Codecs.Lucene41" class="text-break">Namespace Lucene.Net.Codecs.Lucene41 |
| </h1> |
| <div class="markdown level0 summary"><!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <p>Support for testing <a class="xref" href="Lucene.Net.Codecs.Lucene41.Lucene41Codec.html">Lucene41Codec</a>.</p> |
| </div> |
| <div class="markdown level0 conceptual"></div> |
| <div class="markdown level0 remarks"></div> |
| <h3 id="classes">Classes |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Codecs.Lucene41.Lucene41Codec.html">Lucene41Codec</a></h4> |
| <section><p>Implements the Lucene 4.1 index format, with configurable per-field postings formats. |
| <p> |
| If you want to reuse functionality of this codec in another codec, extend |
| <a class="xref" href="Lucene.Net.Codecs.FilterCodec.html">FilterCodec</a>. |
| <p> |
| See <a class="xref" href="../Lucene.Net.TestFramework/Lucene.Net.Codecs.Lucene41.html">Lucene.Net.Codecs.Lucene41</a> package documentation for file format details. |
| <p> |
| @lucene.experimental </p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Codecs.Lucene41.Lucene41PostingsBaseFormat.html">Lucene41PostingsBaseFormat</a></h4> |
| <section><p>Provides a <a class="xref" href="Lucene.Net.Codecs.PostingsReaderBase.html">PostingsReaderBase</a> and |
| <a class="xref" href="Lucene.Net.Codecs.PostingsWriterBase.html">PostingsWriterBase</a>. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Codecs.Lucene41.Lucene41PostingsFormat.html">Lucene41PostingsFormat</a></h4> |
| <section><p>Lucene 4.1 postings format, which encodes postings in packed integer blocks |
| for fast decode.</p> |
| <p><strong>NOTE</strong>: this format is still experimental and |
| subject to change without backwards compatibility. |
| |
| <p> |
| Basic idea: |
| <ul><li> |
| <strong>Packed Blocks and VInt Blocks</strong>: |
| <p>In packed blocks, integers are encoded with the same bit width packed format (<a class="xref" href="Lucene.Net.Util.Packed.PackedInt32s.html">PackedInt32s</a>): |
| the block size (i.e. number of integers inside block) is fixed (currently 128). Additionally blocks |
| that are all the same value are encoded in an optimized way.</p> |
| <p>In VInt blocks, integers are encoded as VInt (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt32_System_Int32_">WriteVInt32(Int32)</a>): |
| the block size is variable.</p> |
| </li><li> |
| <strong>Block structure</strong>: |
| <p>When the postings are long enough, Lucene41PostingsFormat will try to encode most integer data |
| as a packed block.</p> |
| <p>Take a term with 259 documents as an example, the first 256 document ids are encoded as two packed |
| blocks, while the remaining 3 are encoded as one VInt block. </p> |
| <p>Different kinds of data are always encoded separately into different packed blocks, but may |
| possibly be interleaved into the same VInt block. </p> |
| <p>This strategy is applied to pairs: |
| <document number, frequency>, |
| <position, payload length>, |
| <position, offset start, offset length>, and |
| <position, payload length, offsetstart, offset length>.</p> |
| </li><li> |
| <strong>Skipdata settings</strong>: |
| <p>The structure of skip table is quite similar to previous version of Lucene. Skip interval is the |
| same as block size, and each skip entry points to the beginning of each block. However, for |
| the first block, skip data is omitted.</p> |
| </li><li> |
| <strong>Positions, Payloads, and Offsets</strong>: |
| <p>A position is an integer indicating where the term occurs within one document. |
| A payload is a blob of metadata associated with current position. |
| An offset is a pair of integers indicating the tokenized start/end offsets for given term |
| in current position: it is essentially a specialized payload. </p> |
| <p>When payloads and offsets are not omitted, numPositions==numPayloads==numOffsets (assuming a |
| null payload contributes one count). As mentioned in block structure, it is possible to encode |
| these three either combined or separately.</p> |
| <p>In all cases, payloads and offsets are stored together. When encoded as a packed block, |
| position data is separated out as .pos, while payloads and offsets are encoded in .pay (payload |
| metadata will also be stored directly in .pay). When encoded as VInt blocks, all these three are |
| stored interleaved into the .pos (so is payload metadata).</p> |
| <p>With this strategy, the majority of payload and offset data will be outside .pos file. |
| So for queries that require only position data, running on a full index with payloads and offsets, |
| this reduces disk pre-fetches.</p> |
| </li></ul> |
| </p> |
| |
| <p> |
| Files and detailed format: |
| <ul><li><code>.tim</code>: <a href="#Termdictionary">Term Dictionary</a></li><li><code>.tip</code>: <a href="#Termindex">Term Index</a></li><li><code>.doc</code>: <a href="#Frequencies">Frequencies and Skip Data</a></li><li><code>.pos</code>: <a href="#Positions">Positions</a></li><li><code>.pay</code>: <a href="#Payloads">Payloads and Offsets</a></li></ul> |
| </p> |
| |
| <p><a name="Termdictionary" id="Termdictionary"></a> |
| <dl> |
| <dd> |
| <strong>Term Dictionary</strong><p> |
| <p>The .tim file contains the list of terms in each |
| field along with per-term statistics (such as docfreq) |
| and pointers to the frequencies, positions, payload and |
| skip data in the .doc, .pos, and .pay files. |
| See <a class="xref" href="Lucene.Net.Codecs.BlockTreeTermsWriter.html">BlockTreeTermsWriter</a> for more details on the format. |
| </p> |
| |
| <p>NOTE: The term dictionary can plug into different postings implementations: |
| the postings writer/reader are actually responsible for encoding |
| and decoding the PostingsHeader and TermMetadata sections described here:</p> |
| |
| <p><ul><li>PostingsHeader --> Header, PackedBlockSize</li><li>TermMetadata --> (DocFPDelta|SingletonDocID), PosFPDelta?, PosVIntBlockFPDelta?, PayFPDelta?, |
| SkipFPDelta?</li><li>Header, --> CodecHeader (<a class="xref" href="Lucene.Net.Codecs.CodecUtil.html#Lucene_Net_Codecs_CodecUtil_WriteHeader_Lucene_Net_Store_DataOutput_System_String_System_Int32_">WriteHeader(DataOutput, String, Int32)</a>) </li><li>PackedBlockSize, SingletonDocID --> VInt (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt32_System_Int32_">WriteVInt32(Int32)</a>) </li><li>DocFPDelta, PosFPDelta, PayFPDelta, PosVIntBlockFPDelta, SkipFPDelta --> VLong (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt64_System_Int64_">WriteVInt64(Int64)</a>) </li><li>Footer --> CodecFooter (<a class="xref" href="Lucene.Net.Codecs.CodecUtil.html#Lucene_Net_Codecs_CodecUtil_WriteFooter_Lucene_Net_Store_IndexOutput_">WriteFooter(IndexOutput)</a>) </li></ul> |
| <p>Notes:</p> |
| <ul><li>Header is a CodecHeader (<a class="xref" href="Lucene.Net.Codecs.CodecUtil.html#Lucene_Net_Codecs_CodecUtil_WriteHeader_Lucene_Net_Store_DataOutput_System_String_System_Int32_">WriteHeader(DataOutput, String, Int32)</a>) storing the version information |
| for the postings.</li><li>PackedBlockSize is the fixed block size for packed blocks. In packed block, bit width is |
| determined by the largest integer. Smaller block size result in smaller variance among width |
| of integers hence smaller indexes. Larger block size result in more efficient bulk i/o hence |
| better acceleration. This value should always be a multiple of 64, currently fixed as 128 as |
| a tradeoff. It is also the skip interval used to accelerate <a class="xref" href="Lucene.Net.Search.DocIdSetIterator.html#Lucene_Net_Search_DocIdSetIterator_Advance_System_Int32_">Advance(Int32)</a>.</li><li>DocFPDelta determines the position of this term's TermFreqs within the .doc file. |
| In particular, it is the difference of file offset between this term's |
| data and previous term's data (or zero, for the first term in the block).On disk it is |
| stored as the difference from previous value in sequence. </li><li>PosFPDelta determines the position of this term's TermPositions within the .pos file. |
| While PayFPDelta determines the position of this term's <TermPayloads, TermOffsets?> within |
| the .pay file. Similar to DocFPDelta, it is the difference between two file positions (or |
| neglected, for fields that omit payloads and offsets).</li><li>PosVIntBlockFPDelta determines the position of this term's last TermPosition in last pos packed |
| block within the .pos file. It is synonym for PayVIntBlockFPDelta or OffsetVIntBlockFPDelta. |
| This is actually used to indicate whether it is necessary to load following |
| payloads and offsets from .pos instead of .pay. Every time a new block of positions are to be |
| loaded, the PostingsReader will use this value to check whether current block is packed format |
| or VInt. When packed format, payloads and offsets are fetched from .pay, otherwise from .pos. |
| (this value is neglected when total number of positions i.e. totalTermFreq is less or equal |
| to PackedBlockSize).</li><li>SkipFPDelta determines the position of this term's SkipData within the .doc |
| file. In particular, it is the length of the TermFreq data. |
| SkipDelta is only stored if DocFreq is not smaller than SkipMinimum |
| (i.e. 128 in Lucene41PostingsFormat).</li><li>SingletonDocID is an optimization when a term only appears in one document. In this case, instead |
| of writing a file pointer to the .doc file (DocFPDelta), and then a VIntBlock at that location, the |
| single document ID is written to the term dictionary.</li></ul> |
| </dd> |
| </dl></p> |
| <p><a name="Termindex" id="Termindex"></a> |
| <dl> |
| <dd> |
| <strong>Term Index</strong> |
| <p>The .tip file contains an index into the term dictionary, so that it can be |
| accessed randomly. See <a class="xref" href="Lucene.Net.Codecs.BlockTreeTermsWriter.html">BlockTreeTermsWriter</a> for more details on the format.</p> |
| </dd> |
| </dl></p> |
| <p><a name="Frequencies" id="Frequencies"></a> |
| <dl> |
| <dd> |
| <strong>Frequencies and Skip Data</strong><p> |
| <p>The .doc file contains the lists of documents which contain each term, along |
| with the frequency of the term in that document (except when frequencies are |
| omitted: <a class="xref" href="Lucene.Net.Index.IndexOptions.html#Lucene_Net_Index_IndexOptions_DOCS_ONLY">DOCS_ONLY</a>). It also saves skip data to the beginning of |
| each packed or VInt block, when the length of document list is larger than packed block size.</p> |
| |
| <p><ul><li>docFile(.doc) --> Header, <TermFreqs, SkipData?><sup>TermCount</sup>, Footer</li><li>Header --> CodecHeader (<a class="xref" href="Lucene.Net.Codecs.CodecUtil.html#Lucene_Net_Codecs_CodecUtil_WriteHeader_Lucene_Net_Store_DataOutput_System_String_System_Int32_">WriteHeader(DataOutput, String, Int32)</a>)</li><li>TermFreqs --> <PackedBlock> <sup>PackedDocBlockNum</sup>, |
| VIntBlock? </li><li>PackedBlock --> PackedDocDeltaBlock, PackedFreqBlock?</li><li>VIntBlock --> <DocDelta[, Freq?]><sup>DocFreq-PackedBlockSize<em>PackedDocBlockNum</em></sup></li><li>SkipData --> <<SkipLevelLength, SkipLevel> |
| <sup>NumSkipLevels-1</sup>, SkipLevel>, SkipDatum?</li><li>SkipLevel --> <SkipDatum> <sup>TrimmedDocFreq/(PackedBlockSize^(Level + 1))</sup></li><li>SkipDatum --> DocSkip, DocFPSkip, <PosFPSkip, PosBlockOffset, PayLength?, |
| PayFPSkip?>?, SkipChildLevelPointer?</li><li>PackedDocDeltaBlock, PackedFreqBlock --> PackedInts (<a class="xref" href="Lucene.Net.Util.Packed.PackedInt32s.html">PackedInt32s</a>) </li><li>DocDelta, Freq, DocSkip, DocFPSkip, PosFPSkip, PosBlockOffset, PayByteUpto, PayFPSkip |
| --> |
| VInt (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt32_System_Int32_">WriteVInt32(Int32)</a>) </li><li>SkipChildLevelPointer --> VLong (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt64_System_Int64_">WriteVInt64(Int64)</a>) </li><li>Footer --> CodecFooter (<a class="xref" href="Lucene.Net.Codecs.CodecUtil.html#Lucene_Net_Codecs_CodecUtil_WriteFooter_Lucene_Net_Store_IndexOutput_">WriteFooter(IndexOutput)</a>) </li></ul> |
| <p>Notes:</p> |
| <ul><li>PackedDocDeltaBlock is theoretically generated from two steps: |
| <ol><li>Calculate the difference between each document number and previous one, |
| and get a d-gaps list (for the first document, use absolute value); </li><li>For those d-gaps from first one to PackedDocBlockNumPackedBlockSize<sup>th</sup>, |
| separately encode as packed blocks.</li></ol> |
| If frequencies are not omitted, PackedFreqBlock will be generated without d-gap step. |
| </li><li>VIntBlock stores remaining d-gaps (along with frequencies when possible) with a format |
| that encodes DocDelta and Freq: |
| <p>DocDelta: if frequencies are indexed, this determines both the document |
| number and the frequency. In particular, DocDelta/2 is the difference between |
| this document number and the previous document number (or zero when this is the |
| first document in a TermFreqs). When DocDelta is odd, the frequency is one. |
| When DocDelta is even, the frequency is read as another VInt. If frequencies |
| are omitted, DocDelta contains the gap (not multiplied by 2) between document |
| numbers and no frequency information is stored.</p> |
| <p>For example, the TermFreqs for a term which occurs once in document seven |
| and three times in document eleven, with frequencies indexed, would be the |
| following sequence of VInts:</p> |
| <p>15, 8, 3</p> |
| <p>If frequencies were omitted (<a class="xref" href="Lucene.Net.Index.IndexOptions.html#Lucene_Net_Index_IndexOptions_DOCS_ONLY">DOCS_ONLY</a>) it would be this |
| sequence of VInts instead:</p> |
| <p>7,4</p> |
| </li><li>PackedDocBlockNum is the number of packed blocks for current term's docids or frequencies. |
| In particular, PackedDocBlockNum = floor(DocFreq/PackedBlockSize) </li><li>TrimmedDocFreq = DocFreq % PackedBlockSize == 0 ? DocFreq - 1 : DocFreq. |
| We use this trick since the definition of skip entry is a little different from base interface. |
| In <a class="xref" href="Lucene.Net.Codecs.MultiLevelSkipListWriter.html">MultiLevelSkipListWriter</a>, skip data is assumed to be saved for |
| skipInterval<sup>th</sup>, 2<em>skipInterval<sup>th</sup> ... posting in the list. However, |
| in Lucene41PostingsFormat, the skip data is saved for skipInterval+1<sup>th</sup>, |
| 2</em>skipInterval+1<sup>th</sup> ... posting (skipInterval==PackedBlockSize in this case). |
| When DocFreq is multiple of PackedBlockSize, MultiLevelSkipListWriter will expect one |
| more skip data than Lucene41SkipWriter. </li><li>SkipDatum is the metadata of one skip entry. |
| For the first block (no matter packed or VInt), it is omitted.</li><li>DocSkip records the document number of every PackedBlockSize<sup>th</sup> document number in |
| the postings (i.e. last document number in each packed block). On disk it is stored as the |
| difference from previous value in the sequence. </li><li>DocFPSkip records the file offsets of each block (excluding )posting at |
| PackedBlockSize+1<sup>th</sup>, 2*PackedBlockSize+1<sup>th</sup> ... , in DocFile. |
| The file offsets are relative to the start of current term's TermFreqs. |
| On disk it is also stored as the difference from previous SkipDatum in the sequence.</li><li>Since positions and payloads are also block encoded, the skip should skip to related block first, |
| then fetch the values according to in-block offset. PosFPSkip and PayFPSkip record the file |
| offsets of related block in .pos and .pay, respectively. While PosBlockOffset indicates |
| which value to fetch inside the related block (PayBlockOffset is unnecessary since it is always |
| equal to PosBlockOffset). Same as DocFPSkip, the file offsets are relative to the start of |
| current term's TermFreqs, and stored as a difference sequence.</li><li>PayByteUpto indicates the start offset of the current payload. It is equivalent to |
| the sum of the payload lengths in the current block up to PosBlockOffset</li></ul> |
| </dd> |
| </dl></p> |
| <p><a name="Positions" id="Positions"></a> |
| <dl> |
| <dd> |
| <strong>Positions</strong> |
| <p>The .pos file contains the lists of positions that each term occurs at within documents. It also |
| sometimes stores part of payloads and offsets for speedup.</p> |
| <ul><li>PosFile(.pos) --> Header, <TermPositions> <sup>TermCount</sup>, Footer</li><li>Header --> CodecHeader (<a class="xref" href="Lucene.Net.Codecs.CodecUtil.html#Lucene_Net_Codecs_CodecUtil_WriteHeader_Lucene_Net_Store_DataOutput_System_String_System_Int32_">WriteHeader(DataOutput, String, Int32)</a>) </li><li>TermPositions --> <PackedPosDeltaBlock> <sup>PackedPosBlockNum</sup>, |
| VIntBlock? </li><li>VIntBlock --> <PositionDelta[, PayloadLength?], PayloadData?, |
| OffsetDelta?, OffsetLength?><sup>PosVIntCount</sup></li><li>PackedPosDeltaBlock --> PackedInts (<a class="xref" href="Lucene.Net.Util.Packed.PackedInt32s.html">PackedInt32s</a>)</li><li>PositionDelta, OffsetDelta, OffsetLength --> |
| VInt (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt32_System_Int32_">WriteVInt32(Int32)</a>) </li><li>PayloadData --> byte (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteByte_System_Byte_">WriteByte(Byte)</a>)<sup>PayLength</sup></li><li>Footer --> CodecFooter (<a class="xref" href="Lucene.Net.Codecs.CodecUtil.html#Lucene_Net_Codecs_CodecUtil_WriteFooter_Lucene_Net_Store_IndexOutput_">WriteFooter(IndexOutput)</a>) </li></ul> |
| <p>Notes:</p> |
| <ul><li>TermPositions are order by term (terms are implicit, from the term dictionary), and position |
| values for each term document pair are incremental, and ordered by document number.</li><li>PackedPosBlockNum is the number of packed blocks for current term's positions, payloads or offsets. |
| In particular, PackedPosBlockNum = floor(totalTermFreq/PackedBlockSize) </li><li>PosVIntCount is the number of positions encoded as VInt format. In particular, |
| PosVIntCount = totalTermFreq - PackedPosBlockNum*PackedBlockSize</li><li>The procedure how PackedPosDeltaBlock is generated is the same as PackedDocDeltaBlock |
| in chapter <a href="#Frequencies">Frequencies and Skip Data</a>.</li><li>PositionDelta is, if payloads are disabled for the term's field, the |
| difference between the position of the current occurrence in the document and |
| the previous occurrence (or zero, if this is the first occurrence in this |
| document). If payloads are enabled for the term's field, then PositionDelta/2 |
| is the difference between the current and the previous position. If payloads |
| are enabled and PositionDelta is odd, then PayloadLength is stored, indicating |
| the length of the payload at the current term position.</li><li>For example, the TermPositions for a term which occurs as the fourth term in |
| one document, and as the fifth and ninth term in a subsequent document, would |
| be the following sequence of VInts (payloads disabled): |
| <p>4, 5, 4</p></li><li>PayloadData is metadata associated with the current term position. If |
| PayloadLength is stored at the current position, then it indicates the length |
| of this payload. If PayloadLength is not stored, then this payload has the same |
| length as the payload at the previous position.</li><li>OffsetDelta/2 is the difference between this position's startOffset from the |
| previous occurrence (or zero, if this is the first occurrence in this document). |
| If OffsetDelta is odd, then the length (endOffset-startOffset) differs from the |
| previous occurrence and an OffsetLength follows. Offset data is only written for |
| <a class="xref" href="Lucene.Net.Index.IndexOptions.html#Lucene_Net_Index_IndexOptions_DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS">DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS</a>.</li></ul> |
| </dd> |
| </dl></p> |
| <p><a name="Payloads" id="Payloads"></a> |
| <dl> |
| <dd> |
| <strong>Payloads and Offsets</strong> |
| <p>The .pay file will store payloads and offsets associated with certain term-document positions. |
| Some payloads and offsets will be separated out into .pos file, for performance reasons.</p> |
| <ul><li>PayFile(.pay): --> Header, <TermPayloads, TermOffsets?> <sup>TermCount</sup>, Footer</li><li>Header --> CodecHeader (<a class="xref" href="Lucene.Net.Codecs.CodecUtil.html#Lucene_Net_Codecs_CodecUtil_WriteHeader_Lucene_Net_Store_DataOutput_System_String_System_Int32_">WriteHeader(DataOutput, String, Int32)</a>) </li><li>TermPayloads --> <PackedPayLengthBlock, SumPayLength, PayData> <sup>PackedPayBlockNum</sup></li><li>TermOffsets --> <PackedOffsetStartDeltaBlock, PackedOffsetLengthBlock> <sup>PackedPayBlockNum</sup></li><li>PackedPayLengthBlock, PackedOffsetStartDeltaBlock, PackedOffsetLengthBlock --> PackedInts (<a class="xref" href="Lucene.Net.Util.Packed.PackedInt32s.html">PackedInt32s</a>) </li><li>SumPayLength --> VInt (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt32_System_Int32_">WriteVInt32(Int32)</a>) </li><li>PayData --> byte (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteByte_System_Byte_">WriteByte(Byte)</a>) <sup>SumPayLength</sup></li><li>Footer --> CodecFooter (<a class="xref" href="Lucene.Net.Codecs.CodecUtil.html#Lucene_Net_Codecs_CodecUtil_WriteFooter_Lucene_Net_Store_IndexOutput_">WriteFooter(IndexOutput)</a>) </li></ul> |
| <p>Notes:</p> |
| <ul><li>The order of TermPayloads/TermOffsets will be the same as TermPositions, note that part of |
| payload/offsets are stored in .pos.</li><li>The procedure how PackedPayLengthBlock and PackedOffsetLengthBlock are generated is the |
| same as PackedFreqBlock in chapter <a href="#Frequencies">Frequencies and Skip Data</a>. |
| While PackedStartDeltaBlock follows a same procedure as PackedDocDeltaBlock.</li><li>PackedPayBlockNum is always equal to PackedPosBlockNum, for the same term. It is also synonym |
| for PackedOffsetBlockNum.</li><li>SumPayLength is the total length of payloads written within one block, should be the sum |
| of PayLengths in one packed block.</li><li>PayLength in PackedPayLengthBlock is the length of each payload associated with the current |
| position.</li></ul> |
| </dd> |
| </dl> |
| </p></p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Codecs.Lucene41.Lucene41PostingsReader.html">Lucene41PostingsReader</a></h4> |
| <section><p>Concrete class that reads docId(maybe frq,pos,offset,payloads) list |
| with postings format. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Codecs.Lucene41.Lucene41PostingsWriter.html">Lucene41PostingsWriter</a></h4> |
| <section><p>Concrete class that writes docId(maybe frq,pos,offset,payloads) list |
| with postings format. |
| <p> |
| Postings list for each term will be stored separately. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Codecs.Lucene41.Lucene41PostingsWriter.Int32BlockTermState.html">Lucene41PostingsWriter.Int32BlockTermState</a></h4> |
| <section><p>NOTE: This was IntBlockTermState in Lucene</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Codecs.Lucene41.Lucene41StoredFieldsFormat.html">Lucene41StoredFieldsFormat</a></h4> |
| <section><p>Lucene 4.1 stored fields format.</p> |
| <p><p><strong>Principle</strong></p> |
| <p>This <a class="xref" href="Lucene.Net.Codecs.StoredFieldsFormat.html">StoredFieldsFormat</a> compresses blocks of 16KB of documents in |
| order to improve the compression ratio compared to document-level |
| compression. It uses the <a href="http://code.google.com/p/lz4/">LZ4</a> |
| compression algorithm, which is fast to compress and very fast to decompress |
| data. Although the compression method that is used focuses more on speed |
| than on compression ratio, it should provide interesting compression ratios |
| for redundant inputs (such as log files, HTML or plain text).</p> |
| <p><strong>File formats</strong></p> |
| <p>Stored fields are represented by two files:</p> |
| <ol><li><a name="field_data" id="field_data"></a> |
| <p>A fields data file (extension <code>.fdt</code>). this file stores a compact |
| representation of documents in compressed blocks of 16KB or more. When |
| writing a segment, documents are appended to an in-memory <code>byte[]</code> |
| buffer. When its size reaches 16KB or more, some metadata about the documents |
| is flushed to disk, immediately followed by a compressed representation of |
| the buffer using the |
| <a href="http://code.google.com/p/lz4/">LZ4</a> |
| <a href="http://fastcompression.blogspot.fr/2011/05/lz4-explained.html">compression format</a>.</p> |
| <p>Here is a more detailed description of the field data file format:</p> |
| <ul><li>FieldData (.fdt) --> <Header>, PackedIntsVersion, <Chunk><sup>ChunkCount</sup></li><li>Header --> CodecHeader (<a class="xref" href="Lucene.Net.Codecs.CodecUtil.html#Lucene_Net_Codecs_CodecUtil_WriteHeader_Lucene_Net_Store_DataOutput_System_String_System_Int32_">WriteHeader(DataOutput, String, Int32)</a>) </li><li>PackedIntsVersion --> <a class="xref" href="Lucene.Net.Util.Packed.PackedInt32s.html#Lucene_Net_Util_Packed_PackedInt32s_VERSION_CURRENT">VERSION_CURRENT</a> as a VInt (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt32_System_Int32_">WriteVInt32(Int32)</a>) </li><li>ChunkCount is not known in advance and is the number of chunks necessary to store all document of the segment</li><li>Chunk --> DocBase, ChunkDocs, DocFieldCounts, DocLengths, <CompressedDocs></li><li>DocBase --> the ID of the first document of the chunk as a VInt (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt32_System_Int32_">WriteVInt32(Int32)</a>) </li><li>ChunkDocs --> the number of documents in the chunk as a VInt (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt32_System_Int32_">WriteVInt32(Int32)</a>) </li><li>DocFieldCounts --> the number of stored fields of every document in the chunk, encoded as followed: |
| <ul><li>if chunkDocs=1, the unique value is encoded as a VInt (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt32_System_Int32_">WriteVInt32(Int32)</a>) </li><li>else read a VInt (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt32_System_Int32_">WriteVInt32(Int32)</a>) (let's call it <code>bitsRequired</code>) |
| <ul><li>if <code>bitsRequired</code> is <code>0</code> then all values are equal, and the common value is the following VInt (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt32_System_Int32_">WriteVInt32(Int32)</a>) </li><li>else <code>bitsRequired</code> is the number of bits required to store any value, and values are stored in a packed (<a class="xref" href="Lucene.Net.Util.Packed.PackedInt32s.html">PackedInt32s</a>) array where every value is stored on exactly <code>bitsRequired</code> bits</li></ul> |
| </li></ul> |
| </li><li>DocLengths --> the lengths of all documents in the chunk, encoded with the same method as DocFieldCounts</li><li>CompressedDocs --> a compressed representation of <Docs> using the LZ4 compression format</li><li>Docs --> <Doc><sup>ChunkDocs</sup></li><li>Doc --> <FieldNumAndType, Value><sup>DocFieldCount</sup></li><li>FieldNumAndType --> a VLong (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteVInt64_System_Int64_">WriteVInt64(Int64)</a>), whose 3 last bits are Type and other bits are FieldNum</li><li>Type --> |
| <ul><li>0: Value is String</li><li>1: Value is BinaryValue</li><li>2: Value is Int</li><li>3: Value is Float</li><li>4: Value is Long</li><li>5: Value is Double</li><li>6, 7: unused</li></ul> |
| </li><li>FieldNum --> an ID of the field</li><li>Value --> String (<a class="xref" href="Lucene.Net.Store.DataOutput.html#Lucene_Net_Store_DataOutput_WriteString_System_String_">WriteString(String)</a>) | BinaryValue | Int | Float | Long | Double depending on Type</li><li>BinaryValue --> ValueLength <Byte><sup>ValueLength</sup></li></ul> |
| <p>Notes</p> |
| <ul><li>If documents are larger than 16KB then chunks will likely contain only |
| one document. However, documents can never spread across several chunks (all |
| fields of a single document are in the same chunk).</li><li>When at least one document in a chunk is large enough so that the chunk |
| is larger than 32KB, the chunk will actually be compressed in several LZ4 |
| blocks of 16KB. this allows <a class="xref" href="Lucene.Net.Index.StoredFieldVisitor.html">StoredFieldVisitor</a>s which are only |
| interested in the first fields of a document to not have to decompress 10MB |
| of data if the document is 10MB, but only 16KB.</li><li>Given that the original lengths are written in the metadata of the chunk, |
| the decompressor can leverage this information to stop decoding as soon as |
| enough data has been decompressed.</li><li>In case documents are incompressible, CompressedDocs will be less than |
| 0.5% larger than Docs.</li></ul> |
| </li><li><a name="field_index" id="field_index"></a> |
| <p>A fields index file (extension <code>.fdx</code>).</p> |
| <ul><li>FieldsIndex (.fdx) --> <Header>, <ChunkIndex></li><li>Header --> CodecHeader (<a class="xref" href="Lucene.Net.Codecs.CodecUtil.html#Lucene_Net_Codecs_CodecUtil_WriteHeader_Lucene_Net_Store_DataOutput_System_String_System_Int32_">WriteHeader(DataOutput, String, Int32)</a>) </li><li>ChunkIndex: See <a class="xref" href="Lucene.Net.Codecs.Compressing.CompressingStoredFieldsIndexWriter.html">CompressingStoredFieldsIndexWriter</a></li></ul> |
| </li></ol> |
| <p><strong>Known limitations</strong></p> |
| <p>This <a class="xref" href="Lucene.Net.Codecs.StoredFieldsFormat.html">StoredFieldsFormat</a> does not support individual documents |
| larger than (<code>2<sup>31</sup> - 2<sup>14</sup></code>) bytes. In case this |
| is a problem, you should use another format, such as |
| <a class="xref" href="Lucene.Net.Codecs.Lucene40.Lucene40StoredFieldsFormat.html">Lucene40StoredFieldsFormat</a>.</p></p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| </article> |
| </div> |
| |
| <div class="hidden-sm col-md-2" role="complementary"> |
| <div class="sideaffix"> |
| <div class="contribution"> |
| <ul class="nav"> |
| <li> |
| <a href="https://github.com/apache/lucenenet/blob/docs-4.8.0-beta00007/src/Lucene.Net.TestFramework/Codecs/Lucene41/package.md/#L2" class="contribution-link">Improve this Doc</a> |
| </li> |
| </ul> |
| </div> |
| <nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix"> |
| <!-- <p><a class="back-to-top" href="#top">Back to top</a><p> --> |
| </nav> |
| </div> |
| </div> |
| </div> |
| </div> |
| |
| <footer> |
| <div class="grad-bottom"></div> |
| <div class="footer"> |
| <div class="container"> |
| <span class="pull-right"> |
| <a href="#top">Back to top</a> |
| </span> |
| Copyright © 2020 Licensed to the Apache Software Foundation (ASF) |
| |
| </div> |
| </div> |
| </footer> |
| </div> |
| |
| <script type="text/javascript" src="../../styles/docfx.vendor.js"></script> |
| <script type="text/javascript" src="../../styles/docfx.js"></script> |
| <script type="text/javascript" src="../../styles/main.js"></script> |
| </body> |
| </html> |