| <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
| <html> |
| <head> |
| <META http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <meta content="Apache Forrest" name="Generator"> |
| <meta name="Forrest-version" content="0.8"> |
| <meta name="Forrest-skin-name" content="lucene"> |
| <title> |
| Apache Lucene - Index File Formats |
| </title> |
| <link type="text/css" href="skin/basic.css" rel="stylesheet"> |
| <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet"> |
| <link media="print" type="text/css" href="skin/print.css" rel="stylesheet"> |
| <link type="text/css" href="skin/profile.css" rel="stylesheet"> |
| <script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script> |
| <link rel="shortcut icon" href="images/favicon.ico"> |
| </head> |
| <body onload="init()"> |
| <script type="text/javascript">ndeSetTextSize();</script> |
| <div id="top"> |
| <!--+ |
| |breadtrail |
| +--> |
| <div class="breadtrail"> |
| <a href="http://www.apache.org/">Apache</a> > <a href="http://lucene.apache.org/">Lucene</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script> |
| </div> |
| <!--+ |
| |header |
| +--> |
| <div class="header"> |
| <!--+ |
| |start group logo |
| +--> |
| <div class="grouplogo"> |
| <a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="http://www.apache.org/images/asf_logo_simple.png" title="Apache Lucene"></a> |
| </div> |
| <!--+ |
| |end group logo |
| +--> |
| <!--+ |
| |start Project Logo |
| +--> |
| <div class="projectlogo"> |
| <a href="http://lucene.apache.org/java/"><img class="logoImage" alt="Lucene" src="http://lucene.apache.org/images/lucene_green_300.gif" title="Apache Lucene is a high-performance, full-featured text search engine library written entirely in |
| Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform."></a> |
| </div> |
| <!--+ |
| |end Project Logo |
| +--> |
| <!--+ |
| |start Search |
| +--> |
| <div class="searchbox"> |
| <form action="http://search.lucidimagination.com/p:lucene" method="get" class="roundtopsmall"> |
| <input onFocus="getBlank (this, 'Search the site with Lucene');" size="25" name="q" id="query" type="text" value="Search the site with Lucene"> |
| <input name="Search" value="Search" type="submit"> |
| </form> |
| <div style="position: relative; top: -5px; left: -10px">Powered by <a href="http://www.lucidimagination.com" style="color: #033268">Lucid Imagination</a> |
| </div> |
| </div> |
| <!--+ |
| |end search |
| +--> |
| <!--+ |
| |start Tabs |
| +--> |
| <ul id="tabs"> |
| <li class="current"> |
| <a class="selected" href="http://lucene.apache.org/java/docs/">Main</a> |
| </li> |
| <li> |
| <a class="unselected" href="http://wiki.apache.org/lucene-java">Wiki</a> |
| </li> |
| <li class="current"> |
| <a class="selected" href="index.html">Lucene 2.9.2 Documentation</a> |
| </li> |
| </ul> |
| <!--+ |
| |end Tabs |
| +--> |
| </div> |
| </div> |
| <div id="main"> |
| <div id="publishedStrip"> |
| <!--+ |
| |start Subtabs |
| +--> |
| <div id="level2tabs"></div> |
| <!--+ |
| |end Endtabs |
| +--> |
| <script type="text/javascript"><!-- |
| document.write("Last Published: " + document.lastModified); |
| // --></script> |
| </div> |
| <!--+ |
| |breadtrail |
| +--> |
| <div class="breadtrail"> |
| |
| |
| </div> |
| <!--+ |
| |start Menu, mainarea |
| +--> |
| <!--+ |
| |start Menu |
| +--> |
| <div id="menu"> |
| <div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" id="menu_selected_1.1Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div> |
| <div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: block;"> |
| <div class="menuitem"> |
| <a href="index.html">Overview</a> |
| </div> |
| <div onclick="SwitchMenu('menu_1.1.2', 'skin/')" id="menu_1.1.2Title" class="menutitle">Changes</div> |
| <div id="menu_1.1.2" class="menuitemgroup"> |
| <div class="menuitem"> |
| <a href="changes/Changes.html">Core</a> |
| </div> |
| <div class="menuitem"> |
| <a href="changes/Contrib-Changes.html">Contrib</a> |
| </div> |
| </div> |
| <div onclick="SwitchMenu('menu_1.1.3', 'skin/')" id="menu_1.1.3Title" class="menutitle">Javadocs</div> |
| <div id="menu_1.1.3" class="menuitemgroup"> |
| <div class="menuitem"> |
| <a href="api/all/index.html">All</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/core/index.html">Core</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/demo/index.html">Demo</a> |
| </div> |
| <div onclick="SwitchMenu('menu_1.1.3.4', 'skin/')" id="menu_1.1.3.4Title" class="menutitle">Contrib</div> |
| <div id="menu_1.1.3.4" class="menuitemgroup"> |
| <div class="menuitem"> |
| <a href="api/contrib-analyzers/index.html">Analyzers</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-smartcn/index.html">Smart Chinese Analyzer</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-ant/index.html">Ant</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-bdb/index.html">Bdb</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-bdb-je/index.html">Bdb-je</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-benchmark/index.html">Benchmark</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-collation/index.html">Collation</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-fast-vector-highlighter/index.html">Fast Vector Highlighter</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-highlighter/index.html">Highlighter</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-instantiated/index.html">Instantiated</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-lucli/index.html">Lucli</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-memory/index.html">Memory</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-misc/index.html">Miscellaneous</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-queries/index.html">Queries</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-queryparser/index.html">Query Parser Framework</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-regex/index.html">Regex</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-remote/index.html">Remote</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-snowball/index.html">Snowball</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-spatial/index.html">Spatial</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-spellchecker/index.html">Spellchecker</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-surround/index.html">Surround</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-swing/index.html">Swing</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-wikipedia/index.html">Wikipedia</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-wordnet/index.html">Wordnet</a> |
| </div> |
| <div class="menuitem"> |
| <a href="api/contrib-xml-query-parser/index.html">XML Query Parser</a> |
| </div> |
| </div> |
| </div> |
| <div class="menuitem"> |
| <a href="contributions.html">Contributions</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://wiki.apache.org/lucene-java/LuceneFAQ">FAQ</a> |
| </div> |
| <div class="menupage"> |
| <div class="menupagetitle">File Formats</div> |
| </div> |
| <div class="menuitem"> |
| <a href="gettingstarted.html">Getting Started</a> |
| </div> |
| <div class="menuitem"> |
| <a href="lucene-contrib/index.html">Lucene Contrib</a> |
| </div> |
| <div class="menuitem"> |
| <a href="queryparsersyntax.html">Query Syntax</a> |
| </div> |
| <div class="menuitem"> |
| <a href="scoring.html">Scoring</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://wiki.apache.org/lucene-java">Wiki</a> |
| </div> |
| </div> |
| <div id="credit"></div> |
| <div id="roundbottom"> |
| <img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div> |
| <!--+ |
| |alternative credits |
| +--> |
| <div id="credit2"></div> |
| </div> |
| <!--+ |
| |end Menu |
| +--> |
| <!--+ |
| |start content |
| +--> |
| <div id="content"> |
| <div title="Portable Document Format" class="pdflink"> |
| <a class="dida" href="fileformats.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br> |
| PDF</a> |
| </div> |
| <h1> |
| Apache Lucene - Index File Formats |
| </h1> |
| <div id="minitoc-area"> |
| <ul class="minitoc"> |
| <li> |
| <a href="#Index File Formats">Index File Formats</a> |
| </li> |
| <li> |
| <a href="#Definitions">Definitions</a> |
| <ul class="minitoc"> |
| <li> |
| <a href="#Inverted Indexing">Inverted Indexing</a> |
| </li> |
| <li> |
| <a href="#Types of Fields">Types of Fields</a> |
| </li> |
| <li> |
| <a href="#Segments">Segments</a> |
| </li> |
| <li> |
| <a href="#Document Numbers">Document Numbers</a> |
| </li> |
| </ul> |
| </li> |
| <li> |
| <a href="#Overview">Overview</a> |
| </li> |
| <li> |
| <a href="#File Naming">File Naming</a> |
| </li> |
| <li> |
| <a href="#file-names">Summary of File Extensions</a> |
| </li> |
| <li> |
| <a href="#Primitive Types">Primitive Types</a> |
| <ul class="minitoc"> |
| <li> |
| <a href="#Byte">Byte</a> |
| </li> |
| <li> |
| <a href="#UInt32">UInt32</a> |
| </li> |
| <li> |
| <a href="#Uint64">Uint64</a> |
| </li> |
| <li> |
| <a href="#VInt">VInt</a> |
| </li> |
| <li> |
| <a href="#Chars">Chars</a> |
| </li> |
| <li> |
| <a href="#String">String</a> |
| </li> |
| </ul> |
| </li> |
| <li> |
| <a href="#Compound Types">Compound Types</a> |
| <ul class="minitoc"> |
| <li> |
| <a href="#MapStringString">Map<String,String></a> |
| </li> |
| </ul> |
| </li> |
| <li> |
| <a href="#Per-Index Files">Per-Index Files</a> |
| <ul class="minitoc"> |
| <li> |
| <a href="#Segments File">Segments File</a> |
| </li> |
| <li> |
| <a href="#Lock File">Lock File</a> |
| </li> |
| <li> |
| <a href="#Deletable File">Deletable File</a> |
| </li> |
| <li> |
| <a href="#Compound Files">Compound Files</a> |
| </li> |
| </ul> |
| </li> |
| <li> |
| <a href="#Per-Segment Files">Per-Segment Files</a> |
| <ul class="minitoc"> |
| <li> |
| <a href="#Fields">Fields</a> |
| </li> |
| <li> |
| <a href="#Term Dictionary">Term Dictionary</a> |
| </li> |
| <li> |
| <a href="#Frequencies">Frequencies</a> |
| </li> |
| <li> |
| <a href="#Positions">Positions</a> |
| </li> |
| <li> |
| <a href="#Normalization Factors">Normalization Factors</a> |
| </li> |
| <li> |
| <a href="#Term Vectors">Term Vectors</a> |
| </li> |
| <li> |
| <a href="#Deleted Documents">Deleted Documents</a> |
| </li> |
| </ul> |
| </li> |
| <li> |
| <a href="#Limitations">Limitations</a> |
| </li> |
| </ul> |
| </div> |
| |
| <a name="N1000C"></a><a name="Index File Formats"></a> |
| <h2 class="boxed">Index File Formats</h2> |
| <div class="section"> |
| <p> |
| This document defines the index file formats used |
| in Lucene version 2.9. If you are using a different |
| version of Lucene, please consult the copy of |
| <span class="codefrag">docs/fileformats.html</span> |
| that was distributed |
| with the version you are using. |
| </p> |
| <p> |
| Apache Lucene is written in Java, but several |
| efforts are underway to write |
| <a href="http://wiki.apache.org/lucene-java/LuceneImplementations">versions |
| of Lucene in other programming |
| languages</a>. If these versions are to remain compatible with Apache |
| Lucene, then a language-independent definition of the Lucene index |
| format is required. This document thus attempts to provide a |
| complete and independent definition of the Apache Lucene 2.9 file |
| formats. |
| </p> |
| <p> |
| As Lucene evolves, this document should evolve. |
| Versions of Lucene in different programming languages should endeavor |
| to agree on file formats, and generate new versions of this document. |
| </p> |
| <p> |
| Compatibility notes are provided in this document, |
| describing how file formats have changed from prior versions. |
| </p> |
| <p> |
| In version 2.1, the file format was changed to allow |
| lock-less commits (ie, no more commit lock). The |
| change is fully backwards compatible: you can open a |
| pre-2.1 index for searching or adding/deleting of |
| docs. When the new segments file is saved |
| (committed), it will be written in the new file format |
| (meaning no specific "upgrade" process is needed). |
| But note that once a commit has occurred, pre-2.1 |
| Lucene will not be able to read the index. |
| </p> |
| <p> |
| In version 2.3, the file format was changed to allow |
| segments to share a single set of doc store (vectors & |
| stored fields) files. This allows for faster indexing |
| in certain cases. The change is fully backwards |
| compatible (in the same way as the lock-less commits |
| change in 2.1). |
| </p> |
| </div> |
| |
| |
| <a name="N1002B"></a><a name="Definitions"></a> |
| <h2 class="boxed">Definitions</h2> |
| <div class="section"> |
| <p> |
| The fundamental concepts in Lucene are index, |
| document, field and term. |
| </p> |
| <p> |
| An index contains a sequence of documents. |
| </p> |
| <ul> |
| |
| <li> |
| |
| <p> |
| A document is a sequence of fields. |
| </p> |
| |
| </li> |
| |
| |
| <li> |
| |
| <p> |
| A field is a named sequence of terms. |
| </p> |
| |
| </li> |
| |
| |
| <li> |
| A term is a string. |
| </li> |
| |
| </ul> |
| <p> |
| The same string in two different fields is |
| considered a different term. Thus terms are represented as a pair of |
| strings, the first naming the field, and the second naming text |
| within the field. |
| </p> |
| <a name="N1004B"></a><a name="Inverted Indexing"></a> |
| <h3 class="boxed">Inverted Indexing</h3> |
| <p> |
| The index stores statistics about terms in order |
| to make term-based search more efficient. Lucene's |
| index falls into the family of indexes known as an <i>inverted |
| index.</i> This is because it can list, for a term, the documents that contain |
| it. This is the inverse of the natural relationship, in which |
| documents list terms. |
| </p> |
| <a name="N10057"></a><a name="Types of Fields"></a> |
| <h3 class="boxed">Types of Fields</h3> |
| <p> |
| In Lucene, fields may be <i>stored</i>, in which |
| case their text is stored in the index literally, in a non-inverted |
| manner. Fields that are inverted are called <i>indexed</i>. A field |
| may be both stored and indexed.</p> |
| <p>The text of a field may be <i>tokenized</i> into terms to be |
| indexed, or the text of a field may be used literally as a term to be indexed. |
| Most fields are |
| tokenized, but sometimes it is useful for certain identifier fields |
| to be indexed literally. |
| </p> |
| <p>See the <a href="api/core/org/apache/lucene/document/Field.html">Field</a> java docs for more information on Fields.</p> |
| <a name="N10074"></a><a name="Segments"></a> |
| <h3 class="boxed">Segments</h3> |
| <p> |
| Lucene indexes may be composed of multiple sub-indexes, or |
| <i>segments</i>. Each segment is a fully independent index, which could be searched |
| separately. Indexes evolve by: |
| </p> |
| <ol> |
| |
| <li> |
| |
| <p>Creating new segments for newly added documents.</p> |
| |
| </li> |
| |
| <li> |
| |
| <p>Merging existing segments.</p> |
| |
| </li> |
| |
| </ol> |
| <p> |
| Searches may involve multiple segments and/or multiple indexes, each |
| index potentially composed of a set of segments. |
| </p> |
| <a name="N10092"></a><a name="Document Numbers"></a> |
| <h3 class="boxed">Document Numbers</h3> |
| <p> |
| Internally, Lucene refers to documents by an integer <i>document |
| number</i>. The first document added to an index is numbered zero, and each |
| subsequent document added gets a number one greater than the previous. |
| </p> |
| <p> |
| |
| <br> |
| |
| </p> |
| <p> |
| Note that a document's number may change, so caution should be taken |
| when storing these numbers outside of Lucene. In particular, numbers may |
| change in the following situations: |
| </p> |
| <ul> |
| |
| <li> |
| |
| <p> |
| The |
| numbers stored in each segment are unique only within the segment, |
| and must be converted before they can be used in a larger context. |
| The standard technique is to allocate each segment a range of |
| values, based on the range of numbers used in that segment. To |
| convert a document number from a segment to an external value, the |
| segment's <i>base</i> document |
| number is added. To convert an external value back to a |
| segment-specific value, the segment is identified by the range that |
| the external value is in, and the segment's base value is |
| subtracted. For example two five document segments might be |
| combined, so that the first segment has a base value of zero, and |
| the second of five. Document three from the second segment would |
| have an external value of eight. |
| </p> |
| |
| </li> |
| |
| <li> |
| |
| <p> |
| When documents are deleted, gaps are created |
| in the numbering. These are eventually removed as the index evolves |
| through merging. Deleted documents are dropped when segments are |
| merged. A freshly-merged segment thus has no gaps in its numbering. |
| </p> |
| |
| </li> |
| |
| </ul> |
| </div> |
| |
| |
| <a name="N100B9"></a><a name="Overview"></a> |
| <h2 class="boxed">Overview</h2> |
| <div class="section"> |
| <p> |
| Each segment index maintains the following: |
| </p> |
| <ul> |
| |
| <li> |
| |
| <p>Field names. This |
| contains the set of field names used in the index. |
| |
| </p> |
| |
| </li> |
| |
| <li> |
| |
| <p>Stored Field |
| values. This contains, for each document, a list of attribute-value |
| pairs, where the attributes are field names. These are used to |
| store auxiliary information about the document, such as its title, |
| url, or an identifier to access a |
| database. The set of stored fields are what is returned for each hit |
| when searching. This is keyed by document number. |
| </p> |
| |
| </li> |
| |
| <li> |
| |
| <p>Term dictionary. |
| A dictionary containing all of the terms used in all of the indexed |
| fields of all of the documents. The dictionary also contains the |
| number of documents which contain the term, and pointers to the |
| term's frequency and proximity data. |
| </p> |
| |
| </li> |
| |
| |
| <li> |
| |
| <p>Term Frequency |
| data. For each term in the dictionary, the numbers of all the |
| documents that contain that term, and the frequency of the term in |
| that document if omitTf is false. |
| </p> |
| |
| </li> |
| |
| |
| <li> |
| |
| <p>Term Proximity |
| data. For each term in the dictionary, the positions that the term |
| occurs in each document. Note that this will |
| not exist if all fields in all documents set |
| omitTf to true. |
| </p> |
| |
| </li> |
| |
| |
| <li> |
| |
| <p>Normalization |
| factors. For each field in each document, a value is stored that is |
| multiplied into the score for hits on that field. |
| </p> |
| |
| </li> |
| |
| <li> |
| |
| <p>Term Vectors. For each field in each document, the term vector |
| (sometimes called document vector) may be stored. A term vector consists |
| of term text and term frequency. To add Term Vectors to your index see the |
| <a href="api/core/org/apache/lucene/document/Field.html">Field</a> |
| constructors |
| </p> |
| |
| </li> |
| |
| <li> |
| |
| <p>Deleted documents. |
| An optional file indicating which documents are deleted. |
| </p> |
| |
| </li> |
| |
| </ul> |
| <p>Details on each of these are provided in subsequent sections. |
| </p> |
| </div> |
| |
| |
| <a name="N100FC"></a><a name="File Naming"></a> |
| <h2 class="boxed">File Naming</h2> |
| <div class="section"> |
| <p> |
| All files belonging to a segment have the same name with varying |
| extensions. The extensions correspond to the different file formats |
| described below. When using the Compound File format (default in 1.4 and greater) these files are |
| collapsed into a single .cfs file (see below for details) |
| </p> |
| <p> |
| Typically, all segments |
| in an index are stored in a single directory, although this is not |
| required. |
| </p> |
| <p> |
| As of version 2.1 (lock-less commits), file names are |
| never re-used (there is one exception, "segments.gen", |
| see below). That is, when any file is saved to the |
| Directory it is given a never before used filename. |
| This is achieved using a simple generations approach. |
| For example, the first segments file is segments_1, |
| then segments_2, etc. The generation is a sequential |
| long integer represented in alpha-numeric (base 36) |
| form. |
| </p> |
| </div> |
| |
| <a name="N1010B"></a><a name="file-names"></a> |
| <h2 class="boxed">Summary of File Extensions</h2> |
| <div class="section"> |
| <p>The following table summarizes the names and extensions of the files in Lucene: |
| <table class="ForrestTable" cellspacing="1" cellpadding="4"> |
| |
| <tr> |
| |
| <th>Name</th> |
| <th>Extension</th> |
| <th>Brief Description</th> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#Segments File">Segments File</a></td> |
| <td>segments.gen, segments_N</td> |
| <td>Stores information about segments</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#Lock File">Lock File</a></td> |
| <td>write.lock</td> |
| <td>The Write lock prevents multiple IndexWriters from writing to the same file.</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#Compound Files">Compound File</a></td> |
| <td>.cfs</td> |
| <td>An optional "virtual" file consisting of all the other index files for systems |
| that frequently run out of file handles.</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#Fields">Fields</a></td> |
| <td>.fnm</td> |
| <td>Stores information about the fields</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#field_index">Field Index</a></td> |
| <td>.fdx</td> |
| <td>Contains pointers to field data</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#field_data">Field Data</a></td> |
| <td>.fdt</td> |
| <td>The stored fields for documents</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#tis">Term Infos</a></td> |
| <td>.tis</td> |
| <td>Part of the term dictionary, stores term info</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#tii">Term Info Index</a></td> |
| <td>.tii</td> |
| <td>The index into the Term Infos file</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#Frequencies">Frequencies</a></td> |
| <td>.frq</td> |
| <td>Contains the list of docs which contain each term along with frequency</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#Positions">Positions</a></td> |
| <td>.prx</td> |
| <td>Stores position information about where a term occurs in the index</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#Normalization Factors">Norms</a></td> |
| <td>.nrm</td> |
| <td>Encodes length and boost factors for docs and fields</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#tvx">Term Vector Index</a></td> |
| <td>.tvx</td> |
| <td>Stores offset into the document data file</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#tvd">Term Vector Documents</a></td> |
| <td>.tvd</td> |
| <td>Contains information about each document that has term vectors</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#tvf">Term Vector Fields</a></td> |
| <td>.tvf</td> |
| <td>The field level info about term vectors</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td><a href="#Deleted Documents">Deleted Documents</a></td> |
| <td>.del</td> |
| <td>Info about what files are deleted</td> |
| |
| </tr> |
| |
| </table> |
| |
| |
| </p> |
| </div> |
| |
| |
| <a name="N101F5"></a><a name="Primitive Types"></a> |
| <h2 class="boxed">Primitive Types</h2> |
| <div class="section"> |
| <a name="N101FA"></a><a name="Byte"></a> |
| <h3 class="boxed">Byte</h3> |
| <p> |
| The most primitive type |
| is an eight-bit byte. Files are accessed as sequences of bytes. All |
| other data types are defined as sequences |
| of bytes, so file formats are byte-order independent. |
| </p> |
| <a name="N10203"></a><a name="UInt32"></a> |
| <h3 class="boxed">UInt32</h3> |
| <p> |
| 32-bit unsigned integers are written as four |
| bytes, high-order bytes first. |
| </p> |
| <p> |
| UInt32 --> <Byte><sup>4</sup> |
| |
| </p> |
| <a name="N10212"></a><a name="Uint64"></a> |
| <h3 class="boxed">Uint64</h3> |
| <p> |
| 64-bit unsigned integers are written as eight |
| bytes, high-order bytes first. |
| </p> |
| <p>UInt64 --> <Byte><sup>8</sup> |
| |
| </p> |
| <a name="N10221"></a><a name="VInt"></a> |
| <h3 class="boxed">VInt</h3> |
| <p> |
| A variable-length format for positive integers is |
| defined where the high-order bit of each byte indicates whether more |
| bytes remain to be read. The low-order seven bits are appended as |
| increasingly more significant bits in the resulting integer value. |
| Thus values from zero to 127 may be stored in a single byte, values |
| from 128 to 16,383 may be stored in two bytes, and so on. |
| </p> |
| <p> |
| |
| <b>VInt Encoding Example</b> |
| |
| </p> |
| <table class="ForrestTable" cellspacing="0" cellpadding="4" border="0"> |
| |
| <col width="64*"> |
| |
| <col width="64*"> |
| |
| <col width="64*"> |
| |
| <col width="64*"> |
| |
| <tr valign="TOP"> |
| |
| <td width="25%"> |
| |
| <p align="RIGHT"> |
| |
| <b>Value</b> |
| |
| </p> |
| |
| </td> |
| <td width="25%"> |
| |
| <p align="RIGHT"> |
| |
| <b>First byte</b> |
| |
| </p> |
| |
| </td> |
| <td width="25%"> |
| |
| <p align="RIGHT"> |
| |
| <b>Second byte</b> |
| |
| </p> |
| |
| </td> |
| <td width="25%"> |
| |
| <p align="RIGHT"> |
| |
| <b>Third byte</b> |
| |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr valign="BOTTOM"> |
| |
| <td sdnum="1033;0;#,##0" sdval="0" width="25%"> |
| |
| <p align="RIGHT">0 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="0" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 00000000 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr valign="BOTTOM"> |
| |
| <td sdnum="1033;0;#,##0" sdval="1" width="25%"> |
| |
| <p align="RIGHT">1 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="1" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr valign="BOTTOM"> |
| |
| <td sdnum="1033;0;#,##0" sdval="2" width="25%"> |
| |
| <p align="RIGHT">2 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="10" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 00000010 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td valign="TOP" width="25%"> |
| |
| <p align="RIGHT">... |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" valign="BOTTOM" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" valign="BOTTOM" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" valign="BOTTOM" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr valign="BOTTOM"> |
| |
| <td sdnum="1033;0;#,##0" sdval="127" width="25%"> |
| |
| <p align="RIGHT">127 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="1111111" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 01111111 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr valign="BOTTOM"> |
| |
| <td sdnum="1033;0;#,##0" sdval="128" width="25%"> |
| |
| <p align="RIGHT">128 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="10000000" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 10000000 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="1" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr valign="BOTTOM"> |
| |
| <td sdnum="1033;0;#,##0" sdval="129" width="25%"> |
| |
| <p align="RIGHT">129 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="10000001" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 10000001 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="1" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr valign="BOTTOM"> |
| |
| <td sdnum="1033;0;#,##0" sdval="130" width="25%"> |
| |
| <p align="RIGHT">130 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="10000010" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 10000010 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="1" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td valign="TOP" width="25%"> |
| |
| <p align="RIGHT">... |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" valign="BOTTOM" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" valign="BOTTOM" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" valign="BOTTOM" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr valign="BOTTOM"> |
| |
| <td sdnum="1033;0;#,##0" sdval="16383" width="25%"> |
| |
| <p align="RIGHT">16,383 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="11111111" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 11111111 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="1111111" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| 01111111 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" width="25%"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr valign="BOTTOM"> |
| |
| <td sdnum="1033;0;#,##0" sdval="16384" width="25%"> |
| |
| <p align="RIGHT">16,384 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="10000000" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 10000000 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="10000000" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| 10000000 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="1" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr valign="BOTTOM"> |
| |
| <td sdnum="1033;0;#,##0" sdval="16385" width="25%"> |
| |
| <p align="RIGHT">16,385 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="10000001" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 10000001 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="10000000" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| 10000000 |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" sdval="1" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td valign="TOP" width="25%"> |
| |
| <p align="RIGHT">... |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" valign="BOTTOM" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" valign="BOTTOM" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| <td sdnum="1033;0;00000000" valign="BOTTOM" width="25%"> |
| |
| <p align="RIGHT" class="western" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| |
| <br> |
| |
| |
| </p> |
| |
| </td> |
| |
| </tr> |
| |
| </table> |
| <p> |
| This provides compression while still being |
| efficient to decode. |
| </p> |
| <a name="N10506"></a><a name="Chars"></a> |
| <h3 class="boxed">Chars</h3> |
| <p> |
| Lucene writes unicode |
| character sequences as UTF-8 encoded bytes. |
| </p> |
| <a name="N1050F"></a><a name="String"></a> |
| <h3 class="boxed">String</h3> |
| <p> |
| Lucene writes strings as UTF-8 encoded bytes. |
| First the length, in bytes, is written as a VInt, |
| followed by the bytes. |
| </p> |
| <p> |
| String --> VInt, Chars |
| </p> |
| </div> |
| |
| |
| <a name="N1051C"></a><a name="Compound Types"></a> |
| <h2 class="boxed">Compound Types</h2> |
| <div class="section"> |
| <a name="N10521"></a><a name="MapStringString"></a> |
| <h3 class="boxed">Map<String,String></h3> |
| <p> |
| In a couple places Lucene stores a Map |
| String->String. |
| </p> |
| <p> |
| Map<String,String> --> Count<String,String><sup>Count</sup> |
| |
| </p> |
| </div> |
| |
| |
| <a name="N10531"></a><a name="Per-Index Files"></a> |
| <h2 class="boxed">Per-Index Files</h2> |
| <div class="section"> |
| <p> |
| The files in this section exist one-per-index. |
| </p> |
| <a name="N10539"></a><a name="Segments File"></a> |
| <h3 class="boxed">Segments File</h3> |
| <p> |
| The active segments in the index are stored in the |
| segment info file, |
| <tt>segments_N</tt>. |
| There may |
| be one or more |
| <tt>segments_N</tt> |
| files in the |
| index; however, the one with the largest |
| generation is the active one (when older |
| segments_N files are present it's because they |
| temporarily cannot be deleted, or, a writer is in |
| the process of committing, or a custom |
| <a href="api/core/org/apache/lucene/index/IndexDeletionPolicy.html">IndexDeletionPolicy</a> |
| is in use). This file lists each |
| segment by name, has details about the separate |
| norms and deletion files, and also contains the |
| size of each segment. |
| </p> |
| <p> |
| As of 2.1, there is also a file |
| <tt>segments.gen</tt>. |
| This file contains the |
| current generation (the |
| <tt>_N</tt> |
| in |
| <tt>segments_N</tt>) |
| of the index. This is |
| used only as a fallback in case the current |
| generation cannot be accurately determined by |
| directory listing alone (as is the case for some |
| NFS clients with time-based directory cache |
| expiraation). This file simply contains an Int32 |
| version header (SegmentInfos.FORMAT_LOCKLESS = |
| -2), followed by the generation recorded as Int64, |
| written twice. |
| </p> |
| <p> |
| |
| <b>2.9</b> |
| Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize, DelGen, DocStoreOffset, [DocStoreSegment, DocStoreIsCompoundFile], HasSingleNormFile, NumField, |
| NormGen<sup>NumField</sup>, |
| IsCompoundFile, DeletionCount, HasProx, Diagnostics><sup>SegCount</sup>, CommitUserData, Checksum |
| </p> |
| <p> |
| Format, NameCounter, SegCount, SegSize, NumField, |
| DocStoreOffset, DeletionCount --> Int32 |
| </p> |
| <p> |
| Version, DelGen, NormGen, Checksum --> Int64 |
| </p> |
| <p> |
| SegName, DocStoreSegment --> String |
| </p> |
| <p> |
| Diagnostics --> Map<String,String> |
| </p> |
| <p> |
| IsCompoundFile, HasSingleNormFile, |
| DocStoreIsCompoundFile, HasProx --> Int8 |
| </p> |
| <p> |
| CommitUserData --> Map<String,String> |
| </p> |
| <p> |
| Format is -9 (SegmentInfos.FORMAT_DIAGNOSTICS). |
| </p> |
| <p> |
| Version counts how often the index has been |
| changed by adding or deleting documents. |
| </p> |
| <p> |
| NameCounter is used to generate names for new segment files. |
| </p> |
| <p> |
| SegName is the name of the segment, and is used as the file name prefix |
| for all of the files that compose the segment's index. |
| </p> |
| <p> |
| SegSize is the number of documents contained in the segment index. |
| </p> |
| <p> |
| DelGen is the generation count of the separate |
| deletes file. If this is -1, there are no |
| separate deletes. If it is 0, this is a pre-2.1 |
| segment and you must check filesystem for the |
| existence of _X.del. Anything above zero means |
| there are separate deletes (_X_N.del). |
| </p> |
| <p> |
| NumField is the size of the array for NormGen, or |
| -1 if there are no NormGens stored. |
| </p> |
| <p> |
| NormGen records the generation of the separate |
| norms files. If NumField is -1, there are no |
| normGens stored and they are all assumed to be 0 |
| when the segment file was written pre-2.1 and all |
| assumed to be -1 when the segments file is 2.1 or |
| above. The generation then has the same meaning |
| as delGen (above). |
| </p> |
| <p> |
| IsCompoundFile records whether the segment is |
| written as a compound file or not. If this is -1, |
| the segment is not a compound file. If it is 1, |
| the segment is a compound file. Else it is 0, |
| which means we check filesystem to see if _X.cfs |
| exists. |
| </p> |
| <p> |
| If HasSingleNormFile is 1, then the field norms are |
| written as a single joined file (with extension |
| <tt>.nrm</tt>); if it is 0 then each field's norms |
| are stored as separate <tt>.fN</tt> files. See |
| "Normalization Factors" below for details. |
| </p> |
| <p> |
| DocStoreOffset, DocStoreSegment, |
| DocStoreIsCompoundFile: If DocStoreOffset is -1, |
| this segment has its own doc store (stored fields |
| values and term vectors) files and DocStoreSegment |
| and DocStoreIsCompoundFile are not stored. In |
| this case all files for stored field values |
| (<tt>*.fdt</tt> and <tt>*.fdx</tt>) and term |
| vectors (<tt>*.tvf</tt>, <tt>*.tvd</tt> and |
| <tt>*.tvx</tt>) will be stored with this segment. |
| Otherwise, DocStoreSegment is the name of the |
| segment that has the shared doc store files; |
| DocStoreIsCompoundFile is 1 if that segment is |
| stored in compound file format (as a <tt>.cfx</tt> |
| file); and DocStoreOffset is the starting document |
| in the shared doc store files where this segment's |
| documents begin. In this case, this segment does |
| not store its own doc store files but instead |
| shares a single set of these files with other |
| segments. |
| </p> |
| <p> |
| Checksum contains the CRC32 checksum of all bytes |
| in the segments_N file up until the checksum. |
| This is used to verify integrity of the file on |
| opening the index. |
| </p> |
| <p> |
| DeletionCount records the number of deleted |
| documents in this segment. |
| </p> |
| <p> |
| HasProx is 1 if any fields in this segment have |
| omitTf set to false; else, it's 0. |
| </p> |
| <p> |
| CommitUserData stores an optional user-supplied |
| opaque Map<String,String> that was passed to |
| IndexWriter's commit or prepareCommit, or |
| IndexReader's flush methods. |
| </p> |
| <p> |
| The Diagnostics Map is privately written by |
| IndexWriter, as a debugging aid, for each segment |
| it creates. It includes metadata like the current |
| Lucene version, OS, Java version, why the segment |
| was created (merge, flush, addIndexes), etc. |
| </p> |
| <a name="N105BE"></a><a name="Lock File"></a> |
| <h3 class="boxed">Lock File</h3> |
| <p> |
| The write lock, which is stored in the index |
| directory by default, is named "write.lock". If |
| the lock directory is different from the index |
| directory then the write lock will be named |
| "XXXX-write.lock" where XXXX is a unique prefix |
| derived from the full path to the index directory. |
| When this file is present, a writer is currently |
| modifying the index (adding or removing |
| documents). This lock file ensures that only one |
| writer is modifying the index at a time. |
| </p> |
| <a name="N105C7"></a><a name="Deletable File"></a> |
| <h3 class="boxed">Deletable File</h3> |
| <p> |
| A writer dynamically computes |
| the files that are deletable, instead, so no file |
| is written. |
| </p> |
| <a name="N105D0"></a><a name="Compound Files"></a> |
| <h3 class="boxed">Compound Files</h3> |
| <p>Starting with Lucene 1.4 the compound file format became default. This |
| is simply a container for all files described in the next section |
| (except for the .del file).</p> |
| <p>Compound (.cfs) --> FileCount, <DataOffset, FileName> |
| <sup>FileCount</sup> |
| , |
| FileData |
| <sup>FileCount</sup> |
| |
| </p> |
| <p>FileCount --> VInt</p> |
| <p>DataOffset --> Long</p> |
| <p>FileName --> String</p> |
| <p>FileData --> raw file data</p> |
| <p>The raw file data is the data from the individual files named above.</p> |
| <p>Starting with Lucene 2.3, doc store files (stored |
| field values and term vectors) can be shared in a |
| single set of files for more than one segment. When |
| compound file is enabled, these shared files will be |
| added into a single compound file (same format as |
| above) but with the extension <tt>.cfx</tt>. |
| </p> |
| </div> |
| |
| |
| <a name="N105F8"></a><a name="Per-Segment Files"></a> |
| <h2 class="boxed">Per-Segment Files</h2> |
| <div class="section"> |
| <p> |
| The remaining files are all per-segment, and are |
| thus defined by suffix. |
| </p> |
| <a name="N10600"></a><a name="Fields"></a> |
| <h3 class="boxed">Fields</h3> |
| <p> |
| |
| <br> |
| |
| <b>Field Info</b> |
| |
| <br> |
| |
| </p> |
| <p> |
| Field names are |
| stored in the field info file, with suffix .fnm. |
| </p> |
| <p> |
| FieldInfos |
| (.fnm) --> FNMVersion,FieldsCount, <FieldName, |
| FieldBits> |
| <sup>FieldsCount</sup> |
| |
| </p> |
| <p> |
| FNMVersion, FieldsCount --> VInt |
| </p> |
| <p> |
| FieldName --> String |
| </p> |
| <p> |
| FieldBits --> Byte |
| </p> |
| <p> |
| |
| <ul> |
| |
| <li> |
| The low-order bit is one for |
| indexed fields, and zero for non-indexed fields. |
| </li> |
| |
| <li> |
| The second lowest-order |
| bit is one for fields that have term vectors stored, and zero for fields |
| without term vectors. |
| </li> |
| |
| <li>If the third lowest-order bit is set (0x04), term positions are stored with the term vectors.</li> |
| |
| <li>If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors.</li> |
| |
| <li>If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field.</li> |
| |
| <li>If the sixth lowest-order bit is set (0x20), payloads are stored for the indexed field.</li> |
| |
| </ul> |
| |
| </p> |
| <p> |
| FNMVersion (added in 2.9) is always -2. |
| </p> |
| <p> |
| Fields are numbered by their order in this file. Thus field zero is |
| the |
| first field in the file, field one the next, and so on. Note that, |
| like document numbers, field numbers are segment relative. |
| </p> |
| <p> |
| |
| <br> |
| |
| <b>Stored Fields</b> |
| |
| <br> |
| |
| </p> |
| <p> |
| Stored fields are represented by two files: |
| </p> |
| <ol> |
| |
| <li> |
| <a name="field_index"></a> |
| |
| <p> |
| The field index, or .fdx file. |
| </p> |
| |
| |
| <p> |
| This contains, for each document, a pointer to |
| its field data, as follows: |
| </p> |
| |
| |
| <p> |
| FieldIndex |
| (.fdx) --> |
| <FieldValuesPosition> |
| <sup>SegSize</sup> |
| |
| </p> |
| |
| <p>FieldValuesPosition |
| --> Uint64 |
| </p> |
| |
| <p>This |
| is used to find the location within the field data file of the |
| fields of a particular document. Because it contains fixed-length |
| data, this file may be easily randomly accessed. The position of |
| document |
| <i>n</i> |
| 's |
| <i></i> |
| field data is the Uint64 at |
| <i>n*8</i> |
| in |
| this file. |
| </p> |
| |
| </li> |
| |
| <li> |
| |
| <p> |
| <a name="field_data"></a> |
| The field data, or .fdt file. |
| |
| </p> |
| |
| |
| <p> |
| This contains the stored fields of each document, |
| as follows: |
| </p> |
| |
| |
| <p> |
| FieldData (.fdt) --> |
| <DocFieldData> |
| <sup>SegSize</sup> |
| |
| </p> |
| |
| <p>DocFieldData --> |
| FieldCount, <FieldNum, Bits, Value> |
| <sup>FieldCount</sup> |
| |
| </p> |
| |
| <p>FieldCount --> |
| VInt |
| </p> |
| |
| <p>FieldNum --> |
| VInt |
| </p> |
| |
| <p>Bits --> |
| Byte |
| </p> |
| |
| <p> |
| |
| <ul> |
| |
| <li>low order bit is one for tokenized fields</li> |
| |
| <li>second bit is one for fields containing binary data</li> |
| |
| <li>third bit is one for fields with compression option enabled |
| (if compression is enabled, the algorithm used is ZLIB)</li> |
| |
| </ul> |
| |
| </p> |
| |
| <p>Value --> |
| String | BinaryValue (depending on Bits) |
| </p> |
| |
| <p>BinaryValue --> |
| ValueSize, <Byte>^ValueSize |
| </p> |
| |
| <p>ValueSize --> |
| VInt |
| </p> |
| |
| |
| </li> |
| |
| </ol> |
| <a name="N106A7"></a><a name="Term Dictionary"></a> |
| <h3 class="boxed">Term Dictionary</h3> |
| <p> |
| The term dictionary is represented as two files: |
| </p> |
| <ol> |
| |
| <li> |
| <a name="tis"></a> |
| |
| <p> |
| The term infos, or tis file. |
| </p> |
| |
| |
| <p> |
| TermInfoFile (.tis)--> |
| TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos |
| </p> |
| |
| <p>TIVersion --> |
| UInt32 |
| </p> |
| |
| <p>TermCount --> |
| UInt64 |
| </p> |
| |
| <p>IndexInterval --> |
| UInt32 |
| </p> |
| |
| <p>SkipInterval --> |
| UInt32 |
| </p> |
| |
| <p>MaxSkipLevels --> |
| UInt32 |
| </p> |
| |
| <p>TermInfos --> |
| <TermInfo> |
| <sup>TermCount</sup> |
| |
| </p> |
| |
| <p>TermInfo --> |
| <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta> |
| </p> |
| |
| <p>Term --> |
| <PrefixLength, Suffix, FieldNum> |
| </p> |
| |
| <p>Suffix --> |
| String |
| </p> |
| |
| <p>PrefixLength, |
| DocFreq, FreqDelta, ProxDelta, SkipDelta |
| <br> |
| --> VInt |
| </p> |
| |
| <p> |
| This file is sorted by Term. Terms are |
| ordered first lexicographically (by UTF16 |
| character code) by the term's field name, |
| and within that lexicographically (by |
| UTF16 character code) by the term's text. |
| </p> |
| |
| <p>TIVersion names the version of the format |
| of this file and is equal to TermInfosWriter.FORMAT_CURRENT. |
| </p> |
| |
| <p>Term |
| text prefixes are shared. The PrefixLength is the number of initial |
| characters from the previous term which must be pre-pended to a |
| term's suffix in order to form the term's text. Thus, if the |
| previous term's text was "bone" and the term is "boy", |
| the PrefixLength is two and the suffix is "y". |
| </p> |
| |
| <p>FieldNumber |
| determines the term's field, whose name is stored in the .fdt file. |
| </p> |
| |
| <p>DocFreq |
| is the count of documents which contain the term. |
| </p> |
| |
| <p>FreqDelta |
| determines the position of this term's TermFreqs within the .frq |
| file. In particular, it is the difference between the position of |
| this term's data in that file and the position of the previous |
| term's data (or zero, for the first term in the file). |
| </p> |
| |
| <p>ProxDelta |
| determines the position of this term's TermPositions within the .prx |
| file. In particular, it is the difference between the position of |
| this term's data in that file and the position of the previous |
| term's data (or zero, for the first term in the file. For fields |
| with omitTf true, this will be 0 since |
| prox information is not stored. |
| </p> |
| |
| <p>SkipDelta determines the position of this |
| term's SkipData within the .frq file. In |
| particular, it is the number of bytes |
| after TermFreqs that the SkipData starts. |
| In other words, it is the length of the |
| TermFreq data. SkipDelta is only stored |
| if DocFreq is not smaller than SkipInterval. |
| </p> |
| |
| </li> |
| |
| <li> |
| |
| <p> |
| <a name="tii"></a> |
| The term info index, or .tii file. |
| </p> |
| |
| |
| <p> |
| This contains every IndexInterval |
| <sup>th</sup> |
| entry from the .tis |
| file, along with its location in the "tis" file. This is |
| designed to be read entirely into memory and used to provide random |
| access to the "tis" file. |
| </p> |
| |
| |
| <p> |
| The structure of this file is very similar to the |
| .tis file, with the addition of one item per record, the IndexDelta. |
| </p> |
| |
| |
| <p> |
| TermInfoIndex (.tii)--> |
| TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices |
| </p> |
| |
| <p>TIVersion --> |
| UInt32 |
| </p> |
| |
| <p>IndexTermCount --> |
| UInt64 |
| </p> |
| |
| <p>IndexInterval --> |
| UInt32 |
| </p> |
| |
| <p>SkipInterval --> |
| UInt32 |
| </p> |
| |
| <p>TermIndices --> |
| <TermInfo, IndexDelta> |
| <sup>IndexTermCount</sup> |
| |
| </p> |
| |
| <p>IndexDelta --> |
| VLong |
| </p> |
| |
| <p>IndexDelta |
| determines the position of this term's TermInfo within the .tis file. In |
| particular, it is the difference between the position of this term's |
| entry in that file and the position of the previous term's entry. |
| </p> |
| |
| <p>SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int). |
| Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while |
| smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more |
| accelerable cases.</p> |
| |
| <p>MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in |
| smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration. |
| See format of .frq file for more information about skip levels.</p> |
| |
| </li> |
| |
| </ol> |
| <a name="N1072B"></a><a name="Frequencies"></a> |
| <h3 class="boxed">Frequencies</h3> |
| <p> |
| The .frq file contains the lists of documents |
| which contain each term, along with the frequency of the term in that |
| document (if omitTf is false). |
| </p> |
| <p>FreqFile (.frq) --> |
| <TermFreqs, SkipData> |
| <sup>TermCount</sup> |
| |
| </p> |
| <p>TermFreqs --> |
| <TermFreq> |
| <sup>DocFreq</sup> |
| |
| </p> |
| <p>TermFreq --> |
| DocDelta[, Freq?] |
| </p> |
| <p>SkipData --> |
| <<SkipLevelLength, SkipLevel> |
| <sup>NumSkipLevels-1</sup>, SkipLevel> |
| <SkipDatum> |
| </p> |
| <p>SkipLevel --> |
| <SkipDatum> |
| <sup>DocFreq/(SkipInterval^(Level + 1))</sup> |
| |
| </p> |
| <p>SkipDatum --> |
| DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer? |
| </p> |
| <p>DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --> |
| VInt |
| </p> |
| <p>SkipChildLevelPointer --> |
| VLong |
| </p> |
| <p>TermFreqs |
| are ordered by term (the term is implicit, from the .tis file). |
| </p> |
| <p>TermFreq |
| entries are ordered by increasing document number. |
| </p> |
| <p>DocDelta: if omitTf is false, this determines both |
| the document number and the frequency. In |
| particular, DocDelta/2 is the difference between |
| this document number and the previous document |
| number (or zero when this is the first document in |
| a TermFreqs). When DocDelta is odd, the frequency |
| is one. When DocDelta is even, the frequency is |
| read as another VInt. If omitTf is true, DocDelta |
| contains the gap (not multiplied by 2) between |
| document numbers and no frequency information is |
| stored. |
| </p> |
| <p>For example, the TermFreqs for a term which occurs |
| once in document seven and three times in document |
| eleven, with omitTf false, would be the following |
| sequence of VInts: |
| </p> |
| <p>15, 8, 3 |
| </p> |
| <p> If omitTf were true it would be this sequence |
| of VInts instead: |
| </p> |
| <p> |
| 7,4 |
| </p> |
| <p>DocSkip records the document number before every |
| SkipInterval |
| <sup>th</sup> |
| document in TermFreqs. |
| If payloads are disabled for the term's field, |
| then DocSkip represents the difference from the |
| previous value in the sequence. |
| If payloads are enabled for the term's field, |
| then DocSkip/2 represents the difference from the |
| previous value in the sequence. If payloads are enabled |
| and DocSkip is odd, |
| then PayloadLength is stored indicating the length |
| of the last payload before the SkipInterval<sup>th</sup> |
| document in TermPositions. |
| FreqSkip and ProxSkip record the position of every |
| SkipInterval |
| <sup>th</sup> |
| entry in FreqFile and |
| ProxFile, respectively. File positions are |
| relative to the start of TermFreqs and Positions, |
| to the previous SkipDatum in the sequence. |
| </p> |
| <p>For example, if DocFreq=35 and SkipInterval=16, |
| then there are two SkipData entries, containing |
| the 15 |
| <sup>th</sup> |
| and 31 |
| <sup>st</sup> |
| document |
| numbers in TermFreqs. The first FreqSkip names |
| the number of bytes after the beginning of |
| TermFreqs that the 16 |
| <sup>th</sup> |
| SkipDatum |
| starts, and the second the number of bytes after |
| that that the 32 |
| <sup>nd</sup> |
| starts. The first |
| ProxSkip names the number of bytes after the |
| beginning of Positions that the 16 |
| <sup>th</sup> |
| SkipDatum starts, and the second the number of |
| bytes after that that the 32 |
| <sup>nd</sup> |
| starts. |
| </p> |
| <p>Each term can have multiple skip levels. |
| The amount of skip levels for a term is NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval)))). |
| The number of SkipData entries for a skip level is DocFreq/(SkipInterval^(Level + 1)), whereas the lowest skip |
| level is Level=0. <br> |
| Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries, |
| containing the 3<sup>rd</sup>, 7<sup>th</sup>, 11<sup>th</sup>, 15<sup>th</sup>, 19<sup>th</sup>, 23<sup>rd</sup>, |
| 27<sup>th</sup>, and 31<sup>st</sup> document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the |
| 15<sup>th</sup> and 31<sup>st</sup> document numbers in TermFreqs. <br> |
| The SkipData entries on all upper levels > 0 contain a SkipChildLevelPointer referencing the corresponding SkipData |
| entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer |
| to entry 31 on level 0. |
| </p> |
| <a name="N107B3"></a><a name="Positions"></a> |
| <h3 class="boxed">Positions</h3> |
| <p> |
| The .prx file contains the lists of positions that |
| each term occurs at within documents. Note that |
| fields with omitTf true do not store |
| anything into this file, and if all fields in the |
| index have omitTf true then the .prx file will not |
| exist. |
| </p> |
| <p>ProxFile (.prx) --> |
| <TermPositions> |
| <sup>TermCount</sup> |
| |
| </p> |
| <p>TermPositions --> |
| <Positions> |
| <sup>DocFreq</sup> |
| |
| </p> |
| <p>Positions --> |
| <PositionDelta,Payload?> |
| <sup>Freq</sup> |
| |
| </p> |
| <p>Payload --> |
| <PayloadLength?,PayloadData> |
| </p> |
| <p>PositionDelta --> |
| VInt |
| </p> |
| <p>PayloadLength --> |
| VInt |
| </p> |
| <p>PayloadData --> |
| byte<sup>PayloadLength</sup> |
| |
| </p> |
| <p>TermPositions |
| are ordered by term (the term is implicit, from the .tis file). |
| </p> |
| <p>Positions |
| entries are ordered by increasing document number (the document |
| number is implicit from the .frq file). |
| </p> |
| <p>PositionDelta |
| is, if payloads are disabled for the term's field, the difference |
| between the position of the current occurrence in |
| the document and the previous occurrence (or zero, if this is the |
| first occurrence in this document). |
| If payloads are enabled for the term's field, then PositionDelta/2 |
| is the difference between the current and the previous position. If |
| payloads are enabled and PositionDelta is odd, then PayloadLength is |
| stored, indicating the length of the payload at the current term position. |
| </p> |
| <p> |
| For example, the TermPositions for a |
| term which occurs as the fourth term in one document, and as the |
| fifth and ninth term in a subsequent document, would be the following |
| sequence of VInts (payloads disabled): |
| </p> |
| <p>4, |
| 5, 4 |
| </p> |
| <p>PayloadData |
| is metadata associated with the current term position. If PayloadLength |
| is stored at the current position, then it indicates the length of this |
| Payload. If PayloadLength is not stored, then this Payload has the same |
| length as the Payload at the previous position. |
| </p> |
| <a name="N107EF"></a><a name="Normalization Factors"></a> |
| <h3 class="boxed">Normalization Factors</h3> |
| <p>There's a single .nrm file containing all norms: |
| </p> |
| <p>AllNorms |
| (.nrm) --> NormsHeader,<Norms> |
| <sup>NumFieldsWithNorms</sup> |
| |
| </p> |
| <p>Norms |
| --> <Byte> |
| <sup>SegSize</sup> |
| |
| </p> |
| <p>NormsHeader |
| --> 'N','R','M',Version |
| </p> |
| <p>Version |
| --> Byte |
| </p> |
| <p>NormsHeader |
| has 4 bytes, last of which is the format version for this file, currently -1. |
| </p> |
| <p>Each |
| byte encodes a floating point value. Bits 0-2 contain the 3-bit |
| mantissa, and bits 3-8 contain the 5-bit exponent. |
| </p> |
| <p>These |
| are converted to an IEEE single float value as follows: |
| </p> |
| <ol> |
| |
| <li> |
| |
| <p>If |
| the byte is zero, use a zero float. |
| </p> |
| |
| </li> |
| |
| <li> |
| |
| <p>Otherwise, |
| set the sign bit of the float to zero; |
| </p> |
| |
| </li> |
| |
| <li> |
| |
| <p>add |
| 48 to the exponent and use this as the float's exponent; |
| </p> |
| |
| </li> |
| |
| <li> |
| |
| <p>map |
| the mantissa to the high-order 3 bits of the float's mantissa; and |
| |
| </p> |
| |
| </li> |
| |
| <li> |
| |
| <p>set |
| the low-order 21 bits of the float's mantissa to zero. |
| </p> |
| |
| </li> |
| |
| </ol> |
| <p>A separate norm file is created when the norm values of an existing segment are modified. |
| When field <em>N</em> is modified, a separate norm file <em>.sN</em> |
| is created, to maintain the norm values for that field. |
| </p> |
| <p>Separate norm files are created (when adequate) for both compound and non compound segments. |
| </p> |
| <a name="N10840"></a><a name="Term Vectors"></a> |
| <h3 class="boxed">Term Vectors</h3> |
| <p> |
| Term Vector support is an optional on a field by |
| field basis. It consists of 3 files. |
| </p> |
| <ol> |
| |
| <li> |
| <a name="tvx"></a> |
| |
| <p>The Document Index or .tvx file.</p> |
| |
| <p>For each document, this stores the offset |
| into the document data (.tvd) and field |
| data (.tvf) files. |
| </p> |
| |
| <p>DocumentIndex (.tvx) --> TVXVersion<DocumentPosition,FieldPosition> |
| <sup>NumDocs</sup> |
| |
| </p> |
| |
| <p>TVXVersion --> Int (TermVectorsReader.CURRENT)</p> |
| |
| <p>DocumentPosition --> UInt64 (offset in |
| the .tvd file)</p> |
| |
| <p>FieldPosition --> UInt64 (offset in the |
| .tvf file)</p> |
| |
| </li> |
| |
| <li> |
| <a name="tvd"></a> |
| |
| <p>The Document or .tvd file.</p> |
| |
| <p>This contains, for each document, the number of fields, a list of the fields with |
| term vector info and finally a list of pointers to the field information in the .tvf |
| (Term Vector Fields) file.</p> |
| |
| <p> |
| Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions> |
| <sup>NumDocs</sup> |
| |
| </p> |
| |
| <p>TVDVersion --> Int (TermVectorsReader.FORMAT_CURRENT)</p> |
| |
| <p>NumFields --> VInt</p> |
| |
| <p>FieldNums --> <FieldNumDelta> |
| <sup>NumFields</sup> |
| |
| </p> |
| |
| <p>FieldNumDelta --> VInt</p> |
| |
| <p>FieldPositions --> <FieldPositionDelta> |
| <sup>NumFields-1</sup> |
| |
| </p> |
| |
| <p>FieldPositionDelta --> VLong</p> |
| |
| <p>The .tvd file is used to map out the fields that have term vectors stored and |
| where the field information is in the .tvf file.</p> |
| |
| </li> |
| |
| <li> |
| <a name="tvf"></a> |
| |
| <p>The Field or .tvf file.</p> |
| |
| <p>This file contains, for each field that has a term vector stored, a list of |
| the terms, their frequencies and, optionally, position and offest information.</p> |
| |
| <p>Field (.tvf) --> TVFVersion<NumTerms, Position/Offset, TermFreqs> |
| <sup>NumFields</sup> |
| |
| </p> |
| |
| <p>TVFVersion --> Int (TermVectorsReader.FORMAT_CURRENT)</p> |
| |
| <p>NumTerms --> VInt</p> |
| |
| <p>Position/Offset --> Byte</p> |
| |
| <p>TermFreqs --> <TermText, TermFreq, Positions?, Offsets?> |
| <sup>NumTerms</sup> |
| |
| </p> |
| |
| <p>TermText --> <PrefixLength, Suffix></p> |
| |
| <p>PrefixLength --> VInt</p> |
| |
| <p>Suffix --> String</p> |
| |
| <p>TermFreq --> VInt</p> |
| |
| <p>Positions --> <VInt><sup>TermFreq</sup> |
| </p> |
| |
| <p>Offsets --> <VInt, VInt><sup>TermFreq</sup> |
| </p> |
| |
| <br> |
| |
| <p>Notes:</p> |
| |
| <ul> |
| |
| <li>Position/Offset byte stores whether this term vector has position or offset information stored.</li> |
| |
| <li>Term |
| text prefixes are shared. The PrefixLength is the number of initial |
| characters from the previous term which must be pre-pended to a |
| term's suffix in order to form the term's text. Thus, if the |
| previous term's text was "bone" and the term is "boy", |
| the PrefixLength is two and the suffix is "y". |
| </li> |
| |
| <li>Positions are stored as delta encoded VInts. This means we only store the difference of the current position from the last position</li> |
| |
| <li>Offsets are stored as delta encoded VInts. The first VInt is the startOffset, the second is the endOffset.</li> |
| |
| </ul> |
| |
| |
| |
| </li> |
| |
| </ol> |
| <a name="N108DC"></a><a name="Deleted Documents"></a> |
| <h3 class="boxed">Deleted Documents</h3> |
| <p>The .del file is |
| optional, and only exists when a segment contains deletions. |
| </p> |
| <p>Although per-segment, this file is maintained exterior to compound segment files. |
| </p> |
| <p> |
| Deletions |
| (.del) --> [Format],ByteCount,BitCount, Bits | DGaps (depending on Format) |
| </p> |
| <p>Format,ByteSize,BitCount --> |
| Uint32 |
| </p> |
| <p>Bits --> |
| <Byte> |
| <sup>ByteCount</sup> |
| |
| </p> |
| <p>DGaps --> |
| <DGap,NonzeroByte> |
| <sup>NonzeroBytesCount</sup> |
| |
| </p> |
| <p>DGap --> |
| VInt |
| </p> |
| <p>NonzeroByte --> |
| Byte |
| </p> |
| <p>Format |
| is Optional. -1 indicates DGaps. Non-negative value indicates Bits, and that Format is excluded. |
| </p> |
| <p>ByteCount |
| indicates the number of bytes in Bits. It is typically |
| (SegSize/8)+1. |
| </p> |
| <p> |
| BitCount |
| indicates the number of bits that are currently set in Bits. |
| </p> |
| <p>Bits |
| contains one bit for each document indexed. When the bit |
| corresponding to a document number is set, that document is marked as |
| deleted. Bit ordering is from least to most significant. Thus, if |
| Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as |
| deleted. |
| </p> |
| <p>DGaps |
| represents sparse bit-vectors more efficiently than Bits. |
| It is made of DGaps on indexes of nonzero bytes in Bits, |
| and the nonzero bytes themselves. The number of nonzero bytes |
| in Bits (NonzeroBytesCount) is not stored. |
| </p> |
| <p>For example, |
| if there are 8000 bits and only bits 10,12,32 are set, |
| DGaps would be used: |
| </p> |
| <p> |
| (VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1 |
| </p> |
| </div> |
| |
| |
| <a name="N10916"></a><a name="Limitations"></a> |
| <h2 class="boxed">Limitations</h2> |
| <div class="section"> |
| <p> |
| When referring to term numbers, Lucene's current |
| implementation uses a Java <span class="codefrag">int</span> to hold the |
| term index, which means the maximum number of unique |
| terms in any single index segment is ~2.1 billion times |
| the term index interval (default 128) = ~274 billion. |
| This is technically not a limitation of the index file |
| format, just of Lucene's current implementation. |
| </p> |
| <p> |
| Similarly, Lucene uses a Java <span class="codefrag">int</span> to refer |
| to document numbers, and the index file format uses an |
| <span class="codefrag">Int32</span> on-disk to store document numbers. |
| This is a limitation of both the index file format and |
| the current implementation. Eventually these should be |
| replaced with either <span class="codefrag">UInt64</span> values, or |
| better yet, <span class="codefrag">VInt</span> values which have no |
| limit. |
| </p> |
| </div> |
| |
| |
| </div> |
| <!--+ |
| |end content |
| +--> |
| <div class="clearboth"> </div> |
| </div> |
| <div id="footer"> |
| <!--+ |
| |start bottomstrip |
| +--> |
| <div class="lastmodified"> |
| <script type="text/javascript"><!-- |
| document.write("Last Published: " + document.lastModified); |
| // --></script> |
| </div> |
| <div class="copyright"> |
| Copyright © |
| 2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a> |
| </div> |
| <!--+ |
| |end bottomstrip |
| +--> |
| </div> |
| </body> |
| </html> |