| <?xml version="1.0"?> |
| |
| <document> |
| |
| <properties> |
| <title>Index File Formats</title> |
| <authors> |
| <person email="cutting@apache.org" name="Doug Cutting"/> |
| </authors> |
| </properties> |
| |
| <body> |
| <section name="Index File Formats"> |
| |
| <p> |
| This document defines the index file formats used |
| in Lucene version 1.4. |
| </p> |
| |
| <p> |
| Jakarta Lucene is written in Java, but several |
| efforts are underway to write versions of Lucene in other programming |
| languages. If these versions are to remain compatible with Jakarta |
| Lucene, then a language-independent definition of the Lucene index |
| format is required. This document thus attempts to provide a |
| complete and independent definition of the Jakarta Lucene 1.4 file |
| formats. |
| </p> |
| |
| <p> |
| As Lucene evolves, this document should evolve. |
| Versions of Lucene in different programming languages should endeavor |
| to agree on file formats, and generate new versions of this document. |
| </p> |
| |
| <p> |
| Compatibility notes are provided in this document, |
| describing how file formats have changed from prior versions. |
| </p> |
| |
| </section> |
| |
| <section name="Definitions"> |
| |
| <p> |
| The fundamental concepts in Lucene are index, |
| document, field and term. |
| </p> |
| |
| |
| <p> |
| An index contains a sequence of documents. |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| A document is a sequence of fields. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| A field is a named sequence of terms. |
| </p> |
| </li> |
| |
| <li> |
| A term is a string. |
| </li> |
| </ul> |
| |
| <p> |
| The same string in two different fields is |
| considered a different term. Thus terms are represented as a pair of |
| strings, the first naming the field, and the second naming text |
| within the field. |
| </p> |
| |
| <subsection name="Inverted Indexing"> |
| |
| <p> |
| The index stores statistics about terms in order |
| to make term-based search more efficient. Lucene's |
| index falls into the family of indexes known as an <i>inverted |
| index.</i> This is because it can list, for a term, the documents that contain |
| it. This is the inverse of the natural relationship, in which |
| documents list terms. |
| </p> |
| </subsection> |
| <subsection name="Types of Fields"> |
| |
| <p> |
| In Lucene, fields may be <i>stored</i>, in which |
| case their text is stored in the index literally, in a non-inverted |
| manner. Fields that are inverted are called <i>indexed</i>. A field |
| may be both stored and indexed.</p> |
| |
| <p>The text of a field may be <i>tokenized</i> into terms to be |
| indexed, or the text of a field may be used literally as a term to be indexed. |
| Most fields are |
| tokenized, but sometimes it is useful for certain identifier fields |
| to be indexed literally. |
| </p> |
| |
| </subsection> |
| |
| <subsection name="Segments"> |
| |
| <p> |
| Lucene indexes may be composed of multiple sub-indexes, or<i> |
| segments</i>. Each segment is a fully independent index, which could be searched |
| separately. Indexes evolve by: |
| </p> |
| |
| <ol> |
| <li><p>Creating new segments for newly added documents.</p> |
| </li> |
| <li><p>Merging existing segments.</p> |
| </li> |
| </ol> |
| |
| <p> |
| Searches may involve multiple segments and/or multiple indexes, each |
| index potentially composed of a set of segments. |
| </p> |
| </subsection> |
| |
| <subsection name="Document Numbers"> |
| |
| <p> |
| Internally, Lucene refers to documents by an integer <i>document |
| number</i>. The first document added to an index is numbered zero, and each |
| subsequent document added gets a number one greater than the previous. |
| </p> |
| |
| <p> |
| <br/> |
| </p> |
| |
| <p> |
| Note that a document's number may change, so caution should be taken |
| when storing these numbers outside of Lucene. In particular, numbers may |
| change in the following situations: |
| </p> |
| |
| |
| <ul> |
| <li> |
| <p> |
| The |
| numbers stored in each segment are unique only within the segment, |
| and must be converted before they can be used in a larger context. |
| The standard technique is to allocate each segment a range of |
| values, based on the range of numbers used in that segment. To |
| convert a document number from a segment to an external value, the |
| segment's <i>base</i> document |
| number is added. To convert an external value back to a |
| segment-specific value, the segment is identified by the range that |
| the external value is in, and the segment's base value is |
| subtracted. For example two five document segments might be |
| combined, so that the first segment has a base value of zero, and |
| the second of five. Document three from the second segment would |
| have an external value of eight. |
| </p> |
| </li> |
| <li> |
| <p> |
| When documents are deleted, gaps are created |
| in the numbering. These are eventually removed as the index evolves |
| through merging. Deleted documents are dropped when segments are |
| merged. A freshly-merged segment thus has no gaps in its numbering. |
| </p> |
| </li> |
| </ul> |
| |
| </subsection> |
| |
| </section> |
| |
| <section name="Overview"> |
| |
| <p> |
| Each segment index maintains the following: |
| </p> |
| <ul> |
| <li><p>Field names. This |
| contains the set of field names used in the index. |
| |
| </p> |
| </li> |
| <li><p>Stored Field |
| values. This contains, for each document, a list of attribute-value |
| pairs, where the attributes are field names. These are used to |
| store auxiliary information about the document, such as its title, |
| url, or an identifier to access a |
| database. The set of stored fields are what is returned for each hit |
| when searching. This is keyed by document number. |
| </p> |
| </li> |
| <li><p>Term dictionary. |
| A dictionary containing all of the terms used in all of the indexed |
| fields of all of the documents. The dictionary also contains the |
| number of documents which contain the term, and pointers to the |
| term's frequency and proximity data. |
| </p> |
| </li> |
| |
| <li><p>Term Frequency |
| data. For each term in the dictionary, the numbers of all the |
| documents that contain that term, and the frequency of the term in |
| that document. |
| </p> |
| </li> |
| |
| <li><p>Term Proximity |
| data. For each term in the dictionary, the positions that the term |
| occurs in each document. |
| </p> |
| </li> |
| |
| <li><p>Normalization |
| factors. For each field in each document, a value is stored that is |
| multiplied into the score for hits on that field. |
| </p> |
| </li> |
| <li><p>Term Vectors. For each field in each document, the term vector |
| (sometimes called document vector) is stored. A term vector consists |
| of term text and term frequency. |
| </p> |
| </li> |
| <li><p>Deleted documents. |
| An optional file indicating which documents are deleted. |
| </p> |
| </li> |
| </ul> |
| |
| <p>Details on each of these are provided in subsequent sections. |
| </p> |
| </section> |
| |
| <section name="File Naming"> |
| |
| <p> |
| All files belonging to a segment have the same name with varying |
| extensions. The extensions correspond to the different file formats |
| described below. |
| </p> |
| |
| <p> |
| Typically, all segments |
| in an index are stored in a single directory, although this is not |
| required. |
| </p> |
| |
| </section> |
| |
| <section name="Primitive Types"> |
| |
| <subsection name="Byte"> |
| |
| <p> |
| The most primitive type |
| is an eight-bit byte. Files are accessed as sequences of bytes. All |
| other data types are defined as sequences |
| of bytes, so file formats are byte-order independent. |
| </p> |
| |
| </subsection> |
| |
| <subsection name="UInt32"> |
| |
| <p> |
| 32-bit unsigned integers are written as four |
| bytes, high-order bytes first. |
| </p> |
| <p> |
| UInt32 --> <Byte><sup>4</sup> |
| </p> |
| |
| </subsection> |
| |
| <subsection name="Uint64"> |
| |
| <p> |
| 64-bit unsigned integers are written as eight |
| bytes, high-order bytes first. |
| </p> |
| |
| <p>UInt64 --> <Byte><sup>8</sup> |
| </p> |
| |
| </subsection> |
| |
| <subsection name="VInt"> |
| |
| <p> |
| A variable-length format for positive integers is |
| defined where the high-order bit of each byte indicates whether more |
| bytes remain to be read. The low-order seven bits are appended as |
| increasingly more significant bits in the resulting integer value. |
| Thus values from zero to 127 may be stored in a single byte, values |
| from 128 to 16,383 may be stored in two bytes, and so on. |
| </p> |
| |
| <p><b>VInt Encoding Example</b></p> |
| |
| <table width="100%" border="0" cellpadding="4" cellspacing="0"> |
| <col width="64*" /> |
| <col width="64*" /> |
| <col width="64*" /> |
| <col width="64*" /> |
| <tr valign="TOP"> |
| <td width="25%"> |
| <p align="RIGHT"><b>Value</b> |
| </p> |
| </td> |
| <td width="25%"> |
| <p align="RIGHT"><b>First byte</b> |
| </p> |
| </td> |
| <td width="25%"> |
| <p align="RIGHT"><b>Second byte</b> |
| </p> |
| </td> |
| <td width="25%"> |
| <p align="RIGHT"><b>Third byte</b> |
| </p> |
| </td> |
| </tr> |
| <tr valign="BOTTOM"> |
| <td width="25%" sdval="0" sdnum="1033;0;#,##0"> |
| <p align="RIGHT">0 |
| </p> |
| </td> |
| <td width="25%" sdval="0" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; |
| margin-right: 0.01cm"> |
| 00000000 |
| </p> |
| </td> |
| <td width="25%" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| <td width="25%" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| </tr> |
| <tr valign="BOTTOM"> |
| <td width="25%" sdval="1" sdnum="1033;0;#,##0"> |
| <p align="RIGHT">1 |
| </p> |
| </td> |
| <td width="25%" sdval="1" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; |
| margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| </td> |
| <td width="25%" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| <td width="25%" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| </tr> |
| <tr valign="BOTTOM"> |
| <td width="25%" sdval="2" sdnum="1033;0;#,##0"> |
| <p align="RIGHT">2 |
| </p> |
| </td> |
| <td width="25%" sdval="10" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; |
| margin-right: 0.01cm"> |
| 00000010 |
| </p> |
| </td> |
| <td width="25%" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| <td width="25%" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| </tr> |
| <tr> |
| <td width="25%" valign="TOP"> |
| <p align="RIGHT">... |
| </p> |
| </td> |
| <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: 0.11cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| </tr> |
| <tr valign="BOTTOM"> |
| <td width="25%" sdval="127" sdnum="1033;0;#,##0"> |
| <p align="RIGHT">127 |
| </p> |
| </td> |
| <td width="25%" sdval="1111111" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; |
| margin-right: 0.01cm"> |
| 01111111 |
| </p> |
| </td> |
| <td width="25%" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| <td width="25%" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| </tr> |
| <tr valign="BOTTOM"> |
| <td width="25%" sdval="128" sdnum="1033;0;#,##0"> |
| <p align="RIGHT">128 |
| </p> |
| </td> |
| <td width="25%" sdval="10000000" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; |
| margin-right: 0.01cm"> |
| 10000000 |
| </p> |
| </td> |
| <td width="25%" sdval="1" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; |
| margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| </td> |
| <td width="25%" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| </tr> |
| <tr valign="BOTTOM"> |
| <td width="25%" sdval="129" sdnum="1033;0;#,##0"> |
| <p align="RIGHT">129 |
| </p> |
| </td> |
| <td width="25%" sdval="10000001" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; |
| margin-right: 0.01cm"> |
| 10000001 |
| </p> |
| </td> |
| <td width="25%" sdval="1" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; |
| margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| </td> |
| <td width="25%" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| </tr> |
| <tr valign="BOTTOM"> |
| <td width="25%" sdval="130" sdnum="1033;0;#,##0"> |
| <p align="RIGHT">130 |
| </p> |
| </td> |
| <td width="25%" sdval="10000010" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; |
| margin-right: 0.01cm"> |
| 10000010 |
| </p> |
| </td> |
| <td width="25%" sdval="1" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; |
| margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| </td> |
| <td width="25%" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| </tr> |
| <tr> |
| <td width="25%" valign="TOP"> |
| <p align="RIGHT">... |
| </p> |
| </td> |
| <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: 0.11cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| </tr> |
| <tr valign="BOTTOM"> |
| <td width="25%" sdval="16383" sdnum="1033;0;#,##0"> |
| <p align="RIGHT">16,383 |
| </p> |
| </td> |
| <td width="25%" sdval="11111111" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; |
| margin-right: 0.01cm"> |
| 11111111 |
| </p> |
| </td> |
| <td width="25%" sdval="1111111" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; |
| margin-right: 0.01cm"> |
| 01111111 |
| </p> |
| </td> |
| <td width="25%" sdnum="1033;0;00000000"> |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: |
| 0.01cm"><br/> |
| |
| </p> |
| </td> |
| </tr> |
| <tr valign="BOTTOM"> |
| <td width="25%" sdval="16384" sdnum="1033;0;#,##0"> |
| <p align="RIGHT">16,384 |
| </p> |
| </td> |
| <td width="25%" sdval="10000000" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; |
| margin-right: 0.01cm"> |
| 10000000 |
| </p> |
| </td> |
| <td width="25%" sdval="10000000" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; |
| margin-right: 0.01cm"> |
| 10000000 |
| </p> |
| </td> |
| <td width="25%" sdval="1" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: -0.47cm; |
| margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| </td> |
| </tr> |
| <tr valign="BOTTOM"> |
| <td width="25%" sdval="16385" sdnum="1033;0;#,##0"> |
| <p align="RIGHT">16,385 |
| </p> |
| </td> |
| <td width="25%" sdval="10000001" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; |
| margin-right: 0.01cm"> |
| 10000001 |
| </p> |
| </td> |
| <td width="25%" sdval="10000000" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; |
| margin-right: 0.01cm"> |
| 10000000 |
| </p> |
| </td> |
| <td width="25%" sdval="1" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: -0.47cm; |
| margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| </td> |
| </tr> |
| <tr> |
| <td width="25%" valign="TOP"> |
| <p align="RIGHT">... |
| </p> |
| </td> |
| <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; |
| margin-right: 0.01cm"> |
| <br/> |
| |
| </p> |
| </td> |
| <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; |
| margin-right: 0.01cm"> |
| <br/> |
| |
| </p> |
| </td> |
| <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> |
| <p class="western" align="RIGHT" style="margin-left: -0.47cm; |
| margin-right: 0.01cm"> |
| <br/> |
| |
| </p> |
| </td> |
| </tr> |
| </table> |
| |
| <p> |
| This provides compression while still being |
| efficient to decode. |
| </p> |
| |
| </subsection> |
| |
| <subsection name="Chars"> |
| |
| <p> |
| Lucene writes unicode |
| character sequences using the standard UTF-8 encoding. |
| </p> |
| |
| |
| </subsection> |
| |
| <subsection name="String"> |
| |
| <p> |
| Lucene writes strings as a VInt representing the length, followed by |
| the character data. |
| </p> |
| |
| <p> |
| String --> VInt, Chars |
| </p> |
| |
| </subsection> |
| |
| </section> |
| |
| <section name="Per-Index Files"> |
| |
| <p> |
| The files in this section exist one-per-index. |
| </p> |
| |
| <subsection name="Segments File"> |
| |
| <p> |
| The active segments in the index are stored in the |
| segment info file. An index only has |
| a single file in this format, and it is named "segments". |
| This lists each segment by name, and also contains the size of each |
| segment. |
| </p> |
| |
| <p> |
| Segments --> Format, Version, SegCount, <SegName, SegSize><sup>SegCount</sup> |
| </p> |
| |
| <p> |
| Format, SegCount, SegSize --> UInt32 |
| </p> |
| |
| <p> |
| Version --> UInt64 |
| </p> |
| |
| <p> |
| SegName --> String |
| </p> |
| |
| <p> |
| Format is -1 in Lucene 1.4. |
| </p> |
| |
| <p> |
| Version counts how often the index has been |
| changed by adding or deleting documents. |
| </p> |
| |
| <p> |
| SegName is the name of the segment, and is used as the file name prefix |
| for all of the files that compose the segment's index. |
| </p> |
| |
| <p> |
| SegSize is the number of documents contained in the segment index. |
| </p> |
| |
| |
| </subsection> |
| |
| <subsection name="Lock Files"> |
| |
| <p> |
| Several files are used to indicate that another |
| process is using an index. Note that these files are not |
| stored in the index directory itself, but rather in the |
| system's temporary directory, as indicated in the Java |
| system property "java.io.tmpdir". |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| When a file named "commit.lock" |
| is present, a process is currently re-writing the "segments" |
| file and deleting outdated segment index files, or a process is |
| reading the "segments" |
| file and opening the files of the segments it names. This lock file |
| prevents files from being deleted by another process after a process |
| has read the "segments" |
| file but before it has managed to open all of the files of the |
| segments named therein. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| When a file named "write.lock" |
| is present, a process is currently adding documents to an index, or |
| removing files from that index. This lock file prevents several |
| processes from attempting to modify an index at the same time. |
| </p> |
| </li> |
| </ul> |
| </subsection> |
| |
| <subsection name="Deletable File"> |
| |
| <p> |
| A file named "deletable" |
| contains the names of files that are no longer used by the index, but |
| which could not be deleted. This is only used on Win32, where a |
| file may not be deleted while it is still open. On other platforms |
| the file contains only null bytes. |
| </p> |
| |
| <p> |
| Deletable --> DeletableCount, |
| <DelableName><sup>DeletableCount</sup> |
| </p> |
| |
| <p>DeletableCount --> UInt32 |
| </p> |
| <p>DeletableName --> |
| String |
| </p> |
| </subsection> |
| </section> |
| |
| <section name="Per-Segment Files"> |
| |
| <p> |
| The remaining files are all per-segment, and are |
| thus defined by suffix. |
| </p> |
| <subsection name="Fields"> |
| <p><br/><b>Field Info</b><br/></p> |
| |
| <p> |
| Field names are |
| stored in the field info file, with suffix .fnm. |
| </p> |
| <p> |
| FieldInfos |
| (.fnm) --> FieldsCount, <FieldName, |
| FieldBits><sup>FieldsCount</sup> |
| </p> |
| |
| <p> |
| FieldsCount --> VInt |
| </p> |
| |
| <p> |
| FieldName --> String |
| </p> |
| |
| <p> |
| FieldBits --> Byte |
| </p> |
| |
| <p> |
| The low-order bit is one for |
| indexed fields, and zero for non-indexed fields. The second lowest-order |
| bit is one for fields that have term vectors stored, and zero for fields |
| without term vectors. |
| </p> |
| |
| <p> |
| Fields are numbered by their order in this file. Thus field zero is |
| the |
| first field in the file, field one the next, and so on. Note that, |
| like document numbers, field numbers are segment relative. |
| </p> |
| |
| <p><br/><b>Stored Fields</b><br/></p> |
| |
| <p> |
| Stored fields are represented by two files: |
| </p> |
| |
| <ol> |
| <li> |
| <p> |
| The field index, or .fdx file. |
| </p> |
| |
| <p> |
| This contains, for each document, a pointer to |
| its field data, as follows: |
| </p> |
| |
| <p> |
| FieldIndex |
| (.fdx) --> |
| <FieldValuesPosition><sup>SegSize</sup> |
| </p> |
| <p>FieldValuesPosition |
| --> Uint64 |
| </p> |
| <p>This |
| is used to find the location within the field data file of the |
| fields of a particular document. Because it contains fixed-length |
| data, this file may be easily randomly accessed. The position of |
| document<i> n</i>'s<i> </i>field data is the Uint64 at <i>n*8</i> in |
| this file. |
| </p> |
| </li> |
| <li> |
| <p> |
| The field data, or .fdt file. |
| |
| </p> |
| |
| <p> |
| This contains the stored fields of each document, |
| as follows: |
| </p> |
| |
| <p> |
| FieldData (.fdt) --> |
| <DocFieldData><sup>SegSize</sup> |
| </p> |
| <p>DocFieldData --> |
| FieldCount, <FieldNum, Bits, Value><sup>FieldCount</sup> |
| </p> |
| <p>FieldCount --> |
| VInt |
| </p> |
| <p>FieldNum --> |
| VInt |
| </p> |
| <p>Bits --> |
| Byte |
| </p> |
| <p>Value --> |
| String |
| </p> |
| <p>Currently |
| only the low-order bit is used of Bits is used. It is one for |
| tokenized fields, and zero for non-tokenized fields. |
| </p> |
| </li> |
| </ol> |
| |
| </subsection> |
| <subsection name="Term Dictionary"> |
| |
| <p> |
| The term dictionary is represented as two files: |
| </p> |
| <ol> |
| <li> |
| <p> |
| The term infos, or tis file. |
| </p> |
| |
| <p> |
| TermInfoFile (.tis)--> |
| TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos |
| </p> |
| <p>TIVersion --> |
| UInt32 |
| </p> |
| <p>TermCount --> |
| UInt64 |
| </p> |
| <p>IndexInterval --> |
| UInt32 |
| </p> |
| <p>SkipInterval --> |
| UInt32 |
| </p> |
| <p>TermInfos --> |
| <TermInfo><sup>TermCount</sup> |
| </p> |
| <p>TermInfo --> |
| <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta> |
| </p> |
| <p>Term --> |
| <PrefixLength, Suffix, FieldNum> |
| </p> |
| <p>Suffix --> |
| String |
| </p> |
| <p>PrefixLength, |
| DocFreq, FreqDelta, ProxDelta, SkipDelta<br/> --> VInt |
| </p> |
| <p>This |
| file is sorted by Term. Terms are ordered first lexicographically |
| by the term's field name, and within that lexicographically by the |
| term's text. |
| </p> |
| <p>TIVersion names the version of the format |
| of this file and is -2 in Lucene 1.4. |
| </p> |
| <p>Term |
| text prefixes are shared. The PrefixLength is the number of initial |
| characters from the previous term which must be pre-pended to a |
| term's suffix in order to form the term's text. Thus, if the |
| previous term's text was "bone" and the term is "boy", |
| the PrefixLength is two and the suffix is "y". |
| </p> |
| <p>FieldNumber |
| determines the term's field, whose name is stored in the .fdt file. |
| </p> |
| <p>DocFreq |
| is the count of documents which contain the term. |
| </p> |
| <p>FreqDelta |
| determines the position of this term's TermFreqs within the .frq |
| file. In particular, it is the difference between the position of |
| this term's data in that file and the position of the previous |
| term's data (or zero, for the first term in the file). |
| </p> |
| <p>ProxDelta |
| determines the position of this term's TermPositions within the .prx |
| file. In particular, it is the difference between the position of |
| this term's data in that file and the position of the previous |
| term's data (or zero, for the first term in the file. |
| </p> |
| <p>SkipDelta determines the position of this |
| term's SkipData within the .frq file. In |
| particular, it is the number of bytes |
| after TermFreqs that the SkipData starts. |
| In other words, it is the length of the |
| TermFreq data. |
| </p> |
| </li> |
| <li> |
| <p> |
| The term info index, or .tii file. |
| </p> |
| |
| <p> |
| This contains every IndexInterval<sup>th</sup> entry from the .tis |
| file, along with its location in the "tis" file. This is |
| designed to be read entirely into memory and used to provide random |
| access to the "tis" file. |
| </p> |
| |
| <p> |
| The structure of this file is very similar to the |
| .tis file, with the addition of one item per record, the IndexDelta. |
| </p> |
| |
| <p> |
| TermInfoIndex (.tii)--> |
| IndexTermCount, TermIndices |
| </p> |
| <p>IndexTermCount --> |
| UInt32 |
| </p> |
| <p>TermIndices --> |
| <TermInfo, IndexDelta><sup>IndexTermCount</sup> |
| </p> |
| <p>IndexDelta --> |
| VInt |
| </p> |
| <p>IndexDelta |
| determines the position of this term's TermInfo the .tis file. In |
| particular, it is the difference between the position of this term's |
| entry in that file and the position of the previous term's entry (or |
| zero for the first term in the file). |
| </p> |
| </li> |
| </ol> |
| </subsection> |
| |
| <subsection name="Frequencies"> |
| |
| <p> |
| The .frq file contains the lists of documents |
| which contain each term, along with the frequency of the term in that |
| document. |
| </p> |
| <p>FreqFile (.frq) --> |
| <TermFreqs, SkipData><sup>TermCount</sup> |
| </p> |
| <p>TermFreqs --> |
| <TermFreq><sup>DocFreq</sup> |
| </p> |
| <p>TermFreq --> |
| DocDelta, Freq? |
| </p> |
| <p>SkipData --> |
| <SkipDatum><sup>DocFreq/SkipInterval</sup> |
| </p> |
| <p>SkipDatum --> |
| DocSkip,FreqSkip,ProxSkip |
| </p> |
| <p>DocDelta,Freq,DocSkip,FreqSkip,ProxSkip --> |
| VInt |
| </p> |
| <p>TermFreqs |
| are ordered by term (the term is implicit, from the .tis file). |
| </p> |
| <p>TermFreq |
| entries are ordered by increasing document number. |
| </p> |
| <p>DocDelta |
| determines both the document number and the frequency. In |
| particular, DocDelta/2 is the difference between this document number |
| and the previous document number (or zero when this is the first |
| document in a TermFreqs). When DocDelta is odd, the frequency is |
| one. When DocDelta is even, the frequency is read as another VInt. |
| </p> |
| <p>For |
| example, the TermFreqs for a term which occurs once in document seven |
| and three times in document eleven would be the following sequence of |
| VInts: |
| </p> |
| <p> 15, |
| 22, 3 |
| </p> |
| <p>DocSkip records the document number before every |
| SkipInterval<sup>th</sup> document in TermFreqs. |
| Document numbers are represented as differences |
| from the previous value in the sequence. FreqSkip |
| and ProxSkip record the position of every |
| SkipInterval<sup>th</sup> entry in FreqFile and |
| ProxFile, respectively. File positions are |
| relative to the start of TermFreqs and Positions, |
| to the previous SkipDatum in the sequence. |
| </p> |
| <p>For example, if DocFreq=35 and SkipInterval=16, |
| then there are two SkipData entries, containing |
| the 15<sup>th</sup> and 31<sup>st</sup> document |
| numbers in TermFreqs. The first FreqSkip names |
| the number of bytes after the beginning of |
| TermFreqs that the 16<sup>th</sup> SkipDatum |
| starts, and the second the number of bytes after |
| that that the 32<sup>nd</sup> starts. The first |
| ProxSkip names the number of bytes after the |
| beginning of Positions that the 16<sup>th</sup> |
| SkipDatum starts, and the second the number of |
| bytes after that that the 32<sup>nd</sup> starts. |
| </p> |
| |
| </subsection> |
| <subsection name="Positions"> |
| |
| <p> |
| The .prx file contains the lists of positions that |
| each term occurs at within documents. |
| </p> |
| <p>ProxFile (.prx) --> |
| <TermPositions><sup>TermCount</sup> |
| </p> |
| <p>TermPositions --> |
| <Positions><sup>DocFreq</sup> |
| </p> |
| <p>Positions --> |
| <PositionDelta><sup>Freq</sup> |
| </p> |
| <p>PositionDelta --> |
| VInt |
| </p> |
| <p>TermPositions |
| are ordered by term (the term is implicit, from the .tis file). |
| </p> |
| <p>Positions |
| entries are ordered by increasing document number (the document |
| number is implicit from the .frq file). |
| </p> |
| <p>PositionDelta |
| is the difference between the position of the current occurrence in |
| the document and the previous occurrence (or zero, if this is the |
| first occurrence in this document). |
| </p> |
| <p> |
| For example, the TermPositions for a |
| term which occurs as the fourth term in one document, and as the |
| fifth and ninth term in a subsequent document, would be the following |
| sequence of VInts: |
| </p> |
| <p> 4, |
| 5, 4 |
| </p> |
| </subsection> |
| <subsection name="Normalization Factors"> |
| <p>There's a norm file for each indexed field with a byte for |
| each document. The .f[0-9]* file contains, |
| for each document, a byte that encodes a value that is multiplied |
| into the score for hits on that field: |
| </p> |
| <p>Norms |
| (.f[0-9]*) --> <Byte><sup>SegSize</sup> |
| </p> |
| <p>Each |
| byte encodes a floating point value. Bits 0-2 contain the 3-bit |
| mantissa, and bits 3-8 contain the 5-bit exponent. |
| </p> |
| <p>These |
| are converted to an IEEE single float value as follows: |
| </p> |
| <ol> |
| <li><p>If |
| the byte is zero, use a zero float. |
| </p> |
| </li> |
| <li><p>Otherwise, |
| set the sign bit of the float to zero; |
| </p> |
| </li> |
| <li><p>add |
| 48 to the exponent and use this as the float's exponent; |
| </p> |
| </li> |
| <li><p>map |
| the mantissa to the high-order 3 bits of the float's mantissa; and |
| |
| </p> |
| </li> |
| <li><p>set |
| the low-order 21 bits of the float's mantissa to zero. |
| </p> |
| </li> |
| </ol> |
| |
| </subsection> |
| <subsection name="Term Vectors"> |
| Term Vector support is an optional on a field by field basis. It consists of 4 |
| files. |
| <ol> |
| <li> |
| <p>The Document Index or .tvx file.</p> |
| <p>This contains, for each document, a pointer to the document data in the Document |
| (.tvd) file. |
| </p> |
| <p>DocumentIndex (.tvx) --> TVXVersion<DocumentPosition><sup>NumDocs</sup></p> |
| <p>TVXVersion --> Int</p> |
| <p>DocumentPosition --> UInt64</p> |
| <p>This is used to find the position of the Document in the .tvd file.</p> |
| </li> |
| <li> |
| <p>The Document or .tvd file.</p> |
| <p>This contains, for each document, the number of fields, a list of the fields with |
| term vector info and finally a list of pointers to the field information in the .tvf |
| (Term Vector Fields) file.</p> |
| <p> |
| Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup> |
| </p> |
| <p>TVDVersion --> Int</p> |
| <p>NumFields --> VInt</p> |
| <p>FieldNums --> <FieldNumDelta><sup>NumFields</sup></p> |
| <p>FieldNumDelta --> VInt</p> |
| <p>FieldPositions --> <FieldPosition><sup>NumFields</sup></p> |
| <p>FieldPosition --> VLong</p> |
| <p>The .tvd file is used to map out the fields that have term vectors stored and |
| where the field information is in the .tvf file.</p> |
| </li> |
| <li> |
| <p>The Field or .tvf file.</p> |
| <p>This file contains, for each field that has a term vector stored, a list of |
| the terms and their frequencies.</p> |
| <p>Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs><sup>NumFields</sup></p> |
| <p>TVFVersion --> Int</p> |
| <p>NumTerms --> VInt</p> |
| <p>NumDistinct --> VInt -- Future Use</p> |
| <p>TermFreqs --> <TermText, TermFreq><sup>NumTerms</sup></p> |
| <p>TermText --> <PrefixLength, Suffix></p> |
| <p>PrefixLength --> VInt</p> |
| <p>Suffix --> String</p> |
| <p>TermFreq --> VInt</p> |
| <p>Term |
| text prefixes are shared. The PrefixLength is the number of initial |
| characters from the previous term which must be pre-pended to a |
| term's suffix in order to form the term's text. Thus, if the |
| previous term's text was "bone" and the term is "boy", |
| the PrefixLength is two and the suffix is "y". |
| </p> |
| </li> |
| </ol> |
| </subsection> |
| |
| <subsection name="Deleted Documents"> |
| |
| <p>The .del file is |
| optional, and only exists when a segment contains deletions: |
| </p> |
| |
| <p>Deletions |
| (.del) --> ByteCount,BitCount,Bits |
| </p> |
| |
| <p>ByteSize,BitCount --> |
| Uint32 |
| </p> |
| |
| <p>Bits --> |
| <Byte><sup>ByteCount</sup> |
| </p> |
| |
| <p>ByteCount |
| indicates the number of bytes in Bits. It is typically |
| (SegSize/8)+1. |
| </p> |
| |
| <p> |
| BitCount |
| indicates the number of bits that are currently set in Bits. |
| </p> |
| |
| <p>Bits |
| contains one bit for each document indexed. When the bit |
| corresponding to a document number is set, that document is marked as |
| deleted. Bit ordering is from least to most significant. Thus, if |
| Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as |
| deleted. |
| </p> |
| </subsection> |
| </section> |
| |
| <section name="Limitations"> |
| |
| <p>There |
| are a few places where these file formats limit the maximum number of |
| terms and documents to a 32-bit quantity, or to approximately 4 |
| billion. This is not today a problem, but, in the long term, |
| probably will be. These should therefore be replaced with either |
| UInt64 values, or better yet, with VInt values which have no limit. |
| </p> |
| |
| </section> |
| |
| </body> |
| |
| </document> |