| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
| |
| <!-- |
| Copyright 1999-2004 The Apache Software Foundation |
| Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. |
| You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| |
| <!-- Content Stylesheet for Site --> |
| |
| |
| <!-- start the processing --> |
| <!-- ====================================================================== --> |
| <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> |
| <!-- Main Page Section --> |
| <!-- ====================================================================== --> |
| <html> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/> |
| |
| |
| |
| |
| |
| <title>Jakarta Lucene - Index File Formats</title> |
| </head> |
| |
| <body bgcolor="#ffffff" text="#000000" link="#525D76"> |
| <table border="0" width="100%" cellspacing="0"> |
| <!-- TOP IMAGE --> |
| <tr> |
| <td align="left"> |
| <a href="http://jakarta.apache.org"><img src="http://jakarta.apache.org/images/jakarta-logo.gif" border="0"/></a> |
| </td> |
| <td align="right"> |
| <a href="http://jakarta.apache.org/lucene/"><img src="./images/lucene_green_300.gif" alt="Jakarta Lucene" border="0"/></a> |
| </td> |
| </tr> |
| </table> |
| <table border="0" width="100%" cellspacing="4"> |
| <tr><td colspan="2"> |
| <hr noshade="" size="1"/> |
| </td></tr> |
| |
| <tr> |
| <!-- LEFT SIDE NAVIGATION --> |
| <td width="20%" valign="top" nowrap="true"> |
| |
| <!-- ============================================================ --> |
| |
| <p><strong>About</strong></p> |
| <ul> |
| <li> <a href="./index.html">Overview</a> |
| </li> |
| <li> <a href="http://wiki.apache.org/jakarta-lucene/PoweredBy">Powered by Lucene</a> |
| </li> |
| <li> <a href="./whoweare.html">Who We Are</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/mail.html">Mailing Lists</a> |
| </li> |
| </ul> |
| <p><strong>Resources</strong></p> |
| <ul> |
| <li> <a href="http://wiki.apache.org/jakarta-lucene">Wiki</a> |
| </li> |
| <li> <a href="http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi">FAQ (Official)</a> |
| </li> |
| <li> <a href="http://www.jguru.com/faq/Lucene">jGuru FAQ</a> |
| </li> |
| <li> <a href="./gettingstarted.html">Getting Started</a> |
| </li> |
| <li> <a href="./queryparsersyntax.html">Query Syntax</a> |
| </li> |
| <li> <a href="./systemproperties.html">System Properties</a> |
| </li> |
| <li> <a href="./fileformats.html">File Formats</a> |
| </li> |
| <li> <a href="./api/index.html">Javadoc</a> |
| </li> |
| <li> <a href="./contributions.html">Contributions</a> |
| </li> |
| <li> <a href="./resources.html">Articles, etc.</a> |
| </li> |
| <li> <a href="./benchmarks.html">Benchmarks</a> |
| </li> |
| <li> <a href="http://issues.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=%5BPATCH%5D&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=%27Importance%27">Patches</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/bugs.html">Bugs</a> |
| </li> |
| <li> <a href="http://issues.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=%27Importance%27">Lucene Bugs</a> |
| </li> |
| <li> <a href="http://issues.apache.org/eyebrowse/SummarizeList?listId=30">Lucene-user</a> |
| </li> |
| <li> <a href="http://issues.apache.org/eyebrowse/SummarizeList?listId=29">Lucene-dev</a> |
| </li> |
| <li> <a href="./lucene-sandbox/">Lucene Sandbox</a> |
| </li> |
| </ul> |
| <p><strong>Download</strong></p> |
| <ul> |
| <li> <a href="http://jakarta.apache.org/site/binindex.html">Binaries</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/sourceindex.html">Source Code</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/cvsindex.html">CVS Repositories</a> |
| </li> |
| </ul> |
| <p><strong>Jakarta</strong></p> |
| <ul> |
| <li> <a href="http://jakarta.apache.org/site/getinvolved.html">Get Involved</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/acknowledgements.html">Acknowledgements</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/contact.html">Contact</a> |
| </li> |
| <li> <a href="http://jakarta.apache.org/site/legal.html">Legal</a> |
| </li> |
| </ul> |
| </td> |
| <td width="80%" align="left" valign="top"> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Index File Formats"><strong>Index File Formats</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| This document defines the index file formats used |
| in Lucene version 1.4. |
| </p> |
| <p> |
| Jakarta Lucene is written in Java, but several |
| efforts are underway to write versions of Lucene in other programming |
| languages. If these versions are to remain compatible with Jakarta |
| Lucene, then a language-independent definition of the Lucene index |
| format is required. This document thus attempts to provide a |
| complete and independent definition of the Jakarta Lucene 1.4 file |
| formats. |
| </p> |
| <p> |
| As Lucene evolves, this document should evolve. |
| Versions of Lucene in different programming languages should endeavor |
| to agree on file formats, and generate new versions of this document. |
| </p> |
| <p> |
| Compatibility notes are provided in this document, |
| describing how file formats have changed from prior versions. |
| </p> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Definitions"><strong>Definitions</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| The fundamental concepts in Lucene are index, |
| document, field and term. |
| </p> |
| <p> |
| An index contains a sequence of documents. |
| </p> |
| <ul> |
| <li> |
| <p> |
| A document is a sequence of fields. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| A field is a named sequence of terms. |
| </p> |
| </li> |
| |
| <li> |
| A term is a string. |
| </li> |
| </ul> |
| <p> |
| The same string in two different fields is |
| considered a different term. Thus terms are represented as a pair of |
| strings, the first naming the field, and the second naming text |
| within the field. |
| </p> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Inverted Indexing"><strong>Inverted Indexing</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| The index stores statistics about terms in order |
| to make term-based search more efficient. Lucene's |
| index falls into the family of indexes known as an <i>inverted |
| index.</i> This is because it can list, for a term, the documents that contain |
| it. This is the inverse of the natural relationship, in which |
| documents list terms. |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Types of Fields"><strong>Types of Fields</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| In Lucene, fields may be <i>stored</i>, in which |
| case their text is stored in the index literally, in a non-inverted |
| manner. Fields that are inverted are called <i>indexed</i>. A field |
| may be both stored and indexed.</p> |
| <p>The text of a field may be <i>tokenized</i> into terms to be |
| indexed, or the text of a field may be used literally as a term to be indexed. |
| Most fields are |
| tokenized, but sometimes it is useful for certain identifier fields |
| to be indexed literally. |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Segments"><strong>Segments</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| Lucene indexes may be composed of multiple sub-indexes, or<i> |
| segments</i>. Each segment is a fully independent index, which could be searched |
| separately. Indexes evolve by: |
| </p> |
| <ol> |
| <li><p>Creating new segments for newly added documents.</p> |
| </li> |
| <li><p>Merging existing segments.</p> |
| </li> |
| </ol> |
| <p> |
| Searches may involve multiple segments and/or multiple indexes, each |
| index potentially composed of a set of segments. |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Document Numbers"><strong>Document Numbers</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| Internally, Lucene refers to documents by an integer <i>document |
| number</i>. The first document added to an index is numbered zero, and each |
| subsequent document added gets a number one greater than the previous. |
| </p> |
| <p> |
| <br /> |
| </p> |
| <p> |
| Note that a document's number may change, so caution should be taken |
| when storing these numbers outside of Lucene. In particular, numbers may |
| change in the following situations: |
| </p> |
| <ul> |
| <li> |
| <p> |
| The |
| numbers stored in each segment are unique only within the segment, |
| and must be converted before they can be used in a larger context. |
| The standard technique is to allocate each segment a range of |
| values, based on the range of numbers used in that segment. To |
| convert a document number from a segment to an external value, the |
| segment's <i>base</i> document |
| number is added. To convert an external value back to a |
| segment-specific value, the segment is identified by the range that |
| the external value is in, and the segment's base value is |
| subtracted. For example two five document segments might be |
| combined, so that the first segment has a base value of zero, and |
| the second of five. Document three from the second segment would |
| have an external value of eight. |
| </p> |
| </li> |
| <li> |
| <p> |
| When documents are deleted, gaps are created |
| in the numbering. These are eventually removed as the index evolves |
| through merging. Deleted documents are dropped when segments are |
| merged. A freshly-merged segment thus has no gaps in its numbering. |
| </p> |
| </li> |
| </ul> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Overview"><strong>Overview</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| Each segment index maintains the following: |
| </p> |
| <ul> |
| <li><p>Field names. This |
| contains the set of field names used in the index. |
| |
| </p> |
| </li> |
| <li><p>Stored Field |
| values. This contains, for each document, a list of attribute-value |
| pairs, where the attributes are field names. These are used to |
| store auxiliary information about the document, such as its title, |
| url, or an identifier to access a |
| database. The set of stored fields are what is returned for each hit |
| when searching. This is keyed by document number. |
| </p> |
| </li> |
| <li><p>Term dictionary. |
| A dictionary containing all of the terms used in all of the indexed |
| fields of all of the documents. The dictionary also contains the |
| number of documents which contain the term, and pointers to the |
| term's frequency and proximity data. |
| </p> |
| </li> |
| |
| <li><p>Term Frequency |
| data. For each term in the dictionary, the numbers of all the |
| documents that contain that term, and the frequency of the term in |
| that document. |
| </p> |
| </li> |
| |
| <li><p>Term Proximity |
| data. For each term in the dictionary, the positions that the term |
| occurs in each document. |
| </p> |
| </li> |
| |
| <li><p>Normalization |
| factors. For each field in each document, a value is stored that is |
| multiplied into the score for hits on that field. |
| </p> |
| </li> |
| <li><p>Term Vectors. For each field in each document, the term vector |
| (sometimes called document vector) is stored. A term vector consists |
| of term text and term frequency. |
| </p> |
| </li> |
| <li><p>Deleted documents. |
| An optional file indicating which documents are deleted. |
| </p> |
| </li> |
| </ul> |
| <p>Details on each of these are provided in subsequent sections. |
| </p> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="File Naming"><strong>File Naming</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| All files belonging to a segment have the same name with varying |
| extensions. The extensions correspond to the different file formats |
| described below. |
| </p> |
| <p> |
| Typically, all segments |
| in an index are stored in a single directory, although this is not |
| required. |
| </p> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Primitive Types"><strong>Primitive Types</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Byte"><strong>Byte</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| The most primitive type |
| is an eight-bit byte. Files are accessed as sequences of bytes. All |
| other data types are defined as sequences |
| of bytes, so file formats are byte-order independent. |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="UInt32"><strong>UInt32</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| 32-bit unsigned integers are written as four |
| bytes, high-order bytes first. |
| </p> |
| <p> |
| UInt32 --> <Byte><sup>4</sup> |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Uint64"><strong>Uint64</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| 64-bit unsigned integers are written as eight |
| bytes, high-order bytes first. |
| </p> |
| <p>UInt64 --> <Byte><sup>8</sup> |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="VInt"><strong>VInt</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| A variable-length format for positive integers is |
| defined where the high-order bit of each byte indicates whether more |
| bytes remain to be read. The low-order seven bits are appended as |
| increasingly more significant bits in the resulting integer value. |
| Thus values from zero to 127 may be stored in a single byte, values |
| from 128 to 16,383 may be stored in two bytes, and so on. |
| </p> |
| <p><b>VInt Encoding Example</b></p> |
| <table> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT"><b>Value</b> |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT"><b>First byte</b> |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT"><b>Second byte</b> |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT"><b>Third byte</b> |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">0 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 00000000 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">1 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">2 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 00000010 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">... |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">127 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 01111111 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">128 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 10000000 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">129 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 10000001 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">130 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 10000010 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">... |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">16,383 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 11111111 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| 01111111 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"><br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">16,384 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 10000000 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| 10000000 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">16,385 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| 10000001 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| 10000000 |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| 00000001 |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| <tr> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p align="RIGHT">... |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm"> |
| <br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm"> |
| <br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| <td bgcolor="#a0ddf0" colspan="" rowspan="" valign="top" align="left"> |
| <font color="#000000" size="-1" face="arial,helvetica,sanserif"> |
| |
| <p class="western" align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm"> |
| <br /> |
| |
| </p> |
| |
| </font> |
| </td> |
| </tr> |
| </table> |
| <p> |
| This provides compression while still being |
| efficient to decode. |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Chars"><strong>Chars</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| Lucene writes unicode |
| character sequences using the standard UTF-8 encoding. |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="String"><strong>String</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| Lucene writes strings as a VInt representing the length, followed by |
| the character data. |
| </p> |
| <p> |
| String --> VInt, Chars |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Per-Index Files"><strong>Per-Index Files</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| The files in this section exist one-per-index. |
| </p> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Segments File"><strong>Segments File</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| The active segments in the index are stored in the |
| segment info file. An index only has |
| a single file in this format, and it is named "segments". |
| This lists each segment by name, and also contains the size of each |
| segment. |
| </p> |
| <p> |
| Segments --> Format, Version, SegCount, <SegName, SegSize><sup>SegCount</sup> |
| </p> |
| <p> |
| Format, SegCount, SegSize --> UInt32 |
| </p> |
| <p> |
| Version --> UInt64 |
| </p> |
| <p> |
| SegName --> String |
| </p> |
| <p> |
| Format is -1 in Lucene 1.4. |
| </p> |
| <p> |
| Version counts how often the index has been |
| changed by adding or deleting documents. |
| </p> |
| <p> |
| SegName is the name of the segment, and is used as the file name prefix |
| for all of the files that compose the segment's index. |
| </p> |
| <p> |
| SegSize is the number of documents contained in the segment index. |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Lock Files"><strong>Lock Files</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| Several files are used to indicate that another |
| process is using an index. Note that these files are not |
| stored in the index directory itself, but rather in the |
| system's temporary directory, as indicated in the Java |
| system property "java.io.tmpdir". |
| </p> |
| <ul> |
| <li> |
| <p> |
| When a file named "commit.lock" |
| is present, a process is currently re-writing the "segments" |
| file and deleting outdated segment index files, or a process is |
| reading the "segments" |
| file and opening the files of the segments it names. This lock file |
| prevents files from being deleted by another process after a process |
| has read the "segments" |
| file but before it has managed to open all of the files of the |
| segments named therein. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| When a file named "write.lock" |
| is present, a process is currently adding documents to an index, or |
| removing files from that index. This lock file prevents several |
| processes from attempting to modify an index at the same time. |
| </p> |
| </li> |
| </ul> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Deletable File"><strong>Deletable File</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| A file named "deletable" |
| contains the names of files that are no longer used by the index, but |
| which could not be deleted. This is only used on Win32, where a |
| file may not be deleted while it is still open. On other platforms |
| the file contains only null bytes. |
| </p> |
| <p> |
| Deletable --> DeletableCount, |
| <DelableName><sup>DeletableCount</sup> |
| </p> |
| <p>DeletableCount --> UInt32 |
| </p> |
| <p>DeletableName --> |
| String |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Per-Segment Files"><strong>Per-Segment Files</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| The remaining files are all per-segment, and are |
| thus defined by suffix. |
| </p> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Fields"><strong>Fields</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p><br /><b>Field Info</b><br /></p> |
| <p> |
| Field names are |
| stored in the field info file, with suffix .fnm. |
| </p> |
| <p> |
| FieldInfos |
| (.fnm) --> FieldsCount, <FieldName, |
| FieldBits><sup>FieldsCount</sup> |
| </p> |
| <p> |
| FieldsCount --> VInt |
| </p> |
| <p> |
| FieldName --> String |
| </p> |
| <p> |
| FieldBits --> Byte |
| </p> |
| <p> |
| The low-order bit is one for |
| indexed fields, and zero for non-indexed fields. The second lowest-order |
| bit is one for fields that have term vectors stored, and zero for fields |
| without term vectors. |
| </p> |
| <p> |
| Fields are numbered by their order in this file. Thus field zero is |
| the |
| first field in the file, field one the next, and so on. Note that, |
| like document numbers, field numbers are segment relative. |
| </p> |
| <p><br /><b>Stored Fields</b><br /></p> |
| <p> |
| Stored fields are represented by two files: |
| </p> |
| <ol> |
| <li> |
| <p> |
| The field index, or .fdx file. |
| </p> |
| |
| <p> |
| This contains, for each document, a pointer to |
| its field data, as follows: |
| </p> |
| |
| <p> |
| FieldIndex |
| (.fdx) --> |
| <FieldValuesPosition><sup>SegSize</sup> |
| </p> |
| <p>FieldValuesPosition |
| --> Uint64 |
| </p> |
| <p>This |
| is used to find the location within the field data file of the |
| fields of a particular document. Because it contains fixed-length |
| data, this file may be easily randomly accessed. The position of |
| document<i> n</i>'s<i> </i>field data is the Uint64 at <i>n*8</i> in |
| this file. |
| </p> |
| </li> |
| <li> |
| <p> |
| The field data, or .fdt file. |
| |
| </p> |
| |
| <p> |
| This contains the stored fields of each document, |
| as follows: |
| </p> |
| |
| <p> |
| FieldData (.fdt) --> |
| <DocFieldData><sup>SegSize</sup> |
| </p> |
| <p>DocFieldData --> |
| FieldCount, <FieldNum, Bits, Value><sup>FieldCount</sup> |
| </p> |
| <p>FieldCount --> |
| VInt |
| </p> |
| <p>FieldNum --> |
| VInt |
| </p> |
| <p>Bits --> |
| Byte |
| </p> |
| <p>Value --> |
| String |
| </p> |
| <p>Currently |
| only the low-order bit is used of Bits is used. It is one for |
| tokenized fields, and zero for non-tokenized fields. |
| </p> |
| </li> |
| </ol> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Term Dictionary"><strong>Term Dictionary</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| The term dictionary is represented as two files: |
| </p> |
| <ol> |
| <li> |
| <p> |
| The term infos, or tis file. |
| </p> |
| |
| <p> |
| TermInfoFile (.tis)--> |
| TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos |
| </p> |
| <p>TIVersion --> |
| UInt32 |
| </p> |
| <p>TermCount --> |
| UInt64 |
| </p> |
| <p>IndexInterval --> |
| UInt32 |
| </p> |
| <p>SkipInterval --> |
| UInt32 |
| </p> |
| <p>TermInfos --> |
| <TermInfo><sup>TermCount</sup> |
| </p> |
| <p>TermInfo --> |
| <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta> |
| </p> |
| <p>Term --> |
| <PrefixLength, Suffix, FieldNum> |
| </p> |
| <p>Suffix --> |
| String |
| </p> |
| <p>PrefixLength, |
| DocFreq, FreqDelta, ProxDelta, SkipDelta<br /> --> VInt |
| </p> |
| <p>This |
| file is sorted by Term. Terms are ordered first lexicographically |
| by the term's field name, and within that lexicographically by the |
| term's text. |
| </p> |
| <p>TIVersion names the version of the format |
| of this file and is -2 in Lucene 1.4. |
| </p> |
| <p>Term |
| text prefixes are shared. The PrefixLength is the number of initial |
| characters from the previous term which must be pre-pended to a |
| term's suffix in order to form the term's text. Thus, if the |
| previous term's text was "bone" and the term is "boy", |
| the PrefixLength is two and the suffix is "y". |
| </p> |
| <p>FieldNumber |
| determines the term's field, whose name is stored in the .fdt file. |
| </p> |
| <p>DocFreq |
| is the count of documents which contain the term. |
| </p> |
| <p>FreqDelta |
| determines the position of this term's TermFreqs within the .frq |
| file. In particular, it is the difference between the position of |
| this term's data in that file and the position of the previous |
| term's data (or zero, for the first term in the file). |
| </p> |
| <p>ProxDelta |
| determines the position of this term's TermPositions within the .prx |
| file. In particular, it is the difference between the position of |
| this term's data in that file and the position of the previous |
| term's data (or zero, for the first term in the file. |
| </p> |
| <p>SkipDelta determines the position of this |
| term's SkipData within the .frq file. In |
| particular, it is the number of bytes |
| after TermFreqs that the SkipData starts. |
| In other words, it is the length of the |
| TermFreq data. |
| </p> |
| </li> |
| <li> |
| <p> |
| The term info index, or .tii file. |
| </p> |
| |
| <p> |
| This contains every IndexInterval<sup>th</sup> entry from the .tis |
| file, along with its location in the "tis" file. This is |
| designed to be read entirely into memory and used to provide random |
| access to the "tis" file. |
| </p> |
| |
| <p> |
| The structure of this file is very similar to the |
| .tis file, with the addition of one item per record, the IndexDelta. |
| </p> |
| |
| <p> |
| TermInfoIndex (.tii)--> |
| IndexTermCount, TermIndices |
| </p> |
| <p>IndexTermCount --> |
| UInt32 |
| </p> |
| <p>TermIndices --> |
| <TermInfo, IndexDelta><sup>IndexTermCount</sup> |
| </p> |
| <p>IndexDelta --> |
| VInt |
| </p> |
| <p>IndexDelta |
| determines the position of this term's TermInfo the .tis file. In |
| particular, it is the difference between the position of this term's |
| entry in that file and the position of the previous term's entry (or |
| zero for the first term in the file). |
| </p> |
| </li> |
| </ol> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Frequencies"><strong>Frequencies</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| The .frq file contains the lists of documents |
| which contain each term, along with the frequency of the term in that |
| document. |
| </p> |
| <p>FreqFile (.frq) --> |
| <TermFreqs, SkipData><sup>TermCount</sup> |
| </p> |
| <p>TermFreqs --> |
| <TermFreq><sup>DocFreq</sup> |
| </p> |
| <p>TermFreq --> |
| DocDelta, Freq? |
| </p> |
| <p>SkipData --> |
| <SkipDatum><sup>DocFreq/SkipInterval</sup> |
| </p> |
| <p>SkipDatum --> |
| DocSkip,FreqSkip,ProxSkip |
| </p> |
| <p>DocDelta,Freq,DocSkip,FreqSkip,ProxSkip --> |
| VInt |
| </p> |
| <p>TermFreqs |
| are ordered by term (the term is implicit, from the .tis file). |
| </p> |
| <p>TermFreq |
| entries are ordered by increasing document number. |
| </p> |
| <p>DocDelta |
| determines both the document number and the frequency. In |
| particular, DocDelta/2 is the difference between this document number |
| and the previous document number (or zero when this is the first |
| document in a TermFreqs). When DocDelta is odd, the frequency is |
| one. When DocDelta is even, the frequency is read as another VInt. |
| </p> |
| <p>For |
| example, the TermFreqs for a term which occurs once in document seven |
| and three times in document eleven would be the following sequence of |
| VInts: |
| </p> |
| <p> 15, |
| 22, 3 |
| </p> |
| <p>DocSkip records the document number before every |
| SkipInterval<sup>th</sup> document in TermFreqs. |
| Document numbers are represented as differences |
| from the previous value in the sequence. FreqSkip |
| and ProxSkip record the position of every |
| SkipInterval<sup>th</sup> entry in FreqFile and |
| ProxFile, respectively. File positions are |
| relative to the start of TermFreqs and Positions, |
| to the previous SkipDatum in the sequence. |
| </p> |
| <p>For example, if DocFreq=35 and SkipInterval=16, |
| then there are two SkipData entries, containing |
| the 15<sup>th</sup> and 31<sup>st</sup> document |
| numbers in TermFreqs. The first FreqSkip names |
| the number of bytes after the beginning of |
| TermFreqs that the 16<sup>th</sup> SkipDatum |
| starts, and the second the number of bytes after |
| that that the 32<sup>nd</sup> starts. The first |
| ProxSkip names the number of bytes after the |
| beginning of Positions that the 16<sup>th</sup> |
| SkipDatum starts, and the second the number of |
| bytes after that that the 32<sup>nd</sup> starts. |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Positions"><strong>Positions</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p> |
| The .prx file contains the lists of positions that |
| each term occurs at within documents. |
| </p> |
| <p>ProxFile (.prx) --> |
| <TermPositions><sup>TermCount</sup> |
| </p> |
| <p>TermPositions --> |
| <Positions><sup>DocFreq</sup> |
| </p> |
| <p>Positions --> |
| <PositionDelta><sup>Freq</sup> |
| </p> |
| <p>PositionDelta --> |
| VInt |
| </p> |
| <p>TermPositions |
| are ordered by term (the term is implicit, from the .tis file). |
| </p> |
| <p>Positions |
| entries are ordered by increasing document number (the document |
| number is implicit from the .frq file). |
| </p> |
| <p>PositionDelta |
| is the difference between the position of the current occurrence in |
| the document and the previous occurrence (or zero, if this is the |
| first occurrence in this document). |
| </p> |
| <p> |
| For example, the TermPositions for a |
| term which occurs as the fourth term in one document, and as the |
| fifth and ninth term in a subsequent document, would be the following |
| sequence of VInts: |
| </p> |
| <p> 4, |
| 5, 4 |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Normalization Factors"><strong>Normalization Factors</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p>There's a norm file for each indexed field with a byte for |
| each document. The .f[0-9]* file contains, |
| for each document, a byte that encodes a value that is multiplied |
| into the score for hits on that field: |
| </p> |
| <p>Norms |
| (.f[0-9]*) --> <Byte><sup>SegSize</sup> |
| </p> |
| <p>Each |
| byte encodes a floating point value. Bits 0-2 contain the 3-bit |
| mantissa, and bits 3-8 contain the 5-bit exponent. |
| </p> |
| <p>These |
| are converted to an IEEE single float value as follows: |
| </p> |
| <ol> |
| <li><p>If |
| the byte is zero, use a zero float. |
| </p> |
| </li> |
| <li><p>Otherwise, |
| set the sign bit of the float to zero; |
| </p> |
| </li> |
| <li><p>add |
| 48 to the exponent and use this as the float's exponent; |
| </p> |
| </li> |
| <li><p>map |
| the mantissa to the high-order 3 bits of the float's mantissa; and |
| |
| </p> |
| </li> |
| <li><p>set |
| the low-order 21 bits of the float's mantissa to zero. |
| </p> |
| </li> |
| </ol> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Term Vectors"><strong>Term Vectors</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <ol> |
| <li> |
| <p>The Document Index or .tvx file.</p> |
| <p>This contains, for each document, a pointer to the document data in the Document |
| (.tvd) file. |
| </p> |
| <p>DocumentIndex (.tvx) --> TVXVersion<DocumentPosition><sup>NumDocs</sup></p> |
| <p>TVXVersion --> Int</p> |
| <p>DocumentPosition --> UInt64</p> |
| <p>This is used to find the position of the Document in the .tvd file.</p> |
| </li> |
| <li> |
| <p>The Document or .tvd file.</p> |
| <p>This contains, for each document, the number of fields, a list of the fields with |
| term vector info and finally a list of pointers to the field information in the .tvf |
| (Term Vector Fields) file.</p> |
| <p> |
| Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup> |
| </p> |
| <p>TVDVersion --> Int</p> |
| <p>NumFields --> VInt</p> |
| <p>FieldNums --> <FieldNumDelta><sup>NumFields</sup></p> |
| <p>FieldNumDelta --> VInt</p> |
| <p>FieldPositions --> <FieldPosition><sup>NumFields</sup></p> |
| <p>FieldPosition --> VLong</p> |
| <p>The .tvd file is used to map out the fields that have term vectors stored and |
| where the field information is in the .tvf file.</p> |
| </li> |
| <li> |
| <p>The Field or .tvf file.</p> |
| <p>This file contains, for each field that has a term vector stored, a list of |
| the terms and their frequencies.</p> |
| <p>Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs><sup>NumFields</sup></p> |
| <p>TVFVersion --> Int</p> |
| <p>NumTerms --> VInt</p> |
| <p>NumDistinct --> VInt -- Future Use</p> |
| <p>TermFreqs --> <TermText, TermFreq><sup>NumTerms</sup></p> |
| <p>TermText --> <PrefixLength, Suffix></p> |
| <p>PrefixLength --> VInt</p> |
| <p>Suffix --> String</p> |
| <p>TermFreq --> VInt</p> |
| <p>Term |
| text prefixes are shared. The PrefixLength is the number of initial |
| characters from the previous term which must be pre-pended to a |
| term's suffix in order to form the term's text. Thus, if the |
| previous term's text was "bone" and the term is "boy", |
| the PrefixLength is two and the suffix is "y". |
| </p> |
| </li> |
| </ol> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#828DA6"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Deleted Documents"><strong>Deleted Documents</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p>The .del file is |
| optional, and only exists when a segment contains deletions: |
| </p> |
| <p>Deletions |
| (.del) --> ByteCount,BitCount,Bits |
| </p> |
| <p>ByteSize,BitCount --> |
| Uint32 |
| </p> |
| <p>Bits --> |
| <Byte><sup>ByteCount</sup> |
| </p> |
| <p>ByteCount |
| indicates the number of bytes in Bits. It is typically |
| (SegSize/8)+1. |
| </p> |
| <p> |
| BitCount |
| indicates the number of bits that are currently set in Bits. |
| </p> |
| <p>Bits |
| contains one bit for each document indexed. When the bit |
| corresponding to a document number is set, that document is marked as |
| deleted. Bit ordering is from least to most significant. Thus, if |
| Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as |
| deleted. |
| </p> |
| </blockquote> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| <table border="0" cellspacing="0" cellpadding="2" width="100%"> |
| <tr><td bgcolor="#525D76"> |
| <font color="#ffffff" face="arial,helvetica,sanserif"> |
| <a name="Limitations"><strong>Limitations</strong></a> |
| </font> |
| </td></tr> |
| <tr><td> |
| <blockquote> |
| <p>There |
| are a few places where these file formats limit the maximum number of |
| terms and documents to a 32-bit quantity, or to approximately 4 |
| billion. This is not today a problem, but, in the long term, |
| probably will be. These should therefore be replaced with either |
| UInt64 values, or better yet, with VInt values which have no limit. |
| </p> |
| </blockquote> |
| </p> |
| </td></tr> |
| <tr><td><br/></td></tr> |
| </table> |
| </td> |
| </tr> |
| |
| <!-- FOOTER --> |
| <tr><td colspan="2"> |
| <hr noshade="" size="1"/> |
| </td></tr> |
| <tr><td colspan="2"> |
| <div align="center"><font color="#525D76" size="-1"><em> |
| Copyright © 1999-2004, The Apache Software Foundation |
| </em></font></div> |
| </td></tr> |
| </table> |
| </body> |
| </html> |
| <!-- end the processing --> |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |