blob: 414637a32a50e831bd40aa8e9b6cb2cbe3227d24 [file] [log] [blame]
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
Code to maintain and access indices.
<!-- TODO: add IndexWriter, IndexWriterConfig, DocValues, etc etc -->
<h2>Table Of Contents</h2>
<p>
<ol>
<li><a href="#postings">Postings APIs</a>
<ul>
<li><a href="#fields">Fields</a></li>
<li><a href="#terms">Terms</a></li>
<li><a href="#documents">Documents</a></li>
<li><a href="#positions">Positions</a></li>
</ul>
</li>
<li><a href="#stats">Index Statistics</a>
<ul>
<li><a href="#termstats">Term-level</a></li>
<li><a href="#fieldstats">Field-level</a></li>
<li><a href="#segmentstats">Segment-level</a></li>
<li><a href="#documentstats">Document-level</a></li>
</ul>
</li>
</ol>
</p>
<a name="postings"></a>
<h2>Postings APIs</h2>
<a name="fields"></a>
<h4>
Fields
</h4>
<p>
{@link org.apache.lucene.index.Fields} is the initial entry point into the
postings APIs, this can be obtained in several ways:
<pre class="prettyprint">
// access indexed fields for an index segment
Fields fields = reader.fields();
// access term vector fields for a specified document
Fields fields = reader.getTermVectors(docid);
</pre>
Fields implements Java's Iterable interface, so its easy to enumerate the
list of fields:
<pre class="prettyprint">
// enumerate list of fields
for (String field : fields) {
// access the terms for this field
Terms terms = fields.terms(field);
}
</pre>
</p>
<a name="terms"></a>
<h4>
Terms
</h4>
<p>
{@link org.apache.lucene.index.Terms} represents the collection of terms
within a field, exposes some metadata and <a href="#fieldstats">statistics</a>,
and an API for enumeration.
<pre class="prettyprint">
// metadata about the field
System.out.println("positions? " + terms.hasPositions());
System.out.println("offsets? " + terms.hasOffsets());
System.out.println("payloads? " + terms.hasPayloads());
// iterate through terms
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
while ((term = termsEnum.next()) != null) {
doSomethingWith(termsEnum.term());
}
</pre>
{@link org.apache.lucene.index.TermsEnum} provides an iterator over the list
of terms within a field, some <a href="#termstats">statistics</a> about the term,
and methods to access the term's <a href="#documents">documents</a> and
<a href="#positions">positions</a>.
<pre class="prettyprint">
// seek to a specific term
boolean found = termsEnum.seekExact(new BytesRef("foobar"));
if (found) {
// get the document frequency
System.out.println(termsEnum.docFreq());
// enumerate through documents
DocsEnum docs = termsEnum.docs(null, null);
// enumerate through documents and positions
DocsAndPositionsEnum docsAndPositions = termsEnum.docsAndPositions(null, null);
}
</pre>
</p>
<a name="documents"></a>
<h4>
Documents
</h4>
<p>
{@link org.apache.lucene.index.DocsEnum} is an extension of
{@link org.apache.lucene.search.DocIdSetIterator}that iterates over the list of
documents for a term, along with the term frequency within that document.
<pre class="prettyprint">
int docid;
while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
System.out.println(docid);
System.out.println(docsEnum.freq());
}
</pre>
</p>
<a name="positions"></a>
<h4>
Positions
</h4>
<p>
{@link org.apache.lucene.index.DocsAndPositionsEnum} is an extension of
{@link org.apache.lucene.index.DocsEnum} that additionally allows iteration
of the positions a term occurred within the document, and any additional
per-position information (offsets and payload)
<pre class="prettyprint">
int docid;
while ((docid = docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
System.out.println(docid);
int freq = docsAndPositionsEnum.freq();
for (int i = 0; i < freq; i++) {
System.out.println(docsAndPositionsEnum.nextPosition());
System.out.println(docsAndPositionsEnum.startOffset());
System.out.println(docsAndPositionsEnum.endOffset());
System.out.println(docsAndPositionsEnum.getPayload());
}
}
</pre>
</p>
<a name="stats"></a>
<h2>Index Statistics</h2>
<a name="termstats"></a>
<h4>
Term statistics
</h4>
<p>
<ul>
<li>{@link org.apache.lucene.index.TermsEnum#docFreq}: Returns the number of
documents that contain at least one occurrence of the term. This statistic
is always available for an indexed term. Note that it will also count
deleted documents, when segments are merged the statistic is updated as
those deleted documents are merged away.
<li>{@link org.apache.lucene.index.TermsEnum#totalTermFreq}: Returns the number
of occurrences of this term across all documents. Note that this statistic
is unavailable (returns <code>-1</code>) if term frequencies were omitted
from the index
({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY})
for the field. Like docFreq(), it will also count occurrences that appear in
deleted documents.
</ul>
</p>
<a name="fieldstats"></a>
<h4>
Field statistics
</h4>
<p>
<ul>
<li>{@link org.apache.lucene.index.Terms#size}: Returns the number of
unique terms in the field. This statistic may be unavailable
(returns <code>-1</code>) for some Terms implementations such as
{@link org.apache.lucene.index.MultiTerms}, where it cannot be efficiently
computed. Note that this count also includes terms that appear only
in deleted documents: when segments are merged such terms are also merged
away and the statistic is then updated.
<li>{@link org.apache.lucene.index.Terms#getDocCount}: Returns the number of
documents that contain at least one occurrence of any term for this field.
This can be thought of as a Field-level docFreq(). Like docFreq() it will
also count deleted documents.
<li>{@link org.apache.lucene.index.Terms#getSumDocFreq}: Returns the number of
postings (term-document mappings in the inverted index) for the field. This
can be thought of as the sum of {@link org.apache.lucene.index.TermsEnum#docFreq}
across all terms in the field, and like docFreq() it will also count postings
that appear in deleted documents.
<li>{@link org.apache.lucene.index.Terms#getSumTotalTermFreq}: Returns the number
of tokens for the field. This can be thought of as the sum of
{@link org.apache.lucene.index.TermsEnum#totalTermFreq} across all terms in the
field, and like totalTermFreq() it will also count occurrences that appear in
deleted documents, and will be unavailable (returns <code>-1</code>) if term
frequencies were omitted from the index
({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY})
for the field.
</ul>
</p>
<a name="segmentstats"></a>
<h4>
Segment statistics
</h4>
<p>
<ul>
<li>{@link org.apache.lucene.index.IndexReader#maxDoc}: Returns the number of
documents (including deleted documents) in the index.
<li>{@link org.apache.lucene.index.IndexReader#numDocs}: Returns the number
of live documents (excluding deleted documents) in the index.
<li>{@link org.apache.lucene.index.IndexReader#numDeletedDocs}: Returns the
number of deleted documents in the index.
<li>{@link org.apache.lucene.index.Fields#size}: Returns the number of indexed
fields.
</ul>
</p>
<a name="documentstats"></a>
<h4>
Document statistics
</h4>
<p>
Document statistics are available during the indexing process for an indexed field: typically
a {@link org.apache.lucene.search.similarities.Similarity} implementation will store some
of these values (possibly in a lossy way), into the normalization value for the document in
its {@link org.apache.lucene.search.similarities.Similarity#computeNorm} method.
</p>
<p>
<ul>
<li>{@link org.apache.lucene.index.FieldInvertState#getLength}: Returns the number of
tokens for this field in the document. Note that this is just the number
of times that {@link org.apache.lucene.analysis.TokenStream#incrementToken} returned
true, and is unrelated to the values in
{@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}.
<li>{@link org.apache.lucene.index.FieldInvertState#getNumOverlap}: Returns the number
of tokens for this field in the document that had a position increment of zero. This
can be used to compute a document length that discounts artificial tokens
such as synonyms.
<li>{@link org.apache.lucene.index.FieldInvertState#getPosition}: Returns the accumulated
position value for this field in the document: computed from the values of
{@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute} and including
{@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap}s across multivalued
fields.
<li>{@link org.apache.lucene.index.FieldInvertState#getOffset}: Returns the total
character offset value for this field in the document: computed from the values of
{@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute} returned by
{@link org.apache.lucene.analysis.TokenStream#end}, and including
{@link org.apache.lucene.analysis.Analyzer#getOffsetGap}s across multivalued
fields.
<li>{@link org.apache.lucene.index.FieldInvertState#getUniqueTermCount}: Returns the number
of unique terms encountered for this field in the document.
<li>{@link org.apache.lucene.index.FieldInvertState#getMaxTermFrequency}: Returns the maximum
frequency across all unique terms encountered for this field in the document.
</ul>
</p>
<p>
Additional user-supplied statistics can be added to the document as DocValues fields and
accessed via {@link org.apache.lucene.index.LeafReader#getNumericDocValues}.
</p>
<p>
</body>
</html>