| <!doctype html public "-//w3c//dtd html 4.0 transitional//en"> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <html> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> |
| </head> |
| <body> |
| Code to maintain and access indices. |
| <!-- TODO: add IndexWriter, IndexWriterConfig, DocValues, etc etc --> |
| <h2>Table Of Contents</h2> |
| <p> |
| <ol> |
| <li><a href="#postings">Postings APIs</a> |
| <ul> |
| <li><a href="#fields">Fields</a></li> |
| <li><a href="#terms">Terms</a></li> |
| <li><a href="#documents">Documents</a></li> |
| <li><a href="#positions">Positions</a></li> |
| </ul> |
| </li> |
| <li><a href="#stats">Index Statistics</a> |
| <ul> |
| <li><a href="#termstats">Term-level</a></li> |
| <li><a href="#fieldstats">Field-level</a></li> |
| <li><a href="#segmentstats">Segment-level</a></li> |
| <li><a href="#documentstats">Document-level</a></li> |
| </ul> |
| </li> |
| </ol> |
| </p> |
| <a name="postings"></a> |
| <h2>Postings APIs</h2> |
| <a name="fields"></a> |
| <h4> |
| Fields |
| </h4> |
| <p> |
| {@link org.apache.lucene.index.Fields} is the initial entry point into the |
| postings APIs, this can be obtained in several ways: |
| <pre class="prettyprint"> |
| // access indexed fields for an index segment |
| Fields fields = reader.fields(); |
| // access term vector fields for a specified document |
| Fields fields = reader.getTermVectors(docid); |
| </pre> |
| Fields implements Java's Iterable interface, so its easy to enumerate the |
| list of fields: |
| <pre class="prettyprint"> |
| // enumerate list of fields |
| for (String field : fields) { |
| // access the terms for this field |
| Terms terms = fields.terms(field); |
| } |
| </pre> |
| </p> |
| <a name="terms"></a> |
| <h4> |
| Terms |
| </h4> |
| <p> |
| {@link org.apache.lucene.index.Terms} represents the collection of terms |
| within a field, exposes some metadata and <a href="#fieldstats">statistics</a>, |
| and an API for enumeration. |
| <pre class="prettyprint"> |
| // metadata about the field |
| System.out.println("positions? " + terms.hasPositions()); |
| System.out.println("offsets? " + terms.hasOffsets()); |
| System.out.println("payloads? " + terms.hasPayloads()); |
| // iterate through terms |
| TermsEnum termsEnum = terms.iterator(null); |
| BytesRef term = null; |
| while ((term = termsEnum.next()) != null) { |
| doSomethingWith(termsEnum.term()); |
| } |
| </pre> |
| {@link org.apache.lucene.index.TermsEnum} provides an iterator over the list |
| of terms within a field, some <a href="#termstats">statistics</a> about the term, |
| and methods to access the term's <a href="#documents">documents</a> and |
| <a href="#positions">positions</a>. |
| <pre class="prettyprint"> |
| // seek to a specific term |
| boolean found = termsEnum.seekExact(new BytesRef("foobar")); |
| if (found) { |
| // get the document frequency |
| System.out.println(termsEnum.docFreq()); |
| // enumerate through documents |
| DocsEnum docs = termsEnum.docs(null, null); |
| // enumerate through documents and positions |
| DocsAndPositionsEnum docsAndPositions = termsEnum.docsAndPositions(null, null); |
| } |
| </pre> |
| </p> |
| <a name="documents"></a> |
| <h4> |
| Documents |
| </h4> |
| <p> |
| {@link org.apache.lucene.index.DocsEnum} is an extension of |
| {@link org.apache.lucene.search.DocIdSetIterator}that iterates over the list of |
| documents for a term, along with the term frequency within that document. |
| <pre class="prettyprint"> |
| int docid; |
| while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { |
| System.out.println(docid); |
| System.out.println(docsEnum.freq()); |
| } |
| </pre> |
| </p> |
| <a name="positions"></a> |
| <h4> |
| Positions |
| </h4> |
| <p> |
| {@link org.apache.lucene.index.DocsAndPositionsEnum} is an extension of |
| {@link org.apache.lucene.index.DocsEnum} that additionally allows iteration |
| of the positions a term occurred within the document, and any additional |
| per-position information (offsets and payload) |
| <pre class="prettyprint"> |
| int docid; |
| while ((docid = docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { |
| System.out.println(docid); |
| int freq = docsAndPositionsEnum.freq(); |
| for (int i = 0; i < freq; i++) { |
| System.out.println(docsAndPositionsEnum.nextPosition()); |
| System.out.println(docsAndPositionsEnum.startOffset()); |
| System.out.println(docsAndPositionsEnum.endOffset()); |
| System.out.println(docsAndPositionsEnum.getPayload()); |
| } |
| } |
| </pre> |
| </p> |
| <a name="stats"></a> |
| <h2>Index Statistics</h2> |
| <a name="termstats"></a> |
| <h4> |
| Term statistics |
| </h4> |
| <p> |
| <ul> |
| <li>{@link org.apache.lucene.index.TermsEnum#docFreq}: Returns the number of |
| documents that contain at least one occurrence of the term. This statistic |
| is always available for an indexed term. Note that it will also count |
| deleted documents, when segments are merged the statistic is updated as |
| those deleted documents are merged away. |
| <li>{@link org.apache.lucene.index.TermsEnum#totalTermFreq}: Returns the number |
| of occurrences of this term across all documents. Note that this statistic |
| is unavailable (returns <code>-1</code>) if term frequencies were omitted |
| from the index |
| ({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY}) |
| for the field. Like docFreq(), it will also count occurrences that appear in |
| deleted documents. |
| </ul> |
| </p> |
| <a name="fieldstats"></a> |
| <h4> |
| Field statistics |
| </h4> |
| <p> |
| <ul> |
| <li>{@link org.apache.lucene.index.Terms#size}: Returns the number of |
| unique terms in the field. This statistic may be unavailable |
| (returns <code>-1</code>) for some Terms implementations such as |
| {@link org.apache.lucene.index.MultiTerms}, where it cannot be efficiently |
| computed. Note that this count also includes terms that appear only |
| in deleted documents: when segments are merged such terms are also merged |
| away and the statistic is then updated. |
| <li>{@link org.apache.lucene.index.Terms#getDocCount}: Returns the number of |
| documents that contain at least one occurrence of any term for this field. |
| This can be thought of as a Field-level docFreq(). Like docFreq() it will |
| also count deleted documents. |
| <li>{@link org.apache.lucene.index.Terms#getSumDocFreq}: Returns the number of |
| postings (term-document mappings in the inverted index) for the field. This |
| can be thought of as the sum of {@link org.apache.lucene.index.TermsEnum#docFreq} |
| across all terms in the field, and like docFreq() it will also count postings |
| that appear in deleted documents. |
| <li>{@link org.apache.lucene.index.Terms#getSumTotalTermFreq}: Returns the number |
| of tokens for the field. This can be thought of as the sum of |
| {@link org.apache.lucene.index.TermsEnum#totalTermFreq} across all terms in the |
| field, and like totalTermFreq() it will also count occurrences that appear in |
| deleted documents, and will be unavailable (returns <code>-1</code>) if term |
| frequencies were omitted from the index |
| ({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY}) |
| for the field. |
| </ul> |
| </p> |
| <a name="segmentstats"></a> |
| <h4> |
| Segment statistics |
| </h4> |
| <p> |
| <ul> |
| <li>{@link org.apache.lucene.index.IndexReader#maxDoc}: Returns the number of |
| documents (including deleted documents) in the index. |
| <li>{@link org.apache.lucene.index.IndexReader#numDocs}: Returns the number |
| of live documents (excluding deleted documents) in the index. |
| <li>{@link org.apache.lucene.index.IndexReader#numDeletedDocs}: Returns the |
| number of deleted documents in the index. |
| <li>{@link org.apache.lucene.index.Fields#size}: Returns the number of indexed |
| fields. |
| </ul> |
| </p> |
| <a name="documentstats"></a> |
| <h4> |
| Document statistics |
| </h4> |
| <p> |
| Document statistics are available during the indexing process for an indexed field: typically |
| a {@link org.apache.lucene.search.similarities.Similarity} implementation will store some |
| of these values (possibly in a lossy way), into the normalization value for the document in |
| its {@link org.apache.lucene.search.similarities.Similarity#computeNorm} method. |
| </p> |
| <p> |
| <ul> |
| <li>{@link org.apache.lucene.index.FieldInvertState#getLength}: Returns the number of |
| tokens for this field in the document. Note that this is just the number |
| of times that {@link org.apache.lucene.analysis.TokenStream#incrementToken} returned |
| true, and is unrelated to the values in |
| {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}. |
| <li>{@link org.apache.lucene.index.FieldInvertState#getNumOverlap}: Returns the number |
| of tokens for this field in the document that had a position increment of zero. This |
| can be used to compute a document length that discounts artificial tokens |
| such as synonyms. |
| <li>{@link org.apache.lucene.index.FieldInvertState#getPosition}: Returns the accumulated |
| position value for this field in the document: computed from the values of |
| {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute} and including |
| {@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap}s across multivalued |
| fields. |
| <li>{@link org.apache.lucene.index.FieldInvertState#getOffset}: Returns the total |
| character offset value for this field in the document: computed from the values of |
| {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute} returned by |
| {@link org.apache.lucene.analysis.TokenStream#end}, and including |
| {@link org.apache.lucene.analysis.Analyzer#getOffsetGap}s across multivalued |
| fields. |
| <li>{@link org.apache.lucene.index.FieldInvertState#getUniqueTermCount}: Returns the number |
| of unique terms encountered for this field in the document. |
| <li>{@link org.apache.lucene.index.FieldInvertState#getMaxTermFrequency}: Returns the maximum |
| frequency across all unique terms encountered for this field in the document. |
| </ul> |
| </p> |
| <p> |
| Additional user-supplied statistics can be added to the document as DocValues fields and |
| accessed via {@link org.apache.lucene.index.LeafReader#getNumericDocValues}. |
| </p> |
| <p> |
| </body> |
| </html> |