| |
| LUCENE-2380: FieldCache.getStrings/Index --> FieldCache.getDocTerms/Index |
| |
| * The field values returned when sorting by SortField.STRING are now |
| BytesRef. You can call value.utf8ToString() to convert back to |
| string, if necessary. |
| |
| * In FieldCache, getStrings (returning String[]) has been replaced |
| with getTerms (returning a FieldCache.DocTerms instance). |
| DocTerms provides a getTerm method, taking a docID and a BytesRef |
| to fill (which must not be null), and it fills it in with the |
| reference to the bytes for that term. |
| |
| If you had code like this before: |
| |
| String[] values = FieldCache.DEFAULT.getStrings(reader, field); |
| ... |
| String aValue = values[docID]; |
| |
| you can do this instead: |
| |
| DocTerms values = FieldCache.DEFAULT.getTerms(reader, field); |
| ... |
| BytesRef term = new BytesRef(); |
| String aValue = values.getTerm(docID, term).utf8ToString(); |
| |
| Note however that it can be costly to convert to String, so it's |
| better to work directly with the BytesRef. |
| |
| * Similarly, in FieldCache, getStringIndex (returning a StringIndex |
| instance, with direct arrays int[] order and String[] lookup) has |
| been replaced with getTermsIndex (returning a |
| FieldCache.DocTermsIndex instance). DocTermsIndex provides the |
| getOrd(int docID) method to lookup the int order for a document, |
| lookup(int ord, BytesRef reuse) to lookup the term from a given |
| order, and the sugar method getTerm(int docID, BytesRef reuse) |
| which internally calls getOrd and then lookup. |
| |
| If you had code like this before: |
| |
| StringIndex idx = FieldCache.DEFAULT.getStringIndex(reader, field); |
| ... |
| int ord = idx.order[docID]; |
| String aValue = idx.lookup[ord]; |
| |
| you can do this instead: |
| |
| DocTermsIndex idx = FieldCache.DEFAULT.getTermsIndex(reader, field); |
| ... |
| int ord = idx.getOrd(docID); |
| BytesRef term = new BytesRef(); |
| String aValue = idx.lookup(ord, term).utf8ToString(); |
| |
| Note however that it can be costly to convert to String, so it's |
| better to work directly with the BytesRef. |
| |
| DocTermsIndex also has a getTermsEnum() method, which returns an |
| iterator (TermsEnum) over the term values in the index (ie, |
| iterates ord = 0..numOrd()-1). |
| |
| * StringComparatorLocale is now more CPU costly than it was before |
| (it was already very CPU costly since it does not compare using |
| indexed collation keys; use CollationKeyFilter for better |
| performance), since it converts BytesRef -> String on the fly. |
| Also, the field values returned when sorting by SortField.STRING |
| are now BytesRef. |
| |
| * FieldComparator.StringOrdValComparator has been renamed to |
| TermOrdValComparator, and now uses BytesRef for its values. |
| Likewise for StringValComparator, renamed to TermValComparator. |
| This means when sorting by SortField.STRING or |
| SortField.STRING_VAL (or directly invoking these comparators) the |
| values returned in the FieldDoc.fields array will be BytesRef not |
| String. You can call the .utf8ToString() method on the BytesRef |
| instances, if necessary. |
| |
| |
| |
| LUCENE-1458, LUCENE-2111: Flexible Indexing |
| |
| Flexible indexing changed the low level fields/terms/docs/positions |
| enumeration APIs. Here are the major changes: |
| |
| * Terms are now binary in nature (arbitrary byte[]), represented |
| by the BytesRef class (which provides an offset + length "slice" |
| into an existing byte[]). |
| |
| * Fields are separately enumerated (FieldsEnum) from the terms |
| within each field (TermEnum). So instead of this: |
| |
| TermEnum termsEnum = ...; |
| while(termsEnum.next()) { |
| Term t = termsEnum.term(); |
| System.out.println("field=" + t.field() + "; text=" + t.text()); |
| } |
| |
| Do this: |
| |
| FieldsEnum fieldsEnum = ...; |
| String field; |
| while((field = fieldsEnum.next()) != null) { |
| TermsEnum termsEnum = fieldsEnum.terms(); |
| BytesRef text; |
| while((text = termsEnum.next()) != null) { |
| System.out.println("field=" + field + "; text=" + text.utf8ToString()); |
| } |
| |
| * TermDocs is renamed to DocsEnum. Instead of this: |
| |
| while(td.next()) { |
| int doc = td.doc(); |
| ... |
| } |
| |
| do this: |
| |
| int doc; |
| while((doc = td.next()) != DocsEnum.NO_MORE_DOCS) { |
| ... |
| } |
| |
| Instead of this: |
| |
| if (td.skipTo(target)) { |
| int doc = td.doc(); |
| ... |
| } |
| |
| do this: |
| |
| if ((doc=td.advance(target)) != DocsEnum.NO_MORE_DOCS) { |
| ... |
| } |
| |
| The bulk read API has also changed. Instead of this: |
| |
| int[] docs = new int[256]; |
| int[] freqs = new int[256]; |
| |
| while(true) { |
| int count = td.read(docs, freqs) |
| if (count == 0) { |
| break; |
| } |
| // use docs[i], freqs[i] |
| } |
| |
| do this: |
| |
| DocsEnum.BulkReadResult bulk = td.getBulkResult(); |
| while(true) { |
| int count = td.read(); |
| if (count == 0) { |
| break; |
| } |
| // use bulk.docs.ints[i] and bulk.freqs.ints[i] |
| } |
| |
| * TermPositions is renamed to DocsAndPositionsEnum, and no longer |
| extends the docs only enumerator (DocsEnum). |
| |
| * Deleted docs are no longer implicitly filtered from |
| docs/positions enums. Instead, you pass a Bits |
| skipDocs (set bits are skipped) when obtaining the enums. Also, |
| you can now ask a reader for its deleted docs. |
| |
| * The docs/positions enums cannot seek to a term. Instead, |
| TermsEnum is able to seek, and then you request the |
| docs/positions enum from that TermsEnum. |
| |
| * TermsEnum's seek method returns more information. So instead of |
| this: |
| |
| Term t; |
| TermEnum termEnum = reader.terms(t); |
| if (t.equals(termEnum.term())) { |
| ... |
| } |
| |
| do this: |
| |
| TermsEnum termsEnum = ...; |
| BytesRef text; |
| if (termsEnum.seek(text) == TermsEnum.SeekStatus.FOUND) { |
| ... |
| } |
| |
| SeekStatus also contains END (enumerator is done) and NOT_FOUND |
| (term was not found but enumerator is now positioned to the next |
| term). |
| |
| * TermsEnum has an ord() method, returning the long numeric |
| ordinal (ie, first term is 0, next is 1, and so on) for the term |
| it's not positioned to. There is also a corresponding seek(long |
| ord) method. Note that these methods are optional; in |
| particular the MultiFields TermsEnum does not implement them. |
| |
| |
| How you obtain the enums has changed. The primary entry point is |
| the Fields class. If you know your reader is a single segment |
| reader, do this: |
| |
| Fields fields = reader.Fields(); |
| if (fields != null) { |
| ... |
| } |
| |
| If the reader might be multi-segment, you must do this: |
| |
| Fields fields = MultiFields.getFields(reader); |
| if (fields != null) { |
| ... |
| } |
| |
| The fields may be null (eg if the reader has no fields). |
| |
| Note that the MultiFields approach entails a performance hit on |
| MultiReaders, as it must merge terms/docs/positions on the fly. It's |
| generally better to instead get the sequential readers (use |
| oal.util.ReaderUtil) and then step through those readers yourself, |
| if you can (this is how Lucene drives searches). |
| |
| If you pass a SegmentReader to MultiFields.fiels it will simply |
| return reader.fields(), so there is no performance hit in that |
| case. |
| |
| Once you have a non-null Fields you can do this: |
| |
| Terms terms = fields.terms("field"); |
| if (terms != null) { |
| ... |
| } |
| |
| The terms may be null (eg if the field does not exist). |
| |
| Once you have a non-null terms you can get an enum like this: |
| |
| TermsEnum termsEnum = terms.iterator(); |
| |
| The returned TermsEnum will not be null. |
| |
| You can then .next() through the TermsEnum, or seek. If you want a |
| DocsEnum, do this: |
| |
| Bits skipDocs = MultiFields.getDeletedDocs(reader); |
| DocsEnum docsEnum = null; |
| |
| docsEnum = termsEnum.docs(skipDocs, docsEnum); |
| |
| You can pass in a prior DocsEnum and it will be reused if possible. |
| |
| Likewise for DocsAndPositionsEnum. |
| |
| IndexReader has several sugar methods (which just go through the |
| above steps, under the hood). Instead of: |
| |
| Term t; |
| TermDocs termDocs = reader.termDocs(); |
| termDocs.seek(t); |
| |
| do this: |
| |
| String field; |
| BytesRef text; |
| DocsEnum docsEnum = reader.termDocsEnum(reader.getDeletedDocs(), field, text); |
| |
| Likewise for DocsAndPositionsEnum. |
| |
| * LUCENE-2600: remove IndexReader.isDeleted |
| |
| Instead of IndexReader.isDeleted, do this: |
| |
| import org.apache.lucene.util.Bits; |
| import org.apache.lucene.index.MultiFields; |
| |
| Bits delDocs = MultiFields.getDeletedDocs(indexReader); |
| if (delDocs.get(docID)) { |
| // document is deleted... |
| } |
| |
| * LUCENE-2674: A new idfExplain method was added to Similarity, that |
| accepts an incoming docFreq. If you subclass Similarity, make sure |
| you also override this method on upgrade, otherwise your |
| customizations won't run for certain MultiTermQuerys. |
| |
| * LUCENE-2413: Lucene's core and contrib analyzers, along with Solr's analyzers, |
| were consolidated into modules/analysis. During the refactoring some |
| package names have changed: |
| - o.a.l.analysis.KeywordAnalyzer -> o.a.l.analysis.core.KeywordAnalyzer |
| - o.a.l.analysis.KeywordTokenizer -> o.a.l.analysis.core.KeywordTokenizer |
| - o.a.l.analysis.LetterTokenizer -> o.a.l.analysis.core.LetterTokenizer |
| - o.a.l.analysis.LowerCaseFilter -> o.a.l.analysis.core.LowerCaseFilter |
| - o.a.l.analysis.LowerCaseTokenizer -> o.a.l.analysis.core.LowerCaseTokenizer |
| - o.a.l.analysis.SimpleAnalyzer -> o.a.l.analysis.core.SimpleAnalyzer |
| - o.a.l.analysis.StopAnalyzer -> o.a.l.analysis.core.StopAnalyzer |
| - o.a.l.analysis.StopFilter -> o.a.l.analysis.core.StopFilter |
| - o.a.l.analysis.WhitespaceAnalyzer -> o.a.l.analysis.core.WhitespaceAnalyzer |
| - o.a.l.analysis.WhitespaceTokenizer -> o.a.l.analysis.core.WhitespaceTokenizer |
| - o.a.l.analysis.PorterStemFilter -> o.a.l.analysis.en.PorterStemFilter |
| - o.a.l.analysis.ASCIIFoldingFilter -> o.a.l.analysis.miscellaneous.ASCIIFoldingFilter |
| - o.a.l.analysis.ISOLatin1AccentFilter -> o.a.l.analysis.miscellaneous.ISOLatin1AccentFilter |
| - o.a.l.analysis.KeywordMarkerFilter -> o.a.l.analysis.miscellaneous.KeywordMarkerFilter |
| - o.a.l.analysis.LengthFilter -> o.a.l.analysis.miscellaneous.LengthFilter |
| - o.a.l.analysis.PerFieldAnalyzerWrapper -> o.a.l.analysis.miscellaneous.PerFieldAnalyzerWrapper |
| - o.a.l.analysis.TeeSinkTokenFilter -> o.a.l.analysis.sinks.TeeSinkTokenFilter |
| - o.a.l.analysis.CharFilter -> o.a.l.analysis.charfilter.CharFilter |
| - o.a.l.analysis.BaseCharFilter -> o.a.l.analysis.charfilter.BaseCharFilter |
| - o.a.l.analysis.MappingCharFilter -> o.a.l.analysis.charfilter.MappingCharFilter |
| - o.a.l.analysis.NormalizeCharMap -> o.a.l.analysis.charfilter.NormalizeCharMap |
| - o.a.l.analysis.CharArraySet -> o.a.l.analysis.util.CharArraySet |
| - o.a.l.analysis.CharArrayMap -> o.a.l.analysis.util.CharArrayMap |
| - o.a.l.analysis.ReusableAnalyzerBase -> o.a.l.analysis.util.ReusableAnalyzerBase |
| - o.a.l.analysis.StopwordAnalyzerBase -> o.a.l.analysis.util.StopwordAnalyzerBase |
| - o.a.l.analysis.WordListLoader -> o.a.l.analysis.util.WordListLoader |
| - o.a.l.analysis.CharTokenizer -> o.a.l.analysis.util.CharTokenizer |
| - o.a.l.util.CharacterUtils -> o.a.l.analysis.util.CharacterUtils |
| |
| * LUCENE-2514: The option to use a Collator's order (instead of binary order) for |
| sorting and range queries has been moved to contrib/queries. |
| |
| The Collated TermRangeQuery/Filter has been moved to SlowCollatedTermRangeQuery/Filter, |
| and the collated sorting has been moved to SlowCollatedStringComparator. |
| |
| Note: this functionality isn't very scalable and if you are using it, consider |
| indexing collation keys with the collation support in the analysis module instead. |
| |
| To perform collated range queries, use a suitable collating analyzer: CollationKeyAnalyzer |
| or ICUCollationKeyAnalyzer, and set qp.setAnalyzeRangeTerms(true). |
| |
| TermRangeQuery and TermRangeFilter now work purely on bytes. Both have helper factory methods |
| (newStringRange) similar to the NumericRange API, to easily perform range queries on Strings. |
| |
| * LUCENE-2691: The near-real-time API has moved from IndexWriter to |
| IndexReader. Instead of IndexWriter.getReader(), call |
| IndexReader.open(IndexWriter) or IndexReader.reopen(IndexWriter). |
| |
| * LUCENE-2690: MultiTermQuery boolean rewrites per segment. |
| Also MultiTermQuery.getTermsEnum() now takes an AttributeSource. FuzzyTermsEnum |
| is both consumer and producer of attributes: MTQ.BoostAttribute is |
| added to the FuzzyTermsEnum and MTQ's rewrite mode consumes it. |
| The other way round MTQ.TopTermsBooleanQueryRewrite supplys a |
| global AttributeSource to each segments TermsEnum. The TermsEnum is consumer |
| and gets the current minimum competitive boosts (MTQ.MaxNonCompetitiveBoostAttribute). |
| |
| * LUCENE-2374: The backwards layer in AttributeImpl was removed. To support correct |
| reflection of AttributeImpl instances, where the reflection was done using deprecated |
| toString() parsing, you have to now override reflectWith() to customize output. |
| toString() is no longer implemented by AttributeImpl, so if you have overridden |
| toString(), port your customization over to reflectWith(). reflectAsString() would |
| then return what toString() did before. |
| |
| * LUCENE-2236, LUCENE-2912: DefaultSimilarity can no longer be set statically |
| (and dangerously) for the entire JVM. |
| Instead, IndexWriterConfig and IndexSearcher now take a SimilarityProvider. |
| Similarity can now be configured on a per-field basis. |
| Similarity retains only the field-specific relevance methods such as tf() and idf(). |
| Previously some (but not all) of these methods, such as computeNorm and scorePayload took |
| field as a parameter, this is removed due to the fact the entire Similarity (all methods) |
| can now be configured per-field. |
| Methods that apply to the entire query such as coord() and queryNorm() exist in SimilarityProvider. |
| |
| * LUCENE-1076: TieredMergePolicy is now the default merge policy. |
| It's able to merge non-contiguous segments; this may cause problems |
| for applications that rely on Lucene's internal document ID |
| assigment. If so, you should instead use LogByteSize/DocMergePolicy |
| during indexing. |
| |
| * LUCENE-2883: Lucene's o.a.l.search.function ValueSource based functionality, was consolidated |
| into module/queries along with Solr's similar functionality. The following classes were moved: |
| - o.a.l.search.function.CustomScoreQuery -> o.a.l.queries.CustomScoreQuery |
| - o.a.l.search.function.CustomScoreProvider -> o.a.l.queries.CustomScoreProvider |
| - o.a.l.search.function.NumericIndexDocValueSource -> o.a.l.queries.function.valuesource.NumericIndexDocValueSource |
| The following lists the replacement classes for those removed: |
| - o.a.l.search.function.ByteFieldSource -> o.a.l.queries.function.valuesource.ByteFieldSource |
| - o.a.l.search.function.DocValues -> o.a.l.queries.function.DocValues |
| - o.a.l.search.function.FieldCacheSource -> o.a.l.queries.function.valuesource.FieldCacheSource |
| - o.a.l.search.function.FieldScoreQuery ->o.a.l.queries.function.FunctionQuery |
| - o.a.l.search.function.FloatFieldSource -> o.a.l.queries.function.valuesource.FloatFieldSource |
| - o.a.l.search.function.IntFieldSource -> o.a.l.queries.function.valuesource.IntFieldSource |
| - o.a.l.search.function.OrdFieldSource -> o.a.l.queries.function.valuesource.OrdFieldSource |
| - o.a.l.search.function.ReverseOrdFieldSource -> o.a.l.queries.function.valuesource.ReverseOrdFieldSource |
| - o.a.l.search.function.ShortFieldSource -> o.a.l.queries.function.valuesource.ShortFieldSource |
| - o.a.l.search.function.ValueSource -> o.a.l.queries.function.ValueSource |
| - o.a.l.search.function.ValueSourceQuery -> o.a.l.queries.function.FunctionQuery |
| |
| * LUCENE-2392: Enable flexible scoring: |
| |
| The existing "Similarity" api is now TFIDFSimilarity, if you were extending |
| Similarity before, you should likely extend this instead. |
| |
| Weight.normalize no longer takes a norm value that incorporates the top-level |
| boost from outer queries such as BooleanQuery, instead it takes 2 parameters, |
| the outer boost (topLevelBoost) and the norm. Weight.sumOfSquaredWeights has |
| been renamed to Weight.getValueForNormalization(). |
| |
| The scorePayload method now takes a BytesRef. It is never null. |
| |
| * LUCENE-3283: Lucene's core o.a.l.queryParser QueryParsers have been consolidated into module/queryparser, |
| where other QueryParsers from the codebase will also be placed. The following classes were moved: |
| - o.a.l.queryParser.CharStream -> o.a.l.queryparser.classic.CharStream |
| - o.a.l.queryParser.FastCharStream -> o.a.l.queryparser.classic.FastCharStream |
| - o.a.l.queryParser.MultiFieldQueryParser -> o.a.l.queryparser.classic.MultiFieldQueryParser |
| - o.a.l.queryParser.ParseException -> o.a.l.queryparser.classic.ParseException |
| - o.a.l.queryParser.QueryParser -> o.a.l.queryparser.classic.QueryParser |
| - o.a.l.queryParser.QueryParserBase -> o.a.l.queryparser.classic.QueryParserBase |
| - o.a.l.queryParser.QueryParserConstants -> o.a.l.queryparser.classic.QueryParserConstants |
| - o.a.l.queryParser.QueryParserTokenManager -> o.a.l.queryparser.classic.QueryParserTokenManager |
| - o.a.l.queryParser.QueryParserToken -> o.a.l.queryparser.classic.Token |
| - o.a.l.queryParser.QueryParserTokenMgrError -> o.a.l.queryparser.classic.TokenMgrError |
| |
| |
| |
| * LUCENE-2308: Separate IndexableFieldType from Field instances |
| |
| With this change, the indexing details (indexed, tokenized, norms, |
| indexOptions, stored, etc.) are moved into a separate FieldType |
| instance (rather than being stored directly on the Field). |
| |
| This means you can create the IndexableFieldType instance once, up front, |
| for a given field, and then re-use that instance whenever you instantiate |
| the Field. |
| |
| Certain field types are pre-defined since they are common cases: |
| |
| * StringField: indexes a String value as a single token (ie, does |
| not tokenize). This field turns off norms and indexes only doc |
| IDS (does not index term frequency nor positions). This field |
| does not store its value, but exposes TYPE_STORED as well. |
| |
| * BinaryField: a byte[] value that's only stored. |
| |
| * TextField: indexes and tokenizes a String, Reader or TokenStream |
| value, without term vectors. This field does not store its value, |
| but exposes TYPE_STORED as well. |
| |
| If your usage fits one of those common cases you can simply |
| instantiate the above class. To use the TYPE_STORED variant, do this |
| instead: |
| |
| Field f = new Field("field", StringField.TYPE_STORED, "value"); |
| |
| Alternatively, if an existing type is close to what you want but you |
| need to make a few changes, you can copy that type and make changes: |
| |
| FieldType bodyType = new FieldType(TextField.TYPE_STORED); |
| bodyType.setStoreTermVectors(true); |
| |
| |
| You can of course also create your own FieldType from scratch: |
| |
| FieldType t = new FieldType(); |
| t.setIndexed(true); |
| t.setStored(true); |
| t.setOmitNorms(true); |
| t.setIndexOptions(IndexOptions.DOCS_AND_FREQS); |
| |
| FieldType has a freeze() method to prevent further changes. |
| |
| When migrating from the 3.x API, if you did this before: |
| |
| new Field("field", "value", Field.Store.NO, Field.Indexed.NOT_ANALYZED_NO_NORMS) |
| |
| you can now do this: |
| |
| new StringField("field", "value") |
| |
| (though note that StringField indexes DOCS_ONLY). |
| |
| If instead the value was stored: |
| |
| new Field("field", "value", Field.Store.YES, Field.Indexed.NOT_ANALYZED_NO_NORMS) |
| |
| you can now do this: |
| |
| new Field("field", StringField.TYPE_STORED, "value") |
| |
| If you didn't omit norms: |
| |
| new Field("field", "value", Field.Store.YES, Field.Indexed.NOT_ANALYZED) |
| |
| you can now do this: |
| |
| FieldType ft = new FieldType(StringField.TYPE_STORED); |
| ft.setOmitNorms(false); |
| new Field("field", ft, "value") |
| |
| If you did this before (value can be String or Reader): |
| |
| new Field("field", value, Field.Store.NO, Field.Indexed.ANALYZED) |
| |
| you can now do this: |
| |
| new TextField("field", value) |
| |
| If instead the value was stored: |
| |
| new Field("field", value, Field.Store.YES, Field.Indexed.ANALYZED) |
| |
| you can now do this: |
| |
| new Field("field", TextField.TYPE_STORED, value) |
| |
| If in addition you omit norms: |
| |
| new Field("field", value, Field.Store.YES, Field.Indexed.ANALYZED_NO_NORMS) |
| |
| you can now do this: |
| |
| FieldType ft = new FieldType(TextField.TYPE_STORED); |
| ft.setOmitNorms(true); |
| new Field("field", ft, value) |
| |
| If you did this before (bytes is a byte[]): |
| |
| new Field("field", bytes) |
| |
| you can now do this: |
| |
| new BinaryField("field", bytes) |