blob: 27f2ee44d72543fbbb225ceb6586e08f12d4f700 [file] [log] [blame]
# Apache Lucene Migration Guide
## TermsEnum is now fully abstract (LUCENE-8292) ##
TermsEnum has been changed to be fully abstract, so non-abstract subclass must implement all it's methods.
Non-Performance critical TermsEnums can use BaseTermsEnum as a base class instead. The change was motivated
by several performance issues with FilterTermsEnum that caused significant slowdowns and massive memory consumption due
to not delegating all method from TermsEnum. See LUCENE-8292 and LUCENE-8662
## Similarity.SimScorer.computeXXXFactor methods removed (LUCENE-8014) ##
SpanQuery and PhraseQuery now always calculate their slops as (1.0 / (1.0 +
distance)). Payload factor calculation is performed by PayloadDecoder in the
queries module
## Scorer must produce positive scores (LUCENE-7996) ##
Scorers are no longer allowed to produce negative scores. If you have custom
query implementations, you should make sure their score formula may never produce
negative scores.
As a side-effect of this change, negative boosts are now rejected and
FunctionScoreQuery maps negative values to 0.
## CustomScoreQuery, BoostedQuery and BoostingQuery removed (LUCENE-8099) ##
Instead use FunctionScoreQuery and a DoubleValuesSource implementation. BoostedQuery
and BoostingQuery may be replaced by calls to FunctionScoreQuery.boostByValue() and
FunctionScoreQuery.boostByQuery(). To replace more complex calculations in
CustomScoreQuery, use the lucene-expressions module:
SimpleBindings bindings = new SimpleBindings();
bindings.add("score", DoubleValuesSource.SCORES);
bindings.add("boost1", DoubleValuesSource.fromIntField("myboostfield"));
bindings.add("boost2", DoubleValuesSource.fromIntField("myotherboostfield"));
Expression expr = JavascriptCompiler.compile("score * (boost1 + ln(boost2))");
FunctionScoreQuery q = new FunctionScoreQuery(inputQuery, expr.getDoubleValuesSource(bindings));
## Index options can no longer be changed dynamically (LUCENE-8134) ##
Changing index options on the fly is now going to result into an
IllegalArgumentException. If a field is indexed
(FieldType.indexOptions() != IndexOptions.NONE) then all documents must have
the same index options for that field.
## IndexSearcher.createNormalizedWeight() removed (LUCENE-8242) ##
Instead use IndexSearcher.createWeight(), rewriting the query first, and using
a boost of 1f.
## Memory codecs removed (LUCENE-8267) ##
Memory codecs have been removed from the codebase (MemoryPostings, MemoryDocValues).
## QueryCachingPolicy.ALWAYS_CACHE removed (LUCENE-8144) ##
Caching everything is discouraged as it disables the ability to skip non-interesting documents.
ALWAYS_CACHE can be replaced by a UsageTrackingQueryCachingPolicy with an appropriate config.
## English stopwords are no longer removed by default in StandardAnalyzer (LUCENE_7444) ##
To retain the old behaviour, pass EnglishAnalyzer.ENGLISH_STOP_WORDS_SET as an argument
to the constructor
## StandardAnalyzer.ENGLISH_STOP_WORDS_SET has been moved ##
English stop words are now defined in EnglishAnalyzer#ENGLISH_STOP_WORDS_SET in the
analysis-common module
## TopDocs.maxScore removed ##
TopDocs.maxScore is removed. IndexSearcher and TopFieldCollector no longer have
an option to compute the maximum score when sorting by field. If you need to
know the maximum score for a query, the recommended approach is to run a
separate query:
TopDocs topHits = searcher.search(query, 1);
float maxScore = topHits.scoreDocs.length == 0 ? Float.NaN : topHits.scoreDocs[0].score;
Thanks to other optimizations that were added to Lucene 8, this query will be
able to efficiently select the top-scoring document without having to visit
all matches.
## TopFieldCollector always assumes fillFields=true ##
Because filling sort values doesn't have a significant overhead, the fillFields
option has been removed from TopFieldCollector factory methods. Everything
behaves as if it was set to true.
## TopFieldCollector no longer takes a trackDocScores option ##
Computing scores at collection time is less efficient than running a second
request in order to only compute scores for documents that made it to the top
hits. As a consequence, the trackDocScores option has been removed and can be
replaced with the new TopFieldCollector#populateScores helper method.
## IndexSearcher.search(After) may return lower bounds of the hit count and TopDocs.totalHits is no longer a long ##
Lucene 8 received optimizations for collection of top-k matches by not visiting
all matches. However these optimizations won't help if all matches still need
to be visited in order to compute the total number of hits. As a consequence,
IndexSearcher's search and searchAfter methods were changed to only count hits
accurately up to 1,000, and Topdocs.totalHits was changed from a long to an
object that says whether the hit count is accurate or a lower bound of the
actual hit count.
## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream are deprecated ##
This RAM-based directory implementation is an old piece of code that uses inefficient
thread synchronization primitives and can be confused as "faster" than the NIO-based
MMapDirectory. It is deprecated and scheduled for removal in future versions of
Lucene. (LUCENE-8467, LUCENE-8438)
## LeafCollector.setScorer() now takes a Scorable rather than a Scorer ##
Scorer has a number of methods that should never be called from Collectors, for example
those that advance the underlying iterators. To hide these, LeafCollector.setScorer()
now takes a Scorable, an abstract class that Scorers can extend, with methods
docId() and score() (LUCENE-6228)
## Scorers must have non-null Weights ##
If a custom Scorer implementation does not have an associated Weight, it can probably
be replaced with a Scorable instead.
## Suggesters now return Long instead of long for weight() during indexing, and double
instead of long at suggest time ##
Most code should just require recompilation, though possibly requiring some added casts.
## TokenStreamComponents is now final ##
Instead of overriding TokenStreamComponents#setReader() to customise analyzer
initialisation, you should now pass a Consumer<Reader> instance to the
TokenStreamComponents constructor.
## LowerCaseTokenizer and LowerCaseTokenizerFactory have been removed ##
LowerCaseTokenizer combined tokenization and filtering in a way that broke token
normalization, so they have been removed. Instead, use a LetterTokenizer followed by
a LowerCaseFilter
## CharTokenizer no longer takes a normalizer function ##
CharTokenizer now only performs tokenization. To perform any type of filtering
use a TokenFilter chain as you would with any other Tokenizer.
## Highlighter and FastVectorHighlighter no longer support ToParent/ToChildBlockJoinQuery
Both Highlighter and FastVectorHighlighter need a custom WeightedSpanTermExtractor or FieldQuery respectively
in order to support ToParent/ToChildBlockJoinQuery.
## MultiTermAwareComponent replaced by CharFilterFactory#normalize() and TokenFilterFactory#normalize() ##
Normalization is now type-safe, with CharFilterFactory#normalize() returning a Reader and
TokenFilterFactory#normalize() returning a TokenFilter.
## k1+1 constant factor removed from BM25 similarity numerator (LUCENE-8563) ##
Scores computed by the BM25 similarity are lower than previously as the k1+1
constant factor was removed from the numerator of the scoring formula.
Ordering of results is preserved unless scores are computed from multiple
fields using different similarities. The previous behaviour is now exposed
by the LegacyBM25Similarity class which can be found in the lucene-misc jar.
## IndexWriter#maxDoc()/#numDocs() removed in favor of IndexWriter#getDocStats() ##
IndexWriter#getDocStats() should be used instead of #maxDoc() / #numDocs() which offers a consistent
view on document stats. Previously calling two methods in order ot get point in time stats was subject
to concurrent changes.