lucene/MIGRATE.txt - lucene-solr - Git at Google

 # Apache Lucene Migration Guide

 ## Kuromoji user dictionary now forbids illegal segmentation (LUCENE-8933) ##

 User dictionary now strictly validates if the (concatenated) segment is the same as the surface form. This change avoids
 unexpected runtime exceptions or behaviours.
 For example, these entries are not allowed at all and an exception is thrown when loading the dictionary file.

 # concatenated "日本経済新聞" does not match the surface form "日経新聞"
 日経新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞

 # concatenated "日経新聞" does not match the surface form "日本経済新聞"
 日本経済新聞,日経 新聞,ニッケイ シンブン,カスタム名詞

 ## Analysis factories now have customizable symbolic names (LUCENE-8778) ##

 The SPI names for concrete subclasses of TokenizerFactory, TokenFilterFactory, and CharfilterFactory are no longer
 derived from their class name. Instead, each factory must have a static "NAME" field like this:

   /** o.a.l.a.standard.StandardTokenizerFactory's SPI name */
   public static final String NAME = "standard";

 A factory can be resolved/instantiated with its NAME by using methods such as TokenizerFactory#lookupClass(String)
 or TokenizerFactory#forName(String, Map<String,String>).

 If there are any user-defined factory classes that don't have proper NAME field, an exception will be thrown
 when (re)loading factories. e.g., when calling TokenizerFactory#reloadTokenizers(ClassLoader).


 ## TermsEnum is now fully abstract (LUCENE-8292) ##

 TermsEnum has been changed to be fully abstract, so non-abstract subclass must implement all it's methods.
 Non-Performance critical TermsEnums can use BaseTermsEnum as a base class instead. The change was motivated
 by several performance issues with FilterTermsEnum that caused significant slowdowns and massive memory consumption due
 to not delegating all method from TermsEnum. See LUCENE-8292 and LUCENE-8662

 ## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream removed ##

 RAM-based directory implementation have been removed. (LUCENE-8474).
 ByteBuffersDirectory can be used as a RAM-resident replacement, although it
 is discouraged in favor of the default memory-mapped directory.


 ## Similarity.SimScorer.computeXXXFactor methods removed (LUCENE-8014) ##

 SpanQuery and PhraseQuery now always calculate their slops as (1.0 / (1.0 +
 distance)).  Payload factor calculation is performed by PayloadDecoder in the
 queries module


 ## Scorer must produce positive scores (LUCENE-7996) ##

 Scorers are no longer allowed to produce negative scores. If you have custom
 query implementations, you should make sure their score formula may never produce
 negative scores.

 As a side-effect of this change, negative boosts are now rejected and
 FunctionScoreQuery maps negative values to 0.


 ## CustomScoreQuery, BoostedQuery and BoostingQuery removed (LUCENE-8099) ##

 Instead use FunctionScoreQuery and a DoubleValuesSource implementation.  BoostedQuery
 and BoostingQuery may be replaced by calls to FunctionScoreQuery.boostByValue() and
 FunctionScoreQuery.boostByQuery().  To replace more complex calculations in
 CustomScoreQuery, use the lucene-expressions module:

 SimpleBindings bindings = new SimpleBindings();
 bindings.add("score", DoubleValuesSource.SCORES);
 bindings.add("boost1", DoubleValuesSource.fromIntField("myboostfield"));
 bindings.add("boost2", DoubleValuesSource.fromIntField("myotherboostfield"));
 Expression expr = JavascriptCompiler.compile("score * (boost1 + ln(boost2))");
 FunctionScoreQuery q = new FunctionScoreQuery(inputQuery, expr.getDoubleValuesSource(bindings));

 ## Index options can no longer be changed dynamically (LUCENE-8134) ##

 Changing index options on the fly is now going to result into an
 IllegalArgumentException. If a field is indexed
 (FieldType.indexOptions() != IndexOptions.NONE) then all documents must have
 the same index options for that field.


 ## IndexSearcher.createNormalizedWeight() removed (LUCENE-8242) ##

 Instead use IndexSearcher.createWeight(), rewriting the query first, and using
 a boost of 1f.

 ## Memory codecs removed (LUCENE-8267) ##

 Memory codecs have been removed from the codebase (MemoryPostings, MemoryDocValues).

 ## Direct doc-value format removed (LUCENE-8917) ##

 The "Direct" doc-value format has been removed from the codebase.

 ## QueryCachingPolicy.ALWAYS_CACHE removed (LUCENE-8144) ##

 Caching everything is discouraged as it disables the ability to skip non-interesting documents.
 ALWAYS_CACHE can be replaced by a UsageTrackingQueryCachingPolicy with an appropriate config.

 ## English stopwords are no longer removed by default in StandardAnalyzer (LUCENE_7444) ##

 To retain the old behaviour, pass EnglishAnalyzer.ENGLISH_STOP_WORDS_SET as an argument
 to the constructor

 ## StandardAnalyzer.ENGLISH_STOP_WORDS_SET has been moved ##

 English stop words are now defined in EnglishAnalyzer#ENGLISH_STOP_WORDS_SET in the
 analysis-common module

 ## TopDocs.maxScore removed ##

 TopDocs.maxScore is removed. IndexSearcher and TopFieldCollector no longer have
 an option to compute the maximum score when sorting by field. If you need to
 know the maximum score for a query, the recommended approach is to run a
 separate query:

   TopDocs topHits = searcher.search(query, 1);
   float maxScore = topHits.scoreDocs.length == 0 ? Float.NaN : topHits.scoreDocs[0].score;

 Thanks to other optimizations that were added to Lucene 8, this query will be
 able to efficiently select the top-scoring document without having to visit
 all matches.

 ## TopFieldCollector always assumes fillFields=true ##

 Because filling sort values doesn't have a significant overhead, the fillFields
 option has been removed from TopFieldCollector factory methods. Everything
 behaves as if it was set to true.

 ## TopFieldCollector no longer takes a trackDocScores option ##

 Computing scores at collection time is less efficient than running a second
 request in order to only compute scores for documents that made it to the top
 hits. As a consequence, the trackDocScores option has been removed and can be
 replaced with the new TopFieldCollector#populateScores helper method.

 ## IndexSearcher.search(After) may return lower bounds of the hit count and TopDocs.totalHits is no longer a long ##

 Lucene 8 received optimizations for collection of top-k matches by not visiting
 all matches. However these optimizations won't help if all matches still need
 to be visited in order to compute the total number of hits. As a consequence,
 IndexSearcher's search and searchAfter methods were changed to only count hits
 accurately up to 1,000, and Topdocs.totalHits was changed from a long to an
 object that says whether the hit count is accurate or a lower bound of the
 actual hit count.

 ## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream are deprecated ##

 This RAM-based directory implementation is an old piece of code that uses inefficient
 thread synchronization primitives and can be confused as "faster" than the NIO-based
 MMapDirectory. It is deprecated and scheduled for removal in future versions of
 Lucene. (LUCENE-8467, LUCENE-8438)

 ## LeafCollector.setScorer() now takes a Scorable rather than a Scorer ##

 Scorer has a number of methods that should never be called from Collectors, for example
 those that advance the underlying iterators.  To hide these, LeafCollector.setScorer()
 now takes a Scorable, an abstract class that Scorers can extend, with methods
 docId() and score() (LUCENE-6228)

 ## Scorers must have non-null Weights ##

 If a custom Scorer implementation does not have an associated Weight, it can probably
 be replaced with a Scorable instead.

 ## Suggesters now return Long instead of long for weight() during indexing, and double
 instead of long at suggest time ##

 Most code should just require recompilation, though possibly requiring some added casts.

 ## TokenStreamComponents is now final ##

 Instead of overriding TokenStreamComponents#setReader() to customise analyzer
 initialisation, you should now pass a Consumer&lt;Reader> instance to the
 TokenStreamComponents constructor.

 ## LowerCaseTokenizer and LowerCaseTokenizerFactory have been removed ##

 LowerCaseTokenizer combined tokenization and filtering in a way that broke token
 normalization, so they have been removed. Instead, use a LetterTokenizer followed by
 a LowerCaseFilter

 ## CharTokenizer no longer takes a normalizer function ##

 CharTokenizer now only performs tokenization. To perform any type of filtering
 use a TokenFilter chain as you would with any other Tokenizer.

 ## Highlighter and FastVectorHighlighter no longer support ToParent/ToChildBlockJoinQuery

 Both Highlighter and FastVectorHighlighter need a custom WeightedSpanTermExtractor or FieldQuery respectively
 in order to support ToParent/ToChildBlockJoinQuery.

 ## MultiTermAwareComponent replaced by CharFilterFactory#normalize() and TokenFilterFactory#normalize() ##

 Normalization is now type-safe, with CharFilterFactory#normalize() returning a Reader and
 TokenFilterFactory#normalize() returning a TokenFilter.

 ## k1+1 constant factor removed from BM25 similarity numerator (LUCENE-8563) ##

 Scores computed by the BM25 similarity are lower than previously as the k1+1
 constant factor was removed from the numerator of the scoring formula.
 Ordering of results is preserved unless scores are computed from multiple
 fields using different similarities. The previous behaviour is now exposed
 by the LegacyBM25Similarity class which can be found in the lucene-misc jar.

 ## IndexWriter#maxDoc()/#numDocs() removed in favor of IndexWriter#getDocStats() ##

 IndexWriter#getDocStats() should be used instead of #maxDoc() / #numDocs() which offers a consistent
 view on document stats. Previously calling two methods in order ot get point in time stats was subject
 to concurrent changes.

 ## maxClausesCount moved from BooleanQuery To IndexSearcher (LUCENE-8811) ##
 IndexSearcher now performs max clause count checks on all types of queries (including BooleanQueries).
 This led to a logical move of the clauses count from BooleanQuery to IndexSearcher.

 ## TopDocs.merge shall no longer allow setting of shard indices ##

 TopDocs.merge's API has been changed to stop allowing passing in a parameter to indicate if it should
 set shard indices for hits as they are seen during the merge process. This is done to simplify the API
 to be more dynamic in terms of passing in custom tie breakers.
 If shard indices are to be used for tie breaking docs with equal scores during TopDocs.merge, then it is
 mandatory that the input ScoreDocs have their shard indices set to valid values prior to calling TopDocs.merge

 ## TopDocsCollector Shall Throw IllegalArgumentException For Malformed Arguments ##

 TopDocsCollector shall no longer return an empty TopDocs for malformed arguments.
 Rather, an IllegalArgumentException shall be thrown. This is introduced for better
 defence and to ensure that there is no bubbling up of errors when Lucene is
 used in multi level applications
	# Apache Lucene Migration Guide

	## Kuromoji user dictionary now forbids illegal segmentation (LUCENE-8933) ##

	User dictionary now strictly validates if the (concatenated) segment is the same as the surface form. This change avoids
	unexpected runtime exceptions or behaviours.
	For example, these entries are not allowed at all and an exception is thrown when loading the dictionary file.

	# concatenated "日本経済新聞" does not match the surface form "日経新聞"
	日経新聞,日本経済新聞,ニホンケイザイシンブン,カスタム名詞

	# concatenated "日経新聞" does not match the surface form "日本経済新聞"
	日本経済新聞,日経新聞,ニッケイシンブン,カスタム名詞

	## Analysis factories now have customizable symbolic names (LUCENE-8778) ##

	The SPI names for concrete subclasses of TokenizerFactory, TokenFilterFactory, and CharfilterFactory are no longer
	derived from their class name. Instead, each factory must have a static "NAME" field like this:

	/** o.a.l.a.standard.StandardTokenizerFactory's SPI name */
	public static final String NAME = "standard";

	A factory can be resolved/instantiated with its NAME by using methods such as TokenizerFactory#lookupClass(String)
	or TokenizerFactory#forName(String, Map<String,String>).

	If there are any user-defined factory classes that don't have proper NAME field, an exception will be thrown
	when (re)loading factories. e.g., when calling TokenizerFactory#reloadTokenizers(ClassLoader).


	## TermsEnum is now fully abstract (LUCENE-8292) ##

	TermsEnum has been changed to be fully abstract, so non-abstract subclass must implement all it's methods.
	Non-Performance critical TermsEnums can use BaseTermsEnum as a base class instead. The change was motivated
	by several performance issues with FilterTermsEnum that caused significant slowdowns and massive memory consumption due
	to not delegating all method from TermsEnum. See LUCENE-8292 and LUCENE-8662

	## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream removed ##

	RAM-based directory implementation have been removed. (LUCENE-8474).
	ByteBuffersDirectory can be used as a RAM-resident replacement, although it
	is discouraged in favor of the default memory-mapped directory.


	## Similarity.SimScorer.computeXXXFactor methods removed (LUCENE-8014) ##

	SpanQuery and PhraseQuery now always calculate their slops as (1.0 / (1.0 +
	distance)). Payload factor calculation is performed by PayloadDecoder in the
	queries module


	## Scorer must produce positive scores (LUCENE-7996) ##

	Scorers are no longer allowed to produce negative scores. If you have custom
	query implementations, you should make sure their score formula may never produce
	negative scores.

	As a side-effect of this change, negative boosts are now rejected and
	FunctionScoreQuery maps negative values to 0.


	## CustomScoreQuery, BoostedQuery and BoostingQuery removed (LUCENE-8099) ##

	Instead use FunctionScoreQuery and a DoubleValuesSource implementation. BoostedQuery
	and BoostingQuery may be replaced by calls to FunctionScoreQuery.boostByValue() and
	FunctionScoreQuery.boostByQuery(). To replace more complex calculations in
	CustomScoreQuery, use the lucene-expressions module:

	SimpleBindings bindings = new SimpleBindings();
	bindings.add("score", DoubleValuesSource.SCORES);
	bindings.add("boost1", DoubleValuesSource.fromIntField("myboostfield"));
	bindings.add("boost2", DoubleValuesSource.fromIntField("myotherboostfield"));
	Expression expr = JavascriptCompiler.compile("score * (boost1 + ln(boost2))");
	FunctionScoreQuery q = new FunctionScoreQuery(inputQuery, expr.getDoubleValuesSource(bindings));

	## Index options can no longer be changed dynamically (LUCENE-8134) ##

	Changing index options on the fly is now going to result into an
	IllegalArgumentException. If a field is indexed
	(FieldType.indexOptions() != IndexOptions.NONE) then all documents must have
	the same index options for that field.


	## IndexSearcher.createNormalizedWeight() removed (LUCENE-8242) ##

	Instead use IndexSearcher.createWeight(), rewriting the query first, and using
	a boost of 1f.

	## Memory codecs removed (LUCENE-8267) ##

	Memory codecs have been removed from the codebase (MemoryPostings, MemoryDocValues).

	## Direct doc-value format removed (LUCENE-8917) ##

	The "Direct" doc-value format has been removed from the codebase.

	## QueryCachingPolicy.ALWAYS_CACHE removed (LUCENE-8144) ##

	Caching everything is discouraged as it disables the ability to skip non-interesting documents.
	ALWAYS_CACHE can be replaced by a UsageTrackingQueryCachingPolicy with an appropriate config.

	## English stopwords are no longer removed by default in StandardAnalyzer (LUCENE_7444) ##

	To retain the old behaviour, pass EnglishAnalyzer.ENGLISH_STOP_WORDS_SET as an argument
	to the constructor

	## StandardAnalyzer.ENGLISH_STOP_WORDS_SET has been moved ##

	English stop words are now defined in EnglishAnalyzer#ENGLISH_STOP_WORDS_SET in the
	analysis-common module

	## TopDocs.maxScore removed ##

	TopDocs.maxScore is removed. IndexSearcher and TopFieldCollector no longer have
	an option to compute the maximum score when sorting by field. If you need to
	know the maximum score for a query, the recommended approach is to run a
	separate query:

	TopDocs topHits = searcher.search(query, 1);
	float maxScore = topHits.scoreDocs.length == 0 ? Float.NaN : topHits.scoreDocs[0].score;

	Thanks to other optimizations that were added to Lucene 8, this query will be
	able to efficiently select the top-scoring document without having to visit
	all matches.

	## TopFieldCollector always assumes fillFields=true ##

	Because filling sort values doesn't have a significant overhead, the fillFields
	option has been removed from TopFieldCollector factory methods. Everything
	behaves as if it was set to true.

	## TopFieldCollector no longer takes a trackDocScores option ##

	Computing scores at collection time is less efficient than running a second
	request in order to only compute scores for documents that made it to the top
	hits. As a consequence, the trackDocScores option has been removed and can be
	replaced with the new TopFieldCollector#populateScores helper method.

	## IndexSearcher.search(After) may return lower bounds of the hit count and TopDocs.totalHits is no longer a long ##

	Lucene 8 received optimizations for collection of top-k matches by not visiting
	all matches. However these optimizations won't help if all matches still need
	to be visited in order to compute the total number of hits. As a consequence,
	IndexSearcher's search and searchAfter methods were changed to only count hits
	accurately up to 1,000, and Topdocs.totalHits was changed from a long to an
	object that says whether the hit count is accurate or a lower bound of the
	actual hit count.

	## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream are deprecated ##

	This RAM-based directory implementation is an old piece of code that uses inefficient
	thread synchronization primitives and can be confused as "faster" than the NIO-based
	MMapDirectory. It is deprecated and scheduled for removal in future versions of
	Lucene. (LUCENE-8467, LUCENE-8438)

	## LeafCollector.setScorer() now takes a Scorable rather than a Scorer ##

	Scorer has a number of methods that should never be called from Collectors, for example
	those that advance the underlying iterators. To hide these, LeafCollector.setScorer()
	now takes a Scorable, an abstract class that Scorers can extend, with methods
	docId() and score() (LUCENE-6228)

	## Scorers must have non-null Weights ##

	If a custom Scorer implementation does not have an associated Weight, it can probably
	be replaced with a Scorable instead.

	## Suggesters now return Long instead of long for weight() during indexing, and double
	instead of long at suggest time ##

	Most code should just require recompilation, though possibly requiring some added casts.

	## TokenStreamComponents is now final ##

	Instead of overriding TokenStreamComponents#setReader() to customise analyzer
	initialisation, you should now pass a Consumer<Reader> instance to the
	TokenStreamComponents constructor.

	## LowerCaseTokenizer and LowerCaseTokenizerFactory have been removed ##

	LowerCaseTokenizer combined tokenization and filtering in a way that broke token
	normalization, so they have been removed. Instead, use a LetterTokenizer followed by
	a LowerCaseFilter

	## CharTokenizer no longer takes a normalizer function ##

	CharTokenizer now only performs tokenization. To perform any type of filtering
	use a TokenFilter chain as you would with any other Tokenizer.

	## Highlighter and FastVectorHighlighter no longer support ToParent/ToChildBlockJoinQuery

	Both Highlighter and FastVectorHighlighter need a custom WeightedSpanTermExtractor or FieldQuery respectively
	in order to support ToParent/ToChildBlockJoinQuery.

	## MultiTermAwareComponent replaced by CharFilterFactory#normalize() and TokenFilterFactory#normalize() ##

	Normalization is now type-safe, with CharFilterFactory#normalize() returning a Reader and
	TokenFilterFactory#normalize() returning a TokenFilter.

	## k1+1 constant factor removed from BM25 similarity numerator (LUCENE-8563) ##

	Scores computed by the BM25 similarity are lower than previously as the k1+1
	constant factor was removed from the numerator of the scoring formula.
	Ordering of results is preserved unless scores are computed from multiple
	fields using different similarities. The previous behaviour is now exposed
	by the LegacyBM25Similarity class which can be found in the lucene-misc jar.

	## IndexWriter#maxDoc()/#numDocs() removed in favor of IndexWriter#getDocStats() ##

	IndexWriter#getDocStats() should be used instead of #maxDoc() / #numDocs() which offers a consistent
	view on document stats. Previously calling two methods in order ot get point in time stats was subject
	to concurrent changes.

	## maxClausesCount moved from BooleanQuery To IndexSearcher (LUCENE-8811) ##
	IndexSearcher now performs max clause count checks on all types of queries (including BooleanQueries).
	This led to a logical move of the clauses count from BooleanQuery to IndexSearcher.

	## TopDocs.merge shall no longer allow setting of shard indices ##

	TopDocs.merge's API has been changed to stop allowing passing in a parameter to indicate if it should
	set shard indices for hits as they are seen during the merge process. This is done to simplify the API
	to be more dynamic in terms of passing in custom tie breakers.
	If shard indices are to be used for tie breaking docs with equal scores during TopDocs.merge, then it is
	mandatory that the input ScoreDocs have their shard indices set to valid values prior to calling TopDocs.merge

	## TopDocsCollector Shall Throw IllegalArgumentException For Malformed Arguments ##

	TopDocsCollector shall no longer return an empty TopDocs for malformed arguments.
	Rather, an IllegalArgumentException shall be thrown. This is introduced for better
	defence and to ensure that there is no bubbling up of errors when Lucene is
	used in multi level applications