| Lucene contrib change Log |
| |
| For more information on past and future Lucene versions, please see: |
| http://s.apache.org/luceneversions |
| |
| ======================= Lucene 3.5.0 ================ |
| |
| Changes in backwards compatibility policy |
| |
| * LUCENE-3446: Removed BooleanFilter.finalResult() due to change to |
| FixedBitSet. (Uwe Schindler) |
| |
| New Features |
| |
| * LUCENE-1824: Add BoundaryScanner interface and its implementation classes, |
| SimpleBoundaryScanner and BreakIteratorBoundaryScanner, so that FVH's FragmentsBuilder |
| can find "natural" boundary to make snippets. (Robert Muir, Koji Sekiguchi) |
| |
| * LUCENE-1889: Add MultiTermQuery support for FVH. (Mike Sokolov via Koji Sekiguchi) |
| |
| * LUCENE-3458: Change BooleanFilter to have only a single clauses ArrayList |
| (so toString() works in order). It now behaves more like BooleanQuery, |
| implements Iterable<FilterClause>, and allows adding Filters without |
| creating FilterClause. (Uwe Schindler) |
| |
| * LUCENE-3414: Added HunspellStemFilter which uses a provided pure Java implementation of the |
| Hunspell algorithm. (Chris Male) |
| |
| * LUCENE-3445: Added SearcherManager, to manage sharing and reopening |
| IndexSearchers across multiple search threads. IndexReader's |
| refCount is used to safely close the reader only once all threads are done |
| using it. (Michael McCandless) |
| |
| API Changes |
| |
| * LUCENE-3431: Deprecated QueryAutoStopWordAnalyzer.addStopWords* since they |
| prevent reuse. Stopwords are now to be computed when the Analyzer is instantiated. |
| If new stopwords are needed, a new Analyzer instance should be created. (Chris Male) |
| |
| * LUCENE-3434: Deprecated ShingleAnalyzerWrapper.set* since they prevent reuse. The |
| Analyzer should be configured at instantiation. Deprecated PerFieldAnalyzerWrapper.addAnalyzer |
| since it also prevents reuse. Analyzers per field should be configured at instantiation. |
| (Chris Male) |
| |
| Bug Fixes |
| |
| * LUCENE-3417: DictionaryCompoundWordFilter did not properly add tokens from the |
| end compound word. (Njal Karevoll via Robert Muir) |
| |
| * LUCENE-3019: Fix unexpected color tags for FastVectorHighlighter. (Koji Sekiguchi) |
| |
| * LUCENE-3446: Fix NPE in BooleanFilter when DocIdSet/DocIdSetIterator is null. |
| Converted code to FixedBitSet and simplified. (Uwe Schindler, Shuji Umino) |
| |
| API Changes |
| |
| * LUCENE-3436: Add SuggestMode to the spellchecker, so you can specify the strategy |
| for suggesting related terms. (James Dyer via Robert Muir) |
| |
| ======================= Lucene 3.4.0 ================ |
| |
| New Features |
| |
| * LUCENE-3234: provide a limit on phrase analysis in FastVectorHighlighter for |
| highlighting speed up. Use FastVectorHighlighter.setPhraseLimit() to set limit |
| (e.g. 5000). (Mike Sokolov via Koji Sekiguchi) |
| |
| * LUCENE-3079: a new facet module which provides faceted indexing & search |
| capabilities. It allows managing a taxonomy of categories, and index them |
| with documents. It also provides search API for aggregating (e.g. count) |
| the weights of the categories that are relevant to the search results. |
| (Shai Erera) |
| |
| * LUCENE-3171: Added BlockJoinQuery and BlockJoinCollector, under the |
| new contrib/join module, to enable searches that require joining |
| between parent and child documents. Joined (children + parent) |
| documents must be indexed as a document block, using |
| IndexWriter.add/UpdateDocuments (Mark Harwood, Mike McCandless) |
| |
| * LUCENE-3233, LUCENE-3375: Added SynonymFilter for applying multi-word synonyms |
| during indexing or querying (with parsers for wordnet and solr formats). |
| Removed contrib/wordnet. (Simon Rosenthal, Robert Muir, Mike McCandless) |
| |
| * LUCENE-1768: added support for numeric ranges in contrib query parser; |
| added support for simple numeric queries, such as <age:4>, in contrib |
| query parser (Vinicius Barros via Uwe Schindler) |
| |
| Changes in runtime behavior: |
| |
| * LUCENE-1768: StandardQueryConfigHandler now uses NumericFieldConfigListener |
| to set a NumericConfig to its corresponding FieldConfig; |
| StandardQueryTreeBuilder now uses DummyQueryNodeBuilder for |
| NumericQueryNodes and uses NumericRangeQueryNodeBuilder for |
| NumericRangeQueryNodes; StandardQueryNodeProcessorPipeline now executes |
| NumericQueryNodeProcessor followed by NumericRangeQueryNodeProcessor |
| right after LowercaseExpandedTermsQueryNodeProcessor |
| (Vinicius Barros via Uwe Schindler) |
| |
| API Changes |
| |
| * LUCENE-3296: PKIndexSplitter & MultiPassIndexSplitter now have version |
| constructors. PKIndexSplitter accepts a IndexWriterConfig for each of |
| the target indexes. (Simon Willnauer, Jason Rutherglen) |
| |
| * LUCENE-2979: queryparser configuration API located under |
| org.apache.lucene.queryParser.core.config has been simplified and |
| Attribute objects no longer should be used to configure query parsers. Now |
| any configuration should be done through AbstractQueryConfig's set and get |
| methods. The old API, which uses Attributes objects, is still in place, however |
| it has been deprecated and will be removed soon. |
| (Phillipe Ramalho via Adriano Crestani) |
| |
| * LUCENE-3400: Deprecated DutchAnalyzer.setStemDictionary since it prevents |
| TokenStream reuse (Chris Male) |
| |
| * LUCENE-1768: setNumericConfigMap and getNumericConfigMap were added |
| to StandardQueryParser; ParametricRangeQueryNode and |
| oal.queryParser.standard.nodes.RangeQueryNode now implement |
| oal.queryParser.core.nodes.RangeQueryNode; |
| oal.queryParser.core.nodes.RangeQueryNode was deprecated and now extends |
| TermRangeQueryNode, which extends AbstractRangeQueryNode; |
| ParametricQueryNode was deprecated; FieldQueryNode now implements the |
| new FieldValueQueryNode<CharSequence>, which this last one implements |
| FieldableQueryNode and thew new ValueQueryNode |
| (Vinicius Barros via Uwe Schindler) |
| |
| Optimizations |
| |
| * LUCENE-3306: Disabled indexing of positions for spellchecker n-gram |
| fields: they are not needed because the spellchecker does not |
| use positional queries. (Robert Muir) |
| |
| Bug Fixes |
| |
| * LUCENE-3326: Fixed bug if you used MoreLikeThis.like(Reader), it would |
| try to re-analyze the same Reader multiple times, passing different |
| field names to the analyzer. Additionally MoreLikeThisQuery would take |
| your string and encode/decode it using the default charset, it now uses |
| a StringReader. Finally, MoreLikeThis's methods that take File, URL, InputStream, |
| are deprecated, please create the Reader yourself. (Trejkaz, Robert Muir) |
| |
| * LUCENE-3347: XML query parser did not always incorporate boosts from |
| UserQuery elements. (Moogie, Uwe Schindler) |
| |
| * LUCENE-3382: Fixed a bug where NRTCachingDirectory's listAll() would wrongly |
| throw NoSuchDirectoryException when all files written so far have been |
| cached to RAM and the directory still has not yet been created on the |
| filesystem. (Robert Muir) |
| |
| ======================= Lucene 3.3.0 ======================= |
| |
| New Features |
| |
| * LUCENE-152: Add KStem (light stemmer for English). |
| (Yonik Seeley via Robert Muir) |
| |
| * LUCENE-3135: Add suggesters (autocomplete) to contrib/spellchecker, |
| with three implementations: Jaspell, Ternary Trie, and Finite State. |
| (Andrzej Bialecki, Dawid Weiss, Mike Mccandless, Robert Muir) |
| |
| * LUCENE-3129: Added BlockGroupingCollector, a single pass |
| grouping collector which is faster than the two-pass approach, and |
| also computes the total group count, but requires that every |
| document sharing the same group was indexed as a doc block |
| (IndexWriter.add/updateDocuments). (Mike McCandless) |
| |
| * LUCENE-2955: Added NRTManager and NRTManagerReopenThread, to |
| simplify handling NRT reopen with multiple search threads, and to |
| allow an app to control which indexing changes must be visible to |
| which search requests. (Mike McCandless) |
| |
| * LUCENE-3191: Added SearchGroup.merge and TopGroups.merge, to |
| facilitate doing grouping in a distributed environment (Uwe |
| Schindler, Mike McCandless) |
| |
| * LUCENE-2919: Added PKIndexSplitter, that splits an index according |
| to a middle term in a specified field. (Jason Rutherglen via Mike |
| McCandless, Uwe Schindler) |
| |
| API Changes |
| |
| * LUCENE-3141: add getter method to access fragInfos in FieldFragList. |
| (Sujit Pal via Koji Sekiguchi) |
| |
| * LUCENE-3099: Allow subclasses to determine the group value for |
| First/SecondPassGroupingCollector. (Martijn van Groningen, Mike |
| McCandless) |
| |
| Bug Fixes |
| |
| * LUCENE-3185: Fix bug in NRTCachingDirectory.deleteFile that would |
| always throw exception and sometimes fail to actually delete the |
| file. (Mike McCandless) |
| |
| * LUCENE-3188: contrib/misc IndexSplitter creates indexes with incorrect |
| SegmentInfos.counter; added CheckIndex check & fix for this problem. |
| (Ivan Dimitrov Vasilev via Steve Rowe) |
| |
| Build |
| |
| * LUCENE-3149: Upgrade contrib/icu's ICU jar file to ICU 4.8. |
| (Robert Muir) |
| |
| ======================= Lucene 3.2.0 ======================= |
| |
| Changes in backwards compatibility policy |
| |
| * LUCENE-2981: Removed the following contribs: ant, db, lucli, swing. (Robert Muir) |
| |
| Changes in runtime behavior |
| |
| * LUCENE-3086: ItalianAnalyzer now uses ElisionFilter with a set of Italian |
| contractions by default. (Robert Muir) |
| |
| Bug Fixes |
| |
| * LUCENE-3045: fixed QueryNodeImpl.containsTag(String key) that was |
| not lowercasing the key before checking for the tag (Adriano Crestani) |
| |
| * LUCENE-3026: SmartChineseAnalyzer's WordTokenFilter threw NullPointerException |
| on sentences longer than 32,767 characters. (wangzhenghang via Robert Muir) |
| |
| * LUCENE-2939: Highlighter should try and use maxDocCharsToAnalyze in |
| WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as |
| when using CachingTokenStream. This can be a significant performance bug for |
| large documents. (Mark Miller) |
| |
| * LUCENE-3043: GermanStemmer threw IndexOutOfBoundsException if it encountered |
| a zero-length token. (Robert Muir) |
| |
| * LUCENE-3044: ThaiWordFilter didn't reset its cached state correctly, this only |
| caused a problem if you consumed a tokenstream, then reused it, added different |
| attributes to it, and consumed it again. (Robert Muir, Uwe Schindler) |
| |
| * LUCENE-3113: Fixed some minor analysis bugs: double-reset() in ReusableAnalyzerBase |
| and ShingleAnalyzerWrapper, missing end() implementations in PrefixAwareTokenFilter |
| and PrefixAndSuffixAwareTokenFilter, invocations of incrementToken() after it |
| already returned false in CommonGramsQueryFilter, HyphenatedWordsFilter, |
| ShingleFilter, and SynonymsFilter. (Robert Muir, Steven Rowe, Uwe Schindler) |
| |
| New Features |
| |
| * LUCENE-3016: Add analyzer for Latvian. (Robert Muir) |
| |
| * LUCENE-1421: create new grouping contrib module, enabling search |
| results to be grouped by a single-valued indexed field. This |
| module was factored out of Solr's grouping implementation, but |
| it cannot group by function queries nor arbitrary queries. (Mike |
| McCandless) |
| |
| * LUCENE-3098: add AllGroupsCollector, to collect all unique groups |
| (but in unspecified order). (Martijn van Groningen via Mike |
| McCandless) |
| |
| * LUCENE-3092: Added NRTCachingDirectory in contrib/misc, which |
| caches small segments in RAM. This is useful, in the near-real-time |
| case where the indexing rate is lowish but the reopen rate is |
| highish, to take load off the IO system. (Mike McCandless) |
| |
| Optimizations |
| |
| * LUCENE-3040: Switch all analysis consumers (highlighter, morelikethis, memory, ...) |
| over to reusableTokenStream(). (Robert Muir) |
| |
| ======================= Lucene 3.1.0 ======================= |
| |
| Changes in backwards compatibility policy |
| |
| * LUCENE-2100: All Analyzers in Lucene-contrib have been marked as final. |
| Analyzers should be only act as a composition of TokenStreams, users should |
| compose their own analyzers instead of subclassing existing ones. |
| (Simon Willnauer) |
| |
| * LUCENE-2194, LUCENE-2201: Snowball APIs were upgraded to snowball revision |
| 502 (with some local modifications for improved performance). |
| Index backwards compatibility and binary backwards compatibility is |
| preserved, but some protected/public member variables changed type. This |
| does NOT affect java code/class files produced by the snowball compiler, |
| but technically is a backwards compatibility break. (Robert Muir) |
| |
| * LUCENE-2226: Moved contrib/snowball functionality into contrib/analyzers. |
| Be sure to remove any old obselete lucene-snowball jar files from your |
| classpath! (Robert Muir) |
| |
| * LUCENE-2323: Moved contrib/wikipedia functionality into contrib/analyzers. |
| Additionally the package was changed from org.apache.lucene.wikipedia.analysis |
| to org.apache.lucene.analysis.wikipedia. (Robert Muir) |
| |
| * LUCENE-2581: Added new methods to FragmentsBuilder interface. These methods |
| are used to set pre/post tags and Encoder. (Koji Sekiguchi) |
| |
| * LUCENE-2391: Improved spellchecker (re)build time/ram usage by omitting |
| frequencies/positions/norms for single-valued fields, modifying the default |
| ramBufferMBSize to match IndexWriterConfig (16MB), making index optimization |
| an optional boolean parameter, and modifying the incremental update logic |
| to work well with unoptimized spellcheck indexes. The indexDictionary() methods |
| were made final to ensure a hard backwards break in case you were subclassing |
| Spellchecker. In general, subclassing Spellchecker is not recommended. (Robert Muir) |
| |
| Changes in runtime behavior |
| |
| * LUCENE-2117: SnowballAnalyzer uses TurkishLowerCaseFilter instead of |
| LowercaseFilter to correctly handle the unique Turkish casing behavior if |
| used with Version > 3.0 and the TurkishStemmer. |
| (Robert Muir via Simon Willnauer) |
| |
| * LUCENE-2055: GermanAnalyzer now uses the Snowball German2 algorithm and |
| stopwords list by default for Version > 3.0. |
| (Robert Muir, Uwe Schindler, Simon Willnauer) |
| |
| Bug fixes |
| |
| * LUCENE-2855: contrib queryparser was using CharSequence as key in some internal |
| Map instances, which was leading to incorrect behavior, since some CharSequence |
| implementors do not override hashcode and equals methods. Now the internal Maps |
| are using String instead. (Adriano Crestani) |
| |
| * LUCENE-2068: Fixed ReverseStringFilter which was not aware of supplementary |
| characters. During reverse the filter created unpaired surrogates, which |
| will be replaced by U+FFFD by the indexer, but not at query time. The filter |
| now reverses supplementary characters correctly if used with Version > 3.0. |
| (Simon Willnauer, Robert Muir) |
| |
| * LUCENE-2035: TokenSources.getTokenStream() does not assign positionIncrement. |
| (Christopher Morris via Mark Miller) |
| |
| * LUCENE-2055: Deprecated RussianTokenizer, RussianStemmer, RussianStemFilter, |
| FrenchStemmer, FrenchStemFilter, DutchStemmer, and DutchStemFilter. For |
| these Analyzers, SnowballFilter is used instead (for Version > 3.0), as |
| the previous code did not always implement the Snowball algorithm correctly. |
| Additionally, for Version > 3.0, the Snowball stopword lists are used by |
| default. (Robert Muir, Uwe Schindler, Simon Willnauer) |
| |
| * LUCENE-2184: Fixed bug with handling best fit value when the proper best fit value is |
| not an indexed field. Note, this change affects the APIs. (Grant Ingersoll) |
| |
| * LUCENE-2359: Fix bug in CartesianPolyFilterBuilder related to handling of behavior around |
| the 180th meridian (Grant Ingersoll) |
| |
| * LUCENE-2404: Fix bugs with position increment and empty tokens in ThaiWordFilter. |
| For matchVersion >= 3.1 the filter also no longer lowercases. ThaiAnalyzer |
| will use a separate LowerCaseFilter instead. (Uwe Schindler, Robert Muir) |
| |
| * LUCENE-2615: Fix DirectIOLinuxDirectory to not assign bogus |
| permissions to newly created files, and to not silently hardwire |
| buffer size to 1 MB. (Mark Miller, Robert Muir, Mike McCandless) |
| |
| * LUCENE-2629: Fix gennorm2 task for generating ICUFoldingFilter's .nrm file. This allows |
| you to customize its normalization/folding, by editing the source data files in src/data |
| and regenerating a new .nrm with 'ant gennorm2'. (David Bowen via Robert Muir) |
| |
| * LUCENE-2653: ThaiWordFilter depends on the JRE having a Thai dictionary, which is not |
| always the case. If the dictionary is unavailable, the filter will now throw |
| UnsupportedOperationException in the constructor. (Robert Muir) |
| |
| * LUCENE-589: Fix contrib/demo for international documents. |
| (Curtis d'Entremont via Robert Muir) |
| |
| * LUCENE-2246: Fix contrib/demo for Turkish html documents. |
| (Selim Nadi via Robert Muir) |
| |
| * LUCENE-590: Demo HTML parser gives incorrect summaries when title is repeated as a heading |
| (Curtis d'Entremont via Robert Muir) |
| |
| * LUCENE-591: The demo indexer now indexes meta keywords. |
| (Curtis d'Entremont via Robert Muir) |
| |
| * LUCENE-2874: Highlighting overlapping tokens outputted doubled words. |
| (Pierre Gossé via Robert Muir) |
| |
| * LUCENE-2943: Fix thread-safety issues with ICUCollationKeyFilter. |
| (Robert Muir) |
| |
| * LUCENE-3087: Highlighter: fix case that was preventing highlighting |
| of exact phrase when tokens overlap. (Pierre Gossé via Mike |
| McCandless) |
| |
| API Changes |
| |
| * LUCENE-2867: Some contrib queryparser methods that receives CharSequence as |
| identifier, such as QueryNode#unsetTag(CharSequence), were deprecated and |
| will be removed on version 4. (Adriano Crestani) |
| |
| * LUCENE-2147: Spatial GeoHashUtils now always decode GeoHash strings |
| with full precision. GeoHash#decode_exactly(String) was merged into |
| GeoHash#decode(String). (Chris Male, Simon Willnauer) |
| |
| * LUCENE-2204: Change some package private classes/members to publicly accessible to implement |
| custom FragmentsBuilders. (Koji Sekiguchi) |
| |
| * LUCENE-2055: Integrate snowball into contrib/analyzers. SnowballAnalyzer is |
| now deprecated in favor of language-specific analyzers which contain things |
| such as stopword lists and any language-specific processing in addition to |
| stemming. Add Turkish and Romanian stopwords lists to support this. |
| (Robert Muir, Uwe Schindler, Simon Willnauer) |
| |
| * LUCENE-2603: Add setMultiValuedSeparator(char) method to set an arbitrary |
| char that is used when concatenating multiValued data. Default is a space |
| (' '). It is applied on ANALYZED field only. (Koji Sekiguchi) |
| |
| * LUCENE-2626: FastVectorHighlighter: enable FragListBuilder and FragmentsBuilder |
| to be set per-field override. (Koji Sekiguchi) |
| |
| * LUCENE-2712: FieldBoostMapAttribute in contrib/queryparser was changed from |
| a Map<CharSequence,Float> to a Map<String,Float>. Per the CharSequence javadoc, |
| CharSequence is inappropriate as a map key. (Robert Muir) |
| |
| * LUCENE-1937: Add more methods to manipulate QueryNodeProcessorPipeline elements. |
| QueryNodeProcessorPipeline now implements the List interface, this is useful |
| if you want to extend or modify an existing pipeline. (Adriano Crestani via Robert Muir) |
| |
| * LUCENE-2754, LUCENE-2757: Deprecated SpanRegexQuery. Use |
| new SpanMultiTermQueryWrapper<RegexQuery>(new RegexQuery()) instead. |
| (Robert Muir, Uwe Schindler) |
| |
| * LUCENE-2747: Deprecated ArabicLetterTokenizer. StandardTokenizer now tokenizes |
| most languages correctly including Arabic. (Steven Rowe, Robert Muir) |
| |
| * LUCENE-2830: Use StringBuilder instead of StringBuffer across Benchmark, and |
| remove the StringBuffer HtmlParser.parse() variant. (Shai Erera) |
| |
| * LUCENE-2920: Deprecated ShingleMatrixFilter as it is unmaintained and does |
| not work with custom Attributes or custom payload encoders. (Uwe Schindler) |
| |
| New features |
| |
| * LUCENE-2500: Added DirectIOLinuxDirectory, a Linux-specific |
| Directory impl that uses the O_DIRECT flag to bypass the buffer |
| cache. This is useful to prevent segment merging from evicting |
| pages from the buffer cache, since fadvise/madvise do not seem. |
| (Michael McCandless) |
| |
| * LUCENE-2306: Add NumericRangeFilter and NumericRangeQuery support to XMLQueryParser. |
| (Jingkei Ly, via Mark Harwood) |
| |
| * LUCENE-2102: Add a Turkish LowerCase Filter. TurkishLowerCaseFilter handles |
| Turkish and Azeri unique casing behavior correctly. |
| (Ahmet Arslan, Robert Muir via Simon Willnauer) |
| |
| * LUCENE-2039: Add a extensible query parser to contrib/misc. |
| ExtendableQueryParser enables arbitrary parser extensions based on a |
| customizable field naming scheme. |
| (Simon Willnauer) |
| |
| * LUCENE-2067: Add a Czech light stemmer. CzechAnalyzer will now stem words |
| when Version is set to 3.1 or higher. (Robert Muir) |
| |
| * LUCENE-2062: Add a Bulgarian analyzer. (Robert Muir, Simon Willnauer) |
| |
| * LUCENE-2206: Add Snowball's stopword lists for Danish, Dutch, English, |
| Finnish, French, German, Hungarian, Italian, Norwegian, Russian, Spanish, |
| and Swedish. These can be loaded with WordListLoader.getSnowballWordSet. |
| (Robert Muir, Simon Willnauer) |
| |
| * LUCENE-2243: Add DisjunctionMaxQuery support for FastVectorHighlighter. |
| (Koji Sekiguchi) |
| |
| * LUCENE-2218: ShingleFilter supports minimum shingle size, and the separator |
| character is now configurable. Its also up to 20% faster. |
| (Steven Rowe via Robert Muir) |
| |
| * LUCENE-2234: Add a Hindi analyzer. (Robert Muir) |
| |
| * LUCENE-2055: Add analyzers/misc/StemmerOverrideFilter. This filter provides |
| the ability to override any stemmer with a custom dictionary map. |
| (Robert Muir, Uwe Schindler, Simon Willnauer) |
| |
| * LUCENE-2399: Add ICUNormalizer2Filter, which normalizes tokens with ICU's |
| Normalizer2. This allows for efficient combinations of normalization and custom |
| mappings in addition to standard normalization, and normalization combined |
| with unicode case folding. (Robert Muir) |
| |
| * LUCENE-1343: Add ICUFoldingFilter, a replacement for ASCIIFoldingFilter that |
| does a more thorough job of normalizing unicode text for search. |
| (Robert Haschart, Robert Muir) |
| |
| * LUCENE-2409: Add ICUTransformFilter, which transforms text in a context |
| sensitive way, either from ICU built-in rules (such as Traditional-Simplified), |
| or from rules you write yourself. (Robert Muir) |
| |
| * LUCENE-2414: Add ICUTokenizer, a tailorable tokenizer that implements Unicode |
| Text Segmentation. This tokenizer is useful for documents or collections with |
| multiple languages. The default configuration includes special support for |
| Thai, Lao, Myanmar, and Khmer. (Robert Muir, Uwe Schindler) |
| |
| * LUCENE-2298: Add analyzers/stempel, an algorithmic stemmer with support for |
| the Polish language. (Andrzej Bialecki via Robert Muir) |
| |
| * LUCENE-2400: ShingleFilter was changed to don't output all-filler shingles and |
| unigrams, and uses a more performant algorithm to build grams using a linked list |
| of AttributeSource.cloneAttributes() instances and the new copyTo() method. |
| (Steven Rowe via Uwe Schindler) |
| |
| * LUCENE-2437: Add an Analyzer for Indonesian. (Robert Muir) |
| |
| * LUCENE-2393: The HighFreqTerms tool (in misc) can now optionally |
| also include the total termFreq. (Tom Burton-West via Mike McCandless) |
| |
| * LUCENE-2463: Add a Greek inflectional stemmer. GreekAnalyzer will now stem words |
| when Version is set to 3.1 or higher. (Robert Muir) |
| |
| * LUCENE-1287: Allow usage of HyphenationCompoundWordTokenFilter without dictionary. |
| (Thomas Peuss via Robert Muir) |
| |
| * LUCENE-2464: FastVectorHighlighter: add SingleFragListBuilder to return |
| entire field contents. (Koji Sekiguchi) |
| |
| * LUCENE-2503: Added lighter stemming alternatives for European languages. |
| (Robert Muir) |
| |
| * LUCENE-2581: FastVectorHighlighter: add Encoder to FragmentsBuilder. |
| (Koji Sekiguchi) |
| |
| * LUCENE-2624: Add Analyzers for Armenian, Basque, and Catalan, from snowball. |
| (Robert Muir) |
| |
| * LUCENE-1938: PrecedenceQueryParser is now implemented with the flexible QP framework. |
| This means that you can also add this functionality to your own QP pipeline by using |
| BooleanModifiersQueryNodeProcessor, for example instead of GroupQueryNodeProcessor. |
| (Adriano Crestani via Robert Muir) |
| |
| * LUCENE-2791: Added WindowsDirectory, a Windows-specific Directory impl |
| that doesn't synchronize on the file handle. This can be useful to |
| avoid the performance problems of SimpleFSDirectory and NIOFSDirectory. |
| (Robert Muir, Simon Willnauer, Uwe Schindler, Michael McCandless) |
| |
| * LUCENE-2842: Add analyzer for Galician. Also adds the RSLP (Orengo) stemmer |
| for Portuguese. (Robert Muir) |
| |
| * SOLR-1057: Add PathHierarchyTokenizer that represents file path hierarchies as synonyms of |
| /something, /something/something, /something/something/else. (Ryan McKinley, Koji Sekiguchi) |
| |
| Build |
| |
| * LUCENE-2124: Moved the JDK-based collation support from contrib/collation |
| into core, and moved the ICU-based collation support into contrib/icu. |
| (Steven Rowe, Robert Muir) |
| |
| * LUCENE-2323: Moved contrib/regex into contrib/queries. Moved the |
| queryparsers under contrib/misc and contrib/surround into contrib/queryparser. |
| Moved contrib/fast-vector-highlighter into contrib/highlighter. |
| Moved ChainedFilter from contrib/misc to contrib/queries. contrib/spatial now |
| depends on contrib/queries instead of contrib/misc. (Robert Muir) |
| |
| * LUCENE-2333: Fix failures during contrib builds, when classes in |
| core were changed without ant clean. This fix also optimizes the |
| dependency management between contribs by a new ANT macro. |
| (Uwe Schindler, Shai Erera) |
| |
| * LUCENE-2797: Upgrade contrib/icu's ICU jar file to ICU 4.6 |
| (Robert Muir) |
| |
| * LUCENE-2833: Upgrade contrib/ant's jtidy jar file to r938 (Robert Muir) |
| |
| * LUCENE-2413: Moved the demo out of lucene core and into contrib/demo. |
| (Robert Muir) |
| |
| Optimizations |
| |
| * LUCENE-2157: DelimitedPayloadTokenFilter no longer copies the buffer |
| over itsself. Instead it sets only the length. This patch also optimizes |
| the logic of the filter and uses NIO for IdentityEncoder. (Uwe Schindler) |
| |
| * LUCENE-2084: Change IndexableBinaryStringTools to work on byte[] and char[] |
| directly, instead of Byte/CharBuffers, and modify ICUCollationKeyFilter to |
| take advantage of this for faster performance. |
| (Steven Rowe, Uwe Schindler, Robert Muir) |
| |
| * LUCENE-2194, LUCENE-2201, LUCENE-2288: Snowball stemmers in contrib/analyzers |
| have been optimized to work on char[] and remove unnecessary object creation. |
| (Shai Erera, Robert Muir) |
| |
| * LUCENE-2404: Improve performance of ThaiWordFilter by using a char[]-backed |
| CharacterIterator (currently from javax.swing). (Uwe Schindler, Robert Muir) |
| |
| Test Cases |
| |
| * LUCENE-2115: Cutover contrib tests to use Java5 generics. (Kay Kay |
| via Mike McCandless) |
| |
| Other |
| |
| * LUCENE-1845: Updated bdb-je jar from version 3.3.69 to 3.3.93. |
| (Simon Willnauer via Mike McCandless) |
| |
| * LUCENE-2415: Use reflection instead of a shim class to access Jakarta |
| Regex prefix. (Uwe Schindler) |
| |
| ================== Release 2.9.4 / 3.0.3 ==================== |
| |
| Bug Fixes |
| |
| * LUCENE-2277: QueryNodeImpl threw ConcurrentModificationException on |
| add(List<QueryNode>). (Frank Wesemann via Robert Muir) |
| |
| * LUCENE-2284: MatchAllDocsQueryNode toString() created an invalid XML tag. |
| (Frank Wesemann via Robert Muir) |
| |
| * LUCENE-2278: FastVectorHighlighter: Highlighted term is out of alignment |
| in multi-valued NOT_ANALYZED field. (Koji Sekiguchi) |
| |
| * LUCENE-2524: FastVectorHighlighter: use mod for getting colored tag. |
| (Koji Sekiguchi) |
| |
| * LUCENE-2616: FastVectorHighlighter: out of alignment when the first value is |
| empty in multiValued field (Koji Sekiguchi) |
| |
| * LUCENE-2731, LUCENE-2732: Fix (charset) problems in XML loading in |
| HyphenationCompoundWordTokenFilter (partial bugfix-only in 2.9 and 3.0, |
| full fix will be in later 3.1). |
| (Uwe Schinder) |
| |
| Documentation |
| |
| * LUCENE-2055: Add documentation noting that the Dutch and French stemmers |
| in contrib/analyzers do not implement the Snowball algorithm correctly, |
| and recommend to use the equivalents in contrib/snowball if possible. |
| (Robert Muir, Uwe Schindler, Simon Willnauer) |
| |
| * LUCENE-2653: Add documentation noting that ThaiWordFilter will not work |
| as expected on all JRE's. For example, on an IBM JRE, it does nothing. |
| (Robert Muir) |
| |
| ================== Release 2.9.3 / 3.0.2 ==================== |
| |
| No changes. |
| |
| ================== Release 2.9.2 / 3.0.1 ==================== |
| |
| New features |
| |
| * LUCENE-2108: Spellchecker now safely supports concurrent modifications to |
| the spell-index. Threads can safely obtain term suggestions while the spell- |
| index is rebuild, cleared or reset. Internal IndexSearcher instances remain |
| open until the last thread accessing them releases the reference. |
| (Simon Willnauer) |
| |
| Bug Fixes |
| |
| * LUCENE-2144: Fix InstantiatedIndex to handle termDocs(null) |
| correctly (enumerate all non-deleted docs). (Karl Wettin via Mike |
| McCandless) |
| |
| * LUCENE-2199: ShingleFilter skipped over tri-gram shingles if outputUnigram |
| was set to false. (Simon Willnauer) |
| |
| * LUCENE-2211: Fix missing clearAttributes() calls: |
| ShingleMatrix, PrefixAware, compounds, NGramTokenFilter, |
| EdgeNGramTokenFilter, Highlighter, and MemoryIndex. |
| (Uwe Schindler, Robert Muir) |
| |
| * LUCENE-2207, LUCENE-2219: Fix incorrect offset calculations in end() for |
| CJKTokenizer, ChineseTokenizer, SmartChinese SentenceTokenizer, |
| and WikipediaTokenizer. (Koji Sekiguchi, Robert Muir) |
| |
| * LUCENE-2266: Fixed offset calculations in NGramTokenFilter and |
| EdgeNGramTokenFilter. (Joe Calderon, Robert Muir via Uwe Schindler) |
| |
| API Changes |
| |
| * LUCENE-2108: Add SpellChecker.close, to close the underlying |
| reader. (Eirik Bjørsnøs via Mike McCandless) |
| |
| * LUCENE-2165: Add a constructor to SnowballAnalyzer that takes a Set of |
| stopwords, and deprecate the String[] one. (Nick Burch via Robert Muir) |
| |
| ======================= Release 3.0.0 ======================= |
| |
| Changes in backwards compatibility policy |
| |
| * LUCENE-1257: Change some occurences of StringBuffer in public/protected |
| APIs of contrib/surround to StringBuilder. |
| (Paul Elschot via Uwe Schindler) |
| |
| Changes in runtime behavior |
| |
| * LUCENE-1966: Modified and cleaned the default Arabic stopwords list used |
| by ArabicAnalyzer. You'll need to fully re-index any previously created |
| indexes. (Basem Narmok via Robert Muir) |
| |
| API Changes |
| |
| * LUCENE-1936: Deprecated RussianLowerCaseFilter, because it transforms |
| text exactly the same as LowerCaseFilter. Please use LowerCaseFilter |
| instead, which has the same functionality. (Robert Muir) |
| |
| * LUCENE-2051: Contrib Analyzer setters were deprecated and replaced |
| with ctor arguments / Version number. Also stop word lists |
| were unified. (Simon Willnauer) |
| |
| Bug fixes |
| |
| * LUCENE-1781: Fixed various issues with the lat/lng bounding box |
| distance filter created for radius search in contrib/spatial. |
| (Bill Bell via Mike McCandless) |
| |
| * LUCENE-1939: IndexOutOfBoundsException at ShingleMatrixFilter's |
| Iterator#hasNext method on exhausted streams. |
| (Patrick Jungermann via Karl Wettin) |
| |
| * LUCENE-1359: French analyzer did not support null field names. |
| (Andrew Lynch via Robert Muir) |
| |
| New features |
| |
| * LUCENE-1924: Added BalancedSegmentMergePolicy to contrib/misc, |
| which is a merge policy that tries to avoid doing very large |
| segment merges to give better search performance in a mixed |
| indexing/searching environment. (John Wang via Mike McCandless) |
| |
| * LUCENE-1959: Add index splitting tools. The IndexSplitter tool works |
| on multi-segment (non optimized) indexes and it can copy specific |
| segments out of the index into a new index. It can also list the |
| segments in the index, and delete specified segments. (Jason Rutherglen via |
| Mike McCandless). MultiPassIndexSplitter can split any index into |
| any number of output parts, at the cost of doing multiple passes over |
| the input index. (Andrzej Bialecki) |
| |
| * LUCENE-1993: Add maxDocFreq setting to MoreLikeThis, to exclude |
| from consideration terms that match more than the specified number |
| of documents. (Christian Steinert via Mike McCandless) |
| |
| Optimizations |
| |
| * LUCENE-1965, LUCENE-1962: Arabic-, Persian- and SmartChineseAnalyzer |
| loads default stopwords only once if accessed for the first time. |
| Previous versions were loading the stopword files each time a new |
| instance was created. This might improve performance for applications |
| creating lots of instances of these Analyzers. (Simon Willnauer) |
| |
| Documentation |
| |
| * LUCENE-1916: Translated documentation in the smartcn hhmm package. |
| (Patricia Peng via Robert Muir) |
| |
| Build |
| |
| * LUCENE-1904: Moved wordnet-based synonym support from contrib/memory |
| into contrib/wordnet. (Robert Muir) |
| |
| * LUCENE-2031: Moved PatternAnalyzer from contrib/memory into |
| contrib/analyzers/common, under miscellaneous. (Robert Muir) |
| |
| ======================= Release 2.9.1 ======================= |
| |
| Changes in backwards compatibility policy |
| |
| * LUCENE-2002: Add required Version matchVersion argument when |
| constructing ComplexPhraseQueryParser and default (as of 2.9) |
| enablePositionIncrements to true to match StandardAnalyzer's |
| default. Also added required matchVersion to most of the analyzers |
| (Uwe Schindler, Mike McCandless) |
| |
| Changes in runtime behavior |
| |
| * LUCENE-1963: ArabicAnalyzer now lowercases before checking the stopword |
| list. This has no effect on Arabic text, but if you are using a custom |
| stopword list that contains some non-Arabic words, you'll need to fully |
| reindex. (DM Smith via Robert Muir) |
| |
| Bug fixes |
| |
| * LUCENE-1953: FastVectorHighlighter: small fragCharSize can cause |
| StringIndexOutOfBoundsException. (Koji Sekiguchi) |
| |
| * LUCENE-1929: Highlighter throws exception on NumericRangeQuery and does not |
| support deprecated RangeQuery. (Mark Miller) |
| |
| * LUCENE-2001: Wordnet Syns2Index incorrectly parses synonyms that |
| contain a single quote. (Parag H. Dave via Robert Muir) |
| |
| * LUCENE-2003: Highlighter doesn't respect position increments other than 1 with |
| PhraseQuerys. (Uwe Schindler, Mark Miller) |
| |
| * LUCENE-1954: InstantiatedIndexWriter: Fixed ClassCastException with |
| NumericField because of incorrect unchecked cast: Document.getFields() |
| returns List<Fieldable>. (Bernd Fondermann via Uwe Schindler) |
| |
| * LUCENE-2014: SmartChineseAnalyzer did not properly clear attributes |
| in WordTokenFilter. If enablePositionIncrements is set for StopFilter, |
| then this could create invalid position increments, causing IndexWriter |
| to crash. (Robert Muir, Uwe Schindler) |
| |
| * LUCENE-2013: SpanRegexQuery does not work with QueryScorer. |
| (Benjamin Keil via Mark Miller) |
| |
| ======================= Release 2.9.0 ======================= |
| |
| Changes in runtime behavior |
| |
| * LUCENE-1505: Local lucene now uses org.apache.lucene.util.NumericUtils for all |
| number conversion. You'll need to fully re-index any previously created indexes. |
| This isn't a break in back-compatibility because local Lucene has not yet |
| been released. (Mike McCandless) |
| |
| * LUCENE-1758: ArabicAnalyzer now uses the light10 algorithm, has a refined |
| default stopword list, and lowercases non-Arabic text. |
| You'll need to fully re-index any previously created indexes. This isn't a |
| break in back-compatibility because ArabicAnalyzer has not yet been |
| released. (Robert Muir) |
| |
| |
| API Changes |
| |
| * LUCENE-1695: Update the Highlighter to use the new TokenStream API. This issue breaks backwards |
| compatibility with some public classes. If you have implemented custom Fragmenters or Scorers, |
| you will need to adjust them to work with the new TokenStream API. Rather than getting passed a |
| Token at a time, you will be given a TokenStream to init your impl with - store the Attributes |
| you are interested in locally and access them on each call to the method that used to pass a new |
| Token. Look at the included updated impls for examples. (Mark Miller) |
| |
| * LUCENE-1460: Change contrib TokenStreams/Filters to use the new |
| TokenStream API. (Robert Muir, Michael Busch) |
| |
| * LUCENE-1775, LUCENE-1903: Change remaining TokenFilters (shingle, prefix-suffix) |
| to use the new TokenStream API. ShingleFilter is much more efficient now, |
| it clones much less often and computes the tokens mostly on the fly now. |
| Also added more tests. (Robert Muir, Michael Busch, Uwe Schindler, Chris Harris) |
| |
| * LUCENE-1685: The position aware SpanScorer has become the default scorer |
| for Highlighting. The SpanScorer implementation has replaced QueryScorer |
| and the old term highlighting QueryScorer has been renamed to |
| QueryTermScorer. Multi-term queries are also now expanded by default. If |
| you were previously rewriting the query for multi-term query highlighting, |
| you should no longer do that (unless you switch to using QueryTermScorer). |
| The SpanScorer API (now QueryScorer) has also been improved to more closely |
| match the API of the previous QueryScorer implementation. (Mark Miller) |
| |
| * LUCENE-1793: Deprecate the custom encoding support in the Greek and Russian |
| Analyzers. If you need to index text in these encodings, please use Java's |
| character set conversion facilities (InputStreamReader, etc) during I/O, |
| so that Lucene can analyze this text as Unicode instead. (Robert Muir) |
| |
| Bug fixes |
| |
| * LUCENE-1423: InstantiatedTermEnum#skipTo(Term) throws ArrayIndexOutOfBounds on empty index. |
| (Karl Wettin) |
| |
| * LUCENE-1462: InstantiatedIndexWriter did not reset pre analyzed TokenStreams the |
| same way IndexWriter does. Parts of InstantiatedIndex was not Serializable. |
| (Karl Wettin) |
| |
| * LUCENE-1510: InstantiatedIndexReader#norms methods throws NullPointerException on empty index. |
| (Karl Wettin, Robert Newson) |
| |
| * LUCENE-1514: ShingleMatrixFilter#next(Token) easily throws a StackOverflowException |
| due to recursive invocation. (Karl Wettin) |
| |
| * LUCENE-1548: Fix distance normalization in LevenshteinDistance to |
| not produce negative distances (Thomas Morton via Mike McCandless) |
| |
| * LUCENE-1490: Fix latin1 conversion of HALFWIDTH_AND_FULLWIDTH_FORMS |
| characters to only apply to the correct subset (Daniel Cheng via |
| Mike McCandless) |
| |
| * LUCENE-1576: Fix BrazilianAnalyzer to downcase tokens after |
| StandardTokenizer so that stop words with mixed case are filtered |
| out. (Rafael Cunha de Almeida, Douglas Campos via Mike McCandless) |
| |
| * LUCENE-1491: EdgeNGramTokenFilter no longer stops on tokens shorter than minimum n-gram size. |
| (Todd Teak via Otis Gospodnetic) |
| |
| * LUCENE-1683: Fixed JavaUtilRegexCapabilities (an impl used by |
| RegexQuery) to use Matcher.matches() not Matcher.lookingAt() so |
| that the regexp must match the entire string, not just a prefix. |
| (Trejkaz via Mike McCandless) |
| |
| * LUCENE-1792: Fix new query parser to set rewrite method for |
| multi-term queries. (Luis Alves, Mike McCandless via Michael Busch) |
| |
| * LUCENE-1828: Fix memory index to call TokenStream.reset() and |
| TokenStream.end(). (Tim Smith via Michael Busch) |
| |
| * LUCENE-1912: Fix fast-vector-highlighter issue when two or more |
| terms are concatenated (Koji Sekiguchi via Mike McCandless) |
| |
| New features |
| |
| * LUCENE-1531: Added support for BoostingTermQuery to XML query parser. (Karl Wettin) |
| |
| * LUCENE-1435: Added contrib/collation, a CollationKeyFilter |
| allowing you to convert tokens into CollationKeys encoded using |
| IndexableBinaryStringTools. This allows for faster RangeQuery when |
| a field needs to use a custom Collator. (Steven Rowe via Mike |
| McCandless) |
| |
| * LUCENE-1591: EnWikiDocMaker, LineDocMaker, WriteLineDoc can now |
| read/write bz2 using Apache commons compress library. This means |
| you can download the .bz2 export from http://wikipedia.org and |
| immediately index it. (Shai Erera via Mike McCandless) |
| |
| * LUCENE-1629: Add SmartChineseAnalyzer to contrib/analyzers. It |
| improves on CJKAnalyzer and ChineseAnalyzer by handling Chinese |
| sentences properly. SmartChineseAnalyzer uses a Hidden Markov |
| Model to tokenize Chinese words in a more intelligent way. |
| (Xiaoping Gao via Mike McCandless) |
| |
| * LUCENE-1676: Added DelimitedPayloadTokenFilter class for automatically adding payloads "in-stream" (Grant Ingersoll) |
| |
| * LUCENE-1578: Support for loading unoptimized readers to the |
| constructor of InstantiatedIndex. (Karl Wettin) |
| |
| * LUCENE-1704: Allow specifying the Tidy configuration file when |
| parsing HTML docs with contrib/ant. (Keith Sprochi via Mike |
| McCandless) |
| |
| * LUCENE-1522: Added contrib/fast-vector-highlighter, a new alternative |
| highlighter. (Koji Sekiguchi via Mike McCandless) |
| |
| * LUCENE-1740: Added "analyzer" command to Lucli, enabling changing |
| the analyzer from the default StandardAnalyzer. (Bernd Fondermann |
| via Mike McCandless) |
| |
| * LUCENE-1272: Add get/setBoost to MoreLikeThis. (Jonathan |
| Leibiusky via Mike McCandless) |
| |
| * LUCENE-1745: Added constructors to JakartaRegexpCapabilities and |
| JavaUtilRegexCapabilities as well as static flags to support |
| configuring a RegexCapabilities implementation with the |
| implementation-specific modifier flags. Allows for callers to |
| customize the RegexQuery using the implementation-specific options |
| and fine tune how regular expressions are compiled and |
| matched. (Marc Zampetti zampettim@aim.com via Mike McCandless) |
| |
| * LUCENE-1567: Added a new QueryParser framework, that allows |
| implementing a new query syntax in a flexible and efficient way. |
| This new QueryParser will be moved to Lucene's core in release |
| 3.0 and will then replace the current core QueryParser, which |
| has been deprecated with this patch. |
| (Luis Alves and Adriano Campos via Michael Busch) |
| |
| * LUCENE-1486: Added ComplexPhraseQueryParser, an extension of QueryParser |
| that allows a subset of the Lucene query language to be embedded in |
| PhraseQuerys. Wildcard, Range, and Fuzzy queries, as well as limited |
| boolean logic, can be used within quote operators with this parser, ie: |
| "(jo* -john) smyth~". (Mark Harwood via Mark Miller) |
| |
| * Added web-based demo of functionality in contrib's XML Query Parser |
| packaged as War file (Mark Harwood) |
| |
| * LUCENE-1406: Added Arabic analyzer. (Robert Muir via Grant Ingersoll) |
| |
| * LUCENE-1628: Added Persian analyzer. (Robert Muir) |
| |
| * LUCENE-1813: Add option to ReverseStringFilter to mark reversed tokens. |
| (Andrzej Bialecki via Robert Muir) |
| |
| Optimizations |
| |
| * LUCENE-1643: Re-use the collation key (RawCollationKey) for |
| better performance, in ICUCollationKeyFilter. (Robert Muir via |
| Mike McCandless) |
| |
| * LUCENE-1794: Implement TokenStream reuse for contrib Analyzers, |
| and implement reset() for TokenStreams to support reuse. (Robert Muir) |
| |
| Documentation |
| |
| * LUCENE-1876: added missing package level documentation for numerous |
| contrib packages. |
| (Steven Rowe & Robert Muir) |
| |
| Build |
| |
| * LUCENE-1728: Split contrib/analyzers into common and smartcn modules. |
| Contrib/analyzers now builds an additional lucene-smartcn Jar file. All |
| smartcn classes are not included in the lucene-analyzers JAR file. |
| (Robert Muir via Simon Willnauer) |
| |
| * LUCENE-1829: Fix contrib query parser to properly create javacc files. |
| (Jan-Pascal and Luis Alves via Michael Busch) |
| |
| Test Cases |
| |
| |
| ======================= Release 2.4.0 ======================= |
| |
| Changes in runtime behavior |
| |
| (None) |
| |
| API Changes |
| |
| 1. |
| |
| (None) |
| |
| Bug fixes |
| |
| 1. LUCENE-1312: Added full support for InstantiatedIndexReader#getFieldNames() |
| and tests that assert that deleted documents behaves as they should (they did). |
| (Jason Rutherglen, Karl Wettin) |
| |
| 2. LUCENE-1318: InstantiatedIndexReader.norms(String, b[], int) didn't treat |
| the array offset right. (Jason Rutherglen via Karl Wettin) |
| |
| New features |
| |
| 1. LUCENE-1320: ShingleMatrixFilter, multidimensional shingle token filter. (Karl Wettin) |
| |
| 2. LUCENE-1142: Updated Snowball package, org.tartarus distribution revision 500. |
| Introducing Hungarian, Turkish and Romanian support, updated older stemmers |
| and optimized (reflectionless) SnowballFilter. |
| IMPORTANT NOTICE ON BACKWARDS COMPATIBILITY: an index created using the 2.3.2 (or older) |
| might not be compatible with these updated classes as some algorithms have changed. |
| (Karl Wettin) |
| |
| 3. LUCENE-1016: TermVectorAccessor, transparent vector space access via stored vectors |
| or by resolving the inverted index. (Karl Wettin) |
| |
| Documentation |
| |
| (None) |
| |
| Build |
| |
| (None) |
| |
| Test Cases |
| |
| (None) |