Lucene Change Log | |
$Id: CHANGES.txt 355181 2005-12-08 19:53:06Z cutting $ | |
1.9 RC1 | |
Note that this realease is mostly but not 100% source compatible with the | |
latest release of Lucene (1.4.3). In other words, you should make sure | |
your application compiles with this version of Lucene before you replace | |
the old Lucene JAR with the new one. | |
Requirements | |
1. To compile and use Lucene you now need Java 1.4 or later. | |
Changes in runtime behavior | |
1. FuzzyQuery can no longer throw a TooManyClauses exception. If a | |
FuzzyQuery expands to more than BooleanQuery.maxClauseCount | |
terms only the BooleanQuery.maxClauseCount most similar terms | |
go into the rewritten query and thus the exception is avoided. | |
(Christoph) | |
2. Changed system property from "org.apache.lucene.lockdir" to | |
"org.apache.lucene.lockDir", so that its casing follows the existing | |
pattern used in other Lucene system properties. (Bernhard) | |
3. The terms of RangeQueries and FuzzyQueries are now converted to | |
lowercase by default (as it has been the case for PrefixQueries | |
and WildcardQueries before). Use setLowercaseExpandedTerms(false) | |
to disable that behavior but note that this also affects | |
PrefixQueries and WildcardQueries. (Daniel Naber) | |
4. Document frequency that is computed when MultiSearcher is used is now | |
computed correctly and "globally" across subsearchers and indices, while | |
before it used to be computed locally to each index, which caused | |
ranking across multiple indices not to be equivalent. | |
(Chuck Williams, Wolf Siberski via Otis, bug #31841) | |
5. When opening an IndexWriter with create=true, Lucene now only deletes | |
its own files from the index directory (looking at the file name suffixes | |
to decide if a file belongs to Lucene). The old behavior was to delete | |
all files. (Daniel Naber and Bernhard Messer, bug #34695) | |
6. The version of an IndexReader, as returned by getCurrentVersion() | |
and getVersion() doesn't start at 0 anymore for new indexes. Instead, it | |
is now initialized by the system time in milliseconds. | |
(Bernhard Messer via Daniel Naber) | |
7. Several default values cannot be set via system properties anymore, as | |
this has been considered inappropriate for a library like Lucene. For | |
most properties there are set/get methods available in IndexWriter which | |
you should use instead. This affects the following properties: | |
See IndexWriter for getter/setter methods: | |
org.apache.lucene.writeLockTimeout, org.apache.lucene.commitLockTimeout, | |
org.apache.lucene.minMergeDocs, org.apache.lucene.maxMergeDocs, | |
org.apache.lucene.maxFieldLength, org.apache.lucene.termIndexInterval, | |
org.apache.lucene.mergeFactor, | |
See BooleanQuery for getter/setter methods: | |
org.apache.lucene.maxClauseCount | |
See FSDirectory for getter/setter methods: | |
disableLuceneLocks | |
(Daniel Naber) | |
8. Fixed FieldCacheImpl to use user-provided IntParser and FloatParser, | |
instead of using Integer and Float classes for parsing. | |
(Yonik Seeley via Otis Gospodnetic) | |
9. Expert level search routines returning TopDocs and TopFieldDocs | |
no longer normalize scores. This also fixes bugs related to | |
MultiSearchers and score sorting/normalization. | |
(Luc Vanlerberghe via Yonik Seeley, LUCENE-469) | |
New features | |
1. Added support for stored compressed fields (patch #31149) | |
(Bernhard Messer via Christoph) | |
2. Added support for binary stored fields (patch #29370) | |
(Drew Farris and Bernhard Messer via Christoph) | |
3. Added support for position and offset information in term vectors | |
(patch #18927). (Grant Ingersoll & Christoph) | |
4. A new class DateTools has been added. It allows you to format dates | |
in a readable format adequate for indexing. Unlike the existing | |
DateField class DateTools can cope with dates before 1970 and it | |
forces you to specify the desired date resolution (e.g. month, day, | |
second, ...) which can make RangeQuerys on those fields more efficient. | |
(Daniel Naber) | |
5. QueryParser now correctly works with Analyzers that can return more | |
than one token per position. For example, a query "+fast +car" | |
would be parsed as "+fast +(car automobile)" if the Analyzer | |
returns "car" and "automobile" at the same position whenever it | |
finds "car" (Patch #23307). | |
(Pierrick Brihaye, Daniel Naber) | |
6. Permit unbuffered Directory implementations (e.g., using mmap). | |
InputStream is replaced by the new classes IndexInput and | |
BufferedIndexInput. OutputStream is replaced by the new classes | |
IndexOutput and BufferedIndexOutput. InputStream and OutputStream | |
are now deprecated and FSDirectory is now subclassable. (cutting) | |
7. Add native Directory and TermDocs implementations that work under | |
GCJ. These require GCC 3.4.0 or later and have only been tested | |
on Linux. Use 'ant gcj' to build demo applications. (cutting) | |
8. Add MMapDirectory, which uses nio to mmap input files. This is | |
still somewhat slower than FSDirectory. However it uses less | |
memory per query term, since a new buffer is not allocated per | |
term, which may help applications which use, e.g., wildcard | |
queries. It may also someday be faster. (cutting & Paul Elschot) | |
9. Added javadocs-internal to build.xml - bug #30360 | |
(Paul Elschot via Otis) | |
10. Added RangeFilter, a more generically useful filter than DateFilter. | |
(Chris M Hostetter via Erik) | |
11. Added NumberTools, a utility class indexing numeric fields. | |
(adapted from code contributed by Matt Quail; committed by Erik) | |
12. Added public static IndexReader.main(String[] args) method. | |
IndexReader can now be used directly at command line level | |
to list and optionally extract the individual files from an existing | |
compound index file. | |
(adapted from code contributed by Garrett Rooney; committed by Bernhard) | |
13. Add IndexWriter.setTermIndexInterval() method. See javadocs. | |
(Doug Cutting) | |
14. Added LucenePackage, whose static get() method returns java.util.Package, | |
which lets the caller get the Lucene version information specified in | |
the Lucene Jar. | |
(Doug Cutting via Otis) | |
15. Added Hits.iterator() method and corresponding HitIterator and Hit objects. | |
This provides standard java.util.Iterator iteration over Hits. | |
Each call to the iterator's next() method returns a Hit object. | |
(Jeremy Rayner via Erik) | |
16. Add ParallelReader, an IndexReader that combines separate indexes | |
over different fields into a single virtual index. (Doug Cutting) | |
17. Add IntParser and FloatParser interfaces to FieldCache, so that | |
fields in arbitrarily formats can be cached as ints and floats. | |
(Doug Cutting) | |
18. Added class org.apache.lucene.index.IndexModifier which combines | |
IndexWriter and IndexReader, so you can add and delete documents without | |
worrying about synchronisation/locking issues. | |
(Daniel Naber) | |
19. Lucene can now be used inside an unsigned applet, as Lucene's access | |
to system properties will not cause a SecurityException anymore. | |
(Jon Schuster via Daniel Naber, bug #34359) | |
20. Added a new class MatchAllDocsQuery that matches all documents. | |
(John Wang via Daniel Naber, bug #34946) | |
21. Added ability to omit norms on a per field basis to decrease | |
index size and memory consumption when there are many indexed fields. | |
See Field.setOmitNorms() | |
(Yonik Seeley, LUCENE-448) | |
22. Added NullFragmenter to contrib/highlighter, which is useful for | |
highlighting entire documents or fields. | |
(Erik Hatcher) | |
23. Added regular expression queries, RegexQuery and SpanRegexQuery. | |
Note the same term enumeration caveats apply with these queries as | |
apply to WildcardQuery and other term expanding queries. | |
These two new queries are not currently supported via QueryParser. | |
(Erik Hatcher) | |
24. Added ConstantScoreQuery which wraps a filter and produces a score | |
equal to the query boost for every matching document. | |
(Yonik Seeley, LUCENE-383) | |
25. Added ConstantScoreRangeQuery which produces a constant score for | |
every document in the range. One advantage over a normal RangeQuery | |
is that it doesn't expand to a BooleanQuery and thus doesn't have a maximum | |
number of terms the range can cover. Both endpoints may also be open. | |
(Yonik Seeley, LUCENE-383) | |
26. Added ability to specify a minimum number of optional clauses that | |
must match in a BooleanQuery. See BooleanQuery.setMinimumNumberShouldMatch(). | |
(Paul Elschot, Chris Hostetter via Yonik Seeley, LUCENE-395) | |
27. Added DisjunctionMaxQuery which provides the maximum score across it's clauses. | |
It's very useful for searching across multiple fields. | |
(Chuck Williams via Yonik Seeley, LUCENE-323) | |
28. New class ISOLatin1AccentFilter that replaces accented characters in the ISO | |
Latin 1 character set by their unaccented equivalent. | |
(Sven Duzont via Erik Hatcher) | |
29. New class KeywordAnalyzer. "Tokenizes" the entire stream as a single token. | |
This is useful for data like zip codes, ids, and some product names. | |
(Erik Hatcher) | |
30. Copied LengthFilter from contrib area to core. Removes words that are too | |
long and too short from the stream. | |
(David Spencer via Otis and Daniel) | |
31. Added getPositionIncrementGap(String fieldName) to Analyzer. This allows | |
custom analyzers to put gaps between Field instances with the same field | |
name, preventing phrase or span queries crossing these boundaries. The | |
default implementation issues a gap of 0, allowing the default token | |
position increment of 1 to put the next field's first token into a | |
successive position. | |
(Erik Hatcher, with advice from Yonik) | |
32. StopFilter can now ignore case when checking for stop words. | |
(Grant Ingersoll via Yonik, LUCENE-248) | |
API Changes | |
1. Several methods and fields have been deprecated. The API documentation | |
contains information about the recommended replacements. It is planned | |
that most of the deprecated methods and fields will be removed in | |
Lucene 2.0. (Daniel Naber) | |
2. The Russian and the German analyzers have been moved to contrib/analyzers. | |
Also, the WordlistLoader class has been moved one level up in the | |
hierarchy and is now org.apache.lucene.analysis.WordlistLoader | |
(Daniel Naber) | |
3. The API contained methods that declared to throw an IOException | |
but that never did this. These declarations have been removed. If | |
your code tries to catch these exceptions you might need to remove | |
those catch clauses to avoid compile errors. (Daniel Naber) | |
4. Add a serializable Parameter Class to standardize parameter enum | |
classes in BooleanClause and Field. (Christoph) | |
5. Added rewrite methods to all SpanQuery subclasses that nest other SpanQuerys. | |
This allows custom SpanQuery subclasses that rewrite (for term expansion, for | |
example) to nest within the built-in SpanQuery classes successfully. | |
Bug fixes | |
1. The JSP demo page (src/jsp/results.jsp) now properly closes the | |
IndexSearcher it opens. (Daniel Naber) | |
2. Fixed a bug in IndexWriter.addIndexes(IndexReader[] readers) that | |
prevented deletion of obsolete segments. (Christoph Goller) | |
3. Fix in FieldInfos to avoid the return of an extra blank field in | |
IndexReader.getFieldNames() (Patch #19058). (Mark Harwood via Bernhard) | |
4. Some combinations of BooleanQuery and MultiPhraseQuery (formerly | |
PhrasePrefixQuery) could provoke UnsupportedOperationException | |
(bug #33161). (Rhett Sutphin via Daniel Naber) | |
5. Small bug in skipTo of ConjunctionScorer that caused NullPointerException | |
if skipTo() was called without prior call to next() fixed. (Christoph) | |
6. Disable Similiarty.coord() in the scoring of most automatically | |
generated boolean queries. The coord() score factor is | |
appropriate when clauses are independently specified by a user, | |
but is usually not appropriate when clauses are generated | |
automatically, e.g., by a fuzzy, wildcard or range query. Matches | |
on such automatically generated queries are no longer penalized | |
for not matching all terms. (Doug Cutting, Patch #33472) | |
7. Getting a lock file with Lock.obtain(long) was supposed to wait for | |
a given amount of milliseconds, but this didn't work. | |
(John Wang via Daniel Naber, Bug #33799) | |
8. Fix FSDirectory.createOutput() to always create new files. | |
Previously, existing files were overwritten, and an index could be | |
corrupted when the old version of a file was longer than the new. | |
Now any existing file is first removed. (Doug Cutting) | |
9. Fix BooleanQuery containing nested SpanTermQuery's, which previously | |
could return an incorrect number of hits. | |
(Reece Wilton via Erik Hatcher, Bug #35157) | |
10. Fix NullPointerException that could occur with a MultiPhraseQuery | |
inside a BooleanQuery. | |
(Hans Hjelm and Scotty Allen via Daniel Naber, Bug #35626) | |
11. Fixed SnowballFilter to pass through the position increment from | |
the original token. | |
(Yonik Seeley via Erik Hatcher, LUCENE-437) | |
12. Added Unicode range of Korean characters to StandardTokenizer, | |
grouping contiguous characters into a token rather than one token | |
per character. This change also changes the token type to "<CJ>" | |
for Chinese and Japanese character tokens (previously it was "<CJK>"). | |
(Cheolgoo Kang via Otis and Erik, LUCENE-444 and LUCENE-461) | |
13. FieldsReader now looks at FieldInfo.storeOffsetWithTermVector and | |
FieldInfo.storePositionWithTermVector and creates the Field with | |
correct TermVector parameter. | |
(Frank Steinmann via Bernhard, LUCENE-455) | |
14. Fixed WildcardQuery to prevent "cat" matching "ca??". | |
(Xiaozheng Ma via Bernhard, LUCENE-306) | |
15. Fixed a bug where MultiSearcher and ParallelMultiSearcher could | |
change the sort order when sorting by string for documents without | |
a value for the sort field. | |
(Luc Vanlerberghe via Yonik, LUCENE-453) | |
16. Fixed a sorting problem with MultiSearchers that can lead to | |
missing or duplicate docs due to equal docs sorting in an arbitrary order. | |
(Yonik Seeley, LUCENE-456) | |
17. A single hit using the expert level sorted search methods | |
resulted in the score not being normalized. | |
(Yonik Seeley, LUCENE-462) | |
18. Fixed inefficient memory usage when loading an index into RAMDirectory. | |
(Volodymyr Bychkoviak via Bernhard, LUCENE-475) | |
19. Corrected term offsets returned by ChineseTokenizer. | |
(Ray Tsang via Erik Hatcher, LUCENE-324) | |
20. Fixed MultiReader.undeleteAll() to correctly update numDocs. | |
(Robert Kirchgessner via Doug Cutting, LUCENE-479) | |
Optimizations | |
1. Disk usage (peak requirements during indexing and optimization) | |
in case of compound file format has been improved. | |
(Bernhard, Dmitry, and Christoph) | |
2. Optimize the performance of certain uses of BooleanScorer, | |
TermScorer and IndexSearcher. In particular, a BooleanQuery | |
composed of TermQuery, with not all terms required, that returns a | |
TopDocs (e.g., through a Hits with no Sort specified) runs much | |
faster. (cutting) | |
3. Removed synchronization from reading of term vectors with an | |
IndexReader (Patch #30736). (Bernhard Messer via Christoph) | |
4. Optimize term-dictionary lookup to allocate far fewer terms when | |
scanning for the matching term. This speeds searches involving | |
low-frequency terms, where the cost of dictionary lookup can be | |
significant. (cutting) | |
5. Optimize fuzzy queries so the standard fuzzy queries with a prefix | |
of 0 now run 20-50% faster (Patch #31882). | |
(Jonathan Hager via Daniel Naber) | |
6. A Version of BooleanScorer (BooleanScorer2) added that delivers | |
documents in increasing order and implements skipTo. For queries | |
with required or forbidden clauses it may be faster than the old | |
BooleanScorer, for BooleanQueries consisting only of optional | |
clauses it is probably slower. The new BooleanScorer is now the | |
default. (Patch 31785 by Paul Elschot via Christoph) | |
7. Use uncached access to norms when merging to reduce RAM usage. | |
(Bug #32847). (Doug Cutting) | |
8. Don't read term index when random-access is not required. This | |
reduces time to open IndexReaders and they use less memory when | |
random access is not required, e.g., when merging segments. The | |
term index is now read into memory lazily at the first | |
random-access. (Doug Cutting) | |
9. Optimize IndexWriter.addIndexes(Directory[]) when the number of | |
added indexes is larger than mergeFactor. Previously this could | |
result in quadratic performance. Now performance is n log(n). | |
(Doug Cutting) | |
10. Speed up the creation of TermEnum for indicies with multiple | |
segments and deleted documents, and thus speed up PrefixQuery, | |
RangeQuery, WildcardQuery, FuzzyQuery, RangeFilter, DateFilter, | |
and sorting the first time on a field. | |
(Yonik Seeley, LUCENE-454) | |
11. Optimized and generalized 32 bit floating point to byte | |
(custom 8 bit floating point) conversions. Increased the speed of | |
Similarity.encodeNorm() anywhere from 10% to 250%, depending on the JVM. | |
(Yonik Seeley, LUCENE-467) | |
Infrastructure | |
1. Lucene's source code repository has converted from CVS to | |
Subversion. The new repository is at | |
http://svn.apache.org/repos/asf/lucene/java/trunk | |
2. Lucene's issue tracker has migrated from Bugzilla to JIRA. | |
Lucene's JIRA is at http://issues.apache.org/jira/browse/LUCENE | |
The old issues are still available at | |
http://issues.apache.org/bugzilla/show_bug.cgi?id=xxxx | |
(use the bug number instead of xxxx) | |
1.4.3 | |
1. The JSP demo page (src/jsp/results.jsp) now properly escapes error | |
messages which might contain user input (e.g. error messages about | |
query parsing). If you used that page as a starting point for your | |
own code please make sure your code also properly escapes HTML | |
characters from user input in order to avoid so-called cross site | |
scripting attacks. (Daniel Naber) | |
2. QueryParser changes in 1.4.2 broke the QueryParser API. Now the old | |
API is supported again. (Christoph) | |
1.4.2 | |
1. Fixed bug #31241: Sorting could lead to incorrect results (documents | |
missing, others duplicated) if the sort keys were not unique and there | |
were more than 100 matches. (Daniel Naber) | |
2. Memory leak in Sort code (bug #31240) eliminated. | |
(Rafal Krzewski via Christoph and Daniel) | |
3. FuzzyQuery now takes an additional parameter that specifies the | |
minimum similarity that is required for a term to match the query. | |
The QueryParser syntax for this is term~x, where x is a floating | |
point number >= 0 and < 1 (a bigger number means that a higher | |
similarity is required). Furthermore, a prefix can be specified | |
for FuzzyQuerys so that only those terms are considered similar that | |
start with this prefix. This can speed up FuzzyQuery greatly. | |
(Daniel Naber, Christoph Goller) | |
4. PhraseQuery and PhrasePrefixQuery now allow the explicit specification | |
of relative positions. (Christoph Goller) | |
5. QueryParser changes: Fix for ArrayIndexOutOfBoundsExceptions | |
(patch #9110); some unused method parameters removed; The ability | |
to specify a minimum similarity for FuzzyQuery has been added. | |
(Christoph Goller) | |
6. IndexSearcher optimization: a new ScoreDoc is no longer allocated | |
for every non-zero-scoring hit. This makes 'OR' queries that | |
contain common terms substantially faster. (cutting) | |
1.4.1 | |
1. Fixed a performance bug in hit sorting code, where values were not | |
correctly cached. (Aviran via cutting) | |
2. Fixed errors in file format documentation. (Daniel Naber) | |
1.4 final | |
1. Added "an" to the list of stop words in StopAnalyzer, to complement | |
the existing "a" there. Fix for bug 28960 | |
(http://issues.apache.org/bugzilla/show_bug.cgi?id=28960). (Otis) | |
2. Added new class FieldCache to manage in-memory caches of field term | |
values. (Tim Jones) | |
3. Added overloaded getFieldQuery method to QueryParser which | |
accepts the slop factor specified for the phrase (or the default | |
phrase slop for the QueryParser instance). This allows overriding | |
methods to replace a PhraseQuery with a SpanNearQuery instead, | |
keeping the proper slop factor. (Erik Hatcher) | |
4. Changed the encoding of GermanAnalyzer.java and GermanStemmer.java to | |
UTF-8 and changed the build encoding to UTF-8, to make changed files | |
compile. (Otis Gospodnetic) | |
5. Removed synchronization from term lookup under IndexReader methods | |
termFreq(), termDocs() or termPositions() to improve | |
multi-threaded performance. (cutting) | |
6. Fix a bug where obsolete segment files were not deleted on Win32. | |
1.4 RC3 | |
1. Fixed several search bugs introduced by the skipTo() changes in | |
release 1.4RC1. The index file format was changed a bit, so | |
collections must be re-indexed to take advantage of the skipTo() | |
optimizations. (Christoph Goller) | |
2. Added new Document methods, removeField() and removeFields(). | |
(Christoph Goller) | |
3. Fixed inconsistencies with index closing. Indexes and directories | |
are now only closed automatically by Lucene when Lucene opened | |
them automatically. (Christoph Goller) | |
4. Added new class: FilteredQuery. (Tim Jones) | |
5. Added a new SortField type for custom comparators. (Tim Jones) | |
6. Lock obtain timed out message now displays the full path to the lock | |
file. (Daniel Naber via Erik) | |
7. Fixed a bug in SpanNearQuery when ordered. (Paul Elschot via cutting) | |
8. Fixed so that FSDirectory's locks still work when the | |
java.io.tmpdir system property is null. (cutting) | |
9. Changed FilteredTermEnum's constructor to take no parameters, | |
as the parameters were ignored anyway (bug #28858) | |
1.4 RC2 | |
1. GermanAnalyzer now throws an exception if the stopword file | |
cannot be found (bug #27987). It now uses LowerCaseFilter | |
(bug #18410) (Daniel Naber via Otis, Erik) | |
2. Fixed a few bugs in the file format documentation. (cutting) | |
1.4 RC1 | |
1. Changed the format of the .tis file, so that: | |
- it has a format version number, which makes it easier to | |
back-compatibly change file formats in the future. | |
- the term count is now stored as a long. This was the one aspect | |
of the Lucene's file formats which limited index size. | |
- a few internal index parameters are now stored in the index, so | |
that they can (in theory) now be changed from index to index, | |
although there is not yet an API to do so. | |
These changes are back compatible. The new code can read old | |
indexes. But old code will not be able read new indexes. (cutting) | |
2. Added an optimized implementation of TermDocs.skipTo(). A skip | |
table is now stored for each term in the .frq file. This only | |
adds a percent or two to overall index size, but can substantially | |
speedup many searches. (cutting) | |
3. Restructured the Scorer API and all Scorer implementations to take | |
advantage of an optimized TermDocs.skipTo() implementation. In | |
particular, PhraseQuerys and conjunctive BooleanQuerys are | |
faster when one clause has substantially fewer matches than the | |
others. (A conjunctive BooleanQuery is a BooleanQuery where all | |
clauses are required.) (cutting) | |
4. Added new class ParallelMultiSearcher. Combined with | |
RemoteSearchable this makes it easy to implement distributed | |
search systems. (Jean-Francois Halleux via cutting) | |
5. Added support for hit sorting. Results may now be sorted by any | |
indexed field. For details see the javadoc for | |
Searcher#search(Query, Sort). (Tim Jones via Cutting) | |
6. Changed FSDirectory to auto-create a full directory tree that it | |
needs by using mkdirs() instead of mkdir(). (Mladen Turk via Otis) | |
7. Added a new span-based query API. This implements, among other | |
things, nested phrases. See javadocs for details. (Doug Cutting) | |
8. Added new method Query.getSimilarity(Searcher), and changed | |
scorers to use it. This permits one to subclass a Query class so | |
that it can specify it's own Similarity implementation, perhaps | |
one that delegates through that of the Searcher. (Julien Nioche | |
via Cutting) | |
9. Added MultiReader, an IndexReader that combines multiple other | |
IndexReaders. (Cutting) | |
10. Added support for term vectors. See Field#isTermVectorStored(). | |
(Grant Ingersoll, Cutting & Dmitry) | |
11. Fixed the old bug with escaping of special characters in query | |
strings: http://issues.apache.org/bugzilla/show_bug.cgi?id=24665 | |
(Jean-Francois Halleux via Otis) | |
12. Added support for overriding default values for the following, | |
using system properties: | |
- default commit lock timeout | |
- default maxFieldLength | |
- default maxMergeDocs | |
- default mergeFactor | |
- default minMergeDocs | |
- default write lock timeout | |
(Otis) | |
13. Changed QueryParser.jj to allow '-' and '+' within tokens: | |
http://issues.apache.org/bugzilla/show_bug.cgi?id=27491 | |
(Morus Walter via Otis) | |
14. Changed so that the compound index format is used by default. | |
This makes indexing a bit slower, but vastly reduces the chances | |
of file handle problems. (Cutting) | |
1.3 final | |
1. Added catch of BooleanQuery$TooManyClauses in QueryParser to | |
throw ParseException instead. (Erik Hatcher) | |
2. Fixed a NullPointerException in Query.explain(). (Doug Cutting) | |
3. Added a new method IndexReader.setNorm(), that permits one to | |
alter the boosting of fields after an index is created. | |
4. Distinguish between the final position and length when indexing a | |
field. The length is now defined as the total number of tokens, | |
instead of the final position, as it was previously. Length is | |
used for score normalization (Similarity.lengthNorm()) and for | |
controlling memory usage (IndexWriter.maxFieldLength). In both of | |
these cases, the total number of tokens is a better value to use | |
than the final token position. Position is used in phrase | |
searching (see PhraseQuery and Token.setPositionIncrement()). | |
5. Fix StandardTokenizer's handling of CJK characters (Chinese, | |
Japanese and Korean ideograms). Previously contiguous sequences | |
were combined in a single token, which is not very useful. Now | |
each ideogram generates a separate token, which is more useful. | |
1.3 RC3 | |
1. Added minMergeDocs in IndexWriter. This can be raised to speed | |
indexing without altering the number of files, but only using more | |
memory. (Julien Nioche via Otis) | |
2. Fix bug #24786, in query rewriting. (bschneeman via Cutting) | |
3. Fix bug #16952, in demo HTML parser, skip comments in | |
javascript. (Christoph Goller) | |
4. Fix bug #19253, in demo HTML parser, add whitespace as needed to | |
output (Daniel Naber via Christoph Goller) | |
5. Fix bug #24301, in demo HTML parser, long titles no longer | |
hang things. (Christoph Goller) | |
6. Fix bug #23534, Replace use of file timestamp of segments file | |
with an index version number stored in the segments file. This | |
resolves problems when running on file systems with low-resolution | |
timestamps, e.g., HFS under MacOS X. (Christoph Goller) | |
7. Fix QueryParser so that TokenMgrError is not thrown, only | |
ParseException. (Erik Hatcher) | |
8. Fix some bugs introduced by change 11 of RC2. (Christoph Goller) | |
9. Fixed a problem compiling TestRussianStem. (Christoph Goller) | |
10. Cleaned up some build stuff. (Erik Hatcher) | |
1.3 RC2 | |
1. Added getFieldNames(boolean) to IndexReader, SegmentReader, and | |
SegmentsReader. (Julien Nioche via otis) | |
2. Changed file locking to place lock files in | |
System.getProperty("java.io.tmpdir"), where all users are | |
permitted to write files. This way folks can open and correctly | |
lock indexes which are read-only to them. | |
3. IndexWriter: added a new method, addDocument(Document, Analyzer), | |
permitting one to easily use different analyzers for different | |
documents in the same index. | |
4. Minor enhancements to FuzzyTermEnum. | |
(Christoph Goller via Otis) | |
5. PriorityQueue: added insert(Object) method and adjusted IndexSearcher | |
and MultiIndexSearcher to use it. | |
(Christoph Goller via Otis) | |
6. Fixed a bug in IndexWriter that returned incorrect docCount(). | |
(Christoph Goller via Otis) | |
7. Fixed SegmentsReader to eliminate the confusing and slightly different | |
behaviour of TermEnum when dealing with an enumeration of all terms, | |
versus an enumeration starting from a specific term. | |
This patch also fixes incorrect term document frequences when the same term | |
is present in multiple segments. | |
(Christoph Goller via Otis) | |
8. Added CachingWrapperFilter and PerFieldAnalyzerWrapper. (Erik Hatcher) | |
9. Added support for the new "compound file" index format (Dmitry | |
Serebrennikov) | |
10. Added Locale setting to QueryParser, for use by date range parsing. | |
11. Changed IndexReader so that it can be subclassed by classes | |
outside of its package. Previously it had package-private | |
abstract methods. Also modified the index merging code so that it | |
can work on an arbitrary IndexReader implementation, and added a | |
new method, IndexWriter.addIndexes(IndexReader[]), to take | |
advantage of this. (cutting) | |
12. Added a limit to the number of clauses which may be added to a | |
BooleanQuery. The default limit is 1024 clauses. This should | |
stop most OutOfMemoryExceptions by prefix, wildcard and fuzzy | |
queries which run amok. (cutting) | |
13. Add new method: IndexReader.undeleteAll(). This undeletes all | |
deleted documents which still remain in the index. (cutting) | |
1.3 RC1 | |
1. Fixed PriorityQueue's clear() method. | |
Fix for bug 9454, http://nagoya.apache.org/bugzilla/show_bug.cgi?id=9454 | |
(Matthijs Bomhoff via otis) | |
2. Changed StandardTokenizer.jj grammar for EMAIL tokens. | |
Fix for bug 9015, http://nagoya.apache.org/bugzilla/show_bug.cgi?id=9015 | |
(Dale Anson via otis) | |
3. Added the ability to disable lock creation by using disableLuceneLocks | |
system property. This is useful for read-only media, such as CD-ROMs. | |
(otis) | |
4. Added id method to Hits to be able to access the index global id. | |
Required for sorting options. | |
(carlson) | |
5. Added support for new range query syntax to QueryParser.jj. | |
(briangoetz) | |
6. Added the ability to retrieve HTML documents' META tag values to | |
HTMLParser.jj. | |
(Mark Harwood via otis) | |
7. Modified QueryParser to make it possible to programmatically specify the | |
default Boolean operator (OR or AND). | |
(Péter Halácsy via otis) | |
8. Made many search methods and classes non-final, per requests. | |
This includes IndexWriter and IndexSearcher, among others. | |
(cutting) | |
9. Added class RemoteSearchable, providing support for remote | |
searching via RMI. The test class RemoteSearchableTest.java | |
provides an example of how this can be used. (cutting) | |
10. Added PhrasePrefixQuery (and supporting MultipleTermPositions). The | |
test class TestPhrasePrefixQuery provides the usage example. | |
(Anders Nielsen via otis) | |
11. Changed the German stemming algorithm to ignore case while | |
stripping. The new algorithm is faster and produces more equal | |
stems from nouns and verbs derived from the same word. | |
(gschwarz) | |
12. Added support for boosting the score of documents and fields via | |
the new methods Document.setBoost(float) and Field.setBoost(float). | |
Note: This changes the encoding of an indexed value. Indexes | |
should be re-created from scratch in order for search scores to | |
be correct. With the new code and an old index, searches will | |
yield very large scores for shorter fields, and very small scores | |
for longer fields. Once the index is re-created, scores will be | |
as before. (cutting) | |
13. Added new method Token.setPositionIncrement(). | |
This permits, for the purpose of phrase searching, placing | |
multiple terms in a single position. This is useful with | |
stemmers that produce multiple possible stems for a word. | |
This also permits the introduction of gaps between terms, so that | |
terms which are adjacent in a token stream will not be matched by | |
and exact phrase query. This makes it possible, e.g., to build | |
an analyzer where phrases are not matched over stop words which | |
have been removed. | |
Finally, repeating a token with an increment of zero can also be | |
used to boost scores of matches on that token. (cutting) | |
14. Added new Filter class, QueryFilter. This constrains search | |
results to only match those which also match a provided query. | |
Results are cached, so that searches after the first on the same | |
index using this filter are very fast. | |
This could be used, for example, with a RangeQuery on a formatted | |
date field to implement date filtering. One could re-use a | |
single QueryFilter that matches, e.g., only documents modified | |
within the last week. The QueryFilter and RangeQuery would only | |
need to be reconstructed once per day. (cutting) | |
15. Added a new IndexWriter method, getAnalyzer(). This returns the | |
analyzer used when adding documents to this index. (cutting) | |
16. Fixed a bug with IndexReader.lastModified(). Before, document | |
deletion did not update this. Now it does. (cutting) | |
17. Added Russian Analyzer. | |
(Boris Okner via otis) | |
18. Added a public, extensible scoring API. For details, see the | |
javadoc for org.apache.lucene.search.Similarity. | |
19. Fixed return of Hits.id() from float to int. (Terry Steichen via Peter). | |
20. Added getFieldNames() to IndexReader and Segment(s)Reader classes. | |
(Peter Mularien via otis) | |
21. Added getFields(String) and getValues(String) methods. | |
Contributed by Rasik Pandey on 2002-10-09 | |
(Rasik Pandey via otis) | |
22. Revised internal search APIs. Changes include: | |
a. Queries are no longer modified during a search. This makes | |
it possible, e.g., to reuse the same query instance with | |
multiple indexes from multiple threads. | |
b. Term-expanding queries (e.g. PrefixQuery, WildcardQuery, | |
etc.) now work correctly with MultiSearcher, fixing bugs 12619 | |
and 12667. | |
c. Boosting BooleanQuery's now works, and is supported by the | |
query parser (problem reported by Lee Mallabone). Thus a query | |
like "(+foo +bar)^2 +baz" is now supported and equivalent to | |
"(+foo^2 +bar^2) +baz". | |
d. New method: Query.rewrite(IndexReader). This permits a | |
query to re-write itself as an alternate, more primitive query. | |
Most of the term-expanding query classes (PrefixQuery, | |
WildcardQuery, etc.) are now implemented using this method. | |
e. New method: Searchable.explain(Query q, int doc). This | |
returns an Explanation instance that describes how a particular | |
document is scored against a query. An explanation can be | |
displayed as either plain text, with the toString() method, or | |
as HTML, with the toHtml() method. Note that computing an | |
explanation is as expensive as executing the query over the | |
entire index. This is intended to be used in developing | |
Similarity implementations, and, for good performance, should | |
not be displayed with every hit. | |
f. Scorer and Weight are public, not package protected. It now | |
possible for someone to write a Scorer implementation that is | |
not in the org.apache.lucene.search package. This is still | |
fairly advanced programming, and I don't expect anyone to do | |
this anytime soon, but at least now it is possible. | |
g. Added public accessors to the primitive query classes | |
(TermQuery, PhraseQuery and BooleanQuery), permitting access to | |
their terms and clauses. | |
Caution: These are extensive changes and they have not yet been | |
tested extensively. Bug reports are appreciated. | |
(cutting) | |
23. Added convenience RAMDirectory constructors taking File and String | |
arguments, for easy FSDirectory to RAMDirectory conversion. | |
(otis) | |
24. Added code for manual renaming of files in FSDirectory, since it | |
has been reported that java.io.File's renameTo(File) method sometimes | |
fails on Windows JVMs. | |
(Matt Tucker via otis) | |
25. Refactored QueryParser to make it easier for people to extend it. | |
Added the ability to automatically lower-case Wildcard terms in | |
the QueryParser. | |
(Tatu Saloranta via otis) | |
1.2 RC6 | |
1. Changed QueryParser.jj to have "?" be a special character which | |
allowed it to be used as a wildcard term. Updated TestWildcard | |
unit test also. (Ralf Hettesheimer via carlson) | |
1.2 RC5 | |
1. Renamed build.properties to default.properties and updated | |
the BUILD.txt document to describe how to override the | |
default.property settings without having to edit the file. This | |
brings the build process closer to Scarab's build process. | |
(jon) | |
2. Added MultiFieldQueryParser class. (Kelvin Tan, via otis) | |
3. Updated "powered by" links. (otis) | |
4. Fixed instruction for setting up JavaCC - Bug #7017 (otis) | |
5. Added throwing exception if FSDirectory could not create diectory | |
- Bug #6914 (Eugene Gluzberg via otis) | |
6. Update MultiSearcher, MultiFieldParse, Constants, DateFilter, | |
LowerCaseTokenizer javadoc (otis) | |
7. Added fix to avoid NullPointerException in results.jsp | |
(Mark Hayes via otis) | |
8. Changed Wildcard search to find 0 or more char instead of 1 or more | |
(Lee Mallobone, via otis) | |
9. Fixed error in offset issue in GermanStemFilter - Bug #7412 | |
(Rodrigo Reyes, via otis) | |
10. Added unit tests for wildcard search and DateFilter (otis) | |
11. Allow co-existence of indexed and non-indexed fields with the same name | |
(cutting/casper, via otis) | |
12. Add escape character to query parser. | |
(briangoetz) | |
13. Applied a patch that ensures that searches that use DateFilter | |
don't throw an exception when no matches are found. (David Smiley, via | |
otis) | |
14. Fixed bugs in DateFilter and wildcardquery unit tests. (cutting, otis, carlson) | |
1.2 RC4 | |
1. Updated contributions section of website. | |
Add XML Document #3 implementation to Document Section. | |
Also added Term Highlighting to Misc Section. (carlson) | |
2. Fixed NullPointerException for phrase searches containing | |
unindexed terms, introduced in 1.2RC3. (cutting) | |
3. Changed document deletion code to obtain the index write lock, | |
enforcing the fact that document addition and deletion cannot be | |
performed concurrently. (cutting) | |
4. Various documentation cleanups. (otis, acoliver) | |
5. Updated "powered by" links. (cutting, jon) | |
6. Fixed a bug in the GermanStemmer. (Bernhard Messer, via otis) | |
7. Changed Term and Query to implement Serializable. (scottganyo) | |
8. Fixed to never delete indexes added with IndexWriter.addIndexes(). | |
(cutting) | |
9. Upgraded to JUnit 3.7. (otis) | |
1.2 RC3 | |
1. IndexWriter: fixed a bug where adding an optimized index to an | |
empty index failed. This was encountered using addIndexes to copy | |
a RAMDirectory index to an FSDirectory. | |
2. RAMDirectory: fixed a bug where RAMInputStream could not read | |
across more than across a single buffer boundary. | |
3. Fix query parser so it accepts queries with unicode characters. | |
(briangoetz) | |
4. Fix query parser so that PrefixQuery is used in preference to | |
WildcardQuery when there's only an asterisk at the end of the | |
term. Previously PrefixQuery would never be used. | |
5. Fix tests so they compile; fix ant file so it compiles tests | |
properly. Added test cases for Analyzers and PriorityQueue. | |
6. Updated demos, added Getting Started documentation. (acoliver) | |
7. Added 'contributions' section to website & docs. (carlson) | |
8. Removed JavaCC from source distribution for copyright reasons. | |
Folks must now download this separately from metamata in order to | |
compile Lucene. (cutting) | |
9. Substantially improved the performance of DateFilter by adding the | |
ability to reuse TermDocs objects. (cutting) | |
10. Added IndexReader methods: | |
public static boolean indexExists(String directory); | |
public static boolean indexExists(File directory); | |
public static boolean indexExists(Directory directory); | |
public static boolean isLocked(Directory directory); | |
public static void unlock(Directory directory); | |
(cutting, otis) | |
11. Fixed bugs in GermanAnalyzer (gschwarz) | |
1.2 RC2, 19 October 2001: | |
- added sources to distribution | |
- removed broken build scripts and libraries from distribution | |
- SegmentsReader: fixed potential race condition | |
- FSDirectory: fixed so that getDirectory(xxx,true) correctly | |
erases the directory contents, even when the directory | |
has already been accessed in this JVM. | |
- RangeQuery: Fix issue where an inclusive range query would | |
include the nearest term in the index above a non-existant | |
specified upper term. | |
- SegmentTermEnum: Fix NullPointerException in clone() method | |
when the Term is null. | |
- JDK 1.1 compatibility fix: disabled lock files for JDK 1.1, | |
since they rely on a feature added in JDK 1.2. | |
1.2 RC1 (first Apache release), 2 October 2001: | |
- packages renamed from com.lucene to org.apache.lucene | |
- license switched from LGPL to Apache | |
- ant-only build -- no more makefiles | |
- addition of lock files--now fully thread & process safe | |
- addition of German stemmer | |
- MultiSearcher now supports low-level search API | |
- added RangeQuery, for term-range searching | |
- Analyzers can choose tokenizer based on field name | |
- misc bug fixes. | |
1.01b (last Sourceforge release), 2 July 2001 | |
. a few bug fixes | |
. new Query Parser | |
. new prefix query (search for "foo*" matches "food") | |
1.0, 2000-10-04 | |
This release fixes a few serious bugs and also includes some | |
performance optimizations, a stemmer, and a few other minor | |
enhancements. | |
0.04 2000-04-19 | |
Lucene now includes a grammar-based tokenizer, StandardTokenizer. | |
The only tokenizer included in the previous release (LetterTokenizer) | |
identified terms consisting entirely of alphabetic characters. The | |
new tokenizer uses a regular-expression grammar to identify more | |
complex classes of terms, including numbers, acronyms, email | |
addresses, etc. | |
StandardTokenizer serves two purposes: | |
1. It is a much better, general purpose tokenizer for use by | |
applications as is. | |
The easiest way for applications to start using | |
StandardTokenizer is to use StandardAnalyzer. | |
2. It provides a good example of grammar-based tokenization. | |
If an application has special tokenization requirements, it can | |
implement a custom tokenizer by copying the directory containing | |
the new tokenizer into the application and modifying it | |
accordingly. | |
0.01, 2000-03-30 | |
First open source release. | |
The code has been re-organized into a new package and directory | |
structure for this release. It builds OK, but has not been tested | |
beyond that since the re-organization. |