| Lucene Benchmark Contrib Change Log |
| |
| The Benchmark contrib package contains code for benchmarking Lucene in a variety of ways. |
| |
| For more information on past and future Lucene versions, please see: |
| http://s.apache.org/luceneversions |
| |
| 05/25/2011 |
| LUCENE-3137: ExtractReuters supports out-dir param suffixed by a slash. (Doron Cohen) |
| |
| 03/31/2011 |
| Updated ReadTask to the new method for obtaining a top-level deleted docs |
| bitset. Also checking the bitset for null, when there are no deleted docs. |
| (Steve Rowe, Mike McCandless) |
| |
| Updated NewAnalyzerTask and NewShingleAnalyzerTask to handle analyzers |
| in the new org.apache.lucene.analysis.core package (KeywordAnalyzer, |
| SimpleAnalyzer, etc.) (Steve Rowe, Robert Muir) |
| |
| Updated ReadTokensTask to convert tokens to their indexed forms |
| (char[]->byte[]), just as the indexer does. This allows measurement |
| of the conversion process, which is important for analysis components |
| that customize it, e.g. (ICU)CollationKeyFilter. As a result, |
| benchmarks that incorporate this task will no longer be directly |
| comparable between 3.X and 4.0. (Robert Muir, Steve Rowe) |
| |
| 03/24/2011 |
| LUCENE-2977: WriteLineDocTask now automatically detects how to write - |
| GZip or BZip2 or Plain-text - according to the output file extension. |
| Property bzip.compression of WriteLineDocTask was canceled. (Doron Cohen) |
| |
| 03/23/2011 |
| LUCENE-2980: Benchmark's ContentSource no more requires lower case file suffixes |
| for detecting file type (gzip/bzip2/text). As part of this fix worked around an |
| issue with gzip input streams which were remaining open (See COMPRESS-127). |
| (Doron Cohen) |
| |
| 03/22/2011 |
| LUCENE-2978: Upgrade benchmark's commons-compress from 1.0 to 1.1 as |
| the move of gzip decompression in LUCENE-1540 from Java's GZipInputStream |
| to commons-compress 1.0 made it 15 times slower. In 1.1 no such slow-down |
| is observed. (Doron Cohen) |
| |
| 03/21/2011 |
| LUCENE-2958: WriteLineDocTask improvements - allow to emit line docs also for empty |
| docs, and be flexible about which fields are added to the line file. For this, a header |
| line was added to the line file. That header is examined by LineDocSource. Old line |
| files which have no header line are handled as before, imposing the default header. |
| (Doron Cohen, Shai Erera, Mike McCandless) |
| |
| 03/21/2011 |
| LUCENE-2964: Allow benchmark tasks from alternative packages, |
| specified through a new property "alt.tasks.packages". |
| (Doron Cohen, Shai Erera) |
| |
| 03/20/2011 |
| LUCENE-2963: Easier way to run benchmark, by calling Benmchmark.exec(alg-file). |
| (Doron Cohen) |
| |
| 03/10/2011 |
| LUCENE-2961: Removed lib/xml-apis.jar, since JVM 1.5+ already contains the |
| JAXP 1.3 interface classes it provides. |
| |
| 02/05/2011 |
| LUCENE-1540: Improvements to contrib.benchmark for TREC collections. |
| ContentSource can now process plain text files, gzip files, and bzip2 files. |
| TREC doc parsing now handles the TREC gov2 collection and TREC disks 4&5-CR |
| collection (both used by many TREC tasks). (Shai Erera, Doron Cohen) |
| |
| 01/31/2011 |
| LUCENE-1591: Rollback to xerces-2.9.1-patched-XERCESJ-1257.jar to workaround |
| XERCESJ-1257, which we hit on current Wikipedia XML export |
| (ENWIKI-20110115-pages-articles.xml) with xerces-2.10.0.jar. (Mike McCandless) |
| |
| 01/26/2011 |
| LUCENE-929: ExtractReuters first extracts to a tmp dir and then renames. That |
| way, if a previous extract attempt failed, "ant extract-reuters" will still |
| extract the files. (Shai Erera, Doron Cohen, Grant Ingersoll) |
| |
| 01/24/2011 |
| LUCENE-2885: Add WaitForMerges task (calls IndexWriter.waitForMerges()). |
| (Mike McCandless) |
| |
| 10/10/2010 |
| The locally built patched version of the Xerces-J jar introduced |
| as part of LUCENE-1591 is no longer required, because Xerces |
| 2.10.0, which contains a fix for XERCESJ-1257 (see |
| http://svn.apache.org/viewvc?view=revision&revision=554069), |
| was released earlier this year. Upgraded |
| xerces-2.9.1-patched-XERCESJ-1257.jar and xml-apis-2.9.0.jar |
| to xercesImpl-2.10.0.jar and xml-apis-2.10.0.jar. (Steven Rowe) |
| |
| 8/2/2010 |
| LUCENE-2582: You can now specify the default codec to use for |
| writing new segments by adding default.codec = Pulsing (for |
| example), in the alg file. (Mike McCandless) |
| |
| 4/27/2010: WriteLineDocTask now supports multi-threading. Also, |
| StringBufferReader was renamed to StringBuilderReader and works on |
| StringBuilder now. In addition, LongToEnglishContentSource starts from 0 |
| (instead of Long.MIN_VAL+10) and wraps around to MIN_VAL (if you ever hit |
| Long.MAX_VAL). (Shai Erera) |
| |
| 4/07/2010 |
| LUCENE-2377: Enable the use of NoMergePolicy and NoMergeScheduler by |
| CreateIndexTask. (Shai Erera) |
| |
| 3/28/2010 |
| LUCENE-2353: Fixed bug in Config where Windows absolute path property values |
| were incorrectly handled (Shai Erera) |
| |
| 3/24/2010 |
| LUCENE-2343: Added support for benchmarking collectors. (Grant Ingersoll, Shai Erera) |
| |
| 2/21/2010 |
| LUCENE-2254: Add support to the quality package for running |
| experiments with any combination of Title, Description, and Narrative. |
| (Robert Muir) |
| |
| 1/28/2010 |
| LUCENE-2223: Add a benchmark for ShingleFilter. You can wrap any |
| analyzer with ShingleAnalyzerWrapper and specify shingle parameters |
| with the NewShingleAnalyzer task. (Steven Rowe via Robert Muir) |
| |
| 1/14/2010 |
| LUCENE-2210: TrecTopicsReader now properly reads descriptions and |
| narratives from trec topics files. (Robert Muir) |
| |
| 1/11/2010 |
| LUCENE-2181: Add a benchmark for collation. This adds NewLocaleTask, |
| which sets a Locale in the run data for collation to use, and can be |
| used in the future for benchmarking localized range queries and sorts. |
| Also add NewCollationAnalyzerTask, which works with both JDK and ICU |
| Collator implementations. Fix ReadTokensTask to not tokenize fields |
| unless they should be tokenized according to DocMaker config. The |
| easiest way to run the benchmark is to run 'ant collation' |
| (Steven Rowe via Robert Muir) |
| |
| 12/22/2009 |
| LUCENE-2178: Allow multiple locations to add to the class path with |
| -Dbenchmark.ext.classpath=... when running "ant run-task" (Steven |
| Rowe via Mike McCandless) |
| |
| 12/17/2009 |
| LUCENE-2168: Allow negative relative thread priority for BG tasks |
| (Mike McCandless) |
| |
| 12/07/2009 |
| LUCENE-2106: ReadTask does not close its Reader when |
| OpenReader/CloseReader are not used. (Mark Miller) |
| |
| 11/17/2009 |
| LUCENE-2079: Allow specifying delta thread priority after the "&"; |
| added log.time.step.msec to print per-time-period counts; fixed |
| NearRealTimeTask to print reopen times (in msec) of each reopen, at |
| the end. (Mike McCandless) |
| |
| 11/13/2009 |
| LUCENE-2050: Added ability to run tasks within a serial sequence in |
| the background, by appending "&". The tasks are stopped & joined at |
| the end of the sequence. Also added Wait and RollbackIndex tasks. |
| Genericized NearRealTimeReaderTask to only reopen the reader |
| (previously it spawned its own thread, and also did searching). |
| Also changed the API of PerfRunData.getIndexReader: it now returns a |
| reference, and it's your job to decRef the reader when you're done |
| using it. (Mike McCandless) |
| |
| 11/12/2009 |
| LUCENE-2059: allow TrecContentSource not to change the docname. |
| Previously, it would always append the iteration # to the docname. |
| With the new option content.source.excludeIteration, you can disable this. |
| The resulting index can then be used with the quality package to measure |
| relevance. (Robert Muir) |
| |
| 11/12/2009 |
| LUCENE-2058: specify trec_eval submission output from the command line. |
| Previously, 4 arguments were required, but the third was unused. The |
| third argument is now the desired location of submission.txt (Robert Muir) |
| |
| 11/08/2009 |
| LUCENE-2044: Added delete.percent.rand.seed to seed the Random instance |
| used by DeleteByPercentTask. (Mike McCandless) |
| |
| 11/07/2009 |
| LUCENE-2043: Fix CommitIndexTask to also commit pending IndexReader |
| changes (Mike McCandless) |
| |
| 11/07/2009 |
| LUCENE-2042: Added print.hits.field, to print each hit from the |
| Search* tasks. (Mike McCandless) |
| |
| 11/04/2009 |
| LUCENE-2029: Added doc.body.stored and doc.body.tokenized; each |
| falls back to the non-body variant as its default. (Mike McCandless) |
| |
| 10/28/2009 |
| LUCENE-1994: Fix thread safety of EnwikiContentSource and DocMaker |
| when doc.reuse.fields is false. Also made docs.reuse.fields=true |
| thread safe. (Mark Miller, Shai Erera, Mike McCandless) |
| |
| 8/4/2009 |
| LUCENE-1770: Add EnwikiQueryMaker (Mark Miller) |
| |
| 8/04/2009 |
| LUCENE-1773: Add FastVectorHighlighter tasks. This change is a |
| non-backwards compatible change in how subclasses of ReadTask define |
| a highlighter. The methods doHighlight, isMergeContiguousFragments, |
| maxNumFragments and getHighlighter are no longer used and have been |
| mark deprecated and package protected private so there's a compile |
| time error. Instead, the new getBenchmarkHighlighter method should |
| return an appropriate highlighter for the task. The configuration of |
| the highlighter tasks (maxFrags, mergeContiguous, etc.) is now |
| accepted as params to the task. (Koji Sekiguchi via Mike McCandless) |
| |
| 8/03/2009 |
| LUCENE-1778: Add support for log.step setting per task type. Perviously, if |
| you included a log.step line in the .alg file, it had been applied to all |
| tasks. Now, you can include a log.step.AddDoc, or log.step.DeleteDoc (for |
| example) to control logging for just these tasks. If you want to ommit logging |
| for any other task, include log.step=-1. The syntax is "log.step." together |
| with the Task's 'short' name (i.e., without the 'Task' part). |
| (Shai Erera via Mark Miller) |
| |
| 7/24/2009 |
| LUCENE-1595: Deprecate LineDocMaker and EnwikiDocMaker in favor of |
| using DocMaker directly, with content.source = LineDocSource or |
| EnwikiContentSource. NOTE: with this change, the "id" field from |
| the Wikipedia XML export is now indexed as the "docname" field |
| (previously it was indexed as "docid"). Additionaly, the |
| SearchWithSort task now accepts all types that SortField can accept |
| and no longer falls back to SortField.AUTO, which has been |
| deprecated. (Mike McCandless) |
| |
| 7/20/2009 |
| LUCENE-1755: Fix WriteLineDocTask to output a document if it contains either |
| a title or body (or both). (Shai Erera via Mark Miller) |
| |
| 7/14/2009 |
| LUCENE-1725: Fix the example Sort algorithm - auto is now deprecated and no longer works |
| with Benchmark. Benchmark will now throw an exception if you specify sort fields without |
| a type. The example sort algorithm is now typed. (Mark Miller) |
| |
| 7/6/2009 |
| LUCENE-1730: Fix TrecContentSource to use ISO-8859-1 when reading the TREC files, |
| unless a different encoding is specified. Additionally, ContentSource now supports |
| a content.source.encoding parameter in the configuration file. |
| (Shai Erera via Mark Miller) |
| |
| 6/26/2009 |
| LUCENE-1716: Added the following support: |
| doc.tokenized.norms: specifies whether to store norms |
| doc.body.tokenized.norms: special attribute for the body field |
| doc.index.props: specifies whether DocMaker should index the properties set on |
| DocData |
| writer.info.stream: specifies the info stream to set on IndexWriter (supported |
| values are: SystemOut, SystemErr and a file name). (Shai Erera via Mike McCandless) |
| |
| 6/23/09 |
| LUCENE-1714: WriteLineDocTask incorrectly normalized text, by replacing only |
| occurrences of "\t" with a space. It now replaces "\r\n" in addition to that, |
| so that LineDocMaker won't fail. (Shai Erera via Michael McCandless) |
| |
| 6/17/09 |
| LUCENE-1595: This issue breaks previous external algorithms. DocMaker has been |
| replaced with a concrete class which accepts a ContentSource for iterating over |
| a content source's documents. Most of the old DocMakers were changed to a |
| ContentSource implementation, and DocMaker is now a default document creation impl |
| that provides an easy way for reusing fields. When [doc.maker] is not defined in |
| an algorithm, the new DocMaker is the default. If you have .alg files which |
| specify a DocMaker (like ReutersDocMaker), you should change the [doc.maker] line to: |
| [content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource] |
| |
| i.e. |
| doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker |
| becomes |
| content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource |
| |
| doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker |
| becomes |
| content.source=org.apache.lucene.benchmark.byTask.feeds.SingleDocSource |
| |
| Also, PerfTask now logs a message in tearDown() rather than each Task doing its |
| own logging. A new setting called [log.step] is consulted to determine how often |
| to log. [doc.add.log.step] is no longer a valid setting. For easy migration of |
| current .alg files, rename [doc.add.log.step] to [log.step] and [doc.delete.log.step] |
| to [delete.log.step]. |
| |
| Additionally, [doc.maker.forever] should be changed to [content.source.forever]. |
| (Shai Erera via Mark Miller) |
| |
| 6/12/09 |
| LUCENE-1539: Added DeleteByPercentTask which enables deleting a |
| percentage of documents and searching on them. Changed CommitIndex |
| to optionally accept a label (recorded as userData=<label> in the |
| commit point). Added FlushReaderTask, and modified OpenReaderTask |
| to also optionally take a label referencing a commit point to open. |
| Also changed default autoCommit (when IndexWriter is opened) to |
| true. (Jason Rutherglen via Mike McCandless) |
| |
| 12/20/08 |
| LUCENE-1495: Allow task sequence to run for specfied number of seconds by adding ": 2.7s" (for example). |
| |
| 12/16/08 |
| LUCENE-1493: Stop using deprecated Hits API for searching; add new |
| param search.num.hits to set top N docs to collect. |
| |
| 12/16/08 |
| LUCENE-1492: Added optional readOnly param (default true) to OpenReader task. |
| |
| 9/9/08 |
| LUCENE-1243: Added new sorting benchmark capabilities. Also Reopen and commit tasks. (Mark Miller via Grant Ingersoll) |
| |
| 5/10/08 |
| LUCENE-1090: remove relative paths assumptions from benchmark code. |
| Only build.xml was modified: work-dir definition must remain so |
| benchmark tests can run from both trunk-home and benchmark-home. |
| |
| 3/9/08 |
| LUCENE-1209: Fixed DocMaker settings by round. Prior to this fix, DocMaker settings of |
| first round were used in all rounds. (E.g. term vectors.) |
| (Mark Miller via Doron Cohen) |
| |
| 1/30/08 |
| LUCENE-1156: Fixed redirect problem in EnwikiDocMaker. Refactored ExtractWikipedia to use EnwikiDocMaker. Added property to EnwikiDocMaker to allow |
| for skipping image only documents. |
| |
| 1/24/2008 |
| LUCENE-1136: add ability to not count sub-task doLogic increment |
| |
| 1/23/2008 |
| LUCENE-1129: ReadTask properly uses the traversalSize value |
| LUCENE-1128: Added support for benchmarking the highlighter |
| |
| 01/20/08 |
| LUCENE-1139: various fixes |
| - add merge.scheduler, merge.policy config properties |
| - refactor Open/CreateIndexTask to share setting config on IndexWriter |
| - added doc.reuse.fields=true|false for LineDocMaker |
| - OptimizeTask now takes int param to call optimize(int maxNumSegments) |
| - CloseIndexTask now takes bool param to call close(false) (abort running merges) |
| |
| |
| 01/03/08 |
| LUCENE-1116: quality package improvements: |
| - add MRR computation; |
| - allow control of max #queries to run; |
| - verify log & report are flushed. |
| - add TREC query reader for the 1MQ track. |
| |
| 12/31/07 |
| LUCENE-1102: EnwikiDocMaker now indexes the docid field, so results might not be comparable with results prior to this change, although |
| it is doubted that this one small field makes much difference. |
| |
| 12/13/07 |
| LUCENE-1086: DocMakers setup for the "docs.dir" property |
| fixed to properly handle absolute paths. (Shai Erera via Doron Cohen) |
| |
| 9/18/07 |
| LUCENE-941: infinite loop for alg: {[AddDoc(4000)]: 4} : * |
| ResetInputsTask fixed to work also after exhaustion. |
| All Reset Tasks now subclas ResetInputsTask. |
| |
| 8/9/07 |
| LUCENE-971: Change enwiki tasks to a doc maker (extending |
| LineDocMaker) that directly processes the Wikipedia XML and produces |
| documents. Intermediate files (one per document) are no longer |
| created. |
| |
| 8/1/07 |
| LUCENE-967: Add "ReadTokensTask" to allow for benchmarking just tokenization. |
| |
| 7/27/07 |
| LUCENE-836: Add support for search quality benchmarking, running |
| a set of queries against a searcher, and, optionally produce a submission |
| report, and, if query judgements are available, compute quality measures: |
| recall, precision_at_N, average_precision, MAP. TREC specific Judge (based |
| on TREC QRels) and TREC Topics reader are included in o.a.l.benchmark.quality.trec |
| but any other format of queries and judgements can be implemented and used. |
| |
| 7/24/07 |
| LUCENE-947: Add support for creating and index "one document per |
| line" from a large text file, which reduces per-document overhead of |
| opening a single file for each document. |
| |
| 6/30/07 |
| LUCENE-848: Added support for Wikipedia benchmarking. |
| |
| 6/25/07 |
| - LUCENE-940: Multi-threaded issues fixed: SimpleDateFormat; logging for addDoc/deleteDoc tasks. |
| - LUCENE-945: tests fail to find data dirs. Added sys-prop benchmark.work.dir and cfg-prop work.dir. |
| (Doron Cohen) |
| |
| 4/17/07 |
| - LUCENE-863: Deprecated StandardBenchmarker in favour of byTask code. |
| (Otis Gospodnetic) |
| |
| 4/13/07 |
| |
| Better error handling and javadocs around "exhaustive" doc making. |
| |
| 3/25/07 |
| |
| LUCENE-849: |
| 1. which HTML Parser is used is configurable with html.parser property. |
| 2. External classes added to classpath with -Dbenchmark.ext.classpath=path. |
| 3. '*' as repeating number now means "exhaust doc maker - no repetitions". |
| |
| 3/22/07 |
| |
| -Moved withRetrieve() call out of the loop in ReadTask |
| -Added SearchTravRetLoadFieldSelectorTask to help benchmark some of the FieldSelector capabilities |
| -Added options to store content bytes on the Reuters Doc (and others, but Reuters is the only one w/ it enabled) |
| |
| 3/21/07 |
| |
| Tests (for benchmarking code correctness) were added - LUCENE-840. |
| To be invoked by "ant test" from contrib/benchmark. (Doron Cohen) |
| |
| 3/19/07 |
| |
| 1. Introduced an AbstractQueryMaker to hold common QueryMaker code. (GSI) |
| 2. Added traversalSize parameter to SearchTravRetTask and SearchTravTask. Changed SearchTravRetTask to extend SearchTravTask. (GSI) |
| 3. Added FileBasedQueryMaker to run queries from a File or resource. (GSI) |
| 4. Modified query-maker generation for read related tasks to make further read tasks addition simpler and safer. (DC) |
| 5. Changed Taks' setParams() to throw UnsupportedOperationException if that task does not suppot command line param. (DC) |
| 6. Improved javadoc to specify all properties command line params currently supported. (DC) |
| 7. Refactored ReportTasks so that it is easy/possible now to create new report tasks. (DC) |
| |
| 01/09/07 |
| |
| 1. Committed Doron Cohen's benchmarking contribution, which provides an easily expandable task based approach to benchmarking. See the javadocs for information. (Doron Cohen via Grant Ingersoll) |
| |
| 2. Added this file. |
| |
| 3. 2/11/07: LUCENE-790 and 788: Fixed Locale issue with date formatter. Fixed some minor issues with benchmarking by task. Added a dependency |
| on the Lucene demo to the build classpath. (Doron Cohen, Grant Ingersoll) |
| |
| 4. 2/13/07: LUCENE-801: build.xml now builds Lucene core and Demo first and has classpath dependencies on the output of that build. (Doron Cohen, Grant Ingersoll) |