blob: 95a9dca8c09fb17535f261a683975bcf55d32a34 [file] [log] [blame]
Indexing Benchmarks
The purpose of this experiment is to test raw indexing speed, using
Reuters-21578, Distribution 1.0 as a test corpus. As of this writing,
Reuters-21578 is available at:
http://www.daviddlewis.com/resources/testcollections/reuters21578
The corpus comes packaged in SGML, which means we need to preprocess it so
that our results are not infected by differences between SGML parsers. A
simple perl script, "./extract_reuters.plx" is supplied, which expands the
Reuters articles out into the file system, 1 article per file, with the title
as the first line of text. It takes one command line argument: the location
of the un-tarred Reuters collection.
./extract_reuters.plx /path/to/reuters_collection
Each of the indexing apps takes four optional command line arguments:
* The number of documents to index.
* The number of times to repeat the indexing process.
* The increment, or number of docs to add during each index writer instance.
* Whether or not the main text should be stored and highlightable.
$ perl -Mblib=../../perl indexers/lucy_indexer.plx \
> --docs=1000 --reps=6 --increment=10 --store=1
$ java -server -Xmx500M -XX:CompileThreshold=100 LuceneIndexer \
> -docs 1000 -reps 6 -increment 10 -store 1
If no command line args are supplied, the apps will index the entire 19043
article collection once, using a single index writer, and will neither store
nor vectorize the main text.
Upon finishing, each app will produce a "truncated mean" report: the slowest
25% and fastest 25% of reps will be discarded, and the rest will be averaged.