lucene/benchmark/README.enwiki - lucene-solr - Git at Google

 Support exists for downloading, parsing, and loading the English
 version of wikipedia (enwiki).

 The build file can automatically try to download the most current
 enwiki dataset (pages-articles.xml.bz2) from the "latest" directory,
 http://download.wikimedia.org/enwiki/latest/. However, this file
 doesn't always exist, depending on where wikipedia is in the dump
 process and whether prior dumps have succeeded. If this file doesn't
 exist, you can sometimes find an older or in progress version by
 looking in the dated directories under
 http://download.wikimedia.org/enwiki/. For example, as of this
 writing, there is a page file in
 http://download.wikimedia.org/enwiki/20070402/. You can download this
 file manually and put it in temp. Note that the file you download will
 probably have the date in the name, e.g.,
 http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-articles.xml.bz2.

 If you use the EnwikiContentSource then the data will be decompressed on the fly
 during the benchmark.  If you want to benchmark indexing, you should probably decompress
 it beforehand using the "enwiki" Ant target which will produce a work/enwiki.txt, after
 which you can use LineDocSource in your benchmark.

 After that, ant enwiki should process the data set and run a load
 test. Ant target enwiki will download, decompress, and extract (to individual files
 in work/enwiki) the dataset, respectively.
	Support exists for downloading, parsing, and loading the English
	version of wikipedia (enwiki).

	The build file can automatically try to download the most current
	enwiki dataset (pages-articles.xml.bz2) from the "latest" directory,
	http://download.wikimedia.org/enwiki/latest/. However, this file
	doesn't always exist, depending on where wikipedia is in the dump
	process and whether prior dumps have succeeded. If this file doesn't
	exist, you can sometimes find an older or in progress version by
	looking in the dated directories under
	http://download.wikimedia.org/enwiki/. For example, as of this
	writing, there is a page file in
	http://download.wikimedia.org/enwiki/20070402/. You can download this
	file manually and put it in temp. Note that the file you download will
	probably have the date in the name, e.g.,
	http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-articles.xml.bz2.

	If you use the EnwikiContentSource then the data will be decompressed on the fly
	during the benchmark. If you want to benchmark indexing, you should probably decompress
	it beforehand using the "enwiki" Ant target which will produce a work/enwiki.txt, after
	which you can use LineDocSource in your benchmark.

	After that, ant enwiki should process the data set and run a load
	test. Ant target enwiki will download, decompress, and extract (to individual files
	in work/enwiki) the dataset, respectively.