contrib/benchmark/README.enwiki - lucene-solr - Git at Google

 Support exists for downloading, parsing, and loading the English
 version of wikipedia (enwiki).

 The build file can automatically try to download the most current
 enwiki dataset (pages-articles.xml.bz2) from the "latest" directory,
 http://download.wikimedia.org/enwiki/latest/. However, this file
 doesn't always exist, depending on where wikipedia is in the dump
 process and whether prior dumps have succeeded. If this file doesn't
 exist, you can sometimes find an older or in progress version by
 looking in the dated directories under
 http://download.wikimedia.org/enwiki/. For example, as of this
 writing, there is a page file in
 http://download.wikimedia.org/enwiki/20070402/. You can download this
 file manually and put it in temp. Note that the file you download will
 probably have the date in the name, e.g.,
 http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-articles.xml.bz2. When
 you put it in temp, rename it to enwiki-latest-pages-articles.xml.bz2.

 After that, ant enwiki should process the data set and run a load
 test. Ant targets get-enwiki, expand-enwiki, and extract-enwiki can
 also be used to download, decompress, and extract (to individual files
 in work/enwiki) the dataset, respectively.

 NOTE: This bug in Xerces:

   https://issues.apache.org/jira/browse/XERCESJ-1257

 which is still present as of 2.9.1, causes an exception like this when
 processing Wikipedia's XML:

 Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
 	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
 	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
 	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
 	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
 	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
 	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
 	at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:77)
 	... 1 more

 The original poster in the Xerces bug provided this patch:

 --- UTF8Reader.java	2006-11-23 00:36:53.000000000 +0100
 +++ /home/rainman/lucene/xerces-2_9_0/src/org/apache/xerces/impl/io/UTF8Reader.java	2008-04-04 00:40:58.000000000 +0200
 @@ -534,6 +534,16 @@
                      invalidByte(4, 4, b2);
                  }

 +                // check if output buffer is large enough to hold 2 surrogate chars
 +                if( out + 1 >= offset + length ){
 +                    fBuffer[0] = (byte)b0;
 +                    fBuffer[1] = (byte)b1;
 +                    fBuffer[2] = (byte)b2;
 +                    fBuffer[3] = (byte)b3;
 +                    fOffset = 4;
 +                    return out - offset;
 +		}
 +
                  // decode bytes into surrogate characters
                  int uuuuu = ((b0 << 2) & 0x001C) | ((b1 >> 4) & 0x0003);
                  if (uuuuu > 0x10) {

 which I've applied to Xerces 2.9.1 sources, and committed under
 lib/xerces-2.9.1-patched-XERCESJ-1257.jar.  Once XERCESJ-1257 is fixed
 we can upgrade to a standard Xerces release.
	Support exists for downloading, parsing, and loading the English
	version of wikipedia (enwiki).

	The build file can automatically try to download the most current
	enwiki dataset (pages-articles.xml.bz2) from the "latest" directory,
	http://download.wikimedia.org/enwiki/latest/. However, this file
	doesn't always exist, depending on where wikipedia is in the dump
	process and whether prior dumps have succeeded. If this file doesn't
	exist, you can sometimes find an older or in progress version by
	looking in the dated directories under
	http://download.wikimedia.org/enwiki/. For example, as of this
	writing, there is a page file in
	http://download.wikimedia.org/enwiki/20070402/. You can download this
	file manually and put it in temp. Note that the file you download will
	probably have the date in the name, e.g.,
	http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-articles.xml.bz2. When
	you put it in temp, rename it to enwiki-latest-pages-articles.xml.bz2.

	After that, ant enwiki should process the data set and run a load
	test. Ant targets get-enwiki, expand-enwiki, and extract-enwiki can
	also be used to download, decompress, and extract (to individual files
	in work/enwiki) the dataset, respectively.

	NOTE: This bug in Xerces:

	https://issues.apache.org/jira/browse/XERCESJ-1257

	which is still present as of 2.9.1, causes an exception like this when
	processing Wikipedia's XML:

	Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:77)
	... 1 more

	The original poster in the Xerces bug provided this patch:

	--- UTF8Reader.java 2006-11-23 00:36:53.000000000 +0100
	+++ /home/rainman/lucene/xerces-2_9_0/src/org/apache/xerces/impl/io/UTF8Reader.java 2008-04-04 00:40:58.000000000 +0200
	@@ -534,6 +534,16 @@
	invalidByte(4, 4, b2);
	}

	+ // check if output buffer is large enough to hold 2 surrogate chars
	+ if( out + 1 >= offset + length ){
	+ fBuffer[0] = (byte)b0;
	+ fBuffer[1] = (byte)b1;
	+ fBuffer[2] = (byte)b2;
	+ fBuffer[3] = (byte)b3;
	+ fOffset = 4;
	+ return out - offset;
	+ }
	+
	// decode bytes into surrogate characters
	int uuuuu = ((b0 << 2) & 0x001C) \| ((b1 >> 4) & 0x0003);
	if (uuuuu > 0x10) {

	which I've applied to Xerces 2.9.1 sources, and committed under
	lib/xerces-2.9.1-patched-XERCESJ-1257.jar. Once XERCESJ-1257 is fixed
	we can upgrade to a standard Xerces release.