| Support exists for downloading, parsing, and loading the English |
| version of wikipedia (enwiki). |
| |
| The build file can automatically try to download the most current |
| enwiki dataset (pages-articles.xml.bz2) from the "latest" directory, |
| http://download.wikimedia.org/enwiki/latest/. However, this file |
| doesn't always exist, depending on where wikipedia is in the dump |
| process and whether prior dumps have succeeded. If this file doesn't |
| exist, you can sometimes find an older or in progress version by |
| looking in the dated directories under |
| http://download.wikimedia.org/enwiki/. For example, as of this |
| writing, there is a page file in |
| http://download.wikimedia.org/enwiki/20070402/. You can download this |
| file manually and put it in temp. Note that the file you download will |
| probably have the date in the name, e.g., |
| http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-articles.xml.bz2. When |
| you put it in temp, rename it to enwiki-latest-pages-articles.xml.bz2. |
| |
| After that, ant enwiki should process the data set and run a load |
| test. Ant targets get-enwiki, expand-enwiki, and extract-enwiki can |
| also be used to download, decompress, and extract (to individual files |
| in work/enwiki) the dataset, respectively. |
| |
| NOTE: This bug in Xerces: |
| |
| https://issues.apache.org/jira/browse/XERCESJ-1257 |
| |
| which is still present as of 2.9.1, causes an exception like this when |
| processing Wikipedia's XML: |
| |
| Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. |
| at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) |
| at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) |
| at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) |
| at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) |
| at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) |
| at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) |
| at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) |
| at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) |
| at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) |
| at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) |
| at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:77) |
| ... 1 more |
| |
| The original poster in the Xerces bug provided this patch: |
| |
| --- UTF8Reader.java 2006-11-23 00:36:53.000000000 +0100 |
| +++ /home/rainman/lucene/xerces-2_9_0/src/org/apache/xerces/impl/io/UTF8Reader.java 2008-04-04 00:40:58.000000000 +0200 |
| @@ -534,6 +534,16 @@ |
| invalidByte(4, 4, b2); |
| } |
| |
| + // check if output buffer is large enough to hold 2 surrogate chars |
| + if( out + 1 >= offset + length ){ |
| + fBuffer[0] = (byte)b0; |
| + fBuffer[1] = (byte)b1; |
| + fBuffer[2] = (byte)b2; |
| + fBuffer[3] = (byte)b3; |
| + fOffset = 4; |
| + return out - offset; |
| + } |
| + |
| // decode bytes into surrogate characters |
| int uuuuu = ((b0 << 2) & 0x001C) | ((b1 >> 4) & 0x0003); |
| if (uuuuu > 0x10) { |
| |
| which I've applied to Xerces 2.9.1 sources, and committed under |
| lib/xerces-2.9.1-patched-XERCESJ-1257.jar. Once XERCESJ-1257 is fixed |
| we can upgrade to a standard Xerces release. |