| Support exists for downloading, parsing, and loading the English |
| version of wikipedia (enwiki). |
| |
| The build file can automatically try to download the most current |
| enwiki dataset (pages-articles.xml.bz2) from the "latest" directory, |
| http://download.wikimedia.org/enwiki/latest/. However, this file |
| doesn't always exist, depending on where wikipedia is in the dump |
| process and whether prior dumps have succeeded. If this file doesn't |
| exist, you can sometimes find an older or in progress version by |
| looking in the dated directories under |
| http://download.wikimedia.org/enwiki/. For example, as of this |
| writing, there is a page file in |
| http://download.wikimedia.org/enwiki/20070402/. You can download this |
| file manually and put it in temp. Note that the file you download will |
| probably have the date in the name, e.g., |
| http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-articles.xml.bz2. When |
| you put it in temp, rename it to enwiki-latest-pages-articles.xml.bz2. |
| |
| After that, ant enwiki should process the data set and run a load |
| test. Ant targets get-enwiki, expand-enwiki, and extract-enwiki can |
| also be used to download, decompress, and extract (to individual files |
| in work/enwiki) the dataset, respectively. |