Apache Accumulo Wikipedia Search Example | |
This project contains a sample application for ingesting and querying wikipedia data. | |
Ingest | |
------ | |
Prerequisites | |
------------- | |
1. Accumulo, Hadoop, and ZooKeeper must be installed and running | |
2. One or more wikipedia dump files (http://dumps.wikimedia.org/backup-index.html) placed in an HDFS directory. | |
You will want to grab the files with the link name of pages-articles.xml.bz2 | |
3. Though not strictly required, the ingest will go more quickly if the files are decompressed: | |
$ bunzip2 < enwiki-*-pages-articles.xml.bz2 | hadoop fs -put - /wikipedia/enwiki-pages-articles.xml | |
INSTRUCTIONS | |
------------ | |
1. Copy the ingest/conf/wikipedia.xml.example to ingest/conf/wikipedia.xml and change it to specify Accumulo information. | |
2. Copy the ingest/lib/wikisearch-*.jar and ingest/lib/protobuf*.jar to $ACCUMULO_HOME/lib/ext | |
3. Then run ingest/bin/ingest.sh with one argument (the name of the directory in HDFS where the wikipedia XML | |
files reside) and this will kick off a MapReduce job to ingest the data into Accumulo. | |
Query | |
----- | |
Prerequisites | |
------------- | |
1. The query software was tested using JBoss AS 6. Install this unless you feel like messing with the installation. | |
NOTE: Ran into a bug (https://issues.jboss.org/browse/RESTEASY-531) that did not allow an EJB3.1 war file. The | |
workaround is to separate the RESTEasy servlet from the EJBs by creating an EJB jar and a WAR file. | |
INSTRUCTIONS | |
------------- | |
1. Copy the query/src/main/resources/META-INF/ejb-jar.xml.example file to | |
query/src/main/resources/META-INF/ejb-jar.xml. Modify to the file to contain the same | |
information that you put into the wikipedia.xml file from the Ingest step above. | |
2. Re-build the query distribution by running 'mvn package assembly:single' in the query module's directory. | |
3. Untar the resulting file in the $JBOSS_HOME/server/default directory. | |
$ cd $JBOSS_HOME/server/default | |
$ tar -xzf /some/path/to/wikisearch/query/target/wikisearch-query*.tar.gz | |
This will place the dependent jars in the lib directory and the EJB jar into the deploy directory. | |
4. Next, copy the wikisearch*.war file in the query-war/target directory to $JBOSS_HOME/server/default/deploy. | |
5. Start JBoss ($JBOSS_HOME/bin/run.sh) | |
6. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example: | |
setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki | |
7. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the $JBOSS_HOME/server/default/lib directory: | |
kryo*.jar | |
minlog*.jar | |
commons-jexl*.jar | |
8. Copy the $JBOSS_HOME/server/default/deploy/wikisearch-query*.jar to $ACCUMULO_HOME/lib/ext. | |
9. At this point you should be able to open a browser and view the page: http://localhost:8080/accumulo-wikisearch/ui/ui.jsp. | |
You can issue the queries using this user interface or via the following REST urls: <host>/accumulo-wikisearch/rest/Query/xml, | |
<host>/accumulo-wikisearch/rest/Query/html, <host>/accumulo-wikisearch/rest/Query/yaml, or <host>/accumulo-wikisearch/rest/Query/json. | |
There are two parameters to the REST service, query and auths. The query parameter is the same string that you would type | |
into the search box at ui.jsp, and the auths parameter is a comma-separated list of wikis that you want to search (i.e. | |
enwiki,frwiki,dewiki, etc. Or you can use all) |