blob: 399f0f3eb4c2ef7fc6cf688b6844b0fad685d1bb [file] [log] [blame]
Apache Accumulo Wikipedia Search Example (parallel version)
This project contains a sample application for ingesting and querying wikipedia data.
Ingest
------
Prerequisites
-------------
1. Accumulo, Hadoop, and ZooKeeper must be installed and running
2. One or more wikipedia dump files (http://dumps.wikimedia.org/backup-index.html) placed in an HDFS directory.
You will want to grab the files with the link name of pages-articles.xml.bz2
INSTRUCTIONS
------------
1. Copy the ingest/conf/wikipedia_parallel.xml.example to ingest/conf/wikipedia.xml and change it to specify Accumulo information.
2. Copy the ingest/lib/wikisearch-*.jar and ingest/lib/protobuf*.jar to $ACCUMULO_HOME/lib/ext
3. Then run ingest/bin/ingest_parallel.sh with one argument (the name of the directory in HDFS where the wikipedia XML
files reside) and this will kick off a MapReduce job to ingest the data into Accumulo.
Query
-----
Prerequisites
-------------
1. The query software was tested using JBoss AS 6. Install this unless you feel like messing with the installation.
NOTE: Ran into a bug (https://issues.jboss.org/browse/RESTEASY-531) that did not allow an EJB3.1 war file. The
workaround is to separate the RESTEasy servlet from the EJBs by creating an EJB jar and a WAR file.
INSTRUCTIONS
-------------
1. Copy the query/src/main/resources/META-INF/ejb-jar.xml.example file to
query/src/main/resources/META-INF/ejb-jar.xml. Modify to the file to contain the same
information that you put into the wikipedia.xml file from the Ingest step above.
2. Re-build the query distribution by running 'mvn package assembly:single' in the top-level directory.
3. Untar the resulting file in the $JBOSS_HOME/server/default directory.
$ cd $JBOSS_HOME/server/default
$ tar -xzf $ACCUMULO_HOME/src/examples/wikisearch/query/target/wikisearch-query*.tar.gz
This will place the dependent jars in the lib directory and the EJB jar into the deploy directory.
4. Next, copy the wikisearch*.war file in the query-war/target directory to $JBOSS_HOME/server/default/deploy.
5. Start JBoss ($JBOSS_HOME/bin/run.sh)
6. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example:
setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki
7. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the $JBOSS_HOME/server/default/lib directory:
commons-lang*.jar
kryo*.jar
minlog*.jar
commons-jexl*.jar
guava*.jar
8. Copy the $JBOSS_HOME/server/default/deploy/wikisearch-query*.jar to $ACCUMULO_HOME/lib/ext.
9. At this point you should be able to open a browser and view the page: http://localhost:8080/accumulo-wikisearch/ui/ui.jsp.
You can issue the queries using this user interface or via the following REST urls: <host>/accumulo-wikisearch/rest/Query/xml,
<host>/accumulo-wikisearch/rest/Query/html, <host>/accumulo-wikisearch/rest/Query/yaml, or <host>/accumulo-wikisearch/rest/Query/json.
There are two parameters to the REST service, query and auths. The query parameter is the same string that you would type
into the search box at ui.jsp, and the auths parameter is a comma-separated list of wikis that you want to search (i.e.
enwiki,frwiki,dewiki, etc. Or you can use all)