Wikisearch Installation

Instructions for installing and running the Accumulo Wikisearch example.

Ingest

Prerequisites

  1. Accumulo, Hadoop, and ZooKeeper must be installed and running

  2. Download one or more wikipedia dump files and put them in an HDFS directory. You will want to grab the files with the link name of pages-articles.xml.bz2. Though not strictly required, the ingest will go more quickly if the files are decompressed:

     $ bunzip2 enwiki-*-pages-articles.xml.bz2
     $ hadoop fs -put enwiki-*-pages-articles.xml /wikipedia/enwiki-pages-articles.xml
    

Instructions

  1. Create a wikipedia.xml file (or wikipedia_parallel.xml if running parallel version) from wikipedia.xml.example or wikipedia_parallel.xml.example and modify for your Accumulo installation.

     $ cp ingest/conf
     $ cp wikipedia.xml.example wikipedia.xml
     $ vim wikipedia.xml
    
  2. Copy ingest/lib/wikisearch-*.jar to $ACCUMULO_HOME/lib/ext

  3. Run ingest/bin/ingest.sh (or ingest_parallel.sh if running parallel version) with one argument (the name of the directory in HDFS where the wikipedia XML files reside) and this will kick off a MapReduce job to ingest the data into Accumulo.

Query

Prerequisites

  1. The query software was tested using JBoss AS 6. Install the JBoss distro and follow the instructions below to build the EJB jar and WAR file required.
  • To stop the JBoss warnings about WSDescriptorDeployer and JMSDescriptorDeployer, these deployers can be removed from $JBOSS_HOME/server/default/deployers/jbossws.deployer/META-INF/stack-agnostic-jboss-beans.xml
  1. Ensure that you have successfully run mvn clean install at the Wikisearch top level to install the jars into your local maven repo before building the query package.

Instructions

  1. Create a ejb-jar.xml from ejb-jar.xml.example and modify it to contain the same information that you put into wikipedia.xml in the ingest steps above:

     cd query/src/main/resources/META-INF/
     cp ejb-jar.xml.example ejb-jar.xml
     vim ejb-jar.xml
    
  2. Re-build the query distribution by running mvn package assembly:single in the query module's directory.

  3. Untar the resulting file in the $JBOSS_HOME/server/default directory.

     $ cd $JBOSS_HOME/server/default
     $ tar -xzf /some/path/to/wikisearch/query/target/wikisearch-query*.tar.gz
    

    This will place the dependent jars in the lib directory and the EJB jar into the deploy directory.

  4. Next, copy the wikisearch*.war file in the query-war/target directory to $JBOSS_HOME/server/default/deploy.

  5. Start JBoss ($JBOSS_HOME/bin/run.sh)

  6. Use the Accumulo shell and give the user permissions for the wikis that you loaded:

     > setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki
    
  7. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the $JBOSS_HOME/server/default/lib directory:

     kryo*.jar
     minlog*.jar
     commons-jexl*.jar
    
  8. Copy $JBOSS_HOME/server/default/deploy/wikisearch-query*.jar to $ACCUMULO_HOME/lib/ext.

  9. At this point you should be able to open a browser and view the page:

     http://localhost:8080/accumulo-wikisearch/ui.html
    

You can issue the queries using this user interface or via the following REST urls:

    <host>/accumulo-wikisearch/rest/Query/xml
    <host>/accumulo-wikisearch/rest/Query/html
    <host>/accumulo-wikisearch/rest/Query/yaml
    <host>/accumulo-wikisearch/rest/Query/json.

There are two parameters to the REST service, query and auths. The query parameter is the same string that you would type into the search box at ui.jsp, and the auths parameter is a comma-separated list of wikis that you want to search (i.e. enwiki,frwiki,dewiki, etc. Or you can use all)

  • NOTE: Ran into a bug that did not allow an EJB3.1 war file. The workaround is to separate the RESTEasy servlet from the EJBs by creating an EJB jar and a WAR file.