README - accumulo-wikisearch - Git at Google

  Apache Accumulo Wikipedia Search Example

  This project contains a sample application for ingesting and querying wikipedia data.


  Ingest
  ------

  	Prerequisites
  	-------------
  	1. Accumulo, Hadoop, and ZooKeeper must be installed and running
  	2. One or more wikipedia dump files (http://dumps.wikimedia.org/backup-index.html) placed in an HDFS directory.
 	   You will want to grab the files with the link name of pages-articles.xml.bz2
         3. Though not strictly required, the ingest will go more quickly if the files are decompressed:

             $ bunzip2 < enwiki-*-pages-articles.xml.bz2 | hadoop fs -put - /wikipedia/enwiki-pages-articles.xml


  	INSTRUCTIONS
  	------------
 	1. Copy the ingest/conf/wikipedia.xml.example to ingest/conf/wikipedia.xml and change it to specify Accumulo information.
 	2. Copy the ingest/lib/wikisearch-*.jar and ingest/lib/protobuf*.jar to $ACCUMULO_HOME/lib/ext
 	3. Then run ingest/bin/ingest.sh with one argument (the name of the directory in HDFS where the wikipedia XML
            files reside) and this will kick off a MapReduce job to ingest the data into Accumulo.

  Query
  -----

  	Prerequisites
  	-------------
 	1. The query software was tested using JBoss AS 6. Install this unless you feel like messing with the installation.

 	NOTE: Ran into a bug (https://issues.jboss.org/browse/RESTEASY-531) that did not allow an EJB3.1 war file. The
 	workaround is to separate the RESTEasy servlet from the EJBs by creating an EJB jar and a WAR file.

 	INSTRUCTIONS
 	-------------
 	1. Copy the query/src/main/resources/META-INF/ejb-jar.xml.example file to
 	   query/src/main/resources/META-INF/ejb-jar.xml. Modify to the file to contain the same
 	   information that you put into the wikipedia.xml file from the Ingest step above.
 	2. Re-build the query distribution by running 'mvn package assembly:single' in the query module's directory.
         3. Untar the resulting file in the $JBOSS_HOME/server/default directory.

               $ cd $JBOSS_HOME/server/default
               $ tar -xzf /some/path/to/wikisearch/query/target/wikisearch-query*.tar.gz

            This will place the dependent jars in the lib directory and the EJB jar into the deploy directory.
 	4. Next, copy the wikisearch*.war file in the query-war/target directory to $JBOSS_HOME/server/default/deploy.
 	5. Start JBoss ($JBOSS_HOME/bin/run.sh)
 	6. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example:
 			setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki
 	7. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the $JBOSS_HOME/server/default/lib directory:

 		kryo*.jar
 		minlog*.jar
 		commons-jexl*.jar

 	8. Copy the $JBOSS_HOME/server/default/deploy/wikisearch-query*.jar to $ACCUMULO_HOME/lib/ext.


 	9. At this point you should be able to open a browser and view the page: http://localhost:8080/accumulo-wikisearch/ui/ui.jsp.
 	You can issue the queries using this user interface or via the following REST urls: <host>/accumulo-wikisearch/rest/Query/xml,
 	<host>/accumulo-wikisearch/rest/Query/html, <host>/accumulo-wikisearch/rest/Query/yaml, or <host>/accumulo-wikisearch/rest/Query/json.
 	There are two parameters to the REST service, query and auths. The query parameter is the same string that you would type
 	into the search box at ui.jsp, and the auths parameter is a comma-separated list of wikis that you want to search (i.e.
 	enwiki,frwiki,dewiki, etc. Or you can use all)
	Apache Accumulo Wikipedia Search Example

	This project contains a sample application for ingesting and querying wikipedia data.


	Ingest
	------

	Prerequisites
	-------------
	1. Accumulo, Hadoop, and ZooKeeper must be installed and running
	2. One or more wikipedia dump files (http://dumps.wikimedia.org/backup-index.html) placed in an HDFS directory.
	You will want to grab the files with the link name of pages-articles.xml.bz2
	3. Though not strictly required, the ingest will go more quickly if the files are decompressed:

	$ bunzip2 < enwiki-*-pages-articles.xml.bz2 \| hadoop fs -put - /wikipedia/enwiki-pages-articles.xml


	INSTRUCTIONS
	------------
	1. Copy the ingest/conf/wikipedia.xml.example to ingest/conf/wikipedia.xml and change it to specify Accumulo information.
	2. Copy the ingest/lib/wikisearch-.jar and ingest/lib/protobuf.jar to $ACCUMULO_HOME/lib/ext
	3. Then run ingest/bin/ingest.sh with one argument (the name of the directory in HDFS where the wikipedia XML
	files reside) and this will kick off a MapReduce job to ingest the data into Accumulo.

	Query
	-----

	Prerequisites
	-------------
	1. The query software was tested using JBoss AS 6. Install this unless you feel like messing with the installation.

	NOTE: Ran into a bug (https://issues.jboss.org/browse/RESTEASY-531) that did not allow an EJB3.1 war file. The
	workaround is to separate the RESTEasy servlet from the EJBs by creating an EJB jar and a WAR file.

	INSTRUCTIONS
	-------------
	1. Copy the query/src/main/resources/META-INF/ejb-jar.xml.example file to
	query/src/main/resources/META-INF/ejb-jar.xml. Modify to the file to contain the same
	information that you put into the wikipedia.xml file from the Ingest step above.
	2. Re-build the query distribution by running 'mvn package assembly:single' in the query module's directory.
	3. Untar the resulting file in the $JBOSS_HOME/server/default directory.

	$ cd $JBOSS_HOME/server/default
	$ tar -xzf /some/path/to/wikisearch/query/target/wikisearch-query*.tar.gz

	This will place the dependent jars in the lib directory and the EJB jar into the deploy directory.
	4. Next, copy the wikisearch*.war file in the query-war/target directory to $JBOSS_HOME/server/default/deploy.
	5. Start JBoss ($JBOSS_HOME/bin/run.sh)
	6. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example:
	setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki
	7. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the $JBOSS_HOME/server/default/lib directory:

	kryo*.jar
	minlog*.jar
	commons-jexl*.jar

	8. Copy the $JBOSS_HOME/server/default/deploy/wikisearch-query*.jar to $ACCUMULO_HOME/lib/ext.


	9. At this point you should be able to open a browser and view the page: http://localhost:8080/accumulo-wikisearch/ui/ui.jsp.
	You can issue the queries using this user interface or via the following REST urls: <host>/accumulo-wikisearch/rest/Query/xml,
	<host>/accumulo-wikisearch/rest/Query/html, <host>/accumulo-wikisearch/rest/Query/yaml, or <host>/accumulo-wikisearch/rest/Query/json.
	There are two parameters to the REST service, query and auths. The query parameter is the same string that you would type
	into the search box at ui.jsp, and the auths parameter is a comma-separated list of wikis that you want to search (i.e.
	enwiki,frwiki,dewiki, etc. Or you can use all)