commit	dd28a78f4967bb5706cfc2e2b4b8a2cee15dd351	[log] [tgz]
author	Mike Walch <mwalch@gmail.com>	Wed Sep 16 11:12:06 2015 -0400
committer	Mike Walch <mwalch@gmail.com>	Wed Sep 16 11:12:06 2015 -0400
tree	2f12bfe5578e546b768d63caeefd9f135bb14dfa
parent	9fa46c413a928b337fb019dbad8abf1c9fbb7b1c [diff]
parent	8245084544f9e6b94eb760a1e1b3c974a73838e0 [diff]

tree: 2f12bfe5578e546b768d63caeefd9f135bb14dfa

README.md

WebIndex

An example Fluo applications that creates a web index using CommonCrawl data.

Requirements

In order run this application you need the following installed and running on your machine:

Hadoop (HDFS & YARN)
Accumulo
Fluo

Consider using fluo-dev to run these requirments

Configure your environment

First, you must create data.yml and dropwizard.yml files and edit them for your environment:

cd conf
cp data.yml.example data.yml
cp dropwizard.yml.example dropwizard.yml

Download CommonCrawl data

Next, run the following command to download CommonCrawl data files. The data files can have a fileType of wat, wet, or warc. The command downloads a file containing the URL path of thousands of files and numFiles specifies how many of those files will be downloaded from AWS and loaded into your HDFS instance.

 # Command structure
./bin/download.sh <fileType> <numFiles>
 # Use command below for this example
./bin/download.sh wat 1

Initialize Fluo & Accumulo with data

Next, run the following command to run Fluo and initialize it and Accumulo with data:

./bin/init.sh

Run the web application

Finally, run the following command to run the web app:

./bin/webapp.sh

Open your browser to http://localhost:8080/