An example Fluo applications that creates a web index using CommonCrawl data.
In order run this application you need the following installed and running on your machine:
Consider using fluo-dev to run these requirments
First, you must create data.yml
and dropwizard.yml
files and edit them for your environment:
cd conf cp data.yml.example data.yml cp dropwizard.yml.example dropwizard.yml
Next, run the following command to download CommonCrawl data files. The data files can have a fileType
of wat
, wet
, or warc
. The command downloads a file containing the URL path of thousands of files and numFiles
specifies how many of those files will be downloaded from AWS and loaded into your HDFS instance.
# Command structure ./bin/download.sh <fileType> <numFiles> # Use command below for this example ./bin/download.sh wat 1
Next, run the following command to run Fluo and initialize it and Accumulo with data:
./bin/init.sh
Finally, run the following command to run the web app:
./bin/webapp.sh
Open your browser to http://localhost:8080/