Fixes #88 - Refactored Accumulo export code

* Updates due to changes in apache/incubator-fluo-recipes#102
* All objects placed on export queue now implement IndexUpdate
  interface and only represent data that is being exported
* Moved code that creates Accumulo mutations from export objects
  to IndexClient
25 files changed
tree: 7149f929c0b8ae73ab674fc50fae2cded2a4961b
  1. bin/
  2. conf/
  3. contrib/
  4. docs/
  5. modules/
  6. .gitignore
  7. .travis.yml
  8. AUTHORS
  9. LICENSE
  10. pom.xml
  11. README.md
README.md

Webindex

Build Status Apache License

WebIndex is an example Apache Fluo application that uses Common Crawl web crawl data to index links to web pages in multiple ways. It has a simple UI to view the resulting indexes. If you are new to Fluo, you may want start with thephrasecount application as the WebIndex application is more complicated. For more information on how the WebIndex application works, view the tables and code documentation.

Running WebIndex

If you are new to WebIndex, the simplest way to run the application is to run the development server. First, clone the WebIndex repo:

git clone https://github.com/astralway/webindex.git

Next, on a machine where Java and Maven are installed, run the development server using the webindex command:

cd webindex/
./bin/webindex dev

This will build and start the development server which will log to the console. When you want to terminate the server, press ctrl-c.

The development server starts a MiniAccumuloCluster and runs MiniFluo on top of it. It parses a CommonCrawl data file and creates a file at data/1K-pages.txt with 1000 pages that are loaded into MiniFluo. The pages are processed by Fluo which exports indexes to Accumulo. A web application is started at http://localhost:4567 that queries these indexes.

If you would like to run WebIndex on a cluster, follow the install instructions.