Add 'webindex/' from commit '91dc7cb6fc72c79a53c6b7d0a6c0599cd8eacb9b'
git-subtree-dir: webindex
git-subtree-mainline: f762da6d8f93dec655741632dd534d1287d1a6ec
git-subtree-split: 91dc7cb6fc72c79a53c6b7d0a6c0599cd8eacb9b
diff --git a/LICENSE b/LICENSE
index 8f71f43..d645695 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,3 +1,4 @@
+
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
@@ -178,7 +179,7 @@
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
- boilerplate notice, with the fields enclosed by brackets "{}"
+ boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
@@ -186,7 +187,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.
- Copyright {yyyy} {name of copyright owner}
+ Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
@@ -199,4 +200,3 @@
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-
diff --git a/README.md b/README.md
index f869d20..6a9cfba 100644
--- a/README.md
+++ b/README.md
@@ -1,76 +1 @@
-![Webindex][logo]
----
-[![Build Status][ti]][tl] [![Apache License][li]][ll]
-
-Webindex is an example [Apache Fluo][fluo] application that incrementally indexes links to web pages
-in multiple ways. If you are new to Fluo, you may want start with the [Fluo tour][tour] as the
-WebIndex application is more complicated. For more information on how the WebIndex application
-works, view the [tables](docs/tables.md) and [code](docs/code-guide.md) documentation.
-
-Webindex utilizes multiple projects. [Common Crawl][cc] web crawl data is used as the input.
-[Apache Spark][spark] is used to initialize Fluo and incrementally load data into Fluo. [Apache
-Accumulo][accumulo] is used to hold the indexes and Fluo's data. Fluo is used to continuously
-combine new and historical information about web pages and update an external index when changes
-occur. Webindex has simple UI built using [Spark Java][sparkjava] that allows querying the indexes.
-
-Below is a video showing repeatedly querying stackoverflow.com while Webindex was running for three
-days on EC2. The video was made by querying the Webindex instance periodically and taking a
-screenshot. More details about this video are available in this [blog post][bp].
-
-[![Querying stackoverflow.com](http://img.youtube.com/vi/mJJNJbPN2EI/0.jpg)](http://www.youtube.com/watch?v=mJJNJbPN2EI)
-
-## Running WebIndex
-
-If you are new to WebIndex, the simplest way to run the application is to run the development
-server. First, clone the WebIndex repo:
-
- git clone https://github.com/astralway/webindex.git
-
-Next, on a machine where Java and Maven are installed, run the development server using the
-`webindex` command:
-
- cd webindex/
- ./bin/webindex dev
-
-This will build and start the development server which will log to the console. This 'dev' command
-has several command line options which can be viewed by running with `-h`. When you want to
-terminate the server, press `CTRL-c`.
-
-The development server starts a MiniAccumuloCluster and runs MiniFluo on top of it. It parses a
-CommonCrawl data file and creates a file at `data/1000-pages.txt` with 1000 pages that are loaded
-into MiniFluo. The number of pages loaded can be changed to 5000 by using the command below:
-
- ./bin/webindex dev --pages 5000
-
-The pages are processed by Fluo which exports indexes to Accumulo. The development server also
-starts a web application at [http://localhost:4567](http://localhost:4567) that queries indexes in
-Accumulo.
-
-If you would like to run WebIndex on a cluster, follow the [install] instructions.
-
-### Viewing metrics
-
-Metrics can be sent from the development server to InfluxDB and viewed in Grafana. You can either
-setup InfluxDB+Grafana on you own or use [Uno] command `uno setup metrics`. After a metrics server
-is started, start the development server the option `--metrics` to start sending metrics:
-
- ./bin/webindex dev --metrics
-
-Fluo metrics can be viewed in Grafana. To view application-specific metrics for Webindex, import
-the WebIndex Grafana dashboard located at `contrib/webindex-dashboard.json`.
-
-[tour]: https://fluo.apache.org/tour/
-[sparkjava]: http://sparkjava.com/
-[spark]: https://spark.apache.org/
-[accumulo]: https://accumulo.apache.org/
-[fluo]: https://fluo.apache.org/
-[pc]: https://github.com/astralway/phrasecount
-[Uno]: https://github.com/astralway/uno
-[cc]: https://commoncrawl.org/
-[install]: docs/install.md
-[ti]: https://travis-ci.org/astralway/webindex.svg?branch=master
-[tl]: https://travis-ci.org/astralway/webindex
-[li]: http://img.shields.io/badge/license-ASL-blue.svg
-[ll]: https://github.com/astralway/webindex/blob/master/LICENSE
-[logo]: contrib/webindex.png
-[bp]: https://fluo.apache.org/blog/2016/01/11/webindex-long-run/#videos-from-run
+Examples for Apache Fluo
diff --git a/phrasecount/.gitignore b/phrasecount/.gitignore
new file mode 100644
index 0000000..93eea5d
--- /dev/null
+++ b/phrasecount/.gitignore
@@ -0,0 +1,6 @@
+.classpath
+.project
+.settings
+target
+.idea
+*.iml
diff --git a/phrasecount/.travis.yml b/phrasecount/.travis.yml
new file mode 100644
index 0000000..e36964e
--- /dev/null
+++ b/phrasecount/.travis.yml
@@ -0,0 +1,12 @@
+language: java
+jdk:
+ - oraclejdk8
+script: mvn verify
+notifications:
+ irc:
+ channels:
+ - "chat.freenode.net#fluo"
+ on_success: always
+ on_failure: always
+ use_notice: true
+ skip_join: true
diff --git a/phrasecount/LICENSE b/phrasecount/LICENSE
new file mode 100644
index 0000000..e06d208
--- /dev/null
+++ b/phrasecount/LICENSE
@@ -0,0 +1,202 @@
+Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "{}"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright {yyyy} {name of copyright owner}
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
diff --git a/phrasecount/README.md b/phrasecount/README.md
new file mode 100644
index 0000000..74a6509
--- /dev/null
+++ b/phrasecount/README.md
@@ -0,0 +1,164 @@
+# Phrase Count
+
+[![Build Status](https://travis-ci.org/astralway/phrasecount.svg?branch=master)](https://travis-ci.org/astralway/phrasecount)
+
+An example application that computes phrase counts for unique documents using Apache Fluo. Each
+unique document that is added causes phrase counts to be incremented. Unique documents have
+reference counts based on the number of locations that point to them. When a unique document is no
+longer referenced by any location, then the phrase counts will be decremented appropriately.
+
+After phrase counts are incremented, export transactions send phrase counts to an Accumulo table.
+The purpose of exporting data is to make it available for query. Percolator is not designed to
+support queries, because its transactions are designed for throughput and not responsiveness.
+
+This example uses the Collision Free Map and Export Queue from [Apache Fluo Recipes][11]. A
+Collision Free Map is used to calculate phrase counts. An Export Queue is used to update the
+external Accumulo table in a fault tolerant manner. Before using Fluo Recipes, this example was
+quite complex. Switching to Fluo Recipes dramatically simplified this example.
+
+## Schema
+
+### Fluo Table Schema
+
+This example uses the following schema for the table used by Apache Fluo.
+
+Row | Column | Value | Purpose
+-------------|---------------|-------------------|---------------------------------------------------------------------
+uri:\<uri\> | doc:hash | \<hash\> | Contains the hash of the document found at the URI
+doc:\<hash\> | doc:content | \<document\> | The contents of the document
+doc:\<hash\> | doc:refCount | \<int\> | The number of URIs that reference this document
+doc:\<hash\> | index:check | empty | Setting this columns triggers the observer that indexes the document
+doc:\<hash\> | index:status | INDEXED or empty | Used to track the status of whether this document was indexed
+
+Additionally the two recipes used by the example store their data in the table
+under two row prefixes. Nothing else should be stored within these prefixes.
+The collision free map used to compute phrasecounts stores data within the row
+prefix `pcm:`. The export queue stores data within the row prefix `aeq:`.
+
+### External Table Schema
+
+This example uses the following schema for the external Accumulo table.
+
+Row | Column | Value | Purpose
+-----------|-----------------|------------|---------------------------------------------------------------------
+\<phrase\> | stat:totalCount | \<count\> | For a given phrase, the value is the total number of times that phrase occurred in all documents.
+\<phrase\> | stat:docCount | \<count\> | For a given phrase, the values is the number of documents in which that phrase occurred.
+
+[PhraseCountTable][14] encapsulates all of the code for interacting with this
+external table.
+
+## Code Overview
+
+Documents are loaded into the Fluo table by [DocumentLoader][1] which is
+executed by [Load][2]. [DocumentLoader][1] handles reference counting of
+unique documents and may set a notification for [DocumentObserver][3].
+[DocumentObserver][3] increments or decrements global phrase counts by
+inserting `+1` or `-1` into a collision free map for each phrase in a document.
+[PhraseMap][4] contains the code called by the collision free map recipe. The
+code in [PhraseMap][4] does two things. First it computes the phrase counts by
+summing the updates. Second it places the newly computed phrase count on an
+export queue. [PhraseExporter][5] is called by the export queue recipe to
+generate mutations to update the external Accumulo table.
+
+All observers and recipes are configured by code in [Application][10]. All
+observers are run by the Fluo worker processes when notifications trigger them.
+
+## Building
+
+After cloning this repository, build with following command.
+
+```
+mvn package
+```
+
+## Running via Maven
+
+If you do not have Accumulo, Hadoop, Zookeeper, and Fluo setup, then you can
+start an MiniFluo instance with the [mini.sh](bin/mini.sh) script. This script
+will run [Mini.java][12] using Maven. The command will create a
+`fluo.properties` file that can be used by the other commands in this section.
+
+```bash
+./bin/mini.sh /tmp/mac fluo.properties
+```
+
+After the mini command prints out `Wrote : fluo.properties` then its ready to
+use. Run `tail -f mini.log` and look for the message about writing
+fluo.properties.
+
+This command will automatically configure [PhraseExporter][5] to export phrases
+to an Accumulo table named `pcExport`.
+
+The reason `-Dexec.classpathScope=test` is set is because it adds the test
+[log4j.properties][7] file to the classpath.
+
+### Adding documents
+
+The [load.sh](bin/load.sh) runs [Load.java][2] which scans the directory
+`$TXT_DIR` looking for .txt files to add. The scan is recursive.
+
+```bash
+./bin/load.sh fluo.properties $TXT_DIR
+```
+
+### Printing phrases
+
+After documents are added, [print.sh](bin/print.sh) will run [Print.java][13]
+which prints out phrase counts. Try modifying a document you added and running
+the load command again, you should eventually see the phrase counts change.
+
+```bash
+./bin/print.sh fluo.properties pcExport
+```
+
+The command will print out the number of unique documents and the number
+of processed documents. If the number of processed documents is less than the
+number of unique documents, then there is still work to do. After the load
+command runs, the documents will have been added or updated. However the
+phrase counts will not update until the Observer runs in the background.
+
+### Killing mini
+
+Make sure to kill mini when finished testing. The following command will kill it.
+
+```bash
+pkill -f phrasecount.cmd.Mini
+```
+
+## Deploying example
+
+The following script can run this example on a cluster using the Fluo
+distribution and serves as executable documentation for deployment. The
+previous maven commands using the exec plugin are convenient for a development
+environment, using the following scripts shows how things would work in a
+production environment.
+
+ * [run.sh] (bin/run.sh) : Runs this example with YARN using the Fluo tar
+ distribution. Running in this way requires setting up Hadoop, Zookeeper,
+ and Accumulo instances separately. The [Uno][8] and [Muchos][9]
+ projects were created to ease setting up these external dependencies.
+
+## Generating data
+
+Need some data? Use `elinks` to generate text files from web pages.
+
+```
+mkdir data
+elinks -dump 1 -no-numbering -no-references http://accumulo.apache.org > data/accumulo.txt
+elinks -dump 1 -no-numbering -no-references http://hadoop.apache.org > data/hadoop.txt
+elinks -dump 1 -no-numbering -no-references http://zookeeper.apache.org > data/zookeeper.txt
+```
+
+[1]: src/main/java/phrasecount/DocumentLoader.java
+[2]: src/main/java/phrasecount/cmd/Load.java
+[3]: src/main/java/phrasecount/DocumentObserver.java
+[4]: src/main/java/phrasecount/PhraseMap.java
+[5]: src/main/java/phrasecount/PhraseExporter.java
+[7]: src/test/resources/log4j.properties
+[8]: https://github.com/astralway/uno
+[9]: https://github.com/astralway/muchos
+[10]: src/main/java/phrasecount/Application.java
+[11]: https://github.com/apache/fluo-recipes
+[12]: src/main/java/phrasecount/cmd/Mini.java
+[13]: src/main/java/phrasecount/cmd/Print.java
+[14]: src/main/java/phrasecount/query/PhraseCountTable.java
diff --git a/phrasecount/bin/copy-jars.sh b/phrasecount/bin/copy-jars.sh
new file mode 100755
index 0000000..a92ac5f
--- /dev/null
+++ b/phrasecount/bin/copy-jars.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+
+#This script will copy the phrase count jar and its dependencies to the Fluo
+#application lib dir
+
+
+if [ "$#" -ne 2 ]; then
+ echo "Usage : $0 <FLUO HOME> <PHRASECOUNT_HOME>"
+ exit
+fi
+
+FLUO_HOME=$1
+PC_HOME=$2
+
+PC_JAR=$PC_HOME/target/phrasecount-0.0.1-SNAPSHOT.jar
+
+#build and copy phrasecount jar
+(cd $PC_HOME; mvn package -DskipTests)
+
+FLUO_APP_LIB=$FLUO_HOME/apps/phrasecount/lib/
+
+cp $PC_JAR $FLUO_APP_LIB
+(cd $PC_HOME; mvn dependency:copy-dependencies -DoutputDirectory=$FLUO_APP_LIB)
+
diff --git a/phrasecount/bin/load.sh b/phrasecount/bin/load.sh
new file mode 100755
index 0000000..4c9a904
--- /dev/null
+++ b/phrasecount/bin/load.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+mvn exec:java -Dexec.mainClass=phrasecount.cmd.Load -Dexec.args="${*:1}" -Dexec.classpathScope=test
diff --git a/phrasecount/bin/mini.sh b/phrasecount/bin/mini.sh
new file mode 100755
index 0000000..b8b60a4
--- /dev/null
+++ b/phrasecount/bin/mini.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+
+mvn exec:java -Dexec.mainClass=phrasecount.cmd.Mini -Dexec.args="${*:1}" -Dexec.classpathScope=test &>mini.log &
+echo "Started Mini in background. Writing output to mini.log."
diff --git a/phrasecount/bin/print.sh b/phrasecount/bin/print.sh
new file mode 100755
index 0000000..198fad9
--- /dev/null
+++ b/phrasecount/bin/print.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+
+mvn exec:java -Dexec.mainClass=phrasecount.cmd.Print -Dexec.args="${*:1}" -Dexec.classpathScope=test
+
diff --git a/phrasecount/bin/run.sh b/phrasecount/bin/run.sh
new file mode 100755
index 0000000..8f6e46a
--- /dev/null
+++ b/phrasecount/bin/run.sh
@@ -0,0 +1,69 @@
+#!/bin/bash
+
+BIN_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
+PC_HOME=$( cd "$( dirname "$BIN_DIR" )" && pwd )
+
+# stop if any command fails
+set -e
+
+if [ "$#" -ne 1 ]; then
+ echo "Usage : $0 <TXT FILES DIR>"
+ exit
+fi
+
+#set the following to a directory containing text files
+TXT_DIR=$1
+if [ ! -d $TXT_DIR ]; then
+ echo "Document directory $TXT_DIR does not exist"
+ exit 1
+fi
+
+#ensure $FLUO_HOME is set
+if [ -z "$FLUO_HOME" ]; then
+ echo '$FLUO_HOME must be set!'
+ exit 1
+fi
+
+#Set application name. $FLUO_APP_NAME is set by fluo-dev and zetten
+APP=${FLUO_APP_NAME:-phrasecount}
+
+#derived variables
+APP_PROPS=$FLUO_HOME/apps/$APP/conf/fluo.properties
+
+if [ ! -f $FLUO_HOME/conf/fluo.properties ]; then
+ echo "Fluo is not configured, exiting."
+ exit 1
+fi
+
+#remove application if it exists
+if [ -d $FLUO_HOME/apps/$APP ]; then
+ echo "Restarting '$APP' application. Errors may be printed if it's not running..."
+ $FLUO_HOME/bin/fluo kill $APP || true
+ rm -rf $FLUO_HOME/apps/$APP
+fi
+
+#create new application dir
+$FLUO_HOME/bin/fluo new $APP
+
+#copy phrasecount jars to Fluo application lib dir
+$PC_HOME/bin/copy-jars.sh $FLUO_HOME $PC_HOME
+
+#Create export table and output Fluo configuration
+$FLUO_HOME/bin/fluo exec $APP phrasecount.cmd.Setup $APP_PROPS pcExport >> $APP_PROPS
+
+$FLUO_HOME/bin/fluo init $APP -f
+$FLUO_HOME/bin/fluo exec $APP org.apache.fluo.recipes.accumulo.cmds.OptimizeTable
+$FLUO_HOME/bin/fluo start $APP
+$FLUO_HOME/bin/fluo info $APP
+
+#Load data
+$FLUO_HOME/bin/fluo exec $APP phrasecount.cmd.Load $APP_PROPS $TXT_DIR
+
+#wait for all notifications to be processed.
+$FLUO_HOME/bin/fluo wait $APP
+
+#print phrase counts
+$FLUO_HOME/bin/fluo exec $APP phrasecount.cmd.Print $APP_PROPS pcExport
+
+$FLUO_HOME/bin/fluo stop $APP
+
diff --git a/phrasecount/pom.xml b/phrasecount/pom.xml
new file mode 100644
index 0000000..bb9afde
--- /dev/null
+++ b/phrasecount/pom.xml
@@ -0,0 +1,98 @@
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+ <modelVersion>4.0.0</modelVersion>
+
+ <groupId>io.github.astralway</groupId>
+ <artifactId>phrasecount</artifactId>
+ <version>0.0.1-SNAPSHOT</version>
+ <packaging>jar</packaging>
+
+ <name>phrasecount</name>
+ <url>https://github.com/astralway/phrasecount</url>
+
+ <properties>
+ <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+ <accumulo.version>1.7.2</accumulo.version>
+ <fluo.version>1.0.0-incubating</fluo.version>
+ <fluo-recipes.version>1.0.0-incubating</fluo-recipes.version>
+ </properties>
+
+ <build>
+ <plugins>
+ <plugin>
+ <artifactId>maven-compiler-plugin</artifactId>
+ <version>3.1</version>
+ <configuration>
+ <source>1.8</source>
+ <target>1.8</target>
+ <optimize>true</optimize>
+ <encoding>UTF-8</encoding>
+ </configuration>
+ </plugin>
+ <plugin>
+ <artifactId>maven-dependency-plugin</artifactId>
+ <version>2.10</version>
+ <configuration>
+ <!--define the specific dependencies to copy into the Fluo application dir-->
+ <includeArtifactIds>fluo-recipes-core,fluo-recipes-accumulo,fluo-recipes-kryo,kryo,minlog,reflectasm,objenesis</includeArtifactIds>
+ </configuration>
+ </plugin>
+ </plugins>
+ </build>
+
+ <dependencies>
+ <dependency>
+ <groupId>junit</groupId>
+ <artifactId>junit</artifactId>
+ <version>4.11</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>com.beust</groupId>
+ <artifactId>jcommander</artifactId>
+ <version>1.32</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-api</artifactId>
+ <version>${fluo.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-core</artifactId>
+ <version>${fluo.version}</version>
+ <scope>runtime</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-recipes-core</artifactId>
+ <version>${fluo-recipes.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-recipes-accumulo</artifactId>
+ <version>${fluo-recipes.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-recipes-kryo</artifactId>
+ <version>${fluo-recipes.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.accumulo</groupId>
+ <artifactId>accumulo-core</artifactId>
+ <version>${accumulo.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-mini</artifactId>
+ <version>${fluo.version}</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.accumulo</groupId>
+ <artifactId>accumulo-minicluster</artifactId>
+ <version>${accumulo.version}</version>
+ </dependency>
+ </dependencies>
+</project>
diff --git a/phrasecount/src/main/java/phrasecount/Application.java b/phrasecount/src/main/java/phrasecount/Application.java
new file mode 100644
index 0000000..30d7c3a
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/Application.java
@@ -0,0 +1,71 @@
+package phrasecount;
+
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.api.config.ObserverSpecification;
+import org.apache.fluo.recipes.accumulo.export.AccumuloExporter;
+import org.apache.fluo.recipes.core.export.ExportQueue;
+import org.apache.fluo.recipes.core.map.CollisionFreeMap;
+import org.apache.fluo.recipes.kryo.KryoSimplerSerializer;
+import phrasecount.pojos.Counts;
+import phrasecount.pojos.PcKryoFactory;
+
+import static phrasecount.Constants.EXPORT_QUEUE_ID;
+import static phrasecount.Constants.PCM_ID;
+
+public class Application {
+
+ public static class Options {
+ public Options(int pcmBuckets, int eqBuckets, String instance, String zooKeepers, String user,
+ String password, String eTable) {
+ this.phraseCountMapBuckets = pcmBuckets;
+ this.exportQueueBuckets = eqBuckets;
+ this.instance = instance;
+ this.zookeepers = zooKeepers;
+ this.user = user;
+ this.password = password;
+ this.exportTable = eTable;
+
+ }
+
+ public int phraseCountMapBuckets;
+ public int exportQueueBuckets;
+
+ public String instance;
+ public String zookeepers;
+ public String user;
+ public String password;
+ public String exportTable;
+ }
+
+ /**
+ * Sets Fluo configuration needed to run the phrase count application
+ *
+ * @param fluoConfig FluoConfiguration
+ * @param opts Options
+ */
+ public static void configure(FluoConfiguration fluoConfig, Options opts) {
+ // set up an observer that watches the reference counts of documents. When a document is
+ // referenced or dereferenced, it will add or subtract phrase counts from a collision free map.
+ fluoConfig.addObserver(new ObserverSpecification(DocumentObserver.class.getName()));
+
+ // configure which KryoFactory recipes should use
+ KryoSimplerSerializer.setKryoFactory(fluoConfig, PcKryoFactory.class);
+
+ // set up a collision free map to combine phrase counts
+ CollisionFreeMap.configure(fluoConfig,
+ new CollisionFreeMap.Options(PCM_ID, PhraseMap.PcmCombiner.class,
+ PhraseMap.PcmUpdateObserver.class, String.class, Counts.class,
+ opts.phraseCountMapBuckets));
+
+ AccumuloExporter.Configuration accumuloConfig =
+ new AccumuloExporter.Configuration(opts.instance, opts.zookeepers, opts.user, opts.password,
+ opts.exportTable);
+
+ // setup an Accumulo export queue to to send phrase count updates to an Accumulo table
+ ExportQueue.Options exportQueueOpts =
+ new ExportQueue.Options(EXPORT_QUEUE_ID, PhraseExporter.class.getName(),
+ String.class.getName(), Counts.class.getName(),
+ opts.exportQueueBuckets).setExporterConfiguration(accumuloConfig);
+ ExportQueue.configure(fluoConfig, exportQueueOpts);
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/Constants.java b/phrasecount/src/main/java/phrasecount/Constants.java
new file mode 100644
index 0000000..1f73bee
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/Constants.java
@@ -0,0 +1,21 @@
+package phrasecount;
+
+import org.apache.fluo.api.data.Column;
+import org.apache.fluo.recipes.core.types.StringEncoder;
+import org.apache.fluo.recipes.core.types.TypeLayer;
+
+public class Constants {
+
+ // set the encoder to use in once place
+ public static final TypeLayer TYPEL = new TypeLayer(new StringEncoder());
+
+ public static final Column INDEX_CHECK_COL = TYPEL.bc().fam("index").qual("check").vis();
+ public static final Column INDEX_STATUS_COL = TYPEL.bc().fam("index").qual("status").vis();
+ public static final Column DOC_CONTENT_COL = TYPEL.bc().fam("doc").qual("content").vis();
+ public static final Column DOC_HASH_COL = TYPEL.bc().fam("doc").qual("hash").vis();
+ public static final Column DOC_REF_COUNT_COL = TYPEL.bc().fam("doc").qual("refCount").vis();
+
+ public static final String EXPORT_QUEUE_ID = "aeq";
+ //phrase count map id
+ public static final String PCM_ID = "pcm";
+}
diff --git a/phrasecount/src/main/java/phrasecount/DocumentLoader.java b/phrasecount/src/main/java/phrasecount/DocumentLoader.java
new file mode 100644
index 0000000..8384b35
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/DocumentLoader.java
@@ -0,0 +1,73 @@
+package phrasecount;
+
+import org.apache.fluo.api.client.Loader;
+import org.apache.fluo.api.client.TransactionBase;
+import org.apache.fluo.recipes.core.types.TypedTransactionBase;
+import phrasecount.pojos.Document;
+
+import static phrasecount.Constants.DOC_CONTENT_COL;
+import static phrasecount.Constants.DOC_HASH_COL;
+import static phrasecount.Constants.DOC_REF_COUNT_COL;
+import static phrasecount.Constants.INDEX_CHECK_COL;
+import static phrasecount.Constants.TYPEL;
+
+/**
+ * Executes document load transactions which dedupe and reference count documents. If needed, the
+ * observer that updates phrase counts is triggered.
+ */
+public class DocumentLoader implements Loader {
+
+ private Document document;
+
+ public DocumentLoader(Document doc) {
+ this.document = doc;
+ }
+
+ @Override
+ public void load(TransactionBase tx, Context context) throws Exception {
+
+ // TODO Need a strategy for dealing w/ large documents. If a worker processes many large
+ // documents concurrently, it could cause memory exhaustion. Could break up large documents
+ // into pieces, However, not sure if the example should be complicated with this.
+
+ TypedTransactionBase ttx = TYPEL.wrap(tx);
+ String storedHash = ttx.get().row("uri:" + document.getURI()).col(DOC_HASH_COL).toString();
+
+ if (storedHash == null || !storedHash.equals(document.getHash())) {
+
+ ttx.mutate().row("uri:" + document.getURI()).col(DOC_HASH_COL).set(document.getHash());
+
+ Integer refCount =
+ ttx.get().row("doc:" + document.getHash()).col(DOC_REF_COUNT_COL).toInteger();
+ if (refCount == null) {
+ // this document was never seen before
+ addNewDocument(ttx, document);
+ } else {
+ setRefCount(ttx, document.getHash(), refCount, refCount + 1);
+ }
+
+ if (storedHash != null) {
+ decrementRefCount(ttx, refCount, storedHash);
+ }
+ }
+ }
+
+ private void setRefCount(TypedTransactionBase tx, String hash, Integer prevRc, int rc) {
+ tx.mutate().row("doc:" + hash).col(DOC_REF_COUNT_COL).set(rc);
+
+ if (rc == 0 || (rc == 1 && (prevRc == null || prevRc == 0))) {
+ // setting this triggers DocumentObserver
+ tx.mutate().row("doc:" + hash).col(INDEX_CHECK_COL).set();
+ }
+ }
+
+ private void decrementRefCount(TypedTransactionBase tx, Integer prevRc, String hash) {
+ int rc = tx.get().row("doc:" + hash).col(DOC_REF_COUNT_COL).toInteger();
+ setRefCount(tx, hash, prevRc, rc - 1);
+ }
+
+ private void addNewDocument(TypedTransactionBase tx, Document doc) {
+ setRefCount(tx, doc.getHash(), null, 1);
+ tx.mutate().row("doc:" + doc.getHash()).col(DOC_CONTENT_COL).set(doc.getContent());
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/DocumentObserver.java b/phrasecount/src/main/java/phrasecount/DocumentObserver.java
new file mode 100644
index 0000000..1c50bfc
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/DocumentObserver.java
@@ -0,0 +1,102 @@
+package phrasecount;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Map.Entry;
+
+import org.apache.fluo.api.client.TransactionBase;
+import org.apache.fluo.api.data.Bytes;
+import org.apache.fluo.api.data.Column;
+import org.apache.fluo.api.observer.AbstractObserver;
+import org.apache.fluo.recipes.core.map.CollisionFreeMap;
+import org.apache.fluo.recipes.core.types.TypedTransactionBase;
+import phrasecount.pojos.Counts;
+import phrasecount.pojos.Document;
+
+import static phrasecount.Constants.DOC_CONTENT_COL;
+import static phrasecount.Constants.DOC_REF_COUNT_COL;
+import static phrasecount.Constants.INDEX_CHECK_COL;
+import static phrasecount.Constants.INDEX_STATUS_COL;
+import static phrasecount.Constants.PCM_ID;
+import static phrasecount.Constants.TYPEL;
+
+/**
+ * An Observer that updates phrase counts when a document is added or removed.
+ */
+public class DocumentObserver extends AbstractObserver {
+
+ private CollisionFreeMap<String, Counts> pcMap;
+
+ private enum IndexStatus {
+ INDEXED, UNINDEXED
+ }
+
+ @Override
+ public void init(Context context) throws Exception {
+ pcMap = CollisionFreeMap.getInstance(PCM_ID, context.getAppConfiguration());
+ }
+
+ @Override
+ public void process(TransactionBase tx, Bytes row, Column col) throws Exception {
+
+ TypedTransactionBase ttx = TYPEL.wrap(tx);
+
+ IndexStatus status = getStatus(ttx, row);
+ int refCount = ttx.get().row(row).col(DOC_REF_COUNT_COL).toInteger(0);
+
+ if (status == IndexStatus.UNINDEXED && refCount > 0) {
+ updatePhraseCounts(ttx, row, 1);
+ ttx.mutate().row(row).col(INDEX_STATUS_COL).set(IndexStatus.INDEXED.name());
+ } else if (status == IndexStatus.INDEXED && refCount == 0) {
+ updatePhraseCounts(ttx, row, -1);
+ deleteDocument(ttx, row);
+ }
+
+ // TODO modifying the trigger is currently broken, enable more than one observer to commit for a
+ // notification
+ // tx.delete(row, col);
+
+ }
+
+ @Override
+ public ObservedColumn getObservedColumn() {
+ return new ObservedColumn(INDEX_CHECK_COL, NotificationType.STRONG);
+ }
+
+ private void deleteDocument(TypedTransactionBase tx, Bytes row) {
+ // TODO it would probably be useful to have a deleteRow method on Transaction... this method
+ // could start off w/ a simple implementation and later be
+ // optimized... or could have a delete range option
+
+ // TODO this is brittle, this code assumes it knows all possible columns
+ tx.delete(row, DOC_CONTENT_COL);
+ tx.delete(row, DOC_REF_COUNT_COL);
+ tx.delete(row, INDEX_STATUS_COL);
+ }
+
+ private void updatePhraseCounts(TypedTransactionBase ttx, Bytes row, int multiplier) {
+ String content = ttx.get().row(row).col(Constants.DOC_CONTENT_COL).toString();
+
+ // this makes the assumption that the implementation of getPhrases is invariant. This is
+ // probably a bad assumption. A possible way to make this more robust
+ // is to store the output of getPhrases when indexing and use the stored output when unindexing.
+ // Alternatively, could store the version of Document used for
+ // indexing.
+ Map<String, Integer> phrases = new Document(null, content).getPhrases();
+ Map<String, Counts> updates = new HashMap<>(phrases.size());
+ for (Entry<String, Integer> entry : phrases.entrySet()) {
+ updates.put(entry.getKey(), new Counts(multiplier, entry.getValue() * multiplier));
+ }
+
+ pcMap.update(ttx, updates);
+ }
+
+ private IndexStatus getStatus(TypedTransactionBase tx, Bytes row) {
+ String status = tx.get().row(row).col(INDEX_STATUS_COL).toString();
+
+ if (status == null)
+ return IndexStatus.UNINDEXED;
+
+ return IndexStatus.valueOf(status);
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/PhraseExporter.java b/phrasecount/src/main/java/phrasecount/PhraseExporter.java
new file mode 100644
index 0000000..5aec44a
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/PhraseExporter.java
@@ -0,0 +1,24 @@
+package phrasecount;
+
+import java.util.function.Consumer;
+
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.fluo.recipes.accumulo.export.AccumuloExporter;
+import org.apache.fluo.recipes.core.export.SequencedExport;
+import phrasecount.pojos.Counts;
+import phrasecount.query.PhraseCountTable;
+
+/**
+ * Export code that converts {@link Counts} objects from the export queue to Mutations that are
+ * written to Accumulo.
+ */
+public class PhraseExporter extends AccumuloExporter<String, Counts> {
+
+ @Override
+ protected void translate(SequencedExport<String, Counts> export, Consumer<Mutation> consumer) {
+ String phrase = export.getKey();
+ long seq = export.getSequence();
+ Counts counts = export.getValue();
+ consumer.accept(PhraseCountTable.createMutation(phrase, seq, counts));
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/PhraseMap.java b/phrasecount/src/main/java/phrasecount/PhraseMap.java
new file mode 100644
index 0000000..01c3bfb
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/PhraseMap.java
@@ -0,0 +1,63 @@
+package phrasecount;
+
+import java.util.Iterator;
+import java.util.Optional;
+
+import com.google.common.collect.Iterators;
+import org.apache.fluo.api.client.TransactionBase;
+import org.apache.fluo.api.observer.Observer.Context;
+import org.apache.fluo.recipes.core.export.Export;
+import org.apache.fluo.recipes.core.export.ExportQueue;
+import org.apache.fluo.recipes.core.map.CollisionFreeMap;
+import org.apache.fluo.recipes.core.map.Combiner;
+import org.apache.fluo.recipes.core.map.Update;
+import org.apache.fluo.recipes.core.map.UpdateObserver;
+import phrasecount.pojos.Counts;
+
+import static phrasecount.Constants.EXPORT_QUEUE_ID;
+
+/**
+ * This class contains all of the code related to the {@link CollisionFreeMap} that keeps track of
+ * phrase counts.
+ */
+public class PhraseMap {
+
+ /**
+ * A combiner for the {@link CollisionFreeMap} that stores phrase counts. The
+ * {@link CollisionFreeMap} calls this combiner when it lazily updates the counts for a phrase.
+ */
+ public static class PcmCombiner implements Combiner<String, Counts> {
+
+ @Override
+ public Optional<Counts> combine(String key, Iterator<Counts> updates) {
+ Counts sum = new Counts(0, 0);
+ while (updates.hasNext()) {
+ sum = sum.add(updates.next());
+ }
+ return Optional.of(sum);
+ }
+ }
+
+ /**
+ * This class is notified when the {@link CollisionFreeMap} used to store phrase counts updates a
+ * phrase count. Updates are placed an Accumulo export queue to be exported to the table storing
+ * phrase counts for query.
+ */
+ public static class PcmUpdateObserver extends UpdateObserver<String, Counts> {
+
+ private ExportQueue<String, Counts> pcEq;
+
+ @Override
+ public void init(String mapId, Context observerContext) throws Exception {
+ pcEq = ExportQueue.getInstance(EXPORT_QUEUE_ID, observerContext.getAppConfiguration());
+ }
+
+ @Override
+ public void updatingValues(TransactionBase tx, Iterator<Update<String, Counts>> updates) {
+ Iterator<Export<String, Counts>> exports =
+ Iterators.transform(updates, u -> new Export<>(u.getKey(), u.getNewValue().get()));
+ pcEq.addAll(tx, exports);
+ }
+ }
+
+}
diff --git a/phrasecount/src/main/java/phrasecount/cmd/Load.java b/phrasecount/src/main/java/phrasecount/cmd/Load.java
new file mode 100644
index 0000000..82e4e75
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/cmd/Load.java
@@ -0,0 +1,51 @@
+package phrasecount.cmd;
+
+import java.io.File;
+
+import com.google.common.base.Charsets;
+import com.google.common.io.Files;
+import org.apache.fluo.api.client.FluoClient;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.client.LoaderExecutor;
+import org.apache.fluo.api.config.FluoConfiguration;
+import phrasecount.DocumentLoader;
+import phrasecount.pojos.Document;
+
+public class Load {
+
+ public static void main(String[] args) throws Exception {
+
+ if (args.length != 2) {
+ System.err.println("Usage : " + Load.class.getName() + " <fluo props file> <txt file dir>");
+ System.exit(-1);
+ }
+
+ FluoConfiguration config = new FluoConfiguration(new File(args[0]));
+ config.setLoaderThreads(20);
+ config.setLoaderQueueSize(40);
+
+ try (FluoClient fluoClient = FluoFactory.newClient(config);
+ LoaderExecutor le = fluoClient.newLoaderExecutor()) {
+ File[] files = new File(args[1]).listFiles();
+
+ if (files == null) {
+ System.out.println("Text file dir does not exist: " + args[1]);
+ } else {
+ for (File txtFile : files) {
+ if (txtFile.getName().endsWith(".txt")) {
+ String uri = txtFile.toURI().toString();
+ String content = Files.toString(txtFile, Charsets.UTF_8);
+
+ System.out.println("Processing : " + txtFile.toURI());
+ le.execute(new DocumentLoader(new Document(uri, content)));
+ } else {
+ System.out.println("Ignoring : " + txtFile.toURI());
+ }
+ }
+ }
+ }
+
+ // TODO figure what threads are hanging around
+ System.exit(0);
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/cmd/Mini.java b/phrasecount/src/main/java/phrasecount/cmd/Mini.java
new file mode 100644
index 0000000..e43c1f5
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/cmd/Mini.java
@@ -0,0 +1,97 @@
+package phrasecount.cmd;
+
+import java.io.File;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import com.beust.jcommander.ParameterException;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.minicluster.MemoryUnit;
+import org.apache.accumulo.minicluster.MiniAccumuloCluster;
+import org.apache.accumulo.minicluster.MiniAccumuloConfig;
+import org.apache.accumulo.minicluster.ServerType;
+import org.apache.fluo.api.client.FluoAdmin.InitializationOptions;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.api.mini.MiniFluo;
+import phrasecount.Application;
+
+public class Mini {
+
+ static class Parameters {
+ @Parameter(names = {"-m", "--moreMemory"}, description = "Use more memory")
+ boolean moreMemory = false;
+
+ @Parameter(names = {"-w", "--workerThreads"}, description = "Number of worker threads")
+ int workerThreads = 5;
+
+ @Parameter(names = {"-t", "--tabletServers"}, description = "Number of tablet servers")
+ int tabletServers = 2;
+
+ @Parameter(names = {"-z", "--zookeeperPort"}, description = "Port to use for zookeeper")
+ int zookeeperPort = 0;
+
+ @Parameter(description = "<MAC dir> <output props file>")
+ List<String> args;
+ }
+
+ public static void main(String[] args) throws Exception {
+
+ Parameters params = new Parameters();
+ JCommander jc = new JCommander(params);
+
+ try {
+ jc.parse(args);
+ if (params.args == null || params.args.size() != 2)
+ throw new ParameterException("Expected two arguments");
+ } catch (ParameterException pe) {
+ System.out.println(pe.getMessage());
+ jc.setProgramName(Mini.class.getSimpleName());
+ jc.usage();
+ System.exit(-1);
+ }
+
+ MiniAccumuloConfig cfg = new MiniAccumuloConfig(new File(params.args.get(0)), "secret");
+ cfg.setZooKeeperPort(params.zookeeperPort);
+ cfg.setNumTservers(params.tabletServers);
+ if (params.moreMemory) {
+ cfg.setMemory(ServerType.TABLET_SERVER, 2, MemoryUnit.GIGABYTE);
+ Map<String, String> site = new HashMap<>();
+ site.put(Property.TSERV_DATACACHE_SIZE.getKey(), "768M");
+ site.put(Property.TSERV_INDEXCACHE_SIZE.getKey(), "256M");
+ cfg.setSiteConfig(site);
+ }
+
+ MiniAccumuloCluster cluster = new MiniAccumuloCluster(cfg);
+ cluster.start();
+
+ FluoConfiguration fluoConfig = new FluoConfiguration();
+
+ fluoConfig.setMiniStartAccumulo(false);
+ fluoConfig.setAccumuloInstance(cluster.getInstanceName());
+ fluoConfig.setAccumuloUser("root");
+ fluoConfig.setAccumuloPassword("secret");
+ fluoConfig.setAccumuloZookeepers(cluster.getZooKeepers());
+ fluoConfig.setInstanceZookeepers(cluster.getZooKeepers() + "/fluo");
+
+ fluoConfig.setAccumuloTable("data");
+ fluoConfig.setWorkerThreads(params.workerThreads);
+
+ fluoConfig.setApplicationName("phrasecount");
+
+ Application.configure(fluoConfig, new Application.Options(17, 17, cluster.getInstanceName(),
+ cluster.getZooKeepers(), "root", "secret", "pcExport"));
+
+ FluoFactory.newAdmin(fluoConfig).initialize(new InitializationOptions());
+
+ MiniFluo miniFluo = FluoFactory.newMiniFluo(fluoConfig);
+
+ miniFluo.getClientConfiguration().save(new File(params.args.get(1)));
+
+ System.out.println();
+ System.out.println("Wrote : " + params.args.get(1));
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/cmd/Print.java b/phrasecount/src/main/java/phrasecount/cmd/Print.java
new file mode 100644
index 0000000..79819b2
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/cmd/Print.java
@@ -0,0 +1,55 @@
+package phrasecount.cmd;
+
+import java.io.File;
+
+import com.google.common.collect.Iterables;
+import org.apache.fluo.api.client.FluoClient;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.client.Snapshot;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.api.data.Column;
+import org.apache.fluo.api.data.Span;
+import phrasecount.Constants;
+import phrasecount.pojos.PhraseAndCounts;
+import phrasecount.query.PhraseCountTable;
+
+public class Print {
+
+ public static void main(String[] args) throws Exception {
+ if (args.length != 2) {
+ System.err
+ .println("Usage : " + Print.class.getName() + " <fluo props file> <export table name>");
+ System.exit(-1);
+ }
+
+ FluoConfiguration fluoConfig = new FluoConfiguration(new File(args[0]));
+
+ PhraseCountTable pcTable = new PhraseCountTable(fluoConfig, args[1]);
+ for (PhraseAndCounts phraseCount : pcTable) {
+ System.out.printf("%7d %7d '%s'\n", phraseCount.docPhraseCount, phraseCount.totalPhraseCount,
+ phraseCount.phrase);
+ }
+
+ try (FluoClient fluoClient = FluoFactory.newClient(fluoConfig);
+ Snapshot snap = fluoClient.newSnapshot()) {
+
+ // TODO could precompute this using observers
+ int uriCount = count(snap, "uri:", Constants.DOC_HASH_COL);
+ int documentCount = count(snap, "doc:", Constants.DOC_REF_COUNT_COL);
+ int numIndexedDocs = count(snap, "doc:", Constants.INDEX_STATUS_COL);
+
+ System.out.println();
+ System.out.printf("# uris : %,d\n", uriCount);
+ System.out.printf("# unique documents : %,d\n", documentCount);
+ System.out.printf("# processed documents : %,d\n", numIndexedDocs);
+ System.out.println();
+ }
+
+ // TODO figure what threads are hanging around
+ System.exit(0);
+ }
+
+ private static int count(Snapshot snap, String prefix, Column col) {
+ return Iterables.size(snap.scanner().over(Span.prefix(prefix)).fetch(col).byRow().build());
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/cmd/Setup.java b/phrasecount/src/main/java/phrasecount/cmd/Setup.java
new file mode 100644
index 0000000..9d27917
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/cmd/Setup.java
@@ -0,0 +1,38 @@
+package phrasecount.cmd;
+
+import java.io.File;
+
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.TableNotFoundException;
+import org.apache.accumulo.core.client.ZooKeeperInstance;
+import org.apache.accumulo.core.client.security.tokens.PasswordToken;
+import org.apache.fluo.api.config.FluoConfiguration;
+import phrasecount.Application;
+import phrasecount.Application.Options;
+
+public class Setup {
+
+ public static void main(String[] args) throws Exception {
+ FluoConfiguration config = new FluoConfiguration(new File(args[0]));
+
+ String exportTable = args[1];
+
+ Connector conn =
+ new ZooKeeperInstance(config.getAccumuloInstance(), config.getAccumuloZookeepers())
+ .getConnector("root", new PasswordToken("secret"));
+ try {
+ conn.tableOperations().delete(exportTable);
+ } catch (TableNotFoundException e) {
+ // ignore if table not found
+ }
+
+ conn.tableOperations().create(exportTable);
+
+ Options opts = new Options(103, 103, config.getAccumuloInstance(), config.getAccumuloZookeepers(),
+ config.getAccumuloUser(), config.getAccumuloPassword(), exportTable);
+
+ FluoConfiguration observerConfig = new FluoConfiguration();
+ Application.configure(observerConfig, opts);
+ observerConfig.save(System.out);
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/cmd/Split.java b/phrasecount/src/main/java/phrasecount/cmd/Split.java
new file mode 100644
index 0000000..cc9d145
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/cmd/Split.java
@@ -0,0 +1,40 @@
+package phrasecount.cmd;
+
+import java.io.File;
+import java.util.SortedSet;
+import java.util.TreeSet;
+
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.ZooKeeperInstance;
+import org.apache.accumulo.core.client.security.tokens.PasswordToken;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.hadoop.io.Text;
+
+/**
+ * Utility to add splits to the Accumulo table used by Fluo.
+ */
+public class Split {
+ public static void main(String[] args) throws Exception {
+ if (args.length != 2) {
+ System.err.println("Usage : " + Split.class.getName() + " <fluo props file> <table name>");
+ System.exit(-1);
+ }
+
+ FluoConfiguration fluoConfig = new FluoConfiguration(new File(args[0]));
+ ZooKeeperInstance zki =
+ new ZooKeeperInstance(fluoConfig.getAccumuloInstance(), fluoConfig.getAccumuloZookeepers());
+ Connector conn = zki.getConnector(fluoConfig.getAccumuloUser(),
+ new PasswordToken(fluoConfig.getAccumuloPassword()));
+
+ SortedSet<Text> splits = new TreeSet<>();
+
+ for (char c = 'b'; c < 'z'; c++) {
+ splits.add(new Text("phrase:" + c));
+ }
+
+ conn.tableOperations().addSplits(args[1], splits);
+
+ // TODO figure what threads are hanging around
+ System.exit(0);
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/pojos/Counts.java b/phrasecount/src/main/java/phrasecount/pojos/Counts.java
new file mode 100644
index 0000000..d8e0829
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/pojos/Counts.java
@@ -0,0 +1,44 @@
+package phrasecount.pojos;
+
+import com.google.common.base.Objects;
+
+public class Counts {
+ // number of documents a phrase was seen in
+ public final long docPhraseCount;
+ // total times a phrase was seen in all documents
+ public final long totalPhraseCount;
+
+ public Counts() {
+ docPhraseCount = 0;
+ totalPhraseCount = 0;
+ }
+
+ public Counts(long docPhraseCount, long totalPhraseCount) {
+ this.docPhraseCount = docPhraseCount;
+ this.totalPhraseCount = totalPhraseCount;
+ }
+
+ public Counts add(Counts other) {
+ return new Counts(this.docPhraseCount + other.docPhraseCount, this.totalPhraseCount + other.totalPhraseCount);
+ }
+
+ @Override
+ public boolean equals(Object o) {
+ if (o instanceof Counts) {
+ Counts opc = (Counts) o;
+ return opc.docPhraseCount == docPhraseCount && opc.totalPhraseCount == totalPhraseCount;
+ }
+
+ return false;
+ }
+
+ @Override
+ public int hashCode() {
+ return (int) (993 * totalPhraseCount + 17 * docPhraseCount);
+ }
+
+ @Override
+ public String toString() {
+ return Objects.toStringHelper(this).add("documents", docPhraseCount).add("total", totalPhraseCount).toString();
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/pojos/Document.java b/phrasecount/src/main/java/phrasecount/pojos/Document.java
new file mode 100644
index 0000000..5fc0e70
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/pojos/Document.java
@@ -0,0 +1,59 @@
+package phrasecount.pojos;
+
+import java.util.HashMap;
+import java.util.Map;
+
+import com.google.common.hash.Hasher;
+import com.google.common.hash.Hashing;
+
+public class Document {
+ // the location where the document came from. This is needed inorder to detect when a document
+ // changes.
+ private String uri;
+
+ // the text of a document.
+ private String content;
+
+ private String hash = null;
+
+ public Document(String uri, String content) {
+ this.content = content;
+ this.uri = uri;
+ }
+
+ public String getURI() {
+ return uri;
+ }
+
+ public String getHash() {
+ if (hash != null)
+ return hash;
+
+ Hasher hasher = Hashing.sha1().newHasher();
+ String[] tokens = content.toLowerCase().split("[^\\p{Alnum}]+");
+
+ for (String token : tokens) {
+ hasher.putString(token);
+ }
+
+ return hash = hasher.hash().toString();
+ }
+
+ public Map<String, Integer> getPhrases() {
+ String[] tokens = content.toLowerCase().split("[^\\p{Alnum}]+");
+
+ Map<String, Integer> phrases = new HashMap<>();
+ for (int i = 3; i < tokens.length; i++) {
+ String phrase = tokens[i - 3] + " " + tokens[i - 2] + " " + tokens[i - 1] + " " + tokens[i];
+ Integer old = phrases.put(phrase, 1);
+ if (old != null)
+ phrases.put(phrase, 1 + old);
+ }
+
+ return phrases;
+ }
+
+ public String getContent() {
+ return content;
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/pojos/PcKryoFactory.java b/phrasecount/src/main/java/phrasecount/pojos/PcKryoFactory.java
new file mode 100644
index 0000000..3158f00
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/pojos/PcKryoFactory.java
@@ -0,0 +1,13 @@
+package phrasecount.pojos;
+
+import com.esotericsoftware.kryo.Kryo;
+import com.esotericsoftware.kryo.pool.KryoFactory;
+
+public class PcKryoFactory implements KryoFactory {
+ @Override
+ public Kryo create() {
+ Kryo kryo = new Kryo();
+ kryo.register(Counts.class, 9);
+ return kryo;
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/pojos/PhraseAndCounts.java b/phrasecount/src/main/java/phrasecount/pojos/PhraseAndCounts.java
new file mode 100644
index 0000000..d6ddc33
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/pojos/PhraseAndCounts.java
@@ -0,0 +1,24 @@
+package phrasecount.pojos;
+
+public class PhraseAndCounts extends Counts {
+ public String phrase;
+
+ public PhraseAndCounts(String phrase, int docPhraseCount, int totalPhraseCount) {
+ super(docPhraseCount, totalPhraseCount);
+ this.phrase = phrase;
+ }
+
+ @Override
+ public boolean equals(Object o) {
+ if (o instanceof PhraseAndCounts) {
+ PhraseAndCounts op = (PhraseAndCounts) o;
+ return phrase.equals(op.phrase) && super.equals(op);
+ }
+ return false;
+ }
+
+ @Override
+ public int hashCode() {
+ return super.hashCode() + 31 * phrase.hashCode();
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/query/PhraseCountTable.java b/phrasecount/src/main/java/phrasecount/query/PhraseCountTable.java
new file mode 100644
index 0000000..f5f670a
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/query/PhraseCountTable.java
@@ -0,0 +1,107 @@
+package phrasecount.query;
+
+import java.util.Iterator;
+import java.util.Map.Entry;
+
+import com.google.common.collect.Iterators;
+import org.apache.accumulo.core.client.ClientConfiguration;
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.RowIterator;
+import org.apache.accumulo.core.client.Scanner;
+import org.apache.accumulo.core.client.ZooKeeperInstance;
+import org.apache.accumulo.core.client.security.tokens.PasswordToken;
+import org.apache.accumulo.core.data.Key;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.Range;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.security.Authorizations;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.hadoop.io.Text;
+import phrasecount.pojos.Counts;
+import phrasecount.pojos.PhraseAndCounts;
+
+/**
+ * All of the code for dealing with the Accumulo table that Fluo is exporting to
+ */
+public class PhraseCountTable implements Iterable<PhraseAndCounts> {
+
+ static final String STAT_CF = "stat";
+
+ //name of column qualifier used to store phrase count across all documents
+ static final String TOTAL_PC_CQ = "totalCount";
+
+ //name of column qualifier used to store number of documents containing a phrase
+ static final String DOC_PC_CQ = "docCount";
+
+ public static Mutation createMutation(String phrase, long seq, Counts pc) {
+ Mutation mutation = new Mutation(phrase);
+
+ // use the sequence number for the Accumulo timestamp, this will cause older updates to fall
+ // behind newer ones
+ if (pc.totalPhraseCount == 0)
+ mutation.putDelete(STAT_CF, TOTAL_PC_CQ, seq);
+ else
+ mutation.put(STAT_CF, TOTAL_PC_CQ, seq, pc.totalPhraseCount + "");
+
+ if (pc.docPhraseCount == 0)
+ mutation.putDelete(STAT_CF, DOC_PC_CQ, seq);
+ else
+ mutation.put(STAT_CF, DOC_PC_CQ, seq, pc.docPhraseCount + "");
+
+ return mutation;
+ }
+
+ private Connector conn;
+ private String table;
+
+ public PhraseCountTable(FluoConfiguration fluoConfig, String table) throws Exception {
+ ZooKeeperInstance zki = new ZooKeeperInstance(
+ new ClientConfiguration().withZkHosts(fluoConfig.getAccumuloZookeepers())
+ .withInstance(fluoConfig.getAccumuloInstance()));
+ this.conn = zki.getConnector(fluoConfig.getAccumuloUser(),
+ new PasswordToken(fluoConfig.getAccumuloPassword()));
+ this.table = table;
+ }
+
+ public PhraseCountTable(Connector conn, String table) {
+ this.conn = conn;
+ this.table = table;
+ }
+
+
+ public Counts getPhraseCounts(String phrase) throws Exception {
+ Scanner scanner = conn.createScanner(table, Authorizations.EMPTY);
+ scanner.setRange(new Range(phrase));
+
+ int sum = 0;
+ int docCount = 0;
+
+ for (Entry<Key, Value> entry : scanner) {
+ String cq = entry.getKey().getColumnQualifierData().toString();
+ if (cq.equals(TOTAL_PC_CQ)) {
+ sum = Integer.valueOf(entry.getValue().toString());
+ }
+
+ if (cq.equals(DOC_PC_CQ)) {
+ docCount = Integer.valueOf(entry.getValue().toString());
+ }
+ }
+
+ return new Counts(docCount, sum);
+ }
+
+ @Override
+ public Iterator<PhraseAndCounts> iterator() {
+ try {
+ Scanner scanner = conn.createScanner(table, Authorizations.EMPTY);
+ scanner.fetchColumn(new Text(STAT_CF), new Text(TOTAL_PC_CQ));
+ scanner.fetchColumn(new Text(STAT_CF), new Text(DOC_PC_CQ));
+
+ return Iterators.transform(new RowIterator(scanner), new RowTransform());
+ } catch (RuntimeException e) {
+ throw e;
+ } catch (Exception e) {
+ throw new RuntimeException(e);
+ }
+ }
+}
diff --git a/phrasecount/src/main/java/phrasecount/query/RowTransform.java b/phrasecount/src/main/java/phrasecount/query/RowTransform.java
new file mode 100644
index 0000000..e86439c
--- /dev/null
+++ b/phrasecount/src/main/java/phrasecount/query/RowTransform.java
@@ -0,0 +1,34 @@
+package phrasecount.query;
+
+import java.util.Iterator;
+import java.util.Map.Entry;
+
+import com.google.common.base.Function;
+import org.apache.accumulo.core.data.Key;
+import org.apache.accumulo.core.data.Value;
+import phrasecount.pojos.PhraseAndCounts;
+
+public class RowTransform implements Function<Iterator<Entry<Key, Value>>, PhraseAndCounts> {
+ @Override
+ public PhraseAndCounts apply(Iterator<Entry<Key, Value>> input) {
+ String phrase = null;
+
+ int totalPhraseCount = 0;
+ int docPhraseCount = 0;
+
+ while (input.hasNext()) {
+ Entry<Key, Value> colEntry = input.next();
+ String cq = colEntry.getKey().getColumnQualifierData().toString();
+
+ if (cq.equals(PhraseCountTable.TOTAL_PC_CQ))
+ totalPhraseCount = Integer.parseInt(colEntry.getValue().toString());
+ else
+ docPhraseCount = Integer.parseInt(colEntry.getValue().toString());
+
+ if (phrase == null)
+ phrase = colEntry.getKey().getRowData().toString();
+ }
+
+ return new PhraseAndCounts(phrase, docPhraseCount, totalPhraseCount);
+ }
+}
diff --git a/phrasecount/src/test/java/phrasecount/PhraseCounterTest.java b/phrasecount/src/test/java/phrasecount/PhraseCounterTest.java
new file mode 100644
index 0000000..5815883
--- /dev/null
+++ b/phrasecount/src/test/java/phrasecount/PhraseCounterTest.java
@@ -0,0 +1,215 @@
+package phrasecount;
+
+import java.util.Random;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.security.tokens.PasswordToken;
+import org.apache.accumulo.minicluster.MiniAccumuloCluster;
+import org.apache.accumulo.minicluster.MiniAccumuloConfig;
+import org.apache.fluo.api.client.FluoAdmin.InitializationOptions;
+import org.apache.fluo.api.client.FluoClient;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.client.LoaderExecutor;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.api.mini.MiniFluo;
+import org.apache.fluo.recipes.core.types.TypedSnapshot;
+import org.junit.After;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.BeforeClass;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+import phrasecount.pojos.Counts;
+import phrasecount.pojos.Document;
+import phrasecount.query.PhraseCountTable;
+
+import static phrasecount.Constants.DOC_CONTENT_COL;
+import static phrasecount.Constants.DOC_REF_COUNT_COL;
+import static phrasecount.Constants.TYPEL;
+
+// TODO make this an integration test
+
+public class PhraseCounterTest {
+ public static TemporaryFolder folder = new TemporaryFolder();
+ public static MiniAccumuloCluster cluster;
+ private static FluoConfiguration props;
+ private static MiniFluo miniFluo;
+ private static final PasswordToken password = new PasswordToken("secret");
+ private static AtomicInteger tableCounter = new AtomicInteger(1);
+ private PhraseCountTable pcTable;
+
+ @BeforeClass
+ public static void setUpBeforeClass() throws Exception {
+ folder.create();
+ MiniAccumuloConfig cfg = new MiniAccumuloConfig(folder.newFolder("miniAccumulo"),
+ new String(password.getPassword()));
+ cluster = new MiniAccumuloCluster(cfg);
+ cluster.start();
+ }
+
+ @AfterClass
+ public static void tearDownAfterClass() throws Exception {
+ cluster.stop();
+ folder.delete();
+ }
+
+ @Before
+ public void setUpFluo() throws Exception {
+
+ // configure Fluo to use mini instance. Could avoid all of this code and let MiniFluo create a
+ // MiniAccumulo instance. However we need access to the MiniAccumulo instance inorder to create
+ // the export/query table.
+ props = new FluoConfiguration();
+ props.setMiniStartAccumulo(false);
+ props.setApplicationName("phrasecount");
+ props.setAccumuloInstance(cluster.getInstanceName());
+ props.setAccumuloUser("root");
+ props.setAccumuloPassword("secret");
+ props.setInstanceZookeepers(cluster.getZooKeepers() + "/fluo");
+ props.setAccumuloZookeepers(cluster.getZooKeepers());
+ props.setAccumuloTable("data" + tableCounter.getAndIncrement());
+ props.setWorkerThreads(5);
+
+ // create the export/query table
+ String queryTable = "pcq" + tableCounter.getAndIncrement();
+ Connector conn = cluster.getConnector("root", "secret");
+ conn.tableOperations().create(queryTable);
+ pcTable = new PhraseCountTable(conn, queryTable);
+
+ // configure phrase count observers
+ Application.configure(props, new Application.Options(13, 13, cluster.getInstanceName(),
+ cluster.getZooKeepers(), "root", "secret", queryTable));
+
+ FluoFactory.newAdmin(props)
+ .initialize(new InitializationOptions().setClearTable(true).setClearZookeeper(true));
+
+ miniFluo = FluoFactory.newMiniFluo(props);
+ }
+
+ @After
+ public void tearDownFluo() throws Exception {
+ miniFluo.close();
+ }
+
+ private void loadDocument(FluoClient fluoClient, String uri, String content) {
+ try (LoaderExecutor le = fluoClient.newLoaderExecutor()) {
+ Document doc = new Document(uri, content);
+ le.execute(new DocumentLoader(doc));
+ }
+ miniFluo.waitForObservers();
+ }
+
+ @Test
+ public void test1() throws Exception {
+ try (FluoClient fluoClient = FluoFactory.newClient(props)) {
+
+ loadDocument(fluoClient, "/foo1", "This is only a test. Do not panic. This is only a test.");
+
+ Assert.assertEquals(new Counts(1, 2), pcTable.getPhraseCounts("is only a test"));
+ Assert.assertEquals(new Counts(1, 1), pcTable.getPhraseCounts("test do not panic"));
+
+ // add new document w/ different content and overlapping phrase.. should change some counts
+ loadDocument(fluoClient, "/foo2", "This is only a test");
+
+ Assert.assertEquals(new Counts(2, 3), pcTable.getPhraseCounts("is only a test"));
+ Assert.assertEquals(new Counts(1, 1), pcTable.getPhraseCounts("test do not panic"));
+
+ // add new document w/ same content, should not change any counts
+ loadDocument(fluoClient, "/foo3", "This is only a test");
+
+ Assert.assertEquals(new Counts(2, 3), pcTable.getPhraseCounts("is only a test"));
+ Assert.assertEquals(new Counts(1, 1), pcTable.getPhraseCounts("test do not panic"));
+
+ // change the content of /foo1, should change counts
+ loadDocument(fluoClient, "/foo1", "The test is over, for now.");
+
+ Assert.assertEquals(new Counts(1, 1), pcTable.getPhraseCounts("the test is over"));
+ Assert.assertEquals(new Counts(1, 1), pcTable.getPhraseCounts("is only a test"));
+ Assert.assertEquals(new Counts(0, 0), pcTable.getPhraseCounts("test do not panic"));
+
+ // change content of foo2, should not change anything
+ loadDocument(fluoClient, "/foo2", "The test is over, for now.");
+
+ Assert.assertEquals(new Counts(1, 1), pcTable.getPhraseCounts("the test is over"));
+ Assert.assertEquals(new Counts(1, 1), pcTable.getPhraseCounts("is only a test"));
+ Assert.assertEquals(new Counts(0, 0), pcTable.getPhraseCounts("test do not panic"));
+
+ String oldHash = new Document("/foo3", "This is only a test").getHash();
+ try(TypedSnapshot tsnap = TYPEL.wrap(fluoClient.newSnapshot())){
+ Assert.assertNotNull(tsnap.get().row("doc:" + oldHash).col(DOC_CONTENT_COL).toString());
+ Assert.assertEquals(1, tsnap.get().row("doc:" + oldHash).col(DOC_REF_COUNT_COL).toInteger(0));
+ }
+ // dereference document that foo3 was referencing
+ loadDocument(fluoClient, "/foo3", "The test is over, for now.");
+
+ Assert.assertEquals(new Counts(1, 1), pcTable.getPhraseCounts("the test is over"));
+ Assert.assertEquals(new Counts(0, 0), pcTable.getPhraseCounts("is only a test"));
+ Assert.assertEquals(new Counts(0, 0), pcTable.getPhraseCounts("test do not panic"));
+
+ try(TypedSnapshot tsnap = TYPEL.wrap(fluoClient.newSnapshot())){
+ Assert.assertNull(tsnap.get().row("doc:" + oldHash).col(DOC_CONTENT_COL).toString());
+ Assert.assertNull(tsnap.get().row("doc:" + oldHash).col(DOC_REF_COUNT_COL).toInteger());
+ }
+ }
+
+ }
+
+ @Test
+ public void testHighCardinality() throws Exception {
+ try (FluoClient fluoClient = FluoFactory.newClient(props)) {
+
+ Random rand = new Random();
+
+ loadDocsWithRandomWords(fluoClient, rand, "This is only a test", 0, 100);
+
+ Assert.assertEquals(new Counts(100, 100), pcTable.getPhraseCounts("this is only a"));
+ Assert.assertEquals(new Counts(100, 100), pcTable.getPhraseCounts("is only a test"));
+
+ loadDocsWithRandomWords(fluoClient, rand, "This is not a test", 0, 2);
+
+ Assert.assertEquals(new Counts(2, 2), pcTable.getPhraseCounts("this is not a"));
+ Assert.assertEquals(new Counts(2, 2), pcTable.getPhraseCounts("is not a test"));
+ Assert.assertEquals(new Counts(98, 98), pcTable.getPhraseCounts("this is only a"));
+ Assert.assertEquals(new Counts(98, 98), pcTable.getPhraseCounts("is only a test"));
+
+ loadDocsWithRandomWords(fluoClient, rand, "This is not a test", 2, 100);
+
+ Assert.assertEquals(new Counts(100, 100), pcTable.getPhraseCounts("this is not a"));
+ Assert.assertEquals(new Counts(100, 100), pcTable.getPhraseCounts("is not a test"));
+ Assert.assertEquals(new Counts(0, 0), pcTable.getPhraseCounts("this is only a"));
+ Assert.assertEquals(new Counts(0, 0), pcTable.getPhraseCounts("is only a test"));
+
+ loadDocsWithRandomWords(fluoClient, rand, "This is only a test", 0, 50);
+
+ Assert.assertEquals(new Counts(50, 50), pcTable.getPhraseCounts("this is not a"));
+ Assert.assertEquals(new Counts(50, 50), pcTable.getPhraseCounts("is not a test"));
+ Assert.assertEquals(new Counts(50, 50), pcTable.getPhraseCounts("this is only a"));
+ Assert.assertEquals(new Counts(50, 50), pcTable.getPhraseCounts("is only a test"));
+
+ }
+ }
+
+ void loadDocsWithRandomWords(FluoClient fluoClient, Random rand, String phrase, int start,
+ int end) {
+
+ try (LoaderExecutor le = fluoClient.newLoaderExecutor()) {
+ // load many documents that share the same phrase
+ for (int i = start; i < end; i++) {
+ String uri = "/foo" + i;
+ StringBuilder content = new StringBuilder(phrase);
+ // add a bunch of random words
+ for (int j = 0; j < 20; j++) {
+ content.append(' ');
+ content.append(Integer.toString(rand.nextInt(10000), 36));
+ }
+
+ Document doc = new Document(uri, content.toString());
+ le.execute(new DocumentLoader(doc));
+ }
+ }
+ miniFluo.waitForObservers();
+ }
+}
+
diff --git a/phrasecount/src/test/resources/log4j.properties b/phrasecount/src/test/resources/log4j.properties
new file mode 100644
index 0000000..1ed12ff
--- /dev/null
+++ b/phrasecount/src/test/resources/log4j.properties
@@ -0,0 +1,29 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log4j.rootLogger=INFO, CA
+log4j.appender.CA=org.apache.log4j.ConsoleAppender
+log4j.appender.CA.layout=org.apache.log4j.PatternLayout
+log4j.appender.CA.layout.ConversionPattern=%d{ISO8601} [%c{2}] %-5p: %m%n
+
+#Uncomment to see debugging output for Fluo.
+#log4j.logger.org.apache.fluo=DEBUG
+
+#uncomment the following to see all transaction activity
+#log4j.logger.fluo.tx=TRACE
+
+log4j.logger.org.apache.zookeeper.ClientCnxn=FATAL
+log4j.logger.org.apache.zookeeper.ZooKeeper=WARN
+log4j.logger.org.apache.curator=WARN
diff --git a/stresso/.gitignore b/stresso/.gitignore
new file mode 100644
index 0000000..9233d7a
--- /dev/null
+++ b/stresso/.gitignore
@@ -0,0 +1,9 @@
+.classpath
+.project
+.settings
+target
+.DS_Store
+.idea
+*.iml
+git/
+logs/
diff --git a/stresso/.travis.yml b/stresso/.travis.yml
new file mode 100644
index 0000000..551c724
--- /dev/null
+++ b/stresso/.travis.yml
@@ -0,0 +1,25 @@
+# Copyright 2015 Stresso authors (see AUTHORS)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+language: java
+jdk:
+ - oraclejdk8
+script: mvn verify
+notifications:
+ irc:
+ channels:
+ - "chat.freenode.net#fluo"
+ on_success: always
+ on_failure: always
+ use_notice: true
+ skip_join: true
diff --git a/stresso/AUTHORS b/stresso/AUTHORS
new file mode 100644
index 0000000..d413329
--- /dev/null
+++ b/stresso/AUTHORS
@@ -0,0 +1,5 @@
+AUTHORS
+-------
+
+Keith Turner - Peterson Technologies
+Mike Walch - Peterson Technologies
diff --git a/stresso/LICENSE b/stresso/LICENSE
new file mode 100644
index 0000000..37ec93a
--- /dev/null
+++ b/stresso/LICENSE
@@ -0,0 +1,191 @@
+Apache License
+Version 2.0, January 2004
+http://www.apache.org/licenses/
+
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+1. Definitions.
+
+"License" shall mean the terms and conditions for use, reproduction, and
+distribution as defined by Sections 1 through 9 of this document.
+
+"Licensor" shall mean the copyright owner or entity authorized by the copyright
+owner that is granting the License.
+
+"Legal Entity" shall mean the union of the acting entity and all other entities
+that control, are controlled by, or are under common control with that entity.
+For the purposes of this definition, "control" means (i) the power, direct or
+indirect, to cause the direction or management of such entity, whether by
+contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the
+outstanding shares, or (iii) beneficial ownership of such entity.
+
+"You" (or "Your") shall mean an individual or Legal Entity exercising
+permissions granted by this License.
+
+"Source" form shall mean the preferred form for making modifications, including
+but not limited to software source code, documentation source, and configuration
+files.
+
+"Object" form shall mean any form resulting from mechanical transformation or
+translation of a Source form, including but not limited to compiled object code,
+generated documentation, and conversions to other media types.
+
+"Work" shall mean the work of authorship, whether in Source or Object form, made
+available under the License, as indicated by a copyright notice that is included
+in or attached to the work (an example is provided in the Appendix below).
+
+"Derivative Works" shall mean any work, whether in Source or Object form, that
+is based on (or derived from) the Work and for which the editorial revisions,
+annotations, elaborations, or other modifications represent, as a whole, an
+original work of authorship. For the purposes of this License, Derivative Works
+shall not include works that remain separable from, or merely link (or bind by
+name) to the interfaces of, the Work and Derivative Works thereof.
+
+"Contribution" shall mean any work of authorship, including the original version
+of the Work and any modifications or additions to that Work or Derivative Works
+thereof, that is intentionally submitted to Licensor for inclusion in the Work
+by the copyright owner or by an individual or Legal Entity authorized to submit
+on behalf of the copyright owner. For the purposes of this definition,
+"submitted" means any form of electronic, verbal, or written communication sent
+to the Licensor or its representatives, including but not limited to
+communication on electronic mailing lists, source code control systems, and
+issue tracking systems that are managed by, or on behalf of, the Licensor for
+the purpose of discussing and improving the Work, but excluding communication
+that is conspicuously marked or otherwise designated in writing by the copyright
+owner as "Not a Contribution."
+
+"Contributor" shall mean Licensor and any individual or Legal Entity on behalf
+of whom a Contribution has been received by Licensor and subsequently
+incorporated within the Work.
+
+2. Grant of Copyright License.
+
+Subject to the terms and conditions of this License, each Contributor hereby
+grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
+irrevocable copyright license to reproduce, prepare Derivative Works of,
+publicly display, publicly perform, sublicense, and distribute the Work and such
+Derivative Works in Source or Object form.
+
+3. Grant of Patent License.
+
+Subject to the terms and conditions of this License, each Contributor hereby
+grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
+irrevocable (except as stated in this section) patent license to make, have
+made, use, offer to sell, sell, import, and otherwise transfer the Work, where
+such license applies only to those patent claims licensable by such Contributor
+that are necessarily infringed by their Contribution(s) alone or by combination
+of their Contribution(s) with the Work to which such Contribution(s) was
+submitted. If You institute patent litigation against any entity (including a
+cross-claim or counterclaim in a lawsuit) alleging that the Work or a
+Contribution incorporated within the Work constitutes direct or contributory
+patent infringement, then any patent licenses granted to You under this License
+for that Work shall terminate as of the date such litigation is filed.
+
+4. Redistribution.
+
+You may reproduce and distribute copies of the Work or Derivative Works thereof
+in any medium, with or without modifications, and in Source or Object form,
+provided that You meet the following conditions:
+
+You must give any other recipients of the Work or Derivative Works a copy of
+this License; and
+You must cause any modified files to carry prominent notices stating that You
+changed the files; and
+You must retain, in the Source form of any Derivative Works that You distribute,
+all copyright, patent, trademark, and attribution notices from the Source form
+of the Work, excluding those notices that do not pertain to any part of the
+Derivative Works; and
+If the Work includes a "NOTICE" text file as part of its distribution, then any
+Derivative Works that You distribute must include a readable copy of the
+attribution notices contained within such NOTICE file, excluding those notices
+that do not pertain to any part of the Derivative Works, in at least one of the
+following places: within a NOTICE text file distributed as part of the
+Derivative Works; within the Source form or documentation, if provided along
+with the Derivative Works; or, within a display generated by the Derivative
+Works, if and wherever such third-party notices normally appear. The contents of
+the NOTICE file are for informational purposes only and do not modify the
+License. You may add Your own attribution notices within Derivative Works that
+You distribute, alongside or as an addendum to the NOTICE text from the Work,
+provided that such additional attribution notices cannot be construed as
+modifying the License.
+You may add Your own copyright statement to Your modifications and may provide
+additional or different license terms and conditions for use, reproduction, or
+distribution of Your modifications, or for any such Derivative Works as a whole,
+provided Your use, reproduction, and distribution of the Work otherwise complies
+with the conditions stated in this License.
+
+5. Submission of Contributions.
+
+Unless You explicitly state otherwise, any Contribution intentionally submitted
+for inclusion in the Work by You to the Licensor shall be under the terms and
+conditions of this License, without any additional terms or conditions.
+Notwithstanding the above, nothing herein shall supersede or modify the terms of
+any separate license agreement you may have executed with Licensor regarding
+such Contributions.
+
+6. Trademarks.
+
+This License does not grant permission to use the trade names, trademarks,
+service marks, or product names of the Licensor, except as required for
+reasonable and customary use in describing the origin of the Work and
+reproducing the content of the NOTICE file.
+
+7. Disclaimer of Warranty.
+
+Unless required by applicable law or agreed to in writing, Licensor provides the
+Work (and each Contributor provides its Contributions) on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied,
+including, without limitation, any warranties or conditions of TITLE,
+NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are
+solely responsible for determining the appropriateness of using or
+redistributing the Work and assume any risks associated with Your exercise of
+permissions under this License.
+
+8. Limitation of Liability.
+
+In no event and under no legal theory, whether in tort (including negligence),
+contract, or otherwise, unless required by applicable law (such as deliberate
+and grossly negligent acts) or agreed to in writing, shall any Contributor be
+liable to You for damages, including any direct, indirect, special, incidental,
+or consequential damages of any character arising as a result of this License or
+out of the use or inability to use the Work (including but not limited to
+damages for loss of goodwill, work stoppage, computer failure or malfunction, or
+any and all other commercial damages or losses), even if such Contributor has
+been advised of the possibility of such damages.
+
+9. Accepting Warranty or Additional Liability.
+
+While redistributing the Work or Derivative Works thereof, You may choose to
+offer, and charge a fee for, acceptance of support, warranty, indemnity, or
+other liability obligations and/or rights consistent with this License. However,
+in accepting such obligations, You may act only on Your own behalf and on Your
+sole responsibility, not on behalf of any other Contributor, and only if You
+agree to indemnify, defend, and hold each Contributor harmless for any liability
+incurred by, or claims asserted against, such Contributor by reason of your
+accepting any such warranty or additional liability.
+
+END OF TERMS AND CONDITIONS
+
+APPENDIX: How to apply the Apache License to your work
+
+To apply the Apache License to your work, attach the following boilerplate
+notice, with the fields enclosed by brackets "[]" replaced with your own
+identifying information. (Don't include the brackets!) The text should be
+enclosed in the appropriate comment syntax for the file format. We also
+recommend that a file or class name and description of purpose be included on
+the same "printed page" as the copyright notice for easier identification within
+third-party archives.
+
+ Copyright [yyyy] [name of copyright owner]
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
diff --git a/stresso/README.md b/stresso/README.md
new file mode 100644
index 0000000..d3c2577
--- /dev/null
+++ b/stresso/README.md
@@ -0,0 +1,192 @@
+
+# Stresso
+
+[![Build Status](https://travis-ci.org/astralway/stresso.svg?branch=master)](https://travis-ci.org/astralway/stresso)
+
+An example application designed to stress Apache Fluo. This Fluo application computes the
+number of unique integers through the process of building a bitwise trie. New numbers
+are added to the trie as leaf nodes. Observers watch all nodes in the trie to create
+parents and percolate counts up to the root nodes such that each node in the trie keeps
+track of the number of leaf nodes below it. The count at the root nodes should equal
+the total number of leaf nodes. This makes it easy to verify if the test ran correctly.
+The test stresses Apache Fluo in that multiple transactions can operate on the same data
+as counts are percolated up the trie.
+
+## Concepts and definitions
+
+This test has the following set of configurable parameters.
+
+ * **nodeSize** : The number of bits chopped off the end each time a number is
+ percolated up. Must choose a nodeSize such that `64 % nodeSize == 0`
+ * **stopLevel** : The number of levels in the tree is a function of the
+ nodeSize. The deepest possible level is `64 / nodeSize`. Levels are
+ decremented going up the tree. Setting the stop level determines how far up
+ to percolate. The lower the stop level, the more root nodes there are.
+ Having more root nodes means less collisions, but all roots need to be
+ scanned to get the count of unique numbers. Having ~64k root nodes is a
+ good choice.
+ * **max** : Random numbers are generated modulo the max.
+
+Setting the stop level such that you have ~64k root nodes is dependent on the
+max and nodeSize. For example assume we choose a max of 10<sup>12</sup> and a
+node size of 8. The following table shows information about each level in the
+tree using this configuration. So for a max of 10<sup>12</sup> choosing a stop
+level of 5 would result in 59,604 root nodes. With this many root nodes there
+would not be many collisions and scanning 59,604 nodes to compute the unique
+number of intergers is a quick operation.
+
+|Level|Max Node |Number of possible Nodes|
+|:---:|---------------------|-----------------------:|
+| 0 |`0xXXXXXXXXXXXXXXXX` | 1 |
+| 1 |`0x00XXXXXXXXXXXXXX` | 1 |
+| 2 |`0x0000XXXXXXXXXXXX` | 1 |
+| 3 |`0x000000XXXXXXXXXX` | 1 |
+| 4 |`0x000000E8XXXXXXXX` | 232 |
+| 5 |`0x000000E8D4XXXXXX` | 59,604 |
+| 6 |`0x000000E8D4A5XXXX` | 15,258,789 |
+| 7 |`0x000000E8D4A510XX` | 3,906,250,000 |
+| 8 |`0x000000E8D4A51000` | 1,000,000,000,000 |
+
+In the table above, X indicates nibbles that are always zeroed out for every
+node at that level. You can easily view nodes at a level using a row prefix
+with the fluo scan command. For example `fluo scan -p 05` shows all nodes at
+level 5.
+
+For small scale test a max of 10<sup>9</sup> and a stop level of 6 is a good
+choice.
+
+## Building Stresso
+
+```
+mvn package
+```
+
+This will create a jar and shaded jar in target:
+
+```
+$ ls target/stresso-*
+target/stresso-0.0.1-SNAPSHOT.jar target/stresso-0.0.1-SNAPSHOT-shaded.jar
+```
+
+## Run Stresso using MiniFluo
+
+There are several integration tests that run Stresso on a MiniFluo instance.
+These tests can be run using `mvn verify`.
+
+## Run Stresso on cluster
+
+The [bin directory](/bin) contains a set of scripts to help run this test on a
+cluster. These scripts make the following assumpitions.
+
+ * `FLUO_HOME` environment variable is set. If not set, then set it in `conf/env.sh`.
+ * Hadoop `yarn` command is on path.
+ * Hadoop `hadoop` command is on path.
+ * Accumulo `accumulo` command is on path.
+
+Before running any of the scipts, copy [conf/env.sh.example](/conf/env.sh.example)
+to `conf/env.sh`, then inspect and modify the file.
+
+Next, execute the [run-test.sh](/bin/run-test.sh) script. This script will create a
+new Apache Fluo app called `stresso` (which can be changed by `FLUO_APP_NAME` in your env.sh).
+It will modify the application's fluo.properties, copy the stresso jar to the `lib/`
+directory of the app and set the following in fluo.properties:
+
+```
+fluo.observer.0=stresso.trie.NodeObserver
+fluo.app.trie.nodeSize=X
+fluo.app.trie.stopLevel=Y
+```
+
+The `run-test.sh` script will then initialize and start the Stresso application.
+It will load a lot of data directly into Accumulo without transactions and then
+incrementally load smaller amounts of data using transactions. After incrementally
+loading some data, it computes the expected number of unique integers using map reduce.
+It then prints the number of unique integers computed by Apache Fluo.
+
+## Additional Scripts
+
+The script [generate.sh](/bin/generate.sh) starts a map reduce job to generate
+random integers.
+
+```
+generate.sh <num files> <num per file> <max> <out dir>
+
+where:
+
+num files = Number of files to generate (and number of map task)
+numPerMap = Number of random numbers to generate per file
+max = Generate random numbers between 0 and max
+out dir = Output directory
+```
+
+The script [split.sh](/bin/split.sh) pre-splits the Accumulo table used by Apache
+Fluo. Consider running this command before loading data.
+
+```
+split.sh <num tablets> <max>
+
+where:
+
+num tablets = Num tablets to create for lowest level of tree. Will create less tablets for higher levels based on the max.
+```
+After generating random numbers, load them into Apache Fluo with one of the following
+commands. The script [init.sh](/bin/init.sh) intializes any empty table using
+map reduce. This simulates the case where a user has a lot of initial data to
+load into Fluo. This command should only be run when the table is empty
+because it writes directly to the Fluo table w/o using transactions.
+
+```
+init.sh <input dir> <tmp dir> <num reducers>
+
+where:
+
+input dir = A directory with file created by stresso.trie.Generate
+node size = Size of node in bits which must be a divisor of 32/64
+tmp dir = This command runs two map reduce jobs and needs an intermediate directory to store data.
+num reducers = Number of reduce task map reuduce job should run
+```
+
+Run the [load.sh](/bin/load.sh) script on a table with existing data. It starts
+a map reduce job that executes load transactions. Loading the same directory
+multiple times should not result in incorrect counts.
+
+```
+load.sh <input dir>
+```
+
+After loading data, run the [print.sh](/bin/print.sh) script to check the
+status of the computation of the number of unique integers within Apache Fluo. This
+command will print two numbers, the sum of the root nodes and number of root
+nodes. If there are outstanding notification to process, this count may not be
+accurate.
+
+```
+print.sh
+```
+
+In order to know how many unique numbers are expected, run the [unique.sh](/bin/unique.sh)
+script. This scrpt runs a map reduce job that calculates the number of
+unique integers. This script can take a list of directories created by
+multiple runs of [generate.sh](/bin/generate.sh)
+
+```
+unique.sh <num reducers> <input dir>{ <input dir>}
+```
+
+As transactions execute they leave a trail of history behind. The nodes in the
+lower levels of the tree are updated by many transactions and therefore have a
+long history trail. A long transactional history can slow down transactions.
+Forcing a compaction in Accumulo will clean up this history. However
+compacting the entire table is expensive. To avoid this expense, compact only the
+lower levels of the tree. The following command will compact levels of the
+tree with a maximum number of nodes less than the specified cutoff.
+
+```
+compact-ll.sh <max> <cutoff>
+```
+
+where:
+
+```
+cutoff = Any level of the tree with a maximum number of nodes that is less than this cutoff will be compacted.
+```
diff --git a/stresso/bin/compact-ll.sh b/stresso/bin/compact-ll.sh
new file mode 100755
index 0000000..5a98277
--- /dev/null
+++ b/stresso/bin/compact-ll.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+BIN_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
+. $BIN_DIR/load-env.sh
+
+$FLUO_CMD exec $FLUO_APP_NAME stresso.trie.CompactLL $FLUO_PROPS $@
diff --git a/stresso/bin/diff.sh b/stresso/bin/diff.sh
new file mode 100755
index 0000000..5e36d95
--- /dev/null
+++ b/stresso/bin/diff.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+BIN_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
+. $BIN_DIR/load-env.sh
+
+$FLUO_CMD exec $FLUO_APP_NAME stresso.trie.Diff $FLUO_PROPS $@
diff --git a/stresso/bin/generate.sh b/stresso/bin/generate.sh
new file mode 100755
index 0000000..622be8a
--- /dev/null
+++ b/stresso/bin/generate.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+BIN_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
+. $BIN_DIR/load-env.sh
+
+yarn jar $STRESSO_JAR stresso.trie.Generate $@
diff --git a/stresso/bin/init.sh b/stresso/bin/init.sh
new file mode 100755
index 0000000..133ad10
--- /dev/null
+++ b/stresso/bin/init.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+BIN_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
+. $BIN_DIR/load-env.sh
+
+if [ "$#" -ne 3 ]; then
+ echo "Usage : $0 <input dir> <work dir> <num reducers>"
+ exit 1
+fi
+
+yarn jar $STRESSO_SHADED_JAR stresso.trie.Init -Dmapreduce.job.reduces=$3 $FLUO_PROPS $1 $2
diff --git a/stresso/bin/load-env.sh b/stresso/bin/load-env.sh
new file mode 100644
index 0000000..5400fc2
--- /dev/null
+++ b/stresso/bin/load-env.sh
@@ -0,0 +1,44 @@
+if [ ! -f $BIN_DIR/../conf/env.sh ]
+then
+ . $BIN_DIR/../conf/env.sh.example
+else
+ . $BIN_DIR/../conf/env.sh
+fi
+
+# verify fluo configuration
+if [ ! -d "$FLUO_HOME" ]; then
+ echo "Problem with FLUO_HOME : $FLUO_HOME"
+ exit 1
+fi
+FLUO_CMD=$FLUO_HOME/bin/fluo
+if [ -z "$FLUO_APP_NAME" ]; then
+ echo "FLUO_APP_NAME is not set!"
+ exit 1
+fi
+FLUO_APP_LIB=$FLUO_HOME/apps/$FLUO_APP_NAME/lib
+FLUO_PROPS=$FLUO_HOME/apps/$FLUO_APP_NAME/conf/fluo.properties
+if [ ! -f "$FLUO_PROPS" ] && [ -z "$SKIP_FLUO_PROPS_CHECK" ]; then
+ echo "Fluo properties file not found : $FLUO_PROPS"
+ exit 1
+fi
+
+STRESSO_VERSION=0.0.1-SNAPSHOT
+STRESSO_JAR=$BIN_DIR/../target/stresso-$STRESSO_VERSION.jar
+STRESSO_SHADED_JAR=$BIN_DIR/../target/stresso-$STRESSO_VERSION-shaded.jar
+if [ ! -f "$STRESSO_JAR" ] && [ -z "$SKIP_JAR_CHECKS" ]; then
+ echo "Stresso jar not found : $STRESSO_JAR"
+ exit 1;
+fi
+if [ ! -f "$STRESSO_SHADED_JAR" ] && [ -z "$SKIP_JAR_CHECKS" ]; then
+ echo "Stresso shaded jar not found : $STRESSO_SHADED_JAR"
+ exit 1;
+fi
+
+command -v yarn >/dev/null 2>&1 || { echo >&2 "I require yarn but it's not installed. Aborting."; exit 1; }
+command -v hadoop >/dev/null 2>&1 || { echo >&2 "I require hadoop but it's not installed. Aborting."; exit 1; }
+
+if [[ "$OSTYPE" == "darwin"* ]]; then
+ export SED="sed -i .bak"
+else
+ export SED="sed -i"
+fi
diff --git a/stresso/bin/load.sh b/stresso/bin/load.sh
new file mode 100755
index 0000000..8cf2ac5
--- /dev/null
+++ b/stresso/bin/load.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+BIN_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
+. $BIN_DIR/load-env.sh
+
+yarn jar $STRESSO_SHADED_JAR stresso.trie.Load $FLUO_PROPS $@
diff --git a/stresso/bin/print.sh b/stresso/bin/print.sh
new file mode 100755
index 0000000..2554c4c
--- /dev/null
+++ b/stresso/bin/print.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+BIN_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
+. $BIN_DIR/load-env.sh
+
+$FLUO_CMD exec $FLUO_APP_NAME stresso.trie.Print $FLUO_PROPS $@
diff --git a/stresso/bin/run-test.sh b/stresso/bin/run-test.sh
new file mode 100755
index 0000000..a58dd6f
--- /dev/null
+++ b/stresso/bin/run-test.sh
@@ -0,0 +1,124 @@
+#!/bin/bash
+
+BIN_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
+SKIP_JAR_CHECKS="true"
+SKIP_FLUO_PROPS_CHECK="true"
+
+. $BIN_DIR/load-env.sh
+
+unset SKIP_JAR_CHECKS
+unset SKIP_FLUO_PROPS_CHECK
+
+# stop if any command fails
+set -e
+
+if [ ! -d $FLUO_HOME/apps/$FLUO_APP_NAME ]; then
+ $FLUO_CMD new $FLUO_APP_NAME
+else
+ echo "Restarting '$FLUO_APP_NAME' application. Errors may be printed if it's not running..."
+ $FLUO_CMD stop $FLUO_APP_NAME || true
+ rm -rf $FLUO_HOME/apps/$FLUO_APP_NAME
+ $FLUO_CMD new $FLUO_APP_NAME
+fi
+
+# build stresso
+(cd $BIN_DIR/..;mvn package -Dfluo.version=$FLUO_VERSION -Daccumulo.version=$ACCUMULO_VERSION -DskipTests)
+
+if [[ $(accumulo version) == *1.6* ]]; then
+ # build stress balancer
+ (cd $BIN_DIR/..; mkdir -p git; cd git;git clone https://github.com/keith-turner/stress-balancer.git; cd stress-balancer; ./config-fluo.sh $FLUO_PROPS)
+fi
+
+if [ ! -f "$STRESSO_JAR" ]; then
+ echo "Stresso jar not found : $STRESSO_JAR"
+ exit 1
+fi
+if [ ! -d $FLUO_APP_LIB ]; then
+ echo "Fluo app lib $FLUO_APP_LIB does not exist"
+ exit 1
+fi
+cp $STRESSO_JAR $FLUO_APP_LIB
+mvn dependency:copy-dependencies -DincludeArtifactIds=fluo-recipes-core -DoutputDirectory=$FLUO_APP_LIB
+
+# determine a good stop level
+if (("$MAX" <= $((10**9)))); then
+ STOP=6
+elif (("$MAX" <= $((10**12)))); then
+ STOP=5
+else
+ STOP=4
+fi
+
+# delete existing config in fluo.properties if it exist
+$SED '/fluo.observer/d' $FLUO_PROPS
+$SED '/fluo.app.trie/d' $FLUO_PROPS
+
+# append stresso specific config
+echo "fluo.observer.0=stresso.trie.NodeObserver" >> $FLUO_PROPS
+echo "fluo.app.trie.nodeSize=8" >> $FLUO_PROPS
+echo "fluo.app.trie.stopLevel=$STOP" >> $FLUO_PROPS
+
+$FLUO_CMD init $FLUO_APP_NAME -f
+$FLUO_CMD start $FLUO_APP_NAME
+
+echo "Removing any previous logs in $LOG_DIR"
+mkdir -p $LOG_DIR
+rm -f $LOG_DIR/*
+
+# configure balancer for fluo table
+if [[ $(accumulo version) == *1.6* ]]; then
+ (cd $BIN_DIR/../git/stress-balancer; ./config-accumulo.sh $FLUO_PROPS)
+fi # TODO setup RegexGroupBalancer built into Accumulo 1.7.0... may be easier to do from java
+
+hadoop fs -rm -r -f /stresso/
+
+set -e
+
+# add splits to Fluo table
+echo "*****Presplitting table*****"
+$BIN_DIR/split.sh $SPLITS >$LOG_DIR/split.out 2>$LOG_DIR/split.err
+
+if (( GEN_INIT > 0 )); then
+ # generate and load intial data using map reduce writing directly to table
+ echo "*****Generating and loading initial data set*****"
+ $BIN_DIR/generate.sh $MAPS $((GEN_INIT / MAPS)) $MAX /stresso/init >$LOG_DIR/generate_0.out 2>$LOG_DIR/generate_0.err
+ $BIN_DIR/init.sh /stresso/init /stresso/initTmp $REDUCES >$LOG_DIR/init.out 2>$LOG_DIR/init.err
+ hadoop fs -rm -r /stresso/initTmp
+fi
+
+# load data incrementally
+for i in $(seq 1 $ITERATIONS); do
+ echo "*****Generating and loading incremental data set $i*****"
+ $BIN_DIR/generate.sh $MAPS $((GEN_INCR / MAPS)) $MAX /stresso/$i >$LOG_DIR/generate_$i.out 2>$LOG_DIR/generate_$i.err
+ $BIN_DIR/load.sh /stresso/$i >$LOG_DIR/load_$i.out 2>$LOG_DIR/load_$i.err
+ # TODO could reload the same dataset sometimes, maybe when i%5 == 0 or something
+ $BIN_DIR/compact-ll.sh $MAX $COMPACT_CUTOFF >$LOG_DIR/compact-ll_$i.out 2>$LOG_DIR/compact-ll_$i.err
+ if ! ((i % WAIT_PERIOD)); then
+ $FLUO_CMD wait $FLUO_APP_NAME >$LOG_DIR/wait_$i.out 2>$LOG_DIR/wait_$i.err
+ else
+ sleep $SLEEP
+ fi
+done
+
+# print unique counts
+echo "*****Calculating # of unique integers using MapReduce*****"
+$BIN_DIR/unique.sh $REDUCES /stresso/* >$LOG_DIR/unique.out 2>$LOG_DIR/unique.err
+grep UNIQUE $LOG_DIR/unique.err
+
+echo "*****Wait for Fluo to finish processing*****"
+$FLUO_CMD wait $FLUO_APP_NAME
+
+echo "*****Printing # of unique integers calculated by Fluo*****"
+$BIN_DIR/print.sh >$LOG_DIR/print.out 2>$LOG_DIR/print.err
+cat $LOG_DIR/print.out
+
+echo "*****Verifying Fluo & MapReduce results match*****"
+MAPR_TOTAL=`grep UNIQUE $LOG_DIR/unique.err | cut -d = -f 2`
+FLUO_TOTAL=`grep "Total at root" $LOG_DIR/print.out | cut -d ' ' -f 5`
+if [ $MAPR_TOTAL -eq $FLUO_TOTAL ]; then
+ echo "Success! Fluo & MapReduce both calculated $FLUO_TOTAL unique integers"
+ exit 0
+else
+ echo "ERROR - Results do not match. Fluo calculated $FLUO_TOTAL unique integers while MapReduce calculated $MAPR_TOTAL integers"
+ exit 1
+fi
diff --git a/stresso/bin/split.sh b/stresso/bin/split.sh
new file mode 100755
index 0000000..225bef5
--- /dev/null
+++ b/stresso/bin/split.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+BIN_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
+. $BIN_DIR/load-env.sh
+
+$FLUO_CMD exec $FLUO_APP_NAME stresso.trie.Split $FLUO_PROPS "$TABLE_PROPS" $@
diff --git a/stresso/bin/unique.sh b/stresso/bin/unique.sh
new file mode 100755
index 0000000..68c2a58
--- /dev/null
+++ b/stresso/bin/unique.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+BIN_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
+. $BIN_DIR/load-env.sh
+
+if [ "$#" -lt 2 ]; then
+ echo "Usage : $0 <num reducers> <input dir>{ <input dir>}"
+ exit 1
+fi
+
+yarn jar $STRESSO_JAR stresso.trie.Unique -Dmapreduce.job.reduces=$1 ${@:2}
diff --git a/stresso/conf/.gitignore b/stresso/conf/.gitignore
new file mode 100644
index 0000000..137e678
--- /dev/null
+++ b/stresso/conf/.gitignore
@@ -0,0 +1 @@
+env.sh
diff --git a/stresso/conf/env.sh.example b/stresso/conf/env.sh.example
new file mode 100644
index 0000000..77f9171
--- /dev/null
+++ b/stresso/conf/env.sh.example
@@ -0,0 +1,48 @@
+###############################
+# configuration for all scripts
+###############################
+# Fluo Home
+test -z "$FLUO_HOME" && FLUO_HOME=/path/to/accumulo
+# Fluo application name
+FLUO_APP_NAME=stresso
+
+###############################
+# configuration for run-test.sh
+###############################
+# Place where logs from test are placed
+LOG_DIR=$BIN_DIR/../logs
+# Maximum number to generate
+MAX=$((10**9))
+#the number of splits to create in table
+SPLITS=17
+# Number of mappers to run for data generation, which determines how many files
+# generation outputs. The number of files determines how many mappers loading
+# data will run.
+MAPS=17
+# Number of reduce tasks
+REDUCES=17
+# Number of random numbers to generate initially
+GEN_INIT=$((10**6))
+# Number of random numbers to generate for each incremental step.
+GEN_INCR=$((10**3))
+# Number of incremental steps.
+ITERATIONS=3
+# Seconds to sleep between incremental steps.
+SLEEP=30
+# Compact levels with less than the following possible nodes after loads
+COMPACT_CUTOFF=$((256**3 + 1))
+# The fluo wait command is executed after this many incremental load steps.
+WAIT_PERIOD=10
+# To run map reduce jobs, a shaded jar is built. The following properties
+# determine what versions of Fluo and Accumulo client libs end up in the shaded
+# jar.
+FLUO_VERSION=$($FLUO_HOME/bin/fluo version)
+ACCUMULO_VERSION=$(accumulo version)
+
+# The following Accumulo table properties will be set
+read -r -d '' TABLE_PROPS << EOM
+table.compaction.major.ratio=1.5
+table.file.compress.blocksize=8K
+table.file.compress.blocksize.index=32K
+table.file.compress.type=snappy
+EOM
diff --git a/stresso/conf/log4j.xml b/stresso/conf/log4j.xml
new file mode 100644
index 0000000..bd82a3a
--- /dev/null
+++ b/stresso/conf/log4j.xml
@@ -0,0 +1,39 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+ Copyright 2015 Stresso authors (see AUTHORS)
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
+<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
+
+ <appender name="console" class="org.apache.log4j.ConsoleAppender">
+ <param name="Target" value="System.out"/>
+ <layout class="org.apache.log4j.PatternLayout">
+ <param name="ConversionPattern" value="%d{ISO8601} [%-8c{2}] %-5p: %m%n" />
+ </layout>
+ </appender>
+
+ <logger name="org.apache.zookeeper">
+ <level value="ERROR" />
+ </logger>
+
+ <logger name="org.apache.curator">
+ <level value="ERROR" />
+ </logger>
+
+ <root>
+ <level value="INFO" />
+ <appender-ref ref="console" />
+ </root>
+</log4j:configuration>
diff --git a/stresso/pom.xml b/stresso/pom.xml
new file mode 100644
index 0000000..9b514da
--- /dev/null
+++ b/stresso/pom.xml
@@ -0,0 +1,234 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+ Copyright 2014 Stresso authors (see AUTHORS)
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+ <modelVersion>4.0.0</modelVersion>
+
+ <groupId>io.github.astralway</groupId>
+ <artifactId>stresso</artifactId>
+ <version>0.0.1-SNAPSHOT</version>
+ <packaging>jar</packaging>
+
+ <name>Stresso</name>
+ <description>This repo contains an example application designed to stress Apache Fluo</description>
+ <url>https://github.com/astralway/stresso</url>
+
+ <properties>
+ <accumulo.version>1.7.2</accumulo.version>
+ <hadoop.version>2.6.3</hadoop.version>
+ <fluo.version>1.0.0-incubating</fluo.version>
+ <fluo-recipes.version>1.0.0-incubating</fluo-recipes.version>
+ <slf4j.version>1.7.12</slf4j.version>
+ </properties>
+
+ <profiles>
+ <profile>
+ <id>mini-accumulo</id>
+ <activation>
+ <property>
+ <name>!skipTests</name>
+ </property>
+ </activation>
+ <build>
+ <plugins>
+ <plugin>
+ <groupId>org.apache.accumulo</groupId>
+ <artifactId>accumulo-maven-plugin</artifactId>
+ <version>${accumulo.version}</version>
+ <configuration>
+ <instanceName>it-instance-maven</instanceName>
+ <rootPassword>ITSecret</rootPassword>
+ </configuration>
+ <executions>
+ <execution>
+ <id>run-plugin</id>
+ <goals>
+ <goal>start</goal>
+ <goal>stop</goal>
+ </goals>
+ </execution>
+ </executions>
+ </plugin>
+ </plugins>
+ </build>
+ </profile>
+ </profiles>
+
+ <build>
+ <plugins>
+ <plugin>
+ <artifactId>maven-compiler-plugin</artifactId>
+ <version>3.1</version>
+ <configuration>
+ <source>1.8</source>
+ <target>1.8</target>
+ <optimize>true</optimize>
+ <encoding>UTF-8</encoding>
+ </configuration>
+ </plugin>
+
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-failsafe-plugin</artifactId>
+ <configuration>
+ <systemPropertyVariables>
+ <fluo.it.instance.name>it-instance-maven</fluo.it.instance.name>
+ <fluo.it.instance.clear>false</fluo.it.instance.clear>
+ </systemPropertyVariables>
+ </configuration>
+ <executions>
+ <execution>
+ <id>run-integration-tests</id>
+ <goals>
+ <goal>integration-test</goal>
+ <goal>verify</goal>
+ </goals>
+ </execution>
+ </executions>
+ </plugin>
+
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-shade-plugin</artifactId>
+ <executions>
+ <execution>
+ <goals>
+ <goal>shade</goal>
+ </goals>
+ <phase>package</phase>
+ <configuration>
+ <shadedArtifactAttached>true</shadedArtifactAttached>
+ <shadedClassifierName>shaded</shadedClassifierName>
+ <filters>
+ <filter>
+ <artifact>*:*</artifact>
+ <excludes>
+ <exclude>META-INF/*.SF</exclude>
+ <exclude>META-INF/*.DSA</exclude>
+ <exclude>META-INF/*.RSA</exclude>
+ </excludes>
+ </filter>
+ </filters>
+ </configuration>
+ </execution>
+ </executions>
+ </plugin>
+
+ </plugins>
+ </build>
+
+ <!--
+ The provided scope is used for dependencies that should not end up in
+ the shaded jar. The shaded jar is used to run map reduce jobs via the yarn
+ command. The yarn command will provided hadoop dependencies, so they are not
+ needed in the shaded jar.
+ -->
+
+ <dependencies>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-api</artifactId>
+ <version>${fluo.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-core</artifactId>
+ <version>${fluo.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-mapreduce</artifactId>
+ <version>${fluo.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-recipes-core</artifactId>
+ <version>${fluo-recipes.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.accumulo</groupId>
+ <artifactId>accumulo-core</artifactId>
+ <version>${accumulo.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.hadoop</groupId>
+ <artifactId>hadoop-client</artifactId>
+ <version>${hadoop.version}</version>
+ <scope>provided</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-api</artifactId>
+ <version>${slf4j.version}</version>
+ <scope>provided</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-log4j12</artifactId>
+ <version>${slf4j.version}</version>
+ <scope>provided</scope>
+ </dependency>
+ <dependency>
+ <groupId>com.google.guava</groupId>
+ <artifactId>guava</artifactId>
+ <version>13.0.1</version>
+ <scope>provided</scope>
+ </dependency>
+ <dependency>
+ <groupId>commons-configuration</groupId>
+ <artifactId>commons-configuration</artifactId>
+ <version>1.10</version>
+ <scope>provided</scope>
+ </dependency>
+ <dependency>
+ <groupId>commons-codec</groupId>
+ <artifactId>commons-codec</artifactId>
+ <version>1.10</version>
+ <scope>provided</scope>
+ </dependency>
+
+ <!-- Test Dependencies -->
+ <dependency>
+ <groupId>junit</groupId>
+ <artifactId>junit</artifactId>
+ <version>4.11</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.accumulo</groupId>
+ <artifactId>accumulo-minicluster</artifactId>
+ <version>${accumulo.version}</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-mini</artifactId>
+ <version>${fluo.version}</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-recipes-test</artifactId>
+ <version>${fluo-recipes.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>commons-io</groupId>
+ <artifactId>commons-io</artifactId>
+ <version>2.4</version>
+ <scope>test</scope>
+ </dependency>
+ </dependencies>
+</project>
diff --git a/stresso/src/main/java/stresso/trie/CompactLL.java b/stresso/src/main/java/stresso/trie/CompactLL.java
new file mode 100644
index 0000000..1e0e421
--- /dev/null
+++ b/stresso/src/main/java/stresso/trie/CompactLL.java
@@ -0,0 +1,61 @@
+package stresso.trie;
+
+import java.io.File;
+
+import org.apache.accumulo.core.client.Connector;
+import org.apache.fluo.api.client.FluoClient;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.core.util.AccumuloUtil;
+import org.apache.hadoop.io.Text;
+
+/**
+ * Compact the lower levels of the tree. The lower levels of the tree contain a small of nodes that
+ * are frequently updated. Compacting these lower levels is a quick operation that cause the Fluo GC
+ * iterator to cleanup past transactions.
+ */
+
+public class CompactLL {
+ public static void main(String[] args) throws Exception {
+
+ if (args.length != 3) {
+ System.err
+ .println("Usage: " + Split.class.getSimpleName() + " <fluo props> <max> <cutoff>");
+ System.exit(-1);
+ }
+
+ FluoConfiguration config = new FluoConfiguration(new File(args[0]));
+
+ long max = Long.parseLong(args[1]);
+
+ //compact levels that can contain less nodes than this
+ int cutoff = Integer.parseInt(args[2]);
+
+ int nodeSize;
+ int stopLevel;
+ try (FluoClient client = FluoFactory.newClient(config)) {
+ nodeSize = client.getAppConfiguration().getInt(Constants.NODE_SIZE_PROP);
+ stopLevel = client.getAppConfiguration().getInt(Constants.STOP_LEVEL_PROP);
+ }
+
+ int level = 64 / nodeSize;
+
+ while(level >= stopLevel) {
+ if(max < cutoff) {
+ break;
+ }
+
+ max = max >> 8;
+ level--;
+ }
+
+ String start = String.format("%02d", stopLevel);
+ String end = String.format("%02d:~", (level));
+
+ System.out.println("Compacting "+start+" to "+end);
+ Connector conn = AccumuloUtil.getConnector(config);
+ conn.tableOperations().compact(config.getAccumuloTable(), new Text(start), new Text(end), true, false);
+
+ System.exit(0);
+ }
+}
diff --git a/stresso/src/main/java/stresso/trie/Constants.java b/stresso/src/main/java/stresso/trie/Constants.java
new file mode 100644
index 0000000..7c8cf6c
--- /dev/null
+++ b/stresso/src/main/java/stresso/trie/Constants.java
@@ -0,0 +1,32 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+package stresso.trie;
+
+import org.apache.fluo.api.data.Column;
+import org.apache.fluo.recipes.core.types.StringEncoder;
+import org.apache.fluo.recipes.core.types.TypeLayer;
+
+/**
+ *
+ */
+public class Constants {
+
+ public static final TypeLayer TYPEL = new TypeLayer(new StringEncoder());
+
+ public static final Column COUNT_SEEN_COL = new Column("count", "seen");
+ public static final Column COUNT_WAIT_COL = new Column("count", "wait");
+
+ public static final String NODE_SIZE_PROP = "trie.nodeSize";
+ public static final String STOP_LEVEL_PROP = "trie.stopLevel";
+}
diff --git a/stresso/src/main/java/stresso/trie/Diff.java b/stresso/src/main/java/stresso/trie/Diff.java
new file mode 100644
index 0000000..f74521d
--- /dev/null
+++ b/stresso/src/main/java/stresso/trie/Diff.java
@@ -0,0 +1,104 @@
+package stresso.trie;
+
+import java.io.File;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+
+import org.apache.fluo.api.client.FluoClient;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.client.Snapshot;
+import org.apache.fluo.api.client.scanner.ColumnScanner;
+import org.apache.fluo.api.client.scanner.RowScanner;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.api.data.ColumnValue;
+import org.apache.fluo.api.data.Span;
+
+public class Diff {
+ public static Map<String, Long> getRootCount(FluoClient client, Snapshot snap, int level,
+ int stopLevel, int nodeSize) throws Exception {
+
+ HashMap<String, Long> counts = new HashMap<>();
+
+ RowScanner rows = snap.scanner().over(Span.prefix(String.format("%02d:", level)))
+ .fetch(Constants.COUNT_SEEN_COL, Constants.COUNT_WAIT_COL).byRow().build();
+
+ for (ColumnScanner columns : rows) {
+ String row = columns.getsRow();
+ Node node = new Node(row);
+
+ while (node.getLevel() > stopLevel) {
+ node = node.getParent();
+ }
+
+ String stopRow = node.getRowId();
+ long count = counts.getOrDefault(stopRow, 0L);
+
+ if (node.getNodeSize() == nodeSize) {
+ for (ColumnValue colVal : columns) {
+ count += Long.parseLong(colVal.getsValue());
+ }
+ } else {
+ throw new RuntimeException("TODO");
+ }
+
+ counts.put(stopRow, count);
+ }
+
+ return counts;
+ }
+
+ public static void main(String[] args) throws Exception {
+
+ if (args.length != 1) {
+ System.err.println("Usage: " + Diff.class.getSimpleName() + " <fluo props>");
+ System.exit(-1);
+ }
+
+ FluoConfiguration config = new FluoConfiguration(new File(args[0]));
+
+ try (FluoClient client = FluoFactory.newClient(config); Snapshot snap = client.newSnapshot()) {
+
+ int stopLevel = client.getAppConfiguration().getInt(Constants.STOP_LEVEL_PROP);
+ int nodeSize = client.getAppConfiguration().getInt(Constants.NODE_SIZE_PROP);
+
+ Map<String, Long> rootCounts = getRootCount(client, snap, stopLevel, stopLevel, nodeSize);
+ ArrayList<String> rootRows = new ArrayList<>(rootCounts.keySet());
+ Collections.sort(rootRows);
+
+ // TODO 8
+ for (int level = stopLevel + 1; level <= 8; level++) {
+ System.out.printf("Level %d:\n", level);
+
+ Map<String, Long> counts = getRootCount(client, snap, level, stopLevel, nodeSize);
+
+ long sum = 0;
+
+ for (String row : rootRows) {
+ long c1 = rootCounts.get(row);
+ long c2 = counts.getOrDefault(row, -1L);
+
+ if (c1 != c2) {
+ System.out.printf("\tdiff: %s %d %d\n", row, c1, c2);
+ }
+
+ if (c2 > 0) {
+ sum += c2;
+ }
+ }
+
+ HashSet<String> extras = new HashSet<>(counts.keySet());
+ extras.removeAll(rootCounts.keySet());
+
+ for (String row : extras) {
+ long c = counts.get(row);
+ System.out.printf("\textra: %s %d\n", row, c);
+ }
+
+ System.out.println("\tsum " + sum);
+ }
+ }
+ }
+}
diff --git a/stresso/src/main/java/stresso/trie/Generate.java b/stresso/src/main/java/stresso/trie/Generate.java
new file mode 100644
index 0000000..55f9beb
--- /dev/null
+++ b/stresso/src/main/java/stresso/trie/Generate.java
@@ -0,0 +1,176 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package stresso.trie;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.util.Random;
+
+import com.google.common.base.Preconditions;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.mapred.InputFormat;
+import org.apache.hadoop.mapred.InputSplit;
+import org.apache.hadoop.mapred.JobClient;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapred.RunningJob;
+import org.apache.hadoop.mapred.SequenceFileOutputFormat;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class Generate extends Configured implements Tool {
+
+ private static final Logger log = LoggerFactory.getLogger(Generate.class);
+
+ public static final String TRIE_GEN_NUM_PER_MAPPER_PROP = "stresso.trie.numPerMapper";
+ public static final String TRIE_GEN_NUM_MAPPERS_PROP = "stresso.trie.numMappers";
+ public static final String TRIE_GEN_MAX_PROP = "stresso.trie.max";
+
+ public static class RandomSplit implements InputSplit {
+
+ @Override
+ public void write(DataOutput out) throws IOException {}
+
+ @Override
+ public void readFields(DataInput in) throws IOException {}
+
+ @Override
+ public long getLength() throws IOException {
+ return 0;
+ }
+
+ @Override
+ public String[] getLocations() throws IOException {
+ return new String[0];
+ }
+
+ }
+
+ public static class RandomLongInputFormat implements InputFormat<LongWritable,NullWritable> {
+
+ @Override
+ public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
+ InputSplit[] splits = new InputSplit[job.getInt(TRIE_GEN_NUM_MAPPERS_PROP, 1)];
+ for (int i = 0; i < splits.length; i++) {
+ splits[i] = new RandomSplit();
+ }
+ return splits;
+ }
+
+ @Override
+ public RecordReader<LongWritable,NullWritable> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException {
+
+ final int numToGen = job.getInt(TRIE_GEN_NUM_PER_MAPPER_PROP, 1);
+ final long max = job.getLong(TRIE_GEN_MAX_PROP, Long.MAX_VALUE);
+
+ return new RecordReader<LongWritable,NullWritable>() {
+
+ private Random random = new Random();
+ private int count = 0;
+
+ @Override
+ public boolean next(LongWritable key, NullWritable value) throws IOException {
+
+ if (count == numToGen)
+ return false;
+
+ key.set((random.nextLong() & 0x7fffffffffffffffl) % max);
+ count++;
+ return true;
+ }
+
+ @Override
+ public LongWritable createKey() {
+ return new LongWritable();
+ }
+
+ @Override
+ public NullWritable createValue() {
+ return NullWritable.get();
+ }
+
+ @Override
+ public long getPos() throws IOException {
+ return count;
+ }
+
+ @Override
+ public void close() throws IOException {}
+
+ @Override
+ public float getProgress() throws IOException {
+ return (float) count / numToGen;
+ }
+ };
+ }
+ }
+
+ @Override
+ public int run(String[] args) throws Exception {
+
+ if (args.length != 4) {
+ log.error("Usage: " + this.getClass().getSimpleName() + " <numMappers> <numbersPerMapper> <max> <output dir>");
+ System.exit(-1);
+ }
+
+ int numMappers = Integer.parseInt(args[0]);
+ int numPerMapper = Integer.parseInt(args[1]);
+ long max = Long.parseLong(args[2]);
+ Path out = new Path(args[3]);
+
+ Preconditions.checkArgument(numMappers > 0, "numMappers <= 0");
+ Preconditions.checkArgument(numPerMapper > 0, "numPerMapper <= 0");
+ Preconditions.checkArgument(max > 0, "max <= 0");
+
+ JobConf job = new JobConf(getConf());
+
+ job.setJobName(this.getClass().getName());
+
+ job.setJarByClass(Generate.class);
+
+ job.setInt(TRIE_GEN_NUM_PER_MAPPER_PROP, numPerMapper);
+ job.setInt(TRIE_GEN_NUM_MAPPERS_PROP, numMappers);
+ job.setLong(TRIE_GEN_MAX_PROP, max);
+
+ job.setInputFormat(RandomLongInputFormat.class);
+
+ job.setNumReduceTasks(0);
+
+ job.setOutputKeyClass(LongWritable.class);
+ job.setOutputValueClass(NullWritable.class);
+
+ job.setOutputFormat(SequenceFileOutputFormat.class);
+ SequenceFileOutputFormat.setOutputPath(job, out);
+
+ RunningJob runningJob = JobClient.runJob(job);
+ runningJob.waitForCompletion();
+ return runningJob.isSuccessful() ? 0 : -1;
+ }
+
+ public static void main(String[] args) throws Exception {
+ int ret = ToolRunner.run(new Generate(), args);
+ System.exit(ret);
+ }
+
+}
diff --git a/stresso/src/main/java/stresso/trie/Init.java b/stresso/src/main/java/stresso/trie/Init.java
new file mode 100644
index 0000000..d0f847f
--- /dev/null
+++ b/stresso/src/main/java/stresso/trie/Init.java
@@ -0,0 +1,241 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package stresso.trie;
+
+import java.io.BufferedOutputStream;
+import java.io.File;
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.Collection;
+
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.client.mapreduce.AccumuloFileOutputFormat;
+import org.apache.accumulo.core.client.mapreduce.lib.partition.RangePartitioner;
+import org.apache.accumulo.core.data.Key;
+import org.apache.accumulo.core.data.Value;
+import org.apache.commons.codec.binary.Base64;
+import org.apache.fluo.api.client.FluoClient;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.core.util.AccumuloUtil;
+import org.apache.fluo.mapreduce.FluoKeyValue;
+import org.apache.fluo.mapreduce.FluoKeyValueGenerator;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+
+public class Init extends Configured implements Tool {
+
+ public static final String TRIE_STOP_LEVEL_PROP = FluoConfiguration.FLUO_PREFIX + ".stress.trie.stopLevel";
+ public static final String TRIE_NODE_SIZE_PROP = FluoConfiguration.FLUO_PREFIX + ".stress.trie.node.size";
+
+ public static class UniqueReducer extends Reducer<LongWritable,NullWritable,LongWritable,NullWritable> {
+ @Override
+ protected void reduce(LongWritable key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
+ context.write(key, NullWritable.get());
+ }
+ }
+
+ public static class InitMapper extends Mapper<LongWritable,NullWritable,Text,LongWritable> {
+
+ private int stopLevel;
+ private int nodeSize;
+ private static final LongWritable ONE = new LongWritable(1);
+
+ private Text outputKey = new Text();
+
+ @Override
+ protected void setup(Context context) throws IOException, InterruptedException {
+ nodeSize = context.getConfiguration().getInt(TRIE_NODE_SIZE_PROP, 0);
+ stopLevel = context.getConfiguration().getInt(TRIE_STOP_LEVEL_PROP, 0);
+ }
+
+ @Override
+ protected void map(LongWritable key, NullWritable val, Context context) throws IOException, InterruptedException {
+ Node node = new Node(key.get(), 64 / nodeSize, nodeSize);
+ while (node != null) {
+ outputKey.set(node.getRowId());
+ context.write(outputKey, ONE);
+ if (node.getLevel() <= stopLevel)
+ node = null;
+ else
+ node = node.getParent();
+ }
+ }
+ }
+
+ public static class InitCombiner extends Reducer<Text,LongWritable,Text,LongWritable> {
+
+ private LongWritable outputVal = new LongWritable();
+
+ @Override
+ protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
+ long sum = 0;
+ for (LongWritable l : values) {
+ sum += l.get();
+ }
+
+ outputVal.set(sum);
+ context.write(key, outputVal);
+ }
+ }
+
+ public static class InitReducer extends Reducer<Text,LongWritable,Key,Value> {
+ private FluoKeyValueGenerator fkvg = new FluoKeyValueGenerator();
+
+ @Override
+ protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
+ long sum = 0;
+ for (LongWritable l : values) {
+ sum += l.get();
+ }
+
+ fkvg.setRow(key).setColumn(Constants.COUNT_SEEN_COL).setValue(sum + "");
+
+ FluoKeyValue[] kvs = fkvg.getKeyValues();
+ for (FluoKeyValue kv : kvs) {
+ context.write(kv.getKey(), kv.getValue());
+ }
+ }
+ }
+
+ @Override
+ public int run(String[] args) throws Exception {
+ if (args.length != 3) {
+ System.err.println("Usage: " + this.getClass().getSimpleName() + " <fluoProps> <input dir> <tmp dir>");
+ System.exit(-1);
+ }
+
+ FluoConfiguration props = new FluoConfiguration(new File(args[0]));
+ Path input = new Path(args[1]);
+ Path tmp = new Path(args[2]);
+
+ int stopLevel;
+ int nodeSize;
+ try (FluoClient client = FluoFactory.newClient(props)) {
+ nodeSize = client.getAppConfiguration().getInt(Constants.NODE_SIZE_PROP);
+ stopLevel = client.getAppConfiguration().getInt(Constants.STOP_LEVEL_PROP);
+ }
+
+ int ret = unique(input, new Path(tmp, "nums"));
+ if (ret != 0)
+ return ret;
+
+ return buildTree(nodeSize, props, tmp, stopLevel);
+ }
+
+ private int unique(Path input, Path tmp) throws Exception {
+ Job job = Job.getInstance(getConf());
+ job.setJarByClass(Init.class);
+
+ job.setJobName(Init.class.getName() + "_unique");
+
+ job.setInputFormatClass(SequenceFileInputFormat.class);
+ SequenceFileInputFormat.addInputPath(job, input);
+
+ job.setReducerClass(UniqueReducer.class);
+
+ job.setOutputKeyClass(LongWritable.class);
+ job.setOutputValueClass(NullWritable.class);
+
+ job.setOutputFormatClass(SequenceFileOutputFormat.class);
+ SequenceFileOutputFormat.setOutputPath(job, tmp);
+
+ boolean success = job.waitForCompletion(true);
+ return success ? 0 : 1;
+
+ }
+
+ private int buildTree(int nodeSize, FluoConfiguration props, Path tmp, int stopLevel) throws Exception {
+ Job job = Job.getInstance(getConf());
+
+ job.setJarByClass(Init.class);
+
+ job.setJobName(Init.class.getName() + "_load");
+
+ job.setMapOutputKeyClass(Text.class);
+ job.setMapOutputValueClass(LongWritable.class);
+
+ job.getConfiguration().setInt(TRIE_NODE_SIZE_PROP, nodeSize);
+ job.getConfiguration().setInt(TRIE_STOP_LEVEL_PROP, stopLevel);
+
+ job.setInputFormatClass(SequenceFileInputFormat.class);
+ SequenceFileInputFormat.addInputPath(job, new Path(tmp, "nums"));
+
+ job.setMapperClass(InitMapper.class);
+ job.setCombinerClass(InitCombiner.class);
+ job.setReducerClass(InitReducer.class);
+
+ job.setOutputFormatClass(AccumuloFileOutputFormat.class);
+
+ job.setPartitionerClass(RangePartitioner.class);
+
+ FileSystem fs = FileSystem.get(job.getConfiguration());
+ Connector conn = AccumuloUtil.getConnector(props);
+
+ Path splitsPath = new Path(tmp, "splits.txt");
+
+ Collection<Text> splits1 = writeSplits(props, fs, conn, splitsPath);
+
+ RangePartitioner.setSplitFile(job, splitsPath.toString());
+ job.setNumReduceTasks(splits1.size() + 1);
+
+ Path outPath = new Path(tmp, "out");
+ AccumuloFileOutputFormat.setOutputPath(job, outPath);
+
+ boolean success = job.waitForCompletion(true);
+
+ if (success) {
+ Path failPath = new Path(tmp, "failures");
+ fs.mkdirs(failPath);
+ conn.tableOperations().importDirectory(props.getAccumuloTable(), outPath.toString(), failPath.toString(), false);
+
+ //Compacting files makes them local to each tablet and generates files using the tables settings.
+ conn.tableOperations().compact(props.getAccumuloTable(), new CompactionConfig().setWait(true));
+ }
+ return success ? 0 : 1;
+ }
+
+ private Collection<Text> writeSplits(FluoConfiguration props, FileSystem fs, Connector conn, Path splitsPath) throws Exception {
+ Collection<Text> splits1 = conn.tableOperations().listSplits(props.getAccumuloTable());
+ OutputStream out = new BufferedOutputStream(fs.create(splitsPath));
+ for (Text split : splits1) {
+ out.write(Base64.encodeBase64(split.copyBytes()));
+ out.write('\n');
+ }
+
+ out.close();
+ return splits1;
+ }
+
+ public static void main(String[] args) throws Exception {
+ int ret = ToolRunner.run(new Init(), args);
+ System.exit(ret);
+ }
+
+}
diff --git a/stresso/src/main/java/stresso/trie/Load.java b/stresso/src/main/java/stresso/trie/Load.java
new file mode 100644
index 0000000..8e1ebfb
--- /dev/null
+++ b/stresso/src/main/java/stresso/trie/Load.java
@@ -0,0 +1,86 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package stresso.trie;
+
+import java.io.File;
+import java.io.IOException;
+
+import org.apache.fluo.api.client.Loader;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.mapreduce.FluoOutputFormat;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class Load extends Configured implements Tool {
+
+ private static final Logger log = LoggerFactory.getLogger(Load.class);
+
+ public static class LoadMapper extends Mapper<LongWritable, NullWritable, Loader, NullWritable> {
+
+ @Override
+ protected void map(LongWritable key, NullWritable val, Context context)
+ throws IOException, InterruptedException {
+ context.write(new NumberLoader(key.get()), val);
+ }
+ }
+
+ @Override
+ public int run(String[] args) throws Exception {
+
+ if (args.length != 2) {
+ log.error("Usage: " + this.getClass().getSimpleName() + "<fluoProps> <input dir>");
+ System.exit(-1);
+ }
+
+ FluoConfiguration props = new FluoConfiguration(new File(args[0]));
+ Path input = new Path(args[1]);
+
+ Job job = Job.getInstance(getConf());
+
+ job.setJobName(Load.class.getName());
+
+ job.setJarByClass(Load.class);
+
+ job.setInputFormatClass(SequenceFileInputFormat.class);
+ SequenceFileInputFormat.addInputPath(job, input);
+
+ job.setMapperClass(LoadMapper.class);
+
+ job.setNumReduceTasks(0);
+
+ job.setOutputFormatClass(FluoOutputFormat.class);
+ FluoOutputFormat.configure(job, props);
+
+ job.getConfiguration().setBoolean("mapreduce.map.speculative", false);
+
+ boolean success = job.waitForCompletion(true);
+ return success ? 0 : 1;
+ }
+
+ public static void main(String[] args) throws Exception {
+ int ret = ToolRunner.run(new Load(), args);
+ System.exit(ret);
+ }
+
+}
diff --git a/stresso/src/main/java/stresso/trie/Node.java b/stresso/src/main/java/stresso/trie/Node.java
new file mode 100644
index 0000000..fb45a60
--- /dev/null
+++ b/stresso/src/main/java/stresso/trie/Node.java
@@ -0,0 +1,117 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package stresso.trie;
+
+import com.google.common.base.Strings;
+import com.google.common.hash.Hashing;
+
+import static com.google.common.base.Preconditions.checkArgument;
+
+/** Utility class that represents trie node
+ */
+public class Node {
+
+ private final Number number;
+ private final int level;
+ private final int nodeSize;
+
+ static final int HASH_LEN=4;
+
+ public Node(Number number, int level, int nodeSize) {
+ this.number = number;
+ this.level = level;
+ this.nodeSize = nodeSize;
+ }
+
+ public Node(String rowId) {
+ String[] rowArgs = rowId.split(":");
+ checkArgument(validRowId(rowArgs), "Invalid row id - "+ rowId);
+ this.level = Integer.parseInt(rowArgs[0]);
+ this.nodeSize = Integer.parseInt(rowArgs[2]);
+ this.number = parseNumber(rowArgs[3]);
+ }
+
+ public Number getNumber() {
+ return number;
+ }
+
+ public int getLevel() {
+ return level;
+ }
+
+ public boolean isRoot() {
+ return level == 0;
+ }
+
+ public int getNodeSize() {
+ return nodeSize;
+ }
+
+ private Number parseNumber(String numStr) {
+ if (numStr.equals("root")) {
+ return null;
+ } else if (numStr.length() == 16) {
+ return Long.parseLong(numStr, 16);
+ } else {
+ return Integer.parseInt(numStr, 16);
+ }
+ }
+
+ private String genHash(){
+ long num = (number == null)? 0l : number.longValue();
+ int hash = Hashing.murmur3_32().newHasher().putInt(level).putInt(nodeSize).putLong(num).hash().asInt();
+ hash = hash & 0x7fffffff;
+ //base 36 gives a lot more bins in 4 bytes than hex, but it still human readable which is nice for debugging.
+ String hashString = Strings.padStart(Integer.toString(hash, Character.MAX_RADIX), HASH_LEN, '0');
+ return hashString.substring(hashString.length() - HASH_LEN);
+ }
+
+ public String getRowId() {
+ if (level == 0) {
+ return String.format("00:%s:%02d:root", genHash(), nodeSize);
+ } else {
+ if (number instanceof Integer) {
+ return String.format("%02d:%s:%02d:%08x", level, genHash(), nodeSize, number);
+ } else {
+ return String.format("%02d:%s:%02d:%016x", level, genHash(), nodeSize, number);
+ }
+ }
+ }
+
+ public Node getParent() {
+ if (level == 1) {
+ return new Node(null, 0, nodeSize);
+ } else {
+ if (number instanceof Long) {
+ int shift = (((64 / nodeSize) - level) * nodeSize) + nodeSize;
+ Long parent = (number.longValue() >> shift) << shift;
+ return new Node(parent, level-1, nodeSize);
+ } else {
+ int shift = (((32 / nodeSize) - level) * nodeSize) + nodeSize;
+ Integer parent = (number.intValue() >> shift) << shift;
+ return new Node(parent, level-1, nodeSize);
+ }
+ }
+ }
+
+ private boolean validRowId(String[] rowArgs) {
+ return ((rowArgs.length == 4) && (rowArgs[0] != null) && (rowArgs[1] != null) && (rowArgs[2] != null) && (rowArgs[3] != null));
+ }
+
+ public static String generateRootId(int nodeSize) {
+ return (new Node(null, 0, nodeSize)).getRowId();
+ }
+}
diff --git a/stresso/src/main/java/stresso/trie/NodeObserver.java b/stresso/src/main/java/stresso/trie/NodeObserver.java
new file mode 100644
index 0000000..d7356c5
--- /dev/null
+++ b/stresso/src/main/java/stresso/trie/NodeObserver.java
@@ -0,0 +1,78 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+package stresso.trie;
+
+import java.util.Map;
+
+import org.apache.fluo.api.client.TransactionBase;
+import org.apache.fluo.api.data.Bytes;
+import org.apache.fluo.api.data.Column;
+import org.apache.fluo.api.observer.AbstractObserver;
+import org.apache.fluo.recipes.core.types.TypedSnapshotBase.Value;
+import org.apache.fluo.recipes.core.types.TypedTransactionBase;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Observer that looks for count:wait for nodes. If found, it increments count:seen and increments
+ * count:wait of parent node in trie
+ */
+public class NodeObserver extends AbstractObserver {
+
+ private static final Logger log = LoggerFactory.getLogger(NodeObserver.class);
+
+ private int stopLevel = 0;
+
+ @Override
+ public void process(TransactionBase tx, Bytes row, Column col) throws Exception {
+
+ final TypedTransactionBase ttx = Constants.TYPEL.wrap(tx);
+
+ Map<Column, Value> colVals =
+ ttx.get().row(row).columns(Constants.COUNT_SEEN_COL, Constants.COUNT_WAIT_COL);
+
+ final Integer childWait = colVals.get(Constants.COUNT_WAIT_COL).toInteger(0);
+
+ if (childWait > 0) {
+ Integer childSeen = colVals.get(Constants.COUNT_SEEN_COL).toInteger(0);
+
+ ttx.mutate().row(row).col(Constants.COUNT_SEEN_COL).set(childSeen + childWait);
+ ttx.mutate().row(row).col(Constants.COUNT_WAIT_COL).delete();
+
+ try {
+ Node node = new Node(row.toString());
+ if (node.getLevel() > stopLevel) {
+ Node parent = node.getParent();
+ Integer parentWait =
+ ttx.get().row(parent.getRowId()).col(Constants.COUNT_WAIT_COL).toInteger(0);
+ ttx.mutate().row(parent.getRowId()).col(Constants.COUNT_WAIT_COL)
+ .set(parentWait + childWait);
+ }
+ } catch (IllegalArgumentException e) {
+ log.error(e.getMessage());
+ e.printStackTrace();
+ }
+ }
+ }
+
+ @Override
+ public void init(Context context) throws Exception {
+ stopLevel = context.getAppConfiguration().getInt(Constants.STOP_LEVEL_PROP);
+ }
+
+ @Override
+ public ObservedColumn getObservedColumn() {
+ return new ObservedColumn(Constants.COUNT_WAIT_COL, NotificationType.STRONG);
+ }
+}
diff --git a/stresso/src/main/java/stresso/trie/NumberLoader.java b/stresso/src/main/java/stresso/trie/NumberLoader.java
new file mode 100644
index 0000000..961893a
--- /dev/null
+++ b/stresso/src/main/java/stresso/trie/NumberLoader.java
@@ -0,0 +1,70 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+package stresso.trie;
+
+import java.util.Map;
+
+import org.apache.fluo.api.client.Loader;
+import org.apache.fluo.api.client.TransactionBase;
+import org.apache.fluo.api.data.Column;
+import org.apache.fluo.recipes.core.types.TypedSnapshotBase.Value;
+import org.apache.fluo.recipes.core.types.TypedTransactionBase;
+
+import static com.google.common.base.Preconditions.checkArgument;
+
+/**
+ * Executes load transactions of numbers into trie at leaf node level
+ */
+public class NumberLoader implements Loader {
+
+ private final Number number;
+ private Integer nodeSize = null;
+
+ public NumberLoader(Integer num, int nodeSize) {
+ checkArgument(num >= 0, "Only positive numbers accepted");
+ checkArgument((nodeSize <= 32) && ((32 % nodeSize) == 0), "nodeSize must be divisor of 32");
+ this.number = num;
+ this.nodeSize = nodeSize;
+ }
+
+ public NumberLoader(Long num) {
+ checkArgument(num >= 0, "Only positive numbers accepted");
+ this.number = num;
+ }
+
+ @Override
+ public void load(TransactionBase tx, Context context) throws Exception {
+
+ if (nodeSize == null) {
+ nodeSize = context.getAppConfiguration().getInt(Constants.NODE_SIZE_PROP);
+ checkArgument((nodeSize <= 64) && ((64 % nodeSize) == 0), "nodeSize must be divisor of 64");
+ }
+ int level = 64 / nodeSize;
+
+ TypedTransactionBase ttx = Constants.TYPEL.wrap(tx);
+
+ String rowId = new Node(number, level, nodeSize).getRowId();
+
+ Map<Column, Value> colVals =
+ ttx.get().row(rowId).columns(Constants.COUNT_SEEN_COL, Constants.COUNT_WAIT_COL);
+
+ Integer seen = colVals.get(Constants.COUNT_SEEN_COL).toInteger(0);
+ if (seen == 0) {
+ Integer wait = colVals.get(Constants.COUNT_WAIT_COL).toInteger(0);
+ if (wait == 0) {
+ ttx.mutate().row(rowId).col(Constants.COUNT_WAIT_COL).set(1);
+ }
+ }
+ }
+}
diff --git a/stresso/src/main/java/stresso/trie/Print.java b/stresso/src/main/java/stresso/trie/Print.java
new file mode 100644
index 0000000..62c39a8
--- /dev/null
+++ b/stresso/src/main/java/stresso/trie/Print.java
@@ -0,0 +1,125 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package stresso.trie;
+
+import java.io.File;
+
+import org.apache.fluo.api.client.FluoClient;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.client.Snapshot;
+import org.apache.fluo.api.client.scanner.ColumnScanner;
+import org.apache.fluo.api.client.scanner.RowScanner;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.api.config.SimpleConfiguration;
+import org.apache.fluo.api.data.ColumnValue;
+import org.apache.fluo.api.data.Span;
+
+public class Print {
+
+ public static class Stats {
+ public long totalWait = 0;
+ public long totalSeen = 0;
+ public long nodes;
+ public boolean sawOtherNodes = false;
+
+ public Stats() {
+
+ }
+
+ public Stats(long tw, long ts, boolean son) {
+ this.totalWait = tw;
+ this.totalSeen = ts;
+ this.sawOtherNodes = son;
+ }
+
+ public Stats(long tw, long ts, long nodes, boolean son) {
+ this.totalWait = tw;
+ this.totalSeen = ts;
+ this.nodes = nodes;
+ this.sawOtherNodes = son;
+ }
+
+ @Override
+ public boolean equals(Object o) {
+ if (o instanceof Stats) {
+ Stats os = (Stats) o;
+
+ return totalWait == os.totalWait && totalSeen == os.totalSeen
+ && sawOtherNodes == os.sawOtherNodes;
+ }
+
+ return false;
+ }
+ }
+
+ public static Stats getStats(SimpleConfiguration config) throws Exception {
+
+ try (FluoClient client = FluoFactory.newClient(config); Snapshot snap = client.newSnapshot()) {
+
+ int level = client.getAppConfiguration().getInt(Constants.STOP_LEVEL_PROP);
+ int nodeSize = client.getAppConfiguration().getInt(Constants.NODE_SIZE_PROP);
+
+ RowScanner rows = snap.scanner().over(Span.prefix(String.format("%02d:", level)))
+ .fetch(Constants.COUNT_SEEN_COL, Constants.COUNT_WAIT_COL).byRow().build();
+
+
+ long totalSeen = 0;
+ long totalWait = 0;
+
+ int otherNodeSizes = 0;
+
+ long nodes = 0;
+
+ for (ColumnScanner columns : rows) {
+ String row = columns.getsRow();
+ Node node = new Node(row);
+
+ if (node.getNodeSize() == nodeSize) {
+ for (ColumnValue cv : columns) {
+ if (cv.getColumn().equals(Constants.COUNT_SEEN_COL)) {
+ totalSeen += Long.parseLong(cv.getsValue());
+ } else {
+ totalWait += Long.parseLong(cv.getsValue());
+ }
+ }
+
+ nodes++;
+ } else {
+ otherNodeSizes++;
+ }
+ }
+
+ return new Stats(totalWait, totalSeen, nodes, otherNodeSizes != 0);
+ }
+
+ }
+
+ public static void main(String[] args) throws Exception {
+
+ if (args.length != 1) {
+ System.err.println("Usage: " + Print.class.getSimpleName() + " <fluo props>");
+ System.exit(-1);
+ }
+
+ Stats stats = getStats(new FluoConfiguration(new File(args[0])));
+
+ System.out.println("Total at root : " + (stats.totalSeen + stats.totalWait));
+ System.out.println("Nodes Scanned : " + stats.nodes);
+
+ if (stats.sawOtherNodes) {
+ System.err.println("WARN : Other node sizes were seen and ignored.");
+ }
+ }
+}
diff --git a/stresso/src/main/java/stresso/trie/Split.java b/stresso/src/main/java/stresso/trie/Split.java
new file mode 100644
index 0000000..df8e28f
--- /dev/null
+++ b/stresso/src/main/java/stresso/trie/Split.java
@@ -0,0 +1,141 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package stresso.trie;
+
+import java.io.ByteArrayInputStream;
+import java.io.File;
+import java.nio.charset.StandardCharsets;
+import java.util.Map.Entry;
+import java.util.Properties;
+import java.util.Set;
+import java.util.TreeSet;
+
+import com.google.common.base.Strings;
+import org.apache.accumulo.core.client.AccumuloException;
+import org.apache.accumulo.core.client.AccumuloSecurityException;
+import org.apache.accumulo.core.client.Connector;
+import org.apache.fluo.api.client.FluoClient;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.core.util.AccumuloUtil;
+import org.apache.hadoop.io.Text;
+
+public class Split {
+
+ private static final String RGB_CLASS =
+ "org.apache.accumulo.server.master.balancer.RegexGroupBalancer";
+ private static final String RGB_PATTERN_PROP = "table.custom.balancer.group.regex.pattern";
+ private static final String RGB_DEFAULT_PROP = "table.custom.balancer.group.regex.default";
+ private static final String TABLE_BALANCER_PROP = "table.balancer";
+
+ public static void main(String[] args) throws Exception {
+ if (args.length != 3) {
+ System.err.println("Usage: " + Split.class.getSimpleName()
+ + " <fluo props> <table props> <tablets per level>");
+ System.exit(-1);
+ }
+
+ FluoConfiguration config = new FluoConfiguration(new File(args[0]));
+
+ int maxTablets = Integer.parseInt(args[2]);
+
+ int nodeSize;
+ int stopLevel;
+ try (FluoClient client = FluoFactory.newClient(config)) {
+ nodeSize = client.getAppConfiguration().getInt(Constants.NODE_SIZE_PROP);
+ stopLevel = client.getAppConfiguration().getInt(Constants.STOP_LEVEL_PROP);
+ }
+
+ setupBalancer(config);
+
+ int level = 64 / nodeSize;
+
+ while (level >= stopLevel) {
+ int numTablets = maxTablets;
+ if (numTablets == 0)
+ break;
+
+ TreeSet<Text> splits = genSplits(level, numTablets);
+ addSplits(config, splits);
+ System.out.printf("Added %d tablets for level %d\n", numTablets, level);
+
+ level--;
+ }
+
+ optimizeAccumulo(config, args[1]);
+ }
+
+ private static void optimizeAccumulo(FluoConfiguration config, String tableProps)
+ throws Exception {
+ Connector conn = AccumuloUtil.getConnector(config);
+
+ Properties tprops = new Properties();
+ tprops.load(new ByteArrayInputStream(tableProps.getBytes(StandardCharsets.UTF_8)));
+
+ Set<Entry<Object, Object>> es = tprops.entrySet();
+ for (Entry<Object, Object> e : es) {
+ conn.tableOperations().setProperty(config.getAccumuloTable(), e.getKey().toString(),
+ e.getValue().toString());
+ }
+ try {
+ conn.instanceOperations().setProperty("table.durability", "flush");
+ conn.tableOperations().removeProperty("accumulo.metadata", "table.durability");
+ conn.tableOperations().removeProperty("accumulo.root", "table.durability");
+ } catch (AccumuloException e) {
+ System.err.println(
+ "Unable to set durability settings (error expected in Accumulo 1.6) : " + e.getMessage());
+ }
+ }
+
+ private static void setupBalancer(FluoConfiguration config) throws AccumuloSecurityException {
+ Connector conn = AccumuloUtil.getConnector(config);
+
+ try {
+ // setting this prop first intentionally because it should fail in 1.6
+ conn.tableOperations().setProperty(config.getAccumuloTable(), RGB_PATTERN_PROP, "(\\d\\d).*");
+ conn.tableOperations().setProperty(config.getAccumuloTable(), RGB_DEFAULT_PROP, "none");
+ conn.tableOperations().setProperty(config.getAccumuloTable(), TABLE_BALANCER_PROP, RGB_CLASS);
+ System.out.println("Setup tablet group balancer");
+ } catch (AccumuloException e) {
+ System.err.println(
+ "Unable to setup tablet balancer (error expected in Accumulo 1.6) : " + e.getMessage());
+ }
+ }
+
+ private static TreeSet<Text> genSplits(int level, int numTablets) {
+
+ TreeSet<Text> splits = new TreeSet<>();
+
+ String ls = String.format("%02d:", level);
+
+ int numSplits = numTablets - 1;
+ int distance = (((int) Math.pow(Character.MAX_RADIX, Node.HASH_LEN) - 1) / numTablets) + 1;
+ int split = distance;
+ for (int i = 0; i < numSplits; i++) {
+ splits.add(new Text(
+ ls + Strings.padStart(Integer.toString(split, Character.MAX_RADIX), Node.HASH_LEN, '0')));
+ split += distance;
+ }
+
+ splits.add(new Text(ls + "~"));
+
+ return splits;
+ }
+
+ private static void addSplits(FluoConfiguration config, TreeSet<Text> splits) throws Exception {
+ Connector conn = AccumuloUtil.getConnector(config);
+ conn.tableOperations().addSplits(config.getAccumuloTable(), splits);
+ }
+}
diff --git a/stresso/src/main/java/stresso/trie/Unique.java b/stresso/src/main/java/stresso/trie/Unique.java
new file mode 100644
index 0000000..ef0d1cc
--- /dev/null
+++ b/stresso/src/main/java/stresso/trie/Unique.java
@@ -0,0 +1,104 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package stresso.trie;
+
+import java.io.IOException;
+import java.util.Iterator;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.mapred.JobClient;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.MapReduceBase;
+import org.apache.hadoop.mapred.OutputCollector;
+import org.apache.hadoop.mapred.Reducer;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapred.RunningJob;
+import org.apache.hadoop.mapred.SequenceFileInputFormat;
+import org.apache.hadoop.mapred.lib.NullOutputFormat;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class Unique extends Configured implements Tool {
+
+ private static final Logger log = LoggerFactory.getLogger(Unique.class);
+
+ public static enum Stats {
+ UNIQUE;
+ }
+
+ public static class UniqueReducer extends MapReduceBase implements Reducer<LongWritable,NullWritable,LongWritable,NullWritable> {
+ @Override
+ public void reduce(LongWritable key, Iterator<NullWritable> values, OutputCollector<LongWritable,NullWritable> output, Reporter reporter) throws IOException {
+ reporter.getCounter(Stats.UNIQUE).increment(1);
+ }
+ }
+
+
+ private static int numUnique = 0;
+
+ @VisibleForTesting
+ public static int getNumUnique() {
+ return numUnique;
+ }
+
+ @Override
+ public int run(String[] args) throws Exception {
+
+ if (args.length < 1) {
+ log.error("Usage: " + this.getClass().getSimpleName() + "<input dir>{ <input dir>}");
+ System.exit(-1);
+ }
+
+ JobConf job = new JobConf(getConf());
+
+ job.setJobName(Unique.class.getName());
+ job.setJarByClass(Unique.class);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+ for (String arg : args) {
+ SequenceFileInputFormat.addInputPath(job, new Path(arg));
+ }
+
+ job.setMapOutputKeyClass(LongWritable.class);
+ job.setMapOutputValueClass(NullWritable.class);
+
+ job.setReducerClass(UniqueReducer.class);
+
+ job.setOutputFormat(NullOutputFormat.class);
+
+ RunningJob runningJob = JobClient.runJob(job);
+ runningJob.waitForCompletion();
+ numUnique = (int) runningJob.getCounters().getCounter(Stats.UNIQUE);
+
+ log.debug("numUnique : "+numUnique);
+
+ return runningJob.isSuccessful() ? 0 : -1;
+
+ }
+
+ public static void main(String[] args) throws Exception {
+ int ret = ToolRunner.run(new Unique(), args);
+ System.exit(ret);
+ }
+
+}
diff --git a/stresso/src/test/java/stresso/ITBase.java b/stresso/src/test/java/stresso/ITBase.java
new file mode 100644
index 0000000..99a0b02
--- /dev/null
+++ b/stresso/src/test/java/stresso/ITBase.java
@@ -0,0 +1,135 @@
+/*
+ * Copyright 2017 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+
+package stresso;
+
+import java.io.File;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.Instance;
+import org.apache.accumulo.core.client.security.tokens.PasswordToken;
+import org.apache.accumulo.minicluster.MiniAccumuloCluster;
+import org.apache.accumulo.minicluster.MiniAccumuloConfig;
+import org.apache.accumulo.minicluster.MiniAccumuloInstance;
+import org.apache.commons.io.FileUtils;
+import org.apache.fluo.api.client.FluoAdmin;
+import org.apache.fluo.api.client.FluoAdmin.InitializationOptions;
+import org.apache.fluo.api.client.FluoClient;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.api.mini.MiniFluo;
+import org.junit.After;
+import org.junit.AfterClass;
+import org.junit.Before;
+import org.junit.BeforeClass;
+
+public class ITBase {
+
+ protected final static String USER = "root";
+ protected final static String PASSWORD = "ITSecret";
+ protected final static String TABLE_BASE = "table";
+ protected final static String IT_INSTANCE_NAME_PROP = FluoConfiguration.FLUO_PREFIX
+ + ".it.instance.name";
+ protected final static String IT_INSTANCE_CLEAR_PROP = FluoConfiguration.FLUO_PREFIX
+ + ".it.instance.clear";
+
+ protected static String instanceName;
+ protected static Connector conn;
+ protected static Instance miniAccumulo;
+ private static MiniAccumuloCluster cluster;
+ private static boolean startedCluster = false;
+
+ private static AtomicInteger tableCounter = new AtomicInteger(1);
+ protected static AtomicInteger testCounter = new AtomicInteger();
+
+ protected FluoConfiguration config;
+ protected FluoClient client;
+ protected MiniFluo miniFluo;
+
+ @BeforeClass
+ public static void setUpAccumulo() throws Exception {
+ instanceName = System.getProperty(IT_INSTANCE_NAME_PROP, "it-instance-default");
+ File instanceDir = new File("target/accumulo-maven-plugin/" + instanceName);
+ boolean instanceClear =
+ System.getProperty(IT_INSTANCE_CLEAR_PROP, "true").equalsIgnoreCase("true");
+ if (instanceDir.exists() && instanceClear) {
+ FileUtils.deleteDirectory(instanceDir);
+ }
+ if (!instanceDir.exists()) {
+ MiniAccumuloConfig cfg = new MiniAccumuloConfig(instanceDir, PASSWORD);
+ cfg.setInstanceName(instanceName);
+ cluster = new MiniAccumuloCluster(cfg);
+ cluster.start();
+ startedCluster = true;
+ }
+ miniAccumulo = new MiniAccumuloInstance(instanceName, instanceDir);
+ conn = miniAccumulo.getConnector(USER, new PasswordToken(PASSWORD));
+ }
+
+
+ @AfterClass
+ public static void tearDownAccumulo() throws Exception {
+ if (startedCluster) {
+ cluster.stop();
+ }
+ }
+
+ protected void preInit(FluoConfiguration config){}
+
+ public String getCurTableName() {
+ return TABLE_BASE + tableCounter.get();
+ }
+
+ public String getNextTableName() {
+ return TABLE_BASE + tableCounter.incrementAndGet();
+ }
+
+ @Before
+ public void setUpFluo() throws Exception {
+
+ config = new FluoConfiguration();
+ config.setApplicationName("mini-test" + testCounter.getAndIncrement());
+ config.setAccumuloInstance(miniAccumulo.getInstanceName());
+ config.setAccumuloUser(USER);
+ config.setAccumuloPassword(PASSWORD);
+ config.setAccumuloZookeepers(miniAccumulo.getZooKeepers());
+ config.setInstanceZookeepers(miniAccumulo.getZooKeepers() + "/fluo");
+ config.setMiniStartAccumulo(false);
+ config.setAccumuloTable(getNextTableName());
+ config.setWorkerThreads(5);
+ preInit(config);
+
+ config.setTransactionRollbackTime(1, TimeUnit.SECONDS);
+
+ try (FluoAdmin admin = FluoFactory.newAdmin(config)) {
+ InitializationOptions opts =
+ new InitializationOptions().setClearZookeeper(true).setClearTable(true);
+ admin.initialize(opts);
+ }
+
+ config.getAppConfiguration().clear();
+
+ client = FluoFactory.newClient(config);
+ miniFluo = FluoFactory.newMiniFluo(config);
+ }
+
+ @After
+ public void tearDownFluo() throws Exception {
+ miniFluo.close();
+ client.close();
+ }
+}
diff --git a/stresso/src/test/java/stresso/TrieBasicIT.java b/stresso/src/test/java/stresso/TrieBasicIT.java
new file mode 100644
index 0000000..eba5d3e
--- /dev/null
+++ b/stresso/src/test/java/stresso/TrieBasicIT.java
@@ -0,0 +1,120 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+package stresso;
+
+import java.util.HashSet;
+import java.util.Random;
+import java.util.Set;
+
+import org.apache.fluo.api.client.FluoClient;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.client.LoaderExecutor;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.api.config.ObserverSpecification;
+import org.apache.fluo.recipes.core.types.TypedSnapshot;
+import org.apache.fluo.recipes.test.FluoITHelper;
+import org.junit.Assert;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import stresso.trie.Constants;
+import stresso.trie.Node;
+import stresso.trie.NodeObserver;
+import stresso.trie.NumberLoader;
+
+import static stresso.trie.Constants.COUNT_SEEN_COL;
+import static stresso.trie.Constants.TYPEL;
+
+/**
+ * Tests Trie Stress Test using Basic Loader
+ */
+public class TrieBasicIT extends ITBase {
+
+ private static final Logger log = LoggerFactory.getLogger(TrieBasicIT.class);
+
+ @Override
+ protected void preInit(FluoConfiguration conf) {
+ conf.addObserver(new ObserverSpecification(NodeObserver.class.getName()));
+ conf.getAppConfiguration().setProperty(Constants.STOP_LEVEL_PROP, 0);
+ }
+
+ @Test
+ public void testBit32() throws Exception {
+ runTrieTest(20, Integer.MAX_VALUE, 32);
+ }
+
+ @Test
+ public void testBit8() throws Exception {
+ runTrieTest(25, Integer.MAX_VALUE, 8);
+ }
+
+ @Test
+ public void testBit4() throws Exception {
+ runTrieTest(10, Integer.MAX_VALUE, 4);
+ }
+
+ @Test
+ public void testBit() throws Exception {
+ runTrieTest(5, Integer.MAX_VALUE, 1);
+ }
+
+ @Test
+ public void testDuplicates() throws Exception {
+ runTrieTest(20, 10, 4);
+ }
+
+ private void runTrieTest(int ingestNum, int maxValue, int nodeSize) throws Exception {
+
+ log.info("Ingesting " + ingestNum + " unique numbers with a nodeSize of " + nodeSize + " bits");
+
+ config.setLoaderThreads(0);
+ config.setLoaderQueueSize(0);
+
+ try (FluoClient fluoClient = FluoFactory.newClient(config)) {
+
+ int uniqueNum;
+
+ try (LoaderExecutor le = client.newLoaderExecutor()) {
+ Random random = new Random();
+ Set<Integer> ingested = new HashSet<>();
+ for (int i = 0; i < ingestNum; i++) {
+ int num = Math.abs(random.nextInt(maxValue));
+ le.execute(new NumberLoader(num, nodeSize));
+ ingested.add(num);
+ }
+
+ uniqueNum = ingested.size();
+ log.info(
+ "Ingested " + uniqueNum + " unique numbers with a nodeSize of " + nodeSize + " bits");
+ }
+
+ miniFluo.waitForObservers();
+
+ try (TypedSnapshot tsnap = TYPEL.wrap(client.newSnapshot())) {
+ Integer result =
+ tsnap.get().row(Node.generateRootId(nodeSize)).col(COUNT_SEEN_COL).toInteger();
+ if (result == null) {
+ log.error("Could not find root node");
+ FluoITHelper.printFluoTable(client);
+ }
+ if (!result.equals(uniqueNum)) {
+ log.error(
+ "Count (" + result + ") at root node does not match expected (" + uniqueNum + "):");
+ FluoITHelper.printFluoTable(client);
+ }
+ Assert.assertEquals(uniqueNum, result.intValue());
+ }
+ }
+ }
+}
diff --git a/stresso/src/test/java/stresso/TrieMapRedIT.java b/stresso/src/test/java/stresso/TrieMapRedIT.java
new file mode 100644
index 0000000..62afea1
--- /dev/null
+++ b/stresso/src/test/java/stresso/TrieMapRedIT.java
@@ -0,0 +1,136 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+package stresso;
+
+import java.io.File;
+import java.util.ArrayList;
+import java.util.Arrays;
+
+import org.apache.commons.io.FileUtils;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.api.config.ObserverSpecification;
+import org.apache.fluo.api.config.SimpleConfiguration;
+import org.apache.fluo.api.mini.MiniFluo;
+import org.apache.hadoop.util.ToolRunner;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import stresso.trie.Constants;
+import stresso.trie.Generate;
+import stresso.trie.Init;
+import stresso.trie.Load;
+import stresso.trie.NodeObserver;
+import stresso.trie.Print;
+import stresso.trie.Unique;
+
+/**
+ * Tests Trie Stress Test using MapReduce Ingest
+ */
+public class TrieMapRedIT extends ITBase {
+
+ @Override
+ protected void preInit(FluoConfiguration conf) {
+ conf.addObserver(new ObserverSpecification(NodeObserver.class.getName()));
+
+ SimpleConfiguration appCfg = conf.getAppConfiguration();
+ appCfg.setProperty(Constants.STOP_LEVEL_PROP, 0);
+ appCfg.setProperty(Constants.NODE_SIZE_PROP, 8);
+ }
+
+ static void generate(int numMappers, int numPerMapper, int max, File out1) throws Exception {
+ int ret = ToolRunner.run(new Generate(),
+ new String[] {"-D", "mapred.job.tracker=local", "-D", "fs.defaultFS=file:///",
+ "" + numMappers, numPerMapper + "", max + "", out1.toURI().toString()});
+ Assert.assertEquals(0, ret);
+ }
+
+ static void load(int nodeSize, File fluoPropsFile, File input) throws Exception {
+ int ret = ToolRunner.run(new Load(), new String[] {"-D", "mapred.job.tracker=local", "-D",
+ "fs.defaultFS=file:///", fluoPropsFile.getAbsolutePath(), input.toURI().toString()});
+ Assert.assertEquals(0, ret);
+ }
+
+ static void init(int nodeSize, File fluoPropsFile, File input, File tmp) throws Exception {
+ int ret = ToolRunner.run(new Init(),
+ new String[] {"-D", "mapred.job.tracker=local", "-D", "fs.defaultFS=file:///",
+ fluoPropsFile.getAbsolutePath(), input.toURI().toString(), tmp.toURI().toString()});
+ Assert.assertEquals(0, ret);
+ }
+
+ static int unique(File... dirs) throws Exception {
+
+ ArrayList<String> args = new ArrayList<>(
+ Arrays.asList("-D", "mapred.job.tracker=local", "-D", "fs.defaultFS=file:///"));
+ for (File dir : dirs) {
+ args.add(dir.toURI().toString());
+ }
+
+ int ret = ToolRunner.run(new Unique(), args.toArray(new String[args.size()]));
+ Assert.assertEquals(0, ret);
+ return Unique.getNumUnique();
+ }
+
+ @Test
+ public void testEndToEnd() throws Exception {
+ File testDir = new File("target/MRIT");
+ FileUtils.deleteQuietly(testDir);
+ testDir.mkdirs();
+ File fluoPropsFile = new File(testDir, "fluo.props");
+
+ config.save(fluoPropsFile);
+
+ File out1 = new File(testDir, "nums-1");
+
+ generate(2, 100, 500, out1);
+ init(8, fluoPropsFile, out1, new File(testDir, "initTmp"));
+ int ucount = unique(out1);
+
+ Assert.assertTrue(ucount > 0);
+
+ miniFluo.waitForObservers();
+
+ Assert.assertEquals(new Print.Stats(0, ucount, false), Print.getStats(config));
+
+ // reload same data
+ load(8, fluoPropsFile, out1);
+
+ miniFluo.waitForObservers();
+
+ Assert.assertEquals(new Print.Stats(0, ucount, false), Print.getStats(config));
+
+ // load some new data
+ File out2 = new File(testDir, "nums-2");
+ generate(2, 100, 500, out2);
+ load(8, fluoPropsFile, out2);
+ int ucount2 = unique(out1, out2);
+ Assert.assertTrue(ucount2 > ucount); // used > because the probability that no new numbers are
+ // chosen is exceedingly small
+
+ miniFluo.waitForObservers();
+
+ Assert.assertEquals(new Print.Stats(0, ucount2, false), Print.getStats(config));
+
+ File out3 = new File(testDir, "nums-3");
+ generate(2, 100, 500, out3);
+ load(8, fluoPropsFile, out3);
+ int ucount3 = unique(out1, out2, out3);
+ Assert.assertTrue(ucount3 > ucount2); // used > because the probability that no new numbers are
+ // chosen is exceedingly small
+
+ miniFluo.waitForObservers();
+
+ Assert.assertEquals(new Print.Stats(0, ucount3, false), Print.getStats(config));
+ }
+}
diff --git a/stresso/src/test/java/stresso/TrieStopLevelIT.java b/stresso/src/test/java/stresso/TrieStopLevelIT.java
new file mode 100644
index 0000000..9c85a27
--- /dev/null
+++ b/stresso/src/test/java/stresso/TrieStopLevelIT.java
@@ -0,0 +1,48 @@
+/*
+ * Copyright 2014 Stresso authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package stresso;
+
+import org.apache.fluo.api.client.Snapshot;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.api.config.ObserverSpecification;
+import org.apache.fluo.api.config.SimpleConfiguration;
+import org.apache.fluo.api.data.Bytes;
+import org.junit.Assert;
+import org.junit.Test;
+import stresso.trie.Constants;
+import stresso.trie.Node;
+import stresso.trie.NodeObserver;
+
+public class TrieStopLevelIT extends TrieMapRedIT {
+
+ @Override
+ protected void preInit(FluoConfiguration conf) {
+ conf.addObserver(new ObserverSpecification(NodeObserver.class.getName()));
+
+ SimpleConfiguration appCfg = conf.getAppConfiguration();
+ appCfg.setProperty(Constants.STOP_LEVEL_PROP, 7);
+ appCfg.setProperty(Constants.NODE_SIZE_PROP, 8);
+ }
+
+ @Test
+ public void testEndToEnd() throws Exception {
+ super.testEndToEnd();
+ try (Snapshot snap = client.newSnapshot()) {
+ Bytes row = Bytes.of(Node.generateRootId(8));
+ Assert.assertNull(snap.get(row, Constants.COUNT_SEEN_COL));
+ Assert.assertNull(snap.get(row, Constants.COUNT_WAIT_COL));
+ }
+ }
+}
diff --git a/stresso/src/test/resources/log4j.properties b/stresso/src/test/resources/log4j.properties
new file mode 100644
index 0000000..adf8e2e
--- /dev/null
+++ b/stresso/src/test/resources/log4j.properties
@@ -0,0 +1,31 @@
+# Copyright 2014 Stresso authors (see AUTHORS)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log4j.rootLogger=INFO, CA
+log4j.appender.CA=org.apache.log4j.ConsoleAppender
+log4j.appender.CA.layout=org.apache.log4j.PatternLayout
+log4j.appender.CA.layout.ConversionPattern=%d{ISO8601} [%c{2}] %-5p: %m%n
+
+log4j.logger.org.apache.curator=ERROR
+log4j.logger.org.apache.accumulo=WARN
+log4j.logger.org.apache.commons.vfs2.impl.DefaultFileSystemManager=WARN
+log4j.logger.org.apache.fluo=WARN
+log4j.logger.org.apache.hadoop=WARN
+log4j.logger.org.apache.hadoop.conf=ERROR
+log4j.logger.org.apache.hadoop.mapred=ERROR
+log4j.logger.org.apache.hadoop.mapreduce=ERROR
+log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR
+log4j.logger.org.apache.zookeeper.ClientCnxn=FATAL
+log4j.logger.org.apache.zookeeper.ZooKeeper=WARN
+log4j.logger.stresso=WARN
diff --git a/.gitignore b/webindex/.gitignore
similarity index 100%
rename from .gitignore
rename to webindex/.gitignore
diff --git a/.travis.yml b/webindex/.travis.yml
similarity index 100%
rename from .travis.yml
rename to webindex/.travis.yml
diff --git a/AUTHORS b/webindex/AUTHORS
similarity index 100%
rename from AUTHORS
rename to webindex/AUTHORS
diff --git a/webindex/LICENSE b/webindex/LICENSE
new file mode 100644
index 0000000..8f71f43
--- /dev/null
+++ b/webindex/LICENSE
@@ -0,0 +1,202 @@
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "{}"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright {yyyy} {name of copyright owner}
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
diff --git a/webindex/README.md b/webindex/README.md
new file mode 100644
index 0000000..f869d20
--- /dev/null
+++ b/webindex/README.md
@@ -0,0 +1,76 @@
+![Webindex][logo]
+---
+[![Build Status][ti]][tl] [![Apache License][li]][ll]
+
+Webindex is an example [Apache Fluo][fluo] application that incrementally indexes links to web pages
+in multiple ways. If you are new to Fluo, you may want start with the [Fluo tour][tour] as the
+WebIndex application is more complicated. For more information on how the WebIndex application
+works, view the [tables](docs/tables.md) and [code](docs/code-guide.md) documentation.
+
+Webindex utilizes multiple projects. [Common Crawl][cc] web crawl data is used as the input.
+[Apache Spark][spark] is used to initialize Fluo and incrementally load data into Fluo. [Apache
+Accumulo][accumulo] is used to hold the indexes and Fluo's data. Fluo is used to continuously
+combine new and historical information about web pages and update an external index when changes
+occur. Webindex has simple UI built using [Spark Java][sparkjava] that allows querying the indexes.
+
+Below is a video showing repeatedly querying stackoverflow.com while Webindex was running for three
+days on EC2. The video was made by querying the Webindex instance periodically and taking a
+screenshot. More details about this video are available in this [blog post][bp].
+
+[![Querying stackoverflow.com](http://img.youtube.com/vi/mJJNJbPN2EI/0.jpg)](http://www.youtube.com/watch?v=mJJNJbPN2EI)
+
+## Running WebIndex
+
+If you are new to WebIndex, the simplest way to run the application is to run the development
+server. First, clone the WebIndex repo:
+
+ git clone https://github.com/astralway/webindex.git
+
+Next, on a machine where Java and Maven are installed, run the development server using the
+`webindex` command:
+
+ cd webindex/
+ ./bin/webindex dev
+
+This will build and start the development server which will log to the console. This 'dev' command
+has several command line options which can be viewed by running with `-h`. When you want to
+terminate the server, press `CTRL-c`.
+
+The development server starts a MiniAccumuloCluster and runs MiniFluo on top of it. It parses a
+CommonCrawl data file and creates a file at `data/1000-pages.txt` with 1000 pages that are loaded
+into MiniFluo. The number of pages loaded can be changed to 5000 by using the command below:
+
+ ./bin/webindex dev --pages 5000
+
+The pages are processed by Fluo which exports indexes to Accumulo. The development server also
+starts a web application at [http://localhost:4567](http://localhost:4567) that queries indexes in
+Accumulo.
+
+If you would like to run WebIndex on a cluster, follow the [install] instructions.
+
+### Viewing metrics
+
+Metrics can be sent from the development server to InfluxDB and viewed in Grafana. You can either
+setup InfluxDB+Grafana on you own or use [Uno] command `uno setup metrics`. After a metrics server
+is started, start the development server the option `--metrics` to start sending metrics:
+
+ ./bin/webindex dev --metrics
+
+Fluo metrics can be viewed in Grafana. To view application-specific metrics for Webindex, import
+the WebIndex Grafana dashboard located at `contrib/webindex-dashboard.json`.
+
+[tour]: https://fluo.apache.org/tour/
+[sparkjava]: http://sparkjava.com/
+[spark]: https://spark.apache.org/
+[accumulo]: https://accumulo.apache.org/
+[fluo]: https://fluo.apache.org/
+[pc]: https://github.com/astralway/phrasecount
+[Uno]: https://github.com/astralway/uno
+[cc]: https://commoncrawl.org/
+[install]: docs/install.md
+[ti]: https://travis-ci.org/astralway/webindex.svg?branch=master
+[tl]: https://travis-ci.org/astralway/webindex
+[li]: http://img.shields.io/badge/license-ASL-blue.svg
+[ll]: https://github.com/astralway/webindex/blob/master/LICENSE
+[logo]: contrib/webindex.png
+[bp]: https://fluo.apache.org/blog/2016/01/11/webindex-long-run/#videos-from-run
diff --git a/bin/impl/base.sh b/webindex/bin/impl/base.sh
similarity index 100%
rename from bin/impl/base.sh
rename to webindex/bin/impl/base.sh
diff --git a/bin/impl/init.sh b/webindex/bin/impl/init.sh
similarity index 100%
rename from bin/impl/init.sh
rename to webindex/bin/impl/init.sh
diff --git a/bin/webindex b/webindex/bin/webindex
similarity index 100%
rename from bin/webindex
rename to webindex/bin/webindex
diff --git a/conf/.gitignore b/webindex/conf/.gitignore
similarity index 100%
rename from conf/.gitignore
rename to webindex/conf/.gitignore
diff --git a/conf/examples/log4j.properties b/webindex/conf/examples/log4j.properties
similarity index 100%
rename from conf/examples/log4j.properties
rename to webindex/conf/examples/log4j.properties
diff --git a/conf/examples/webindex-env.sh b/webindex/conf/examples/webindex-env.sh
similarity index 100%
rename from conf/examples/webindex-env.sh
rename to webindex/conf/examples/webindex-env.sh
diff --git a/conf/examples/webindex.yml b/webindex/conf/examples/webindex.yml
similarity index 100%
rename from conf/examples/webindex.yml
rename to webindex/conf/examples/webindex.yml
diff --git a/contrib/webindex-dashboard.json b/webindex/contrib/webindex-dashboard.json
similarity index 100%
rename from contrib/webindex-dashboard.json
rename to webindex/contrib/webindex-dashboard.json
diff --git a/contrib/webindex.png b/webindex/contrib/webindex.png
similarity index 100%
rename from contrib/webindex.png
rename to webindex/contrib/webindex.png
Binary files differ
diff --git a/contrib/webindex.svg b/webindex/contrib/webindex.svg
similarity index 100%
rename from contrib/webindex.svg
rename to webindex/contrib/webindex.svg
diff --git a/docs/code-guide.md b/webindex/docs/code-guide.md
similarity index 100%
rename from docs/code-guide.md
rename to webindex/docs/code-guide.md
diff --git a/docs/install.md b/webindex/docs/install.md
similarity index 100%
rename from docs/install.md
rename to webindex/docs/install.md
diff --git a/docs/tables.md b/webindex/docs/tables.md
similarity index 100%
rename from docs/tables.md
rename to webindex/docs/tables.md
diff --git a/docs/webindex_graphic.png b/webindex/docs/webindex_graphic.png
similarity index 100%
rename from docs/webindex_graphic.png
rename to webindex/docs/webindex_graphic.png
Binary files differ
diff --git a/modules/core/pom.xml b/webindex/modules/core/pom.xml
similarity index 100%
rename from modules/core/pom.xml
rename to webindex/modules/core/pom.xml
diff --git a/modules/core/src/main/java/webindex/core/Constants.java b/webindex/modules/core/src/main/java/webindex/core/Constants.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/Constants.java
rename to webindex/modules/core/src/main/java/webindex/core/Constants.java
diff --git a/modules/core/src/main/java/webindex/core/IndexClient.java b/webindex/modules/core/src/main/java/webindex/core/IndexClient.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/IndexClient.java
rename to webindex/modules/core/src/main/java/webindex/core/IndexClient.java
diff --git a/modules/core/src/main/java/webindex/core/WebIndexConfig.java b/webindex/modules/core/src/main/java/webindex/core/WebIndexConfig.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/WebIndexConfig.java
rename to webindex/modules/core/src/main/java/webindex/core/WebIndexConfig.java
diff --git a/modules/core/src/main/java/webindex/core/models/DomainStats.java b/webindex/modules/core/src/main/java/webindex/core/models/DomainStats.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/models/DomainStats.java
rename to webindex/modules/core/src/main/java/webindex/core/models/DomainStats.java
diff --git a/modules/core/src/main/java/webindex/core/models/Link.java b/webindex/modules/core/src/main/java/webindex/core/models/Link.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/models/Link.java
rename to webindex/modules/core/src/main/java/webindex/core/models/Link.java
diff --git a/modules/core/src/main/java/webindex/core/models/Links.java b/webindex/modules/core/src/main/java/webindex/core/models/Links.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/models/Links.java
rename to webindex/modules/core/src/main/java/webindex/core/models/Links.java
diff --git a/modules/core/src/main/java/webindex/core/models/Page.java b/webindex/modules/core/src/main/java/webindex/core/models/Page.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/models/Page.java
rename to webindex/modules/core/src/main/java/webindex/core/models/Page.java
diff --git a/modules/core/src/main/java/webindex/core/models/Pages.java b/webindex/modules/core/src/main/java/webindex/core/models/Pages.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/models/Pages.java
rename to webindex/modules/core/src/main/java/webindex/core/models/Pages.java
diff --git a/modules/core/src/main/java/webindex/core/models/TopResults.java b/webindex/modules/core/src/main/java/webindex/core/models/TopResults.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/models/TopResults.java
rename to webindex/modules/core/src/main/java/webindex/core/models/TopResults.java
diff --git a/modules/core/src/main/java/webindex/core/models/URL.java b/webindex/modules/core/src/main/java/webindex/core/models/URL.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/models/URL.java
rename to webindex/modules/core/src/main/java/webindex/core/models/URL.java
diff --git a/modules/core/src/main/java/webindex/core/models/UriInfo.java b/webindex/modules/core/src/main/java/webindex/core/models/UriInfo.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/models/UriInfo.java
rename to webindex/modules/core/src/main/java/webindex/core/models/UriInfo.java
diff --git a/modules/core/src/main/java/webindex/core/models/export/DomainUpdate.java b/webindex/modules/core/src/main/java/webindex/core/models/export/DomainUpdate.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/models/export/DomainUpdate.java
rename to webindex/modules/core/src/main/java/webindex/core/models/export/DomainUpdate.java
diff --git a/modules/core/src/main/java/webindex/core/models/export/IndexUpdate.java b/webindex/modules/core/src/main/java/webindex/core/models/export/IndexUpdate.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/models/export/IndexUpdate.java
rename to webindex/modules/core/src/main/java/webindex/core/models/export/IndexUpdate.java
diff --git a/modules/core/src/main/java/webindex/core/models/export/PageUpdate.java b/webindex/modules/core/src/main/java/webindex/core/models/export/PageUpdate.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/models/export/PageUpdate.java
rename to webindex/modules/core/src/main/java/webindex/core/models/export/PageUpdate.java
diff --git a/modules/core/src/main/java/webindex/core/models/export/UriUpdate.java b/webindex/modules/core/src/main/java/webindex/core/models/export/UriUpdate.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/models/export/UriUpdate.java
rename to webindex/modules/core/src/main/java/webindex/core/models/export/UriUpdate.java
diff --git a/modules/core/src/main/java/webindex/core/util/Pager.java b/webindex/modules/core/src/main/java/webindex/core/util/Pager.java
similarity index 100%
rename from modules/core/src/main/java/webindex/core/util/Pager.java
rename to webindex/modules/core/src/main/java/webindex/core/util/Pager.java
diff --git a/modules/core/src/test/java/webindex/core/WebIndexConfigTest.java b/webindex/modules/core/src/test/java/webindex/core/WebIndexConfigTest.java
similarity index 100%
rename from modules/core/src/test/java/webindex/core/WebIndexConfigTest.java
rename to webindex/modules/core/src/test/java/webindex/core/WebIndexConfigTest.java
diff --git a/modules/core/src/test/java/webindex/core/models/LinkTest.java b/webindex/modules/core/src/test/java/webindex/core/models/LinkTest.java
similarity index 100%
rename from modules/core/src/test/java/webindex/core/models/LinkTest.java
rename to webindex/modules/core/src/test/java/webindex/core/models/LinkTest.java
diff --git a/modules/core/src/test/java/webindex/core/models/PageTest.java b/webindex/modules/core/src/test/java/webindex/core/models/PageTest.java
similarity index 100%
rename from modules/core/src/test/java/webindex/core/models/PageTest.java
rename to webindex/modules/core/src/test/java/webindex/core/models/PageTest.java
diff --git a/modules/core/src/test/java/webindex/core/models/URLTest.java b/webindex/modules/core/src/test/java/webindex/core/models/URLTest.java
similarity index 100%
rename from modules/core/src/test/java/webindex/core/models/URLTest.java
rename to webindex/modules/core/src/test/java/webindex/core/models/URLTest.java
diff --git a/modules/core/src/test/resources/log4j.properties b/webindex/modules/core/src/test/resources/log4j.properties
similarity index 100%
rename from modules/core/src/test/resources/log4j.properties
rename to webindex/modules/core/src/test/resources/log4j.properties
diff --git a/modules/data/pom.xml b/webindex/modules/data/pom.xml
similarity index 100%
rename from modules/data/pom.xml
rename to webindex/modules/data/pom.xml
diff --git a/modules/data/src/main/java/webindex/data/CalcSplits.java b/webindex/modules/data/src/main/java/webindex/data/CalcSplits.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/CalcSplits.java
rename to webindex/modules/data/src/main/java/webindex/data/CalcSplits.java
diff --git a/modules/data/src/main/java/webindex/data/Configure.java b/webindex/modules/data/src/main/java/webindex/data/Configure.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/Configure.java
rename to webindex/modules/data/src/main/java/webindex/data/Configure.java
diff --git a/modules/data/src/main/java/webindex/data/Copy.java b/webindex/modules/data/src/main/java/webindex/data/Copy.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/Copy.java
rename to webindex/modules/data/src/main/java/webindex/data/Copy.java
diff --git a/modules/data/src/main/java/webindex/data/FluoApp.java b/webindex/modules/data/src/main/java/webindex/data/FluoApp.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/FluoApp.java
rename to webindex/modules/data/src/main/java/webindex/data/FluoApp.java
diff --git a/modules/data/src/main/java/webindex/data/Init.java b/webindex/modules/data/src/main/java/webindex/data/Init.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/Init.java
rename to webindex/modules/data/src/main/java/webindex/data/Init.java
diff --git a/modules/data/src/main/java/webindex/data/LoadHdfs.java b/webindex/modules/data/src/main/java/webindex/data/LoadHdfs.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/LoadHdfs.java
rename to webindex/modules/data/src/main/java/webindex/data/LoadHdfs.java
diff --git a/modules/data/src/main/java/webindex/data/LoadS3.java b/webindex/modules/data/src/main/java/webindex/data/LoadS3.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/LoadS3.java
rename to webindex/modules/data/src/main/java/webindex/data/LoadS3.java
diff --git a/modules/data/src/main/java/webindex/data/TestParser.java b/webindex/modules/data/src/main/java/webindex/data/TestParser.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/TestParser.java
rename to webindex/modules/data/src/main/java/webindex/data/TestParser.java
diff --git a/modules/data/src/main/java/webindex/data/fluo/DomainCombineQ.java b/webindex/modules/data/src/main/java/webindex/data/fluo/DomainCombineQ.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/fluo/DomainCombineQ.java
rename to webindex/modules/data/src/main/java/webindex/data/fluo/DomainCombineQ.java
diff --git a/modules/data/src/main/java/webindex/data/fluo/IndexUpdateTranslator.java b/webindex/modules/data/src/main/java/webindex/data/fluo/IndexUpdateTranslator.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/fluo/IndexUpdateTranslator.java
rename to webindex/modules/data/src/main/java/webindex/data/fluo/IndexUpdateTranslator.java
diff --git a/modules/data/src/main/java/webindex/data/fluo/PageLoader.java b/webindex/modules/data/src/main/java/webindex/data/fluo/PageLoader.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/fluo/PageLoader.java
rename to webindex/modules/data/src/main/java/webindex/data/fluo/PageLoader.java
diff --git a/modules/data/src/main/java/webindex/data/fluo/PageObserver.java b/webindex/modules/data/src/main/java/webindex/data/fluo/PageObserver.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/fluo/PageObserver.java
rename to webindex/modules/data/src/main/java/webindex/data/fluo/PageObserver.java
diff --git a/modules/data/src/main/java/webindex/data/fluo/UriCombineQ.java b/webindex/modules/data/src/main/java/webindex/data/fluo/UriCombineQ.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/fluo/UriCombineQ.java
rename to webindex/modules/data/src/main/java/webindex/data/fluo/UriCombineQ.java
diff --git a/modules/data/src/main/java/webindex/data/fluo/WebindexObservers.java b/webindex/modules/data/src/main/java/webindex/data/fluo/WebindexObservers.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/fluo/WebindexObservers.java
rename to webindex/modules/data/src/main/java/webindex/data/fluo/WebindexObservers.java
diff --git a/modules/data/src/main/java/webindex/data/spark/IndexEnv.java b/webindex/modules/data/src/main/java/webindex/data/spark/IndexEnv.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/spark/IndexEnv.java
rename to webindex/modules/data/src/main/java/webindex/data/spark/IndexEnv.java
diff --git a/modules/data/src/main/java/webindex/data/spark/IndexStats.java b/webindex/modules/data/src/main/java/webindex/data/spark/IndexStats.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/spark/IndexStats.java
rename to webindex/modules/data/src/main/java/webindex/data/spark/IndexStats.java
diff --git a/modules/data/src/main/java/webindex/data/spark/IndexUtil.java b/webindex/modules/data/src/main/java/webindex/data/spark/IndexUtil.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/spark/IndexUtil.java
rename to webindex/modules/data/src/main/java/webindex/data/spark/IndexUtil.java
diff --git a/modules/data/src/main/java/webindex/data/util/ArchiveUtil.java b/webindex/modules/data/src/main/java/webindex/data/util/ArchiveUtil.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/util/ArchiveUtil.java
rename to webindex/modules/data/src/main/java/webindex/data/util/ArchiveUtil.java
diff --git a/modules/data/src/main/java/webindex/data/util/WARCFileInputFormat.java b/webindex/modules/data/src/main/java/webindex/data/util/WARCFileInputFormat.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/util/WARCFileInputFormat.java
rename to webindex/modules/data/src/main/java/webindex/data/util/WARCFileInputFormat.java
diff --git a/modules/data/src/main/java/webindex/data/util/WARCFileRecordReader.java b/webindex/modules/data/src/main/java/webindex/data/util/WARCFileRecordReader.java
similarity index 100%
rename from modules/data/src/main/java/webindex/data/util/WARCFileRecordReader.java
rename to webindex/modules/data/src/main/java/webindex/data/util/WARCFileRecordReader.java
diff --git a/modules/data/src/main/java/webindex/serialization/WebindexKryoFactory.java b/webindex/modules/data/src/main/java/webindex/serialization/WebindexKryoFactory.java
similarity index 100%
rename from modules/data/src/main/java/webindex/serialization/WebindexKryoFactory.java
rename to webindex/modules/data/src/main/java/webindex/serialization/WebindexKryoFactory.java
diff --git a/modules/data/src/main/resources/splits/accumulo-default.txt b/webindex/modules/data/src/main/resources/splits/accumulo-default.txt
similarity index 100%
rename from modules/data/src/main/resources/splits/accumulo-default.txt
rename to webindex/modules/data/src/main/resources/splits/accumulo-default.txt
diff --git a/modules/data/src/test/java/webindex/data/SparkTestUtil.java b/webindex/modules/data/src/test/java/webindex/data/SparkTestUtil.java
similarity index 100%
rename from modules/data/src/test/java/webindex/data/SparkTestUtil.java
rename to webindex/modules/data/src/test/java/webindex/data/SparkTestUtil.java
diff --git a/modules/data/src/test/java/webindex/data/fluo/it/IndexIT.java b/webindex/modules/data/src/test/java/webindex/data/fluo/it/IndexIT.java
similarity index 100%
rename from modules/data/src/test/java/webindex/data/fluo/it/IndexIT.java
rename to webindex/modules/data/src/test/java/webindex/data/fluo/it/IndexIT.java
diff --git a/modules/data/src/test/java/webindex/data/spark/Hex.java b/webindex/modules/data/src/test/java/webindex/data/spark/Hex.java
similarity index 100%
rename from modules/data/src/test/java/webindex/data/spark/Hex.java
rename to webindex/modules/data/src/test/java/webindex/data/spark/Hex.java
diff --git a/modules/data/src/test/java/webindex/data/spark/IndexEnvTest.java b/webindex/modules/data/src/test/java/webindex/data/spark/IndexEnvTest.java
similarity index 100%
rename from modules/data/src/test/java/webindex/data/spark/IndexEnvTest.java
rename to webindex/modules/data/src/test/java/webindex/data/spark/IndexEnvTest.java
diff --git a/modules/data/src/test/java/webindex/data/spark/IndexUtilTest.java b/webindex/modules/data/src/test/java/webindex/data/spark/IndexUtilTest.java
similarity index 100%
rename from modules/data/src/test/java/webindex/data/spark/IndexUtilTest.java
rename to webindex/modules/data/src/test/java/webindex/data/spark/IndexUtilTest.java
diff --git a/modules/data/src/test/java/webindex/data/util/ArchiveUtilTest.java b/webindex/modules/data/src/test/java/webindex/data/util/ArchiveUtilTest.java
similarity index 100%
rename from modules/data/src/test/java/webindex/data/util/ArchiveUtilTest.java
rename to webindex/modules/data/src/test/java/webindex/data/util/ArchiveUtilTest.java
diff --git a/modules/data/src/test/resources/data/set1/accumulo-data.txt b/webindex/modules/data/src/test/resources/data/set1/accumulo-data.txt
similarity index 100%
rename from modules/data/src/test/resources/data/set1/accumulo-data.txt
rename to webindex/modules/data/src/test/resources/data/set1/accumulo-data.txt
diff --git a/modules/data/src/test/resources/data/set1/fluo-data.txt b/webindex/modules/data/src/test/resources/data/set1/fluo-data.txt
similarity index 100%
rename from modules/data/src/test/resources/data/set1/fluo-data.txt
rename to webindex/modules/data/src/test/resources/data/set1/fluo-data.txt
diff --git a/modules/data/src/test/resources/log4j.properties b/webindex/modules/data/src/test/resources/log4j.properties
similarity index 100%
rename from modules/data/src/test/resources/log4j.properties
rename to webindex/modules/data/src/test/resources/log4j.properties
diff --git a/modules/data/src/test/resources/wat-18.warc b/webindex/modules/data/src/test/resources/wat-18.warc
similarity index 100%
rename from modules/data/src/test/resources/wat-18.warc
rename to webindex/modules/data/src/test/resources/wat-18.warc
diff --git a/modules/data/src/test/resources/wat.warc b/webindex/modules/data/src/test/resources/wat.warc
similarity index 100%
rename from modules/data/src/test/resources/wat.warc
rename to webindex/modules/data/src/test/resources/wat.warc
diff --git a/modules/integration/pom.xml b/webindex/modules/integration/pom.xml
similarity index 100%
rename from modules/integration/pom.xml
rename to webindex/modules/integration/pom.xml
diff --git a/modules/integration/src/main/java/webindex/integration/DevServer.java b/webindex/modules/integration/src/main/java/webindex/integration/DevServer.java
similarity index 100%
rename from modules/integration/src/main/java/webindex/integration/DevServer.java
rename to webindex/modules/integration/src/main/java/webindex/integration/DevServer.java
diff --git a/modules/integration/src/main/java/webindex/integration/DevServerOpts.java b/webindex/modules/integration/src/main/java/webindex/integration/DevServerOpts.java
similarity index 100%
rename from modules/integration/src/main/java/webindex/integration/DevServerOpts.java
rename to webindex/modules/integration/src/main/java/webindex/integration/DevServerOpts.java
diff --git a/modules/integration/src/main/java/webindex/integration/SampleData.java b/webindex/modules/integration/src/main/java/webindex/integration/SampleData.java
similarity index 100%
rename from modules/integration/src/main/java/webindex/integration/SampleData.java
rename to webindex/modules/integration/src/main/java/webindex/integration/SampleData.java
diff --git a/modules/integration/src/test/java/webindex/integration/DevServerIT.java b/webindex/modules/integration/src/test/java/webindex/integration/DevServerIT.java
similarity index 100%
rename from modules/integration/src/test/java/webindex/integration/DevServerIT.java
rename to webindex/modules/integration/src/test/java/webindex/integration/DevServerIT.java
diff --git a/modules/integration/src/test/resources/5-pages.txt b/webindex/modules/integration/src/test/resources/5-pages.txt
similarity index 100%
rename from modules/integration/src/test/resources/5-pages.txt
rename to webindex/modules/integration/src/test/resources/5-pages.txt
diff --git a/modules/integration/src/test/resources/log4j.properties b/webindex/modules/integration/src/test/resources/log4j.properties
similarity index 100%
rename from modules/integration/src/test/resources/log4j.properties
rename to webindex/modules/integration/src/test/resources/log4j.properties
diff --git a/modules/ui/.gitignore b/webindex/modules/ui/.gitignore
similarity index 100%
rename from modules/ui/.gitignore
rename to webindex/modules/ui/.gitignore
diff --git a/modules/ui/pom.xml b/webindex/modules/ui/pom.xml
similarity index 100%
rename from modules/ui/pom.xml
rename to webindex/modules/ui/pom.xml
diff --git a/modules/ui/src/main/java/webindex/ui/WebServer.java b/webindex/modules/ui/src/main/java/webindex/ui/WebServer.java
similarity index 100%
rename from modules/ui/src/main/java/webindex/ui/WebServer.java
rename to webindex/modules/ui/src/main/java/webindex/ui/WebServer.java
diff --git a/modules/ui/src/main/resources/assets/img/webindex.png b/webindex/modules/ui/src/main/resources/assets/img/webindex.png
similarity index 100%
rename from modules/ui/src/main/resources/assets/img/webindex.png
rename to webindex/modules/ui/src/main/resources/assets/img/webindex.png
Binary files differ
diff --git a/modules/ui/src/main/resources/spark/template/freemarker/404.ftl b/webindex/modules/ui/src/main/resources/spark/template/freemarker/404.ftl
similarity index 100%
rename from modules/ui/src/main/resources/spark/template/freemarker/404.ftl
rename to webindex/modules/ui/src/main/resources/spark/template/freemarker/404.ftl
diff --git a/modules/ui/src/main/resources/spark/template/freemarker/common/footer.ftl b/webindex/modules/ui/src/main/resources/spark/template/freemarker/common/footer.ftl
similarity index 100%
rename from modules/ui/src/main/resources/spark/template/freemarker/common/footer.ftl
rename to webindex/modules/ui/src/main/resources/spark/template/freemarker/common/footer.ftl
diff --git a/modules/ui/src/main/resources/spark/template/freemarker/common/head.ftl b/webindex/modules/ui/src/main/resources/spark/template/freemarker/common/head.ftl
similarity index 100%
rename from modules/ui/src/main/resources/spark/template/freemarker/common/head.ftl
rename to webindex/modules/ui/src/main/resources/spark/template/freemarker/common/head.ftl
diff --git a/modules/ui/src/main/resources/spark/template/freemarker/common/header.ftl b/webindex/modules/ui/src/main/resources/spark/template/freemarker/common/header.ftl
similarity index 100%
rename from modules/ui/src/main/resources/spark/template/freemarker/common/header.ftl
rename to webindex/modules/ui/src/main/resources/spark/template/freemarker/common/header.ftl
diff --git a/modules/ui/src/main/resources/spark/template/freemarker/home.ftl b/webindex/modules/ui/src/main/resources/spark/template/freemarker/home.ftl
similarity index 100%
rename from modules/ui/src/main/resources/spark/template/freemarker/home.ftl
rename to webindex/modules/ui/src/main/resources/spark/template/freemarker/home.ftl
diff --git a/modules/ui/src/main/resources/spark/template/freemarker/links.ftl b/webindex/modules/ui/src/main/resources/spark/template/freemarker/links.ftl
similarity index 100%
rename from modules/ui/src/main/resources/spark/template/freemarker/links.ftl
rename to webindex/modules/ui/src/main/resources/spark/template/freemarker/links.ftl
diff --git a/modules/ui/src/main/resources/spark/template/freemarker/page.ftl b/webindex/modules/ui/src/main/resources/spark/template/freemarker/page.ftl
similarity index 100%
rename from modules/ui/src/main/resources/spark/template/freemarker/page.ftl
rename to webindex/modules/ui/src/main/resources/spark/template/freemarker/page.ftl
diff --git a/modules/ui/src/main/resources/spark/template/freemarker/pages.ftl b/webindex/modules/ui/src/main/resources/spark/template/freemarker/pages.ftl
similarity index 100%
rename from modules/ui/src/main/resources/spark/template/freemarker/pages.ftl
rename to webindex/modules/ui/src/main/resources/spark/template/freemarker/pages.ftl
diff --git a/modules/ui/src/main/resources/spark/template/freemarker/top.ftl b/webindex/modules/ui/src/main/resources/spark/template/freemarker/top.ftl
similarity index 100%
rename from modules/ui/src/main/resources/spark/template/freemarker/top.ftl
rename to webindex/modules/ui/src/main/resources/spark/template/freemarker/top.ftl
diff --git a/pom.xml b/webindex/pom.xml
similarity index 100%
rename from pom.xml
rename to webindex/pom.xml