Merge pull request #84 from mikewalch/spark-web
Refactored UI to replace Dropwizard with Spark Web
diff --git a/.gitignore b/.gitignore
index b2ae586..54549dd 100644
--- a/.gitignore
+++ b/.gitignore
@@ -6,4 +6,4 @@
.settings
target/
/logs/
-/paths/
+/data/
diff --git a/README.md b/README.md
index ef102b5..5de5686 100644
--- a/README.md
+++ b/README.md
@@ -2,141 +2,39 @@
---
[![Build Status][ti]][tl] [![Apache License][li]][ll]
-Webindex is an example [Apache Fluo][fluo] application that uses [Common Crawl][cc] web crawl
-data to index links to web pages in multiple ways. It has a simple UI to view the resulting
-indexes. If you are new to Fluo, you may want start with the [quickstart][qs] or
-[phrasecount][pc] applications as the webindex application is more complicated. For more
-information on how the webindex application works, view the [tables](docs/tables.md) and
-[code](docs/code-guide.md) documentation.
+WebIndex is an example [Apache Fluo][fluo] application that uses [Common Crawl][cc] web crawl data
+to index links to web pages in multiple ways. It has a simple UI to view the resulting indexes. If
+you are new to Fluo, you may want start with the[phrasecount][pc] application as the WebIndex
+application is more complicated. For more information on how the WebIndex application works, view
+the [tables](docs/tables.md) and [code](docs/code-guide.md) documentation.
-### Requirements
+## Running WebIndex
-In order run this application you need the following installed and running on your
-machine:
+If you are new to WebIndex, the simplest way to run the application is to run the development
+server. First, clone the WebIndex repo:
-* Hadoop (HDFS & YARN)
-* Accumulo
-* Fluo
+ git clone https://github.com/astralway/webindex.git
-Consider using [Uno] to run these requirements
+Next, on a machine where Java and Maven are installed, run the development server using the
+`webindex` command:
-### Configure your environment
+ cd webindex/
+ ./bin/webindex dev
-First, you must create the configuration file `data.yml` in the `conf/` directory and edit it
-for your environment.
+This will build and start the development server which will log to the console. When you want to
+terminate the server, press `ctrl-c`.
- cp conf/data.yml.example conf/data.yml
+The development server starts a MiniAccumuloCluster and runs MiniFluo on top of it. It parses a
+CommonCrawl data file and creates a file at `data/1K-pages.txt` with 1000 pages that are loaded into
+MiniFluo. The pages are processed by Fluo which exports indexes to Accumulo. A web application is
+started at [http://localhost:4567](http://localhost:4567) that queries these indexes.
-There are a few environment variables that need to be set to run these scripts (see
-`conf/webindex-env.sh.example` for a list). If you don't want to set them in your `~/.bashrc`,
-create `webindex-env.sh` in `conf/` and set them.
-
- cp conf/webindex-env.sh.example conf/webindex-env.sh
-
-### Download the paths file for a crawl
-
-For each crawl of the web, Common Crawl produces a file containing a list of paths to the data
-files produced by that crawl. The webindex `copy` and `load-s3` commands use this file to
-retrieve Common Crawl data stored in S3. The `getpaths` command below downloads this paths
-file for the April 2015 crawl (identified by `2015-18`) to the `paths/` directory as it will
-be necessary for future commands. If you would like to use a different crawl, the
-[Common Crawl website][cdata] has a list of possible crawls which are identified by the
-`YEAR-WEEK` (i.e. `2015-18`) of the time the crawl occurred.
-
- ./bin/webindex getpaths 2015-18
-
-Take a look at the paths file that was just retrieved.
-
- $ less paths/2015-18.wat.paths
-
-Each line in the paths file contains a path to a different common crawl data file. In later
-commands, you will select paths by specifying a range (in the format of `START-END`). Ranges
-can start at index 0 and their start/end points are inclusive. Therefore, a range of `4-6`
-would select 3 paths from line 4, 5, and 6 of the file. Using the command below, you can
-find the max endpoint for ranges in a paths file.
-
- $ wc -l paths/2015-18.wat.paths
- 38609 paths/2015-18.wat.paths
-
-The 2015-18 paths file has 38609 different paths. A range of `0-38608` would select all
-paths in the file.
-
-### Copy Common Crawl data from AWS into HDFS
-
-After retrieving a paths file, the command below runs a Spark job that copies data files from S3
-to HDFS. The command below will copy 3 files in the file range of `4-6` of the `2015-18` paths
-file into the HDFS directory `/cc/data/a`. Common Crawl data files are large (~330 MB each) so
-be mindful of how many you copy.
-
- ./bin/webindex copy 2015-18 4-6 /cc/data/a
-
-To create multiple data sets, run the command with different range and HDFS directory.
-
- ./bin/webindex copy 2015-18 7-8 /cc/data/b
-
-### Initialize and start the webindex Fluo application
-
-After copying data into HDFS, run the following to initialize and start the webindex
-Fluo application.
-
- ./bin/webindex init
-
-Optionally, add a HDFS directory (with previously copied data) to the end of the command.
-When a directory is specified, `init` will run a Spark job that initializes the webindex
-Fluo application with data before starting it.
-
- ./bin/webindex init /cc/data/a
-
-### Load data into the webindex Fluo application
-
-The `init` command should only be run on an empty cluster. To add more data, run the
-`load-hdfs` or `load-s3` commands. Both start a Spark job that parses Common Crawl data
-and inserts this data into the Fluo table of the webindex application. The webindex Fluo
-observers will incrementally process this data and export indexes to Accumulo.
-
-The `load-hdfs` command below loads data stored in the HDFS directory `/cc/data/b` into
-Fluo.
-
- ./bin/webindex load-hdfs /cc/data/b
-
-The `load-s3` command below loads data hosted on S3 into Fluo. It select files in the
-`9-10` range of the `2015-18` paths file.
-
- ./bin/webindex load-s3 2015-18 9-10
-
-### Compact Transient Ranges
-
-For long runs, this example has [transient ranges][transient] that need to be
-periodically compacted. This can be accomplished with the following command.
-
-```bash
-nohup fluo exec webindex org.apache.fluo.recipes.accumulo.cmds.CompactTransient 600 &> your_log_file.log &
-```
-
-As long as this command is running, it will initiate a compaction of all transient
-ranges every 10 minutes.
-
-### Run the webindex UI
-
-Run the following command to run the webindex UI which can be viewed at
-[http://localhost:8080/](http://localhost:8080/).
-
- ./bin/webindex ui
-
-The UI queries indexes stored in Accumulo that were exported by Fluo. The UI is
-implemented using [dropwizard]. Optionally, you can modify the default dropwizard
-configuration by creating a `dropwizard.yml` in `conf/`.
-
- cp conf/dropwizard.yml.example conf/dropwizard.yml
+If you would like to run WebIndex on a cluster, follow the [install] instructions.
[fluo]: https://fluo.apache.org/
-[qs]: https://github.com/astralway/quickstart
[pc]: https://github.com/astralway/phrasecount
-[Uno]: https://github.com/astralway/uno
-[dropwizard]: http://dropwizard.io/
[cc]: https://commoncrawl.org/
-[cdata]: https://commoncrawl.org/the-data/get-started/
-[transient]: https://github.com/apache/fluo-recipes/blob/master/docs/transient.md
+[install]: docs/install.md
[ti]: https://travis-ci.org/astralway/webindex.svg?branch=master
[tl]: https://travis-ci.org/astralway/webindex
[li]: http://img.shields.io/badge/license-ASL-blue.svg
diff --git a/bin/impl/base.sh b/bin/impl/base.sh
index fd5885e..5117ca5 100755
--- a/bin/impl/base.sh
+++ b/bin/impl/base.sh
@@ -15,13 +15,28 @@
# limitations under the License.
: ${WI_HOME?"WI_HOME must be set"}
-: ${DATA_CONFIG?"DATA_CONFIG must be set"}
+: ${WI_CONFIG?"WI_CONFIG must be set"}
: ${SPARK_HOME?"SPARK_HOME must be set"}
function get_prop {
- echo "`grep $1 $DATA_CONFIG | cut -d ' ' -f 2`"
+ echo "`grep $1 $WI_CONFIG | cut -d ' ' -f 2`"
}
+: ${HADOOP_CONF_DIR?"HADOOP_CONF_DIR must be set in bash env or conf/webindex-env.sh"}
+if [ ! -d $HADOOP_CONF_DIR ]; then
+ echo "HADOOP_CONF_DIR=$HADOOP_CONF_DIR does not exist"
+ exit 1
+fi
+: ${FLUO_HOME?"FLUO_HOME must be set in bash env or conf/webindex-env.sh"}
+if [ ! -d $FLUO_HOME ]; then
+ echo "FLUO_HOME=$FLUO_HOME does not exist"
+ exit 1
+fi
+
+: ${WI_EXECUTOR_INSTANCES?"WI_EXECUTOR_INSTANCES must be set in bash env or conf/webindex-env.sh"}
+: ${WI_EXECUTOR_MEMORY?"WI_EXECUTOR_MEMORY must be set in bash env or conf/webindex-env.sh"}
+export COMMON_SPARK_OPTS="--master yarn-client --num-executors $WI_EXECUTOR_INSTANCES --executor-memory $WI_EXECUTOR_MEMORY"
+
export SPARK_SUBMIT=$SPARK_HOME/bin/spark-submit
if [ ! -f $SPARK_SUBMIT ]; then
echo "The spark-submit command cannot be found in SPARK_HOME=$SPARK_HOME. Please set SPARK_HOME in conf/webindex-env.sh"
@@ -33,13 +48,13 @@
# Stop if any command after this fails
set -e
-export WI_DATA_JAR=$WI_HOME/modules/data/target/webindex-data-0.0.1-SNAPSHOT.jar
-export WI_DATA_DEP_JAR=$WI_HOME/modules/data/target/webindex-data-0.0.1-SNAPSHOT-shaded.jar
+export WI_DATA_JAR=$WI_HOME/modules/data/target/webindex-data-$WI_VERSION.jar
+export WI_DATA_DEP_JAR=$WI_HOME/modules/data/target/webindex-data-$WI_VERSION-shaded.jar
if [ ! -f $WI_DATA_DEP_JAR ]; then
echo "Building $WI_DATA_DEP_JAR"
cd $WI_HOME
: ${SPARK_VERSION?"SPARK_VERSION must be set in bash env or conf/webindex-env.sh"}
: ${HADOOP_VERSION?"HADOOP_VERSION must be set in bash env or conf/webindex-env.sh"}
- mvn clean package -DskipTests -Dspark.version=$SPARK_VERSION -Dhadoop.version=$HADOOP_VERSION
+ mvn clean package -Pcreate-shade-jar -DskipTests -Dspark.version=$SPARK_VERSION -Dhadoop.version=$HADOOP_VERSION
fi
diff --git a/bin/impl/init.sh b/bin/impl/init.sh
index d1c463c..e9a854b 100755
--- a/bin/impl/init.sh
+++ b/bin/impl/init.sh
@@ -54,10 +54,10 @@
mvn dependency:get -Dartifact=com.esotericsoftware:reflectasm:1.10.1:jar -Ddest=$FLUO_APP_LIB
mvn dependency:get -Dartifact=org.objenesis:objenesis:2.1:jar -Ddest=$FLUO_APP_LIB
# Add webindex core and its dependencies
-cp $WI_HOME/modules/core/target/webindex-core-0.0.1-SNAPSHOT.jar $FLUO_APP_LIB
+cp $WI_HOME/modules/core/target/webindex-core-$WI_VERSION.jar $FLUO_APP_LIB
mvn dependency:get -Dartifact=commons-validator:commons-validator:1.4.1:jar -Ddest=$FLUO_APP_LIB
-java -cp $WI_DATA_DEP_JAR webindex.data.Configure $DATA_CONFIG
+java -cp $WI_DATA_DEP_JAR webindex.data.Configure $WI_CONFIG
$FLUO_CMD init $FLUO_APP --force
diff --git a/bin/webindex b/bin/webindex
index 5bc6748..5c298c6 100755
--- a/bin/webindex
+++ b/bin/webindex
@@ -16,6 +16,7 @@
BIN_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
export WI_HOME=$( cd "$( dirname "$BIN_DIR" )" && pwd )
+export WI_VERSION=0.0.1-SNAPSHOT
if [ -f $WI_HOME/conf/webindex-env.sh ]; then
. $WI_HOME/conf/webindex-env.sh
@@ -23,57 +24,57 @@
. $WI_HOME/conf/webindex-env.sh.example
fi
-: ${HADOOP_CONF_DIR?"HADOOP_CONF_DIR must be set in bash env or conf/webindex-env.sh"}
-if [ ! -d $HADOOP_CONF_DIR ]; then
- echo "HADOOP_CONF_DIR=$HADOOP_CONF_DIR does not exist"
- exit 1
-fi
-: ${FLUO_HOME?"FLUO_HOME must be set in bash env or conf/webindex-env.sh"}
-if [ ! -d $FLUO_HOME ]; then
- echo "FLUO_HOME=$FLUO_HOME does not exist"
- exit 1
-fi
-
mkdir -p $WI_HOME/logs
-export DATA_CONFIG=$WI_HOME/conf/data.yml
-if [ ! -f $DATA_CONFIG ]; then
- export DATA_CONFIG=$WI_HOME/conf/data.yml.example
- if [ ! -f $DATA_CONFIG ]; then
- echo "Could not find data.yml or data.yml.example in $WI_HOME/conf"
+export WI_CONFIG=$WI_HOME/conf/webindex.yml
+if [ ! -f $WI_CONFIG ]; then
+ export WI_CONFIG=$WI_HOME/conf/webindex.yml.example
+ if [ ! -f $WI_CONFIG ]; then
+ echo "Could not find webindex.yml or webindex.yml.example in $WI_HOME/conf"
exit 1
fi
- echo "Using default config at $DATA_CONFIG"
+fi
+
+log4j_config=$WI_HOME/conf/log4j.properties
+if [ ! -f $log4j_config ]; then
+ log4j_config=$WI_HOME/conf/log4j.properties.example
+ if [ ! -f $log4j_config ]; then
+ echo "Could not find logj4.properties or log4j.properties.example in $WI_HOME/conf"
+ exit 1
+ fi
fi
function get_prop {
- echo "`grep $1 $DATA_CONFIG | cut -d ' ' -f 2`"
+ echo "`grep $1 $WI_CONFIG | cut -d ' ' -f 2`"
}
-: ${WI_EXECUTOR_INSTANCES?"WI_EXECUTOR_INSTANCES must be set in bash env or conf/webindex-env.sh"}
-: ${WI_EXECUTOR_MEMORY?"WI_EXECUTOR_MEMORY must be set in bash env or conf/webindex-env.sh"}
-export COMMON_SPARK_OPTS="--master yarn-client --num-executors $WI_EXECUTOR_INSTANCES --executor-memory $WI_EXECUTOR_MEMORY"
COMMAND_LOGFILE=$WI_HOME/logs/$1_`date +%s`.log
+DATA_DIR=$WI_HOME/data
+mkdir -p $DATA_DIR
case "$1" in
+dev)
+ pkill -9 -f webindex-dev-server
+ cd $WI_HOME
+ mvn -q compile -P webindex-dev-server -Dlog4j.configuration=file:$log4j_config
+ ;;
getpaths)
- PATHS_DIR=$WI_HOME/paths
- mkdir -p $PATHS_DIR
+ mkdir -p $DATA_DIR
PATHS_FILE="$2".wat.paths
- if [ ! -f $PATHS_DIR/$PATHS_FILE ]; then
- rm -f $PATHS_DIR/wat.paths.gz
+ if [ ! -f $DATA_DIR/$PATHS_FILE ]; then
+ rm -f $DATA_DIR/wat.paths.gz
PATHS_URL=https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-$2/wat.paths.gz
if [[ `wget -S --spider $PATHS_URL 2>&1 | grep 'HTTP/1.1 200 OK'` ]]; then
- wget -P $PATHS_DIR $PATHS_URL
- gzip -d $PATHS_DIR/wat.paths.gz
- mv $PATHS_DIR/wat.paths $PATHS_DIR/$PATHS_FILE
- echo "Downloaded paths file to $PATHS_DIR/$PATHS_FILE"
+ wget -P $DATA_DIR $PATHS_URL
+ gzip -d $DATA_DIR/wat.paths.gz
+ mv $DATA_DIR/wat.paths $DATA_DIR/$PATHS_FILE
+ echo "Downloaded paths file to $DATA_DIR/$PATHS_FILE"
else
echo "Crawl paths file for date $2 does not exist at $PATHS_URL"
exit 1
fi
else
- echo "Crawl paths file already exists at $PATHS_DIR/$PATHS_FILE"
+ echo "Crawl paths file already exists at $DATA_DIR/$PATHS_FILE"
fi
;;
copy)
@@ -83,7 +84,7 @@
fi
. $BIN_DIR/impl/base.sh
COMMAND="$SPARK_SUBMIT --class webindex.data.Copy $COMMON_SPARK_OPTS \
- $WI_DATA_DEP_JAR $WI_HOME/paths/"$2".wat.paths $3 $4"
+ $WI_DATA_DEP_JAR $DATA_DIR/"$2".wat.paths $3 $4"
if [ "$5" != "-fg" ]; then
nohup ${COMMAND} &> $COMMAND_LOGFILE &
echo "Started copy. Logs are being output to $COMMAND_LOGFILE"
@@ -140,7 +141,7 @@
exit 1
fi
COMMAND="$SPARK_SUBMIT --class webindex.data.LoadS3 $COMMON_SPARK_OPTS \
- --files $FLUO_PROPS $WI_DATA_DEP_JAR $WI_HOME/paths/"$2".wat.paths $3"
+ --files $FLUO_PROPS $WI_DATA_DEP_JAR $DATA_DIR/"$2".wat.paths $3"
if [ "$4" != "-fg" ]; then
nohup ${COMMAND} &> $COMMAND_LOGFILE &
echo "Started load-s3. Logs are being output to $COMMAND_LOGFILE"
@@ -155,7 +156,7 @@
fi
. $BIN_DIR/impl/base.sh
COMMAND="$SPARK_SUBMIT --class webindex.data.TestParser $COMMON_SPARK_OPTS \
- $WI_DATA_DEP_JAR $WI_HOME/paths/"$2".wat.paths $3"
+ $WI_DATA_DEP_JAR $DATA_DIR/"$2".wat.paths $3"
if [ "$4" != "-fg" ]; then
nohup ${COMMAND} &> $COMMAND_LOGFILE &
echo "Started data-verify. Logs are being output to $COMMAND_LOGFILE"
@@ -164,18 +165,9 @@
fi
;;
ui)
- pkill -9 -f webindex-ui
- WI_UI_JAR=$WI_HOME/modules/ui/target/webindex-ui-0.0.1-SNAPSHOT.jar
- if [ ! -f $WI_UI_JAR ]; then
- cd $WI_HOME/modules/ui
- mvn clean install -DskipTests
- fi
- DROPWIZARD_CONFIG=""
- if [ -f $WI_HOME/conf/dropwizard.yml ]; then
- DROPWIZARD_CONFIG=$WI_HOME/conf/dropwizard.yml
- echo "Running with dropwizard config at $DROPWIZARD_CONFIG"
- fi
- COMMAND="java -jar $WI_UI_JAR server $DROPWIZARD_CONFIG"
+ pkill -9 -f webindex-web-server
+ cd $WI_HOME
+ COMMAND="mvn -q compile -P webindex-web-server -Dlog4j.configuration=file:$log4j_config"
if [ "$2" != "-fg" ]; then
nohup ${COMMAND} &> $COMMAND_LOGFILE &
echo "Started UI. Logs are being output to $COMMAND_LOGFILE"
@@ -242,7 +234,7 @@
exit 1
fi
echo "Killing the webindex UI web server..."
- pkill -9 -f webindex-ui
+ pkill -9 -f webindex-web-server
echo "Stopping the $FLUO_APP Fluo application (if running)..."
$FLUO_CMD stop $FLUO_APP
@@ -254,7 +246,8 @@
*)
echo -e "Usage: webindex <command> (<argument>)\n"
echo -e "Possible commands:\n"
- echo " getpaths <DATE> Retrieves paths file for given crawl <DATE> (i.e 2015-18) and stores file in the 'paths/' directory"
+ echo " dev Runs WebIndex development server"
+ echo " getpaths <DATE> Retrieves paths file for given crawl <DATE> (i.e 2015-18) and stores file in the 'data/' directory"
echo " See https://commoncrawl.org/the-data/get-started/ for possible crawl dates"
echo " copy <DATE> <RANGE> <DEST> Copies CommonCrawl data files from S3 given a <DATE> and <RANGE> (i.e 0-8) into HDFS <DEST> directory"
echo " init [<SRC>] Initializes and starts the WebIndex application. Optionally, a <SRC> HDFS directory can be added to"
diff --git a/conf/.gitignore b/conf/.gitignore
index 45f109f..40c8d7c 100644
--- a/conf/.gitignore
+++ b/conf/.gitignore
@@ -1,3 +1,3 @@
-data.yml
-dropwizard.yml
+webindex.yml
webindex-env.sh
+log4j.properties
diff --git a/conf/dropwizard.yml.example b/conf/dropwizard.yml.example
deleted file mode 100644
index 77f06a8..0000000
--- a/conf/dropwizard.yml.example
+++ /dev/null
@@ -1,21 +0,0 @@
-# Copyright 2015 Webindex authors (see AUTHORS)
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# This optional file configures the dropwizard settings in the UI.
-# See the URL below for possible configuration:
-# http://www.dropwizard.io/0.8.2/docs/manual/configuration.html
-
-# This file is optional. If you don't need to change the default dropwizard
-# configuration, don't create this file as dropwizard will complain if it is
-# created but empty.
diff --git a/conf/log4j.properties.example b/conf/log4j.properties.example
new file mode 100644
index 0000000..694c884
--- /dev/null
+++ b/conf/log4j.properties.example
@@ -0,0 +1,29 @@
+# Copyright 2016 Webindex authors (see AUTHORS)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log4j.rootLogger=INFO, CA
+log4j.appender.CA=org.apache.log4j.ConsoleAppender
+log4j.appender.CA.layout=org.apache.log4j.PatternLayout
+log4j.appender.CA.layout.ConversionPattern=%d{ISO8601} [%c] %-5p: %m%n
+
+log4j.logger.org.apache.accumulo=WARN
+log4j.logger.org.apache.curator=ERROR
+log4j.logger.org.apache.fluo=WARN
+log4j.logger.org.apache.hadoop=WARN
+log4j.logger.org.apache.hadoop.mapreduce=ERROR
+log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR
+log4j.logger.org.apache.zookeeper=ERROR
+log4j.logger.org.eclipse.jetty=WARN
+log4j.logger.org.spark-project=WARN
+log4j.logger.webindex=INFO
diff --git a/conf/data.yml.example b/conf/webindex.yml.example
similarity index 100%
rename from conf/data.yml.example
rename to conf/webindex.yml.example
diff --git a/docs/install.md b/docs/install.md
new file mode 100644
index 0000000..6f4fab5
--- /dev/null
+++ b/docs/install.md
@@ -0,0 +1,131 @@
+# WebIndex Install
+
+Below are instructions for installing WebIndex on a cluster.
+
+## Requirements
+
+To run WebIndex, you need the following installed on your cluster:
+
+* Java
+* Hadoop (HDFS & YARN)
+* Accumulo
+* Fluo
+* Maven
+
+Hadoop & Accumulo should be running before starting these instructions. Fluo and Maven only need to
+be installed on the machine where you run the `webindex` command. Consider using [Uno] to setup
+Hadoop, Accumulo & Fluo if you are running on a single node.
+
+## Configure your environment
+
+First, clone the WebIndex repo:
+
+ git clone https://github.com/astralway/webindex.git
+
+Next, create the configuration file `webindex.yml` in the `conf/` directory and edit it
+for your environment.
+
+ cd webindex/
+ cp conf/webindex.yml.example conf/webindex.yml
+
+There are a few environment variables that need to be set to run these scripts (see
+`conf/webindex-env.sh.example` for a list). If you don't want to set them in your `~/.bashrc`,
+create `webindex-env.sh` in `conf/` and set them.
+
+ cp conf/webindex-env.sh.example conf/webindex-env.sh
+
+## Download the paths file for a crawl
+
+For each crawl of the web, Common Crawl produces a file containing a list of paths to the data
+files produced by that crawl. The webindex `copy` and `load-s3` commands use this file to
+retrieve Common Crawl data stored in S3. The `getpaths` command below downloads this paths
+file for the April 2015 crawl (identified by `2015-18`) to the `paths/` directory as it will
+be necessary for future commands. If you would like to use a different crawl, the
+[Common Crawl website][cdata] has a list of possible crawls which are identified by the
+`YEAR-WEEK` (i.e. `2015-18`) of the time the crawl occurred.
+
+ ./bin/webindex getpaths 2015-18
+
+Take a look at the paths file that was just retrieved.
+
+ $ less paths/2015-18.wat.paths
+
+Each line in the paths file contains a path to a different common crawl data file. In later
+commands, you will select paths by specifying a range (in the format of `START-END`). Ranges
+can start at index 0 and their start/end points are inclusive. Therefore, a range of `4-6`
+would select 3 paths from line 4, 5, and 6 of the file. Using the command below, you can
+find the max endpoint for ranges in a paths file.
+
+ $ wc -l paths/2015-18.wat.paths
+ 38609 paths/2015-18.wat.paths
+
+The 2015-18 paths file has 38609 different paths. A range of `0-38608` would select all
+paths in the file.
+
+## Copy Common Crawl data from AWS into HDFS
+
+After retrieving a paths file, the command below runs a Spark job that copies data files from S3
+to HDFS. The command below will copy 3 files in the file range of `4-6` of the `2015-18` paths
+file into the HDFS directory `/cc/data/a`. Common Crawl data files are large (~330 MB each) so
+be mindful of how many you copy.
+
+ ./bin/webindex copy 2015-18 4-6 /cc/data/a
+
+To create multiple data sets, run the command with different range and HDFS directory.
+
+ ./bin/webindex copy 2015-18 7-8 /cc/data/b
+
+## Initialize and start the webindex Fluo application
+
+After copying data into HDFS, run the following to initialize and start the webindex
+Fluo application.
+
+ ./bin/webindex init
+
+Optionally, add a HDFS directory (with previously copied data) to the end of the command.
+When a directory is specified, `init` will run a Spark job that initializes the webindex
+Fluo application with data before starting it.
+
+ ./bin/webindex init /cc/data/a
+
+## Load data into the webindex Fluo application
+
+The `init` command should only be run on an empty cluster. To add more data, run the
+`load-hdfs` or `load-s3` commands. Both start a Spark job that parses Common Crawl data
+and inserts this data into the Fluo table of the webindex application. The webindex Fluo
+observers will incrementally process this data and export indexes to Accumulo.
+
+The `load-hdfs` command below loads data stored in the HDFS directory `/cc/data/b` into
+Fluo.
+
+ ./bin/webindex load-hdfs /cc/data/b
+
+The `load-s3` command below loads data hosted on S3 into Fluo. It select files in the
+`9-10` range of the `2015-18` paths file.
+
+ ./bin/webindex load-s3 2015-18 9-10
+
+## Compact Transient Ranges
+
+For long runs, this example has [transient ranges][transient] that need to be
+periodically compacted. This can be accomplished with the following command.
+
+```bash
+nohup fluo exec webindex org.apache.fluo.recipes.accumulo.cmds.CompactTransient 600 &> your_log_file.log &
+```
+
+As long as this command is running, it will initiate a compaction of all transient
+ranges every 10 minutes.
+
+## Run the webindex UI
+
+Run the following command to run the webindex UI which can be viewed at
+[http://localhost:4567/](http://localhost:4567/).
+
+ ./bin/webindex ui
+
+The UI queries indexes stored in Accumulo that were exported by Fluo.
+
+[Uno]: https://github.com/astralway/uno
+[transient]: https://github.com/apache/fluo-recipes/blob/master/docs/transient.md
+[cdata]: https://commoncrawl.org/the-data/get-started/
diff --git a/modules/core/pom.xml b/modules/core/pom.xml
index 86221cc..f402de6 100644
--- a/modules/core/pom.xml
+++ b/modules/core/pom.xml
@@ -34,6 +34,10 @@
<artifactId>gson</artifactId>
</dependency>
<dependency>
+ <groupId>com.google.guava</groupId>
+ <artifactId>guava</artifactId>
+ </dependency>
+ <dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
</dependency>
@@ -43,6 +47,10 @@
<version>1.4.1</version>
</dependency>
<dependency>
+ <groupId>org.apache.accumulo</groupId>
+ <artifactId>accumulo-core</artifactId>
+ </dependency>
+ <dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
diff --git a/modules/core/src/main/java/webindex/core/IndexClient.java b/modules/core/src/main/java/webindex/core/IndexClient.java
new file mode 100644
index 0000000..e28d7f8
--- /dev/null
+++ b/modules/core/src/main/java/webindex/core/IndexClient.java
@@ -0,0 +1,234 @@
+/*
+ * Copyright 2015 Webindex authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package webindex.core;
+
+import java.util.Iterator;
+import java.util.Map;
+
+import com.google.gson.Gson;
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.Scanner;
+import org.apache.accumulo.core.client.TableNotFoundException;
+import org.apache.accumulo.core.data.Key;
+import org.apache.accumulo.core.data.Range;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.security.Authorizations;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import webindex.core.models.DomainStats;
+import webindex.core.models.Link;
+import webindex.core.models.Links;
+import webindex.core.models.Page;
+import webindex.core.models.Pages;
+import webindex.core.models.TopResults;
+import webindex.core.models.URL;
+import webindex.core.util.Pager;
+
+public class IndexClient {
+
+ private static final Logger log = LoggerFactory.getLogger(IndexClient.class);
+ private static final int PAGE_SIZE = 25;
+
+ private Connector conn;
+ private String accumuloIndexTable;
+ private Gson gson = new Gson();
+
+ public IndexClient(String accumuloIndexTable, Connector conn) {
+ this.accumuloIndexTable = accumuloIndexTable;
+ this.conn = conn;
+ }
+
+ public TopResults getTopResults(String next, int pageNum) {
+
+ TopResults results = new TopResults();
+
+ results.setPageNum(pageNum);
+ try {
+ Scanner scanner = conn.createScanner(accumuloIndexTable, Authorizations.EMPTY);
+ Pager pager = Pager.build(scanner, Range.prefix("t:"), PAGE_SIZE, entry -> {
+ String row = entry.getKey().getRow().toString();
+ if (entry.isNext()) {
+ results.setNext(row);
+ } else {
+ String url = URL.fromPageID(row.split(":", 3)[2]).toString();
+ Long num = Long.parseLong(entry.getValue().toString());
+ results.addResult(url, num);
+ }
+ });
+ if (next.isEmpty()) {
+ pager.read(pageNum);
+ } else {
+ pager.read(new Key(next));
+ }
+ } catch (TableNotFoundException e) {
+ log.error("Table {} not found", accumuloIndexTable);
+ }
+ return results;
+ }
+
+ private static Long getLongValue(Map.Entry<Key, Value> entry) {
+ return Long.parseLong(entry.getValue().toString());
+ }
+
+ public Page getPage(String rawUrl) {
+ Page page = null;
+ Long incount = (long) 0;
+ URL url;
+ try {
+ url = URL.from(rawUrl);
+ } catch (Exception e) {
+ log.error("Failed to parse URL {}", rawUrl);
+ return null;
+ }
+
+ try {
+ Scanner scanner = conn.createScanner(accumuloIndexTable, Authorizations.EMPTY);
+ scanner.setRange(Range.exact("p:" + url.toPageID(), Constants.PAGE));
+ for (Map.Entry<Key, Value> entry : scanner) {
+ switch (entry.getKey().getColumnQualifier().toString()) {
+ case Constants.INCOUNT:
+ incount = getLongValue(entry);
+ break;
+ case Constants.CUR:
+ page = gson.fromJson(entry.getValue().toString(), Page.class);
+ break;
+ default:
+ log.error("Unknown page stat {}", entry.getKey().getColumnQualifier());
+ }
+ }
+ } catch (TableNotFoundException e) {
+ e.printStackTrace();
+ }
+
+ if (page == null) {
+ page = new Page(url.toPageID());
+ }
+ page.setNumInbound(incount);
+ return page;
+ }
+
+ public DomainStats getDomainStats(String domain) {
+ DomainStats stats = new DomainStats(domain);
+ Scanner scanner;
+ try {
+ scanner = conn.createScanner(accumuloIndexTable, Authorizations.EMPTY);
+ scanner.setRange(Range.exact("d:" + URL.reverseHost(domain), Constants.DOMAIN));
+ for (Map.Entry<Key, Value> entry : scanner) {
+ switch (entry.getKey().getColumnQualifier().toString()) {
+ case Constants.PAGECOUNT:
+ stats.setTotal(getLongValue(entry));
+ break;
+ default:
+ log.error("Unknown page domain {}", entry.getKey().getColumnQualifier());
+ }
+ }
+ } catch (TableNotFoundException e) {
+ e.printStackTrace();
+ }
+ return stats;
+ }
+
+ public Pages getPages(String domain, String next, int pageNum) {
+ DomainStats stats = getDomainStats(domain);
+ Pages pages = new Pages(domain, pageNum);
+ pages.setTotal(stats.getTotal());
+ String row = "d:" + URL.reverseHost(domain);
+ String cf = Constants.RANK;
+ try {
+ Scanner scanner = conn.createScanner(accumuloIndexTable, Authorizations.EMPTY);
+ Pager pager =
+ Pager.build(scanner, Range.prefix(row + ":"), PAGE_SIZE, entry -> {
+ if (entry.isNext()) {
+ pages.setNext(entry.getKey().getRowData().toString().split(":", 3)[2]);
+ } else {
+ String url =
+ URL.fromPageID(entry.getKey().getRowData().toString().split(":", 4)[3])
+ .toString();
+ Long count = Long.parseLong(entry.getValue().toString());
+ pages.addPage(url, count);
+ }
+ });
+ if (next.isEmpty()) {
+ pager.read(pageNum);
+ } else {
+ pager.read(new Key(row + ":" + next, cf, ""));
+
+ }
+ } catch (TableNotFoundException e) {
+ log.error("Table {} not found", accumuloIndexTable);
+ }
+ return pages;
+ }
+
+ public Links getLinks(String rawUrl, String linkType, String next, int pageNum) {
+
+ Links links = new Links(rawUrl, linkType, pageNum);
+
+ URL url;
+ try {
+ url = URL.from(rawUrl);
+ } catch (Exception e) {
+ log.error("Failed to parse URL: " + rawUrl);
+ return links;
+ }
+
+ try {
+ Scanner scanner = conn.createScanner(accumuloIndexTable, Authorizations.EMPTY);
+ String row = "p:" + url.toPageID();
+ if (linkType.equals("in")) {
+ Page page = getPage(rawUrl);
+ String cf = Constants.INLINKS;
+ links.setTotal(page.getNumInbound());
+ Pager pager = Pager.build(scanner, Range.exact(row, cf), PAGE_SIZE, entry -> {
+ String pageID = entry.getKey().getColumnQualifier().toString();
+ if (entry.isNext()) {
+ links.setNext(pageID);
+ } else {
+ String anchorText = entry.getValue().toString();
+ links.addLink(Link.of(pageID, anchorText));
+ }
+ });
+ if (next.isEmpty()) {
+ pager.read(pageNum);
+ } else {
+ pager.read(new Key(row, cf, next));
+ }
+ } else {
+ scanner.setRange(Range.exact(row, Constants.PAGE, Constants.CUR));
+ Iterator<Map.Entry<Key, Value>> iter = scanner.iterator();
+ if (iter.hasNext()) {
+ Page curPage = gson.fromJson(iter.next().getValue().toString(), Page.class);
+ links.setTotal(curPage.getNumOutbound());
+ int skip = 0;
+ int add = 0;
+ for (Link l : curPage.getOutboundLinks()) {
+ if (skip < (pageNum * PAGE_SIZE)) {
+ skip++;
+ } else if (add < PAGE_SIZE) {
+ links.addLink(l);
+ add++;
+ } else {
+ links.setNext(l.getPageID());
+ break;
+ }
+ }
+ }
+ }
+ } catch (TableNotFoundException e) {
+ log.error("Table {} not found", accumuloIndexTable);
+ }
+ return links;
+ }
+}
diff --git a/modules/core/src/main/java/webindex/core/DataConfig.java b/modules/core/src/main/java/webindex/core/WebIndexConfig.java
similarity index 84%
rename from modules/core/src/main/java/webindex/core/DataConfig.java
rename to modules/core/src/main/java/webindex/core/WebIndexConfig.java
index 36a3bf9..dadc429 100644
--- a/modules/core/src/main/java/webindex/core/DataConfig.java
+++ b/modules/core/src/main/java/webindex/core/WebIndexConfig.java
@@ -21,9 +21,9 @@
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
-public class DataConfig {
+public class WebIndexConfig {
- private static final Logger log = LoggerFactory.getLogger(DataConfig.class);
+ private static final Logger log = LoggerFactory.getLogger(WebIndexConfig.class);
public static String CC_URL_PREFIX = "https://aws-publicdatasets.s3.amazonaws.com/";
public static final String WI_EXECUTOR_INSTANCES = "WI_EXECUTOR_INSTANCES";
@@ -71,27 +71,27 @@
return path;
}
- public static DataConfig load() {
+ public static WebIndexConfig load() {
final String homePath = getEnvPath("WI_HOME");
- final String userPath = homePath + "/conf/data.yml";
- final String defaultPath = homePath + "/conf/data.yml.example";
+ final String userPath = homePath + "/conf/webindex.yml";
+ final String defaultPath = homePath + "/conf/webindex.yml.example";
if ((new File(userPath).exists())) {
log.info("Using user config at {}", userPath);
return load(userPath);
} else {
- log.info("Using default config at {}" + defaultPath);
+ log.info("Using default config at {}", defaultPath);
return load(defaultPath);
}
}
- public static DataConfig load(String configPath) {
+ public static WebIndexConfig load(String configPath) {
return load(configPath, true);
}
- protected static DataConfig load(String configPath, boolean useEnv) {
+ protected static WebIndexConfig load(String configPath, boolean useEnv) {
try {
YamlReader reader = new YamlReader(new FileReader(configPath));
- DataConfig config = reader.read(DataConfig.class);
+ WebIndexConfig config = reader.read(WebIndexConfig.class);
if (useEnv) {
config.hadoopConfDir = getEnvPath("HADOOP_CONF_DIR");
config.fluoHome = getEnvPath("FLUO_HOME");
diff --git a/modules/core/src/main/java/webindex/core/models/URL.java b/modules/core/src/main/java/webindex/core/models/URL.java
index 04fb3f9..c090083 100644
--- a/modules/core/src/main/java/webindex/core/models/URL.java
+++ b/modules/core/src/main/java/webindex/core/models/URL.java
@@ -18,6 +18,8 @@
import java.util.Objects;
import java.util.function.Function;
+import com.google.common.net.HostSpecifier;
+import com.google.common.net.InternetDomainName;
import org.apache.commons.lang.ArrayUtils;
import org.apache.commons.validator.routines.InetAddressValidator;
import org.slf4j.Logger;
@@ -63,6 +65,19 @@
throw new IllegalArgumentException(msg);
}
+ public static String domainFromHost(String host) {
+ return InternetDomainName.from(host).topPrivateDomain().name();
+ }
+
+ public static boolean isValidHost(String host) {
+ return HostSpecifier.isValid(host) && InternetDomainName.isValid(host)
+ && InternetDomainName.from(host).isUnderPublicSuffix();
+ }
+
+ public static URL from(String rawUrl) {
+ return URL.from(rawUrl, URL::domainFromHost, URL::isValidHost);
+ }
+
public static URL from(String rawUrl, Function<String, String> domainFromHost,
Function<String, Boolean> isValidHost) {
@@ -131,6 +146,10 @@
return new URL(domain, host, path, port, secure, ipHost);
}
+ public static boolean isValid(String rawUrl) {
+ return URL.isValid(rawUrl, URL::domainFromHost, URL::isValidHost);
+ }
+
public static boolean isValid(String rawUrl, Function<String, String> domainFromHost,
Function<String, Boolean> isValidHost) {
try {
diff --git a/modules/core/src/main/java/webindex/core/util/Pager.java b/modules/core/src/main/java/webindex/core/util/Pager.java
new file mode 100644
index 0000000..27b4cf8
--- /dev/null
+++ b/modules/core/src/main/java/webindex/core/util/Pager.java
@@ -0,0 +1,104 @@
+/*
+ * Copyright 2015 Webindex authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package webindex.core.util;
+
+import java.util.Iterator;
+import java.util.Map;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.function.Consumer;
+
+import org.apache.accumulo.core.client.Scanner;
+import org.apache.accumulo.core.data.Key;
+import org.apache.accumulo.core.data.Range;
+import org.apache.accumulo.core.data.Value;
+
+public class Pager {
+
+ private Scanner scanner;
+ private int pageSize;
+ private Range pageRange;
+ private Consumer<PageEntry> entryHandler;
+ private AtomicBoolean pageRead = new AtomicBoolean(false);
+
+ public class PageEntry {
+
+ private Key key;
+ private Value value;
+ private boolean isNext;
+
+ public PageEntry(Key key, Value value, boolean isNext) {
+ this.key = key;
+ this.value = value;
+ this.isNext = isNext;
+ }
+
+ public Key getKey() {
+ return key;
+ }
+
+ public Value getValue() {
+ return value;
+ }
+
+ public boolean isNext() {
+ return isNext;
+ }
+ }
+
+ private Pager(Scanner scanner, Range pageRange, int pageSize, Consumer<PageEntry> entryHandler) {
+ this.scanner = scanner;
+ this.pageRange = pageRange;
+ this.pageSize = pageSize;
+ this.entryHandler = entryHandler;
+ }
+
+ public void read(Key startKey) {
+ if (pageRead.get() == true) {
+ throw new IllegalStateException("Pager.read() cannot be called twice");
+ }
+ scanner.setRange(new Range(startKey, pageRange.getEndKey()));
+ handleStart(scanner.iterator());
+ }
+
+ public void read(int pageNum) {
+ if (pageRead.get() == true) {
+ throw new IllegalStateException("Pager.read() cannot be called twice");
+ }
+ scanner.setRange(pageRange);
+ Iterator<Map.Entry<Key, Value>> iterator = scanner.iterator();
+ if (pageNum > 0) {
+ long skip = 0;
+ while (skip < (pageNum * pageSize)) {
+ iterator.next();
+ skip++;
+ }
+ }
+ handleStart(iterator);
+ }
+
+ private void handleStart(Iterator<Map.Entry<Key, Value>> iterator) {
+ long num = 0;
+ while (iterator.hasNext() && (num < (pageSize + 1))) {
+ Map.Entry<Key, Value> entry = iterator.next();
+ entryHandler.accept(new PageEntry(entry.getKey(), entry.getValue(), num == pageSize));
+ num++;
+ }
+ }
+
+ public static Pager build(Scanner scanner, Range pageRange, int pageSize,
+ Consumer<PageEntry> entryHandler) {
+ return new Pager(scanner, pageRange, pageSize, entryHandler);
+ }
+}
diff --git a/modules/core/src/test/java/webindex/core/DataConfigTest.java b/modules/core/src/test/java/webindex/core/WebIndexConfigTest.java
similarity index 87%
rename from modules/core/src/test/java/webindex/core/DataConfigTest.java
rename to modules/core/src/test/java/webindex/core/WebIndexConfigTest.java
index a92bcbd..2bb125e 100644
--- a/modules/core/src/test/java/webindex/core/DataConfigTest.java
+++ b/modules/core/src/test/java/webindex/core/WebIndexConfigTest.java
@@ -17,11 +17,11 @@
import org.junit.Assert;
import org.junit.Test;
-public class DataConfigTest {
+public class WebIndexConfigTest {
@Test
public void testBasic() throws Exception {
- DataConfig config = DataConfig.load("../../conf/data.yml.example", false);
+ WebIndexConfig config = WebIndexConfig.load("../../conf/webindex.yml.example", false);
Assert.assertEquals("webindex_search", config.accumuloIndexTable);
Assert.assertEquals("webindex", config.fluoApp);
Assert.assertEquals("/cc/temp", config.hdfsTempDir);
diff --git a/modules/core/src/test/java/webindex/core/models/URLTest.java b/modules/core/src/test/java/webindex/core/models/URLTest.java
index 89f5b61..e0eb45d 100644
--- a/modules/core/src/test/java/webindex/core/models/URLTest.java
+++ b/modules/core/src/test/java/webindex/core/models/URLTest.java
@@ -22,31 +22,28 @@
public class URLTest {
public static URL from(String rawUrl) {
- return URL.from(rawUrl, host -> host, host -> true);
+ return URL.from(rawUrl);
}
public static String toID(String rawUrl) {
return from(rawUrl).toPageID();
}
- public static boolean isValid(String rawUrl) {
- return URL.isValid(rawUrl, host -> host, host -> true);
- }
public static URL url80(String host, String path) {
- return new URL(host, host, path, 80, false, URL.isValidIP(host));
+ return new URL(URL.domainFromHost(host), host, path, 80, false, URL.isValidIP(host));
}
public static URL url443(String host, String path) {
- return new URL(host, host, path, 443, true, URL.isValidIP(host));
+ return new URL(URL.domainFromHost(host), host, path, 443, true, URL.isValidIP(host));
}
public static URL urlOpen(String host, String path, int port) {
- return new URL(host, host, path, port, false, URL.isValidIP(host));
+ return new URL(URL.domainFromHost(host), host, path, port, false, URL.isValidIP(host));
}
public static URL urlSecure(String host, String path, int port) {
- return new URL(host, host, path, port, true, URL.isValidIP(host));
+ return new URL(URL.domainFromHost(host), host, path, port, true, URL.isValidIP(host));
}
@Test
@@ -57,7 +54,7 @@
"http://ab.com#1/2/3", "https://ab.com/", "https://h.d.ab.com/1/2/3"};
for (String rawUrl : validUrls) {
- Assert.assertTrue(isValid(rawUrl));
+ Assert.assertTrue(URL.isValid(rawUrl));
Assert.assertEquals(rawUrl, from(rawUrl).toString());
}
@@ -67,7 +64,7 @@
"http://a.com:"};
for (String rawUrl : failureUrls) {
- Assert.assertFalse(isValid(rawUrl));
+ Assert.assertFalse(URL.isValid(rawUrl));
}
}
@@ -100,8 +97,8 @@
URL u = from("http://a.b.c.d.com/1/2/3");
Assert.assertEquals("a.b.c.d.com", u.getHost());
Assert.assertEquals("com.d.c.b.a", u.getReverseHost());
- Assert.assertEquals("a.b.c.d.com", u.getDomain());
- Assert.assertEquals("com.d.c.b.a", u.getReverseDomain());
+ Assert.assertEquals("d.com", u.getDomain());
+ Assert.assertEquals("com.d", u.getReverseDomain());
}
@Test
@@ -110,8 +107,6 @@
Assert.assertEquals(urlOpen("example.com", "#a&b", 83), from("http://example.com:83#a&b"));
Assert.assertEquals(url80("a.b.example.com", "/page?1&2"),
from("http://a.b.example.com/page?1&2"));
- Assert.assertEquals(url443("1.2.3.4", "/page?1&2"), from("https://1.2.3.4/page?1&2"));
- Assert.assertEquals(url80("1.2.3.4", "/page?1&2"), from("http://1.2.3.4/page?1&2"));
Assert.assertEquals(url80("example.com", "/1/2/3?c&d&e"),
from("http://example.com/1/2/3?c&d&e"));
Assert.assertEquals(url80("a.b.example.com", "/"), from("http://a.b.example.com"));
@@ -123,7 +118,6 @@
Assert.assertEquals(url443("example.com", "/"), from("https://example.com/"));
Assert.assertEquals(url80("example.com", "/b?1#2&3#4"), from("http://example.com/b?1#2&3#4"));
Assert.assertEquals(urlOpen("example.com", "/b", 8080), from("http://example.com:8080/b"));
- Assert.assertEquals(url80("1.2.3.4", "////c"), from("http://1.2.3.4////c"));
}
@Test
@@ -131,7 +125,7 @@
URL u1 = urlSecure("a.b.c.com", "/", 8329);
URL u2 = from("https://a.b.C.com:8329");
String r1 = u2.toPageID();
- Assert.assertEquals("com.c.b.a>>s8329>/", r1);
+ Assert.assertEquals("com.c>.b.a>s8329>/", r1);
URL u3 = URL.fromPageID(r1);
Assert.assertEquals(u1, u2);
Assert.assertEquals(u1, u3);
@@ -147,6 +141,75 @@
Assert.assertEquals("1.2.3.4>>o>/a/b/c", id5);
Assert.assertEquals(u5, URL.fromPageID(id5));
- Assert.assertEquals("com.b.a>>s80>/", from("https://a.b.com:80").toPageID());
+ Assert.assertEquals("com.b>.a>s80>/", from("https://a.b.com:80").toPageID());
+ }
+
+ @Test
+ public void testMore() throws Exception {
+
+ // valid urls
+ Assert.assertTrue(URL.isValid(" \thttp://example.com/ \t\n\r\n"));
+ Assert.assertTrue(URL.isValid("http://1.2.3.4:80/test?a=b&c=d"));
+ Assert.assertTrue(URL.isValid("http://1.2.3.4/"));
+ Assert.assertTrue(URL.isValid("http://a.b.c.d.com/1/2/3/4/5"));
+ Assert.assertTrue(URL.isValid("http://a.b.com:281/1/2"));
+ Assert.assertTrue(URL.isValid("http://A.B.Com:281/a/b"));
+ Assert.assertTrue(URL.isValid("http://A.b.Com:281/A/b"));
+ Assert.assertTrue(URL.isValid("http://a.B.Com?A/b/C"));
+ Assert.assertTrue(URL.isValid("http://A.Be.COM"));
+ Assert.assertTrue(URL.isValid("http://1.2.3.4:281/1/2"));
+
+ // invalid urls
+ Assert.assertFalse(URL.isValid("http://a.com:/test"));
+ Assert.assertFalse(URL.isValid("http://z.com:"));
+ Assert.assertFalse(URL.isValid("http://1.2.3:80/test?a=b&c=d"));
+ Assert.assertFalse(URL.isValid("http://1.2.3/"));
+ Assert.assertFalse(URL.isValid("http://com/"));
+ Assert.assertFalse(URL.isValid("http://a.b.c.com/bad>et"));
+ Assert.assertFalse(URL.isValid("http://test"));
+ Assert.assertFalse(URL.isValid("http://co.uk"));
+ Assert.assertFalse(URL.isValid("http:///example.com/"));
+ Assert.assertFalse(URL.isValid("http:://example.com/"));
+ Assert.assertFalse(URL.isValid("example.com"));
+ Assert.assertFalse(URL.isValid("127.0.0.1"));
+ Assert.assertFalse(URL.isValid("http://ab@example.com"));
+ Assert.assertFalse(URL.isValid("ftp://example.com"));
+
+ Assert.assertEquals("example.com", from("http://example.com:281/1/2").getHost());
+ Assert.assertEquals("a.b.example.com", from("http://a.b.example.com/1/2").getHost());
+ Assert.assertEquals("a.b.example.com", from("http://A.B.Example.Com/1/2").getHost());
+ Assert.assertEquals("1.2.3.4", from("http://1.2.3.4:89/1/2").getHost());
+
+ Assert.assertEquals("/A/b/C", from("http://A.B.Example.Com/A/b/C").getPath());
+ Assert.assertEquals("?D/E/f", from("http://A.B.Example.Com?D/E/f").getPath());
+
+ URL u = from("http://a.b.c.d.com/1/2/3");
+ Assert.assertEquals("a.b.c.d.com", u.getHost());
+ Assert.assertEquals("com.d.c.b.a", u.getReverseHost());
+ Assert.assertEquals("d.com", u.getDomain());
+ Assert.assertEquals("com.d", u.getReverseDomain());
+
+ Assert.assertEquals("com.example", from("http://example.com:281/1").getReverseHost());
+ Assert.assertEquals("com.example.b.a", from("http://a.b.example.com/1/2").getReverseHost());
+ Assert.assertEquals("1.2.3.4", from("http://1.2.3.4:89/1/2").getReverseHost());
+
+ Assert.assertTrue(from("http://a.com/a.jpg").isImage());
+ Assert.assertTrue(from("http://a.com/a.JPEG").isImage());
+ Assert.assertTrue(from("http://a.com/c/b/a.png").isImage());
+
+ Assert.assertEquals("c.com", from("http://a.b.c.com").getDomain());
+ Assert.assertEquals("com.c", from("http://a.b.c.com").getReverseDomain());
+ Assert.assertEquals("c.co.uk", from("http://a.b.c.co.uk").getDomain());
+ Assert.assertEquals("uk.co.c", from("http://a.b.c.co.uk").getReverseDomain());
+ Assert.assertEquals("d.com.au", from("http://www.d.com.au").getDomain());
+ Assert.assertEquals("au.com.d", from("http://www.d.com.au").getReverseDomain());
+
+ u = from("https://www.d.com.au:9443/a/bc");
+ Assert.assertEquals("au.com.d>.www>s9443>/a/bc", u.toPageID());
+ Assert.assertEquals("https://www.d.com.au:9443/a/bc", u.toString());
+ URL u2 = URL.fromPageID(u.toPageID());
+ Assert.assertEquals("https://www.d.com.au:9443/a/bc", u2.toString());
+ Assert.assertEquals("d.com.au", u2.getDomain());
+ Assert.assertEquals("www.d.com.au", u2.getHost());
}
}
diff --git a/modules/core/src/test/resources/log4j.properties b/modules/core/src/test/resources/log4j.properties
index ee43eea..7add931 100644
--- a/modules/core/src/test/resources/log4j.properties
+++ b/modules/core/src/test/resources/log4j.properties
@@ -18,4 +18,3 @@
log4j.appender.CA.layout.ConversionPattern=%d{ISO8601} [%c] %-5p: %m%n
log4j.logger.webindex=WARN
-log4j.logger.Remoting=WARN
diff --git a/modules/data/pom.xml b/modules/data/pom.xml
index 7dd70d2..df01114 100644
--- a/modules/data/pom.xml
+++ b/modules/data/pom.xml
@@ -32,7 +32,6 @@
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
- <version>14.0.1</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
@@ -67,10 +66,6 @@
</dependency>
<dependency>
<groupId>org.apache.fluo</groupId>
- <artifactId>fluo-mapreduce</artifactId>
- </dependency>
- <dependency>
- <groupId>org.apache.fluo</groupId>
<artifactId>fluo-recipes-accumulo</artifactId>
</dependency>
<dependency>
@@ -148,41 +143,47 @@
<scope>test</scope>
</dependency>
</dependencies>
- <build>
- <plugins>
- <plugin>
- <groupId>org.apache.maven.plugins</groupId>
- <artifactId>maven-shade-plugin</artifactId>
- <executions>
- <execution>
- <goals>
- <goal>shade</goal>
- </goals>
- <phase>package</phase>
- <configuration>
- <shadedArtifactAttached>true</shadedArtifactAttached>
- <shadedClassifierName>shaded</shadedClassifierName>
- <!-- Relocate Thrift because Accumulo 1.8 uses Thrift 0.9.3 and Spark uses 0.9.1. -->
- <relocations>
- <relocation>
- <pattern>org.apache.thrift</pattern>
- <shadedPattern>webindex.org.apache.thrift</shadedPattern>
- </relocation>
- </relocations>
- <filters>
- <filter>
- <artifact>*:*</artifact>
- <excludes>
- <exclude>META-INF/*.SF</exclude>
- <exclude>META-INF/*.DSA</exclude>
- <exclude>META-INF/*.RSA</exclude>
- </excludes>
- </filter>
- </filters>
- </configuration>
- </execution>
- </executions>
- </plugin>
- </plugins>
- </build>
+ <profiles>
+ <profile>
+ <id>create-shade-jar</id>
+ <build>
+ <plugins>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-shade-plugin</artifactId>
+ <executions>
+ <execution>
+ <id>spark-shade-jar</id>
+ <goals>
+ <goal>shade</goal>
+ </goals>
+ <phase>package</phase>
+ <configuration>
+ <shadedArtifactAttached>true</shadedArtifactAttached>
+ <shadedClassifierName>shaded</shadedClassifierName>
+ <!-- Relocate Thrift because Accumulo 1.8 uses Thrift 0.9.3 and Spark uses 0.9.1. -->
+ <relocations>
+ <relocation>
+ <pattern>org.apache.thrift</pattern>
+ <shadedPattern>webindex.org.apache.thrift</shadedPattern>
+ </relocation>
+ </relocations>
+ <filters>
+ <filter>
+ <artifact>*:*</artifact>
+ <excludes>
+ <exclude>META-INF/*.SF</exclude>
+ <exclude>META-INF/*.DSA</exclude>
+ <exclude>META-INF/*.RSA</exclude>
+ </excludes>
+ </filter>
+ </filters>
+ </configuration>
+ </execution>
+ </executions>
+ </plugin>
+ </plugins>
+ </build>
+ </profile>
+ </profiles>
</project>
diff --git a/modules/data/src/main/java/webindex/data/Configure.java b/modules/data/src/main/java/webindex/data/Configure.java
index 0dd0789..11b0f5b 100644
--- a/modules/data/src/main/java/webindex/data/Configure.java
+++ b/modules/data/src/main/java/webindex/data/Configure.java
@@ -22,7 +22,7 @@
import org.apache.fluo.api.config.FluoConfiguration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
-import webindex.core.DataConfig;
+import webindex.core.WebIndexConfig;
import webindex.data.spark.IndexEnv;
public class Configure {
@@ -35,9 +35,9 @@
log.error("Usage: Configure");
System.exit(1);
}
- DataConfig dataConfig = DataConfig.load();
+ WebIndexConfig webIndexConfig = WebIndexConfig.load();
- IndexEnv env = new IndexEnv(dataConfig);
+ IndexEnv env = new IndexEnv(webIndexConfig);
env.initAccumuloIndexTable();
FluoConfiguration appConfig = new FluoConfiguration();
@@ -45,7 +45,7 @@
Iterator<String> iter = appConfig.getKeys();
try (PrintWriter out =
- new PrintWriter(new BufferedWriter(new FileWriter(dataConfig.getFluoPropsPath(), true)))) {
+ new PrintWriter(new BufferedWriter(new FileWriter(webIndexConfig.getFluoPropsPath(), true)))) {
while (iter.hasNext()) {
String key = iter.next();
out.println(key + " = " + appConfig.getRawString(key));
diff --git a/modules/data/src/main/java/webindex/data/Copy.java b/modules/data/src/main/java/webindex/data/Copy.java
index 7976feb..4e71908 100644
--- a/modules/data/src/main/java/webindex/data/Copy.java
+++ b/modules/data/src/main/java/webindex/data/Copy.java
@@ -28,7 +28,7 @@
import org.apache.spark.api.java.JavaSparkContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
-import webindex.core.DataConfig;
+import webindex.core.WebIndexConfig;
import webindex.data.spark.IndexEnv;
public class Copy {
@@ -56,7 +56,7 @@
System.exit(1);
}
- DataConfig dataConfig = DataConfig.load();
+ WebIndexConfig webIndexConfig = WebIndexConfig.load();
SparkConf sparkConf = new SparkConf().setAppName("webindex-copy");
try (JavaSparkContext ctx = new JavaSparkContext(sparkConf)) {
@@ -70,9 +70,9 @@
log.info("Copying {} files (Range {} of paths file {}) from AWS to HDFS {}", copyList.size(),
args[1], args[0], destPath.toString());
- JavaRDD<String> copyRDD = ctx.parallelize(copyList, dataConfig.getNumExecutorInstances());
+ JavaRDD<String> copyRDD = ctx.parallelize(copyList, webIndexConfig.getNumExecutorInstances());
- final String prefix = DataConfig.CC_URL_PREFIX;
+ final String prefix = WebIndexConfig.CC_URL_PREFIX;
final String destDir = destPath.toString();
copyRDD
diff --git a/modules/data/src/main/java/webindex/data/Init.java b/modules/data/src/main/java/webindex/data/Init.java
index 1caa164..a3c7f4c 100644
--- a/modules/data/src/main/java/webindex/data/Init.java
+++ b/modules/data/src/main/java/webindex/data/Init.java
@@ -23,7 +23,7 @@
import org.archive.io.ArchiveReader;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
-import webindex.core.DataConfig;
+import webindex.core.WebIndexConfig;
import webindex.core.models.Page;
import webindex.data.spark.IndexEnv;
import webindex.data.spark.IndexStats;
@@ -40,9 +40,9 @@
log.error("Usage: Init [<dataDir>]");
System.exit(1);
}
- DataConfig dataConfig = DataConfig.load();
+ WebIndexConfig webIndexConfig = WebIndexConfig.load();
- IndexEnv env = new IndexEnv(dataConfig);
+ IndexEnv env = new IndexEnv(webIndexConfig);
env.setFluoTableSplits();
log.info("Initialized Fluo table splits");
diff --git a/modules/data/src/main/java/webindex/data/LoadHdfs.java b/modules/data/src/main/java/webindex/data/LoadHdfs.java
index 87d590b..91eb17f 100644
--- a/modules/data/src/main/java/webindex/data/LoadHdfs.java
+++ b/modules/data/src/main/java/webindex/data/LoadHdfs.java
@@ -37,7 +37,7 @@
import org.archive.io.warc.WARCReaderFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
-import webindex.core.DataConfig;
+import webindex.core.WebIndexConfig;
import webindex.core.models.Page;
import webindex.data.fluo.PageLoader;
import webindex.data.spark.IndexEnv;
@@ -57,7 +57,7 @@
IndexEnv.validateDataDir(dataDir);
final String hadoopConfDir = IndexEnv.getHadoopConfDir();
- final int rateLimit = DataConfig.load().getLoadRateLimit();
+ final int rateLimit = WebIndexConfig.load().getLoadRateLimit();
List<String> loadPaths = new ArrayList<>();
FileSystem hdfs = IndexEnv.getHDFS();
diff --git a/modules/data/src/main/java/webindex/data/LoadS3.java b/modules/data/src/main/java/webindex/data/LoadS3.java
index e16c69e..267e160 100644
--- a/modules/data/src/main/java/webindex/data/LoadS3.java
+++ b/modules/data/src/main/java/webindex/data/LoadS3.java
@@ -31,7 +31,7 @@
import org.archive.io.warc.WARCReaderFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
-import webindex.core.DataConfig;
+import webindex.core.WebIndexConfig;
import webindex.core.models.Page;
import webindex.data.fluo.PageLoader;
import webindex.data.spark.IndexEnv;
@@ -53,7 +53,7 @@
System.exit(1);
}
- final int rateLimit = DataConfig.load().getLoadRateLimit();
+ final int rateLimit = WebIndexConfig.load().getLoadRateLimit();
SparkConf sparkConf = new SparkConf().setAppName("webindex-load-s3");
try (JavaSparkContext ctx = new JavaSparkContext(sparkConf)) {
@@ -63,7 +63,7 @@
JavaRDD<String> loadRDD = ctx.parallelize(loadList, loadList.size());
- final String prefix = DataConfig.CC_URL_PREFIX;
+ final String prefix = WebIndexConfig.CC_URL_PREFIX;
loadRDD.foreachPartition(iter -> {
final FluoConfiguration fluoConfig = new FluoConfiguration(new File("fluo.properties"));
diff --git a/modules/data/src/main/java/webindex/data/TestParser.java b/modules/data/src/main/java/webindex/data/TestParser.java
index 318db31..30fb024 100644
--- a/modules/data/src/main/java/webindex/data/TestParser.java
+++ b/modules/data/src/main/java/webindex/data/TestParser.java
@@ -25,7 +25,7 @@
import org.archive.io.warc.WARCReaderFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
-import webindex.core.DataConfig;
+import webindex.core.WebIndexConfig;
import webindex.data.spark.IndexEnv;
import webindex.data.util.ArchiveUtil;
@@ -45,7 +45,7 @@
System.exit(1);
}
- DataConfig.load();
+ WebIndexConfig.load();
SparkConf sparkConf = new SparkConf().setAppName("webindex-test-parser");
try (JavaSparkContext ctx = new JavaSparkContext(sparkConf)) {
@@ -55,7 +55,7 @@
JavaRDD<String> loadRDD = ctx.parallelize(loadList, loadList.size());
- final String prefix = DataConfig.CC_URL_PREFIX;
+ final String prefix = WebIndexConfig.CC_URL_PREFIX;
loadRDD.foreachPartition(iter -> iter.forEachRemaining(path -> {
String urlToCopy = prefix + path;
diff --git a/modules/data/src/main/java/webindex/data/spark/IndexEnv.java b/modules/data/src/main/java/webindex/data/spark/IndexEnv.java
index 4ee9277..ec1dda8 100644
--- a/modules/data/src/main/java/webindex/data/spark/IndexEnv.java
+++ b/modules/data/src/main/java/webindex/data/spark/IndexEnv.java
@@ -54,7 +54,7 @@
import org.apache.spark.api.java.JavaSparkContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
-import webindex.core.DataConfig;
+import webindex.core.WebIndexConfig;
import webindex.core.models.Page;
import webindex.data.FluoApp;
import webindex.data.fluo.PageObserver;
@@ -72,9 +72,9 @@
private int numTablets;
private int numBuckets;
- public IndexEnv(DataConfig dataConfig) {
- this(getFluoConfig(dataConfig), dataConfig.accumuloIndexTable, dataConfig.hdfsTempDir,
- dataConfig.numBuckets, dataConfig.numTablets);
+ public IndexEnv(WebIndexConfig webIndexConfig) {
+ this(getFluoConfig(webIndexConfig), webIndexConfig.accumuloIndexTable,
+ webIndexConfig.hdfsTempDir, webIndexConfig.numBuckets, webIndexConfig.numTablets);
}
public IndexEnv(FluoConfiguration fluoConfig, String accumuloTable, String hdfsTempDir,
@@ -101,10 +101,10 @@
return hadoopConfDir;
}
- private static FluoConfiguration getFluoConfig(DataConfig dataConfig) {
- Preconditions.checkArgument(new File(dataConfig.getFluoPropsPath()).exists(),
- "fluoPropsPath must be set in data.yml and exist");
- return new FluoConfiguration(new File(dataConfig.getFluoPropsPath()));
+ private static FluoConfiguration getFluoConfig(WebIndexConfig webIndexConfig) {
+ Preconditions.checkArgument(new File(webIndexConfig.getFluoPropsPath()).exists(),
+ "fluoPropsPath must be set in webindex.yml and exist");
+ return new FluoConfiguration(new File(webIndexConfig.getFluoPropsPath()));
}
public FluoConfiguration getFluoConfig() {
diff --git a/modules/data/src/main/java/webindex/data/util/ArchiveUtil.java b/modules/data/src/main/java/webindex/data/util/ArchiveUtil.java
index 212103a..6ce5118 100644
--- a/modules/data/src/main/java/webindex/data/util/ArchiveUtil.java
+++ b/modules/data/src/main/java/webindex/data/util/ArchiveUtil.java
@@ -51,7 +51,7 @@
String rawPageUrl = archiveRecord.getHeader().getUrl();
URL pageUrl;
try {
- pageUrl = DataUrl.from(rawPageUrl);
+ pageUrl = URL.from(rawPageUrl);
} catch (IllegalArgumentException e) {
return Page.EMPTY;
} catch (Exception e) {
@@ -80,7 +80,7 @@
String rawLinkUrl = link.getString("url");
URL linkUrl;
try {
- linkUrl = DataUrl.from(rawLinkUrl);
+ linkUrl = URL.from(rawLinkUrl);
if (!page.getDomain().equals(linkUrl.getDomain())) {
page.addOutbound(Link.of(linkUrl, anchorText));
}
diff --git a/modules/data/src/main/java/webindex/data/util/DataUrl.java b/modules/data/src/main/java/webindex/data/util/DataUrl.java
deleted file mode 100644
index 695b16f..0000000
--- a/modules/data/src/main/java/webindex/data/util/DataUrl.java
+++ /dev/null
@@ -1,39 +0,0 @@
-/*
- * Copyright 2016 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.data.util;
-
-import com.google.common.net.HostSpecifier;
-import com.google.common.net.InternetDomainName;
-import webindex.core.models.URL;
-
-public class DataUrl {
-
- public static String domainFromHost(String host) {
- return InternetDomainName.from(host).topPrivateDomain().name();
- }
-
- public static boolean isValidHost(String host) {
- return HostSpecifier.isValid(host) && InternetDomainName.isValid(host)
- && InternetDomainName.from(host).isUnderPublicSuffix();
- }
-
- public static URL from(String rawUrl) {
- return URL.from(rawUrl, DataUrl::domainFromHost, DataUrl::isValidHost);
- }
-
- public static boolean isValid(String rawUrl) {
- return URL.isValid(rawUrl, DataUrl::domainFromHost, DataUrl::isValidHost);
- }
-}
diff --git a/modules/data/src/test/java/webindex/data/fluo/it/IndexIT.java b/modules/data/src/test/java/webindex/data/fluo/it/IndexIT.java
index 7c64004..6b29fec 100644
--- a/modules/data/src/test/java/webindex/data/fluo/it/IndexIT.java
+++ b/modules/data/src/test/java/webindex/data/fluo/it/IndexIT.java
@@ -54,7 +54,6 @@
import webindex.data.spark.IndexStats;
import webindex.data.spark.IndexUtil;
import webindex.data.util.ArchiveUtil;
-import webindex.data.util.DataUrl;
public class IndexIT extends AccumuloExportITBase {
@@ -136,11 +135,11 @@
}
public static Link newLink(String url) {
- return Link.of(DataUrl.from(url));
+ return Link.of(URL.from(url));
}
public static Link newLink(String url, String anchorText) {
- return Link.of(DataUrl.from(url), anchorText);
+ return Link.of(URL.from(url), anchorText);
}
@Test
@@ -160,8 +159,8 @@
getMiniFluo().waitForObservers();
assertOutput(pages.values());
- URL deleteUrl = DataUrl.from("http://1000games.me/games/gametion/");
- log.info("Deleting page {}", deleteUrl);
+ URL deleteUrl = URL.from("http://1000games.me/games/gametion/");
+ log.debug("Deleting page {}", deleteUrl);
try (LoaderExecutor le = client.newLoaderExecutor()) {
le.execute(PageLoader.deletePage(deleteUrl));
}
@@ -172,7 +171,7 @@
Assert.assertEquals(numPages - 1, pages.size());
assertOutput(pages.values());
- URL updateUrl = DataUrl.from("http://100zone.blogspot.com/2013/03/please-memp3-4shared.html");
+ URL updateUrl = URL.from("http://100zone.blogspot.com/2013/03/please-memp3-4shared.html");
Page updatePage = pages.get(updateUrl);
long numLinks = updatePage.getNumOutbound();
Assert.assertTrue(updatePage.addOutbound(newLink("http://example.com", "Example")));
@@ -186,7 +185,7 @@
getMiniFluo().waitForObservers();
// create a URL that has an inlink count of 2
- URL updateUrl2 = DataUrl.from("http://00assclown.newgrounds.com/");
+ URL updateUrl2 = URL.from("http://00assclown.newgrounds.com/");
Page updatePage2 = pages.get(updateUrl2);
long numLinks2 = updatePage2.getNumOutbound();
Assert.assertTrue(updatePage2.addOutbound(newLink("http://example.com", "Example")));
@@ -237,8 +236,7 @@
try (FluoClient client = FluoFactory.newClient(getMiniFluo().getClientConfiguration());
LoaderExecutor le = client.newLoaderExecutor()) {
for (Page page : pages.subList(2, pages.size())) {
- log.info("Loading page {} with {} links {}", page.getUrl(), page.getOutboundLinks().size(),
- page.getOutboundLinks());
+ log.debug("Loading page {} with {} links", page.getUrl(), page.getOutboundLinks().size());
le.execute(PageLoader.updatePage(page));
}
}
diff --git a/modules/data/src/test/java/webindex/data/spark/IndexUtilTest.java b/modules/data/src/test/java/webindex/data/spark/IndexUtilTest.java
index fc72bb6..75ea240 100644
--- a/modules/data/src/test/java/webindex/data/spark/IndexUtilTest.java
+++ b/modules/data/src/test/java/webindex/data/spark/IndexUtilTest.java
@@ -33,9 +33,9 @@
import scala.Tuple2;
import webindex.core.models.Link;
import webindex.core.models.Page;
+import webindex.core.models.URL;
import webindex.data.SparkTestUtil;
import webindex.data.fluo.UriMap.UriInfo;
-import webindex.data.util.DataUrl;
public class IndexUtilTest {
@@ -106,14 +106,14 @@
private List<Page> getPagesSet1() {
List<Page> pages = new ArrayList<>();
- Page pageA = new Page(DataUrl.from("http://a.com/1").toPageID());
- pageA.addOutbound(Link.of(DataUrl.from("http://b.com/1"), "b1"));
- pageA.addOutbound(Link.of(DataUrl.from("http://b.com/3"), "b3"));
- pageA.addOutbound(Link.of(DataUrl.from("http://c.com/1"), "c1"));
- Page pageB = new Page(DataUrl.from("http://b.com").toPageID());
- pageB.addOutbound(Link.of(DataUrl.from("http://c.com/1"), "c1"));
- pageB.addOutbound(Link.of(DataUrl.from("http://b.com/2"), "b2"));
- pageB.addOutbound(Link.of(DataUrl.from("http://b.com/3"), "b3"));
+ Page pageA = new Page(URL.from("http://a.com/1").toPageID());
+ pageA.addOutbound(Link.of(URL.from("http://b.com/1"), "b1"));
+ pageA.addOutbound(Link.of(URL.from("http://b.com/3"), "b3"));
+ pageA.addOutbound(Link.of(URL.from("http://c.com/1"), "c1"));
+ Page pageB = new Page(URL.from("http://b.com").toPageID());
+ pageB.addOutbound(Link.of(URL.from("http://c.com/1"), "c1"));
+ pageB.addOutbound(Link.of(URL.from("http://b.com/2"), "b2"));
+ pageB.addOutbound(Link.of(URL.from("http://b.com/3"), "b3"));
pages.add(pageA);
pages.add(pageB);
return pages;
diff --git a/modules/data/src/test/java/webindex/data/util/DataUrlTest.java b/modules/data/src/test/java/webindex/data/util/DataUrlTest.java
deleted file mode 100644
index d312b42..0000000
--- a/modules/data/src/test/java/webindex/data/util/DataUrlTest.java
+++ /dev/null
@@ -1,99 +0,0 @@
-/*
- * Copyright 2016 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.data.util;
-
-import org.junit.Assert;
-import org.junit.Test;
-import webindex.core.models.URL;
-
-public class DataUrlTest {
-
- public static URL build(String rawUrl) {
- return DataUrl.from(rawUrl);
- }
-
- public static boolean isValid(String rawUrl) {
- return DataUrl.isValid(rawUrl);
- }
-
- @Test
- public void testBasic() throws Exception {
-
- // valid urls
- Assert.assertTrue(isValid(" \thttp://example.com/ \t\n\r\n"));
- Assert.assertTrue(isValid("http://1.2.3.4:80/test?a=b&c=d"));
- Assert.assertTrue(isValid("http://1.2.3.4/"));
- Assert.assertTrue(isValid("http://a.b.c.d.com/1/2/3/4/5"));
- Assert.assertTrue(isValid("http://a.b.com:281/1/2"));
- Assert.assertTrue(isValid("http://A.B.Com:281/a/b"));
- Assert.assertTrue(isValid("http://A.b.Com:281/A/b"));
- Assert.assertTrue(isValid("http://a.B.Com?A/b/C"));
- Assert.assertTrue(isValid("http://A.Be.COM"));
- Assert.assertTrue(isValid("http://1.2.3.4:281/1/2"));
-
- // invalid urls
- Assert.assertFalse(isValid("http://a.com:/test"));
- Assert.assertFalse(isValid("http://z.com:"));
- Assert.assertFalse(isValid("http://1.2.3:80/test?a=b&c=d"));
- Assert.assertFalse(isValid("http://1.2.3/"));
- Assert.assertFalse(isValid("http://com/"));
- Assert.assertFalse(isValid("http://a.b.c.com/bad>et"));
- Assert.assertFalse(isValid("http://test"));
- Assert.assertFalse(isValid("http://co.uk"));
- Assert.assertFalse(isValid("http:///example.com/"));
- Assert.assertFalse(isValid("http:://example.com/"));
- Assert.assertFalse(isValid("example.com"));
- Assert.assertFalse(isValid("127.0.0.1"));
- Assert.assertFalse(isValid("http://ab@example.com"));
- Assert.assertFalse(isValid("ftp://example.com"));
-
- Assert.assertEquals("example.com", build("http://example.com:281/1/2").getHost());
- Assert.assertEquals("a.b.example.com", build("http://a.b.example.com/1/2").getHost());
- Assert.assertEquals("a.b.example.com", build("http://A.B.Example.Com/1/2").getHost());
- Assert.assertEquals("1.2.3.4", build("http://1.2.3.4:89/1/2").getHost());
-
- Assert.assertEquals("/A/b/C", build("http://A.B.Example.Com/A/b/C").getPath());
- Assert.assertEquals("?D/E/f", build("http://A.B.Example.Com?D/E/f").getPath());
-
- URL u = build("http://a.b.c.d.com/1/2/3");
- Assert.assertEquals("a.b.c.d.com", u.getHost());
- Assert.assertEquals("com.d.c.b.a", u.getReverseHost());
- Assert.assertEquals("d.com", u.getDomain());
- Assert.assertEquals("com.d", u.getReverseDomain());
-
- Assert.assertEquals("com.example", build("http://example.com:281/1").getReverseHost());
- Assert.assertEquals("com.example.b.a", build("http://a.b.example.com/1/2").getReverseHost());
- Assert.assertEquals("1.2.3.4", build("http://1.2.3.4:89/1/2").getReverseHost());
-
- Assert.assertTrue(build("http://a.com/a.jpg").isImage());
- Assert.assertTrue(build("http://a.com/a.JPEG").isImage());
- Assert.assertTrue(build("http://a.com/c/b/a.png").isImage());
-
- Assert.assertEquals("c.com", build("http://a.b.c.com").getDomain());
- Assert.assertEquals("com.c", build("http://a.b.c.com").getReverseDomain());
- Assert.assertEquals("c.co.uk", build("http://a.b.c.co.uk").getDomain());
- Assert.assertEquals("uk.co.c", build("http://a.b.c.co.uk").getReverseDomain());
- Assert.assertEquals("d.com.au", build("http://www.d.com.au").getDomain());
- Assert.assertEquals("au.com.d", build("http://www.d.com.au").getReverseDomain());
-
- u = build("https://www.d.com.au:9443/a/bc");
- Assert.assertEquals("au.com.d>.www>s9443>/a/bc", u.toPageID());
- Assert.assertEquals("https://www.d.com.au:9443/a/bc", u.toString());
- URL u2 = URL.fromPageID(u.toPageID());
- Assert.assertEquals("https://www.d.com.au:9443/a/bc", u2.toString());
- Assert.assertEquals("d.com.au", u2.getDomain());
- Assert.assertEquals("www.d.com.au", u2.getHost());
- }
-}
diff --git a/modules/data/src/test/resources/log4j.properties b/modules/data/src/test/resources/log4j.properties
index 9a981d1..c80a759 100644
--- a/modules/data/src/test/resources/log4j.properties
+++ b/modules/data/src/test/resources/log4j.properties
@@ -26,6 +26,7 @@
log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR
log4j.logger.org.apache.spark=WARN
log4j.logger.org.apache.zookeeper=WARN
+log4j.logger.org.apache.zookeeper.ClientCnxn=ERROR
log4j.logger.org.spark-project=WARN
log4j.logger.webindex=WARN
log4j.logger.Remoting=WARN
diff --git a/modules/integration/pom.xml b/modules/integration/pom.xml
new file mode 100644
index 0000000..890489f
--- /dev/null
+++ b/modules/integration/pom.xml
@@ -0,0 +1,138 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+ Copyright 2015 Webindex authors (see AUTHORS)
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+ <modelVersion>4.0.0</modelVersion>
+ <parent>
+ <groupId>io.github.astralway</groupId>
+ <artifactId>webindex-parent</artifactId>
+ <version>0.0.1-SNAPSHOT</version>
+ <relativePath>../../pom.xml</relativePath>
+ </parent>
+ <artifactId>webindex-integration</artifactId>
+ <name>WebIndex Integration</name>
+ <dependencies>
+ <dependency>
+ <groupId>com.google.code.gson</groupId>
+ <artifactId>gson</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>com.sparkjava</groupId>
+ <artifactId>spark-core</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>commons-io</groupId>
+ <artifactId>commons-io</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>io.github.astralway</groupId>
+ <artifactId>webindex-core</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>io.github.astralway</groupId>
+ <artifactId>webindex-data</artifactId>
+ <exclusions>
+ <exclusion>
+ <groupId>asm</groupId>
+ <artifactId>asm</artifactId>
+ </exclusion>
+ </exclusions>
+ </dependency>
+ <dependency>
+ <groupId>io.github.astralway</groupId>
+ <artifactId>webindex-ui</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.accumulo</groupId>
+ <artifactId>accumulo-minicluster</artifactId>
+ <exclusions>
+ <exclusion>
+ <groupId>org.eclipse.jetty</groupId>
+ <artifactId>*</artifactId>
+ </exclusion>
+ </exclusions>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-api</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-core</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-mini</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-recipes-test</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.netpreserve.commons</groupId>
+ <artifactId>webarchive-commons</artifactId>
+ <exclusions>
+ <exclusion>
+ <groupId>ch.qos.logback</groupId>
+ <artifactId>*</artifactId>
+ </exclusion>
+ </exclusions>
+ </dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-api</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-log4j12</artifactId>
+ </dependency>
+ <!-- Test dependencies -->
+ <dependency>
+ <groupId>junit</groupId>
+ <artifactId>junit</artifactId>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.jsoup</groupId>
+ <artifactId>jsoup</artifactId>
+ <scope>test</scope>
+ </dependency>
+ </dependencies>
+ <profiles>
+ <profile>
+ <id>webindex-dev-server</id>
+ <build>
+ <plugins>
+ <plugin>
+ <groupId>org.codehaus.mojo</groupId>
+ <artifactId>exec-maven-plugin</artifactId>
+ <executions>
+ <execution>
+ <goals>
+ <goal>java</goal>
+ </goals>
+ <phase>compile</phase>
+ <configuration>
+ <mainClass>webindex.integration.DevServer</mainClass>
+ </configuration>
+ </execution>
+ </executions>
+ </plugin>
+ </plugins>
+ </build>
+ </profile>
+ </profiles>
+</project>
diff --git a/modules/integration/src/main/java/webindex/integration/DevServer.java b/modules/integration/src/main/java/webindex/integration/DevServer.java
new file mode 100644
index 0000000..4599e6f
--- /dev/null
+++ b/modules/integration/src/main/java/webindex/integration/DevServer.java
@@ -0,0 +1,159 @@
+/*
+ * Copyright 2015 Webindex authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package webindex.integration;
+
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.util.concurrent.atomic.AtomicBoolean;
+
+import com.google.gson.Gson;
+import org.apache.accumulo.minicluster.MiniAccumuloCluster;
+import org.apache.accumulo.minicluster.MiniAccumuloConfig;
+import org.apache.fluo.api.client.FluoAdmin;
+import org.apache.fluo.api.client.FluoClient;
+import org.apache.fluo.api.client.FluoFactory;
+import org.apache.fluo.api.client.LoaderExecutor;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.api.mini.MiniFluo;
+import org.apache.fluo.recipes.test.AccumuloExportITBase;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import webindex.core.IndexClient;
+import webindex.core.models.Page;
+import webindex.data.fluo.PageLoader;
+import webindex.data.spark.IndexEnv;
+import webindex.ui.WebServer;
+
+public class DevServer {
+
+ private static final Logger log = LoggerFactory.getLogger(DevServer.class);
+ private static final int TEST_SPLITS = 119;
+
+ private Path dataPath;
+ private int webPort;
+ private Path templatePath;
+ private MiniAccumuloCluster cluster;
+ private WebServer webServer;
+ private IndexClient client;
+ private AtomicBoolean started = new AtomicBoolean(false);
+ private Path baseDir;
+
+ public DevServer(Path dataPath, int webPort, Path templatePath, Path baseDir) {
+ this.dataPath = dataPath;
+ this.webPort = webPort;
+ this.templatePath = templatePath;
+ this.baseDir = baseDir;
+ this.webServer = new WebServer();
+ }
+
+ public IndexClient getIndexClient() {
+ if (!started.get()) {
+ throw new IllegalStateException("DevServer must be started before retrieving index client");
+ }
+ return client;
+ }
+
+ public void start() throws Exception {
+ log.info("Starting WebIndex development server...");
+
+ log.info("Starting MiniAccumuloCluster at {}", baseDir);
+
+ MiniAccumuloConfig cfg = new MiniAccumuloConfig(baseDir.toFile(), "secret");
+ cluster = new MiniAccumuloCluster(cfg);
+ cluster.start();
+
+ FluoConfiguration config = new FluoConfiguration();
+ AccumuloExportITBase.configureFromMAC(config, cluster);
+ config.setApplicationName("webindex-dev");
+ config.setAccumuloTable("webindex");
+
+ String exportTable = "webindex_search";
+
+ log.info("Initializing Accumulo & Fluo");
+ IndexEnv env = new IndexEnv(config, exportTable, "/tmp", TEST_SPLITS, TEST_SPLITS);
+ env.initAccumuloIndexTable();
+ env.configureApplication(config);
+
+ FluoFactory.newAdmin(config).initialize(
+ new FluoAdmin.InitializationOptions().setClearTable(true).setClearZookeeper(true));
+
+ env.setFluoTableSplits();
+
+ log.info("Starting web server");
+ client = new IndexClient(exportTable, cluster.getConnector("root", "secret"));
+ webServer.start(client, webPort, templatePath);
+
+ log.info("Loading data from {}", dataPath);
+ Gson gson = new Gson();
+ try (MiniFluo miniFluo = FluoFactory.newMiniFluo(config);
+ FluoClient client = FluoFactory.newClient(miniFluo.getClientConfiguration())) {
+
+ try (LoaderExecutor le = client.newLoaderExecutor()) {
+
+ Files
+ .lines(dataPath)
+ .map(json -> Page.fromJson(gson, json))
+ .forEach(
+ page -> {
+ log.debug("Loading page {} with {} links", page.getUrl(), page.getOutboundLinks()
+ .size());
+ le.execute(PageLoader.updatePage(page));
+ });
+ }
+
+ log.info("Finished loading data. Waiting for observers to finish...");
+ miniFluo.waitForObservers();
+ log.info("Observers finished");
+ }
+
+ started.set(true);
+ }
+
+ public void stop() {
+ webServer.stop();
+ try {
+ cluster.stop();
+ } catch (Exception e) {
+ throw new IllegalStateException(e);
+ }
+ }
+
+ public static void main(String[] args) throws Exception {
+ String dataLocation = "data/1K-pages.txt";
+ String templateLocation = "modules/ui/src/main/resources/spark/template/freemarker";
+ if (args.length == 2) {
+ dataLocation = args[0];
+ templateLocation = args[1];
+ }
+ log.info("Looking for data at {}", dataLocation);
+
+ Path dataPath = Paths.get(dataLocation);
+ if (Files.notExists(dataPath)) {
+ log.info("Generating sample data at {} for dev server", dataPath);
+ SampleData.generate(dataPath, 1000);
+ }
+
+ Path templatePath = Paths.get(templateLocation);
+ if (Files.notExists(templatePath)) {
+ log.info("Template location {} does not exits", templateLocation);
+ throw new IllegalArgumentException("Template location does not exist");
+ }
+
+ DevServer devServer =
+ new DevServer(dataPath, 4567, templatePath, Files.createTempDirectory("webindex-dev-"));
+ devServer.start();
+ }
+}
diff --git a/modules/integration/src/main/java/webindex/integration/SampleData.java b/modules/integration/src/main/java/webindex/integration/SampleData.java
new file mode 100644
index 0000000..03972ff
--- /dev/null
+++ b/modules/integration/src/main/java/webindex/integration/SampleData.java
@@ -0,0 +1,63 @@
+/*
+ * Copyright 2015 Webindex authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package webindex.integration;
+
+import java.net.URL;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.ArrayList;
+import java.util.List;
+
+import com.google.gson.Gson;
+import org.archive.io.ArchiveReader;
+import org.archive.io.ArchiveRecord;
+import org.archive.io.warc.WARCReaderFactory;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import webindex.core.models.Page;
+import webindex.data.util.ArchiveUtil;
+
+public class SampleData {
+
+ private static final Logger log = LoggerFactory.getLogger(SampleData.class);
+
+ private static final String sourceURL = "https://commoncrawl.s3.amazonaws.com/crawl-data/"
+ + "CC-MAIN-2015-32/segments/1438042981460.12/wat/"
+ + "CC-MAIN-20150728002301-00043-ip-10-236-191-2.ec2.internal.warc.wat.gz";
+
+ public static void generate(Path path, int numPages) throws Exception {
+
+ Gson gson = new Gson();
+ long count = 0;
+ List<String> pages = new ArrayList<>();
+ ArchiveReader ar = WARCReaderFactory.get(new URL(sourceURL), 0);
+ for (ArchiveRecord r : ar) {
+ Page p = ArchiveUtil.buildPage(r);
+ if (p.isEmpty() || p.getOutboundLinks().isEmpty()) {
+ log.debug("Skipping {}", p.getUrl());
+ continue;
+ }
+ log.debug("Found {} {}", p.getUrl(), p.getNumOutbound());
+ String json = gson.toJson(p);
+ pages.add(json);
+ count++;
+ if (count == numPages) {
+ break;
+ }
+ }
+ Files.write(path, pages);
+ log.info("Wrote {} pages to {}", numPages, path);
+ }
+}
diff --git a/modules/integration/src/test/java/webindex/integration/DevServerIT.java b/modules/integration/src/test/java/webindex/integration/DevServerIT.java
new file mode 100644
index 0000000..81fc767
--- /dev/null
+++ b/modules/integration/src/test/java/webindex/integration/DevServerIT.java
@@ -0,0 +1,68 @@
+/*
+ * Copyright 2015 Webindex authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package webindex.integration;
+
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+
+import org.apache.commons.io.FileUtils;
+import org.jsoup.Jsoup;
+import org.jsoup.nodes.Document;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import webindex.core.IndexClient;
+import webindex.core.models.Pages;
+
+public class DevServerIT {
+
+ private static final Logger log = LoggerFactory.getLogger(DevServerIT.class);
+
+ static DevServer devServer;
+ static Path tempPath;
+
+ @BeforeClass
+ public static void init() throws Exception {
+ tempPath = Files.createTempDirectory(Paths.get("target/"), "webindex-dev-");
+ devServer = new DevServer(Paths.get("src/test/resources/5-pages.txt"), 24567, null, tempPath);
+ devServer.start();
+ }
+
+ @Test
+ public void basic() throws Exception {
+ Document doc = Jsoup.connect("http://localhost:24567/").get();
+ Assert.assertTrue(doc.text().contains("Enter a domain to view known webpages in that domain"));
+
+ IndexClient client = devServer.getIndexClient();
+ Pages pages = client.getPages("stackoverflow.com", "", 0);
+ Assert.assertEquals(4, pages.getTotal().intValue());
+
+ Pages.PageScore pageScore = pages.getPages().get(0);
+ Assert.assertEquals("http://blog.stackoverflow.com/2009/06/attribution-required/",
+ pageScore.getUrl());
+ Assert.assertEquals(4, pageScore.getScore().intValue());
+ }
+
+ @AfterClass
+ public static void destroy() throws IOException {
+ devServer.stop();
+ FileUtils.deleteDirectory(tempPath.toFile());
+ }
+}
diff --git a/modules/integration/src/test/resources/5-pages.txt b/modules/integration/src/test/resources/5-pages.txt
new file mode 100644
index 0000000..da94d8a
--- /dev/null
+++ b/modules/integration/src/test/resources/5-pages.txt
@@ -0,0 +1,5 @@
+{"url":"http://app.cheezburger.com/Rokas08/TrophyDetails/13f82307-8f12-402e-a544-76db8a2dc19c","pageID":"com.cheezburger\u003e.app\u003eo\u003e/Rokas08/TrophyDetails/13f82307-8f12-402e-a544-76db8a2dc19c","numOutbound":19,"crawlDate":"2015-07-28T03:06:17Z","title":"Rokas08\u0026#39;s Profile - Trophy Details - Cheezburger.com","outboundLinks":[{"url":"https://www.facebook.com/Cheezburger","pageID":"com.facebook\u003e.www\u003es\u003e/Cheezburger","anchorText":"Facebook"},{"url":"https://plus.google.com/105247221600709734681","pageID":"com.google\u003e.plus\u003es\u003e/105247221600709734681","anchorText":"Google+"},{"url":"http://knowyourmeme.com/forums","pageID":"com.knowyourmeme\u003e\u003eo\u003e/forums","anchorText":"Forums"},{"url":"http://knowyourmeme.com/memes/popular","pageID":"com.knowyourmeme\u003e\u003eo\u003e/memes/popular","anchorText":"Popular Memes"},{"url":"http://knowyourmeme.com/photos/most-viewed","pageID":"com.knowyourmeme\u003e\u003eo\u003e/photos/most-viewed","anchorText":"All Images"},{"url":"http://knowyourmeme.com/search?q\u003dcategory%3Aevent\u0026amp;sort\u003dnewest","pageID":"com.knowyourmeme\u003e\u003eo\u003e/search?q\u003dcategory%3Aevent\u0026amp;sort\u003dnewest","anchorText":"New Events"},{"url":"http://knowyourmeme.com/search?q\u003dcategory%3Aperson\u0026amp;sort\u003dnewest","pageID":"com.knowyourmeme\u003e\u003eo\u003e/search?q\u003dcategory%3Aperson\u0026amp;sort\u003dnewest","anchorText":"New People"},{"url":"http://knowyourmeme.com/search?q\u003dcategory%3Asite\u0026amp;sort\u003dnewest","pageID":"com.knowyourmeme\u003e\u003eo\u003e/search?q\u003dcategory%3Asite\u0026amp;sort\u003dnewest","anchorText":"New Sites"},{"url":"http://knowyourmeme.com/search?q\u003dcategory%3Asubculture\u0026amp;sort\u003dnewest","pageID":"com.knowyourmeme\u003e\u003eo\u003e/search?q\u003dcategory%3Asubculture\u0026amp;sort\u003dnewest","anchorText":"New Subcultures"},{"url":"http://knowyourmeme.com/search?utf8\u003d%E2%9C%93\u0026amp;context\u003dentries\u0026amp;q\u003dstatus%3Aconfirmed+category%3Ameme","pageID":"com.knowyourmeme\u003e\u003eo\u003e/search?utf8\u003d%E2%9C%93\u0026amp;context\u003dentries\u0026amp;q\u003dstatus%3Aconfirmed+category%3Ameme","anchorText":"All Memes"},{"url":"http://knowyourmeme.com/videos/most-viewed","pageID":"com.knowyourmeme\u003e\u003eo\u003e/videos/most-viewed","anchorText":"All Videos"},{"url":"http://knowyourmeme.com?ref\u003dnavbar","pageID":"com.knowyourmeme\u003e\u003eo\u003e?ref\u003dnavbar","anchorText":"KYM Wiki"},{"url":"https://twitter.com/Cheezburger","pageID":"com.twitter\u003e\u003es\u003e/Cheezburger","anchorText":"Follow"},{"url":"http://chzb.gr/1riG0EZ?ref\u003dfooternav","pageID":"gr.chzb\u003e\u003eo\u003e/1riG0EZ?ref\u003dfooternav","anchorText":"Videos"},{"url":"http://chzb.gr/1riG0EZ?ref\u003dnavbar","pageID":"gr.chzb\u003e\u003eo\u003e/1riG0EZ?ref\u003dnavbar","anchorText":"Videos Find all our FAIL videos here!"},{"url":"http://chzb.gr/1riGhru?ref\u003dfooternav","pageID":"gr.chzb\u003e\u003eo\u003e/1riGhru?ref\u003dfooternav","anchorText":"Videos"},{"url":"http://chzb.gr/1riGhru?ref\u003dnavbar","pageID":"gr.chzb\u003e\u003eo\u003e/1riGhru?ref\u003dnavbar","anchorText":"Videos See all our Geek videos here!"},{"url":"http://chzb.gr/1riGzi6?ref\u003dfooternav","pageID":"gr.chzb\u003e\u003eo\u003e/1riGzi6?ref\u003dfooternav","anchorText":"Videos"},{"url":"http://chzb.gr/1riGzi6?ref\u003dnavbar","pageID":"gr.chzb\u003e\u003eo\u003e/1riGzi6?ref\u003dnavbar","anchorText":"Videos Watch and learn from all of our trolling videos here!"}]}
+{"url":"http://apple.stackexchange.com/help/badges/9/autobiographer?userid\u003d796","pageID":"com.stackexchange\u003e.apple\u003eo\u003e/help/badges/9/autobiographer?userid\u003d796","numOutbound":4,"crawlDate":"2015-07-28T01:32:26Z","server":"cloudflare-nginx","title":"Autobiographer - Badge - Ask Different","outboundLinks":[{"url":"http://apple.blogoverflow.com/","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e/","anchorText":"blog"},{"url":"http://apple.blogoverflow.com?blb\u003d1","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e?blb\u003d1","anchorText":"blog"},{"url":"http://blog.stackoverflow.com/2009/06/attribution-required/","pageID":"com.stackoverflow\u003e.blog\u003eo\u003e/2009/06/attribution-required/","anchorText":"attribution required"},{"url":"http://creativecommons.org/licenses/by-sa/3.0/","pageID":"org.creativecommons\u003e\u003eo\u003e/licenses/by-sa/3.0/","anchorText":"cc by-sa 3.0"}]}
+{"url":"http://apple.stackexchange.com/questions/15006/spotlight-sometimes-cant-find-a-file-that-actually-exists","pageID":"com.stackexchange\u003e.apple\u003eo\u003e/questions/15006/spotlight-sometimes-cant-find-a-file-that-actually-exists","numOutbound":6,"crawlDate":"2015-07-28T01:58:50Z","server":"cloudflare-nginx","title":"Spotlight sometimes can\u0026#39;t find a file. (that actually exists) - Ask Different","outboundLinks":[{"url":"http://askubuntu.com/questions/653335/using-sed-how-could-we-cut-a-specific-string-from-a-line-of-text","pageID":"com.askubuntu\u003e\u003eo\u003e/questions/653335/using-sed-how-could-we-cut-a-specific-string-from-a-line-of-text","anchorText":"Using sed, how could we cut a specific string from a line of text?"},{"url":"http://apple.blogoverflow.com/","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e/","anchorText":"blog"},{"url":"http://apple.blogoverflow.com?blb\u003d1","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e?blb\u003d1","anchorText":"blog"},{"url":"http://blog.stackoverflow.com/2009/06/attribution-required/","pageID":"com.stackoverflow\u003e.blog\u003eo\u003e/2009/06/attribution-required/","anchorText":"attribution required"},{"url":"http://stackoverflow.com/questions/31654274/is-it-ever-justified-to-have-an-object-which-has-itself-as-a-field","pageID":"com.stackoverflow\u003e\u003eo\u003e/questions/31654274/is-it-ever-justified-to-have-an-object-which-has-itself-as-a-field","anchorText":"Is it ever justified to have an object which has itself as a field?"},{"url":"http://creativecommons.org/licenses/by-sa/3.0/","pageID":"org.creativecommons\u003e\u003eo\u003e/licenses/by-sa/3.0/","anchorText":"cc by-sa 3.0"}]}
+{"url":"http://apple.stackexchange.com/users/208/john-allers","pageID":"com.stackexchange\u003e.apple\u003eo\u003e/users/208/john-allers","numOutbound":8,"crawlDate":"2015-07-28T01:40:51Z","server":"cloudflare-nginx","title":"User John Allers - Ask Different","outboundLinks":[{"url":"http://apple.blogoverflow.com/","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e/","anchorText":"blog"},{"url":"http://apple.blogoverflow.com?blb\u003d1","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e?blb\u003d1","anchorText":"blog"},{"url":"http://serverfault.com/users/2870/","pageID":"com.serverfault\u003e\u003eo\u003e/users/2870/","anchorText":"Server Fault 111 111 3"},{"url":"http://blog.stackoverflow.com/2009/06/attribution-required/","pageID":"com.stackoverflow\u003e.blog\u003eo\u003e/2009/06/attribution-required/","anchorText":"attribution required"},{"url":"http://stackoverflow.com/users/73986/","pageID":"com.stackoverflow\u003e\u003eo\u003e/users/73986/","anchorText":"Stack Overflow 2.2k 2.2k 11828"},{"url":"http://superuser.com/users/3552/","pageID":"com.superuser\u003e\u003eo\u003e/users/3552/","anchorText":"Super User 231 231 26"},{"url":"http://www.zooplet.com/","pageID":"com.zooplet\u003e.www\u003eo\u003e/","anchorText":"zooplet.com"},{"url":"http://creativecommons.org/licenses/by-sa/3.0/","pageID":"org.creativecommons\u003e\u003eo\u003e/licenses/by-sa/3.0/","anchorText":"cc by-sa 3.0"}]}
+{"url":"http://apple.stackexchange.com/users/3126/mjb?tab\u003dsummary","pageID":"com.stackexchange\u003e.apple\u003eo\u003e/users/3126/mjb?tab\u003dsummary","numOutbound":7,"crawlDate":"2015-07-28T01:53:49Z","server":"cloudflare-nginx","title":"User mjb - Ask Different","outboundLinks":[{"url":"http://apple.blogoverflow.com/","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e/","anchorText":"blog"},{"url":"http://apple.blogoverflow.com?blb\u003d1","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e?blb\u003d1","anchorText":"blog"},{"url":"http://serverfault.com/users/117061/","pageID":"com.serverfault\u003e\u003eo\u003e/users/117061/","anchorText":"Server Fault"},{"url":"http://blog.stackoverflow.com/2009/06/attribution-required/","pageID":"com.stackoverflow\u003e.blog\u003eo\u003e/2009/06/attribution-required/","anchorText":"attribution required"},{"url":"http://stackoverflow.com/users/581665/","pageID":"com.stackoverflow\u003e\u003eo\u003e/users/581665/","anchorText":"Stack Overflow"},{"url":"http://superuser.com/users/63808/","pageID":"com.superuser\u003e\u003eo\u003e/users/63808/","anchorText":"Super User"},{"url":"http://creativecommons.org/licenses/by-sa/3.0/","pageID":"org.creativecommons\u003e\u003eo\u003e/licenses/by-sa/3.0/","anchorText":"cc by-sa 3.0"}]}
diff --git a/modules/integration/src/test/resources/log4j.properties b/modules/integration/src/test/resources/log4j.properties
new file mode 100644
index 0000000..c18c21a
--- /dev/null
+++ b/modules/integration/src/test/resources/log4j.properties
@@ -0,0 +1,31 @@
+# Copyright 2014 Webindex authors (see AUTHORS)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log4j.rootLogger=INFO, CA
+log4j.appender.CA=org.apache.log4j.ConsoleAppender
+log4j.appender.CA.layout=org.apache.log4j.PatternLayout
+log4j.appender.CA.layout.ConversionPattern=%d{ISO8601} [%c] %-5p: %m%n
+
+log4j.logger.org.apache.accumulo=WARN
+log4j.logger.org.apache.curator=ERROR
+log4j.logger.org.apache.fluo=WARN
+log4j.logger.org.apache.hadoop=WARN
+log4j.logger.org.apache.hadoop.mapreduce=ERROR
+log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR
+log4j.logger.org.apache.spark=WARN
+log4j.logger.org.apache.zookeeper=ERROR
+log4j.logger.org.eclipse.jetty=WARN
+log4j.logger.org.spark-project=WARN
+log4j.logger.webindex=WARN
+log4j.logger.spark=WARN
diff --git a/modules/ui/pom.xml b/modules/ui/pom.xml
index 167871c..bf9e1e3 100644
--- a/modules/ui/pom.xml
+++ b/modules/ui/pom.xml
@@ -26,126 +26,60 @@
<name>WebIndex UI</name>
<dependencies>
<dependency>
- <groupId>io.dropwizard</groupId>
- <artifactId>dropwizard-assets</artifactId>
+ <groupId>com.sparkjava</groupId>
+ <artifactId>spark-core</artifactId>
</dependency>
<dependency>
- <groupId>io.dropwizard</groupId>
- <artifactId>dropwizard-core</artifactId>
- </dependency>
- <dependency>
- <groupId>io.dropwizard</groupId>
- <artifactId>dropwizard-views-freemarker</artifactId>
+ <groupId>com.sparkjava</groupId>
+ <artifactId>spark-template-freemarker</artifactId>
</dependency>
<dependency>
<groupId>io.github.astralway</groupId>
<artifactId>webindex-core</artifactId>
- <exclusions>
- <exclusion>
- <groupId>com.google.guava</groupId>
- <artifactId>guava</artifactId>
- </exclusion>
- </exclusions>
</dependency>
<dependency>
<groupId>org.apache.accumulo</groupId>
<artifactId>accumulo-core</artifactId>
- <exclusions>
- <exclusion>
- <groupId>log4j</groupId>
- <artifactId>log4j</artifactId>
- </exclusion>
- <exclusion>
- <groupId>com.google.guava</groupId>
- <artifactId>guava</artifactId>
- </exclusion>
- <exclusion>
- <groupId>org.slf4j</groupId>
- <artifactId>slf4j-log4j12</artifactId>
- </exclusion>
- <exclusion>
- <groupId>com.sun.jersey</groupId>
- <artifactId>*</artifactId>
- </exclusion>
- </exclusions>
</dependency>
<dependency>
<groupId>org.apache.fluo</groupId>
<artifactId>fluo-api</artifactId>
- <exclusions>
- <exclusion>
- <groupId>com.google.guava</groupId>
- <artifactId>guava</artifactId>
- </exclusion>
- </exclusions>
</dependency>
<dependency>
<groupId>org.apache.fluo</groupId>
<artifactId>fluo-core</artifactId>
- <exclusions>
- <exclusion>
- <groupId>log4j</groupId>
- <artifactId>log4j</artifactId>
- </exclusion>
- <exclusion>
- <groupId>org.slf4j</groupId>
- <artifactId>slf4j-log4j12</artifactId>
- </exclusion>
- <exclusion>
- <groupId>javax.xml.bind</groupId>
- <artifactId>jaxb-api</artifactId>
- </exclusion>
- <exclusion>
- <groupId>com.google.guava</groupId>
- <artifactId>guava</artifactId>
- </exclusion>
- <exclusion>
- <groupId>com.sun.jersey</groupId>
- <artifactId>*</artifactId>
- </exclusion>
- </exclusions>
</dependency>
<dependency>
- <groupId>junit</groupId>
- <artifactId>junit</artifactId>
- <scope>test</scope>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-api</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-log4j12</artifactId>
</dependency>
</dependencies>
- <build>
- <plugins>
- <plugin>
- <groupId>org.apache.maven.plugins</groupId>
- <artifactId>maven-shade-plugin</artifactId>
- <configuration>
- <createDependencyReducedPom>true</createDependencyReducedPom>
- <filters>
- <filter>
- <artifact>*:*</artifact>
- <excludes>
- <exclude>META-INF/*.SF</exclude>
- <exclude>META-INF/*.DSA</exclude>
- <exclude>META-INF/*.RSA</exclude>
- </excludes>
- </filter>
- </filters>
- </configuration>
- <executions>
- <execution>
- <goals>
- <goal>shade</goal>
- </goals>
- <phase>package</phase>
- <configuration>
- <transformers>
- <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
- <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
- <mainClass>webindex.ui.WebIndexApp</mainClass>
- </transformer>
- </transformers>
- </configuration>
- </execution>
- </executions>
- </plugin>
- </plugins>
- </build>
+ <profiles>
+ <profile>
+ <id>webindex-web-server</id>
+ <build>
+ <plugins>
+ <plugin>
+ <groupId>org.codehaus.mojo</groupId>
+ <artifactId>exec-maven-plugin</artifactId>
+ <executions>
+ <execution>
+ <goals>
+ <goal>java</goal>
+ </goals>
+ <phase>compile</phase>
+ <configuration>
+ <mainClass>webindex.ui.WebServer</mainClass>
+ </configuration>
+ </execution>
+ </executions>
+ </plugin>
+ </plugins>
+ </build>
+ </profile>
+ </profiles>
</project>
diff --git a/modules/ui/src/main/java/webindex/ui/FluoHealthCheck.java b/modules/ui/src/main/java/webindex/ui/FluoHealthCheck.java
deleted file mode 100644
index 2b63a44..0000000
--- a/modules/ui/src/main/java/webindex/ui/FluoHealthCheck.java
+++ /dev/null
@@ -1,25 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.ui;
-
-import com.codahale.metrics.health.HealthCheck;
-
-public class FluoHealthCheck extends HealthCheck {
-
- @Override
- protected Result check() throws Exception {
- return Result.healthy();
- }
-}
diff --git a/modules/ui/src/main/java/webindex/ui/WebIndexApp.java b/modules/ui/src/main/java/webindex/ui/WebIndexApp.java
deleted file mode 100644
index 85d32e4..0000000
--- a/modules/ui/src/main/java/webindex/ui/WebIndexApp.java
+++ /dev/null
@@ -1,58 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.ui;
-
-import java.io.File;
-
-import io.dropwizard.Application;
-import io.dropwizard.assets.AssetsBundle;
-import io.dropwizard.setup.Bootstrap;
-import io.dropwizard.setup.Environment;
-import io.dropwizard.views.ViewBundle;
-import org.apache.accumulo.core.client.Connector;
-import org.apache.fluo.api.config.FluoConfiguration;
-import org.apache.fluo.core.util.AccumuloUtil;
-import webindex.core.DataConfig;
-
-public class WebIndexApp extends Application<WebIndexConfig> {
-
- public static void main(String[] args) throws Exception {
- new WebIndexApp().run(args);
- }
-
- @Override
- public String getName() {
- return "webindex-app";
- }
-
- @Override
- public void initialize(Bootstrap<WebIndexConfig> bootstrap) {
- bootstrap.addBundle(new ViewBundle<>());
- bootstrap.addBundle(new AssetsBundle());
- }
-
- @Override
- public void run(WebIndexConfig config, Environment environment) {
-
- DataConfig dataConfig = WebIndexConfig.getDataConfig();
- File fluoConfigFile = new File(dataConfig.getFluoPropsPath());
- FluoConfiguration fluoConfig = new FluoConfiguration(fluoConfigFile);
-
- Connector conn = AccumuloUtil.getConnector(fluoConfig);
- final WebIndexResources resource = new WebIndexResources(conn, dataConfig);
- environment.healthChecks().register("fluo", new FluoHealthCheck());
- environment.jersey().register(resource);
- }
-}
diff --git a/modules/ui/src/main/java/webindex/ui/WebIndexConfig.java b/modules/ui/src/main/java/webindex/ui/WebIndexConfig.java
deleted file mode 100644
index 1328530..0000000
--- a/modules/ui/src/main/java/webindex/ui/WebIndexConfig.java
+++ /dev/null
@@ -1,25 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.ui;
-
-import io.dropwizard.Configuration;
-import webindex.core.DataConfig;
-
-public class WebIndexConfig extends Configuration {
-
- public static DataConfig getDataConfig() {
- return DataConfig.load();
- }
-}
diff --git a/modules/ui/src/main/java/webindex/ui/WebIndexResources.java b/modules/ui/src/main/java/webindex/ui/WebIndexResources.java
deleted file mode 100644
index 879d883..0000000
--- a/modules/ui/src/main/java/webindex/ui/WebIndexResources.java
+++ /dev/null
@@ -1,297 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.ui;
-
-import java.util.Iterator;
-import java.util.Map;
-
-import javax.validation.constraints.NotNull;
-import javax.ws.rs.DefaultValue;
-import javax.ws.rs.GET;
-import javax.ws.rs.Path;
-import javax.ws.rs.Produces;
-import javax.ws.rs.QueryParam;
-import javax.ws.rs.core.MediaType;
-
-import com.google.gson.Gson;
-import org.apache.accumulo.core.client.Connector;
-import org.apache.accumulo.core.client.Scanner;
-import org.apache.accumulo.core.client.TableNotFoundException;
-import org.apache.accumulo.core.data.Key;
-import org.apache.accumulo.core.data.Range;
-import org.apache.accumulo.core.data.Value;
-import org.apache.accumulo.core.security.Authorizations;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-import webindex.core.Constants;
-import webindex.core.DataConfig;
-import webindex.core.models.DomainStats;
-import webindex.core.models.Link;
-import webindex.core.models.Links;
-import webindex.core.models.Page;
-import webindex.core.models.Pages;
-import webindex.core.models.TopResults;
-import webindex.core.models.URL;
-import webindex.ui.util.Pager;
-import webindex.ui.util.WebUrl;
-import webindex.ui.views.HomeView;
-import webindex.ui.views.LinksView;
-import webindex.ui.views.PageView;
-import webindex.ui.views.PagesView;
-import webindex.ui.views.TopView;
-
-@Path("/")
-public class WebIndexResources {
-
- private static final Logger log = LoggerFactory.getLogger(WebIndexResources.class);
- private static final int PAGE_SIZE = 25;
-
- private DataConfig dataConfig;
- private Connector conn;
- private Gson gson = new Gson();
-
- public WebIndexResources(Connector conn, DataConfig dataConfig) {
- this.conn = conn;
- this.dataConfig = dataConfig;
- }
-
- private static Long getLongValue(Map.Entry<Key, Value> entry) {
- return Long.parseLong(entry.getValue().toString());
- }
-
- @GET
- @Produces(MediaType.TEXT_HTML)
- public HomeView getHome() {
- return new HomeView();
- }
-
- @GET
- @Path("pages")
- @Produces({MediaType.TEXT_HTML, MediaType.APPLICATION_JSON})
- public PagesView getPages(@NotNull @QueryParam("domain") String domain,
- @DefaultValue("") @QueryParam("next") String next,
- @DefaultValue("0") @QueryParam("pageNum") Integer pageNum) {
- DomainStats stats = getDomainStats(domain);
- Pages pages = new Pages(domain, pageNum);
- log.info("Setting total to {}", stats.getTotal());
- pages.setTotal(stats.getTotal());
- String row = "d:" + URL.reverseHost(domain);
- String cf = Constants.RANK;
- try {
- Scanner scanner = conn.createScanner(dataConfig.accumuloIndexTable, Authorizations.EMPTY);
- Pager pager = new Pager(scanner, Range.prefix(row + ":"), PAGE_SIZE) {
- @Override
- public void foundPageEntry(Map.Entry<Key, Value> entry) {
-
- String url =
- URL.fromPageID(entry.getKey().getRowData().toString().split(":", 4)[3]).toString();
- Long count = Long.parseLong(entry.getValue().toString());
- pages.addPage(url, count);
- }
-
- @Override
- public void foundNextEntry(Map.Entry<Key, Value> entry) {
- pages.setNext(entry.getKey().getRowData().toString().split(":", 3)[2]);
- }
- };
- if (next.isEmpty()) {
- pager.getPage(pageNum);
- } else {
- pager.getPage(new Key(row + ":" + next, cf, ""));
-
- }
- } catch (TableNotFoundException e) {
- log.error("Table {} not found", dataConfig.accumuloIndexTable);
- }
- return new PagesView(pages);
- }
-
- @GET
- @Path("page")
- @Produces({MediaType.TEXT_HTML, MediaType.APPLICATION_JSON})
- public PageView getPageView(@NotNull @QueryParam("url") String url) {
- return new PageView(getPage(url));
- }
-
- private Page getPage(String rawUrl) {
- Page page = null;
- Long incount = (long) 0;
- URL url;
- try {
- url = WebUrl.from(rawUrl);
- } catch (Exception e) {
- log.error("Failed to parse URL {}", rawUrl);
- return null;
- }
-
- try {
- Scanner scanner = conn.createScanner(dataConfig.accumuloIndexTable, Authorizations.EMPTY);
- scanner.setRange(Range.exact("p:" + url.toPageID(), Constants.PAGE));
- Iterator<Map.Entry<Key, Value>> iterator = scanner.iterator();
- while (iterator.hasNext()) {
- Map.Entry<Key, Value> entry = iterator.next();
- switch (entry.getKey().getColumnQualifier().toString()) {
- case Constants.INCOUNT:
- incount = getLongValue(entry);
- break;
- case Constants.CUR:
- page = gson.fromJson(entry.getValue().toString(), Page.class);
- break;
- default:
- log.error("Unknown page stat {}", entry.getKey().getColumnQualifier());
- }
- }
- } catch (TableNotFoundException e) {
- e.printStackTrace();
- }
-
- if (page == null) {
- page = new Page(url.toPageID());
- }
- page.setNumInbound(incount);
- return page;
- }
-
- private DomainStats getDomainStats(String domain) {
- DomainStats stats = new DomainStats(domain);
- Scanner scanner;
- try {
- scanner = conn.createScanner(dataConfig.accumuloIndexTable, Authorizations.EMPTY);
- scanner.setRange(Range.exact("d:" + URL.reverseHost(domain), Constants.DOMAIN));
- Iterator<Map.Entry<Key, Value>> iterator = scanner.iterator();
- while (iterator.hasNext()) {
- Map.Entry<Key, Value> entry = iterator.next();
- switch (entry.getKey().getColumnQualifier().toString()) {
- case Constants.PAGECOUNT:
- stats.setTotal(getLongValue(entry));
- break;
- default:
- log.error("Unknown page domain {}", entry.getKey().getColumnQualifier());
- }
- }
- } catch (TableNotFoundException e) {
- e.printStackTrace();
- }
- return stats;
- }
-
- @GET
- @Path("links")
- @Produces({MediaType.TEXT_HTML, MediaType.APPLICATION_JSON})
- public LinksView getLinks(@NotNull @QueryParam("url") String rawUrl,
- @NotNull @QueryParam("linkType") String linkType,
- @DefaultValue("") @QueryParam("next") String next,
- @DefaultValue("0") @QueryParam("pageNum") Integer pageNum) {
-
- Links links = new Links(rawUrl, linkType, pageNum);
-
- URL url;
- try {
- url = WebUrl.from(rawUrl);
- } catch (Exception e) {
- log.error("Failed to parse URL: " + rawUrl);
- return new LinksView(links);
- }
-
- try {
- Scanner scanner = conn.createScanner(dataConfig.accumuloIndexTable, Authorizations.EMPTY);
- String row = "p:" + url.toPageID();
- if (linkType.equals("in")) {
- Page page = getPage(rawUrl);
- String cf = Constants.INLINKS;
- links.setTotal(page.getNumInbound());
- Pager pager = new Pager(scanner, Range.exact(row, cf), PAGE_SIZE) {
-
- @Override
- public void foundPageEntry(Map.Entry<Key, Value> entry) {
- String pageID = entry.getKey().getColumnQualifier().toString();
- String anchorText = entry.getValue().toString();
- links.addLink(Link.of(pageID, anchorText));
- }
-
- @Override
- public void foundNextEntry(Map.Entry<Key, Value> entry) {
- links.setNext(entry.getKey().getColumnQualifier().toString());
- }
- };
- if (next.isEmpty()) {
- pager.getPage(pageNum);
- } else {
- pager.getPage(new Key(row, cf, next));
- }
- } else {
- scanner.setRange(Range.exact(row, Constants.PAGE, Constants.CUR));
- Iterator<Map.Entry<Key, Value>> iter = scanner.iterator();
- if (iter.hasNext()) {
- Page curPage = gson.fromJson(iter.next().getValue().toString(), Page.class);
- links.setTotal(curPage.getNumOutbound());
- int skip = 0;
- int add = 0;
- for (Link l : curPage.getOutboundLinks()) {
- if (skip < (pageNum * PAGE_SIZE)) {
- skip++;
- } else if (add < PAGE_SIZE) {
- links.addLink(l);
- add++;
- } else {
- links.setNext(l.getPageID());
- break;
- }
- }
- }
- }
- } catch (TableNotFoundException e) {
- log.error("Table {} not found", dataConfig.accumuloIndexTable);
- }
- return new LinksView(links);
- }
-
- @GET
- @Path("top")
- @Produces({MediaType.TEXT_HTML, MediaType.APPLICATION_JSON})
- public TopView getTop(@DefaultValue("") @QueryParam("next") String next,
- @DefaultValue("0") @QueryParam("pageNum") Integer pageNum) {
-
- TopResults results = new TopResults();
-
- results.setPageNum(pageNum);
- try {
- Scanner scanner = conn.createScanner(dataConfig.accumuloIndexTable, Authorizations.EMPTY);
- Pager pager = new Pager(scanner, Range.prefix("t:"), PAGE_SIZE) {
-
- @Override
- public void foundPageEntry(Map.Entry<Key, Value> entry) {
- String row = entry.getKey().getRow().toString();
- String url = URL.fromPageID(row.split(":", 3)[2]).toString();
- Long num = Long.parseLong(entry.getValue().toString());
- results.addResult(url, num);
- }
-
- @Override
- public void foundNextEntry(Map.Entry<Key, Value> entry) {
- results.setNext(entry.getKey().getRow().toString());
- }
- };
- if (next.isEmpty()) {
- pager.getPage(pageNum);
- } else {
- pager.getPage(new Key(next));
- }
- } catch (TableNotFoundException e) {
- log.error("Table {} not found", dataConfig.accumuloIndexTable);
- }
- return new TopView(results);
- }
-}
diff --git a/modules/ui/src/main/java/webindex/ui/WebServer.java b/modules/ui/src/main/java/webindex/ui/WebServer.java
new file mode 100644
index 0000000..3737514
--- /dev/null
+++ b/modules/ui/src/main/java/webindex/ui/WebServer.java
@@ -0,0 +1,131 @@
+/*
+ * Copyright 2015 Webindex authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package webindex.ui;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.Collections;
+import java.util.Optional;
+
+import freemarker.template.Configuration;
+import org.apache.accumulo.core.client.Connector;
+import org.apache.fluo.api.config.FluoConfiguration;
+import org.apache.fluo.core.util.AccumuloUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import spark.ModelAndView;
+import spark.Spark;
+import spark.template.freemarker.FreeMarkerEngine;
+import webindex.core.IndexClient;
+import webindex.core.WebIndexConfig;
+import webindex.core.models.Links;
+import webindex.core.models.Page;
+import webindex.core.models.Pages;
+import webindex.core.models.TopResults;
+
+import static spark.Spark.get;
+import static spark.Spark.staticFiles;
+
+public class WebServer {
+
+ private static final Logger log = LoggerFactory.getLogger(WebServer.class);
+
+ private static final ModelAndView VIEW_404 = new ModelAndView(null, "404.ftl");
+
+ public WebServer() {}
+
+ public void start(IndexClient client, int port, Path templatePath) {
+
+ Spark.port(port);
+
+ staticFiles.location("/assets");
+
+ FreeMarkerEngine freeMarkerEngine = new FreeMarkerEngine();
+
+ if (templatePath != null && Files.exists(templatePath)) {
+ log.info("Serving freemarker templates from {}", templatePath.toAbsolutePath());
+ Configuration freeMarkerConfig = new Configuration();
+ try {
+ freeMarkerConfig.setDirectoryForTemplateLoading(templatePath.toFile());
+ } catch (IOException e) {
+ throw new IllegalStateException(e);
+ }
+ freeMarkerEngine.setConfiguration(freeMarkerConfig);
+ }
+
+ get("/", (req, res) -> new ModelAndView(null, "home.ftl"), freeMarkerEngine);
+
+ get("/top",
+ (req, res) -> {
+ String next = Optional.ofNullable(req.queryParams("next")).orElse("");
+ Integer pageNum =
+ Integer.parseInt(Optional.ofNullable(req.queryParams("pageNum")).orElse("0"));
+ TopResults results = client.getTopResults(next, pageNum);
+ return new ModelAndView(Collections.singletonMap("top", results), "top.ftl");
+ }, freeMarkerEngine);
+
+ get("/page", (req, res) -> {
+ if (req.queryParams("url") == null) {
+ return VIEW_404;
+ }
+ String url = req.queryParams("url");
+ Page page = client.getPage(url);
+ return new ModelAndView(Collections.singletonMap("page", page), "page.ftl");
+ }, freeMarkerEngine);
+
+ get("/pages",
+ (req, res) -> {
+ if (req.queryParams("domain") == null) {
+ return VIEW_404;
+ }
+ String domain = req.queryParams("domain");
+ String next = Optional.ofNullable(req.queryParams("next")).orElse("");
+ Integer pageNum =
+ Integer.parseInt(Optional.ofNullable(req.queryParams("pageNum")).orElse("0"));
+ Pages pages = client.getPages(domain, next, pageNum);
+ return new ModelAndView(Collections.singletonMap("pages", pages), "pages.ftl");
+ }, freeMarkerEngine);
+
+ get("/links",
+ (req, res) -> {
+ if (req.queryParams("url") == null || req.queryParams("linkType") == null) {
+ return VIEW_404;
+ }
+ String rawUrl = req.queryParams("url");
+ String linkType = req.queryParams("linkType");
+ String next = Optional.ofNullable(req.queryParams("next")).orElse("");
+ Integer pageNum =
+ Integer.parseInt(Optional.ofNullable(req.queryParams("pageNum")).orElse("0"));
+ Links links = client.getLinks(rawUrl, linkType, next, pageNum);
+ return new ModelAndView(Collections.singletonMap("links", links), "links.ftl");
+ }, freeMarkerEngine);
+ }
+
+ public void stop() {
+ Spark.stop();
+ }
+
+ public static void main(String[] args) throws Exception {
+ WebIndexConfig webIndexConfig = WebIndexConfig.load();
+ File fluoConfigFile = new File(webIndexConfig.getFluoPropsPath());
+ FluoConfiguration fluoConfig = new FluoConfiguration(fluoConfigFile);
+ Connector conn = AccumuloUtil.getConnector(fluoConfig);
+ IndexClient client = new IndexClient(webIndexConfig.accumuloIndexTable, conn);
+ WebServer webServer = new WebServer();
+ webServer.start(client, 4567, null);
+ }
+}
diff --git a/modules/ui/src/main/java/webindex/ui/util/Pager.java b/modules/ui/src/main/java/webindex/ui/util/Pager.java
deleted file mode 100644
index d69d95e..0000000
--- a/modules/ui/src/main/java/webindex/ui/util/Pager.java
+++ /dev/null
@@ -1,72 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.ui.util;
-
-import java.util.Iterator;
-import java.util.Map;
-
-import org.apache.accumulo.core.client.Scanner;
-import org.apache.accumulo.core.data.Key;
-import org.apache.accumulo.core.data.Range;
-import org.apache.accumulo.core.data.Value;
-
-public abstract class Pager {
-
- private Scanner scanner;
- private int pageSize;
- private Range pageRange;
-
- public Pager(Scanner scanner, Range pageRange, int pageSize) {
- this.scanner = scanner;
- this.pageRange = pageRange;
- this.pageSize = pageSize;
- }
-
- public void getPage(Key nextKey) {
- scanner.setRange(new Range(nextKey, pageRange.getEndKey()));
- foundStart(scanner.iterator());
- }
-
- public void getPage(int pageNum) {
- scanner.setRange(pageRange);
- Iterator<Map.Entry<Key, Value>> iterator = scanner.iterator();
- if (pageNum > 0) {
- long skip = 0;
- while (skip < (pageNum * pageSize)) {
- iterator.next();
- skip++;
- }
- }
- foundStart(iterator);
- }
-
- private void foundStart(Iterator<Map.Entry<Key, Value>> iterator) {
- long num = 0;
- while (iterator.hasNext() && (num < (pageSize + 1))) {
- Map.Entry<Key, Value> entry = iterator.next();
- if (num == pageSize) {
- foundNextEntry(entry);
- } else {
- foundPageEntry(entry);
- }
- num++;
- }
- }
-
- public abstract void foundPageEntry(Map.Entry<Key, Value> entry);
-
- public abstract void foundNextEntry(Map.Entry<Key, Value> entry);
-
-}
diff --git a/modules/ui/src/main/java/webindex/ui/util/WebUrl.java b/modules/ui/src/main/java/webindex/ui/util/WebUrl.java
deleted file mode 100644
index 5f2e863..0000000
--- a/modules/ui/src/main/java/webindex/ui/util/WebUrl.java
+++ /dev/null
@@ -1,35 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.ui.util;
-
-import com.google.common.net.HostSpecifier;
-import com.google.common.net.InternetDomainName;
-import webindex.core.models.URL;
-
-public class WebUrl {
-
- public static String domainFromHost(String host) {
- return InternetDomainName.from(host).topPrivateDomain().toString();
- }
-
- public static boolean isValidHost(String host) {
- return HostSpecifier.isValid(host) && InternetDomainName.isValid(host)
- && InternetDomainName.from(host).isUnderPublicSuffix();
- }
-
- public static URL from(String rawUrl) {
- return URL.from(rawUrl, WebUrl::domainFromHost, WebUrl::isValidHost);
- }
-}
diff --git a/modules/ui/src/main/java/webindex/ui/views/HomeView.java b/modules/ui/src/main/java/webindex/ui/views/HomeView.java
deleted file mode 100644
index dc2f1d4..0000000
--- a/modules/ui/src/main/java/webindex/ui/views/HomeView.java
+++ /dev/null
@@ -1,24 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.ui.views;
-
-import io.dropwizard.views.View;
-
-public class HomeView extends View {
-
- public HomeView() {
- super("home.ftl");
- }
-}
diff --git a/modules/ui/src/main/java/webindex/ui/views/LinksView.java b/modules/ui/src/main/java/webindex/ui/views/LinksView.java
deleted file mode 100644
index b168a32..0000000
--- a/modules/ui/src/main/java/webindex/ui/views/LinksView.java
+++ /dev/null
@@ -1,32 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.ui.views;
-
-import io.dropwizard.views.View;
-import webindex.core.models.Links;
-
-public class LinksView extends View {
-
- private final Links links;
-
- public LinksView(Links links) {
- super("links.ftl");
- this.links = links;
- }
-
- public Links getLinks() {
- return links;
- }
-}
diff --git a/modules/ui/src/main/java/webindex/ui/views/PageView.java b/modules/ui/src/main/java/webindex/ui/views/PageView.java
deleted file mode 100644
index 7b1a9fe..0000000
--- a/modules/ui/src/main/java/webindex/ui/views/PageView.java
+++ /dev/null
@@ -1,32 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.ui.views;
-
-import io.dropwizard.views.View;
-import webindex.core.models.Page;
-
-public class PageView extends View {
-
- private final Page page;
-
- public PageView(Page page) {
- super("page.ftl");
- this.page = page;
- }
-
- public Page getPage() {
- return page;
- }
-}
diff --git a/modules/ui/src/main/java/webindex/ui/views/PagesView.java b/modules/ui/src/main/java/webindex/ui/views/PagesView.java
deleted file mode 100644
index a856c63..0000000
--- a/modules/ui/src/main/java/webindex/ui/views/PagesView.java
+++ /dev/null
@@ -1,32 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.ui.views;
-
-import io.dropwizard.views.View;
-import webindex.core.models.Pages;
-
-public class PagesView extends View {
-
- private final Pages pages;
-
- public PagesView(Pages pages) {
- super("pages.ftl");
- this.pages = pages;
- }
-
- public Pages getPages() {
- return pages;
- }
-}
diff --git a/modules/ui/src/main/java/webindex/ui/views/TopView.java b/modules/ui/src/main/java/webindex/ui/views/TopView.java
deleted file mode 100644
index b1927bf..0000000
--- a/modules/ui/src/main/java/webindex/ui/views/TopView.java
+++ /dev/null
@@ -1,32 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.ui.views;
-
-import io.dropwizard.views.View;
-import webindex.core.models.TopResults;
-
-public class TopView extends View {
-
- private TopResults top;
-
- public TopView(TopResults results) {
- super("top.ftl");
- this.top = results;
- }
-
- public TopResults getTop() {
- return top;
- }
-}
diff --git a/modules/ui/src/main/resources/spark/template/freemarker/404.ftl b/modules/ui/src/main/resources/spark/template/freemarker/404.ftl
new file mode 100644
index 0000000..24753a1
--- /dev/null
+++ b/modules/ui/src/main/resources/spark/template/freemarker/404.ftl
@@ -0,0 +1,10 @@
+<html>
+<#include "common/head.ftl">
+<body>
+<div class="container" style="margin-top: 20px">
+<div class="row">
+ <div class="col-md-6 col-md-offset-3" style="margin-top: 200px">
+ <h2>404: Page not found</h2>
+ </div>
+</div>
+<#include "common/footer.ftl">
diff --git a/modules/ui/src/main/resources/webindex/ui/views/common/footer.ftl b/modules/ui/src/main/resources/spark/template/freemarker/common/footer.ftl
similarity index 100%
rename from modules/ui/src/main/resources/webindex/ui/views/common/footer.ftl
rename to modules/ui/src/main/resources/spark/template/freemarker/common/footer.ftl
diff --git a/modules/ui/src/main/resources/webindex/ui/views/common/head.ftl b/modules/ui/src/main/resources/spark/template/freemarker/common/head.ftl
similarity index 100%
rename from modules/ui/src/main/resources/webindex/ui/views/common/head.ftl
rename to modules/ui/src/main/resources/spark/template/freemarker/common/head.ftl
diff --git a/modules/ui/src/main/resources/webindex/ui/views/common/header.ftl b/modules/ui/src/main/resources/spark/template/freemarker/common/header.ftl
similarity index 68%
rename from modules/ui/src/main/resources/webindex/ui/views/common/header.ftl
rename to modules/ui/src/main/resources/spark/template/freemarker/common/header.ftl
index 0a9ac86..0163231 100644
--- a/modules/ui/src/main/resources/webindex/ui/views/common/header.ftl
+++ b/modules/ui/src/main/resources/spark/template/freemarker/common/header.ftl
@@ -5,6 +5,6 @@
<div class="container" style="margin-top: 20px">
<div class="row" style="margin-bottom: 10px">
<div class="col-md-6">
- <a href="/"><img src="/assets/img/webindex.png" alt="WebIndex Home" style="height:30px;"></a>
+ <a href="/"><img src="/img/webindex.png" alt="WebIndex Home" style="height:30px;"></a>
</div>
</div>
diff --git a/modules/ui/src/main/resources/webindex/ui/views/home.ftl b/modules/ui/src/main/resources/spark/template/freemarker/home.ftl
similarity index 93%
rename from modules/ui/src/main/resources/webindex/ui/views/home.ftl
rename to modules/ui/src/main/resources/spark/template/freemarker/home.ftl
index 76c7111..c3d9f20 100644
--- a/modules/ui/src/main/resources/webindex/ui/views/home.ftl
+++ b/modules/ui/src/main/resources/spark/template/freemarker/home.ftl
@@ -4,7 +4,7 @@
<div class="container" style="margin-top: 20px">
<div class="row">
<div class="col-md-6 col-md-offset-3" style="margin-top: 200px">
- <img src="/assets/img/webindex.png" alt="WebIndex">
+ <img src="/img/webindex.png" alt="WebIndex">
<div style="margin-top: 25px;">
<h4>Enter a domain to view known webpages in that domain:</h4>
</div>
diff --git a/modules/ui/src/main/resources/webindex/ui/views/links.ftl b/modules/ui/src/main/resources/spark/template/freemarker/links.ftl
similarity index 100%
rename from modules/ui/src/main/resources/webindex/ui/views/links.ftl
rename to modules/ui/src/main/resources/spark/template/freemarker/links.ftl
diff --git a/modules/ui/src/main/resources/webindex/ui/views/page.ftl b/modules/ui/src/main/resources/spark/template/freemarker/page.ftl
similarity index 100%
rename from modules/ui/src/main/resources/webindex/ui/views/page.ftl
rename to modules/ui/src/main/resources/spark/template/freemarker/page.ftl
diff --git a/modules/ui/src/main/resources/webindex/ui/views/pages.ftl b/modules/ui/src/main/resources/spark/template/freemarker/pages.ftl
similarity index 100%
rename from modules/ui/src/main/resources/webindex/ui/views/pages.ftl
rename to modules/ui/src/main/resources/spark/template/freemarker/pages.ftl
diff --git a/modules/ui/src/main/resources/webindex/ui/views/top.ftl b/modules/ui/src/main/resources/spark/template/freemarker/top.ftl
similarity index 100%
rename from modules/ui/src/main/resources/webindex/ui/views/top.ftl
rename to modules/ui/src/main/resources/spark/template/freemarker/top.ftl
diff --git a/pom.xml b/pom.xml
index 79ee4e1..e07f153 100644
--- a/pom.xml
+++ b/pom.xml
@@ -32,10 +32,10 @@
<module>modules/core</module>
<module>modules/data</module>
<module>modules/ui</module>
+ <module>modules/integration</module>
</modules>
<properties>
<accumulo.version>1.7.1</accumulo.version>
- <dropwizard.version>0.8.2</dropwizard.version>
<fluo-recipes.version>1.0.0-incubating-SNAPSHOT</fluo-recipes.version>
<fluo.version>1.0.0-incubating-SNAPSHOT</fluo.version>
<hadoop.version>2.6.3</hadoop.version>
@@ -57,26 +57,31 @@
<version>2.3.1</version>
</dependency>
<dependency>
+ <groupId>com.google.guava</groupId>
+ <artifactId>guava</artifactId>
+ <version>14.0.1</version>
+ </dependency>
+ <dependency>
+ <groupId>com.sparkjava</groupId>
+ <artifactId>spark-core</artifactId>
+ <version>2.5</version>
+ </dependency>
+ <dependency>
+ <groupId>com.sparkjava</groupId>
+ <artifactId>spark-template-freemarker</artifactId>
+ <version>2.3</version>
+ </dependency>
+ <dependency>
+ <groupId>commons-io</groupId>
+ <artifactId>commons-io</artifactId>
+ <version>2.4</version>
+ </dependency>
+ <dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.6</version>
</dependency>
<dependency>
- <groupId>io.dropwizard</groupId>
- <artifactId>dropwizard-assets</artifactId>
- <version>${dropwizard.version}</version>
- </dependency>
- <dependency>
- <groupId>io.dropwizard</groupId>
- <artifactId>dropwizard-core</artifactId>
- <version>${dropwizard.version}</version>
- </dependency>
- <dependency>
- <groupId>io.dropwizard</groupId>
- <artifactId>dropwizard-views-freemarker</artifactId>
- <version>${dropwizard.version}</version>
- </dependency>
- <dependency>
<groupId>io.github.astralway</groupId>
<artifactId>webindex-core</artifactId>
<version>${project.version}</version>
@@ -183,6 +188,11 @@
<version>3.4.6</version>
</dependency>
<dependency>
+ <groupId>org.jsoup</groupId>
+ <artifactId>jsoup</artifactId>
+ <version>1.9.2</version>
+ </dependency>
+ <dependency>
<groupId>org.netpreserve.commons</groupId>
<artifactId>webarchive-commons</artifactId>
<version>1.1.2</version>
@@ -226,13 +236,14 @@
<exclude>README.md</exclude>
<exclude>docs/**.md</exclude>
<exclude>conf/webindex-tests.txt</exclude>
+ <exclude>src/test/resources/5-pages.txt</exclude>
<exclude>src/test/resources/*.warc</exclude>
<exclude>src/test/resources/data/set1/*.txt</exclude>
<exclude>src/main/resources/splits/*.txt</exclude>
- <exclude>src/main/resources/webindex/ui/views/*.ftl</exclude>
- <exclude>src/main/resources/webindex/ui/views/common/*.ftl</exclude>
+ <exclude>src/main/resources/spark/template/freemarker/*.ftl</exclude>
+ <exclude>src/main/resources/spark/template/freemarker/common/*.ftl</exclude>
<exclude>logs/*</exclude>
- <exclude>paths/*</exclude>
+ <exclude>data/*</exclude>
<exclude>dependency-reduced-pom.xml</exclude>
</excludes>
</configuration>
@@ -247,6 +258,11 @@
</systemPropertyVariables>
</configuration>
</plugin>
+ <plugin>
+ <groupId>org.codehaus.mojo</groupId>
+ <artifactId>exec-maven-plugin</artifactId>
+ <version>1.5.0</version>
+ </plugin>
</plugins>
</pluginManagement>
</build>