commit	8a5cce04389432a664d5c7d33a0c4836437f079c	[log] [tgz]
author	Chris Mattmann <mattmann@apache.org>	Wed Jul 09 11:09:49 2014 -0700
committer	Chris Mattmann <mattmann@apache.org>	Wed Jul 09 11:09:49 2014 -0700
tree	68bffaef74c2476af68cf3dafcce3d940834dcd7
parent	b40d9121900b5ba4931a0b0816527c71d02dd6a6 [diff]

tree: 68bffaef74c2476af68cf3dafcce3d940834dcd7

README.md

Distributed Release Audit Tool (DRAT)

A distributed, parallelized (Map Reduce) wrapper around Apache™ RAT (Release Audit Tool). RAT is used to check for proper licensing in software projects. However, RAT takes a prohibitively long time to analyze large repositories of code, since it can only run on one JVM. Furthermore, RAT isn't customizable by file type or file size and provides no incremental output. This wrapper dramatically speeds up the process by leveraging Apache™ OODT to parallelize and workflow the following components:

Apache™ SOLR based exploration of a CM repository (e.g., Git, SVN, etc.) and classification of that repository based on MIME type using Apache™ Tika.
A MIME partitioner that uses Apache™ Tika to automatically deduce and classify by file type and then partition Apache™ RAT jobs based on sets of 100 files per type (configurable) -- the M/R “partitioner”
A throttle wrapper for RAT to MIME targeted Apache™ RAT. -- the M/R “mapper”
A reducer to “combine” the produced RAT logs together into a global RAT report that can be used for stats generation. -- the M/R “reducer”

How to Build

You can build DRAT in a few steps:

mkdir -p /usr/local/drat/deploy
mkdir -p /usr/local/drat/src
cd /usr/local/drat/src
git clone https://github.com/chrismattmann/drat.git .
mvn install
cp -R distribution/target/dms-distribution-0.1-bin.tar.gz ../deploy/
cd ../deploy/
tar xvzf dms-distribution-0.1-bin.tar.gz
rm *.tar.gz

Running with Vagrant

Prerequisites:

Install Vagrant from here.

Install VirtualBox from here.

git clone https://github.com/chrismattmann/drat.git
cd drat
vagrant up
vagrant ssh

Skip to automated method or manual method. Note that the /vagrant directory is a shared folder to your host system and is a great way to interact with codebases you're looking to audit with drat.

How to Run

Here are the basic commands to run DRAT. Imagine you had a code repo, your-repo, that lives in $HOME/your-repo.

Set your $DRAT_HOME environment variable, e.g., to /usr/local/drat/deploy. Note the tomcat that we ship with DRAT won‘t start correctly unless you define the $JAVA_HOME environment variable, so make sure that’s set too.
Start Apache™ OODT:
$DRAT_HOME/bin/oodt start

Automated method:

Go!
$DRAT_HOME/bin/drat go $HOME/your-repo
This will crawl the repo, index it, map it and then wait for the reduce to complete.

Manual method:

If you would rather run the individual commands yourself, use the manual method:

Crawl the repository of interest, e.g., $HOME/your-repo:
$DRAT_HOME/bin/drat crawl $HOME/your-repo
Index the crawled repo in Apache™ SOLR:
$DRAT_HOME/bin/drat index $HOME/your-repo
Fire off the partitioner and mappers
$DRAT_HOME/bin/drat map
Fire off the reducer
$DRAT_HOME/bin/drat reduce

Please see $DRAT_HOME/bin/drat for the specifics of each command. To shut down OODT, run $DRAT_HOME/bin/oodt stop.

Interacting with DRAT

DRAT UIs are accessible at:

http://localhost:8080/opsui/ - main cockpit, Apache OODT OPSUI
http://localhost:8080/solr/ - Solr4 catalog

DRAT publishes its analyzed aggregated RAT logs to:

$DRAT_HOME/data/archive/rataggregate/*.csv

These look like e.g.

cat *.csv
Notes,Binaries,Archives,Standards,Apache,Generated,Unknown
0,2,0,530,497,0,33

So, these are the counts of each of the source code files and what licenses they are:

Binaries - it's a binary file, no license
Notes - it's a notes file
Archives - it's a tar/zip/etc archive, no license
Standards - it's one of the OSI approved licenses that isn't ALv2, so e.g., BSD, MIT, LGPL, etc.
Generated - these are generated files (either source or binary)
Apache - apache licensed files
Unknown - non discernible license

Re-Running DRAT

If you run DRAT on your source code and want to run it again the easiest way to do so is to:

Grab the aliases for fmquery and fmdel from https://issues.apache.org/jira/browse/OODT-306 and add them to your bash or tcsh profile:
Run fmquery "ProductType:RatLog" | fmdel
Run fmquery "ProductType:RatAggregateLog" | fmdel

You should be good to go to re-run the analysis at that point.

##If you want to analyze an entirely new code base $DRAT_HOME/bin/oodt stop
$DRAT_HOME/bin/drat reset
$DRAT_HOME/bin/oodt start

You shouldn't need to run these, but the manual version of reset is:

Blow away the following dirs:
rm -rf $DRAT_HOME/data/workflow
rm -rf $DRAT_HOME/filemgr/catalog
rm -rf $DRAT_HOME/solr/drat/data
Blow away files in following dirs:
rm -rf $DRAT_HOME/data/archive/*

Useful Environment Variables

The following useful environment variables are set by RADIX but can be overwritten on a per DRAT install basis. Here's the default config, feel free to change/override in your own environment.

setenv DRAT_HOME /usr/local/drat/deploy
setenv FILEMGR_URL http://localhost:9000
setenv WORKFLOW_URL http://localhost:9001
setenv RESMGR_URL http://localhost:9002
setenv WORKFLOW_HOME $DRAT_HOME/workflow
setenv FILEMGR_HOME $DRAT_HOME/filemgr
setenv PGE_ROOT $DRAT_HOME/pge
setenv PCS_HOME $DRAT_HOME/pcs
setenv GANGLIA_URL http://zipper.jpl.nasa.gov/ganglia/

DRAT Tutorials/Videos

There is now a Youtube video on DRAT explaining DRAT's motivation, and results of running it on DARPA XDATA and on the Computational Infrastructure for Geodynamics as part of my NSF project. The video was made for the 2014 Summer Earth Science Information Partners Federation Meeting.