tag	4319a5d2b2ae9b802f42636ecdeb2b802c0171ed
tagger	Chris Mattmann <mattmann@apache.org>	Thu Jun 26 23:41:35 2014 -0400
object	f2e619a91226edc63087b8c2abd971de92c327e9

commit	f2e619a91226edc63087b8c2abd971de92c327e9	[log] [tgz]
author	Chris Mattmann <chris.mattmann@gmail.com>	Thu Jun 26 23:38:20 2014 -0400
committer	Chris Mattmann <chris.mattmann@gmail.com>	Thu Jun 26 23:38:20 2014 -0400
tree	2cd3a4a06236941bfb07d9a952dc0a7832f4e5b6
parent	ca064978610f64f9718400a3fccbd9096d84e98e [diff]
parent	af2b477f0c437ab25f6039ae63a3be43aa00a515 [diff]

tree: 2cd3a4a06236941bfb07d9a952dc0a7832f4e5b6

README.md

Distributed Release Audit Tool (DRAT)

A distributed, parallelized (Map Reduce) wrapper around Apache™ RAT to allow it to complete on large code repositories of multiple file types where Apache™ RAT hangs forever.

The tool leverages Apache™ OODT to parallelize and workflow together the following components:

Apache™ SOLR based exploration of a CM repository (e.g., Git, SVN, etc.) and classification of that repository based on MIME type using Apache™ Tika.
A MIME partitioner that uses Apache™ Tika to automatically deduce and classify by file type and then partition Apache™ RAT jobs based on sets of 100 files per type (configurable) -- the M/R “partitioner”
A throttle wrapper for RAT to MIME targeted Apache™ RAT. -- the M/R “mapper”
A reducer to “combine” the produced RAT logs together into a global RAT report that can be used for stats generation. -- the M/R “reducer”

How to Build

You can build DRAT in a few steps:

mkdir -p /usr/local/drat/deploy
mkdir -p /usr/local/drat/src
cd /usr/local/drat/src
git clone https://github.com/chrismattmann/drat.git .
mvn install
cp -R distribution/target/dms-distribution-0.1-bin.tar.gz ../deploy/
cd ../deploy/
tar xvzf dms-distribution-0.1-bin.tar.gz
rm *.tar.gz

How to Run

Here are the basic commands to run DRAT. Imagine you had a code repo, your-repo, that lives in $HOME/your-repo.

Set your $DRAT_HOME environment variable, e.g., to /usr/local/drat/deploy
Start Apache™ OODT:
$DRAT_HOME/bin/oodt start

Automated method:

Go!
cd $DRAT_HOME/bin
./drat go $HOME/your-repo
This will crawl the repo, index it, map it, and reduce it.

Manual method:

If you would rather run the individual commands yourself, use the manual method:

Crawl the repository of interest, e.g., $HOME/your-repo:
$DRAT_HOME/bin/drat crawl $HOME/your-repo
Index the crawled repo in Apache™ SOLR:
$DRAT_HOME/bin/drat index $HOME/your-repo
Fire off the partitioner and mappers
cd $DRAT_HOME/bin/drat map
Fire off the reducer
cd $DRAT_HOME/bin/drat reduce

Please see $DRAT_HOME/bin/drat for the specifics of each command.

Interacting with DRAT

DRAT UIs are accessible at:

http://localhost:8080/opsui/ - main cockpit, Apache OODT OPSUI
http://localhost:8080/solr/ - Solr4 catalog

DRAT publishes its analyzed aggregated RAT logs to:

$DRAT_HOME/data/archive/rataggregate/*.csv

These look like e.g.

cat *.csv
Notes,Binaries,Archives,Standards,Apache,Generated,Unknown
0,2,0,530,497,0,33

So, these are the counts of each of the source code files and what licenses they are:

Binaries - it's a binary file, no license
Notes - it's a notes file
Archives - it's a tar/zip/etc archive, no license
Standards - it's one of the OSI approved licenses that isn't ALv2, so e.g., BSD, MIT, LGPL, etc.
Generated - these are generated files (either source or binary)
Apache - apache licensed files
Unknown - non discernible license

Re-Running DRAT

If you run DRAT on your source code and want to run it again the easiest way to do so is to:

Grab the aliases for fmquery and fmdel from https://issues.apache.org/jira/browse/OODT-306 and add them to your bash or tcsh profile:
Run fmquery "ProductType:RatLog" | fmdel
Run fmquery "ProductType:RatAggregateLog" | fmdel

You should be good to go to re-run the analysis at that point.

##If you want to analyze an entirely new code base $DRAT_HOME/bin/oodt stop $DRAT_HOME/bin/drat reset $DRAT_HOME/bin/oodt start

You shouldn't need to run these, but the manual version of reset is:

Blow away the following dirs:
rm -rf $DRAT_HOME/data/workflow
rm -rf $DRAT_HOME/filemgr/catalog
rm -rf $DRAT_HOME/solr/drat/data
Blow away files in following dirs:
rm -rf $DRAT_HOME/data/archive/*

Useful Environment Variables

The following useful environment variables are set by RADIX but can be overwritten on a per DRAT install basis. Here's the default config, feel free to change/override in your own environment.

setenv DRAT_HOME /usr/local/drat/deploy
setenv FILEMGR_URL http://localhost:9000
setenv WORKFLOW_URL http://localhost:9001
setenv RESMGR_URL http://localhost:9002
setenv WORKFLOW_HOME $DRAT_HOME/workflow
setenv FILEMGR_HOME $DRAT_HOME/filemgr
setenv PGE_ROOT $DRAT_HOME/pge
setenv PCS_HOME $DRAT_HOME/pcs
setenv GANGLIA_URL http://zipper.jpl.nasa.gov/ganglia/

Note the tomcat that we ship with DRAT won‘t start correctly unless you define the $JAVA_HOME environment variable, so make sure that’s set too.