commit | b40d9121900b5ba4931a0b0816527c71d02dd6a6 | [log] [tgz] |
---|---|---|
author | Chris Mattmann <mattmann@apache.org> | Wed Jul 09 11:08:28 2014 -0700 |
committer | Chris Mattmann <mattmann@apache.org> | Wed Jul 09 11:08:28 2014 -0700 |
tree | 3bba2ae75c1e1081d8a27435f018f5a485d3367f | |
parent | 73c3e3e5b1797c7e6de2b0484d8e5b8200cac4d4 [diff] |
- add presentation for ESIP Summer 2014 meeting on DRAT.
A distributed, parallelized (Map Reduce) wrapper around Apache™ RAT (Release Audit Tool). RAT is used to check for proper licensing in software projects. However, RAT takes a prohibitively long time to analyze large repositories of code, since it can only run on one JVM. Furthermore, RAT isn't customizable by file type or file size and provides no incremental output. This wrapper dramatically speeds up the process by leveraging Apache™ OODT to parallelize and workflow the following components:
You can build DRAT in a few steps:
mkdir -p /usr/local/drat/deploy
mkdir -p /usr/local/drat/src
cd /usr/local/drat/src
git clone https://github.com/chrismattmann/drat.git .
mvn install
cp -R distribution/target/dms-distribution-0.1-bin.tar.gz ../deploy/
cd ../deploy/
tar xvzf dms-distribution-0.1-bin.tar.gz
rm *.tar.gz
Prerequisites:
Install Vagrant from here.
Install VirtualBox from here.
git clone https://github.com/chrismattmann/drat.git cd drat vagrant up vagrant ssh
Skip to automated method or manual method. Note that the /vagrant directory is a shared folder to your host system and is a great way to interact with codebases you're looking to audit with drat.
Here are the basic commands to run DRAT. Imagine you had a code repo, your-repo, that lives in $HOME/your-repo
.
Set your $DRAT_HOME
environment variable, e.g., to /usr/local/drat/deploy
. Note the tomcat that we ship with DRAT won‘t start correctly unless you define the $JAVA_HOME
environment variable, so make sure that’s set too.
Start Apache™ OODT:$DRAT_HOME/bin/oodt start
$DRAT_HOME/bin/drat go $HOME/your-repo
If you would rather run the individual commands yourself, use the manual method:
Crawl the repository of interest, e.g., $HOME/your-repo
:$DRAT_HOME/bin/drat crawl $HOME/your-repo
Index the crawled repo in Apache™ SOLR:$DRAT_HOME/bin/drat index $HOME/your-repo
Fire off the partitioner and mappers$DRAT_HOME/bin/drat map
Fire off the reducer$DRAT_HOME/bin/drat reduce
Please see $DRAT_HOME/bin/drat
for the specifics of each command. To shut down OODT, run $DRAT_HOME/bin/oodt stop
.
DRAT UIs are accessible at:
http://localhost:8080/opsui/ - main cockpit, Apache OODT OPSUI
http://localhost:8080/solr/ - Solr4 catalog
DRAT publishes its analyzed aggregated RAT logs to:
$DRAT_HOME/data/archive/rataggregate/*.csv
These look like e.g.
cat *.csv Notes,Binaries,Archives,Standards,Apache,Generated,Unknown 0,2,0,530,497,0,33
So, these are the counts of each of the source code files and what licenses they are:
Binaries - it's a binary file, no license Notes - it's a notes file Archives - it's a tar/zip/etc archive, no license Standards - it's one of the OSI approved licenses that isn't ALv2, so e.g., BSD, MIT, LGPL, etc. Generated - these are generated files (either source or binary) Apache - apache licensed files Unknown - non discernible license
If you run DRAT on your source code and want to run it again the easiest way to do so is to:
Grab the aliases for fmquery and fmdel from https://issues.apache.org/jira/browse/OODT-306 and add them to your bash or tcsh profile:
Run fmquery "ProductType:RatLog" | fmdel
Run fmquery "ProductType:RatAggregateLog" | fmdel
You should be good to go to re-run the analysis at that point.
##If you want to analyze an entirely new code base $DRAT_HOME/bin/oodt stop
$DRAT_HOME/bin/drat reset
$DRAT_HOME/bin/oodt start
You shouldn't need to run these, but the manual version of reset
is:
Blow away the following dirs:rm -rf $DRAT_HOME/data/workflow
rm -rf $DRAT_HOME/filemgr/catalog
rm -rf $DRAT_HOME/solr/drat/data
Blow away files in following dirs:rm -rf $DRAT_HOME/data/archive/*
The following useful environment variables are set by RADIX but can be overwritten on a per DRAT install basis. Here's the default config, feel free to change/override in your own environment.
setenv DRAT_HOME /usr/local/drat/deploy setenv FILEMGR_URL http://localhost:9000 setenv WORKFLOW_URL http://localhost:9001 setenv RESMGR_URL http://localhost:9002 setenv WORKFLOW_HOME $DRAT_HOME/workflow setenv FILEMGR_HOME $DRAT_HOME/filemgr setenv PGE_ROOT $DRAT_HOME/pge setenv PCS_HOME $DRAT_HOME/pcs setenv GANGLIA_URL http://zipper.jpl.nasa.gov/ganglia/
There is now a Youtube video on DRAT explaining DRAT's motivation, and results of running it on DARPA XDATA and on the Computational Infrastructure for Geodynamics as part of the my NSF project. The video was made for the 2014 Summer Earth Science Information Partners Federation Meeting.