blob: 61bcf7c1ad887b245016a1fc8e1ae6f887e6fdab [file] [log] [blame]
TESTING WITH BAYES
------------------
Dan said: "I think we need guidelines on how to train and mass-check Bayes
using our spam and non-spam corpuses. Maybe you could check something in?
*nudge*". OK then!
If you're testing Bayes, or collating results on a change to the algorithms,
please try to stick to these guidelines:
- train with at least 1000 spam and 1000 ham messages
- try to use at least as many ham as spam mails.
- use mail from your own mail feed, not public corpora if possible. Many of
the important signs are taken from headers and are specific to you and your
systems.
- Try to train with older messages, and test with newer, if possible.
- As with the conventional "mass-check" runs, avoiding spam over 6 months old
is a good idea, as older spam uses old techniques that no longer are seen
in the wild.
- DO NOT test with any of the messages you trained with. This will produce
over-inflated success rates.
These are just guidelines (well, apart from the last one), so they can be
bent slightly if needs be ;)
A SAMPLE LOG OF A BAYES 10FCV RUN
---------------------------------
First, I made the corpus to test with.
mkdir ch ; cp ~/Mail/deld/10* ch
mkdir cs ; cp ....spam... cs
This is simply one-file-per-message, RFC-2822 format, as usual.
Now, set the SADIR env var to where your SpamAssassin source tree
can be found:
export SADIR=/home/jm/ftp/spamassassin
Then split the test corpus into folds:
mkdir -p cor/ham cor/spam
$SADIR/tools/split_corpora -n 10 -p cor/ham/bucket ch
$SADIR/tools/split_corpora -n 10 -p cor/spam/bucket cs
That takes from "ch" and "cs" and generates mboxes containing 10%
folds as "cor/ham/bucket{1,2,3,4,5,6,7,8,9,10}".
I then created a set of items I wanted to test:
mkdir testdir
mkdir testdir/{base,bug3118} [...etc.]
cp ~/ftp/spamassassin/lib/Mail/SpamAssassin/Bayes.pm testdir/base/Bayes.pm
cp ~/ftp/spamassassin/lib/Mail/SpamAssassin/Bayes.pm testdir/bug3118/Bayes.pm
In other words, created a directory for each test and copied Bayes.pm into
each one.
I then edited the "Bayes.pm" files in the testdirs to enable whatever tweaks
I wanted to test. "base" remains the same as current SVN, however, so
it acts as a baseline.
Finally I run the driver script:
sh -x $SADIR/masses/bayes-testing/run-multiple testdir/*
That takes a long time, running through the dirs doing a 10-fold CV for
each one.
The results are written to each test-dir in a new directory "results", and
looks like this:
: jm 1204...; ls -l base/results/
total 7028
drwxrwxr-x 2 jm jm 4096 Mar 12 02:41 bucket1
drwxrwxr-x 2 jm jm 4096 Mar 12 03:21 bucket10
drwxrwxr-x 2 jm jm 4096 Mar 12 02:46 bucket2
drwxrwxr-x 2 jm jm 4096 Mar 12 02:50 bucket3
drwxrwxr-x 2 jm jm 4096 Mar 12 02:54 bucket4
drwxrwxr-x 2 jm jm 4096 Mar 12 02:59 bucket5
drwxrwxr-x 2 jm jm 4096 Mar 12 03:03 bucket6
drwxrwxr-x 2 jm jm 4096 Mar 12 03:08 bucket7
drwxrwxr-x 2 jm jm 4096 Mar 12 03:12 bucket8
drwxrwxr-x 2 jm jm 4096 Mar 12 03:17 bucket9
drwxrwxr-x 4 jm jm 4096 Mar 12 03:17 config
-rw-rw-r-- 1 jm jm 1401 Mar 12 03:21 hist_all
-rw-rw-r-- 1 jm jm 4424927 Mar 12 03:21 nonspam_all.log
-rw-rw-r-- 1 jm jm 2596942 Mar 12 03:21 spam_all.log
-rw-rw-r-- 1 jm jm 86338 Mar 12 03:21 test.log
-rw-rw-r-- 1 jm jm 1322 Mar 12 12:03 thresholds.static
-rw-rw-r-- 1 jm jm 3192 Mar 12 03:21 thresholds_all
The important items are:
- thresholds.static: FP/FN/Unsure counts of the Bayes score distribution
across all messages. See "THRESHOLDS SCRIPT" below.
- hist_all: An ASCII-art histogram of the Bayes score distribution across all
messages. Good to view differences at a glance; however nowadays our tweaks
all have much less effect than the "big ones" like hapax use or
case-sensitivity did, so not so useful anymore. See "THE HISTOGRAM" below.
- thresholds_all: a version of the thresholds output that is optimized for
lowest "cost" figure, basically searched the entire score distribution for
optimal thresholds. Nowadays we have chosen some static thresholds and they
work OK, so this isn't much use any more.
- The "bucket*" dirs, and "nonspam_all.log" or "spam_all.log" can be discounted
unless you need to look into more details of why a run didn't work the way
you expected it would... they are there for debugging, basically.
"thresholds.static" is by far the most important, containing the
FP/FN figures for various points on the score distribution. That's
what needs to be used to compare different Bayes tweaks.
THRESHOLDS SCRIPT
-----------------
The "thresholds" script is an emulation of the spambayes testing
methodology: it computes ham/spam hits across a corpus for each
algorithm, then, by dividing those hits into FPs, FNs, and "unsure"s, and
attaching a "cost" to each of those, it computes optimum spam and ham
cutoff points. (It also outputs TCRs.)
Sample output:
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$804.50
Total ham:spam: 39987:23337
FP: 3 0.008% FN: 360 1.543%
Unsure: 4145 6.546% (ham: 193 0.483% spam: 3952 16.934%)
TCRs: l=1 5.408 l=5 5.393 l=9 5.378
BTW, the idea of cutoffs is a spambayes one; the range
0.0 .......... ham_cutoff ........ spam_cutoff ......... 1.0
maps to
MAIL IS HAM UNSURE MAIL IS SPAM
SpamAssassin can be more sophisticated in terms of turning the bayes value
into scores across a range of [ -4.0, 4.0 ]. However the insight the
"unsure" value provides is good to visualise the shape of the graph
anyway, even if we don't use the same scoring system.
But the important thing for our tests is that the threshold results,
together with the histograms, give a good picture of how the algorithm
scatters the results across the table. Ideally, we want
- all ham clustered around 0.0
- all spam clustered around 1.0
- as little ham and spam as possible in the "unsure" middle-ground
So the best algorithms are the ones that are closest to this ideal;
in terms of the results below that means this is the pecking order
for good results, strong indicators first...
- a low cost figure
- low FPs
- low FNs
- low unsures
- a large difference between thresholds
We can then tweak the threshold-to-SpamAssassin-score mapping so that we
maximise the output of the bayes rules in SpamAssassin score terms, by
matching our score ranges to the ham_cutoff and spam_cutoff points.
THE HISTOGRAM
-------------
A histogram from 'draw-bayes-histogram' looks like this:
SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam)
0.000 (99.047%) ..........|.......................................................
0.000 ( 0.977%) ##########|#
0.040 ( 0.145%) .. |
0.040 ( 0.141%) ## |
0.080 ( 0.113%) . |
0.080 ( 0.056%) # |
0.120 ( 0.065%) . |
0.120 ( 0.069%) # |
0.160 ( 0.060%) . |
0.160 ( 0.086%) # |
0.200 ( 0.040%) |
0.200 ( 0.111%) ## |
0.240 ( 0.043%) |
0.240 ( 0.103%) ## |
0.280 ( 0.030%) |
0.280 ( 0.090%) # |
0.320 ( 0.050%) . |
0.320 ( 0.167%) ### |
0.360 ( 0.055%) . |
0.360 ( 0.184%) ### |
0.400 ( 0.048%) . |
0.400 ( 0.184%) ### |
0.440 ( 0.085%) . |
0.440 ( 0.548%) ######## |
0.480 ( 0.195%) .. |
0.480 ( 9.860%) ##########|#######
0.520 ( 0.010%) |
0.520 ( 2.031%) ##########|##
0.560 ( 0.005%) |
0.560 ( 1.268%) ##########|#
0.600 ( 0.003%) |
0.600 ( 1.157%) ##########|#
0.640 ( 0.990%) ##########|#
0.680 ( 0.005%) |
0.680 ( 1.011%) ##########|#
0.720 ( 0.947%) ##########|#
0.760 ( 1.033%) ##########|#
0.800 ( 1.123%) ##########|#
0.840 ( 1.307%) ##########|#
0.880 ( 1.607%) ##########|#
0.920 ( 2.554%) ##########|##
0.960 ( 0.003%) |
0.960 (72.396%) ##########|#######################################################
The format is:
GROUP (PCT%) ZOOM | FULL
the "GROUP" is the part of the [ 0.0, 1.0 ] range that the mails are
falling into. "PCT%" is the percentage of the corpus that fell into
that range. "FULL" is the scaled histogram of number of messages,
so you can see at a glance what the proportions look like; and "ZOOM"
is a "zoomed-in" view at the very bottom of the histogram, zoomed
in by a factor of 10, for closer inspection.