| |
| TESTING WITH BAYES |
| ------------------ |
| |
| Dan said: "I think we need guidelines on how to train and mass-check Bayes |
| using our spam and non-spam corpuses. Maybe you could check something in? |
| *nudge*". OK then! |
| |
| If you're testing Bayes, or collating results on a change to the algorithms, |
| please try to stick to these guidelines: |
| |
| - train with at least 1000 spam and 1000 ham messages |
| |
| - try to use at least as many ham as spam mails. |
| |
| - use mail from your own mail feed, not public corpora if possible. Many of |
| the important signs are taken from headers and are specific to you and your |
| systems. |
| |
| - Try to train with older messages, and test with newer, if possible. |
| |
| - As with the conventional "mass-check" runs, avoiding spam over 6 months old |
| is a good idea, as older spam uses old techniques that no longer are seen |
| in the wild. |
| |
| - DO NOT test with any of the messages you trained with. This will produce |
| over-inflated success rates. |
| |
| These are just guidelines (well, apart from the last one), so they can be |
| bent slightly if needs be ;) |
| |
| |
| |
| A SAMPLE LOG OF A BAYES 10FCV RUN |
| --------------------------------- |
| |
| |
| First, I made the corpus to test with. |
| |
| mkdir ch ; cp ~/Mail/deld/10* ch |
| mkdir cs ; cp ....spam... cs |
| |
| This is simply one-file-per-message, RFC-2822 format, as usual. |
| |
| Now, set the SADIR env var to where your SpamAssassin source tree |
| can be found: |
| |
| export SADIR=/home/jm/ftp/spamassassin |
| |
| Then split the test corpus into folds: |
| |
| mkdir -p cor/ham cor/spam |
| $SADIR/tools/split_corpora -n 10 -p cor/ham/bucket ch |
| $SADIR/tools/split_corpora -n 10 -p cor/spam/bucket cs |
| |
| That takes from "ch" and "cs" and generates mboxes containing 10% |
| folds as "cor/ham/bucket{1,2,3,4,5,6,7,8,9,10}". |
| |
| I then created a set of items I wanted to test: |
| |
| mkdir testdir |
| mkdir testdir/{base,bug3118} [...etc.] |
| cp ~/ftp/spamassassin/lib/Mail/SpamAssassin/Bayes.pm testdir/base/Bayes.pm |
| cp ~/ftp/spamassassin/lib/Mail/SpamAssassin/Bayes.pm testdir/bug3118/Bayes.pm |
| |
| In other words, created a directory for each test and copied Bayes.pm into |
| each one. |
| |
| I then edited the "Bayes.pm" files in the testdirs to enable whatever tweaks |
| I wanted to test. "base" remains the same as current SVN, however, so |
| it acts as a baseline. |
| |
| Finally I run the driver script: |
| |
| sh -x $SADIR/masses/bayes-testing/run-multiple testdir/* |
| |
| That takes a long time, running through the dirs doing a 10-fold CV for |
| each one. |
| |
| The results are written to each test-dir in a new directory "results", and |
| looks like this: |
| |
| : jm 1204...; ls -l base/results/ |
| total 7028 |
| drwxrwxr-x 2 jm jm 4096 Mar 12 02:41 bucket1 |
| drwxrwxr-x 2 jm jm 4096 Mar 12 03:21 bucket10 |
| drwxrwxr-x 2 jm jm 4096 Mar 12 02:46 bucket2 |
| drwxrwxr-x 2 jm jm 4096 Mar 12 02:50 bucket3 |
| drwxrwxr-x 2 jm jm 4096 Mar 12 02:54 bucket4 |
| drwxrwxr-x 2 jm jm 4096 Mar 12 02:59 bucket5 |
| drwxrwxr-x 2 jm jm 4096 Mar 12 03:03 bucket6 |
| drwxrwxr-x 2 jm jm 4096 Mar 12 03:08 bucket7 |
| drwxrwxr-x 2 jm jm 4096 Mar 12 03:12 bucket8 |
| drwxrwxr-x 2 jm jm 4096 Mar 12 03:17 bucket9 |
| drwxrwxr-x 4 jm jm 4096 Mar 12 03:17 config |
| -rw-rw-r-- 1 jm jm 1401 Mar 12 03:21 hist_all |
| -rw-rw-r-- 1 jm jm 4424927 Mar 12 03:21 nonspam_all.log |
| -rw-rw-r-- 1 jm jm 2596942 Mar 12 03:21 spam_all.log |
| -rw-rw-r-- 1 jm jm 86338 Mar 12 03:21 test.log |
| -rw-rw-r-- 1 jm jm 1322 Mar 12 12:03 thresholds.static |
| -rw-rw-r-- 1 jm jm 3192 Mar 12 03:21 thresholds_all |
| |
| The important items are: |
| |
| - thresholds.static: FP/FN/Unsure counts of the Bayes score distribution |
| across all messages. See "THRESHOLDS SCRIPT" below. |
| |
| - hist_all: An ASCII-art histogram of the Bayes score distribution across all |
| messages. Good to view differences at a glance; however nowadays our tweaks |
| all have much less effect than the "big ones" like hapax use or |
| case-sensitivity did, so not so useful anymore. See "THE HISTOGRAM" below. |
| |
| - thresholds_all: a version of the thresholds output that is optimized for |
| lowest "cost" figure, basically searched the entire score distribution for |
| optimal thresholds. Nowadays we have chosen some static thresholds and they |
| work OK, so this isn't much use any more. |
| |
| - The "bucket*" dirs, and "nonspam_all.log" or "spam_all.log" can be discounted |
| unless you need to look into more details of why a run didn't work the way |
| you expected it would... they are there for debugging, basically. |
| |
| "thresholds.static" is by far the most important, containing the |
| FP/FN figures for various points on the score distribution. That's |
| what needs to be used to compare different Bayes tweaks. |
| |
| |
| THRESHOLDS SCRIPT |
| ----------------- |
| |
| The "thresholds" script is an emulation of the spambayes testing |
| methodology: it computes ham/spam hits across a corpus for each |
| algorithm, then, by dividing those hits into FPs, FNs, and "unsure"s, and |
| attaching a "cost" to each of those, it computes optimum spam and ham |
| cutoff points. (It also outputs TCRs.) |
| |
| Sample output: |
| |
| Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$804.50 |
| Total ham:spam: 39987:23337 |
| FP: 3 0.008% FN: 360 1.543% |
| Unsure: 4145 6.546% (ham: 193 0.483% spam: 3952 16.934%) |
| TCRs: l=1 5.408 l=5 5.393 l=9 5.378 |
| |
| |
| BTW, the idea of cutoffs is a spambayes one; the range |
| |
| 0.0 .......... ham_cutoff ........ spam_cutoff ......... 1.0 |
| |
| maps to |
| |
| MAIL IS HAM UNSURE MAIL IS SPAM |
| |
| SpamAssassin can be more sophisticated in terms of turning the bayes value |
| into scores across a range of [ -4.0, 4.0 ]. However the insight the |
| "unsure" value provides is good to visualise the shape of the graph |
| anyway, even if we don't use the same scoring system. |
| |
| But the important thing for our tests is that the threshold results, |
| together with the histograms, give a good picture of how the algorithm |
| scatters the results across the table. Ideally, we want |
| |
| - all ham clustered around 0.0 |
| - all spam clustered around 1.0 |
| - as little ham and spam as possible in the "unsure" middle-ground |
| |
| So the best algorithms are the ones that are closest to this ideal; |
| in terms of the results below that means this is the pecking order |
| for good results, strong indicators first... |
| |
| - a low cost figure |
| - low FPs |
| - low FNs |
| - low unsures |
| - a large difference between thresholds |
| |
| We can then tweak the threshold-to-SpamAssassin-score mapping so that we |
| maximise the output of the bayes rules in SpamAssassin score terms, by |
| matching our score ranges to the ham_cutoff and spam_cutoff points. |
| |
| |
| |
| THE HISTOGRAM |
| ------------- |
| |
| |
| A histogram from 'draw-bayes-histogram' looks like this: |
| |
| SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) |
| 0.000 (99.047%) ..........|....................................................... |
| 0.000 ( 0.977%) ##########|# |
| 0.040 ( 0.145%) .. | |
| 0.040 ( 0.141%) ## | |
| 0.080 ( 0.113%) . | |
| 0.080 ( 0.056%) # | |
| 0.120 ( 0.065%) . | |
| 0.120 ( 0.069%) # | |
| 0.160 ( 0.060%) . | |
| 0.160 ( 0.086%) # | |
| 0.200 ( 0.040%) | |
| 0.200 ( 0.111%) ## | |
| 0.240 ( 0.043%) | |
| 0.240 ( 0.103%) ## | |
| 0.280 ( 0.030%) | |
| 0.280 ( 0.090%) # | |
| 0.320 ( 0.050%) . | |
| 0.320 ( 0.167%) ### | |
| 0.360 ( 0.055%) . | |
| 0.360 ( 0.184%) ### | |
| 0.400 ( 0.048%) . | |
| 0.400 ( 0.184%) ### | |
| 0.440 ( 0.085%) . | |
| 0.440 ( 0.548%) ######## | |
| 0.480 ( 0.195%) .. | |
| 0.480 ( 9.860%) ##########|####### |
| 0.520 ( 0.010%) | |
| 0.520 ( 2.031%) ##########|## |
| 0.560 ( 0.005%) | |
| 0.560 ( 1.268%) ##########|# |
| 0.600 ( 0.003%) | |
| 0.600 ( 1.157%) ##########|# |
| 0.640 ( 0.990%) ##########|# |
| 0.680 ( 0.005%) | |
| 0.680 ( 1.011%) ##########|# |
| 0.720 ( 0.947%) ##########|# |
| 0.760 ( 1.033%) ##########|# |
| 0.800 ( 1.123%) ##########|# |
| 0.840 ( 1.307%) ##########|# |
| 0.880 ( 1.607%) ##########|# |
| 0.920 ( 2.554%) ##########|## |
| 0.960 ( 0.003%) | |
| 0.960 (72.396%) ##########|####################################################### |
| |
| The format is: |
| |
| GROUP (PCT%) ZOOM | FULL |
| |
| the "GROUP" is the part of the [ 0.0, 1.0 ] range that the mails are |
| falling into. "PCT%" is the percentage of the corpus that fell into |
| that range. "FULL" is the scaled histogram of number of messages, |
| so you can see at a glance what the proportions look like; and "ZOOM" |
| is a "zoomed-in" view at the very bottom of the histogram, zoomed |
| in by a factor of 10, for closer inspection. |
| |