| |
| Ten-pass cross-validation scripts. |
| |
| tenpass/split-log-into-buckets: Split a mass-check logfile into n |
| identically-sized buckets, evenly taking messages from all checked corpora and |
| preserving comments. It does this evenly by running through all buckets |
| sequentially as each line is read. Output files are named 'out-N.log'. |
| |
| usage: tenpass/split-log-into-buckets 10 < mass-check.log |
| |
| 10pass-run: the workhorse. Generate a corpus, run this from the 'masses' |
| directory and leave it overnight. Note that you will need to change |
| NSBASE and SPBASE at the top of the script, to point to the basename and |
| path of the split logfiles. |
| |
| usage: tenpass/10pass-run |
| |
| 10pass-compute-tcr: compute TCR, SpamRecall and SpamPrecision based on results |
| data from 10pass-run. |
| |
| usage: tenpass/10pass-compute-tcr |
| |