| |
| RESCORING SURVEY: HOW TO TAKE PART |
| ---------------------------------- |
| |
| The tools in this directory are used to optimise the scoring system used for |
| incoming mails, using a genetic algorithm to search for optimal values. |
| |
| Since this works best with a very large dataset, it would be *great* if you |
| (as a user) could run this and submit the results. |
| |
| The analysis script will not include text from the mails themselves, so |
| it will not give away private details from your mail spool. The only |
| details you'll give away will be your email address (and I promise *NEVER* |
| to give that out or use it for spammy stuff) -- and how many mails you |
| have sitting around in folders! |
| |
| |
| CONDITIONS |
| ---------- |
| |
| 1. First of all, you must be running it on a UNIX system; it's not portable to |
| other OSes yet. Also currently it only reads UNIX mailbox format files, or MH |
| spool directories. |
| |
| 2. This will not work unless you have separated the mail messages you'll be |
| analysing into separate "spam" and "non-spam" piles. It doesn't matter how |
| many mailboxes contain spam, or how many mailboxes contain non-spam; you just |
| need to be sure you know which set is which! |
| |
| The latter is most important. If you have occasional spams scattered through |
| your mailboxes, or occasional non-spam messages in your trapped spam folder, |
| the analysis will be useless. |
| |
| See the CORPUS_POLICY file for more details. |
| |
| |
| |
| HOW TO SUBMIT RESULTS BACK TO US |
| -------------------------------- |
| |
| See the file CORPUS_SUBMIT in this directory. |
| |
| |
| HOW IT WORKS |
| ------------ |
| |
| If you're interested, here's a quick description of the rest of the stuff |
| in this directory and what they do: |
| |
| mass-check : |
| |
| This script is used to perform "mass checks" of a set of mailboxes, Cyrus |
| folders, and/or MH mail spools. It generates summary lines like this: |
| |
| Y 7 /home/jm/Mail/Sapm/1382 SUBJ_ALL_CAPS,SUPERLONG_LINE,SUBJ_FULL_OF_8BITS |
| |
| or for mailboxes, |
| |
| . 1 /path/to/mbox:<5.1.0.14.2.20011004073932.05f4fd28@localhost> TRACKER_ID,BALANCE_FOR_LONG |
| |
| listing the path to the message or its message ID, its score, and the tests |
| that triggered on that mail. |
| |
| Using this info, and a score optimization tool, I can figure out which tests |
| get good hits with few false positives, etc., and re-score the tests to |
| optimise the ratio. |
| |
| This script relies on the spamassassin distribution directory living in "..". |
| |
| |
| logs-to-c : |
| |
| Takes the "spam.log" and "nonspam.log" files and converts them into C |
| source files and simplified data files for use by the C score optimization |
| algorithm. (Called by "make" when you build the perceptron, so generally |
| you won't need to run it yourself.) |
| |
| |
| hit-frequencies : |
| |
| Analyses the log files and computes how often each test hits, overall, |
| for spam mails and for non-spam. |
| |
| |
| mk-baseline-results : |
| |
| Compute results for the baseline scores (read from ../rules/*). If you |
| provide the name of a config directory as the first argument, it'll use that |
| instead. |
| |
| It will output statistics on the current ruleset to ../rules/STATISTICS.txt, |
| suitable for a release build of SpamAssassin. |
| |
| |
| perceptron.c : |
| |
| Perceptron learner by Henry Stern. See "README.perceptron" for details. |
| |
| |
| -- EOF |