| #!/usr/bin/perl -w -T |
| |
| use strict; |
| |
| use File::Spec; |
| |
| my $PREFIX = '@@PREFIX@@'; # substituted at 'make' time |
| my $DEF_RULES_DIR = '@@DEF_RULES_DIR@@'; # substituted at 'make' time |
| my $LOCAL_RULES_DIR = '@@LOCAL_RULES_DIR@@'; # substituted at 'make' time |
| |
| use lib '@@INSTALLSITELIB@@'; # substituted at 'make' time |
| |
| BEGIN { |
| # Locate locally installed SA libraries *without* using FindBin, which generates |
| # warnings and causes more trouble than its worth. We don't need to be too |
| # smart about this BTW. |
| my @bin = File::Spec->splitpath($0); |
| my $bin = ($bin[0] ? File::Spec->catpath(@bin[0..1]) : $bin[1]) # /home/jm/foo -> /home/jm |
| || File::Spec->curdir; # foo -> . |
| |
| # check to make sure it wasn't just installed in the normal way. |
| # note that ./lib/Mail/SpamAssassin.pm takes precedence, for |
| # building SpamAssassin on a machine where an old version is installed. |
| if (-e $bin.'/lib/Mail/SpamAssassin.pm' |
| || !-e '@@INSTALLSITELIB@@/Mail/SpamAssassin.pm') |
| { |
| # These are common paths where the SA libs might be found. |
| foreach (qw(lib ../lib/site_perl |
| ../lib/spamassassin ../share/spamassassin/lib)) |
| { |
| my $dir = File::Spec->catdir($bin, split('/', $_)); |
| if(-f File::Spec->catfile($dir, "Mail", "SpamAssassin.pm")) { |
| unshift(@INC, $dir); last; |
| } |
| } |
| } |
| } |
| |
| require Mail::SpamAssassin::CmdLearn; |
| exit Mail::SpamAssassin::CmdLearn::cmdline_run ({ isspam => 0 }); |
| |
| # --------------------------------------------------------------------------- |
| |
| =head1 NAME |
| |
| sa-learn-nonspam - train spamassassin with nonspam data |
| |
| =head1 SYNOPSIS |
| |
| B<sa-learn-nonspam> [options] --file I<message> |
| |
| B<sa-learn-nonspam> [options] --mbox I<mailbox> |
| |
| B<sa-learn-nonspam> [options] --dir I<directory> |
| |
| B<sa-learn-nonspam> [options] --single < I<message> |
| |
| Options: |
| |
| -f file, --folders=file Read list of files/directories from file |
| --dir Learn a directory of RFC 822 files |
| --file Learn a file in RFC 822 format |
| --mbox Learn a file in mbox format |
| --showdots Show progress using dots |
| --no-rebuild Skip building databases after scan |
| -C file, --config-file=file Path to standard configuration dir |
| -p prefs, --prefs-file=file Set user preferences file |
| -a, --auto-whitelist Use auto-whitelists |
| --whitelist-factory Select whitelist factory |
| -D, --debug-level Print debugging messages |
| -V, --version Print version |
| -h, --help Print usage message |
| |
| =head1 DESCRIPTION |
| |
| Given a typical selection of your incoming mail classified as spam or non-spam, |
| this tool will feed each mail to SpamAssassin, allowing it to 'learn' what |
| signs are likely to mean spam, and which are likely to mean non-spam. |
| |
| Simply run this command once for each of your mail folders, and it will |
| ''learn'' from the mail therein. If you make a mistake and scan a mail as |
| non-spam when it is spam or vice versa, simply rescan it correctly and the |
| mistake will be corrected. |
| |
| SpamAssassin remembers which mail messages it's learnt already, and will not |
| re-learn those messages again, unless you use the B<sa-forget> command. |
| |
| (Note: this command is not present in a released version of SpamAssassin yet.) |
| |
| =head1 INTRODUCTION TO BAYESIAN FILTERING |
| |
| (Thanks to Michael Bell for this section!) |
| |
| For a more lengthy description of how this works, go to |
| http:///www.paulgraham.com and see "A Plan for Spam". It's reasonably |
| readable, even if statistics make me break out in hives. |
| |
| The short semi-inaccurate version: Given training, a spam heuristics engine |
| can take the most "spammy" and "nonspammy" words and apply probablistic |
| analysis. Furthermore, once given a basis for the analysis, the engine can |
| continue to learn iteratively by applying both it's non-Bayesian and Bayesian |
| ruleset together to create evolving "intelligence". |
| |
| SpamAssassin 2.50 supports Bayesian spam analysis, in the form of the BAYES |
| rules. This is a new feature, quite powerful, and disabled by default. |
| |
| The pros of Bayesian spam analysis: |
| |
| =over 4 |
| |
| =item Can greatly reduce false positives and false negatives. |
| |
| It learns from your mail, so it's tailored to your unique e-mail flow. |
| |
| =item Once it starts learning, it can continue to learn from SpamAssassin |
| and improve over time. |
| |
| A planned feature is to add 'auto-learning' to SpamAssassin, so that |
| it can reliably self-train. It's not present yet, though. |
| |
| =back |
| |
| And the cons: |
| |
| =over 4 |
| |
| =item It requires a fair amount of work to get running. |
| |
| A lot of hand-sorting of mail is crucial. |
| |
| =item It's hard to explain why a message is or isn't marked as spam. |
| |
| By which I mean: a straightforward rule, that matches, say, "VIAGRA" is |
| easy to understand. If it generates a false positive or false negative, |
| it's fairly easy to understand why. |
| |
| With Bayesian analysis, it's all probabilities - "because the past says |
| it's likely as this falls into a probablistic distribution common to past |
| spam in your systems". Tell that to your users! Tell that to the client |
| when he asks "what can I do to change this". (By the way, the answer in |
| this case is "use whitelisting".) |
| |
| =item It will take disk space and memory. |
| |
| The databases it maintains take quite a lot of resources to store and use. |
| |
| =back |
| |
| =head1 GETTING STARTED |
| |
| Still interested? Ok, here's the guidelines for getting this working. |
| |
| First a high-level overview: |
| |
| =over 4 |
| |
| =item Build a significant sample of both spam and non-spam. |
| |
| I suggest several thousand of each, placed in SPAM and NONSPAM directories or |
| mailboxes. Yes, you MUST hand-sort this - otherwise the results won't be much |
| better than SpamAssassin on its own. Verify the spamminess/nonspammiess of |
| EVERY message. I urge you to avoid using a publicly available corpus (sample) - |
| this must be taken from YOUR mail server, if it's to be statistically useful. |
| Otherwise, the results may be pretty skewed. |
| |
| =item Use the SpamAssassin tools (more below) B<sa-learn-spam> and |
| B<sa-learn-nonspam> to teach SpamAssassin about these samples, like |
| so: |
| |
| sa-learn-spam /path/to/spam/folder |
| sa-learn-nonspam /path/to/nonspam/folder |
| ... |
| |
| Let SpamAssassin proceed, learning stuff. When it find spam and non-spam |
| it will add the "interesting tokens" to the database. |
| |
| =item Once you've learnt everything you want in this session, run |
| B<sa-learn-rebuild> to put those changes into the production database. |
| |
| sa-learn-rebuild |
| |
| This will probably take 30 seconds to several minutes to complete, depending on |
| how much mail you trained with, and how fast your machine is. |
| |
| =item If you need SpamAssassin to forget about specific messages, use |
| B<sa-forget>. |
| |
| This can be applied to either spam or non-spam that has run through the |
| B<sa-learn> processes. It's a bit of a hammer, really, lowering the |
| weighting of the specific tokens in that message (only if that message has |
| been processed before). |
| |
| =item Learning from single messages uses a command like this: |
| |
| cat mailmessage | sa-learn-nonspam --single |
| |
| This is handy for binding to a key in your mail user agent. It's very fast, as |
| all the time-consuming stuff is deferred until you run the C<sa-learn-rebuild> |
| command. |
| |
| =back |
| |
| =head1 EFFECTIVE TRAINING |
| |
| Learning filters require training to be effective. If you don't train |
| them, they won't work. In addition, you need to train them with new |
| messages regularly to keep them up-to-date, or their data will become |
| stale and impact accuracy. |
| |
| You need to train with both spam I<and> non-spam mails. One type of mail |
| alone will not have any effect. |
| |
| Note that if your mail folders contain things like forwarded spam, |
| discussions of spam-catching rules, etc., this will cause trouble. You |
| should avoid scanning those messages if possible. (An easy way to do this |
| is to move them aside, into a folder which is not scanned.) |
| |
| Another thing to be aware of, is that typically you should aim to train |
| with at least 1000 messages of spam, and 1000 non-spam messages, if |
| possible. More is better, but anything over about 5000 messages does not |
| improve accuracy significantly in our tests. |
| |
| It's also worth noting that training with a very very small quantity of |
| non-spam, will produce atrocious results. You should aim to train with at |
| least the same amount (or more if possible!) of non-spam data than spam. |
| |
| On an on-going basis, it's best to keep training the filter to make |
| sure it has fresh data to work from. There are various ways to do |
| this: |
| |
| =over 4 |
| |
| =item 1. Supervised learning |
| |
| This means keeping a copy of all or most of your mail, separated into spam |
| and nonspam piles, and periodically re-training using those. It produces |
| the best results, but requires more work from you, the user. |
| |
| (An easy way to do this, by the way, is to create a new folder for |
| 'deleted' messages, and instead of deleting them from other folders, |
| simply move them in there instead. Then keep all spam in a separate |
| folder and never delete it. As long as you remember to move misclassified |
| mails into the correct folder set, it's easy enough to keep up to date.) |
| |
| =item 2. Unsupervised learning from Bayesian classification |
| |
| Another way to train is to chain the results of the Bayesian classifier |
| back into the training, so it reinforces its own decisions. This is only |
| safe if you then retrain it based on any errors you discover. |
| |
| SpamAssassin does not support this method, due to experimental results |
| which strongly indicate that it does not work well, and since Bayes is |
| only one part of the resulting score presented to the user (while Bayes |
| may have made the wrong decision about a mail, it may have been overridden |
| by another system). |
| |
| =item 3. Unsupervised learning from SpamAssassin rules |
| |
| Also called 'auto-learning' in SpamAssassin. Based on statistical |
| analysis of the SpamAssassin success rates, we can automatically train the |
| Bayesian database with a certain degree of confidence that our training |
| data is accurate. |
| |
| It should be supplemented with some supervised training in addition, if |
| possible. |
| |
| This will be turned on by setting the SpamAssassin configuration parameter |
| C<auto_learn> to 1; note, however, that it is not supported in the current |
| version of SpamAssassin. |
| |
| =back |
| |
| =head1 OPTIONS |
| |
| =over 4 |
| |
| =item B<-a>, B<--auto-whitelist> |
| |
| Use auto-whitelists. While learning, add addresses to the auto-whitelist |
| as appropriate. |
| |
| =item B<-h>, B<--help> |
| |
| Print help message and exit. |
| |
| =item B<-C> I<config>, B<--config-file>=I<config> |
| |
| Read configuration from I<config>. |
| |
| =item B<-p> I<prefs>, B<--prefs-file>=I<prefs> |
| |
| Read user score preferences from I<prefs>. |
| |
| =item B<-D>, B<--debug-level> |
| |
| Produce diagnostic output. |
| |
| =item B<-M> I<factory>, B<--whitelist-factory>=I<factory> |
| |
| Select alternative whitelist factory. |
| |
| =item B<--no-rebuild> |
| |
| Skip the slow rebuilding step which normally takes place after changing |
| database entries. If you plan to scan many folders in a batch, it is faster to |
| use this switch and run B<sa-learn-rebuild> once all the folders have been |
| scanned. |
| |
| =back |
| |
| =head1 INSTALLATION |
| |
| The B<sa-learn-nonspam> command is part of the B<Mail::SpamAssassin> Perl |
| module. Install this as a normal Perl module, using C<perl -MCPAN -e shell>, |
| or by hand. |
| |
| =head1 ENVIRONMENT |
| |
| No environment variables, aside from those used by perl, are required to |
| be set. |
| |
| =head1 SEE ALSO |
| |
| Mail::SpamAssassin(3) |
| spamassassin(1) |
| sa-learn-spam(1) |
| sa-learn-nonspam(1) |
| sa-learn-rebuild(1) |
| sa-forget(1) |
| http:///www.paulgraham.com, "A Plan For Spam" |
| |
| http://www.bgl.nu/~glouis/bogofilter/test6000.html, discussion of various Bayes |
| training regimes, including 'train on error' and unsupervised training |
| |
| =head1 AUTHOR |
| |
| Justin Mason E<lt>jm /at/ jmason.orgE<gt> |
| |
| =cut |
| |