| #!/usr/bin/perl -w -T |
| |
| use strict; |
| |
| use File::Spec; |
| |
| # Do 'use vars' instead of my() since CmdLearn looks for these and my() |
| # makes them non-exportable. Doh. |
| use vars qw/ $PREFIX $DEF_RULES_DIR $LOCAL_RULES_DIR /; |
| |
| $PREFIX = '@@PREFIX@@'; # substituted at 'make' time |
| $DEF_RULES_DIR = '@@DEF_RULES_DIR@@'; # substituted at 'make' time |
| $LOCAL_RULES_DIR = '@@LOCAL_RULES_DIR@@'; # substituted at 'make' time |
| |
| use lib '@@INSTALLSITELIB@@'; # substituted at 'make' time |
| |
| BEGIN { |
| # Locate locally installed SA libraries *without* using FindBin, which generates |
| # warnings and causes more trouble than its worth. We don't need to be too |
| # smart about this BTW. |
| my @bin = File::Spec->splitpath($0); |
| my $bin = ($bin[0] ? File::Spec->catpath(@bin[0..1]) : $bin[1]) # /home/jm/foo -> /home/jm |
| || File::Spec->curdir; # foo -> . |
| |
| # check to make sure it wasn't just installed in the normal way. |
| # note that ./lib/Mail/SpamAssassin.pm takes precedence, for |
| # building SpamAssassin on a machine where an old version is installed. |
| if (-e $bin.'/lib/Mail/SpamAssassin.pm' |
| || !-e '@@INSTALLSITELIB@@/Mail/SpamAssassin.pm') |
| { |
| # These are common paths where the SA libs might be found. |
| foreach (qw(lib ../lib/site_perl |
| ../lib/spamassassin ../share/spamassassin/lib)) |
| { |
| my $dir = File::Spec->catdir($bin, split('/', $_)); |
| if(-f File::Spec->catfile($dir, "Mail", "SpamAssassin.pm")) { |
| unshift(@INC, $dir); last; |
| } |
| } |
| } |
| } |
| |
| require Mail::SpamAssassin::CmdLearn; |
| exit Mail::SpamAssassin::CmdLearn::cmdline_run (); |
| |
| # --------------------------------------------------------------------------- |
| |
| =head1 NAME |
| |
| sa-learn - train SpamAssassin's Bayesian classifier |
| |
| =head1 SYNOPSIS |
| |
| B<sa-learn> [options] --file I<message> |
| |
| B<sa-learn> [options] --mbox I<mailbox> |
| |
| B<sa-learn> [options] --dir I<directory> |
| |
| B<sa-learn> [options] --single < I<message> |
| |
| Options: |
| |
| --ham Learn messages as ham (non-spam) |
| --spam Learn messages as spam |
| --forget Forget a message |
| --rebuild Rebuild the database if needed |
| --force-expire Force an expiry run, rebuild every time |
| -f file, --folders=file Read list of files/directories from file |
| --dir Learn a directory of RFC 822 files |
| --file Learn a file in RFC 822 format |
| --mbox Learn a file in mbox format |
| --showdots Show progress using dots |
| --no-rebuild Skip building databases after scan |
| -L, --local Operate locally, no network accesses |
| -C file, --config-file=file Path to standard configuration dir |
| -p prefs, --prefs-file=file Set user preferences file |
| -D, --debug-level Print debugging messages |
| -V, --version Print version |
| -h, --help Print usage message |
| |
| =head1 DESCRIPTION |
| |
| Given a typical selection of your incoming mail classified as spam or ham |
| (non-spam), this tool will feed each mail to SpamAssassin, allowing it |
| to 'learn' what signs are likely to mean spam, and which are likely to |
| mean ham. |
| |
| Simply run this command once for each of your mail folders, and it will |
| ''learn'' from the mail therein. |
| |
| Note that I<globbing> in the mail folder names is supported; in other words, |
| listing a folder name as C<*> will scan every folder that matches. |
| |
| SpamAssassin remembers which mail messages it's learnt already, and will not |
| re-learn those messages again, unless you use the B<--forget> option. |
| |
| If you make a mistake and scan a mail as ham when it is spam, or vice |
| versa, simply rerun this command with the correct classification, and the |
| mistake will be corrected. SpamAssassin will automatically 'forget' the |
| previous indications. |
| |
| =head1 INTRODUCTION TO BAYESIAN FILTERING |
| |
| (Thanks to Michael Bell for this section!) |
| |
| For a more lengthy description of how this works, go to |
| http://www.paulgraham.com and see "A Plan for Spam". It's reasonably |
| readable, even if statistics make me break out in hives. |
| |
| The short semi-inaccurate version: Given training, a spam heuristics engine |
| can take the most "spammy" and "hammy" words and apply probablistic |
| analysis. Furthermore, once given a basis for the analysis, the engine can |
| continue to learn iteratively by applying both it's non-Bayesian and Bayesian |
| ruleset together to create evolving "intelligence". |
| |
| SpamAssassin 2.50 supports Bayesian spam analysis, in the form of the |
| BAYES rules. This is a new feature, quite powerful, and is disabled |
| until enough messages have been learnt. |
| |
| The pros of Bayesian spam analysis: |
| |
| =over 4 |
| |
| =item Can greatly reduce false positives and false negatives. |
| |
| It learns from your mail, so it's tailored to your unique e-mail flow. |
| |
| =item Once it starts learning, it can continue to learn from SpamAssassin |
| and improve over time. |
| |
| =back |
| |
| And the cons: |
| |
| =over 4 |
| |
| =item A decent number of messages are required before results are useful |
| for ham/spam determination. |
| |
| =item It's hard to explain why a message is or isn't marked as spam. |
| |
| i.e.: a straightforward rule, that matches, say, "VIAGRA" is |
| easy to understand. If it generates a false positive or false negative, |
| it's fairly easy to understand why. |
| |
| With Bayesian analysis, it's all probabilities - "because the past says |
| it's likely as this falls into a probablistic distribution common to past |
| spam in your systems". Tell that to your users! Tell that to the client |
| when he asks "what can I do to change this". (By the way, the answer in |
| this case is "use whitelisting".) |
| |
| =item It will take disk space and memory. |
| |
| The databases it maintains take quite a lot of resources to store and use. |
| |
| =back |
| |
| =head1 GETTING STARTED |
| |
| Still interested? Ok, here's the guidelines for getting this working. |
| |
| First a high-level overview: |
| |
| =over 4 |
| |
| =item Build a significant sample of both ham and spam. |
| |
| I suggest several thousand of each, placed in SPAM and HAM directories or |
| mailboxes. Yes, you MUST hand-sort this - otherwise the results won't be much |
| better than SpamAssassin on its own. Verify the spamminess/haminess of |
| EVERY message. You're urged to avoid using a publicly available corpus (sample) - |
| this must be taken from YOUR mail server, if it's to be statistically useful. |
| Otherwise, the results may be pretty skewed. |
| |
| =item Use this tool to teach SpamAssassin about these samples, like so: |
| |
| sa-learn --spam /path/to/spam/folder |
| sa-learn --ham /path/to/ham/folder |
| ... |
| |
| Let SpamAssassin proceed, learning stuff. When it finds ham and spam |
| it will add the "interesting tokens" to the database. |
| |
| =item If you need SpamAssassin to forget about specific messages, use |
| the B<--forget> option. |
| |
| This can be applied to either ham or spam that has run through the |
| B<sa-learn> processes. It's a bit of a hammer, really, lowering the |
| weighting of the specific tokens in that message (only if that message has |
| been processed before). |
| |
| =item Learning from single messages uses a command like this: |
| |
| cat mailmessage | sa-learn --ham --no-rebuild --single |
| |
| This is handy for binding to a key in your mail user agent. It's very fast, as |
| all the time-consuming stuff is deferred until you run with the C<--rebuild> |
| option. |
| |
| =item Autolearning is enabled by default |
| |
| If you don't have a corpus of mail saved to learn, you can let |
| SpamAssassin automatically learn the mail that you receive. If you are |
| autolearning from scratch, the amount of mail you receive will determine |
| how long until the BAYES_* rules are activated. |
| |
| =back |
| |
| =head1 EFFECTIVE TRAINING |
| |
| Learning filters require training to be effective. If you don't train |
| them, they won't work. In addition, you need to train them with new |
| messages regularly to keep them up-to-date, or their data will become |
| stale and impact accuracy. |
| |
| You need to train with both spam I<and> ham mails. One type of mail |
| alone will not have any effect. |
| |
| Note that if your mail folders contain things like forwarded spam, |
| discussions of spam-catching rules, etc., this will cause trouble. You |
| should avoid scanning those messages if possible. (An easy way to do this |
| is to move them aside, into a folder which is not scanned.) |
| |
| Another thing to be aware of, is that typically you should aim to train |
| with at least 1000 messages of spam, and 1000 ham messages, if |
| possible. More is better, but anything over about 5000 messages does not |
| improve accuracy significantly in our tests. |
| |
| Be careful that you train from the same source -- for example, if you train |
| on old spam, but new ham mail, then the classifier will think that |
| a mail with an old date stamp is likely to be spam. |
| |
| It's also worth noting that training with a very small quantity of |
| ham, will produce atrocious results. You should aim to train with at |
| least the same amount (or more if possible!) of ham data than spam. |
| |
| On an on-going basis, it's best to keep training the filter to make |
| sure it has fresh data to work from. There are various ways to do |
| this: |
| |
| =over 4 |
| |
| =item 1. Supervised learning |
| |
| This means keeping a copy of all or most of your mail, separated into spam |
| and ham piles, and periodically re-training using those. It produces |
| the best results, but requires more work from you, the user. |
| |
| (An easy way to do this, by the way, is to create a new folder for |
| 'deleted' messages, and instead of deleting them from other folders, |
| simply move them in there instead. Then keep all spam in a separate |
| folder and never delete it. As long as you remember to move misclassified |
| mails into the correct folder set, it's easy enough to keep up to date.) |
| |
| =item 2. Unsupervised learning from Bayesian classification |
| |
| Another way to train is to chain the results of the Bayesian classifier |
| back into the training, so it reinforces its own decisions. This is only |
| safe if you then retrain it based on any errors you discover. |
| |
| SpamAssassin does not support this method, due to experimental results |
| which strongly indicate that it does not work well, and since Bayes is |
| only one part of the resulting score presented to the user (while Bayes |
| may have made the wrong decision about a mail, it may have been overridden |
| by another system). |
| |
| =item 3. Unsupervised learning from SpamAssassin rules |
| |
| Also called 'auto-learning' in SpamAssassin. Based on statistical |
| analysis of the SpamAssassin success rates, we can automatically train the |
| Bayesian database with a certain degree of confidence that our training |
| data is accurate. |
| |
| It should be supplemented with some supervised training in addition, if |
| possible. |
| |
| This is the default, but can be turned off by setting the SpamAssassin |
| configuration parameter C<auto_learn> to 0. |
| |
| =item 4. Mistake-based training |
| |
| This means training on a small number of mails, then only training on |
| messages that SpamAssassin classifies incorrectly. This works, but it |
| takes longer to get it right than a full training session would. |
| |
| =back |
| |
| =head1 OPTIONS |
| |
| =over 4 |
| |
| =item B<--ham> |
| |
| Learn the input message(s) as ham. If you have previously learnt |
| any of the messages as spam, SpamAssassin will forget them first, then |
| re-learn them as ham. Alternatively, if you have previously learnt |
| them as ham, it'll skip them this time around. |
| |
| =item B<--spam> |
| |
| Learn the input message(s) as spam. If you have previously learnt |
| any of the messages as ham, SpamAssassin will forget them first, then |
| re-learn them as spam. Alternatively, if you have previously learnt |
| them as spam, it'll skip them this time around. |
| |
| =item B<--rebuild> |
| |
| Rebuild the databases, typically done after learning with B<--no-rebuild>, |
| or if you wish to periodically clean the Bayes databases once a day on |
| a busy server. |
| |
| =item B<--force-expire> |
| |
| Forces an expiry run, regardless of whether it may be necessary or not. |
| |
| =item B<--forget> |
| |
| Forget a given message previously learnt. |
| |
| =item B<-h>, B<--help> |
| |
| Print help message and exit. |
| |
| =item B<-C> I<config>, B<--config-file>=I<config> |
| |
| Read configuration from I<config>. |
| |
| =item B<-p> I<prefs>, B<--prefs-file>=I<prefs> |
| |
| Read user score preferences from I<prefs>. |
| |
| =item B<-D>, B<--debug-level> |
| |
| Produce diagnostic output. |
| |
| =item B<--no-rebuild> |
| |
| Skip the slow rebuilding step which normally takes place after changing |
| database entries. If you plan to scan many folders in a batch, it is faster to |
| use this switch and run C<sa-learn --rebuild> once all the folders have been |
| scanned. |
| |
| =item B<-L>, B<--local> |
| |
| Do not perform any network accesses while learning details about the mail |
| messages. This will speed up the learning process, but may result in a |
| slightly lower accuracy. |
| |
| Note that this is currently ignored, as current versions of SpamAssassin will |
| not perform network access while learning; but future versions may. |
| |
| =back |
| |
| =head1 FILES |
| |
| B<sa-learn> and the other parts of SpamAssassin's Bayesian learner, |
| use a set of persistent database files to store the learnt tokens, as follows. |
| |
| =over 4 |
| |
| =item bayes_toks |
| |
| The database of tokens, containing the tokens learnt, their count of |
| occurrences in ham and spam, and the message count of the last message |
| they were seen in. |
| |
| This database also contains some 'magic' tokens, as follows: the number of ham |
| and spam messages learnt, the number of tokens in the database, the |
| message-count of the last expiry run, the message-count of the oldest token in |
| the database, and the message-count of the current message (to the nearest |
| 5000). |
| |
| This is a database file, using the first one of the following database modules |
| that SpamAssassin can find in your perl installation: C<DB_File>, C<GDBM_File>, |
| C<NDBM_File>, or C<SDBM_File>. |
| |
| =item bayes_seen |
| |
| A map of message-ID to what that message was learnt as. This is used |
| so that SpamAssassin can avoid re-learning a message it's already seen, |
| and so it can reverse the training if you later decide that message |
| was previously learnt incorrectly. |
| |
| This is a database file, using the first one of the following database modules |
| that SpamAssassin can find in your perl installation: C<DB_File>, C<GDBM_File>, |
| C<NDBM_File>, or C<SDBM_File>. |
| |
| =item bayes_journal |
| |
| While SpamAssassin is scanning mails, it needs to track which tokens it uses in |
| its calculations. So that many processes can read the databases |
| simultaneously, but only one can write at a time, this uses a 'journal' file. |
| |
| When you run C<sa-learn --rebuild>, the journal is read, and the tokens that |
| were accessed during the journal's lifetime will have their last-access time |
| updated in the C<bayes_toks> database. |
| |
| =item bayes_msgcount |
| |
| Every time SpamAssassin accesses a mail message for scanning, or every time |
| the C<sa-learn> command is run, the 'message count' is increased by one. |
| This is used to control expiration of old tokens. |
| |
| Since many processes may be running simultaneously, SpamAssassin does not |
| use a locked database file for this operation; instead, it uses the size |
| of this file as a counter, appending one byte for each message. Once it |
| hits 5000 bytes, the C<bayes_toks> database is locked, and the message |
| counter entry in that database is increased accordingly. |
| |
| =back |
| |
| =head1 EXPIRATION |
| |
| Since SpamAssassin auto-learns, the Bayes database files could increase |
| perpetually until they fill your disk or you run out of memory. To control |
| this, SpamAssassin performs expiration. |
| |
| Every C<bayes_expiry_scan_count> / 2 messages, or when C<sa-learn --rebuild |
| --force-expire> is run, SpamAssassin will attempt an expiry run, as follows. |
| |
| SpamAssassin runs through every token in the database. If that token has not |
| been used during the scanning of the last C<bayes_expiry_scan_count> messages, |
| it is marked for deletion. |
| |
| Next, if that operation would bring the number of tokens below the |
| C<bayes_expiry_min_db_size> threshold, it removes tokens from the for-deletion |
| list until the resulting database would contain C<bayes_expiry_min_db_size> |
| token entries. |
| |
| It then removes the listed tokens and updates the 'last expiry' setting. |
| |
| The SpamAssassin configuration settings which control this operation are: |
| |
| =over 4 |
| |
| =item C<bayes_expiry_min_db_size> is part of the SpamAssassin configuration. |
| The default value is 100000, which is roughly equivalent to a 5Mb database file |
| if you're using DB_File. |
| |
| =item C<bayes_expiry_scan_count> is also part of the SpamAssassin |
| configuration. The default value is 5000. |
| |
| =back |
| |
| =head1 INSTALLATION |
| |
| The B<sa-learn> command is part of the B<Mail::SpamAssassin> Perl module. |
| Install this as a normal Perl module, using C<perl -MCPAN -e shell>, |
| or by hand. |
| |
| =head1 ENVIRONMENT |
| |
| No environment variables, aside from those used by perl, are required to |
| be set. |
| |
| =head1 SEE ALSO |
| |
| Mail::SpamAssassin(3) |
| spamassassin(1) |
| |
| http://www.paulgraham.com/ , Paul Graham's "A Plan For Spam" paper |
| |
| http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html , Gary |
| Robinson's f(x) and combining algorithms, as used in SpamAssassin |
| |
| http://www.bgl.nu/~glouis/bogofilter/test6000.html , discussion of various |
| Bayes training regimes, including 'train on error' and unsupervised training |
| |
| =head1 AUTHOR |
| |
| Justin Mason E<lt>jm /at/ jmason.orgE<gt> |
| |
| =cut |
| |