| SpamAssassin Corpus Policy |
| -------------------------- |
| |
| SpamAssassin relies on corpus data to generate optimal scores. This is |
| the policy used by all corpora accepted by the SpamAssassin project. |
| |
| 1. All mail must be hand-verified into "spam" and "ham" (non-spam) |
| collections. It may not be solely classified using automated |
| spam-classification algorithms such as SpamAssassin and other spam filters. |
| |
| 2. It should not contain old mail. Older spam uses different techniques and |
| legitimate email changes over time as well. Specifically, please try to |
| avoid including spam older than 6 months and ham older than 18 months (12 |
| months is better). |
| |
| 3. It must contain a representative mix of ham. That includes commercial ham |
| messages, legitimate business discussions, and verified opt-in mail |
| newsletters. This is a very important point! |
| |
| 4. It must not contain certain types of mail to limit corpus bias: |
| |
| a. viruses (please check all messages with ClamAV or another anti-virus |
| program to exclude these) |
| |
| b. anti-spam or anti-virus mailing lists, especially SpamAssassin, that |
| frequently include spam and virus elements, even though they are |
| technically ham, these often appear to be spam and will skew the |
| results, rewriting the tests to avoid triggering on these messages is |
| not realistic at this time. |
| |
| c. bounces of viruses or spam sent back to forged or faked from addresses, |
| (so-called blowback or joe-job bounces), these typically have an |
| envelope sender of <> or <MAILER-DAEMON.*>, but please include all valid |
| bounces. |
| |
| d. mailing list moderation administrative messages that contain spam |
| subject lines or excerpts |
| |
| 5. Finally, you should sign an Apache Contributor License Agreement. |
| |
| Once you run "mass-check" on a corpus, see the instructions in "CORPUS_SUBMIT" |
| for details of how to verify that the top scorers are not accidental spam that |
| got through. |
| |
| lastmod: 2004-09-06 quinlan |