masses/CORPUS_POLICY - spamassassin - Git at Google

 SpamAssassin Corpus Policy
 --------------------------

 SpamAssassin relies on corpus data to generate optimal scores.  This is
 the policy used by all corpora accepted by the SpamAssassin project.

 1. All mail must be hand-verified into "spam" and "ham" (non-spam)
    collections.  It may not be solely classified using automated
    spam-classification algorithms such as SpamAssassin and other spam filters.

 2. It should not contain old mail.  Older spam uses different techniques and
    legitimate email changes over time as well.  Specifically, please try to
    avoid including spam older than 6 months and ham older than 18 months (12
    months is better).

 3. It must contain a representative mix of ham.  That includes commercial ham
    messages, legitimate business discussions, and verified opt-in mail
    newsletters.  This is a very important point!

 4. It must not contain certain types of mail to limit corpus bias:

    a. viruses (please check all messages with ClamAV or another anti-virus
       program to exclude these)

    b. anti-spam or anti-virus mailing lists, especially SpamAssassin, that
       frequently include spam and virus elements, even though they are
       technically ham, these often appear to be spam and will skew the
       results, rewriting the tests to avoid triggering on these messages is
       not realistic at this time.

    c. bounces of viruses or spam sent back to forged or faked from addresses,
       (so-called blowback or joe-job bounces), these typically have an
       envelope sender of <> or <MAILER-DAEMON.*>, but please include all valid
       bounces.

    d. mailing list moderation administrative messages that contain spam
       subject lines or excerpts

 5. Finally, you should sign an Apache Contributor License Agreement.

 Once you run "mass-check" on a corpus, see the instructions in "CORPUS_SUBMIT"
 for details of how to verify that the top scorers are not accidental spam that
 got through.

 lastmod: 2004-09-06 quinlan
	SpamAssassin Corpus Policy
	--------------------------

	SpamAssassin relies on corpus data to generate optimal scores. This is
	the policy used by all corpora accepted by the SpamAssassin project.

	1. All mail must be hand-verified into "spam" and "ham" (non-spam)
	collections. It may not be solely classified using automated
	spam-classification algorithms such as SpamAssassin and other spam filters.

	2. It should not contain old mail. Older spam uses different techniques and
	legitimate email changes over time as well. Specifically, please try to
	avoid including spam older than 6 months and ham older than 18 months (12
	months is better).

	3. It must contain a representative mix of ham. That includes commercial ham
	messages, legitimate business discussions, and verified opt-in mail
	newsletters. This is a very important point!

	4. It must not contain certain types of mail to limit corpus bias:

	a. viruses (please check all messages with ClamAV or another anti-virus
	program to exclude these)

	b. anti-spam or anti-virus mailing lists, especially SpamAssassin, that
	frequently include spam and virus elements, even though they are
	technically ham, these often appear to be spam and will skew the
	results, rewriting the tests to avoid triggering on these messages is
	not realistic at this time.

	c. bounces of viruses or spam sent back to forged or faked from addresses,
	(so-called blowback or joe-job bounces), these typically have an
	envelope sender of <> or <MAILER-DAEMON.*>, but please include all valid
	bounces.

	d. mailing list moderation administrative messages that contain spam
	subject lines or excerpts

	5. Finally, you should sign an Apache Contributor License Agreement.

	Once you run "mass-check" on a corpus, see the instructions in "CORPUS_SUBMIT"
	for details of how to verify that the top scorers are not accidental spam that
	got through.

	lastmod: 2004-09-06 quinlan