masses/README.perceptron - spamassassin - Git at Google


 Fast SpamAssassin Score Learning Tool

 Henry Stern
 Faculty of Computer Science
 Dalhousie University
 6050 University Avenue
 Halifax, NS  Canada
 B3H 1W5
 henry@stern.ca

 January 8, 2004

 1.  WHAT IS IT?

 This program is used to compute scores for SpamAssassin rules.  It makes
 use of data files generated by the suite of scripts in
 spamassassin/masses.  The program outputs the generated scores in a file
 titled 'perceptron.scores'.

 The advantage of this program over that of the genetic algorithm (GA)
 implementation in spamassassin/masses/craig_evolve.c is that while the GA
 requires several hours to run on high-end machines, the perceptron
 requires only about 15 seconds of CPU time on an Athlon XP 1700+ system.

 This makes incremental updates and score personalization practical for the
 end-user and gives developers a better idea just how useful a new rule is.

 2.  OPTIONS

 There are four options that can be passed to the perceptron program.

   -p ham_preference

 	This increases the number of non-spam messages in the training
 	set.  It does this by adding 1 + (number of tests hit) *
 	ham_preference instances of non-spam messages to the training set.
 	This is intended to reduce false positives by encouraging the
 	training program to look at the harder-to-clasify ham messages
 	more often.  By default, this parameter is 2.0.

   -e num_epochs

 	This parameter sets how many passes the perceptron will make
 	through the training set before terminating.  On each pass, the
 	training set is shuffled and then iterated through.  By default,
 	it will make 15 passes.

   -l learning_rate

 	This parameter modifies the learning rate of the perceptron.  The
 	error gradient is computed for each instance.  This program uses a
 	logsig activation function y = (1/(1+exp(-x))), so the error
 	gradient is computed as E(x) = y*(1-y)*(is_spam - y).  For
 	each instance and score hit in the training set, the scores are
 	modified by adding E(x) / (number of tests hit + 1) *
 	learning_rate.  The default value for this is 2.0, but it can be
 	whatever you want.

   -w weight_decay

 	To prevent the scores from getting too high (or even forcing them
 	to if you want), before each epoch, the scores and network bias
 	are multiplied by the weight decay.  This is off by default (1.0),
 	but it can be useful.

 3.  HOW DOES IT WORK?

 This program implements the "Stochastic Gradient Descent" method of
 training a neural network.  It uses a single perceptron with a logsig
 activation function and maps the weights to SpamAssassin score space.

 The perceptron is the simplest form of neural network.  It consists of a
 transfer function and an activation function.  Together, they simulate the
 average firing rate of a biological neuron over time.

 The transfer function is the sum of the product of weights and inputs.
 It simulates the membrane potential of a neuron.  When the accumulated
 charge on the membrane exceeds a certain threshold, the neuron fires,
 sending an electrical impulse down the axon. This implementation uses a
 linear transfer function:

 [1]	f(x) = bias + sum_{i=1}^{n} w_i * x_i

 This is quite similar to how the SpamAssasin score system works.  If you
 set the bias to be 0, w_i to be the score for rule i and x_i to be whether
 or not rule i is activated by a given message, the transfer function will
 return the score of the message.

 The activation function simulates the electrical spike travelling down the
 axon.  It takes the output of the transfer function as input and applies
 some sort of transformation to it.  This implementation uses a logsig
 activation function:

 [2]	y(x) = 1 / (1 + exp(-f(x)))

 This non-linear function constrains the output of the transfer function
 between 0 and 1.  When plotted, it looks somewhat S-shaped with vertical
 asymptotes at 0 as x approaches -infinity and 1 as x approaches infinity.
 The slope of the function is greatest at x=0 and tapers off as it
 approaches the asymptotes.

 Lastly, the performance of the perceptron needs to be measured using an
 error function.  Two error functions are commonly used:  mean squared
 error and entropic error.  By default, this implementation uses mean
 squared error but entropic error may be substituted by adding a compiler
 directive.

 The most common method of training neural networks is called gradient
 descent.  It involves iteratively tuning the parameters of the network so
 that the mean error rate always decreases.  This is done by finding the
 direction of steepest descent down the "error gradient," reducing the
 value of the error function for the next iteration of the algorithm.

 If the transfer function and activation function are both differentiable,
 the error gradient of a neural network can be calculated with respect to
 the weights and bias.  Without getting into calculus, the error gradient
 for a perceptron with a linear transfer function, logsig activation
 function and mean squared error function is:

 [3]	E(x) = y(x) * (1-y(x)) * (y_expected - y(x))

 The weights are updated using the function:

 [4]	w_i = w_i + E(x) * x_i * learning_rate

 Since the SpamAssassin rule hits are sparse, the basic gradient descent
 algorithm is impractical.  This implementation uses a variation called
 "Stochastic gradient descent."  Instead of doing one batch update per
 epoch, the training set is randomly walked through, doing incremental
 updates.  In addition, in this implementation, the learning rate is
 modified by the number of rule hits for a given training instance.
 Together, these allow good weights to be computed for
 infrequently-occurring rules.

 Once the perceptron has finished running, the weights are converted to
 scores and exported using the familiar file format.  Weights are converted
 to scores using this function:

 [5]	score(weight) = -threshold * weight / bias

 4.  ACKNOWLEDGEMENTS

 I would like to thank my PhD supervisor, Michael Shepherd, for not getting
 mad at me while I worked on this.  I'd also like to thank Thomas
 Trappenberg for his invaluable assistance while I was tweaking the
 performance of the learning algorithm.  I would like to thank Daniel
 Quinlan, Justin Mason and Theo Van Dinter for their valuable input and
 constructive criticism.

 --
 hs
 8/1/2004


 Updates
 2007-07-01: http://old.nabble.com/Re%3A-Spam-research-tp11386525p11386525.html
 2010-04-24: http://old.nabble.com/hardware-on-ruleqa-tp28352058p28353585.html
 See also http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376

	Fast SpamAssassin Score Learning Tool

	Henry Stern
	Faculty of Computer Science
	Dalhousie University
	6050 University Avenue
	Halifax, NS Canada
	B3H 1W5
	henry@stern.ca

	January 8, 2004

	1. WHAT IS IT?

	This program is used to compute scores for SpamAssassin rules. It makes
	use of data files generated by the suite of scripts in
	spamassassin/masses. The program outputs the generated scores in a file
	titled 'perceptron.scores'.

	The advantage of this program over that of the genetic algorithm (GA)
	implementation in spamassassin/masses/craig_evolve.c is that while the GA
	requires several hours to run on high-end machines, the perceptron
	requires only about 15 seconds of CPU time on an Athlon XP 1700+ system.

	This makes incremental updates and score personalization practical for the
	end-user and gives developers a better idea just how useful a new rule is.

	2. OPTIONS

	There are four options that can be passed to the perceptron program.

	-p ham_preference

	This increases the number of non-spam messages in the training
	set. It does this by adding 1 + (number of tests hit) *
	ham_preference instances of non-spam messages to the training set.
	This is intended to reduce false positives by encouraging the
	training program to look at the harder-to-clasify ham messages
	more often. By default, this parameter is 2.0.

	-e num_epochs

	This parameter sets how many passes the perceptron will make
	through the training set before terminating. On each pass, the
	training set is shuffled and then iterated through. By default,
	it will make 15 passes.

	-l learning_rate

	This parameter modifies the learning rate of the perceptron. The
	error gradient is computed for each instance. This program uses a
	logsig activation function y = (1/(1+exp(-x))), so the error
	gradient is computed as E(x) = y(1-y)(is_spam - y). For
	each instance and score hit in the training set, the scores are
	modified by adding E(x) / (number of tests hit + 1) *
	learning_rate. The default value for this is 2.0, but it can be
	whatever you want.

	-w weight_decay

	To prevent the scores from getting too high (or even forcing them
	to if you want), before each epoch, the scores and network bias
	are multiplied by the weight decay. This is off by default (1.0),
	but it can be useful.

	3. HOW DOES IT WORK?

	This program implements the "Stochastic Gradient Descent" method of
	training a neural network. It uses a single perceptron with a logsig
	activation function and maps the weights to SpamAssassin score space.

	The perceptron is the simplest form of neural network. It consists of a
	transfer function and an activation function. Together, they simulate the
	average firing rate of a biological neuron over time.

	The transfer function is the sum of the product of weights and inputs.
	It simulates the membrane potential of a neuron. When the accumulated
	charge on the membrane exceeds a certain threshold, the neuron fires,
	sending an electrical impulse down the axon. This implementation uses a
	linear transfer function:

	[1] f(x) = bias + sum_{i=1}^{n} w_i * x_i

	This is quite similar to how the SpamAssasin score system works. If you
	set the bias to be 0, w_i to be the score for rule i and x_i to be whether
	or not rule i is activated by a given message, the transfer function will
	return the score of the message.

	The activation function simulates the electrical spike travelling down the
	axon. It takes the output of the transfer function as input and applies
	some sort of transformation to it. This implementation uses a logsig
	activation function:

	[2] y(x) = 1 / (1 + exp(-f(x)))

	This non-linear function constrains the output of the transfer function
	between 0 and 1. When plotted, it looks somewhat S-shaped with vertical
	asymptotes at 0 as x approaches -infinity and 1 as x approaches infinity.
	The slope of the function is greatest at x=0 and tapers off as it
	approaches the asymptotes.

	Lastly, the performance of the perceptron needs to be measured using an
	error function. Two error functions are commonly used: mean squared
	error and entropic error. By default, this implementation uses mean
	squared error but entropic error may be substituted by adding a compiler
	directive.

	The most common method of training neural networks is called gradient
	descent. It involves iteratively tuning the parameters of the network so
	that the mean error rate always decreases. This is done by finding the
	direction of steepest descent down the "error gradient," reducing the
	value of the error function for the next iteration of the algorithm.

	If the transfer function and activation function are both differentiable,
	the error gradient of a neural network can be calculated with respect to
	the weights and bias. Without getting into calculus, the error gradient
	for a perceptron with a linear transfer function, logsig activation
	function and mean squared error function is:

	[3] E(x) = y(x) * (1-y(x)) * (y_expected - y(x))

	The weights are updated using the function:

	[4] w_i = w_i + E(x) * x_i * learning_rate

	Since the SpamAssassin rule hits are sparse, the basic gradient descent
	algorithm is impractical. This implementation uses a variation called
	"Stochastic gradient descent." Instead of doing one batch update per
	epoch, the training set is randomly walked through, doing incremental
	updates. In addition, in this implementation, the learning rate is
	modified by the number of rule hits for a given training instance.
	Together, these allow good weights to be computed for
	infrequently-occurring rules.

	Once the perceptron has finished running, the weights are converted to
	scores and exported using the familiar file format. Weights are converted
	to scores using this function:

	[5] score(weight) = -threshold * weight / bias

	4. ACKNOWLEDGEMENTS

	I would like to thank my PhD supervisor, Michael Shepherd, for not getting
	mad at me while I worked on this. I'd also like to thank Thomas
	Trappenberg for his invaluable assistance while I was tweaking the
	performance of the learning algorithm. I would like to thank Daniel
	Quinlan, Justin Mason and Theo Van Dinter for their valuable input and
	constructive criticism.

	--
	hs
	8/1/2004


	Updates
	2007-07-01: http://old.nabble.com/Re%3A-Spam-research-tp11386525p11386525.html
	2010-04-24: http://old.nabble.com/hardware-on-ruleqa-tp28352058p28353585.html
	See also http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376