| |
| Fast SpamAssassin Score Learning Tool |
| |
| Henry Stern |
| Faculty of Computer Science |
| Dalhousie University |
| 6050 University Avenue |
| Halifax, NS Canada |
| B3H 1W5 |
| henry@stern.ca |
| |
| January 8, 2004 |
| |
| 1. WHAT IS IT? |
| |
| This program is used to compute scores for SpamAssassin rules. It makes |
| use of data files generated by the suite of scripts in |
| spamassassin/masses. The program outputs the generated scores in a file |
| titled 'perceptron.scores'. |
| |
| The advantage of this program over that of the genetic algorithm (GA) |
| implementation in spamassassin/masses/craig_evolve.c is that while the GA |
| requires several hours to run on high-end machines, the perceptron |
| requires only about 15 seconds of CPU time on an Athlon XP 1700+ system. |
| |
| This makes incremental updates and score personalization practical for the |
| end-user and gives developers a better idea just how useful a new rule is. |
| |
| 2. OPTIONS |
| |
| There are four options that can be passed to the perceptron program. |
| |
| -p ham_preference |
| |
| This increases the number of non-spam messages in the training |
| set. It does this by adding 1 + (number of tests hit) * |
| ham_preference instances of non-spam messages to the training set. |
| This is intended to reduce false positives by encouraging the |
| training program to look at the harder-to-clasify ham messages |
| more often. By default, this parameter is 2.0. |
| |
| -e num_epochs |
| |
| This parameter sets how many passes the perceptron will make |
| through the training set before terminating. On each pass, the |
| training set is shuffled and then iterated through. By default, |
| it will make 15 passes. |
| |
| -l learning_rate |
| |
| This parameter modifies the learning rate of the perceptron. The |
| error gradient is computed for each instance. This program uses a |
| logsig activation function y = (1/(1+exp(-x))), so the error |
| gradient is computed as E(x) = y*(1-y)*(is_spam - y). For |
| each instance and score hit in the training set, the scores are |
| modified by adding E(x) / (number of tests hit + 1) * |
| learning_rate. The default value for this is 2.0, but it can be |
| whatever you want. |
| |
| -w weight_decay |
| |
| To prevent the scores from getting too high (or even forcing them |
| to if you want), before each epoch, the scores and network bias |
| are multiplied by the weight decay. This is off by default (1.0), |
| but it can be useful. |
| |
| 3. HOW DOES IT WORK? |
| |
| This program implements the "Stochastic Gradient Descent" method of |
| training a neural network. It uses a single perceptron with a logsig |
| activation function and maps the weights to SpamAssassin score space. |
| |
| The perceptron is the simplest form of neural network. It consists of a |
| transfer function and an activation function. Together, they simulate the |
| average firing rate of a biological neuron over time. |
| |
| The transfer function is the sum of the product of weights and inputs. |
| It simulates the membrane potential of a neuron. When the accumulated |
| charge on the membrane exceeds a certain threshold, the neuron fires, |
| sending an electrical impulse down the axon. This implementation uses a |
| linear transfer function: |
| |
| [1] f(x) = bias + sum_{i=1}^{n} w_i * x_i |
| |
| This is quite similar to how the SpamAssasin score system works. If you |
| set the bias to be 0, w_i to be the score for rule i and x_i to be whether |
| or not rule i is activated by a given message, the transfer function will |
| return the score of the message. |
| |
| The activation function simulates the electrical spike travelling down the |
| axon. It takes the output of the transfer function as input and applies |
| some sort of transformation to it. This implementation uses a logsig |
| activation function: |
| |
| [2] y(x) = 1 / (1 + exp(-f(x))) |
| |
| This non-linear function constrains the output of the transfer function |
| between 0 and 1. When plotted, it looks somewhat S-shaped with vertical |
| asymptotes at 0 as x approaches -infinity and 1 as x approaches infinity. |
| The slope of the function is greatest at x=0 and tapers off as it |
| approaches the asymptotes. |
| |
| Lastly, the performance of the perceptron needs to be measured using an |
| error function. Two error functions are commonly used: mean squared |
| error and entropic error. By default, this implementation uses mean |
| squared error but entropic error may be substituted by adding a compiler |
| directive. |
| |
| The most common method of training neural networks is called gradient |
| descent. It involves iteratively tuning the parameters of the network so |
| that the mean error rate always decreases. This is done by finding the |
| direction of steepest descent down the "error gradient," reducing the |
| value of the error function for the next iteration of the algorithm. |
| |
| If the transfer function and activation function are both differentiable, |
| the error gradient of a neural network can be calculated with respect to |
| the weights and bias. Without getting into calculus, the error gradient |
| for a perceptron with a linear transfer function, logsig activation |
| function and mean squared error function is: |
| |
| [3] E(x) = y(x) * (1-y(x)) * (y_expected - y(x)) |
| |
| The weights are updated using the function: |
| |
| [4] w_i = w_i + E(x) * x_i * learning_rate |
| |
| Since the SpamAssassin rule hits are sparse, the basic gradient descent |
| algorithm is impractical. This implementation uses a variation called |
| "Stochastic gradient descent." Instead of doing one batch update per |
| epoch, the training set is randomly walked through, doing incremental |
| updates. In addition, in this implementation, the learning rate is |
| modified by the number of rule hits for a given training instance. |
| Together, these allow good weights to be computed for |
| infrequently-occurring rules. |
| |
| Once the perceptron has finished running, the weights are converted to |
| scores and exported using the familiar file format. Weights are converted |
| to scores using this function: |
| |
| [5] score(weight) = -threshold * weight / bias |
| |
| 4. ACKNOWLEDGEMENTS |
| |
| I would like to thank my PhD supervisor, Michael Shepherd, for not getting |
| mad at me while I worked on this. I'd also like to thank Thomas |
| Trappenberg for his invaluable assistance while I was tweaking the |
| performance of the learning algorithm. I would like to thank Daniel |
| Quinlan, Justin Mason and Theo Van Dinter for their valuable input and |
| constructive criticism. |
| |
| -- |
| hs |
| 8/1/2004 |
| |
| |
| Updates |
| 2007-07-01: http://old.nabble.com/Re%3A-Spam-research-tp11386525p11386525.html |
| 2010-04-24: http://old.nabble.com/hardware-on-ruleqa-tp28352058p28353585.html |
| See also http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376 |