blob: 815adff440d5c1ae6f1808ae74b59a7fadee695b [file] [log] [blame]
Genetic Algorithm Optimizer for Meta Rules
Henry Stern
Anti-Spam Engineering
McAfee International
Alton House
Gatehouse Way
Aylesbury, Bucks
HP19 8YD
April 9, 2005
1. WHAT IS IT?
This program is used to optimize phrase-based meta rules such as
ADVANCE_FEE and NIGERIAN for performance by selecting a subset of the
candidate body rules.
It selects rule sets based on combined hit rates, false positive rates and
a desired number of rules.
This program requres the GAUL: Genetic Algorithm Utility Library which can be
obtained from http://gaul.sourceforge.net/. It is licensed under the GPL.
2. OPTIONS
Config parameters:
-h hits_file
Path to the compressed matrix containing the rule hits.
Default: hits.dat
-r rules_fule
Path to the file containing the rule names corresponding to columns in the compressed matrix.
Default: rules.dat
Fitness function parameters:
-m maximum_relevant_hits
Stop counting hits after seeing this many. The rule analogue of this
is:
ADVANCE_FEE_1, ADVANCE_FEE_2, ..., ADVANCE_FEE_m
Default: 4
-t target_num_rules
How many sub-rules should be used by the meta rule.
Default: 50
-l target_flex_rules
Solutions with target_num_rules +/- target_flex_rules are half as
fit as solutions with target_num_rules.
Default: 5
-e hits_exponent
Parameter to the fitness function, how the importance of high numbers of
hits is.
Default: 3.0
-p penalty_exponent
If rules hit ham, this exponential penalty is applied based on the
number of hits.
Default: 9.0
GA parameters:
-s population_size
How many individuals should be used in the simulation.
Default: 100
-g max_generations
How many geenrations that the simulation should run for.
Default: 10000
-x crossover_prob
The probability of an allele-mixing cross-over.
Default: 1.0
-u mutation_prob
The probability of a one-allele mutation.
Default: 0.1
3. HOW DOES IT WORK?
For every generation, "Parents" are selected based on their fitness. Parents
are "mated" to produce two children. The alleles on each chromosome are
randomly selected, which isn't very biologically plausible, but it works.
Fitness of an individual is evaluated based on hit rates, false positive rates
and how close the individual is to the target number of rules.
--
hs
9/5/2005