| |
| Genetic Algorithm Optimizer for Meta Rules |
| |
| Henry Stern |
| Anti-Spam Engineering |
| McAfee International |
| Alton House |
| Gatehouse Way |
| Aylesbury, Bucks |
| HP19 8YD |
| |
| April 9, 2005 |
| |
| 1. WHAT IS IT? |
| |
| This program is used to optimize phrase-based meta rules such as |
| ADVANCE_FEE and NIGERIAN for performance by selecting a subset of the |
| candidate body rules. |
| |
| It selects rule sets based on combined hit rates, false positive rates and |
| a desired number of rules. |
| |
| This program requres the GAUL: Genetic Algorithm Utility Library which can be |
| obtained from http://gaul.sourceforge.net/. It is licensed under the GPL. |
| |
| 2. OPTIONS |
| |
| Config parameters: |
| -h hits_file |
| Path to the compressed matrix containing the rule hits. |
| Default: hits.dat |
| |
| -r rules_fule |
| Path to the file containing the rule names corresponding to columns in the compressed matrix. |
| Default: rules.dat |
| |
| Fitness function parameters: |
| -m maximum_relevant_hits |
| Stop counting hits after seeing this many. The rule analogue of this |
| is: |
| ADVANCE_FEE_1, ADVANCE_FEE_2, ..., ADVANCE_FEE_m |
| Default: 4 |
| |
| -t target_num_rules |
| How many sub-rules should be used by the meta rule. |
| Default: 50 |
| |
| -l target_flex_rules |
| Solutions with target_num_rules +/- target_flex_rules are half as |
| fit as solutions with target_num_rules. |
| Default: 5 |
| |
| -e hits_exponent |
| Parameter to the fitness function, how the importance of high numbers of |
| hits is. |
| Default: 3.0 |
| |
| -p penalty_exponent |
| If rules hit ham, this exponential penalty is applied based on the |
| number of hits. |
| Default: 9.0 |
| |
| GA parameters: |
| -s population_size |
| How many individuals should be used in the simulation. |
| Default: 100 |
| |
| -g max_generations |
| How many geenrations that the simulation should run for. |
| Default: 10000 |
| |
| -x crossover_prob |
| The probability of an allele-mixing cross-over. |
| Default: 1.0 |
| |
| -u mutation_prob |
| The probability of a one-allele mutation. |
| Default: 0.1 |
| |
| 3. HOW DOES IT WORK? |
| |
| For every generation, "Parents" are selected based on their fitness. Parents |
| are "mated" to produce two children. The alleles on each chromosome are |
| randomly selected, which isn't very biologically plausible, but it works. |
| |
| Fitness of an individual is evaluated based on hit rates, false positive rates |
| and how close the individual is to the target number of rules. |
| |
| -- |
| hs |
| 9/5/2005 |