It's a very sound theory but the other half of it is that you should score each technique according to its precision. Off the top of my head I'd qualify that further: techniques with low false positive rates should have high scores even if they have high false negative rates. (At least this is true of "boolean" techniques like "is this IP address in blacklist X".)
I haven't read the theory behind SA as a whole, but really, the entire scoring system would benefit from automatic rescoring, not just the text pattern matching. It's not clear to me if SA under Zimbra does this but when I have time I'll look into it more closely.
Basically, you want each technique to have a score that represents how much independent confirmation it offers. E.g. if 100% of email with the string "buy c1ali$" also contained links in the URIBL--and all of the email with links in URIBL was spam, then you'd give a low score to "buy ciali$" and a high score to URIBL, meaning URIBL has all the information you need. |