Zimbra offers Open Source email server software and shared calendar for Linux and the Mac
Go Back   Zimbra :: Forums > Zimbra Collaboration Suite > Administrators

Welcome to the Zimbra :: Forums!
Welcome, if you would like to post a comment please register. We also encourage you to explore all things Zimbra with our team and members of the community.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 04-20-2009, 05:05 PM
Moderator
 
Posts: 1,405
Default spam filtering/training methodology

This arises out of some bugzilla comments; instead of cluttering up bugzilla, I thought it'd be better to turn it into a forum thread for further discussion, if any.

Basically in Bug 9532 - IMAP/Outlook move to junk doesn't train anti-spam and Bug 37164 - mail filed into Junk by Filters is not used to train anti-spam I raised the question of whether Zimbra ought to train SpamAssassin on messages which have been auto-filed into Junk by a user's Filter.

Currently Zimbra doesn't do this directly, but the capability exists in a sort of roundabout way. E.g., if you have Outlook or an IMAP mail client, you can have it move messages to Junk based on keywords, or on the client's own rules/heuristics, and Zimbra will end up training on that basis. On the face of it this seems to be a valid approach.

But consider this case: a user tells their client to filter based on X-Spam-Level. Dan Martin commented,
Quote:
it seems to me if a user is trying to do filter-based training on X-Spam scores specifically, something is funky with the overall setup. X-Spam is, after all, supposed to be the automated scoring system. If a user is getting a lot of false negatives (essentially what a "too low" X-Spam score is), then it seems to me you should analyze the overall scoring of those suspect emails and identify, and if necessary refine, the filters accordingly.
There are a bunch of ways to go about this, sure. You could re-weight the scores. You could lower the threshold from the default 6.6 to 5.1 or whatever. You could add more detection methods such as DCC, Pyzor, Razor, or image spam detection.

But there may still be some information left in the unique combination of recipient + spam score. I mean simply, for certain users, if email is over a fairly low score, then it's guaranteed to be spam, and it may be valid to feed that information back into the system. In theory this resembles some spam training systems that self-train not only on user-sorted false positives/negatives (the way that Zimbra does) but also on their existing corpus. "The spam gets spammier and the ham gets hammier," as it were. If you look at this apache.org page on SA, this is basically item 2 in the section on Effective Training. (ASSP also feeds automatically-identified spam back into its training system, see here.)

The danger here is there could be a chance of "drift" due to unsupervised feedback. Effectively, certain "tokens of legitimacy" could be "poisoned" by being statistically associated with spam, until they become spurious primary indicators of spam. I'm not aware of any real-world exploitations of this concept, but I did find an article discussing it: Does Bayesian Poisoning Exist? (PDF);
__________________
Elliot Wilen
Berkeley, CA

Don't forget to enter your Zimbra version in your forum profile.
Reply With Quote
  #2 (permalink)  
Old 04-20-2009, 06:06 PM
Moderator
 
Posts: 1,027
Default

Hey Elliot,

Berkeley, eh? I'm in San Jose. . .methinks this ought to be discussed over a beer. . .

Anyhow, it's at least in part the Bayesian poisoning that worries me, so you're definitely onto me. But the bigger issue I was digging at is that, at least according to what I have observed in production, an awful lot of people have their point threshold set too high. Therefore, too much gets into the non-junk classification (by my standards, anyhow), and therefore requires Bayesian filtering to correct it.

I have addressed this two ways. One, I have lowered the overall threshold (in my system a positive score of only 3 gets you in the Junk folder), and two, I have given very high scores (+- 5) for the extreme high (BAYES_99) and extreme low (BAYES_0) ends of the Bayes score. The combination of the two has allowed the user preference to have a stronger weight, but casts the burden of newly-discovered mail on the side of not looking like spam. . .or perhaps I should say "if it looks kinda like junk, my presumption is guilty."

This has not resulted in many false positives--extremely few in fact--and those that do happen are easily remedied with the Bayes filter "not junk" in most cases. There is one source I had to manually whitelist because of my hostility to the commercial whitelists, but that's a complication of my own making. . .other than that it's really quite smooth for us at least.
__________________
Cheers,

Dan
Reply With Quote
  #3 (permalink)  
Old 04-23-2009, 11:03 AM
Moderator
 
Posts: 1,405
Default

Sure, look me up when you're going to be in my neck of the woods, Dan.

When/if we go to production with Zimbra, I definitely plan on tweaking the scores. Threshold, not so much, but I hope to add some more score inputs such as those I mentioned in my first post. (E.g. uceprotect level 2 generates some false positives when I use it to block at the MTA level, but I can use it to score.)

One thing I wonder...according to HowScoresAreAssigned - Spamassassin Wiki, it seems that SA is supposed to adjust the scores for Bayes values on its own, but it looks to me like they're fixed (e.g. BAYES_99 is set at 3.5) and will have to be manually adjusted as you mention.
__________________
Elliot Wilen
Berkeley, CA

Don't forget to enter your Zimbra version in your forum profile.
Reply With Quote
  #4 (permalink)  
Old 04-23-2009, 11:23 AM
Moderator
 
Posts: 1,027
Default

I don't claim to be an authority on SpamAssassin--particularly in its native form which I have not used--but I can confirm without a doubt that the scores remain fixed in the Zimbra implementation. I agree that it appears dynamic in the wiki to which you linked. I don't know if that's a version difference or a question of implementation, however.
__________________
Cheers,

Dan
Reply With Quote
  #5 (permalink)  
Old 04-24-2009, 12:23 AM
Moderator
 
Posts: 7,911
Default

Have a look at
Code:
/opt/zimbra/conf/spamassassin/50_scores.cf
__________________
Reply With Quote
  #6 (permalink)  
Old 04-24-2009, 01:33 AM
Moderator
 
Posts: 1,405
Default

What I see starting at line 841 is
Code:
# make the Bayes scores unmutable (as discussed in bug 4505)
ifplugin Mail::SpamAssassin::Plugin::Bayes
score BAYES_00 0 0 -2.312 -2.599
score BAYES_05 0 0 -1.110 -1.110
score BAYES_20 0 0 -0.740 -0.740
score BAYES_40 0 0 -0.185 -0.185
score BAYES_50 0 0 0.001 0.001
score BAYES_60 0 0 1.0 1.0
score BAYES_80 0 0 2.0 2.0
score BAYES_95 0 0 3.0 3.0
score BAYES_99 0 0 3.5 3.5
endif
So, that refers to https://issues.apache.org/SpamAssass...ug.cgi?id=4505. (Discussion starts about comment #34.) I guess they need to update their wiki. Thanks!
__________________
Elliot Wilen
Berkeley, CA

Don't forget to enter your Zimbra version in your forum profile.
Reply With Quote
  #7 (permalink)  
Old 04-24-2009, 09:26 AM
Moderator
 
Posts: 1,027
Default

And beside their grammatical error (the word is "immutable," not "unmutable"), it is these scores that I have chosen to override with my own immutable scoring, in local.cf.
__________________
Cheers,

Dan
Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes


Similar Threads

Why Join?

Registering let's you ask questions, makes it easier to search, displays any files attached to posts, and notifies you about replies.

blog.zimbra.com




 

SEO by vBSEO ©2011, Crawlability, Inc.