Zimbra offers Open Source email server software and shared calendar for Linux and the Mac
Go Back   Zimbra :: Forums > Zimbra Collaboration Suite > Administrators

Welcome to the Zimbra :: Forums!
Welcome, if you would like to post a comment please register. We also encourage you to explore all things Zimbra with our team and members of the community.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 11-09-2011, 08:08 AM
Active Member
 
Posts: 32
Default Dspam_stats

I have noticed that my OCA (Overall Accuracy) has lowered bit by bit everyday since we hit 0 TL (Traning Left). While it was training, the accuracy mostly worked it's way up. I kept track of the /opt/zimbra/dspam/bin/dspam_stats -H output for some time now.

Today:
TP True Positives: 24241
TN True Negatives: 5470
FP False Positives: 765
FN False Negatives: 1395
SC Spam Corpusfed: 0
NC Nonspam Corpusfed: 0
TL Training Left: 0
SHR Spam Hit Rate 94.56%
HSR Ham Strike Rate: 12.27%
PPV Positive predictive value: 96.94%
OCA Overall Accuracy: 93.22%

9/7/11
TP True Positives: 13365
TN True Negatives: 1920
FP False Positives: 163
FN False Negatives: 836
SC Spam Corpusfed: 0
NC Nonspam Corpusfed: 0
TL Training Left: 417
SHR Spam Hit Rate 94.11%
HSR Ham Strike Rate: 7.83%
PPV Positive predictive value: 98.80%
OCA Overall Accuracy: 93.87%

I expected the accuracy to go up not slowly go down every day. I still have learn spam/ham enabled globally. With almost 250 mailboxes on the server, we still get very little spam.

Why is the accuracy keep dropping even though the actual results have improved drastically?

Thanks
Reply With Quote
  #2 (permalink)  
Old 11-10-2011, 12:10 PM
Moderator
 
Posts: 1,432
Default

Look in /opt/zimbra/log/spamtrain.log or do /opt/zimbra/dspam/bin/dspam_stats -s -H to see the result of the last spam training run.

Note that in the last two months you've had a fair number of false positives (602) and nearly as many false negatives (559). These have been bringing your accuracy down.

Also note that there's an oddity to how retraining works with DSPAM in Zimbra. The only mail submitted to DSPAM for training is mail that's incorrectly classified by the entire antispam system (spamassassin + DSPAM). Some of this may have been correctly scored by DSPAM but still got through because DSPAM's contribution to the overall score (default is -1/+10 for ham/spam) may be counterbalanced by other scores from SA. Conversely some of DSPAM's mistakes might not be submitted for retraining for the same reason.

Put this all together and it means that DSPAM retraining is being carried out on a somewhat unrepresentative collection of spam/ham for your site. At the same time, the overall accuracy of your antispam measures may be quite a bit better than the DSPAM accuracy.

If you'd like to modify the impact of DSPAM scoring, see Using DSPAM for Spam Filtering - Zimbra :: Wiki and also consider the tag/kill percentages you've set overall.

According to the designer's intention, DSPAM is really meant to be used by itself, not as part of another scoring system. I can't say how well it would work that way, but you could probably hack the amavis configuration file to turn off SA scoring and use only DSPAM should you wish to do so.

Another consideration is that the overall antispam system (including DSPAM if you have it turned on) is retrained system-wide. Bug 3870 - per user Spam Assassin score would change this, but it hasn't been implemented yet. With per-user scoring, you'd probably see better accuracy, as right now different users are "fighting" over what should be considered spam and what constitutes legitimate email.
__________________
Elliot Wilen
Berkeley, CA

Don't forget to enter your Zimbra version in your forum profile.
Reply With Quote
  #3 (permalink)  
Old 11-11-2011, 04:24 AM
Active Member
 
Posts: 32
Default

My kill score is set at 15
(which really reduced the spam that made it in the junk folder)

My spam score is 6.6 to be marked as spam
DSPAM adds 8.7 to spam and -1 to ham

Modified my spamassassin scores:
score BAYES_00 0.0001 0.0001 -2.312 -2.599
score BAYES_05 0.0001 0.0001 -1.110 -1.110
score BAYES_20 0.0001 0.0001 -0.740 -0.740
score BAYES_40 0.0001 0.0001 -0.185 -0.185
score BAYES_50 0.0001 0.0001 0.001 0.001
score BAYES_60 0.0001 0.0001 1.0 1.0
score BAYES_80 0.0001 0.0001 2.5 2.5
score BAYES_95 0.0001 0.0001 5.5 5.5
score BAYES_99 0.0001 0.0001 6.5 6.5

I also added many custom rules specific to our needs and tailored to some of the spam we receive. After a few months of tweaking, I have found this to be the best formula.

/opt/zimbra/dspam/bin/dspam_stats -s -H
TP True Positives: 3
TN True Negatives: 0
FP False Positives: 0
FN False Negatives: 2
SC Spam Corpusfed: 0
NC Nonspam Corpusfed: 0
TL Training Left: 0
SHR Spam Hit Rate 60.00%
HSR Ham Strike Rate: 100.00%
PPV Positive predictive value: 100.00%
OCA Overall Accuracy: 60.00%

Why only 3 true positives when there are thousands of e-mails daily?? Seems to me like the good e-mails are not being counted.

Thanks
Reply With Quote
  #4 (permalink)  
Old 11-11-2011, 11:23 AM
Moderator
 
Posts: 1,432
Default

There are only three true positives because the only mail which is being trained on is mail that ended up in the spam/ham accounts.

In short based on the data you posted, there were 5 emails that got acted on by your users yesterday. All of them were marked as spam. 3 of those, when tested by DSPAM, came up as spam, so DSPAM recorded them "True Positives". 2 of them, when tested by DSPAM, came up as ham, so DSPAM recorded them as "False Negatives".

All the rest of the mail yesterday was classified correctly by the antispam system. This mail doesn't contribute to DSPAM's accuracy stats because it didn't get submitted by your users for training.
__________________
Elliot Wilen
Berkeley, CA

Don't forget to enter your Zimbra version in your forum profile.
Reply With Quote
  #5 (permalink)  
Old 11-11-2011, 11:32 AM
Active Member
 
Posts: 32
Default

ok i see now.

The way I have it set, DSPAM has such a large impact on the score, it would make sense that most of the time it would be wrong when mail is classified incorrectly. Thanks for the info!!
Reply With Quote
  #6 (permalink)  
Old 11-11-2011, 11:59 AM
Moderator
 
Posts: 1,432
Default

Actually, most of the time DSPAM is getting it right even when the mail is classified incorrectly.

E.g. yesterday DSPAM scored 60% of the submitted spam correctly. However, even though DSPAM presumably added 8.7 to the scores of those emails when they came in, spamassassin must have subtracted more than 2.1 due to Bayes and other elements (such as DNSWL).

If you increased the DSPAM score from 8.7, reduced the spam score from 6.6, or reduced/eliminated some of the spamassassin scores, then those three true positives would have gone straight into people's Junk folders. They wouldn't have been submitted by your users. The false negatives might still have been submitted, which would make DSPAM's accuracy look worse, even though the overall accuracy of your antispam system would be better.

However, all of this ignores the impact of retraining over time. Regardless of whether you use SA + DSPAM or just DSPAM, it might be a little strange that retraining is only happening on email that gets misclassified. It seems to me that proper retraining should be done using representative samples of actual ham/spam, not just the subset that gets sorted incorrectly. It's strikes me as especially distorting with respect to the very small and unrepresentative proportion of ham that actually gets used for retraining.

I am not an expert in this area but it seems to me that it would be more valid to retrain using a combination of reported spam/ham, and weighted samples of the spam/ham which were (presumably) classified correctly, and therefore weren't reported.

To overcome the potential distortion inherent in the current scheme, you could periodically collect your own corpora of representative ham/spam at your site and then use them to train SA and DSPAM.

However, in practice, the overall accuracy of the antispam system is so high that I don't worry about it too much, and if I were to try to tweak it further, I'd probably look into using DCC, Pyzor, and/or Razor as described in Improving Anti-spam system - Zimbra :: Wiki
__________________
Elliot Wilen
Berkeley, CA

Don't forget to enter your Zimbra version in your forum profile.
Reply With Quote
  #7 (permalink)  
Old 11-11-2011, 12:54 PM
Active Member
 
Posts: 32
Default

The big problem with increasing the DSPM score above 8.7 is too many legit e-mails would be marked as spam. I had it much higher before and after checking dozens of logs (show original) I came the the conclusion that 8.7 would be nice middle point without messing with the 6.6 number.

At one point DSPAM marked most e-mails as spam and if it weren't for SA taking away enough points by classifying it as ham then they would have ended up in the junk folder. I have users use unsubscribe, and block/allow sender and only use the spam button for true spam. This method has proven to drastically decrease spam over time.

Now that you explained why my DSPAM stats are slowly going down, I'm happy

Thanks
Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes


Why Join?

Registering let's you ask questions, makes it easier to search, displays any files attached to posts, and notifies you about replies.

blog.zimbra.com




 

SEO by vBSEO ©2011, Crawlability, Inc.