Print Page | Close Window

Notes about the Bayesian statistical filtering

Printed From: LogSat Software
Category: Spam Filter ISP
Forum Name: Spam Filter ISP Support
Forum Description: General support for Spam Filter ISP
URL: https://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=3121
Printed Date: 27 December 2024 at 4:53am


Topic: Notes about the Bayesian statistical filtering
Posted By: LogSat
Subject: Notes about the Bayesian statistical filtering
Date Posted: 08 March 2004 at 10:49pm

Following is how SpamFilter uses the self-learning statistical Bayesian filter to detect spam.

Bayesian filtering performs statistical analysis of the various words and (tokens) in incoming emails by comparing them with a database (corpus) of tokens. This database assigns a probability to each token. That probability indicates how likely it is for that word to appear in a spam message. By using statistical formulas on the sum of the "good" and "bad" tokens and their respective probabilities present in a full message, and comparing them to the ones in the corpus database, we can determine the probability for that message to being spam or not.

Several applications that perform statistical filtering are based on fixed databases distributed by the vendors. The problem is that these database are language depended. Hoses databases contain lists of "good" tokens and lists of "bad" tokens. An english database won't work for a spanish company for example. Furthermore, the type of good emails that company A (a military base for example) receives is usually different from the type of good emails company B (a church for example) receives. Since the "good" emails are different, performing accurate statistical analysis requires that the corpus database reflects accurately the kind of traffic that site receives.

This is where SpamFilter ISP is different. SpamFilter ISP does not come with a predefined corpus database. Instead it "learns" about your actual traffic by examining incoming emails and seeing how they are classified by your other filters. With time, it "learns" what the good email is, and what spam looks like in your particular domain. This allows it to be very accurate and can be deployed to any country, using any language.

In order to be accurate, the corpus database must be quite large. We've configured the filter to begin filtering email after receiving 5000 "good" emails and 5000 "bad" emails. This is usually sufficient to obtain good accuracy.

Some site with very low traffic may take several days to reach that email quota. The 5000 minimum email threshold can be lowered if needed, but please note that this may incur in lower accuracy. To make the change, please stop SpamFilter, then in the SpamFilter.ini file look for the line:

MinEmailsForBayesKickIn=5000

After you set the new lower limit start SpamFilter.

It is also to be noted that usually emails are assigned very distinct scores. They are either very close to 0% (low chance of being spam) or very close to 100% (likely to be spam). In between values are statistically very rare.

Please also note that the Bayesian filter is applied in only *after* all other filters have failed to trigger a hit on an incoming email. Thus if you have very effective filtering already in place, only a few emails that the other filter miss will be caught by the statistical filter. You could in theory "relax" your other filters and let the statistical Bayes filter perform all the work after you have reached the initial kick-in corpus size.

Roberto F.
LogSat Software




Print Page | Close Window