Bayesian Filer Test Not working |
Post Reply | Page 12> |
Author | |
Eric
Guest Group |
Post Options
Thanks(0)
Posted: 27 November 2004 at 3:49pm |
I am trying to run a test on for the Bayesian Filter, I paste the contents on of a known spam mail message, and under the Corpus DB, it says the DB is locked and that it passes with 0%. I have a feeling this is why I get a ton of spam....and it only seems to get worse than better. Also, The learning status is "Inactive" even though the "Learn new incoming emails" box is checked..... Thanks ya! Eric |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Eric,There may be a problem with the statistical corpus database. Can you please try to stop SpamFilter, delete the SpamFilter\corpus directory, then restart SpamFilter. Please note that this will reset your statistical database, and SpamFilter will again need to receive the inistial 5,000 good and 5,0000 spam emails to "prime" the database.Roberto F.
LogSat Software
|
|
Clutcher
Guest Group |
Post Options
Thanks(0)
|
I have tryed also the method you suggested (deleting Corpus) and read almost all the forum but I still can't see any email with spam probabilty not equal to 0% and I receive a lot of spam. The program added to crpus a lot of words but they all seem to be "good one". In fact I can't understand ho to give to Spamfilter a Spam example to trigger bayes. Btw, the program is great undeed and with some quick and simply settings still blocks thousands of virus and other things. TIA MArco |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Marco,In the corpus you'll find entries for both "good" and "bad" words, it is the score that is assigned to them which determines how they are used to check for spam. Each token in the corpus has statistics attached to it indicating how many times that token has appeared in spam emails and how many times it was included in a good email. Using statistical math, a score is given to it, and an analysys of the higher scores in an email determines if the email is good or not.If you have received more emails than the threshold of 5,000 good and 5,000 spam, can you confirm that not a single email has triggered the Bayesian filter? You can check that using the new statistical pie chart available in the latest version that shows the emails blocked by the various filters.Roberto F.
LogSat Software
|
|
Clutcher
Guest Group |
Post Options
Thanks(0)
|
>In the corpus you'll find entries for both "good" and "bad" words, it is the score that is In fact all the word in the corpus have the same value of "good" and therefore as I stated before spam probability is always 0% >If you have received more emails than the threshold of 5,000 good and 5,000 spam, can Yes. I think the problem is that nothing or noone told SpamFilter what is SPAM so it can't assign bad values to bad words. And I don't know how to let it learn. >You can check that using the new statistical pie chart available in the latest version that 27431 emails blocked, none of them as SPAM TIA Ciao MArco
|
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Marco,It's very unusual that *all* the words in the corpus have the same value of good. To verify, can you please go to the "Settings - Bayesian Filter - Corpus Database" tab in SpamFilter, then click on the "Dump Corpus" button. That will generate a listing of all entries in the corpus database, along with the number of times each token appearead in a good email, in a spam email, the probability that an email containing that token is spam, and the last time an email arrived with that token.In that list, look for the following, and please paste the results in the forum so we can take a look at your values.
*sex,
*SEX,
*Subject*viagra,
*unsubscribe,As a comparison, these are the ones right now in our own corpus database.*Token ,Good, Spam, ProbSpam, ModDate
*sex,449,13120,0.473827302455902,12/20/2004
*SEX,71,515,0.182700991630554,12/20/2004
*Subject*viagra,16,2053,0.79815948009491,12/20/2004
*unsubscribe,4828,54253,0.257227122783661,12/20/2004
<<
Yes. I think the problem is that nothing or noone told SpamFilter what is SPAM so it can't assign bad values to bad words. And I don't know how to let it learn.
>>Actually everytime any of SpamFilter's filters finds a spam email, it updates the statistical coprus with the email's tokens and assignes them "spam scores". Every single email that SpamFilter receives goes thru this processes. Please go into more details on your statement "27431 emails blocked, none of them as SPAM". Do you mean that you do not have a single email in the quearantine database, or showing on the statistical pie chart, but yet SpamFilter shows 27431 emails as blocked? If so, this means that SpamFilter blocked 27431 attempts by spammers to "relay" using your SMTP server, but that *all* "incoming" email addressed to your users was not blocked at all. This would tend to indicate a misconfiguration of SpamFilter, as with its default filters SpamFilter will indeed block a huge amount of emails addressed to your domains.Roberto F.
LogSat Software
|
|
Clutcher
Guest Group |
Post Options
Thanks(0)
|
>It's very unusual that *all* the words in the corpus have the same value of good. To verify, In fact, it's what I did >In that list, look for the following, and please paste the results in the forum so we can take *sex,1,1,0,400000005960464,16/12/2004 I could continue but all the words have the same values, apart from the date. >Actually everytime any of SpamFilter's filters finds a spam email Which filters? How can they say it's spam? >Do you mean that you do not have a single email in the quearantine database, or showing I'm full of blocked emails but none because SPAM 14000 Exceed RCPT >This would tend to indicate a misconfiguration of SpamFilter, as with its default filters Yes, but, again, how could he block a message as spam if none of the words are classified as spam? I'm sorry but I still can't understand how. If i paste a message that passed: REPLICA WATCH MODELS I obtain: 12/21/04 14.01.40.109 -- () Token Good Spam Prob is Spam To be clear: how could I or could it increase the value of "Patek"? Ciao MArco |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Marco, Continuo in italiano cosi' forse le cose saranno piu' chiare. I primi due numeri nel corpus database indicano quante email catalogate come "buone" e quante email marcate come "spam" sono arrivate le quali contengono quella parola. Nei tuoi due esempi: *sex,1,1,0,400000005960464,16/12/2004
Ora e' molto strano che dal 16 di dicembre ad oggi tu abbia ricevuto solo due emails con la parola "sex"... E' una parola usata spessissimo nello spam, e per questo penso ci sia una misconfigurazione nel tuo setup di SpamFilter. Quando chiedi <<Which filters? How can they say it's spam?>>, SpamFilter usa multiple tecniche diverse per catturare lo spam. Quando dici : I'm full of blocked emails but none because SPAM 14000 Exceed RCPT in realta' tutte le email che menzioni sono state bloccate perche' erano SPAM. I filtri usati per decidere se una determinata email era spam sono elencati in grassetto sopra. Ti rimando al file readme.htm che trovi nella directory di SpamFilter per vedere cosa sono questi filtri e come funzionano. Il filtro statistico Bayesiano del quale stiamo parlando in tutto questa thread non e' altro che uno dei tanti filtri che SpamFilter usa per trovare lo spam. Ripeto che penso ci sia un problema con la tua configurazione, non perche' non hai emails bloccate dal filtro statistico (capita spesso, date che alcune delle altre tecniche usate da SpamFilter sono molto meglio del filtro bayesiano), ma perche' i numberi che menzioni non mi sembrano corretti. Per semplificare le cose, se puoi mandarci in uno zip i seguenti: il file SpamFilter.ini cercheremo di capire cosa non va'. Per quanto riguarda la domanda sul "Patek", non puoi intervenire sul punteggio che SpamFilter assegna. Questo viene automaticamente aggiornato da SpamFilter ogni volta che un'email con quella parola arriva, a seconda da come viene catalogata dagli altri filtri. Roberto F. |
|
Paul D
Guest Group |
Post Options
Thanks(0)
|
I am running v395 and yet to see anything being blocked via bayesian filter all report 0% SPAM I just made a copy of my corpus folder and stopped delted it so it can start from scratch. I find it hard to belive that out of all these emails less than 1% is being blocked.. any help would be appreciated. Thanks [Messages] |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Paul,It is indeed strange, and it's possible SpamFilter is not configured properly and/or mail is being routed to SpamFilter in such a way as to mask the original source IP of the sender. Without knowing the source IP, SpamFilter will not be able to use many of its filters. Can you please zip and email us a copy of your SpamFilter.ini file, your blacklist and whitelist files, and one of your latest SpamFilter's logfiles so we can try to see what is happening?Roberto F.
LogSat Software
|
|
Paul D
Guest Group |
Post Options
Thanks(0)
|
sent attachment to support@logsat.com ?
|
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Paul, We received your files, and your settings appear to work fine. The activity logfile you sent shows that on the 27th SpamFilter blocked about 41% of your total incoming emails. The average for the previous days was a bit lower, showing that around 30% of your total emails is spam. From your post it seems that you state SpamFilter is blocking only 1% of your emails. As that was not the case, we may have misunderstood. Did you mean that you find it hard to believe that the bayesian filter only stopped 1% of spam? If so, then actually that is absolutely normal. The Bayesian filter is used as a last resort to check for spam, after all the other filters have had a chance to do so. Only if they all fail is the Bayesian filter used. As such, it will indeed have mostly pre-screened emails to check, and will only tag a very small percentage of them. As an example we provided a snapshot of our filter stats for 3 days worth of emails on the forum as follows: 94,828 IP found in MAPS search According to the above, the Bayesian statistical filter on our own server only blocked 0.2% of the spam found by the other filters. However that is still 354 spam emails that were successfully blocked. Roberto F.
|
|
omaits
Newbie Joined: 25 February 2005 Location: United States Status: Offline Points: 5 |
Post Options
Thanks(0)
|
I have a question about this old topic.... I am going to restart our database like you mentioned above because I setup the system wrong and screwed up my database. I tried your technique but I cant figure out how to stop the service! The button that says STOP SERVICE is grayed out on the application. Also, I tried stopping to with ctrl+alt+del and it told me I wasnt allowed. How can I stop it? Sorry if the question is stupid. Im a rookie. |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Either from a DOS prompt type "net stop spamfilter" or follow these instructions from symantec.com:
How to start or stop a service in Windows
Situation:
To start the service: To stop the service: Windows 2000:
Windows XP:
Edited by LogSat |
|
omaits
Newbie Joined: 25 February 2005 Location: United States Status: Offline Points: 5 |
Post Options
Thanks(0)
|
Thanks Roberto....stupid question, I know. I apologize. Anyways, I deleted the directory and my Beyesian filter is busy learning what is/isn't spam. Thanks again.
|
|
Sundance
Newbie Joined: 18 July 2006 Location: Hungary Status: Offline Points: 10 |
Post Options
Thanks(0)
|
Hi Guys! I think I HAD the same problem as a few other ones from the forum, and I know the solution. I live in Hungary and the server setting are hungarian. The decimal separator, therefore, is a comma (,). So, The corpus is like: *sex,1,1,0,400000005960464,16/12/2004 So that poor SPAM FILTER cannot understand its own corpus, because instead of 0.400000005960464 value, it reads ZERO because of the comma. And it cannot count and memorize real probability values. In my corpus, there were the commas, and EVERY prob. value was the same. So what did I do? I went to Settings/Internetional settings, and set the "decimal separator" form comma (,) to a dot (.) . Then I terminated Smap Filter (stopping it is not enough), deleted the entire Corpus Directory, restarted Spam Filter, and waited for a few emails. Then dumped the corpus. sex,2,1,0.21334443560464,16/12/2004 Yes, there are REAL probability values this time!!!! When testing, still everything is 0% spam :( But I can only hope, when I reach the 5000/5000 count, and the Bayes filter kicks in, everything will be OK. |
|
WebGuyz
Senior Member Joined: 09 May 2005 Location: United States Status: Offline Points: 348 |
Post Options
Thanks(0)
|
If this is true then it should be a quick fix for Roberto.
|
|
http://www.webguyz.net
|
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Sundance,
You have an excellent report, and had us scrambling all afternoon to double-check the code behind the Bayesian calculations. It does seem however that the "bug" is limited to the dump of the Bayesian corpus to screen. SpamFilter's internal probability data is stored in binary format in the db.dat.prb file. As it's stored in native binary format, there is no issue with comma/dot international headaches when reading/writing the file. So far, all internal calculations also appear to be using the binary format. The only time we convert the binary probability to text (thus falling victim to the dot/comma problem) is when we output the data on screen. We'll be going over the code one more time to be more certain, but so far it does seem as the Bayesian filter itself does not have a problem with the decimal separator. |
|
Sundance
Newbie Joined: 18 July 2006 Location: Hungary Status: Offline Points: 10 |
Post Options
Thanks(0)
|
Thank you for your answer! Hmmmmm, then I've got some more serious problem. My Bayesian Filter shows 0%spam for all the messages (even before I deleted the folder, and it was way over 5000/5000 messages). When I tried the Bayesian Filter for test messages in the Settings/Bayesian window, the filter also reported 0% spam for everything. If it is not the corpus, then what could cause this bug? |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Sundance,
Have you checked our exaplanation on why the Bayesian filter will have a lower "hit" ratio compared to the other filters earlier in this thread: http://www.logsat.com/spamfilter/forums/forum_posts.asp?TID= 4647#4885 Once you reach the 5000/5000 emails, can you please check the Statistics tab in SpamFilter and post the results of how many emails are stopped by the various filters (this only works if you have enabled the quarantine database)? |
|
Sundance
Newbie Joined: 18 July 2006 Location: Hungary Status: Offline Points: 10 |
Post Options
Thanks(0)
|
Roberto, Yes I have read the forums, and your explanation too. But, my bayes doesnt't have a 'lower' hit ratio, it does not catch _anything_. All mails are 0%spam. 5000/5000 passed and everything was just like before. Nothing changed. Just like the bayes filter didn't even start, or something. Not any sign of the Bayes filter. In Stats, MAPS, SURBL, keywords Filters are at about 60%,30%,10%. Bayes not mentioned there. :(
|
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Can you zip and email us your corpus directory, and one of SpamFilter's activity files, once you reach the 5000/5000 count?
|
|
Sundance
Newbie Joined: 18 July 2006 Location: Hungary Status: Offline Points: 10 |
Post Options
Thanks(0)
|
Okay. Just a few days till 5000/5000 and I'll send them.
|
|
Sundance
Newbie Joined: 18 July 2006 Location: Hungary Status: Offline Points: 10 |
Post Options
Thanks(0)
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Yes, we've been analyzing it. The contents are rather strange, as while there's 42,738 entries, only FIVE tokens have a spam probability of .9 or higher. There's about 370 with a spam probability of .1 or less. The other 42,000 all have a probability of .4.
This is rather unusual. However from your corpus.ini file I see that you received about the same number of good emails as the amount of spam emails. This is also unusual, as normally the amount of spam is much higher than the amount of clean emails. This may be causing the bayesian filter some problems as the numbers are too similar. Can you please also zip up one of the latest SpamFilter's activity logfiles so we can take another look? |
|
Sundance
Newbie Joined: 18 July 2006 Location: Hungary Status: Offline Points: 10 |
Post Options
Thanks(0)
|
Roberto, Sadly enough, the logging was not enabled :( Until your post.... Since then, it is. What should I do now? A. Send you the logfiles generated since aug 24? B. Logfiles, and the corpus (which contains data before aug 24 as well) C. Erase log and corpus NOW, and post both, say, a week later? So they will contain data about the same period of time? regards, Sundance |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Let's start with the simplest, which is D. Just zip and email us just one day's worth of logs (today or yesterday for ex) so we can see if there's any major issues immediately visible.
|
|
Sundance
Newbie Joined: 18 July 2006 Location: Hungary Status: Offline Points: 10 |
Post Options
Thanks(0)
|
OKay. I've sent it. Thank you for Your help!!!! |
|
Sundance
Newbie Joined: 18 July 2006 Location: Hungary Status: Offline Points: 10 |
Post Options
Thanks(0)
|
Hello.... Any ideas? Did you get my email???? Sundance |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Actually on the 29th I sent you an email to say that we had not recived the email with the corpus file, and were waiting to hear back from you... sorry. Can you please re-send it?
|
|
Post Reply | Page 12> |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
This page was generated in 0.414 seconds.