Editing Bayesian Corpus Database?
Printed From: LogSat Software
Category: Spam Filter ISP
Forum Name: Spam Filter ISP Support
Forum Description: General support for Spam Filter ISP
URL: https://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=4902
Printed Date: 26 December 2024 at 3:54pm
Topic: Editing Bayesian Corpus Database?
Posted By: JeffHildebrand
Subject: Editing Bayesian Corpus Database?
Date Posted: 29 December 2004 at 4:40pm
Is there anyway to update or delete tokens from the Bayesian database? It looks like we had one email that was bounced between two servers about 300 times due to a forwarding loop problem. Unfortunatly what it looks like it has done is flagged a lot of legitamate tokens as spam and started blocking well over half of our legitimate email. The only safe way I could see to resolve the blocking was to reinstall a fresh copy of the corpus database and start from scratch.
Below are some of the tokens that were generated, as you can see bye the relatively high spam score it started kicking in a 100% spam detection for many legitimate emails.
*Token,Good,Spam,ProbSpam,ModDate *6*1,0,306,0.99989998341,12/29/04 *Eudora,0,306,0.99989998341,12/29/04 *FULL,0,306,0.99989998341,12/28/04 *Follows,0,306,0.99989998341,12/27/04 *MAILBOX,0,306,0.99989998341,12/27/04 *Mime,0,306,0.99989998341,12/29/04 *Precedence,0,306,0.99989998341,12/27/04 *QUALCOMM,0,306,0.99989998341,12/29/04 *RCPT,0,306,0.99989998341,12/27/04 *Received*AOL,0,306,0.99989998341,12/27/04 *Received*omr,0,306,0.99989998341,12/27/04 *Received*rly,0,306,0.99989998341,12/27/04 *Received*v98*19,0,306,0.99989998341,12/27/04 *SCOLL,0,306,0.99989998341,12/27/04 *SCORE,0,306,0.99989998341,12/29/04 *Subject*unavailable,0,306,0.99989998341,12/27/04 *URL_COUNT,0,306,0.99989998341,12/27/04 *labeled,0,306,0.99989998341,12/29/04 *unavailable,0,306,0.99989998341,12/27/04 *undeliverable,0,306,0.99989998341,12/29/04 *v3*5,0,306,0.99989998341,12/29/04
Thanks,
-Jeff
|
Replies:
Posted By: LogSat
Date Posted: 29 December 2004 at 10:52pm
Sorry Jeff, that is currently not possible.Roberto F.
LogSat Software
|
Posted By: JeffHildebrand
Date Posted: 10 January 2005 at 12:08pm
Could you make this a feature request then? It could be as simple as an import keywords, that would import from a .txt file in the same format as the corpus dump. It could then overwrite existing entries, or add in keywords of your choice. At this point the Bayesian filter has become unusable to us, after reseting it less then two weeks ago it is blocking a very high number of legitimate emails even at a 99.9950% setting. Mainly due to keywords like these:
*Token , Good , Spam , ProbSpam , ModDate *Pagnet , 0 , 13 , 0.999899983 , 01/08/05 *org , 0 , 13 , 0.999899983 , 01/10/05 *pagnet , 0 , 13 , 0.999899983 , 01/10/05 *From*org , 0 , 17 , 0.999899983 , 01/10/05 *From*pagnet , 0 , 16 , 0.999899983 , 01/10/05 *http , 0 , 15 , 0.999899983 , 01/10/05 *href , 0 , 14 , 0.999899983 , 01/10/05 *attached , 0 , 13 , 0.999899983 , 01/10/05 *file , 0 , 13 , 0.999899983 , 01/10/05 *Back , 0 , 15 , 0.999899983 , 01/10/05 *Green , 0 , 15 , 0.999899983 , 01/10/05 *before , 0 , 15 , 0.999899983 , 01/10/05 *dollars , 0 , 15 , 0.999899983 , 01/10/05 *original , 0 , 15 , 0.999899983 , 01/10/05 *second , 0 , 15 , 0.999899983 , 01/10/05 *since , 0 , 15 , 0.999899983 , 01/10/05 *difference , 0 , 14 , 0.999899983 , 01/10/05 *highly , 0 , 14 , 0.999899983 , 01/10/05 *GIF , 0 , 13 , 0.999899983 , 01/07/05 *The , 0 , 13 , 0.999899983 , 01/10/05 *big , 0 , 13 , 0.999899983 , 01/10/05 *ebay , 0 , 13 , 0.999899983 , 01/08/05 *tag , 0 , 13 , 0.999899983 , 01/08/05 *details , 0 , 12 , 0.999899983 , 01/10/05 *mail , 0 , 12 , 0.999899983 , 01/10/05
It is probably just a learning curve for the Bayesian filter, but some way to help speed and fine tune that learning process, before legitimate emails are blocked, would be a tremendous help.
Regards, Jeff
|
|