To throw an idea into this discussion...
I've always thought that some sort of character filtering capability would be useful here. Something that would allow rules-based replacement of certain characters -- ONLY in a cached COPY of the message used for testing, not in the actual message itself -- prior to keyword filtering. In other words, the following filters could be set-up...
- Replace "0" (the number zero) with "o" (the letter O).
- Replace "#" with "h"
- Replace "!" with "i" (the letter I)
- Replace all other punctuation symbols with "" (nothing).
... and any other such rules that prove to be effective. Please pardon the language here (this is a "clinical" discussion, not an attempt to offend; besides, if you deal with spam filtering, you see this and worse every single day)... just by using this limited set of rules, the following subject line:
Subj: SH!T, THAT B-I-T-C-# IS A W#0RE
... would be cleaned-up in the testing copy of the message... the exclamation point would be replaced with i, the #'s would be replaced with h's, the number zero would be replaced with letter o, and all other punctuation would be deleted, resulting in a subject line:
Subj: SHiT THAT BITCh IS A WhoRE
... which would then be able to trigger a case-insensitive keyword filter, if so desired.
This would filter SOME spam, but would still not do anything for those who insert space characters in hot keywords, i.e. "wh or e". Filtering out the spaces and looking for embedded keywords just wouldn't work... "who referred you" becomes "whoreferredyou" which contains "whore", an instant false positive if you're filtering on that word.
Roberto, I think this probably goes well above and beyond your intent for SpamFilter in the here-and-now. Just something to think about a few months from now, when you've finished the database logging functionality and everything else you're currently working on, and you're sitting around the office with nothing to do saying "what else could we do to enhance SpamFilter?" <grin>
Jim
|