Suggestions for filtering |
Post Reply ![]() |
Author | |
Carl Giljam ![]() Guest Group ![]() |
![]() ![]() ![]() ![]() ![]() Posted: 14 October 2003 at 3:41am |
It is apparent that spammers use delimiters, visible or invisible, to break up their message in order to fool simple spam filters. Here's a typical example: Subject: I b*ec,ame tw_ic-e t`he m:an I use t'o be zkdmtpdyyck The other typical example is when the message is "hacked" into very small pieces by html tags, invisible to the human eye which will still read "Viagra" etc. Suggestion: Filter in several passes. Pass 1 - filter as it is, to catch all sorts of html tricks etc. Pass 2 - remove all html tags, then pass it through filter again. Pass 3 - remove all "delimiters" then pass it through filter again (delimiters would be all sorts of *-_': etc but also digits and possibly blanks. Maybe it could be a configuration in Spamfilter how many passes to run and what to remove during each pass (as defined in a regex for each pass). I realise it will slow things down though, it would have to be a trade-off. Another - not as efficient - method would be to filter out on the amount of "delimiters" used but I feel that SpamFilter will soon need to be able to apply different filters to Subjece, Body, etc to do this. |
|
![]() |
|
Desperado ![]() Senior Member ![]() ![]() Joined: 27 January 2005 Location: United States Status: Offline Points: 1143 |
![]() ![]() ![]() ![]() ![]() |
Carl, At 100,000 messages a day, this would get horrible ... don't you think? I find so many variations of "sliced up" messages that I get very frustrated also. I would like to have a "Max HTML Comment" feature but can not seem to get a RegEx to do this correctly. On your Idea of filtering, then rendering, and then filtering again, ... one problem I see is that when the word "Penis" suddenly shows up, are you going to block that? You will come up with BUNCHES of false positives in that case. I would still base my filtering on the message attempting to hide the content. So ... if the message on the rendered pass, suddenly shows up with words the were NOT present in the first pass, then it should be blocked because it was definitely obscured and the only reason for obfuscation is "fool" spam filters. This way, you are not filtering words (which, BTW, is called censoring), but instead you are filtering on the INTENT to hide those words. Make sense? Dan S. |
|
![]() |
|
LogSat ![]() Admin Group ![]() ![]() Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
![]() ![]() ![]() ![]() ![]() |
Carl, Performing all those filtering translations can give a server quite a hit. Furthermore, they are still limited, as spammers can and will always ind a way around them. V1agra or viegra or v!agra or viagre or uiagra or viiagra will often find a whole in any keyword filtering. The best approach is in our opinion completely different. We released a beta version for the next major version of SpamFilter ISP which performs statistical DNA fingerprinting on incoming emails. This allows SpamFilter to adatp rather quickly and learn all variations of the word Viagra for example, almost in real time as spammers invent new ones. We'll be more focused on that aspect of filtering rather than simple keyword list in the near future. Roberto F. |
|
![]() |
|
Carl Giljam ![]() Guest Group ![]() |
![]() ![]() ![]() ![]() ![]() |
I agree that some sort of automatic adaption is necessary, Bayes or whatever - because spammers change pattern very quickly. But at least in the case of using html tricks to obscure the message I am still in favour of removing all html overhead before applying any other filters. We use IMail 8 emailserver (the 8 - version includes spamfilters as well). I don't use the spam handling of IMail though because I liked the regex'es better in Spamfilter so I put that in front of the emailserver. IMail uses a bayesian filter (which I didnt like because it was user-unfriendly to update and teach) and they remove all html first - it seems to me to be the logical way to handle things, or won't you otherwise have to teach your filter that Via<all sorts of html>gra is actually Viagra? Carl |
|
![]() |
|
LogSat ![]() Admin Group ![]() ![]() Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
![]() ![]() ![]() ![]() ![]() |
Carl, The Bayesian filtering we use strips out all html tags except for a few which we keep since they are good indicators of spam/good. However, until for now, we will keep the keyword searches limited to the unparsed email. We feel the origiinal html source, with all garbage comments included, gives a better chance for RegEx to work than plain text. In plain text Viagra can be written in a miriad of ways that a regular keyword scan will miss. V i a g a V.,i,.a,.g,.r,.a V:I:A:G:R:A and so on. There will always be a new way that will not have a filter in it. But searching thru html tags can be very useful and very simple, with just a couple of RegEx expressions blocking the same amount of Spam that would otherwise require hundreds of separate keywords. We are counting on the Bayesian filter to do most of the hard work though. We tried to keep it as simple to admin as possible, and hopefully we succeeded. Roberto F. |
|
![]() |
|
eric ![]() Guest Group ![]() |
![]() ![]() ![]() ![]() ![]() |
i score a huge amount checking the end off messages lately, sure the gif-only html email annoyed me the most, these keywords did the biggest job for me, >rem<!--,-->ove play with some vaiants and see some big results, most junk have the no more please, remove etc options to be target for logsat. -eric- |
|
![]() |
Post Reply ![]() |
|
Tweet
|
Forum Jump | Forum Permissions ![]() You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
This page was generated in 0.234 seconds.