LogSat Software

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   It is apparent that spammers use delimiters, visible or invisible, to break up their message in order to fool simple spam filters.
Here's a typical example:
Subject: I b*ec,ame tw_ic-e t`he m:an I use t'o be    zkdmtpdyyck
The other typical example is when the message is "hacked" into very small pieces by html tags, invisible to the human eye which will still read "Viagra" etc.
Suggestion: Filter in several passes. Pass 1 - filter as it is, to catch all sorts of html tricks etc. Pass 2 - remove all html tags, then pass it through filter again. Pass 3 - remove all "delimiters" then pass it through filter again (delimiters would be all sorts of *-_': etc but also digits and possibly blanks.
Maybe it could be a configuration in Spamfilter how many passes to run and what to remove during each pass (as defined in a regex for each pass). I realise it will slow things down though, it would have to be a trade-off.
Another - not as efficient - method would be to filter out on the amount of "delimiters" used but I feel that SpamFilter will soon need to be able to apply different filters to Subjece, Body, etc to do this.

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
Carl Giljam Members Profile Send Private Message Find Members Posts Add to Buddy List Guest Group	Post Options Post Reply Quote Carl Giljam Thanks(0) Quote Reply Topic: Suggestions for filtering Posted: 14 October 2003 at 3:41am
	It is apparent that spammers use delimiters, visible or invisible, to break up their message in order to fool simple spam filters. Here's a typical example: Subject: I bec,ame tw_ic-e t`he m:an I use t'o be zkdmtpdyyck The other typical example is when the message is "hacked" into very small pieces by html tags, invisible to the human eye which will still read "Viagra" etc. Suggestion: Filter in several passes. Pass 1 - filter as it is, to catch all sorts of html tricks etc. Pass 2 - remove all html tags, then pass it through filter again. Pass 3 - remove all "delimiters" then pass it through filter again (delimiters would be all sorts of -_': etc but also digits and possibly blanks. Maybe it could be a configuration in Spamfilter how many passes to run and what to remove during each pass (as defined in a regex for each pass). I realise it will slow things down though, it would have to be a trade-off. Another - not as efficient - method would be to filter out on the amount of "delimiters" used but I feel that SpamFilter will soon need to be able to apply different filters to Subjece, Body, etc to do this.

Desperado Members Profile Send Private Message Find Members Posts Add to Buddy List Senior Member Joined: 27 January 2005 Location: United States Status: Offline Points: 1143	Post Options Post Reply Quote Desperado Report Post Thanks(0) Quote Reply Posted: 14 October 2003 at 12:18pm
	Carl, At 100,000 messages a day, this would get horrible ... don't you think? I find so many variations of "sliced up" messages that I get very frustrated also. I would like to have a "Max HTML Comment" feature but can not seem to get a RegEx to do this correctly. On your Idea of filtering, then rendering, and then filtering again, ... one problem I see is that when the word "Penis" suddenly shows up, are you going to block that? You will come up with BUNCHES of false positives in that case. I would still base my filtering on the message attempting to hide the content. So ... if the message on the rendered pass, suddenly shows up with words the were NOT present in the first pass, then it should be blocked because it was definitely obscured and the only reason for obfuscation is "fool" spam filters. This way, you are not filtering words (which, BTW, is called censoring), but instead you are filtering on the INTENT to hide those words. Make sense? Dan S.

LogSat Members Profile Send Private Message Find Members Posts Add to Buddy List Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104	Post Options Post Reply Quote LogSat Report Post Thanks(0) Quote Reply Posted: 15 October 2003 at 12:35am
	Carl, Performing all those filtering translations can give a server quite a hit. Furthermore, they are still limited, as spammers can and will always ind a way around them. V1agra or viegra or v!agra or viagre or uiagra or viiagra will often find a whole in any keyword filtering. The best approach is in our opinion completely different. We released a beta version for the next major version of SpamFilter ISP which performs statistical DNA fingerprinting on incoming emails. This allows SpamFilter to adatp rather quickly and learn all variations of the word Viagra for example, almost in real time as spammers invent new ones. We'll be more focused on that aspect of filtering rather than simple keyword list in the near future. Roberto F. LogSat Software

Carl Giljam Members Profile Send Private Message Find Members Posts Add to Buddy List Guest Group	Post Options Post Reply Quote Carl Giljam Thanks(0) Quote Reply Posted: 16 October 2003 at 5:15pm
	I agree that some sort of automatic adaption is necessary, Bayes or whatever - because spammers change pattern very quickly. But at least in the case of using html tricks to obscure the message I am still in favour of removing all html overhead before applying any other filters. We use IMail 8 emailserver (the 8 - version includes spamfilters as well). I don't use the spam handling of IMail though because I liked the regex'es better in Spamfilter so I put that in front of the emailserver. IMail uses a bayesian filter (which I didnt like because it was user-unfriendly to update and teach) and they remove all html first - it seems to me to be the logical way to handle things, or won't you otherwise have to teach your filter that Via<all sorts of html>gra is actually Viagra? Carl

LogSat Members Profile Send Private Message Find Members Posts Add to Buddy List Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104	Post Options Post Reply Quote LogSat Report Post Thanks(0) Quote Reply Posted: 17 October 2003 at 12:53am
	Carl, The Bayesian filtering we use strips out all html tags except for a few which we keep since they are good indicators of spam/good. However, until for now, we will keep the keyword searches limited to the unparsed email. We feel the origiinal html source, with all garbage comments included, gives a better chance for RegEx to work than plain text. In plain text Viagra can be written in a miriad of ways that a regular keyword scan will miss. V i a g a V.,i,.a,.g,.r,.a V:I:A:G:R:A and so on. There will always be a new way that will not have a filter in it. But searching thru html tags can be very useful and very simple, with just a couple of RegEx expressions blocking the same amount of Spam that would otherwise require hundreds of separate keywords. We are counting on the Bayesian filter to do most of the hard work though. We tried to keep it as simple to admin as possible, and hopefully we succeeded. Roberto F. LogSat Software

eric Members Profile Send Private Message Find Members Posts Add to Buddy List Guest Group	Post Options Post Reply Quote eric Thanks(0) Quote Reply Posted: 21 October 2003 at 2:37pm
	i score a huge amount checking the end off messages lately, sure the gif-only html email annoyed me the most, these keywords did the biggest job for me, >rem<!--,-->ove re: 0!~ play with some vaiants and see some big results, most junk have the no more please, remove etc options to be target for logsat. -eric-

LogSat Software

Site Navigation[Skip]

Spam Filter ISP Support Forum

Suggestions for filtering