Spam Filter ISP Support Forum

  New Posts New Posts RSS Feed - Suggestions for filtering
  FAQ FAQ  Forum Search   Register Register  Login Login

Suggestions for filtering

 Post Reply Post Reply
Author
Carl Giljam View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Carl Giljam Quote  Post ReplyReply Direct Link To This Post Topic: Suggestions for filtering
    Posted: 14 October 2003 at 3:41am

It is apparent that spammers use delimiters, visible or invisible, to break up their message in order to fool simple spam filters.

Here's a typical example:

Subject: I b*ec,ame tw_ic-e t`he m:an I use t'o be    zkdmtpdyyck

The other typical example is when the message is "hacked" into very small pieces by html tags, invisible to the human eye which will still read "Viagra" etc.

Suggestion: Filter in several passes. Pass 1 - filter as it is, to catch all sorts of html tricks etc. Pass 2 - remove all html tags, then pass it through filter again. Pass 3 - remove all "delimiters" then pass it through filter again (delimiters would be all sorts of *-_': etc but also digits and possibly blanks.

Maybe it could be a configuration in Spamfilter how many passes to run and what to remove during each pass (as defined in a regex for each pass). I realise it will slow things down though, it would have to be a trade-off.

Another - not as efficient - method would be to filter out on the amount of "delimiters" used but I feel that SpamFilter will soon need to be able to apply different filters to Subjece, Body, etc to do this.

Back to Top
Desperado View Drop Down
Senior Member
Senior Member
Avatar

Joined: 27 January 2005
Location: United States
Status: Offline
Points: 1143
Post Options Post Options   Thanks (0) Thanks(0)   Quote Desperado Quote  Post ReplyReply Direct Link To This Post Posted: 14 October 2003 at 12:18pm

Carl,

At 100,000 messages a day, this would get horrible ... don't you think? I find so many variations of "sliced up" messages that I get very frustrated also.

I would like to have a "Max HTML Comment" feature but can not seem to get a RegEx to do this correctly.

On your Idea of filtering, then rendering, and then filtering again, ... one problem I see is that when the word "Penis" suddenly shows up, are you going to block that?  You will come up with BUNCHES of false positives in that case.   I would still base my filtering on the message attempting to hide the content.  So ... if the message on the rendered pass, suddenly shows up with words the were NOT present in the first pass, then it should be blocked because it was definitely obscured and the only reason for obfuscation is "fool" spam filters.  This way, you are not filtering words (which, BTW, is called censoring), but instead you are filtering on the INTENT to hide those words.  Make sense?

Dan S.

Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4104
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 15 October 2003 at 12:35am

Carl,

Performing all those filtering translations can give a server quite a hit. Furthermore, they are still limited, as spammers can and will always ind a way around them. V1agra or viegra or v!agra or viagre or uiagra or viiagra will often find a whole in any keyword filtering.

The best approach is in our opinion completely different. We released a beta version for the next major version of SpamFilter ISP which performs statistical DNA fingerprinting on incoming emails. This allows SpamFilter to adatp rather quickly and learn all variations of the word Viagra for example, almost in real time as spammers invent new ones. We'll be more focused on that aspect of filtering rather than simple keyword list in the near future.

Roberto F.
LogSat Software 

Back to Top
Carl Giljam View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Carl Giljam Quote  Post ReplyReply Direct Link To This Post Posted: 16 October 2003 at 5:15pm

I agree that some sort of automatic adaption is necessary, Bayes or whatever - because spammers change pattern very quickly. But at least in the case of using html tricks to obscure the message I am still in favour of removing all html overhead before applying any other filters.

We use IMail 8 emailserver (the 8 - version includes spamfilters as well). I don't use the spam handling of IMail though because I liked the regex'es better in Spamfilter so I put that in front of the emailserver. IMail uses a bayesian filter (which I didnt like because it was user-unfriendly to update and teach) and they remove all html first - it seems to me to be the logical way to handle things, or won't you otherwise have to teach your filter that Via<all sorts of html>gra is actually Viagra?

Carl

Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4104
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 17 October 2003 at 12:53am

Carl,

The Bayesian filtering we use strips out all html tags except for a few which we keep since they are good indicators of spam/good.

However, until for now, we will keep the keyword searches limited to the unparsed email. We feel the origiinal html source, with all garbage comments included, gives a better chance for RegEx to work than plain text. In plain text Viagra can be written in a miriad of ways that a regular keyword scan will miss. V   i a   g  a    V.,i,.a,.g,.r,.a   V:I:A:G:R:A and so on. There will always be a new way that will not have a filter in it. But searching thru html tags can be very useful and very simple, with just a couple of RegEx expressions blocking the same amount of Spam that would otherwise require hundreds of separate keywords.

We are counting on the Bayesian filter to do most of the hard work though. We tried to keep it as simple to admin as possible, and hopefully we succeeded.

Roberto F.
LogSat Software

Back to Top
eric View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote eric Quote  Post ReplyReply Direct Link To This Post Posted: 21 October 2003 at 2:37pm

i score a huge amount checking the end off messages lately,

sure the gif-only html email annoyed me the most,

these keywords did the biggest job for me,

>rem<!--,-->ove
re: 0!~

play with some vaiants and see some big results,

most junk have the no more please, remove etc options to be target for logsat.

-eric-

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down



This page was generated in 0.234 seconds.