Spam Filter ISP Support Forum

  New Posts New Posts RSS Feed - Feature request: un-munge text before keyword check
  FAQ FAQ  Forum Search   Register Register  Login Login

Feature request: un-munge text before keyword check

 Post Reply Post Reply
Author
Alec View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Alec Quote  Post ReplyReply Direct Link To This Post Topic: Feature request: un-munge text before keyword check
    Posted: 02 October 2003 at 10:06pm

Hi,

We see loads of email munged with html tags like this:

V<!fjdskfdshjk>ia<!jfdkjfdskfdfjjjjjj>gra!

I have some RegExes working well to block these, but I worry about false positives because they don't necessarily have to match the whole keyword:

((?i)(s|>)v(iagra|<!.*>|i<!.*>|ia<!.*>|iag<!.*>|iagr<!.*>))

I would like to be able to remove the tags and _then_ look for keywords. 

Is there a better way to do this without adding a prefilter?

I wrote a RegEx that counted the number of tags per line and blocked if it was more than a certain number, but then I was getting false positives on some XML emails.

It would be ideal to be able to chain filters of two types--Modify and Match.  Then I could specify a Modify rule to remove the tags (something like "(~r)<.*>" where "(~r)" means remove) and then pass the result to a Match rule (the existing RegEx stuff).  It could all be done on one line in the keyword list like this:

((~r)<.*>) --> ((?i)viagra)

Sorry if I elaborated too much; I'm just thinking out loud.

Thanks.

 

 

Back to Top
Desperado View Drop Down
Senior Member
Senior Member
Avatar

Joined: 27 January 2005
Location: United States
Status: Offline
Points: 1143
Post Options Post Options   Thanks (0) Thanks(0)   Quote Desperado Quote  Post ReplyReply Direct Link To This Post Posted: 03 October 2003 at 9:44pm

Alec,

I have to agree ..... and disagree.  I honestly feel that searching for the obfuscation (munge) is the bet way to not get False positives.  I also wish I could get my attempts to count repete patterns to work properly but I just do not seem to be able to "out smart" the Spam Software.

Having said all that, LogSat is working on an "intelligent fingerprinting" filter that I had the oportunity to do some testing on the first go-around of.  Even though it eventually cause issues that LogSat needed to re-think, I was incourged enough that I have stopped where I am with my RegEx's, and am waiting for the next "generation".  I think that many of the issues that we are facing, may be nicely solved by the new filters.

Dan S.

Back to Top
Alec View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Alec Quote  Post ReplyReply Direct Link To This Post Posted: 07 October 2003 at 5:32pm

I agree that looking for the munge should be a more effective tactic than un-munging and looking for keywords.  Unfortunately certain types of legitimate email (like those containing XML content) also have a whole lot of comment tags embedded in them, so false positives become a problem.

To solve this, a generic prefiltering mechanism could be applied in a different way than I described before.  I suggested doing this:

((~r)<.*>) --> ((?i)viagra)

where "(~r)" was a modifier to remove text matching the following RegEx, "<.*>", and "-->" was an operator to pipe the output into the next filter RegEx, "((?i)viagra)".  We could have other operators like "!->", meaning only pipe the output to the next filter if the first filter returned false.  Then we could construct something like this:

(.*XML.*) !-> (([^>]<!.*>.*[^ $]){3,})

where the first RegEx looks for an XML identifier and only if it doesn't find one the message text is passed to the second filter, which looks for 3 comment tags on a single line.  (That filter is a bit buggy regarding newline characters, BTW).

It would take the spammers a little while to figure out that they just need to stick an XML identifier into their munged spam to get it through...

 

Back to Top
Alec View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Alec Quote  Post ReplyReply Direct Link To This Post Posted: 07 October 2003 at 5:40pm
The last RegEx in my previous post got mangled somewhere along the line; backslash-n and backslash-r are apparently not displayed in the forum.
Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4104
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 09 October 2003 at 11:04pm

Alec,

Applying DNA fingerprinting to detect spam is proving very effective in detecting all tricks used by spammers. Our statistical engine is able to cope with junk inserted a-la XML content and/or html comments. All meaningless words and random strings of characters are simply ignored by the filter. Almost all messages do have some DNA structure in them will allow SpamFilter to consider them spam or not.

We have released a beta version with this new anti-spam engine if you wish to test-drive it.

Roberto F.
LogSat Software

Back to Top
Alec View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Alec Quote  Post ReplyReply Direct Link To This Post Posted: 15 October 2003 at 9:30pm

Will you please explain what "DNA fingerprinting" is and how it works?

Thanks.

 

Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4104
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 16 October 2003 at 12:42am

From our beta download page at http://www.logsat.com/sfi-beta.asp:

Roberto F.
LogSat Software

======================

The new v2.x release of SpamFilter will feature a statistical DNA fingerprinting of incoming emails. The statistical analysis is performed using Bayesian rules. Tokens within incoming emails are scanned and catagorized in a corpus file. The content of all new incoming email is fingerprinted and checked against the historical data. If there is a high statistical probability that the email is spam, it is rejected.

The statistical engine currently kicks in after 500 non-spam and 500 spam emails have been received. This is done to build a valid statistical base to use before emails are rejected. During this period of time, it is critical to avoid false positives. If a good email is quarantined, forcing it's redelivery either thru the web interface or the SpamFilter GUI will "teach" SpamFilter that the fingerprint in that email is a "good" one, and the statistical DNA database will adapt itself to it. It is very important initially to check the quarantine often to force delivery of legitimate email that has been blocked by the "regular" filtering rules.
==============================

Back to Top
aaron View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote aaron Quote  Post ReplyReply Direct Link To This Post Posted: 18 October 2003 at 9:06am

Will this beta version become an eval version?  I'm a small user, operating a spam-filter for the tenants in my apartment building (Lots of time, since am unemployed). We just don't get enough mail through our e-mail server to trip the Bayesian filter (1000 e-mails total) in the month allocated for the beta-test project.  I don't think we even get more than about 500 mails per month, if even that.

Thanks!

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down



This page was generated in 0.246 seconds.