Feature request: un-munge text before keyword check |
Post Reply ![]() |
Author | |
Alec ![]() Guest Group ![]() |
![]() ![]() ![]() ![]() ![]() Posted: 02 October 2003 at 10:06pm |
Hi, We see loads of email munged with html tags like this: V<!fjdskfdshjk>ia<!jfdkjfdskfdfjjjjjj>gra! I have some RegExes working well to block these, but I worry about false positives because they don't necessarily have to match the whole keyword: ((?i)(s|>)v(iagra|<!.*>|i<!.*>|ia<!.*>|iag<!.*>|iagr<!.*>)) I would like to be able to remove the tags and _then_ look for keywords. Is there a better way to do this without adding a prefilter? I wrote a RegEx that counted the number of tags per line and blocked if it was more than a certain number, but then I was getting false positives on some XML emails. It would be ideal to be able to chain filters of two types--Modify and Match. Then I could specify a Modify rule to remove the tags (something like "(~r)<.*>" where "(~r)" means remove) and then pass the result to a Match rule (the existing RegEx stuff). It could all be done on one line in the keyword list like this: ((~r)<.*>) --> ((?i)viagra) Sorry if I elaborated too much; I'm just thinking out loud. Thanks.
|
|
![]() |
|
Desperado ![]() Senior Member ![]() ![]() Joined: 27 January 2005 Location: United States Status: Offline Points: 1143 |
![]() ![]() ![]() ![]() ![]() |
Alec, I have to agree ..... and disagree. I honestly feel that searching for the obfuscation (munge) is the bet way to not get False positives. I also wish I could get my attempts to count repete patterns to work properly but I just do not seem to be able to "out smart" the Spam Software. Having said all that, LogSat is working on an "intelligent fingerprinting" filter that I had the oportunity to do some testing on the first go-around of. Even though it eventually cause issues that LogSat needed to re-think, I was incourged enough that I have stopped where I am with my RegEx's, and am waiting for the next "generation". I think that many of the issues that we are facing, may be nicely solved by the new filters. Dan S. |
|
![]() |
|
Alec ![]() Guest Group ![]() |
![]() ![]() ![]() ![]() ![]() |
I agree that looking for the munge should be a more effective tactic than un-munging and looking for keywords. Unfortunately certain types of legitimate email (like those containing XML content) also have a whole lot of comment tags embedded in them, so false positives become a problem. To solve this, a generic prefiltering mechanism could be applied in a different way than I described before. I suggested doing this: ((~r)<.*>) --> ((?i)viagra) where "(~r)" was a modifier to remove text matching the following RegEx, "<.*>", and "-->" was an operator to pipe the output into the next filter RegEx, "((?i)viagra)". We could have other operators like "!->", meaning only pipe the output to the next filter if the first filter returned false. Then we could construct something like this: (.*XML.*) !-> (([^>]<!.*>.*[^ $]){3,}) where the first RegEx looks for an XML identifier and only if it doesn't find one the message text is passed to the second filter, which looks for 3 comment tags on a single line. (That filter is a bit buggy regarding newline characters, BTW). It would take the spammers a little while to figure out that they just need to stick an XML identifier into their munged spam to get it through...
|
|
![]() |
|
Alec ![]() Guest Group ![]() |
![]() ![]() ![]() ![]() ![]() |
The last RegEx in my previous post got mangled somewhere along the line; backslash-n and backslash-r are apparently not displayed in the forum.
|
|
![]() |
|
LogSat ![]() Admin Group ![]() ![]() Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
![]() ![]() ![]() ![]() ![]() |
Alec, Applying DNA fingerprinting to detect spam is proving very effective in detecting all tricks used by spammers. Our statistical engine is able to cope with junk inserted a-la XML content and/or html comments. All meaningless words and random strings of characters are simply ignored by the filter. Almost all messages do have some DNA structure in them will allow SpamFilter to consider them spam or not. We have released a beta version with this new anti-spam engine if you wish to test-drive it. Roberto F. |
|
![]() |
|
Alec ![]() Guest Group ![]() |
![]() ![]() ![]() ![]() ![]() |
Will you please explain what "DNA fingerprinting" is and how it works? Thanks.
|
|
![]() |
|
LogSat ![]() Admin Group ![]() ![]() Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
![]() ![]() ![]() ![]() ![]() |
From our beta download page at http://www.logsat.com/sfi-beta.asp: Roberto F. ====================== The new v2.x release of SpamFilter will feature a statistical DNA fingerprinting of incoming emails. The statistical analysis is performed using Bayesian rules. Tokens within incoming emails are scanned and catagorized in a corpus file. The content of all new incoming email is fingerprinted and checked against the historical data. If there is a high statistical probability that the email is spam, it is rejected. The statistical engine currently kicks in after 500 non-spam and 500 spam emails have been received. This is done to build a valid statistical base to use before emails are rejected. During this period of time, it is critical to avoid false positives. If a good email is quarantined, forcing it's redelivery either thru the web interface or the SpamFilter GUI will "teach" SpamFilter that the fingerprint in that email is a "good" one, and the statistical DNA database will adapt itself to it. It is very important initially to check the quarantine often to force delivery of legitimate email that has been blocked by the "regular" filtering rules. |
|
![]() |
|
aaron ![]() Guest Group ![]() |
![]() ![]() ![]() ![]() ![]() |
Will this beta version become an eval version? I'm a small user, operating a spam-filter for the tenants in my apartment building (Lots of time, since am unemployed). We just don't get enough mail through our e-mail server to trip the Bayesian filter (1000 e-mails total) in the month allocated for the beta-test project. I don't think we even get more than about 500 mails per month, if even that. Thanks! |
|
![]() |
Post Reply ![]() |
|
Tweet
|
Forum Jump | Forum Permissions ![]() You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
This page was generated in 0.246 seconds.