LogSat Software

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Hi,
We see loads of email munged with html tags like this:
V<!fjdskfdshjk>ia<!jfdkjfdskfdfjjjjjj>gra!
I have some RegExes working well to block these, but I worry about false positives because they don't necessarily have to match the whole keyword:

((?i)(s|>)v(iagra|<!.*>|i<!.*>|ia<!.*>|iag<!.*>|iagr<!.*>))
I would like to be able to remove the tags and _then_ look for keywords.  
Is there a better way to do this without adding a prefilter?
I wrote a RegEx that counted the number of tags per line and blocked if it was more than a certain number, but then I was getting false positives on some XML emails.
It would be ideal to be able to chain filters of two types--Modify and Match.  Then I could specify a Modify rule to remove the tags (something like "(~r)<.*>" where "(~r)" means remove) and then pass the result to a Match rule (the existing RegEx stuff).  It could all be done on one line in the keyword list like this:
((~r)<.*>) --> ((?i)viagra)
Sorry if I elaborated too much; I'm just thinking out loud.
Thanks.

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
Alec Members Profile Send Private Message Find Members Posts Add to Buddy List Guest Group	Post Options Post Reply Quote Alec Thanks(0) Quote Reply Topic: Feature request: un-munge text before keyword check Posted: 02 October 2003 at 10:06pm
	Hi, We see loads of email munged with html tags like this: V<!fjdskfdshjk>ia<!jfdkjfdskfdfjjjjjj>gra! I have some RegExes working well to block these, but I worry about false positives because they don't necessarily have to match the whole keyword: ((?i)(s\|>)v(iagra\|<!.>\|i<!.>\|ia<!.>\|iag<!.>\|iagr<!.>)) I would like to be able to remove the tags and _then_ look for keywords. Is there a better way to do this without adding a prefilter? I wrote a RegEx that counted the number of tags per line and blocked if it was more than a certain number, but then I was getting false positives on some XML emails. It would be ideal to be able to chain filters of two types--Modify and Match. Then I could specify a Modify rule to remove the tags (something like "(~r)<.>" where "(~r)" means remove) and then pass the result to a Match rule (the existing RegEx stuff). It could all be done on one line in the keyword list like this: ((~r)<.*>) --> ((?i)viagra) Sorry if I elaborated too much; I'm just thinking out loud. Thanks.

Desperado Members Profile Send Private Message Find Members Posts Add to Buddy List Senior Member Joined: 27 January 2005 Location: United States Status: Offline Points: 1143	Post Options Post Reply Quote Desperado Report Post Thanks(0) Quote Reply Posted: 03 October 2003 at 9:44pm
	Alec, I have to agree ..... and disagree. I honestly feel that searching for the obfuscation (munge) is the bet way to not get False positives. I also wish I could get my attempts to count repete patterns to work properly but I just do not seem to be able to "out smart" the Spam Software. Having said all that, LogSat is working on an "intelligent fingerprinting" filter that I had the oportunity to do some testing on the first go-around of. Even though it eventually cause issues that LogSat needed to re-think, I was incourged enough that I have stopped where I am with my RegEx's, and am waiting for the next "generation". I think that many of the issues that we are facing, may be nicely solved by the new filters. Dan S.

Alec Members Profile Send Private Message Find Members Posts Add to Buddy List Guest Group	Post Options Post Reply Quote Alec Thanks(0) Quote Reply Posted: 07 October 2003 at 5:32pm
	I agree that looking for the munge should be a more effective tactic than un-munging and looking for keywords. Unfortunately certain types of legitimate email (like those containing XML content) also have a whole lot of comment tags embedded in them, so false positives become a problem. To solve this, a generic prefiltering mechanism could be applied in a different way than I described before. I suggested doing this: ((~r)<.>) --> ((?i)viagra) where "(~r)" was a modifier to remove text matching the following RegEx, "<.>", and "-->" was an operator to pipe the output into the next filter RegEx, "((?i)viagra)". We could have other operators like "!->", meaning only pipe the output to the next filter if the first filter returned false. Then we could construct something like this: (.XML.) !-> (([^>]<!.>.[^ $]){3,}) where the first RegEx looks for an XML identifier and only if it doesn't find one the message text is passed to the second filter, which looks for 3 comment tags on a single line. (That filter is a bit buggy regarding newline characters, BTW). It would take the spammers a little while to figure out that they just need to stick an XML identifier into their munged spam to get it through...

Alec Members Profile Send Private Message Find Members Posts Add to Buddy List Guest Group	Post Options Post Reply Quote Alec Thanks(0) Quote Reply Posted: 07 October 2003 at 5:40pm
	The last RegEx in my previous post got mangled somewhere along the line; backslash-n and backslash-r are apparently not displayed in the forum.

LogSat Members Profile Send Private Message Find Members Posts Add to Buddy List Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4105	Post Options Post Reply Quote LogSat Report Post Thanks(0) Quote Reply Posted: 09 October 2003 at 11:04pm
	Alec, Applying DNA fingerprinting to detect spam is proving very effective in detecting all tricks used by spammers. Our statistical engine is able to cope with junk inserted a-la XML content and/or html comments. All meaningless words and random strings of characters are simply ignored by the filter. Almost all messages do have some DNA structure in them will allow SpamFilter to consider them spam or not. We have released a beta version with this new anti-spam engine if you wish to test-drive it. Roberto F. LogSat Software

Alec Members Profile Send Private Message Find Members Posts Add to Buddy List Guest Group	Post Options Post Reply Quote Alec Thanks(0) Quote Reply Posted: 15 October 2003 at 9:30pm
	Will you please explain what "DNA fingerprinting" is and how it works? Thanks.

LogSat Members Profile Send Private Message Find Members Posts Add to Buddy List Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4105	Post Options Post Reply Quote LogSat Report Post Thanks(0) Quote Reply Posted: 16 October 2003 at 12:42am
	From our beta download page at http://www.logsat.com/sfi-beta.asp: Roberto F. LogSat Software ====================== The new v2.x release of SpamFilter will feature a statistical DNA fingerprinting of incoming emails. The statistical analysis is performed using Bayesian rules. Tokens within incoming emails are scanned and catagorized in a corpus file. The content of all new incoming email is fingerprinted and checked against the historical data. If there is a high statistical probability that the email is spam, it is rejected. The statistical engine currently kicks in after 500 non-spam and 500 spam emails have been received. This is done to build a valid statistical base to use before emails are rejected. During this period of time, it is critical to avoid false positives. If a good email is quarantined, forcing it's redelivery either thru the web interface or the SpamFilter GUI will "teach" SpamFilter that the fingerprint in that email is a "good" one, and the statistical DNA database will adapt itself to it. It is very important initially to check the quarantine often to force delivery of legitimate email that has been blocked by the "regular" filtering rules. ==============================

aaron Members Profile Send Private Message Find Members Posts Add to Buddy List Guest Group	Post Options Post Reply Quote aaron Thanks(0) Quote Reply Posted: 18 October 2003 at 9:06am
	Will this beta version become an eval version? I'm a small user, operating a spam-filter for the tenants in my apartment building (Lots of time, since am unemployed). We just don't get enough mail through our e-mail server to trip the Bayesian filter (1000 e-mails total) in the month allocated for the beta-test project. I don't think we even get more than about 500 mails per month, if even that. Thanks!

LogSat Software

Site Navigation[Skip]

Spam Filter ISP Support Forum

Feature request: un-munge text before keyword check