Catching the comments
Printed From: LogSat Software
Category: Spam Filter ISP
Forum Name: Spam Filter ISP Support
Forum Description: General support for Spam Filter ISP
URL: https://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=6823
Printed Date: 22 November 2024 at 9:02am
Topic: Catching the comments
Posted By: yapadu
Subject: Catching the comments
Date Posted: 14 April 2010 at 9:24am
I am trying to catch messages with large amount of HTML comments in them, using this regex:
((?-g)<!--.{500,}?-->)
I have tried a few different variations:
(<!--.{500,}?-->) ((?-g)<!--.{500,}-->)
My keyword in full does include a non regex keyword first:
content,((?-g)<!--.{500,}?-->)
Basically I am trying to say non greedy search, at least 500 characters between the comments. Every regex tool I have tested says it should work, including the regex test tab in spamfilter.
But on real messages, nothing is caught... anyone see what I am doing wrong?
The following is an example of some text that should be caught:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>hi</title> <style type="text/css"> <!-- body { background-color: #FFF; } --> </style></head>
<body link="#cc054b" alink="#cc054b" vlink="#cc054b">
<center>
<!-- Au cognitive should dddd innovative resorts qqqq hi ttt pounds uuuu thank shops desert area circle reverse trade avenue partners vandana presumed weet started change strengths automatic predicted shrimp missing programmed concerns literature appliances led bill kkkk CTS Australian image similarly drugs farmer sss hesitate mat rrrr gggg haven profile confidential dissolved gid rdonlyres align bots history confidential smalltext foe? du head ttt uuuun aspx change avw wave live shops autumn refer giveaway custom uuuu mypage profile exceeds makeup contrary avenue plastics buildings wrote ddd Neues ppp allies sss kkk stripe gear image reverse citation camels actions audio notify WWW pulse bread jjjj tripolis invasion institutions jam wave printing log go hhh toolbar mid calls wwww color fragrance OK went aaaa zzz iiiii estimado subscriber digital Desktops asp officers Je comes tomorrow mmmm engineer iii numerous crushable heads preferences paddy attachments smalltext cccc besuchen deaths representing makeup ddd margin portion llll increases team ooo circle subscriber ccc farmers received www Neues utritious eee width imre zzzz eee fff whatever don mileage led Debbie utm qqq bbbb oooo jjjj cookie windows century charged mid book mmm nn de toolbar exactly regulators yyy laws pero vspace employees hhhh Dan on eeee plastics inter at printing finances partners powered llll Au Aug bbb New correspond replacement experiencing Het -->
<table width="475" cellspacing="0" cellpadding="5" border="0">
|
Replies:
Posted By: LogSat
Date Posted: 18 April 2010 at 8:31pm
Sorry this took a while, but we were ourselves a bit stumped as to why it wasn't working. It turns out that this is due to a bug you just uncovered in SpamFilter. With the above RegEx, you trying to take advantage of this new feature in SpamFilter with your keywordcontent,((?-g)<!--.{500,}?-->):
// New to VersionNumber = '4.1.2.810'; {TODO -cNew : Added support to specify multiple RegEx expressions separated by commas, just as regular keywords can be separated by commas - has the effect of specifying "AND" rules for RegEx. Note that a "Standard non-RegEx keyword must be specified first for SpamFilter to recognize this syntax. For example: X-SF,([a-z]), ([0-9]) }
The bug however is causing SpamFilter to barf at the "comma" after the 500 above. SpamFilter is no recognizing the comma as part of the RegEx, and thinks it's the keyword separator character. This is a bit tricky to fix, we'll have to rework the keyword engine a bit, so I can't promise the fix will be ready in just a few days.
------------- Roberto Franceschetti
http://www.logsat.com" rel="nofollow - LogSat Software
http://www.logsat.com/sfi-spam-filter.asp" rel="nofollow - Spam Filter ISP
|
Posted By: yapadu
Date Posted: 18 April 2010 at 11:27pm
I'm still not sure my regex is right, but I am fairly sure the final product will require a comma in it.
Looking forward to the fix, as I think it will work well to stop a lot of spam.
|
Posted By: yapadu
Date Posted: 14 August 2010 at 10:52pm
Was a fix ever produced for this, it is tough building some regex without the use of a comma.
------------- --------------------------------------------------------------
I am a user of SF, not an employee. Use any advice offered at your own risk.
|
Posted By: LogSat
Date Posted: 17 August 2010 at 9:49pm
As mentioned previously, this is actually a bigger issue than may seem, and a fix will require a major change in the RegEx engine, which is something we've postponed as we did not want to risk causing unwanted side effects in the new release that was made available this last week. I'm unable to say right now when this will be fixed.
------------- Roberto Franceschetti
http://www.logsat.com" rel="nofollow - LogSat Software
http://www.logsat.com/sfi-spam-filter.asp" rel="nofollow - Spam Filter ISP
|
Posted By: yapadu
Date Posted: 17 August 2010 at 9:58pm
Thanks for the update.
Is the problem with the regex engine, or with the separator of keywords also being a ,?
Currently we have to include regex in () when used in the keywords. Maybe we could escape a , that is included in the () so the list of keywords can be parsed correctly.
If we did something like (home, grown) as the regex maybe we could replace it with (home,, grown) or (home\, grown).
------------- --------------------------------------------------------------
I am a user of SF, not an employee. Use any advice offered at your own risk.
|
|