Catching the comments |
Post Reply |
Author | |
yapadu
Senior Member Joined: 12 May 2005 Status: Offline Points: 297 |
Post Options
Thanks(0)
Posted: 14 April 2010 at 9:24am |
I am trying to catch messages with large amount of HTML comments in them, using this regex:
((?-g)<!--.{500,}?-->) I have tried a few different variations: (<!--.{500,}?-->) ((?-g)<!--.{500,}-->) My keyword in full does include a non regex keyword first: content,((?-g)<!--.{500,}?-->) Basically I am trying to say non greedy search, at least 500 characters between the comments. Every regex tool I have tested says it should work, including the regex test tab in spamfilter. But on real messages, nothing is caught... anyone see what I am doing wrong? The following is an example of some text that should be caught: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>hi</title> <style type="text/css"> <!-- body { background-color: #FFF; } --> </style></head> <body link="#cc054b" alink="#cc054b" vlink="#cc054b"> <center> <!-- Au cognitive should dddd innovative resorts qqqq hi ttt pounds uuuu thank shops desert area circle reverse trade avenue partners vandana presumed weet started change strengths automatic predicted shrimp missing programmed concerns literature appliances led bill kkkk CTS Australian image similarly drugs farmer sss hesitate mat rrrr gggg haven profile confidential dissolved gid rdonlyres align bots history confidential smalltext foe? du head ttt uuuun aspx change avw wave live shops autumn refer giveaway custom uuuu mypage profile exceeds makeup contrary avenue plastics buildings wrote ddd Neues ppp allies sss kkk stripe gear image reverse citation camels actions audio notify WWW pulse bread jjjj tripolis invasion institutions jam wave printing log go hhh toolbar mid calls wwww color fragrance OK went aaaa zzz iiiii estimado subscriber digital Desktops asp officers Je comes tomorrow mmmm engineer iii numerous crushable heads preferences paddy attachments smalltext cccc besuchen deaths representing makeup ddd margin portion llll increases team ooo circle subscriber ccc farmers received www Neues utritious eee width imre zzzz eee fff whatever don mileage led Debbie utm qqq bbbb oooo jjjj cookie windows century charged mid book mmm nn de toolbar exactly regulators yyy laws pero vspace employees hhhh Dan on eeee plastics inter at printing finances partners powered llll Au Aug bbb New correspond replacement experiencing Het --> <table width="475" cellspacing="0" cellpadding="5" border="0"> Edited by yapadu - 14 April 2010 at 9:32am |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Sorry this took a while, but we were ourselves a bit stumped as to why it wasn't working. It turns out that this is due to a bug you just uncovered in SpamFilter. With the above RegEx, you trying to take advantage of this new feature in SpamFilter with your keyword
content,((?-g)<!--.{500,}?-->):
// New to VersionNumber = '4.1.2.810'; {TODO -cNew : Added support to specify multiple RegEx expressions separated by commas, just as regular keywords can be separated by commas - has the effect of specifying "AND" rules for RegEx. Note that a "Standard non-RegEx keyword must be specified first for SpamFilter to recognize this syntax. For example: X-SF,([a-z]), ([0-9]) } The bug however is causing SpamFilter to barf at the "comma" after the 500 above. SpamFilter is no recognizing the comma as part of the RegEx, and thinks it's the keyword separator character. This is a bit tricky to fix, we'll have to rework the keyword engine a bit, so I can't promise the fix will be ready in just a few days. |
|
yapadu
Senior Member Joined: 12 May 2005 Status: Offline Points: 297 |
Post Options
Thanks(0)
|
I'm still not sure my regex is right, but I am fairly sure the final product will require a comma in it.
Looking forward to the fix, as I think it will work well to stop a lot of spam. Edited by yapadu - 18 April 2010 at 11:31pm |
|
yapadu
Senior Member Joined: 12 May 2005 Status: Offline Points: 297 |
Post Options
Thanks(0)
|
Was a fix ever produced for this, it is tough building some regex without the use of a comma.
|
|
--------------------------------------------------------------
I am a user of SF, not an employee. Use any advice offered at your own risk. |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
As mentioned previously, this is actually a bigger issue than may seem, and a fix will require a major change in the RegEx engine, which is something we've postponed as we did not want to risk causing unwanted side effects in the new release that was made available this last week. I'm unable to say right now when this will be fixed.
|
|
yapadu
Senior Member Joined: 12 May 2005 Status: Offline Points: 297 |
Post Options
Thanks(0)
|
Thanks for the update.
Is the problem with the regex engine, or with the separator of keywords also being a ,? Currently we have to include regex in () when used in the keywords. Maybe we could escape a , that is included in the () so the list of keywords can be parsed correctly. If we did something like (home, grown) as the regex maybe we could replace it with (home,, grown) or (home\, grown). |
|
--------------------------------------------------------------
I am a user of SF, not an employee. Use any advice offered at your own risk. |
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
This page was generated in 0.292 seconds.