Spam Filter ISP Support Forum

  New Posts New Posts RSS Feed - Catching the comments
  FAQ FAQ  Forum Search   Register Register  Login Login

Catching the comments

 Post Reply Post Reply
Author
yapadu View Drop Down
Senior Member
Senior Member


Joined: 12 May 2005
Status: Offline
Points: 297
Post Options Post Options   Thanks (0) Thanks(0)   Quote yapadu Quote  Post ReplyReply Direct Link To This Post Topic: Catching the comments
    Posted: 14 April 2010 at 9:24am
I am trying to catch messages with large amount of HTML comments in them, using this regex:

((?-g)<!--.{500,}?-->)

I have tried a few different variations:

(<!--.{500,}?-->)
((?-g)<!--.{500,}-->)

My keyword in full does include a non regex keyword first:

content,((?-g)<!--.{500,}?-->)

Basically I am trying to say non greedy search, at least 500 characters between the comments.  Every regex tool I have tested says it should work, including the regex test tab in spamfilter.

But on real messages, nothing is caught... anyone see what I am doing wrong?

The following is an example of some text that should be caught:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>hi</title>
<style type="text/css">
<!--
body {
    background-color: #FFF;
}
-->
</style></head>

<body link="#cc054b" alink="#cc054b" vlink="#cc054b">

<center>

<!--
Au cognitive should dddd innovative resorts qqqq hi ttt pounds uuuu thank shops desert area circle reverse trade avenue partners vandana presumed weet started change strengths automatic predicted shrimp missing programmed concerns literature appliances led bill kkkk CTS Australian image similarly drugs farmer sss hesitate mat rrrr gggg haven profile confidential dissolved gid rdonlyres align bots history confidential smalltext foe? du head ttt uuuun aspx change avw wave live shops autumn refer giveaway custom uuuu mypage profile exceeds makeup contrary avenue plastics buildings wrote ddd Neues ppp allies sss kkk stripe gear image reverse citation camels actions audio notify WWW pulse bread jjjj tripolis invasion institutions jam wave printing log go hhh toolbar mid calls wwww color fragrance OK went aaaa zzz iiiii estimado subscriber digital Desktops asp officers Je comes tomorrow mmmm engineer iii numerous crushable heads preferences paddy attachments smalltext cccc besuchen deaths representing makeup ddd margin portion llll increases team ooo circle subscriber ccc farmers received www Neues utritious eee width imre zzzz eee fff whatever don mileage led Debbie utm qqq bbbb oooo jjjj cookie windows century charged mid book mmm nn de toolbar exactly regulators yyy laws pero vspace employees hhhh Dan on eeee plastics inter at printing finances partners powered llll Au Aug bbb New correspond replacement experiencing Het
-->

<table width="475" cellspacing="0" cellpadding="5" border="0">



Edited by yapadu - 14 April 2010 at 9:32am
Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4104
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 18 April 2010 at 8:31pm
Sorry this took a while, but we were ourselves a bit stumped as to why it wasn't working. It turns out that this is due to a bug you just uncovered in SpamFilter. With the above RegEx, you trying to take advantage of this new feature in SpamFilter with your keyword
content,((?-g)<!--.{500,}?-->):

// New to VersionNumber = '4.1.2.810';
{TODO -cNew : Added support to specify multiple RegEx expressions separated by commas, just as regular keywords can be separated by commas - has the effect of specifying "AND" rules for RegEx. Note that a "Standard non-RegEx keyword must be specified first for SpamFilter to recognize this syntax. For example: X-SF,([a-z]), ([0-9])    }

The bug however is causing SpamFilter to barf at the "comma" after the 500 above. SpamFilter is no recognizing the comma as part of the RegEx, and thinks it's the keyword separator character. This is a bit tricky to fix, we'll have to rework the keyword engine a bit, so I can't promise the fix will be ready in just a few days.
Roberto Franceschetti

LogSat Software

Spam Filter ISP
Back to Top
yapadu View Drop Down
Senior Member
Senior Member


Joined: 12 May 2005
Status: Offline
Points: 297
Post Options Post Options   Thanks (0) Thanks(0)   Quote yapadu Quote  Post ReplyReply Direct Link To This Post Posted: 18 April 2010 at 11:27pm
I'm still not sure my regex is right, but I am fairly sure the final product will require a comma in it.

Looking forward to the fix, as I think it will work well to stop a lot of spam.


Edited by yapadu - 18 April 2010 at 11:31pm
Back to Top
yapadu View Drop Down
Senior Member
Senior Member


Joined: 12 May 2005
Status: Offline
Points: 297
Post Options Post Options   Thanks (0) Thanks(0)   Quote yapadu Quote  Post ReplyReply Direct Link To This Post Posted: 14 August 2010 at 10:52pm
Was a fix ever produced for this, it is tough building some regex without the use of a comma.
--------------------------------------------------------------
I am a user of SF, not an employee. Use any advice offered at your own risk.
Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4104
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 17 August 2010 at 9:49pm
As mentioned previously, this is actually a bigger issue than may seem, and a fix will require a major change in the RegEx engine, which is something we've postponed as we did not want to risk causing unwanted side effects in the new release that was made available this last week. I'm unable to say right now when this will be fixed.
Roberto Franceschetti

LogSat Software

Spam Filter ISP
Back to Top
yapadu View Drop Down
Senior Member
Senior Member


Joined: 12 May 2005
Status: Offline
Points: 297
Post Options Post Options   Thanks (0) Thanks(0)   Quote yapadu Quote  Post ReplyReply Direct Link To This Post Posted: 17 August 2010 at 9:58pm
Thanks for the update.

Is the problem with the regex engine, or with the separator of keywords also being a ,?

Currently we have to include regex in () when used in the keywords.  Maybe we could escape a , that is included in the () so the list of keywords can be parsed correctly.

If we did something like (home, grown) as the regex maybe we could replace it with (home,, grown) or (home\, grown).
--------------------------------------------------------------
I am a user of SF, not an employee. Use any advice offered at your own risk.
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down



This page was generated in 0.292 seconds.