Print Page | Close Window

RegEx keywords to eliminate junk email with invalid html tags

Printed From: LogSat Software
Category: Spam Filter ISP
Forum Name: Spam Filter ISP Support
Forum Description: General support for Spam Filter ISP
URL: https://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=967
Printed Date: 05 February 2025 at 1:58am


Topic: RegEx keywords to eliminate junk email with invalid html tags
Posted By: LogSat
Subject: RegEx keywords to eliminate junk email with invalid html tags
Date Posted: 15 June 2003 at 11:02pm

Starting with build v1.2.0.151 SpamFilter is able to scan the whole email content + subject header for RegEx (Regular Expression) keywords.

This allows very powerful keyword searches. Many spammers send html emails containing invalid (thus invisible) html tags or html comments in between letters to avoid normal keyword detection.

For example, the following html source:

<!--fxkbu8116c72f6-->SP<mynqhy2d9bswg-->AM 
    <!--ei2rq7erjldy3y-->MER<!--ywf1ph1zmgcik9-->

will actually display SPAMMER in an email client.

We've been using the following RegEx search string to, so far, successfully block a lot of this spam:

(<[!--]*[a-zA-Z0-9]{11,})

This is what the above expressions looks for (remember that SpamFilter requires a RegEx expression to be sorrounded by parenthesis () in order to distinguish it from regular keywords):

  • <   look for an open tag start character, immediately followed by...
  • [!--]*   this looks for zero or more occurrences of the  !--  characters indicating an html comment, immediately followd by...
  • [a-zA-Z0-9]  any letter or digit....
  • {11,}  repeated at least 11 times. This has to be a combination of only either letters or numbers. Any space, tab, single quote, double quote etc will break the sequence.

For example, <a href="aaaa.htm"> will not cause a trigger since there is a space immediately following the a before href.
We choose a minimum repetition of 11 since <blockquote> is a valid tag 10 characters long...

If anyone has comments, problems, or improvements with this "apparently magic" keyword search, please let us know!

Roberto Franceschetti
LogSat Software




Replies:
Posted By: Guests
Date Posted: 16 June 2003 at 11:25am

When will any of these new feature appear in the official RELEASE version.

I do not wish to experiment with the beta release, but feel that I am missing out on all the new features by continuing to use the most current official release (1.1.2.124) when the beta keeps getting all the new features.



Posted By: LogSat
Date Posted: 16 June 2003 at 12:49pm

Alan,

We had to make drastic changes in the code to support the new quarantine database and the web functionality. The code was not as stable as we would have liked, so we created and made public our beta test versions so that we could have more users test the application and report problems. Had we released an official release it would have been with several bugs, and would have many  many users complaining of crashes. We really did not want that.

After two weeks of testing we finally seem to have a much more stable product. Unless any major problems arise in the next few days, we are thinking of making this beta official by the end of this week.

Roberto Franceschetti
LogSat Software



Posted By: Guests
Date Posted: 16 June 2003 at 1:59pm

Roberto,
Thanks for posting the RegEx code for the html comments used to by pass keyword filtering. It works great so far. I was able to eliminate all of the differant comment tags I was using and reduce the keyword list to a smaller size.

Great Job,

g



Posted By: Guests
Date Posted: 16 June 2003 at 4:14pm
So this will only work with the current beta version?  (not the most recent official release?)


Posted By: LogSat
Date Posted: 17 June 2003 at 3:29pm

Currently yes, these features are only available in the beta. But we do anticipate to be releasing it officially within the next few days, so the wait will be very small!

Roberto Franceschetti
LogSat Software



Posted By: Guests
Date Posted: 18 June 2003 at 8:08am
almost all Return receipts are being caught by this regexp... so...


Posted By: LogSat
Date Posted: 18 June 2003 at 8:24am

Can you post the source of such an email so we can try to find a way around it?

Roberto Frnceschetti
LogSat Software



Posted By: Guests
Date Posted: 19 June 2003 at 1:21pm

 

Can RegEx work with a dictionary list?



Posted By: LogSat
Date Posted: 19 June 2003 at 3:30pm

What you mean exactly by "work with a dictionary list"?

Roberto Franceschetti
LogSat Software



Posted By: JimMeredith
Date Posted: 20 June 2003 at 8:11pm

The "invalid html tags" RegEx keyword has been working *almost* perfectly, but there have been a few situations where legitimate emails are being bounced by this rule.

Here's what appears to be happening.  If a message contains a forwarded message within its text, this forwarded message text is likely to include the original from/to email addresses.  In many cases, these email addresses are enclosed in <>'s.  For example:

To: < mailto:longusername@earthlink.net" CLASS="ASPForums" TITLE="WARNING: URL created by poster. - >

This matches the RegEx keyword criteria of [a-zA-Z0-9]{11,} so... bounce!

To get it working, I've changed the [!--]* portion of the RegEx keyword (zero or more occurrences of !--) to instead read [!--]+ (one or more occurrences).  This is still very effective, and has eliminated the bounces of legit messages... but is obviously not perfect as it doesn't offer protection from invalid html tags that are not comments.

RegEx is new to me, but I might try working with it later and coming up with some sort of logical NOT based on the occurrence of a @ within the string.  If someone more familiar with RegEx could just fire this out and post it here, that would be even better. :)

Jim




Print Page | Close Window