Spam Filter ISP Support Forum

  New Posts New Posts RSS Feed - Dan S Keyword RegEx Update
  FAQ FAQ  Forum Search   Register Register  Login Login

Dan S Keyword RegEx Update

 Post Reply Post Reply
Author
Trinidad View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Trinidad Quote  Post ReplyReply Direct Link To This Post Topic: Dan S Keyword RegEx Update
    Posted: 26 September 2003 at 9:11am
Your regex key words are the best, do you have any updates since your last posting?
Back to Top
Desperado View Drop Down
Senior Member
Senior Member
Avatar

Joined: 27 January 2005
Location: United States
Status: Offline
Points: 1143
Post Options Post Options   Thanks (0) Thanks(0)   Quote Desperado Quote  Post ReplyReply Direct Link To This Post Posted: 29 September 2003 at 3:52am

Trinidad,

Hmmmm  good question.  I have added some, removed some, changed some and added some temporary stuff to stop a couple of floods we got.  Below is a list of my "mainstream" keywords .... by that I mean that I use these as a baseline for all my instances that I run. Some of these word wrap so be careful.

((http|3dhttp)://.{0,26}(((%.+%))|@|:)[(\d|\w)])
((<[!--]+[\x20]{0,1}[a-zA-Z0-9]{10,}[\x20]{0,1}[!--](.+)){2,})
(http://+[\d]{1,3}\.{1}[\d]{1,3}\.{1}[\d]{1,3}\.{1}[\d]{1,3})
(<[!--]+[a-zA-Z0-9]{2}(-->))
((http://http:/\w)|(<(\w){3,10}(\x20/>)|(\*http://w)))
((limited time (special|offer)))
(((arge your p)|(1-4 inches)|(3 - 5 inches\!)|(generic viagra)|(123respmarket)|(herbalpillsonline)|(herbaltrials\.com)|(naturalherbal)|(pillsavings)|(gsc\-100)|(go771world)))
((your privacy is extremely important to us)|(this is not spam))
(((www\.)|(http://))(\w){1,20}(4u)\.(biz|com|net)|(medsonsale\.biz)|(freeandgetsave)|(opportunit12)|(thirdw\.com)|(teflondoninc)|(epromotionad))
((lsgone\.php)|(Isgone\.php)|(exit\.asp)|(mc4\.idetermination)|(medsusa\.biz)|(getitwhileucan)|(\&\#105)|(4improvement\.biz)|(best\-ratez\.biz)|(genoveseinc\.biz))
((remove\.php)|(hit\.php))
(<(!\-\- )+[a-zA-Z0-9]{1}(\x20[a-zA-Z0-9]{3,20}){3,5}(-->))
((text\-decoration: blink)|(click here to start))

I hope these don't "break" anything for you.  I check these for false positives often and as I stated in an earlier post, I do have to allow quite a few listservers in y white list because they do "bad" things in theur content.

Here is my "Excluded From Addresses" List

*@listproc.pcworld.com
*@industryweek.com
*@gpsadvantage.com
*@gwbakeries.com
*@peoples.com
*@*.lga2.nytimes.com
*@*.*.nytimes.com
*@softshare.com
*@regulusgroup.com
*@e-news.fsonline.com
*@lists.n-email.net
*@lists.techtarget.com
*@lyris.stockupticks.com
*@multexinvestornetwork.com
*@newsletter.online.com
*@insightmedia.info
*@nhfairfield.com
*@newhorizons.com
*@rootsweb.com
*@*.rootsweb.com
*@returns.groups.yahoo.com
*@cygnuspub.com
*@*.classmates.com
*@listserv.usairways.com
*@jkp.com
*@laurin.com
*.*@dell.com

I also have this single entry in my keyword whitelist to resolve an issue with paypal. 

https://www.paypal.com

I used to have *@paypal.com in my from whitelist but there are a boadload of spoofed paypal addresses so that opened up a big hole.  The keyword whitelist solved that.

Here are my "Blocked From Addresses"

(\b[\d+]+([\-a-za-z0-9_\.\+])+(@hotmail|@juno)\.com)
(\b[\d]+@(aol\.com|msn\.com|bellsouth\.net|brandeis\.edu))
(\w{17,}@(canada|aol|hotbot|msn)\.com)
((@hello\.com|@veriopt\.com|ha@sexyfun\.net|@himailer.com|clubhotlist@aol.com))
((\*@)(\w){1,30}(\.(com|net|org)){1})
(([\x20]{7,})|([\x09]{1,}))
(@(.){1,22}(\x20)(.){1,22})
(test[\d]{0,5}\.com)
(dsl\-verizon\.net)
anyone@*
noone@*
friend@*
someone@*
*@fcc-network.com
*@topprodsource.com
*@myobdeals.com
offers@*
senders@*
*@loyus.com:null
*@163.net:null
*@21cn.com:null
*@24horas.com
*@263.net:null
*@263.net.cn:null
*.*@bounce.e-i1.com
*@amazingoffersdirect.net
*@godomains.com.au
w-mstenson@mindspring.com:null
*@mail.play4keeps.com:null
directmail@badmail.worldnow.com
admin@internet.com:null
admin@mags.net:null
postmaster@vienybe.lt:null
*@selectgroupmedia.com
domainreg@paypal.com
*@biginkspot.com
*@shaw.ca
*@the-dot-com-ink.com

Hope this all helps.

Regards,

Dan S.

Back to Top
Trinidad View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Trinidad Quote  Post ReplyReply Direct Link To This Post Posted: 29 September 2003 at 3:37pm

Your the man when it comes to RegEx

Thanks much

Back to Top
Desperado View Drop Down
Senior Member
Senior Member
Avatar

Joined: 27 January 2005
Location: United States
Status: Offline
Points: 1143
Post Options Post Options   Thanks (0) Thanks(0)   Quote Desperado Quote  Post ReplyReply Direct Link To This Post Posted: 30 September 2003 at 12:49am

Trinidad,

Minor correction to 2 filters:

(<[!--]+[a-zA-Z0-9]{2}(-\->))

(<(!\-\- )+[a-zA-Z0-9]{1}(\x20[a-zA-Z0-9]{3,20}){3,5}(-\->))

Dan S.

Back to Top
Carl Giljam View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Carl Giljam Quote  Post ReplyReply Direct Link To This Post Posted: 30 September 2003 at 3:40am
The messages that slip through the filter now tend to be where the text has been broken up by a lot of comments, example:

mppa iofyydehhz pn z ojdjgnugps hk i-->er - uer - uer - up to 36 ho
Back to Top
Carl Giljam View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Carl Giljam Quote  Post ReplyReply Direct Link To This Post Posted: 30 September 2003 at 3:44am
The messages that slip through the filter now tend to be where the text has been broken up by a lot of comments, example:

(I'm leaving out the example this time, the forum truncated my previous message so trying to see if this works better - you will have to imagine a lot of nonsense comments)

..etcetera. This doesn't seem to be caught by your current RegEx although it to the human eye is very obvious junk (no matter what language you speak!). I'm not good at writing RegEx'es but have a few ideas - comments anyone?

1) Filter out on "nonsense" words in comments. Nonsense = 4 or more consecutive consonants.

2) Maybe change the above to 4 or more consecutive non-vowels (to filter out things like glk4zm2pq)

3) Same as 1) and 2) but for 3 or more consecutive vowels.

4) Same idea as all the above but applied only to the Subject of the email. This often contains nonsense words and I can see no good reason at all to allow them.

5) In all the above, maybe necessary to allow for more than 4 consonants (or 3 vowels) to get things like KPMG (for accountants) etc through but there should be some reasonable number, say 5-6 where the filter could kick in.

6) Filter out if number of chars in comments > number of non-comment chars, either over one single line or in the message as a whole. number of non-comment chars, either over one single line or in the message as a whole. number of non-comment chars, either over one single line or in the message as a whole.

7) Filter out if number of chars in comments are > X (in the message as a whole). X (in the message as a whole). X (in the message as a whole).

8) Filter out if any comments at all...

Well - can any of this be done in RegEx and would it create any problems?
Back to Top
Desperado View Drop Down
Senior Member
Senior Member
Avatar

Joined: 27 January 2005
Location: United States
Status: Offline
Points: 1143
Post Options Post Options   Thanks (0) Thanks(0)   Quote Desperado Quote  Post ReplyReply Direct Link To This Post Posted: 30 September 2003 at 5:36am

Well ... I like your ideas but one thing is at the moment, we can't specify filters for the "Subject" only.  I will take a very serious look at your ideas and see if I can come up with a "Clean" RegEx or 2.

Dan S.

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down



This page was generated in 0.191 seconds.