Spam Filter ISP Support Forum

  New Posts New Posts RSS Feed - Keyword filter
  FAQ FAQ  Forum Search   Register Register  Login Login

Keyword filter

 Post Reply Post Reply
Author
Allan View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Allan Quote  Post ReplyReply Direct Link To This Post Topic: Keyword filter
    Posted: 09 September 2003 at 8:13am

I know the subj. been discussed several times and that the program has been altered on user behalf. But....

I still get mails through that have words inside, that are listet in the keyword filter.

Allan

Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4104
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 09 September 2003 at 5:53pm

Allan, can you post the source of one such the emails? Please note that we need the actual source, a simple copy and past from your email client will not work as that will remove any html (and non) tags from the content. We'll also need to know which keyword is being skipped.

Roberto F.
LogSat Software

Back to Top
Allan View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Allan Quote  Post ReplyReply Direct Link To This Post Posted: 10 September 2003 at 5:25am

Here's a header:

Received: from IAPOTEK ([62.242.39.50]) by dakmail.dak.pharmakon.dk with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2656.59)
 id SH47FH67; Wed, 10 Sep 2003 11:15:33 +0200
Received: from 129.237.93.167 by 62.242.39.50 (LogSat Software SMTP Server) Wed, 10 Sep 2003 11:16:19 +0200
MIME-Version: 1.0
Subject: =?ISO-8859-1?B?VmlhZ3JhIFJpZ2h0IE9ubGluZSBOb3ch?=
To: ap@pharmakon.dk
Date: Wed, 10 Sep 2003 17:15:15 +0000
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2800.1158
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165
Message-ID: <d42501c377bf$4c82e26a$0b272def@ci5egl2>
From: "Napoleon Crowe" <crowezy@cc.jyu.fi>
Content-Type: text/html
Content-Transfer-Encoding: 8bit
X-Server: LogSat Software SMTP Server
X-SF-RX-Return-Path: <crowezy@cc.jyu.fi>

When I view the message in Outlook, the subject line displays: Viagra Right Online Now!

"Viagra" is in my lists

 

Back to Top
John View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote John Quote  Post ReplyReply Direct Link To This Post Posted: 10 September 2003 at 10:57pm

Allan,

If the email is html and is something like this Via<!--some stuff here-->gra then the keyword Viagra will not match the above. See if you can do a "view source" or something to see the actual text that is being sent.

 

Back to Top
Desperado View Drop Down
Senior Member
Senior Member
Avatar

Joined: 27 January 2005
Location: United States
Status: Offline
Points: 1143
Post Options Post Options   Thanks (0) Thanks(0)   Quote Desperado Quote  Post ReplyReply Direct Link To This Post Posted: 10 September 2003 at 11:19pm

Alan,

If you are using MS Outlook, you will not see the actual source because Outlook "Renders" the message to a viewable content.  If, however, you can at least for a test, set up Outlook Express, it will allow you to view the actual, unrendered source and you will most likely see the the actual literal word "viagra" does not exist in the source.  Spamers use many methods of obfucation to mask the literal words from filtewrs which is why using "Regular Expressions" as your keyword filters is so valuable.

Dan S.

Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4104
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 10 September 2003 at 11:46pm

Allan,

With the message headers you posted we were indeed to replicate the scenario.

The source "Subject" header is encoded:

Subject: =? ISO-8859-1?B?VmlhZ3JhIFJpZ2h0IE9ubGluZSBOb3ch?=

Decoding it reveals the "V-i-a-g-r-a Right Online Now".

SpamFilter is currently decoding only the message body when performing the keyword scan. It is not decoding the Subject header though, it simply scans it "as is", so in this case it does not find the word "V-i-a-g-r-a".

We'll try to have it decode the subject as well before the next build is released.

Roberto F.
LogSat Software

Back to Top
Allan View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Allan Quote  Post ReplyReply Direct Link To This Post Posted: 11 September 2003 at 7:53am

Thank you!!

Here is the next example - outlook shows Penis Patches info click here. 

The word "Penis" is off couse included in my blacklists:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2800.1226" name=GENERATOR></HEAD>
<BODY><B>Fra:</B> Norman Ferris [norman_ferrisdh@chello.at]<BR><B>Sendt:</B> 11.
september 2003 22:43<BR><B>Til:</B> Emne:xx@yy.zz<BR><B>Emne:</B> get her off
real large styles . zhqlxv3<BR>
<CENTER>
<TABLE width=580 border=0><K0DY6HC1O7Q6U><KUBWYZ5LQZZ3>
  <TBODY>
  <TR>
    <TD><K48VUT634KOP>
      <CENTER><KUDSPEKK5QJ6E3><FONT face=Arial
      size=+1><B>Ge<KQ1JQA7LK53>n<KCJR9102HQ6OZ>it<KRJJSQG9UTBU51>al
      E<KM0BP8CN2076C29>nl<KWJ83932QVM7K>arg<KNKFD9G1VX1>eme<KHEXG9NEVC8LT2>nt -
      M<K59YMFD3KZA55F0>edi<KKK07UI3AQME>cal
      Br<KYWH39P1ABZ1JX1>eak<KQNHQ05JN0WSX2>thro<K2CMTX827O0WMP3>ug<KWMZ6621TPN>h
      F<KWNA2PM1CJT>or Me<KI25U8S2T6NOS>n<KLDRPQ42AQJNJ7>!
      <BR></FONT><KI83WSO3IGPUBD><FONT face=Arial size=-2>2
      am<KPV6EKZ197Q7B>az<KLI2OEJ2TXDTTP1>in<KZAGYYM20YP>g wa<K8YJLDFRS7KM>ys to
      enl<K3ALVRX2D6X1OF3>arg<KF4OWWCKWUWGC>e yo<K94ILXPG6FSQ3>ur
      man<K1XSRFP1SDI>h<KSP493M10KNRJ>ood - r<KM5W3K83BT7IBS>ead
      b<KME4RAJ2RACQ>elo<KQIUM0P1FO6JG7>w..</FONT><K7E2K5VFJ1N></B><BR></CENTER><BR>Do<KPIGI5C0TDU>ct<KVW8KQA3J72DGC>or<KITY8W736EQ1>s
      work<KR7N4FH3GHZ7>ed for y<KQNDPEF2JC9Y>ear<K0061KM1WFS>s
      cr<KZHQAM83GL6N9>eati<KVMFUGC1DBY>ng a
      p<KHJI9I52FJCQ>i<KK0W8M812MNTE7>l<KOOCL8AMWPL5J16>l to
      e<KR91CRW1KH3>n<KCD7UKY1MI1>l<KLFG145UXX663O>a<KW759A71QL00>r<KE13MZW24291KZ1>g<KBTRGEZ3FVE>e
      th<KSF817UODKJ7OOQ>e <B>ma<KCQRJE33A2TYNE2>le
      g<KUP55ABTV5KYT2>en<KOPLFYEWYSHJUM>ital<KY6DTI02DC2NA>ia</B> by
      len<KLITW1V29PFUI>g<K9UEHUI2UI5MP>th and
      wid<K2AKE0C3I71GKD>th.<KSUEK8EB173R><BR><K5HLKRH8CT6B12R><K6CGSKW16KZKZO3>The
      years of work produced a
      p<KSIBTPD3J3V5L>i<K83K8EH3TRP7TKH>l<K2E7XW72ILYK5>l call<KPFRX26LXTZJ>ed
      "<KB0OAH53DI7E1M>V<KMHKEBV3C01YQY>P<KCCM2J1159AC652>R<KTMM9VT3TDQA>X<KFP6BH77VN25B1>",
      - <A href="http://www.pure-herbal.biz/sknoc/vp/"
      target=_blank><B>V<KV6Q4V8P42B5>P<KHWES9H3T9H>R<KKM4TK734KGT4>X
      P<K1VOFVR3IH1>i<K7KFTYP1XF1>l<KGS2PO32TIH7M>l<KSQ4WVU2VH2W>s
      i<KD49E6K4431DX3>n<K9ZZC8R3IQ4Q3E>f<KVW8HTZ2WS9U>o
      c<KOZWCB02UMO6JN>l<KEA7OO51VXY3WH>i<KAPJIEG3I1B0AP>c<KCKHYFOTZL9LH>k
      h<KNSULGW3S9U09>e<K03ATF37EBJLR0>r<K6AW202246U4>e<KC75B2130EANP63>.</B></A><BR>a<KNPSRFL1320CF92>n<KFDABE42MIB>d
      al<K8CRXEU3M1E85>so a pa<KZA9HTV3ZCXJ>tc<KD9LWF314CFR051>h
      s<K14LJLT3CMG6JI>imil<KQG9T7DBRQ3>air to the q<KDKPM7J3MXDVLZ1>uit
      s<KXS3O1OQA4RRI2>mo<KMSBQ4NG8Q8IO3P>king
      p<K8PUP9D21AY>a<KA2UR3A17R1M>tc<KW3020S3OOXV>h. - <A
      href="http://www.pure-herbal.biz/sknoc/patch/"
      target=_blank><B>P<KRSULO9303VQ>e<KHCQ4F13F94QU>n<KG970DH18JHO>i<KL4N84MAP0X>s
      P<K76NMMN21L98G>a<KXUVOUL1WFFD0EV>t<K4DDB5CEOYW>c<KNP36Z33VBWWF>h<KES79JTJLAU372>e<K7MWU2D2ELEQQL1>s
      i<KPPFRM521WH>n<K0XSG9JWH9K>f<KJB3PI53YAYL5>o
      c<KXCQXSX1P412R>l<KFICYYC14LAY>i<KP3JQ5T2NTFS9W>c<K6KFJTY37ZPCPE2>k
      h<KT2ORPNQR87N>e<KM6A32L1ZM83UG1>r<KMN1PR3074ZHI>e<KE05OGB3ZC1994J>.</B></A><BR></TD><KAILN2S2IWY>
  <TR><KXVKZVV2RL21>
    <TD></TR></TBODY></TABLE>
<TABLE width=580 border=0></CENTER>
  <TBODY>
  <TR>
    <TD><BR><BR><A href="http://www.pure-herbal.biz/sknoc/out.html"><FONT
      face=Arial size=-2>r<KHSTMN92J1UQ>em<KZU8IST3C1P6XS>ov<K5Z34991HECYQR3>e
      yo<K0R13ME3CG9>ur<KDXGF6E2MXW>sel<KFPAZYTK5IE0>f.</FONT></A><BR></TD></TR></TBODY></TABLE><K62X6TQ60UPU><K9UVLKREXQR01></BODY></HTML>

Back to Top
JimMeredith View Drop Down
Newbie
Newbie


Joined: 27 January 2005
Location: United States
Status: Offline
Points: 28
Post Options Post Options   Thanks (0) Thanks(0)   Quote JimMeredith Quote  Post ReplyReply Direct Link To This Post Posted: 11 September 2003 at 1:16pm

Besides the subject line problem, the old imbe<garbage>dded ta<spam>gs trick is definitely being exploited here.  We're seeing this happen more frequently now than ever in the past... a user called me today and asked if we had turned off our SpamFilter!

Here are just a couple lines from an email I received today. 

<font face="Arial"><b><font size="2">˙FFFF95 Do you
wa<pprayerful>nt a lar<pinventory>ger and fir<pcufflink>mer pe<pezra>n<pblue>is?
<br>
˙FFFF95 Do you wa<pmanna>nt to give your par<psuppression>tner
more plea<pcollaborate>sure? <br>

The "magic" RegEx statement that Roberto posted several months ago would catch this... but unfortunately it trapped other legitimate tags and caused too many False Positives to be acceptable for our users.  I posted a modification to that RegEx, which eliminated the FPs, but also eliminated much of the effectiveness in the process.

There's probably no easy answer to this issue.  But I'm open to suggestions.

Back to Top
Desperado View Drop Down
Senior Member
Senior Member
Avatar

Joined: 27 January 2005
Location: United States
Status: Offline
Points: 1143
Post Options Post Options   Thanks (0) Thanks(0)   Quote Desperado Quote  Post ReplyReply Direct Link To This Post Posted: 12 September 2003 at 12:45am

All,

Here we go again ... I am having VERY good results with the following RegEx's but some list servers give problems so I also allow the listed addresses. Using this combo, I get a very high block ratio with very few false positives.

DO NOT LET THE LIST WORD-WRAP!

Keyword Text File:

((http|3dhttp)://.{0,26}(((%.+%))|@|:)[(\d|\w)])
(http://+[\d]{1,3}\.{1}[\d]{1,3}\.{1}[\d]{1,3}\.{1}[\d]{1,3})
((<[!--]+[\x20]{0,1}[a-zA-Z0-9]{10,}[\x20]{0,1}[!--](.+)){2,})
(<[!--]+[a-zA-Z0-9]{2}(-->))
((http://http:/\w)|(<(\w){3,10}(\x20/>)|(\*http://w)))
(<(!-- )+[a-zA-Z0-9=]{28,}( -->))
(content\-type:\x20text/(html|plain)(;{0,1})\r\ncontent-transfer\-encoding:\x20base64\r\n)
((limited time (special|offer)))
(((arge your p)|(3 - 5 inches\!)|(herbalpillsonline)|(herbaltrials\.com)|(pillsavings)|(gsc\-100)))
((text\-decoration: blink)|(click here to start))
((your privacy is extremely important to us)|(this is not spam))
(http://www.(\w){1,20}(4u).(biz|com|net))
((re: )+(wicked screensaver|details|approved|thank you!|that movie|your application|re: my details))
(this email address will be expiring)

And here is my "Excluded From Addresses" List

*@paypal.com
*@listproc.pcworld.com
*@industryweek.com
*@gpsadvantage.com
*@gwbakeries.com
*@peoples.com
*@*.lga2.nytimes.com
*@*.*.nytimes.com
*@softshare.com
*@regulusgroup.com
*.*@dell.com
*@e-news.fsonline.com
*@lists.n-email.net
*@lists.techtarget.com
*@lyris.stockupticks.com
*@multexinvestornetwork.com
*@newsletter.online.com
*@insightmedia.info
*@nhfairfield.com
*@newhorizons.com
*@rootsweb.com
*@*.rootsweb.com
*@returns.groups.yahoo.com
*@*.classmates.com
owner-apic@peach.ease.lsoft.com
*@listserv.usairways.com

Dan S.

Back to Top
Allan View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Allan Quote  Post ReplyReply Direct Link To This Post Posted: 12 September 2003 at 2:33am

A suggestion:

If Outlook can "filter" the unwanted HTML syntax and display the text, why can't SpamFilter when it's checking for keywords ?

Back to Top
Desperado View Drop Down
Senior Member
Senior Member
Avatar

Joined: 27 January 2005
Location: United States
Status: Offline
Points: 1143
Post Options Post Options   Thanks (0) Thanks(0)   Quote Desperado Quote  Post ReplyReply Direct Link To This Post Posted: 12 September 2003 at 6:30pm

Allan,

Outlook is not "Filtering" the html.  Spamers are obscuring the message by adding a wide variety of HTML tags and comments and what ever else they come up with.  The goal is to locate the METHOD of obfuscation ... not the actual message and if the message has been obscured, then it is most likely Spam.  That is what the regular expressions are doing. Having SpamFilter render the code and then go searching for a word or phrase, in my opinion would lead to horrendous amounts of false positives.  Just because a message has the word penis in it, does not make it Spam. The same is true for viagra.  HOWEVER, if those words are being "hidden" using comments, tags or whatever, then it must be Spam.  What other possible reason would there be to mask the actual content.

Dan S.

Back to Top
Allan View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Allan Quote  Post ReplyReply Direct Link To This Post Posted: 15 September 2003 at 2:51am

I do not quite agree with you. I have added some words in the keyword filter - regardles of their meaning, I want to filter the mails.

If Outlook ( and other mail programs ) is capable to filter out worthless HTML tags when displaying the message, I still cant see why the program can't do the same...

Greetings

Allan

Back to Top
Trinidad View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Trinidad Quote  Post ReplyReply Direct Link To This Post Posted: 15 September 2003 at 11:02am

I am new to regex and was wanting to know if you could you give me a break down on what a few of your lines are doing? namely lines 1, 2, 5 and 13.

Thanks in advance.

Back to Top
DigitalMan View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote DigitalMan Quote  Post ReplyReply Direct Link To This Post Posted: 15 September 2003 at 7:40pm

There is one trick you can use to view the source in Outlook.  I use this on most of my mail before opening when it has gotten through SpamFilter but sill looks suspect.

  1. disable any preview panes.  You should do this anyway because previewing an HTML email can cause an image to be downloaded, in which your IP (and possibly a unique code) will be logged and show your address as legitimate, causing more spam to come your way.
  2. Highlight the message but do not open it.

  3. Under "file" choose "save as ..."  If the message is a plain text email, it will add the .txt extension.  (Text emails will be safe to open in outlook).  HTML emails will be given the .htm extension.  I usually just save these to my desktop and then view the source in a HTML editor such as HomeSite.  Notepad will suffice.

Back to Top
JimMeredith View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote JimMeredith Quote  Post ReplyReply Direct Link To This Post Posted: 16 September 2003 at 12:43pm

Valid points, all...

Dan, your RegEx list is always welcome for reading and comparison.  You've said in a previous post that you're not a "RegEx expert", but it seems more and more that you're becoming one as a result of this project!  Filtering on a COMMON single word might not be such a good idea, but there are plenty of single-word spam traps that are very effective with zero false positives.  You'd be surprised how much spam I'm catching by filtering on the slang word "milf".  Both "vallum" and "vaiium" (consider the spammer's creative use of capitalization on the i's and l's) are effective at trapping a few dozen spam messages each week.

But back to my original concern... trapping the embed<crap>ded tags.  Unfortunately I don't think that even your most recent RegEx list addresses this scenario.  Thus, Allan's post about pre-filtering the HTML tags and then running keyword check does have merit.  A sentence that contains the phrase "pe<blurb>nis enl<!--okay-->argem<extract>ent" then becomes "penis enlargement" and it becomes a policy decision for the mail system's administrator whether or not to filter on those two words.

Still, this doesn't rate an enhancement request, not right now.  The statistical scoring enhancement that Roberto has mentioned in previous posts could have a more far-reaching impact on spam control than all of these blacklist/whitelist tweaks combined.  So, I'll just sit tight and wait for THAT beta to arrive.

Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4104
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 16 September 2003 at 8:30pm

Allan,

We had given thought to your same observations. There is a (big) performance penalty in having to parse incoming email's HTML, so we decided to keep it simple and just work with the email source.

We're working hard to prepare a new version that does DNA fingerprinting on incoming emails, which will greatly diminish the need to specify multiple keywords. Once this is complete, should it not have the desired accuracy, we may go back to implement HTML parsing as well.

Roberto F.
LogSat Software

==========

A new message by Allan Poulsen was posted in Support Forum

Re: Keyword filter

I do not quite agree with you. I have added some words in the keyword filter - regardles of their meaning, I want to filter the mails.

>

If Outlook ( and other mail programs ) is capable to filter out worthless HTML tags when displaying the message, I still cant see why the program can't do the same...

Greetings

Allan

Back to Top
DigitalMan View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote DigitalMan Quote  Post ReplyReply Direct Link To This Post Posted: 16 September 2003 at 8:32pm

Having read this thread, I think the feature request that is coming out is a function that would render HTML embedded into a subject line back as text, and then accepting/rejecting the message based on the rendered subject line.  So "VIAGRA" and "VIA<crap>GRA" would all be caught if "VIAGRA" is in the keyword filter.

The question to LogSat then becomes, is this feature request feasible and if so, is it sensible?

Back to Top
Richard View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Richard Quote  Post ReplyReply Direct Link To This Post Posted: 17 September 2003 at 7:03pm

I don't believe that removing HTML tags prior to keyword searching would amount to a lot of code or overhead.

"DNA fingerprinting" has failed elsewhere because spammers are on to it, and throw random garbage into messages to change the "fingerprint" -- I hope your algorithm is smarter than others.

Another spammer trick alluded to earlier in this thread is the creative use of "alphabet substitutions" - such as 1 or ! or | for I or l , 0 for O, @ for a, etc., and peculiar spacing, with and without separating characters (-_+=~. are common).

If viagra is the keyword, then my wishlist is for the software to, in addition to stripping out HTML tags, match V!AGRA and Viagr@ and ###VIAGRA### (embedded in other text), V I A G R A, V-I-A-G-R-A  and so on.

It is interesting to note that (at least I think this is the case) HTML does not allow SRC= or HREF= to be broken up with comments - thus even if VIAGRA is obscured, the website mentioned within is not - thus in the HTML gobbeldygook example a few messages back - this is buried in the otherwise unreadable html: <A href="http://www.pure-herbal.biz/sknoc/vp/"

One should be able to filter out mail containing "pure-herbal.biz" -- if nothing else!

Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4104
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 17 September 2003 at 11:54pm

Richard,

Removing the HTML tags requires little code but it does add more overhead we feel comfortable with right now.

As far as DNA fingerprinting, it is self-adjusting. If spammers decide to spell v1agra or v*i*a*g*r*a, after receiving a few such emails at first, the statistical engine will begin to recognize the new patterns and readjust itself. At least it is in our preliminary tests...

Roberto F.
LogSat Software

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down



This page was generated in 0.535 seconds.