Keyword filter |
Post Reply |
Author | |
Allan
Guest Group |
Post Options
Thanks(0)
Posted: 09 September 2003 at 8:13am |
I know the subj. been discussed several times and that the program has been altered on user behalf. But.... I still get mails through that have words inside, that are listet in the keyword filter. Allan |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Allan, can you post the source of one such the emails? Please note that we need the actual source, a simple copy and past from your email client will not work as that will remove any html (and non) tags from the content. We'll also need to know which keyword is being skipped. Roberto F. |
|
Allan
Guest Group |
Post Options
Thanks(0)
|
Here's a header: Received: from IAPOTEK ([62.242.39.50]) by dakmail.dak.pharmakon.dk with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2656.59) When I view the message in Outlook, the subject line displays: Viagra Right Online Now! "Viagra" is in my lists
|
|
John
Guest Group |
Post Options
Thanks(0)
|
Allan, If the email is html and is something like this Via<!--some stuff here-->gra then the keyword Viagra will not match the above. See if you can do a "view source" or something to see the actual text that is being sent.
|
|
Desperado
Senior Member Joined: 27 January 2005 Location: United States Status: Offline Points: 1143 |
Post Options
Thanks(0)
|
Alan, If you are using MS Outlook, you will not see the actual source because Outlook "Renders" the message to a viewable content. If, however, you can at least for a test, set up Outlook Express, it will allow you to view the actual, unrendered source and you will most likely see the the actual literal word "viagra" does not exist in the source. Spamers use many methods of obfucation to mask the literal words from filtewrs which is why using "Regular Expressions" as your keyword filters is so valuable. Dan S. |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Allan, With the message headers you posted we were indeed to replicate the scenario. The source "Subject" header is encoded: Subject: =? ISO-8859-1?B?VmlhZ3JhIFJpZ2h0IE9ubGluZSBOb3ch?= SpamFilter is currently decoding only the message body when performing the keyword scan. It is not decoding the Subject header though, it simply scans it "as is", so in this case it does not find the word "V-i-a-g-r-a". We'll try to have it decode the subject as well before the next build is released. Roberto F. |
|
Allan
Guest Group |
Post Options
Thanks(0)
|
Thank you!! Here is the next example - outlook shows P The word "Penis" is off couse included in my blacklists: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
|
JimMeredith
Newbie Joined: 27 January 2005 Location: United States Status: Offline Points: 28 |
Post Options
Thanks(0)
|
Besides the subject line problem, the old imbe<garbage>dded ta<spam>gs trick is definitely being exploited here. We're seeing this happen more frequently now than ever in the past... a user called me today and asked if we had turned off our SpamFilter! Here are just a couple lines from an email I received today. <font face="Arial"><b><font size="2">˙FFFF95 Do you The "magic" RegEx statement that Roberto posted several months ago would catch this... but unfortunately it trapped other legitimate tags and caused too many False Positives to be acceptable for our users. I posted a modification to that RegEx, which eliminated the FPs, but also eliminated much of the effectiveness in the process. There's probably no easy answer to this issue. But I'm open to suggestions. |
|
Desperado
Senior Member Joined: 27 January 2005 Location: United States Status: Offline Points: 1143 |
Post Options
Thanks(0)
|
All, Here we go again ... I am having VERY good results with the following RegEx's but some list servers give problems so I also allow the listed addresses. Using this combo, I get a very high block ratio with very few false positives. DO NOT LET THE LIST WORD-WRAP! Keyword Text File: ((http|3dhttp)://.{0,26}(((%.+%))|@|:)[(\d|\w)]) And here is my "Excluded From Addresses" List *@paypal.com Dan S. |
|
Allan
Guest Group |
Post Options
Thanks(0)
|
A suggestion: If Outlook can "filter" the unwanted HTML syntax and display the text, why can't SpamFilter when it's checking for keywords ? |
|
Desperado
Senior Member Joined: 27 January 2005 Location: United States Status: Offline Points: 1143 |
Post Options
Thanks(0)
|
Allan, Outlook is not "Filtering" the html. Spamers are obscuring the message by adding a wide variety of HTML tags and comments and what ever else they come up with. The goal is to locate the METHOD of obfuscation ... not the actual message and if the message has been obscured, then it is most likely Spam. That is what the regular expressions are doing. Having SpamFilter render the code and then go searching for a word or phrase, in my opinion would lead to horrendous amounts of false positives. Just because a message has the word penis in it, does not make it Spam. The same is true for viagra. HOWEVER, if those words are being "hidden" using comments, tags or whatever, then it must be Spam. What other possible reason would there be to mask the actual content. Dan S. |
|
Allan
Guest Group |
Post Options
Thanks(0)
|
I do not quite agree with you. I have added some words in the keyword filter - regardles of their meaning, I want to filter the mails. If Outlook ( and other mail programs ) is capable to filter out worthless HTML tags when displaying the message, I still cant see why the program can't do the same... Greetings Allan |
|
Trinidad
Guest Group |
Post Options
Thanks(0)
|
I am new to regex and was wanting to know if you could you give me a break down on what a few of your lines are doing? namely lines 1, 2, 5 and 13. Thanks in advance. |
|
DigitalMan
Guest Group |
Post Options
Thanks(0)
|
There is one trick you can use to view the source in Outlook. I use this on most of my mail before opening when it has gotten through SpamFilter but sill looks suspect.
|
|
JimMeredith
Guest Group |
Post Options
Thanks(0)
|
Valid points, all... Dan, your RegEx list is always welcome for reading and comparison. You've said in a previous post that you're not a "RegEx expert", but it seems more and more that you're becoming one as a result of this project! Filtering on a COMMON single word might not be such a good idea, but there are plenty of single-word spam traps that are very effective with zero false positives. You'd be surprised how much spam I'm catching by filtering on the slang word "milf". Both "vallum" and "vaiium" (consider the spammer's creative use of capitalization on the i's and l's) are effective at trapping a few dozen spam messages each week. But back to my original concern... trapping the embed<crap>ded tags. Unfortunately I don't think that even your most recent RegEx list addresses this scenario. Thus, Allan's post about pre-filtering the HTML tags and then running keyword check does have merit. A sentence that contains the phrase "pe<blurb>nis enl<!--okay-->argem<extract>ent" then becomes "penis enlargement" and it becomes a policy decision for the mail system's administrator whether or not to filter on those two words. Still, this doesn't rate an enhancement request, not right now. The statistical scoring enhancement that Roberto has mentioned in previous posts could have a more far-reaching impact on spam control than all of these blacklist/whitelist tweaks combined. So, I'll just sit tight and wait for THAT beta to arrive. |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Allan, We had given thought to your same observations. There is a (big) performance penalty in having to parse incoming email's HTML, so we decided to keep it simple and just work with the email source. We're working hard to prepare a new version that does DNA fingerprinting on incoming emails, which will greatly diminish the need to specify multiple keywords. Once this is complete, should it not have the desired accuracy, we may go back to implement HTML parsing as well. Roberto F. ========== A new message by Allan Poulsen was posted in Support Forum Re: Keyword filter
|
|
DigitalMan
Guest Group |
Post Options
Thanks(0)
|
Having read this thread, I think the feature request that is coming out is a function that would render HTML embedded into a subject line back as text, and then accepting/rejecting the message based on the rendered subject line. So "VIAGRA" and "VIA<crap>GRA" would all be caught if "VIAGRA" is in the keyword filter. The question to LogSat then becomes, is this feature request feasible and if so, is it sensible? |
|
Richard
Guest Group |
Post Options
Thanks(0)
|
I don't believe that removing HTML tags prior to keyword searching would amount to a lot of code or overhead. "DNA fingerprinting" has failed elsewhere because spammers are on to it, and throw random garbage into messages to change the "fingerprint" -- I hope your algorithm is smarter than others. Another spammer trick alluded to earlier in this thread is the creative use of "alphabet substitutions" - such as 1 or ! or | for I or l , 0 for O, @ for a, etc., and peculiar spacing, with and without separating characters (-_+=~. are common). If viagra is the keyword, then my wishlist is for the software to, in addition to stripping out HTML tags, match V!AGRA and Viagr@ and ###VIAGRA### (embedded in other text), V I A G R A, V-I-A-G-R-A and so on. It is interesting to note that (at least I think this is the case) HTML does not allow SRC= or HREF= to be broken up with comments - thus even if VIAGRA is obscured, the website mentioned within is not - thus in the HTML gobbeldygook example a few messages back - this is buried in the otherwise unreadable html: <A href="http://www.pure-herbal.biz/sknoc/vp/" One should be able to filter out mail containing "pure-herbal.biz" -- if nothing else! |
|
LogSat
Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
Post Options
Thanks(0)
|
Richard, Removing the HTML tags requires little code but it does add more overhead we feel comfortable with right now. As far as DNA fingerprinting, it is self-adjusting. If spammers decide to spell v1agra or v*i*a*g*r*a, after receiving a few such emails at first, the statistical engine will begin to recognize the new patterns and readjust itself. At least it is in our preliminary tests... Roberto F. |
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
This page was generated in 0.244 seconds.