Classifying and Filtering Spam Using Search Engines

Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing Georgia Tech

>50% of all e-mail today is spam? Source: brightmail.com

Scale • IDC: of 31bn messages sent each day, 18%, or 5.6bn were s[pc]am messages • Brightmail decoy network stats: 6.7 bn spam messages sent in March, 2003, varying from 100 to ~100,000 identical e-mails sent at a time

Current techniques to deal with SPAM/UCE: • Blacklisting • Signature-based Filtering • Statistical/Bayesian Filtering • Heuristic Filtering • Challenge-Response Filtering • Sender-pays • Laws

Blacklisting • MAPS (Mail Abuse Prevention System) RBL catches only 24% of spam with 34% false positives (the spam police article, gaudi/gaspar) • Self-appointed sheriffs/vigilantes, legitimate business increasingly caught in crossfire, e.g. iBill was losing $100k/day during each of the four days of blacklisting • Only a first cut at the problem, never b-lists more than 50% of the servers sending spam (Graham)

Sample and Signature-based Filtering • Set up a network of DECOY e-mail addresses. Any messages sent to these addresses must be spam=>if the same message is sent to a protected address, the message must be SPAM, too (that’s what Brightmail does) • Not very flexible -- spammers take the lead in coming up with tricks • Make each spam different

Brightmail (used by MS/Hotmail, Earthlink, Verizon, ebay etc. )

Basic Statistical Filtering • W: Must be TRAINED, S: relatively low false positives • Starts with two message corpuses -- spam and legitimate • Splits messages into TOKENs • Assigns each token a probability, based on the probability of its appearance in spam corpus e.g. ‘naked’ may have 67% probability of appearing in spam, say vs. ‘regards’ -- 10% • when a new message arrives, stat filter takes top N tokens with the probability that is the farthest from the middle 50% both ways, applies Bayesian Theorem, and comes up with a RANKING for the e-mail

Heuristic Filtering • What kind of filters can you come up with JUST BY LOOKING at a spam e-mail? • Sender name looks bogus? • Header fields are missing? • Lots of html? • Take all these rules and heuristic observations, assign weights/points, and put them into a database • You’ve got yourself an early version of SPAMASSASSIN

SpamAssassin • The way you can make it work (let’s say with postfix): 1) perl -MCPAN -e ‘install Mail::SpamAssassin’ 2) learn on database of spam and legitimate e-mails using sa-learn (part of spamassassin) 3) add a filter program to filter all incoming mail through spamc, a part of spamassassin: /usr/bin/spamc | /usr/sbin/sendmail -i “$@”; exit $? 4) spamc adds headers, something like: X-Spam-Flag: {YES|NO}, X-Spam-Level: *** 5) The headers are caught by a user’s procmail recipe and mail is classified appropriately

Heuristic Filtering Two • W: Public heuristic rules database; makes it relatively easy for spammers to come up with way to bypass the system => The rules database needs to be updated frequently • May not be as effective today as other methods, such as stat filtering

Challenge-Response Filtering • Whenever you receive an e-mail from someone NOT on your whitelist, an automatic reply is sent telling what steps the sender should take to be considered for the whitelist (e.g. send you a confirmation, make a donation, solve a puzzle, etc.) • Very effective at stopping spam BUT has a number of drawbacks: valid mail delayed, kind of harsh -- some may think of it as inconsiderate and never reply, extra work for senders etc.

Stats for different approaches (MessageLabs)

Problems with Statistical and other keyword-dependent methods • 1) Heavily dependent on effective parsing and the presence of “true” tokens, e.g. spammers fooling parsers: Examples: • White background: <font color=white>research data and other statistically strong keywords that are present in legitimate e-mails</font> • Splitting words: check this porn • Adding extra characters and spaces to confuse parsers (F*R E-E) and so forth (javascript, fake html tags, browser-specific tricks) 2) • 2) Spam may contain too little text and be TOO close to real e-mails in keywords. This is a more serious problem. I’ll give an example later.

My research • Developed and implemented a system for filtering of unwanted mail using Google • Can be used WITHOUT training

Classification of current spam

Thoughts • Some users must click on those ads or else there would be no spam (somebody IS interested in it after all) • There may be more of such users in the future as new regulations appear and spam becomes less of an annoyance and more of an ad • Some users may like to receive SPAM-looking messages, for instance, marketing reports, offers, etc., that look very much like spam

Two main observations I use • Spam is USER-SPECIFIC • Most spammers expect users to TAKE some ACTION upon reading spam; in other words, there has to be a FEEDBACK mechanism

Targeting the feedback mechanism • How effective would a spam be without an easy feedback mechanism?

URLs as a feedback mechanism • Of ~1800 spam messages in the classical spam corpuses I have analyzed, ~95% of messages contained URLs • Of the remaining 5%, approximately 1/2 seemed to be damaged submissions (i.e. MIME conversion and other types of errors), the rest consisted of two types of letters: • Messages with 1-800 numbers and faxes (including Nigerian scam) • Religious letters

Basic Approach: URLSP • The basic approach was to extract URLs, apply a user-specific whitelist based on a user’s mailbox (masks such as .edu, cnn.com etc.) and classify everything else as spam • The first version I implemented has been in use at Tech since December’02 • Has actually been working quite well

Effective but rather naive • First version effective but rather naive • Granularity and false positives can be a problem

Next version: Classifying URLs • CLASSIFY URLs using Google and Open Directory • Use whitelists/blacklists of categories and URLs BASED on user mailbox and individual preferences

DMOZ/ODP

Example • Based on files automatically generated from your mailbox, configure the system as follows (blacklist* f. are omitted): whitelist.url: .edu, .mil, .gov, www.nmap.com, www.epic.org, www.cypherpunks.to etc. whitelist.cat: Top/Computers/Security/Anti_Virus/Products Top/Computers/Security/Products_and_Tools/Cryptography/PGP Top/Computers/Security/Products_and_Tools/Password_Tools ...

URL Classifier: Categories Extracted from SPAM • Examples of categories of URLs extracted from spam: Top/Business/Consumer_Goods_and_Services/Beauty/Cosmetics Top/Business/Employment/Careers Top/Business/Financial_Services/Mortgages Top/Business/Investing/Day_Trading/Brokerages Top/Business/Investing/Day_Trading/Education_and_Training Top/Business/Investing/News_and_Media/Newsletters/Stocks_and_Bonds Top/Business/Marketing_and_Advertising/Direct_Marketing/Mailing_Lists/MLM Top/Regional/North_America/Canada/Business_and_Economy/Employment/Job_Search Top/Shopping/Gifts/Personalized Top/Shopping/Home_and_Garden/Kitchen_and_Dining/Appliances/Parts ...

GTUC v1.0 (Basic) • Register for a free account on a CoC-based filtering server • Forward your mail to the server • The mail will be automatically classified into three folders as it arrives • Inbox, Unknown, spam-can • Read your mail with IMAP

Spam of the future • Innovative feedback mechanisms • Appearance as close to legitimate e-mails as possible, e.g. >>> From: rcarlos@legitimate.com Hi, here is an interesting article. You should check it out -- net::“terminator_25” Roberto Carlos

Solution • Current best--Combination of approaches • Categorization and URL-based filtering can help • Uncategorized URLs? Similarity + retrieval of html and categorization with token stats/heuristics

Classifying and Filtering Spam Using Search Engines

Classifying and Filtering Spam Using Search Engines

Presentation Transcript

Search Engines.

Search Engines

Clustering, Classifying, Filtering

Spam Filtering Service Providers

SEARCH ENGINES

Using Search Engines

Filtering Spam With

Search Engines

Spam Filtering Using Statistical Data Compression Models

Search Engines

Search Engines

Using OARE Search Engines

Using OARE Search Engines

Search Engines

SPAM FILTERING

Using Search Engines

Search Engines and Metasearch Engines

Latest Spam Filtering Techniques

Spam Filtering Using Bayesian Approach

Email Spam Filtering Service

Search Engines

Cloud-Based Spam Filtering