280 likes | 407 Views
ELPUB 2007, Vienna. The Fight against Spam - A Machine Learning Approach. Jiri Hynek (jhynek@kiv.zcu.cz) Karel Jezek (jezek_ka@kiv.zcu.cz). www.textmining.cz. Contents:. Stats 101 Today‘s Spam Types Spammer Tricks Text-Based Spam Filter Implementation Results. Contents:.
E N D
ELPUB 2007, Vienna The Fight against Spam- A Machine Learning Approach Jiri Hynek (jhynek@kiv.zcu.cz) Karel Jezek (jezek_ka@kiv.zcu.cz) www.textmining.cz
Contents: • Stats 101 • Today‘s Spam Types • Spammer Tricks • Text-Based SpamFilter Implementation • Results
Contents: Spamming is publishing: Web Spam (“comment spam“) • blogs, (unmoderated) forums, wikis Why: to trigger higher page-ranking! Unsolicited marketing spam in our e-mails – info dissemination to the public Why: sell products!
A bit of Terminology:“Canned meat made largely from pork“ Ham vs. Spam (Spam mail) UCE (Unsolicited Commercial Email) UBM (Unsolicited Bulk Mail) EMP (Excessive Multi-Posting) Junk mail Bulk email
Stats 101 Top five spam categories: Online Pharmacies 20.0% Mortgage Refinancing 9.7% Investment/financial services 9.0% Male products (\/i@gra, CI@1i$) 8.7% Discount computer software 6.9% Communications of the ACM, February 2007/Vol. 50 No.2
Stats 101 1998: Mere 10% of overall mail volume Now: 80% Communications of the ACM, February 2007/Vol. 50 No.2 Average spammers‘ revenue: $1 per 45,000 spams dispatched A database of 100 million e-mails costs 100 dollars, spam software included (www.symantec.com)
Today‘s Spam Types Text Spam
Today‘s Spam Types Text Spam Commonly used phrases filtered out by antispam filters (and words to avoid, of course) Free! 50% off! Click Here Call now! Subscribe Earn $ Discount! Eliminate Debt Double your income You're a Winner! Reverses Aging Hidden Information you requested Stop / Stops Lose Weight Multi level Marketing Million Dollars Opportunity Compare Removes Collect Amazing Cash Bonus Promise You Credit Loans Satisfaction Guaranteed Serious Cash Search Engine Listings
Today‘s Spam Types Image-Based Spam
Today‘s Spam Types Image-Based Spam in our mailboxes
Today‘s Spam Types Phishing
Today‘s Spam Types Captcha - fighting web spam
Common Spammer Tricks Tricks to fool statistical spam filters: • Avoidance of keywords (such as stock, Viagra, etc.), • Frequent change in sender’s address, • Message encoding (such as base64, commonly used for secure message transfer), • Hashing (e.g. insertion of HTML tags into messages), • Use of images instead of plain text (namely GIF, JPEG, and PNG).
New Spammer Tricks Character Hashing: I finlaly was able to lsoe the wieght I have been sturggling to lose for years! And I couldn't bileeve how simple it was! Amizang pacth makes you shed the ponuds! It's Guanarteed to work or your menoy back!
New Spammer Tricks Keyword masking by repeating characters: Buuuyyyy cheeeeaaap viaaagraaa Word obfuscations: \/laGr@ Need a{} Dpiloma? sh1pp1ng //orldwide S0ft T4bs Ci@li$ repl1ca w4tches from r0lex
New Spammer Tricks Word obfuscations: • There are 62,424 (3 x 12 x 17 x 2 x 3 x 17) ways to portray the name Viagra. In fact, there are 600,426,974,379,824,381,952 ways to spell Source: http://cockeyed.com/lessons/viagra/viagra.html
New Spammer Tricks ASCII Art: \|||||/ ( o o ) -ooO--(_)--Ooo— / \
New Spammer Tricks ASCII Art:
New Spammer Tricks Good word attacks (Bayesian poisoning) Russa says McGwire belongs in Hall AP - 35 minutes ago One year on, the face live! EDITORS' BLOG CNN.com AP Action on Elder Abuse Politics My Sources Weather Alerts Back Security SPACE.com The council is now proposing to increase the annual fee to nurses Freeman dies AFP Pope calls for Islam dialogue "There's a lot of theoreticalCSMonitor.com Last Updated: Tuesday, 28 November 2006, 23:13 GMT Bad rapto top ^^ Five girls killed in Iraqi clash This is where a little bit of help28, 6:33 AM ET Wales Lottery Video: Bush Praises Estonia As War on Terror AllyANALYSIS Mucking about? Hazards Podcasts ELSEWHERE ON THE BBC At the same timeVictims Were Asleep Fashion Wire Daily AFP Football's elite Baby beluga dies athands-on situation." 'My mother was assaulted' Entertainment Search World Radio 2 Google together Mr Litvinenko's movements on 1 November, the day he fell...
New Spammer Tricks Good word attacks
A Filter to Fight Text-Based Spam It‘s just another Short Document Classification Problem: The Itemsets Filter Plain Bayes Filter LSI Filter SVM Filter GZip (Compression-based) filter
Standard Spam Testing Collections PU1: A mixture of 481 spam messages and 618 legitimate messages PU123A: Four corpora, based on private mailboxes Enron Corpus: 200,399 unique messages collected by 158 users (mostly managers)
Itemsets Spam Filter: Results FPI = (#ham as spam) / #ham i.e. the proportion of legitimate messages deleted by mistake. FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.
SVM Spam Filter: Results FPI = (#ham as spam) / #ham i.e. the proportion of legitimate messages deleted by mistake. FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.
GZip Spam Filter: Results FPI = (#ham as spam) / #ham i.e. the proportion of legitimate messages deleted by mistake. FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter. …We will look into this in the near future
Light at the end of the tunnel? • Payment per e-mail? • Quite unlikely… • E-mail authentication by SIDF • Sender ID Framework (by Microsoft) • … registered list of servers of domain owners • Confirmation of e-mail source domain (automatically, by ISPs) • Protects 40% of legitimate email sent worldwide • Helps combat phishing scams / domain spoofing (forging a sender's address)
Light at the end of the tunnel? • DomainKeys Identified Mail (DKIM) • Similar technology by Yahoo, Cisco Systems, Sendmail, PGP • Based on digital signatures • An official proposed standard by Internet Engineering Task Force