Filtron : A Learning-Based Anti-Spam Filter

First Conference on Email and Anti-Spam (CEAS) Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis (ernani@iit.demokritos.gr), Ion Androutsopoulos (ion@aueb.gr), George Paliouras (paliourg@iit.demokritos.gr), George Sakkis (gsakkis@rutgers.edu), Panagiotis Stamatopoulos (takis@di.uoa.gr) Mountain View, CA, July 30th and 31st 2004

Outline • Spam Filtering: past, present and future • Anti-spam filtering with Filtron • In Vitro Evaluation • In Vivo Evaluation • Conclusions

Spam Filtering: past, present and future • Past: • Black-lists and white-lists of e-mail addresses • Handcrafted rules looking for suspicious keywords and patterns in headers • Present: • Machine learning-based filters • Mostly using Naïve Bayes classifier • Examples: Mozilla’s spam filter, POPFILE, K9 • Signature based filtering (Vipul’s Razor) • Future: • Combination of several techniques (SpamAssassin)

Filtron: An overview • A multi-platform learning-based anti-spam filter. • Features for simple the user: • Personalized: based on her legitimate messages • Automatically updating black/white lists • Efficient: server-side filtering and interception rules • Features for the advanced user and the researcher: • Customizable learning component • Through Wekaopen source machine learning platform • Support for creating publicly available message collections • Privacy-preserving encoding of messages and user profiles • Portable: Implemented in Java and Tcl/Tk • Currently supported under POSIX-compatible mail servers (MS Exchange Server port efforts under way)

Filtron Spam folders Preprocessor Attribute Selector Legitimate folders attribute set black list, white list Vectorizer training vectors User model induced classifier Learner Filtron’s Architecture

Preprocessing • Break down mailbox(es) into distinct messages • Remove from every message: • mail headers • html tags • attached files • Remove messages with no textual content • Store 5 messages per sender • Avoids bias towards regular correspondents. • Remove duplicates • Encode messages (optional)

Message Classification

In Vitro Evaluation • We investigated the effect of: • Single-token versus multi-token attributes (n-grams for n=1,2,3) • Number of attributes (40-3000) • Learning algorithm (Naïve Bayes, Flexible Bayes, SVMs, LogitBoost) • Training corpus size (~ 10%-100% of full training corpus) • Cost-Sensitive Learning Formulation • Misclassifying a legitimate message as spam (LS) is λ times more serious an error than misclassifying a spam to legitimate (SL) • Two usage scenarios (λ = 1, 9)

In Vitro Evaluation (cont.) • Evaluation: • Four message collections (PU1, PU2, PU3, PUA) • Stratified 10-fold cross validation • Results: • No clear winner among learning algorithms wrt accuracy  Efficiency (or other criteria) more important for real usage. • Nevertheless, SVMs consistently among two best • No substantial improvement with n-grams (for n>1) • Refer to the TR for more details: • Learning to filter unsolicited commercial e-mail, TRN 2004/2, NCSR “Demokritos” (http://www.iit.demokritos.gr/skel/i-config/)

Summary of in Vitro Evaluation

In Vivo Evaluation • Seven month live-evaluation by the third author • Training collection: PU3 • 2313 legitimate / 1826 spam • Learning algorithm: SVM • Cost scenario: λ = 1 • Retained attributes: 520 1-grams • Numeric values (term frequency) • No black-list was used

Summary of in Vivo Evaluation

Post-Mortem AnalysisFalse Positives • 52 false positives (out of 6732) • 52%: Automatically generated messages • subscription verifications, virus warnings, etc. • 22%: Very short messages • 3-5 words in message body • Along with attachments and hyperlinks • 26%: Short messages • 1-2 lines • Written in casual style, often exploited by spammers • With no attachments or hyperlinks

Post-Mortem AnalysisFalse Negatives • 173 false negatives (out of 6732) • 30%: “Hard Spam” • Little textual information, avoiding common suspicious word patterns • Many images and hyperlinks • Tricks to confuse tokenizers • 8%: Advertisements of pornographic sites with very casual and well chosen vocabulary • 23%: Non-English messages • Under-represented in the training corpus • 30%: Encoded messages • BASE64 format; Filtron could not process it at that time • 6%: Hoax letters • Long formal letters (“tremendous business opportunity !”) • Many occurrences of the receiver’s full name • 3%: Short messages with unusual content

Conclusions • Signs of arms race between spammers and content-based filters • Filtron’s performance deemed satisfactory, though it can be improved with: • More elaborate preprocessing to tackle usual countermeasures of spammers (misspellings, uncommon words, text on images) • Regular retraining • Currently most promising approach: combination of different filtering approaches along with Machine Learning • Collaborative filtering • Filtering in the transport layer level • …

Filtron : A Learning-Based Anti-Spam Filter