Design and Evaluation of a Real-Time URL Spam Filtering Service

Design and Evaluation of a Real-Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security and Privacy 2011

OUTLINE • Introduction - Monarch • Related Work • System Design • Implementation • Evaluation • Discussion and Conclusion

Spam URL • Advertisement • Harmful content • Phishing, malware, and scams • Use of compromised and fraudulent accounts • Email, web services

Monarch • Spam URL Filtering as a Service • Tens of millions of features

Related Work • “Detecting spammers on Twitter” (2010) • Post frequency, URLs, friends… • “Behind phishing: an examination of phisher modi operandi” (2008) • Lexical characteristics of phishing URLs • “Cantina: a content-based approach to detecting phishing web sites” (2007) • Parse HTML content

System Design Monarch’s cloud infrastructure • Url Aggregation • Email providers and Twitter’s streaming API • Feature Collection • Visits a URL with web browsers to collect page content

System Design(cont.) Monarch’s cloud infrastructure • Feature Extraction • Transform the raw data into a sparse feature vector • Classification • Training and testing by distributed logistic regression

Collect Raw Features – Web Browser “A taxonomy of JavaScript redirection spam”(2007) • Lightweight browser not enough • Poor HTML parsing, lack of JavaScript and plugins • Instrumented version of Firefox • JavaScript enabled • Flash and Java installed • Visited a URL and monitor a number of details

Raw Features • Web Browser • Initial URL and Landing URL, Redirects, Sources and Frames • HTML Content, Page Links • JavaScript Events, Pop-up Windows, Plugins • HTTP Headers • DNS Resolver • Initial, final, and redirect URLs • IP Address Analysis • City, country, ASN • Proxy and Whitelist (200 domains)

Features Vector • Raw Features => sparse feature vector • Canonicalize URLs • Remove obfuscation • Tokenize the text corpus • Splitting on non-alphanumeric characters http://adl.tw/~dada/dada2.php?a=1&b=3 => domain feature [adl,tw] path feature [dada,dada2,php] query parameters feature [a,1,b,3] => (…,adl:true,adm:false,…,dada:true,…,tw:true,……..) total 49,960,691 feature(dimension)… => (1,3,a,adl,b,dada,dada2,php,tw)

Distributed Classifier Design • Linear classification • : feature vector • Determine a weight vector • A parallel online learner • With regularization to yield a sparse weight vector • Labeled data , • Testing => -1 => non-spam site 1 => spam site

Training the weight vector • Logistic Regression • With subgradient L1-Regularization • yi(xi．wi) larger => f(w) smaller (Classification margin, hyperplane)

Distributed Classifier Algorithm

Data Set and assumption • 1.25 million spam email URLs • 567,784 spam Twitter URLs • 9 million non-spam Twitter URLs • Checking all Twitter URLs against: • Google Safebrowsing, SURBL, URIBL, APWG, Phishtank • Any of its source URLs become blacklisted

Data Set and assumption(cont.) • On Twitter: • 36% scams, 60% phishing, 4% malware

After regularization

Implementation • Amazon Web Services(AWS) infrastructure • URL Aggregation • A queue, keeps 300,000 URLs • Feature Collection • 20x6 Firefox(4.0b4) on Ubuntu 10.04 • With a custom extension • Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views • Classifier • Hadoop Distributed File System • On the 50-node cluster

Evaluation – Overall Accuracy • 5-fold cross-validation • 500,000 spam and non-spam each • Training set size to 400,000 example • 1:1, 4:1, 10:1 • Testing set size to 200,000 example • 1:1

Evaluation – Single Feature

Evaluation – Accuracy Over Time Training once only <-> Retraining every four days

Evaluation – Comparing Email and Tweet Spam • Log odds ratio:

Evaluation – The Cost • For Twitter, $22,751 per month

Discussion and Conclusion • Evasion • Feature Evasion • Time-based Evasion • Crawler Evasion • Monarch • Real-time system • Spam URL Filtering as a Service • $22,751 a month

Design and Evaluation of a Real-Time URL Spam Filtering Service

Design and Evaluation of a Real-Time URL Spam Filtering Service

Presentation Transcript

URL Design

Filtering Approaches for Real-Time Anti-Aliasing

Real-time Linux Evaluation

Spam Filtering Service Providers

Real-Time Design

Filtering Approaches for Real-Time Anti-Aliasing

Real Time Video Filtering

Filtering Spam With

Classifying and Filtering Spam Using Search Engines

Real-time Design and Verification

SPAM FILTERING

Spam Filtering State of the Art

Design and Analysis of Real-Time Software

Filtering Approaches for Real-Time Anti-Aliasing

Latest Spam Filtering Techniques

Spam Filtering Using Bayesian Approach

Email Spam Filtering Service

Advantages of a Spam Filtering System for Business Emails

Cloud-Based Spam Filtering

A Survey SMS Spam Filtering

Real-Time Linux Evaluation

Reflections on Bayesian Spam Filtering