230 likes | 385 Views
Design and Evaluation of a Real-Time URL Spam Filtering Service. Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security and Privacy 2011. OUTLINE. Introduction - Monarch Related Work System Design Implementation Evaluation Discussion and Conclusion.
E N D
Design and Evaluation of a Real-Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security and Privacy 2011
OUTLINE • Introduction - Monarch • Related Work • System Design • Implementation • Evaluation • Discussion and Conclusion
Spam URL • Advertisement • Harmful content • Phishing, malware, and scams • Use of compromised and fraudulent accounts • Email, web services
Monarch • Spam URL Filtering as a Service • Tens of millions of features
Related Work • “Detecting spammers on Twitter” (2010) • Post frequency, URLs, friends… • “Behind phishing: an examination of phisher modi operandi” (2008) • Lexical characteristics of phishing URLs • “Cantina: a content-based approach to detecting phishing web sites” (2007) • Parse HTML content
System Design Monarch’s cloud infrastructure • Url Aggregation • Email providers and Twitter’s streaming API • Feature Collection • Visits a URL with web browsers to collect page content
System Design(cont.) Monarch’s cloud infrastructure • Feature Extraction • Transform the raw data into a sparse feature vector • Classification • Training and testing by distributed logistic regression
Collect Raw Features – Web Browser “A taxonomy of JavaScript redirection spam”(2007) • Lightweight browser not enough • Poor HTML parsing, lack of JavaScript and plugins • Instrumented version of Firefox • JavaScript enabled • Flash and Java installed • Visited a URL and monitor a number of details
Raw Features • Web Browser • Initial URL and Landing URL, Redirects, Sources and Frames • HTML Content, Page Links • JavaScript Events, Pop-up Windows, Plugins • HTTP Headers • DNS Resolver • Initial, final, and redirect URLs • IP Address Analysis • City, country, ASN • Proxy and Whitelist (200 domains)
Features Vector • Raw Features => sparse feature vector • Canonicalize URLs • Remove obfuscation • Tokenize the text corpus • Splitting on non-alphanumeric characters http://adl.tw/~dada/dada2.php?a=1&b=3 => domain feature [adl,tw] path feature [dada,dada2,php] query parameters feature [a,1,b,3] => (…,adl:true,adm:false,…,dada:true,…,tw:true,……..) total 49,960,691 feature(dimension)… => (1,3,a,adl,b,dada,dada2,php,tw)
Distributed Classifier Design • Linear classification • : feature vector • Determine a weight vector • A parallel online learner • With regularization to yield a sparse weight vector • Labeled data , • Testing => -1 => non-spam site 1 => spam site
Training the weight vector • Logistic Regression • With subgradient L1-Regularization • yi(xi.wi) larger => f(w) smaller (Classification margin, hyperplane)
Data Set and assumption • 1.25 million spam email URLs • 567,784 spam Twitter URLs • 9 million non-spam Twitter URLs • Checking all Twitter URLs against: • Google Safebrowsing, SURBL, URIBL, APWG, Phishtank • Any of its source URLs become blacklisted
Data Set and assumption(cont.) • On Twitter: • 36% scams, 60% phishing, 4% malware
Implementation • Amazon Web Services(AWS) infrastructure • URL Aggregation • A queue, keeps 300,000 URLs • Feature Collection • 20x6 Firefox(4.0b4) on Ubuntu 10.04 • With a custom extension • Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views • Classifier • Hadoop Distributed File System • On the 50-node cluster
Evaluation – Overall Accuracy • 5-fold cross-validation • 500,000 spam and non-spam each • Training set size to 400,000 example • 1:1, 4:1, 10:1 • Testing set size to 200,000 example • 1:1
Evaluation – Accuracy Over Time Training once only <-> Retraining every four days
Evaluation – Comparing Email and Tweet Spam • Log odds ratio:
Evaluation – The Cost • For Twitter, $22,751 per month
Discussion and Conclusion • Evasion • Feature Evasion • Time-based Evasion • Crawler Evasion • Monarch • Real-time system • Spam URL Filtering as a Service • $22,751 a month