460 likes | 546 Views
Countering Spam Using Classification Techniques. Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008. Overview. Introduction Countering Email Spam Problem Description Classification History Ongoing Research Countering Web Spam Problem Description
E N D
Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008
Overview • Introduction • Countering Email Spam • Problem Description • Classification History • Ongoing Research • Countering Web Spam • Problem Description • Classification History • Ongoing Research • Conclusions
Introduction • The Internet has spawned numerous information-rich environments • Email Systems • World Wide Web • Social Networking Communities • Openness facilities information sharing, but it also makes them vulnerable…
Denial of Information (DoI) Attacks • Deliberate insertion of low quality information (or noise) into information-rich environments • Information analog to Denial of Service (DoS) attacks • Two goals • Promotion of ideals by means of deception • Denial of access to high quality information • Spam is the currently the most prominent example of a DoI attack
Overview • Introduction • Countering Email Spam • Problem Description • Classification History • Ongoing Research • Countering Web Spam • Problem Description • Classification History • Ongoing Research • Conclusions
Countering Email Spam • Close to 200 billion (yes, billion) emails are sent each day • Spam accounts for around 90% of that email traffic • ~2 million spam messages every second
Problem Description • Email spam detection can be modeled as a binary text classification problem • Two classes: spam and legitimate (non-spam) • Example of supervised learning • Build a model (classifier) based on training data to approximate the target function • Construct a function f: M {spam, legitimate} such that it overlaps F: M {spam, legitimate} as much as possible
Problem Description (cont.) • How do we represent a message? • How do we generate features? • How do we process features? • How do we evaluate performance?
How do we represent a message? • Classification algorithms require a consistent format • Salton’s vector space model (“bag of words”) is the most popular representation • Each message m is represented as a feature vector f of n features: <f1, f2, …, fn>
How do we generate features? • Sources of information • SMTP connections • Network properties • Email headers • Social networks • Email body • Textual parts • URLs • Attachments
How do we process features? • Feature Tokenization • Alphanumeric tokens • N-grams • Phrases • Feature Scrubbing • Stemming • Stop word removal • Feature Selection • Simple feature removal • Information-theoretic algorithms
How do we evaluate performance? • Traditional IR metrics • Precision vs. Recall • False positives vs. False negatives • Imbalanced error costs • ROC curves
Classification History • Sahami et al. (1998) • Used a Naïve Bayes classifier • Were the first to apply text classification research to the spam problem • Pantel and Lin (1998) • Also used a Naïve Bayes classifier • Found that Naïve Bayes outperforms RIPPER
Classification History (cont.) • Drucker et al. (1999) • Evaluated Support Vector Machines as a solution to spam • Found that SVM is more effective than RIPPER and Rocchio • Hidalgo and Lopez (2000) • Found that decision trees (C4.5) outperform Naïve Bayes and k-NN
Classification History (cont.) • Up to this point, private corpora were used exclusively in email spam research • Androutsopoulos et al. (2000a) • Created the first publicly available email spam corpus (Ling-spam) • Performed various feature set size, training set size, stemming, and stop-list experiments with a Naïve Bayes classifier
Classification History (cont.) • Androutsopoulos et al. (2000b) • Created another publicly available email spam corpus (PU1) • Confirmed previous research than Naïve Bayes outperforms a keyword-based filter • Carreras and Marquez (2001) • Used PU1 to show that AdaBoost is more effective than decision trees and Naïve Bayes
Classification History (cont.) • Androutsopoulos et al. (2004) • Created 3 more publicly available corpora (PU2, PU3, and PUA) • Compared Naïve Bayes, Flexible Bayes, Support Vector Machines, and LogitBoost: FB, SVM, and LB outperform NB • Zhang et al. (2004) • Used Ling-spam, PU1, and the SpamAssassin corpora • Compared Naïve Bayes, Support Vector Machines, and AdaBoost: SVM and AB outperform NB
Classification History (cont.) • CEAS (2004 – present) • Focuses solely on email and anti-spam research • Generates a significant amount of academic and industry anti-spam research • Klimt and Yang (2004) • Published the Enron Corpus – the first large-scale corpus of legitimate email messages • TREC Spam Track (2005 – present) • Produces new corpora every year • Provides a standardized platform to evaluate classification algorithms
Ongoing Research • Concept Drift • New Classification Approaches • Adversarial Classification • Image Spam
Concept Drift • Spam content is extremely dynamic • Topic drift (e.g., specific scams) • Technique drift (e.g., obfuscations) • How do we keep up with the Joneses? • Batch vs. Online Learning
New Classification Approaches • Filter Fusion • Compression-based Filtering • Network behavioral clustering
Adversarial Classification • Classifiers assume a clear distinction between spam and legitimate features • Camouflaged messages • Mask spam content with legitimate content • Disrupt decision boundaries for classifiers
Camouflage Attacks • Baseline performance • Accuracies consistently higher than 98% • Classifiers under attack • Accuracies degrade to between 50% and 70% • Retrained classifiers • Accuracies climb back to between 91% and 99%
Camouflage Attacks (cont.) • Retraining postpones the problem, but it doesn’t solve it • We can identify features that are less susceptible to attack, but that’s simply another stalling technique
Image Spam • What happens when an email does not contain textual features? • OCR is easily defeated • Classification using image properties
Overview • Introduction • Countering Email Spam • Problem Description • Classification History • Ongoing Research • Countering Web Spam • Problem Description • Classification History • Ongoing Research • Conclusions
Countering Web Spam • What is web spam? • Traditional definition • Our definition • Between 13.8% and 22.1% of all web pages
Ad Farms • Only contain advertising links (usually ad listings) • Elaborate entry pages used to deceive visitors
Ad Farms (cont.) • Clicking on an entry page link leads to an ad listing • Ad syndicators provide the content • Web spammers create the HTML structures
Parked Domains • Domain parking services • Provide place holders for newly registered domains • Allow ad listings to be used as place holders to monetize a domain • Inevitably, web spammers abused these services
Parked Domains (cont.) • Functionally equivalent to Ad Farms • Both rely on ad syndicators for content • Both provide little to no value to their visitors • Unique Characteristics • Reliance on domain parking services (e.g., apps5.oingo.com, searchportal.information.com, etc.) • Typically for sale by owner (“Offer To Buy This Domain”)
Advertisements • Pages advertising specific products or services • Examples of the kinds of pages being advertised in Ad Farms and Parked Domains
Problem Description • Web spam detection can also be modeled as a binary text classification problem • Salton’s vector space model is quite common • Feature processing and performance evaluation are also quite similar • But what about feature generation…
How do we generate features? • Sources of information • HTTP connections • Hosting IP addresses • Session headers • HTML content • Textual properties • Structural properties • URL linkage structure • PageRank scores • Neighbor properties
Classification History • Davison (2000) • Was the first to investigate link-based web spam • Built decision trees to successfully identify “nepotistic links” • Becchetti et al. (2005) • Revisited the use of decision trees to identify link-based web spam • Used link-based features such as PageRank and TrustRank scores
Classification History • Drost and Scheffer (2005) • Used Support Vector Machines to classify web spam pages • Relied on content-based features as well as link-based features • Ntoulas et al. (2006) • Built decision trees to classify web spam • Used content-based features (e.g., fraction of visible content, compressibility, etc.)
Classification History • Up to this point, previous web spam research was limited to small (on the order of a few thousand), private data sets • Webb et al. (2006) • Presented the Webb Spam Corpus – a first-of-its-kind large-scale, publicly available web spam corpus (almost 350K web spam pages) • http://www.webbspamcorpus.org • Castillo et al. (2006) • Presented the WEBSPAM-UK2006 corpus – a publicly available web spam corpus (only contains 1,924 web spam pages)
Classification History • Castillo et al. (2007) • Created a cost-sensitive decision tree to identify web spam in the WEBSPAM-UK2006 data set • Used link-based features from [Becchetti et al. (2005)] and content-based features from [Ntoulas et al. (2006)] • Webb et al. (2008) • Compared various classifiers (e.g., SVM, decision trees, etc.) using HTTP session information exclusively • Used the Webb Spam Corpus, WebBase data, and the WEBSPAM-UK2006 data set • Found that these classifiers are comparable to (and in many cases, better than) existing approaches
Ongoing Research • Redirection • Phishing • Social Spam
Redirection • 144,801 unique redirect chains (1.54 average HTTP redirects) • 43.9% of web spam pages use some form of HTML or JavaScript redirection
Phishing • Interesting form of deception that affects email and web users • Another form of adversarial classification
Social Spam • Comment spam • Bulletin spam • Message spam
Conclusions • Email and web spam are currently two of the largest information security problems • Classification techniques offer an effective way to filter this low quality information • Spammers are extremely dynamic, generating various areas of important future research…