270 likes | 395 Views
Network-Level Spam Detection. Nick Feamster Georgia Tech. Spam: More than Just a Nuisance. 95% of all email traffic Image and PDF Spam (PDF spam ~12%) As of August 2007, one in every 87 emails constituted a phishing attack Targeted attacks on the rise
E N D
Network-Level Spam Detection Nick FeamsterGeorgia Tech
Spam: More than Just a Nuisance • 95% of all email traffic • Image and PDF Spam (PDF spam ~12%) • As of August 2007, one in every 87 emailsconstituted a phishing attack • Targeted attacks on the rise • 20k-30k unique phishing attacks per month Source: CNET (January 2008), APWG
Detection • Detect unwanted traffic from reaching a user’s inbox by distinguishing spam from ham • Question: What features best differentiate spam from legitimate mail? • Content-based filtering: What is in the mail? • IP address of sender: Who is the sender? • Behavioral features: How the mail is sent?
Content-Based Detection: Problems • Low cost to evasion:Spammers can easily alter features of an email’s content can be easily adjusted and changed • Customized emails are easy to generate: Content-based filters need fuzzy hashes over content, etc. • High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated
Another Approach: IP Addresses • Problem: IP addresses are ephemeral • Every day, 10% of senders are from previously unseen IP addresses • Possible causes • Dynamic addressing • New infections
Idea: Network-Based Detection • Filter email based on how it is sent, in addition to simply what is sent. • Network-level properties are less malleable • Hosting or upstream ISP (AS number) • Membership in a botnet (spammer, hosting infrastructure) • Network location of sender and receiver • Set of target recipients
Behavioral Blacklisting • Idea:Blacklist sending behavior (“Behavioral Blacklisting”) • Identify sending patterns commonly used by spammers • Intuition:Much more difficult for a spammer to change the technique by which mail is sent than it is to change the content
Improving Classification • Lower overhead • Faster detection • Better robustness (i.e., to evasion, dynamism) • Use additional features and combine for more robust classification • Temporal: interarrival times, diurnal patterns • Spatial: sending patterns of groups of senders
SNARE: Automated Sender Reputation • Goal: Sender reputation from a single packet?(or at least as little information as possible) • Lower overhead • Faster classification • Less malleable • Key challenge • What features satisfy these properties and can distinguish spammers from legitimate senders
Sender-Receiver Geodesic Distance 90% of legitimate messages travel 2,200 miles or less
Density of Senders in IP Space For spammers, k nearest senders are much closer in IP space
Other Network-Level Features • Time-of-day at sender • Upstream AS of sender • Message size (and variance) • Number of recipients (and variance)
Combining Features • Put features into the RuleFit classifier • 10-fold cross validation on one day of query logs from a large spam filtering appliance provider • Using only network-level features • Completely automated
Cluster-Based Features • Construct a behavioral fingerprint for each sender • Cluster senders with similar fingerprints • Filter new senders that map to existing clusters
DHCP Reassignment Infection Identifying Invariants IP Address: 24.99.146.xxx Unknown sender IP Address: 76.17.114.xxx Known Spammer spam spam spam spam spam spam domain3.com domain1.com domain2.com domain3.com domain1.com domain2.com Cluster on sending behavior Cluster on sending behavior Similar fingerprint! Behavioral fingerprint
Building the Classifier: Clustering • Feature: Distribution of email sending volumes across recipient domains • Clustering Approach • Build initial seed list of bad IP addresses • For each IP address, compute feature vector: volume per domain per time interval • Collapse into a single IP x domain matrix: • Compute clusters
Clustering: Fingerprint • For each cluster, compute fingerprint vector: • New IPs will be compared to this “fingerprint” IP x IP Matrix: Intensity indicates pairwise similarity
Evaluation • Emulate the performance of a system that could observe sending patterns across many domains • Build clusters/train on given time interval • Evaluate classification • Relative to labeled logs • Relative to IP addresses that were eventually listed
Early Detection Results • Compare SpamTracker scores on “accepted” mail to the SpamHaus database • About 15% of accepted mail was later determined to be spam • Can SpamTracker catch this? • Of 620 emails that were accepted, but sent from IPs that were blacklisted within one month • 65 emails had a score larger than 5 (85th percentile)
Small Samples Work Well Relatively small samples can achieve low false positive rates
Extensions to Phishing • Goal: Detect phishing attacks based on behavioral properties of hosting site(vs. static properties of URL) • Features • URL regular expressions • Registration time of domain • Uptime of hosting site • DNS TTL and redirections • Next time: Discussion of phishing detection/integration
Integration with SMITE • Sensors • Extract network features from traffic • IP addresses • Combine with auxiliary data (routing, time, etc.) • Algorithms • Clustering algorithm to identify behavioral fingerprints • Learning algorithm to classify based on multiple features • Correlation • Clusters formed by aggregating sending behavior observed across multiple sensors • Various features also require input from data collected across collections of IP addresses
Summary • Spam increasing, spammers becoming agile • Content filters are falling behind • IP-Based blacklists are evadable • Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month • Complementary approach: behavioral blacklisting based on network-level features • Blacklist based on how messages are sent • SNARE: Automated sender reputation • ~90% accuracy of existing with lightweight features • Cluster-based features to improve accuracy/reduce need for labelled data
Improvements • Accuracy • Synthesizing multiple classifiers • Incorporating user feedback • Learning algorithms with bounded false positives • Performance • Caching/Sharing • Streaming • Security • Learning in adversarial environments