Network-Level Spam Detection

Network-Level Spam Detection Nick FeamsterGeorgia Tech

Spam: More than Just a Nuisance • 95% of all email traffic • Image and PDF Spam (PDF spam ~12%) • As of August 2007, one in every 87 emailsconstituted a phishing attack • Targeted attacks on the rise • 20k-30k unique phishing attacks per month Source: CNET (January 2008), APWG

Detection • Detect unwanted traffic from reaching a user’s inbox by distinguishing spam from ham • Question: What features best differentiate spam from legitimate mail? • Content-based filtering: What is in the mail? • IP address of sender: Who is the sender? • Behavioral features: How the mail is sent?

Content-Based Detection: Problems • Low cost to evasion:Spammers can easily alter features of an email’s content can be easily adjusted and changed • Customized emails are easy to generate: Content-based filters need fuzzy hashes over content, etc. • High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated

Another Approach: IP Addresses • Problem: IP addresses are ephemeral • Every day, 10% of senders are from previously unseen IP addresses • Possible causes • Dynamic addressing • New infections

Idea: Network-Based Detection • Filter email based on how it is sent, in addition to simply what is sent. • Network-level properties are less malleable • Hosting or upstream ISP (AS number) • Membership in a botnet (spammer, hosting infrastructure) • Network location of sender and receiver • Set of target recipients

Behavioral Blacklisting • Idea:Blacklist sending behavior (“Behavioral Blacklisting”) • Identify sending patterns commonly used by spammers • Intuition:Much more difficult for a spammer to change the technique by which mail is sent than it is to change the content

Improving Classification • Lower overhead • Faster detection • Better robustness (i.e., to evasion, dynamism) • Use additional features and combine for more robust classification • Temporal: interarrival times, diurnal patterns • Spatial: sending patterns of groups of senders

SNARE: Automated Sender Reputation • Goal: Sender reputation from a single packet?(or at least as little information as possible) • Lower overhead • Faster classification • Less malleable • Key challenge • What features satisfy these properties and can distinguish spammers from legitimate senders

Sender-Receiver Geodesic Distance 90% of legitimate messages travel 2,200 miles or less

Density of Senders in IP Space For spammers, k nearest senders are much closer in IP space

Other Network-Level Features • Time-of-day at sender • Upstream AS of sender • Message size (and variance) • Number of recipients (and variance)

Combining Features • Put features into the RuleFit classifier • 10-fold cross validation on one day of query logs from a large spam filtering appliance provider • Using only network-level features • Completely automated

Cluster-Based Features • Construct a behavioral fingerprint for each sender • Cluster senders with similar fingerprints • Filter new senders that map to existing clusters

DHCP Reassignment Infection Identifying Invariants IP Address: 24.99.146.xxx Unknown sender IP Address: 76.17.114.xxx Known Spammer spam spam spam spam spam spam domain3.com domain1.com domain2.com domain3.com domain1.com domain2.com Cluster on sending behavior Cluster on sending behavior Similar fingerprint! Behavioral fingerprint

Building the Classifier: Clustering • Feature: Distribution of email sending volumes across recipient domains • Clustering Approach • Build initial seed list of bad IP addresses • For each IP address, compute feature vector: volume per domain per time interval • Collapse into a single IP x domain matrix: • Compute clusters

Clustering: Fingerprint • For each cluster, compute fingerprint vector: • New IPs will be compared to this “fingerprint” IP x IP Matrix: Intensity indicates pairwise similarity

Evaluation • Emulate the performance of a system that could observe sending patterns across many domains • Build clusters/train on given time interval • Evaluate classification • Relative to labeled logs • Relative to IP addresses that were eventually listed

Early Detection Results • Compare SpamTracker scores on “accepted” mail to the SpamHaus database • About 15% of accepted mail was later determined to be spam • Can SpamTracker catch this? • Of 620 emails that were accepted, but sent from IPs that were blacklisted within one month • 65 emails had a score larger than 5 (85th percentile)

Small Samples Work Well Relatively small samples can achieve low false positive rates

Extensions to Phishing • Goal: Detect phishing attacks based on behavioral properties of hosting site(vs. static properties of URL) • Features • URL regular expressions • Registration time of domain • Uptime of hosting site • DNS TTL and redirections • Next time: Discussion of phishing detection/integration

Integration with SMITE • Sensors • Extract network features from traffic • IP addresses • Combine with auxiliary data (routing, time, etc.) • Algorithms • Clustering algorithm to identify behavioral fingerprints • Learning algorithm to classify based on multiple features • Correlation • Clusters formed by aggregating sending behavior observed across multiple sensors • Various features also require input from data collected across collections of IP addresses

Summary • Spam increasing, spammers becoming agile • Content filters are falling behind • IP-Based blacklists are evadable • Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month • Complementary approach: behavioral blacklisting based on network-level features • Blacklist based on how messages are sent • SNARE: Automated sender reputation • ~90% accuracy of existing with lightweight features • Cluster-based features to improve accuracy/reduce need for labelled data

Improvements • Accuracy • Synthesizing multiple classifiers • Incorporating user feedback • Learning algorithms with bounded false positives • Performance • Caching/Sharing • Streaming • Security • Learning in adversarial environments

Sampling: Training Time

Dynamism: Accuracy over Time

Network-Level Spam Detection

Network-Level Spam Detection

Presentation Transcript

Improving Digest-Based Collaborative Spam Detection

Network-level Malware Detection

ANF Detection Network

Network Security: Spam

Crowdsourcing Service-Level Network Event Detection

Network Intrusion Detection

Opinion Spam Detection

Email Spam Detection using machine Learning

Spam Email Detection

Network-Level Spam Defenses

Spam, Spam, Spam, Spam….

A False Positive Safe Neural Network for Spam Detection

Fighting Spam, Phishing and Online Scams at the Network Level

Internet Level Spam Detection and SpamAssassin 2.50

SPAM DETECTION IN P2P SYSTEMS

SPAM DETECTION IN P2P SYSTEMS

SPAM DETECTION IN P2P SYSTEMS

Network-Level Spam and Scam Defenses

Spam Detection

A False Positive Safe Neural Network for Spam Detection

Spam Detection Kingsley Okeke Nimrat Virk

Improving Digest-Based Collaborative Spam Detection