480 likes | 615 Views
Network-Level Spam and Scam Defenses. Nick Feamster Georgia Tech. with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Jaeyeon Jung, Santosh Vempala. Spam: More than Just a Nuisance. 95% of all email traffic Image and PDF Spam (PDF spam ~12%)
E N D
Network-Level Spam and Scam Defenses Nick FeamsterGeorgia Tech with Anirudh Ramachandran, Shuang Hao, Maria KonteAlex Gray, Jaeyeon Jung, Santosh Vempala
Spam: More than Just a Nuisance • 95% of all email traffic • Image and PDF Spam (PDF spam ~12%) • As of August 2007, one in every 87 emails constituted a phishing attack • Targeted attacks on the rise • 20k-30k unique phishing attacks per month Source: CNET (January 2008), APWG
Approach: Filter • Prevent unwanted traffic from reaching a user’s inbox by distinguishing spam from ham • Question: What features best differentiate spam from legitimate mail? • Content-based filtering: What is in the mail? • IP address of sender: Who is the sender? • Behavioral features: How the mail is sent?
Content Filters: Chasing a Moving Target Images PDFs Excel sheets ...and even mp3s!
Problems with Content Filtering • Customized emails are easy to generate: Content-based filters need fuzzy hashes over content, etc. • Low cost to evasion:Spammers can easily alter features of an email’s content can be easily adjusted and changed • High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated
Another Approach: IP Addresses • Problem: IP addresses are ephemeral • Every day, 10% of senders are from previously unseen IP addresses • Possible causes • Dynamic addressing • New infections
Our Idea: Network-Based Filtering • Filter email based on how it is sent, in addition to simply whatis sent. • Network-level properties are less malleable • Network/geographic location of sender and receiver • Set of target recipients • Hosting or upstream ISP (AS number) • Membership in a botnet (spammer, hosting infrastructure)
Why Network-Level Features? • Lightweight: Don’t require inspecting details of packet streams • Can be done at high speeds • Can be done in the middle of the network • Robust:Perhaps more difficult to change some network-level features than message contents
Challenges • Understandingnetwork-level behavior • What network-level behaviors do spammers have? • How well do existing techniques work? • Building classifiers using network-level features • Key challenge: Which features to use? • Two Algorithms: SNARE and SpamTracker • Building the system • Dynamism: Behavior itself can change • Scale: Lots of email messages (and spam!) out there • Applications to phishing and scams
Data: Spam and BGP • Spam Traps: Domains that receive only spam • BGP Monitors: Watch network-level reachability Domain 1 Domain 2 17-Month Study: August 2004 to December 2005
~ 10 minutes Finding: BGP “Spectrum Agility” • Hijack IP address space using BGP • Send spam • Withdraw IP address A small club of persistent players appears to be using this technique. Common short-lived prefixes and ASes 61.0.0.0/8 4678 66.0.0.0/8 21562 82.0.0.0/8 8717 Somewhere between 1-10% of all spam (some clearly intentional, others might be flapping)
Spectrum Agility: Big Prefixes? • Flexibility:Client IPs can be scattered throughout dark space within a large /8 • Same sender usually returns with different IP addresses • Visibility: Route typically won’t be filtered (nice and short)
How Well do IP Blacklists Work? • Completeness: The fraction of spamming IP addresses that are listed in the blacklist • Responsiveness: The time for the blacklist to list the IP address after the first occurrence of spam
Completeness and Responsiveness • 10-35% of spam is unlisted at the time of receipt • 8.5-20% of these IP addresses remain unlisted even after one month Data: Trap data from March 2007, Spamhaus from March and April 2007
Problems with IP Blacklists • IP addresses of senders have considerable churn • Based on ephemeral identifier (IP address) • More than 10% of all spam comes from IP addresses not seen within the past two months • Dynamic renumbering of IP addresses • Stealing of IP addresses and IP address space • Compromised machines • Often require a human to notice/validate the behavior • Spamming is compartmentalized by domain and not analyzed across domains
Outline • Understanding the network-level behavior • What behaviors do spammers have? • How well do existing techniques work? • Classifiers using network-level features • Key challenge: Which features to use? • Two algorithms: SNARE and SpamTracker • System: SpamSpotter • Dynamism: Behavior itself can change • Scale: Lots of email messages (and spam!) out there • Application to phishing and scams
Finding the Right Features • Goal: Sender reputation from a single packet? • Low overhead • Fast classification • In-network • Perhaps more evasion resistant • Key challenge • What features satisfy these properties and can distinguish spammers from legitimate senders?
Set of Network-Level Features • Single-Packet • Geodesic distance • Distance to k nearest senders • Time of day • AS of sender’s IP • Status of email service ports • Single-Message • Number of recipients • Length of message • Aggregate (Multiple Message/Recipient)
Sender-Receiver Geodesic Distance 90% of legitimate messages travel 2,200 miles or less
Density of Senders in IP Space For spammers, k nearest senders are much closer in IP space
Local Time of Day at Sender Spammers “peak” at different local times of day
Combining Features: RuleFit • Put features into the RuleFit classifier • 10-fold cross validation on one day of query logs from a large spam filtering appliance provider • Comparable performance to SpamHaus • Incorporating into the system can further reduce FPs • Using only network-level features • Completely automated
Benefits of Whitelisting Whitelisting top 50 ASes:False positives reduced to 0.14%
Outline • Understanding the network-level behavior • What behaviors do spammers have? • How well do existing techniques work? • Building classifiers using network-level features • Key challenge: Which features to use? • Algorithms: SpamTracker and SNARE • System (SpamSpotter) • Dynamism: Behavior itself can change • Scale: Lots of email messages (and spam!) out there
Deployment: Real-Time Blacklist • As mail arrives, lookups received at BL • Queries provide proxy for sending behavior • Train based on received data • Return score Approach
Design Choice: Augment DNSBL • Expressive queries • SpamHaus: $ dig 55.102.90.62.zen.spamhaus.org • Ans: 127.0.0.3 (=> listed in exploits block list) • SpamSpotter: $ dig \ receiver_ip.receiver_domain.sender_ip.rbl.gtnoise.net • e.g., dig 120.1.2.3.gmail.com.-.1.1.207.130.rbl.gtnoise.net • Ans: 127.1.3.97 (SpamSpotter score = -3.97) • Also a source of data • Unsupervised algorithms work with unlabeled data
Challenges • Scalability: How to collect and aggregate data, and form the signatures without imposing too much overhead? • Dynamism: When to retrain the classifier, given that sender behavior changes? • Reliability: How should the system be replicated to better defend against attack or failure? • Evasion resistance: Can the system still detect spammers when they are actively trying to evade?
Latency Performance overhead is negligible.
Sampling Relatively small samples can achieve low false positive rates
Possible Improvements • Accuracy • Synthesizing multiple classifiers • Incorporating user feedback • Learning algorithms with bounded false positives • Performance • Caching/Sharing • Streaming • Security • Learning in adversarial environments
Spam Filtering: Summary • Spam increasing, spammers becoming agile • Content filters are falling behind • IP-Based blacklists are evadable • Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month • Complementary approach: behavioral blacklisting based on network-level features • Key idea: Blacklist based on how messages are sent • SNARE: Automated sender reputation • ~90% accuracy of existing with lightweight features • SpamSpotter: Putting it together in an RBL system • SpamTracker: Spectral clustering • catches significant amounts faster than existing blacklists
Phishing and Scams • Scammers host Web sites on dynamic scam hosting infrastructure • Use DNS to redirect users to different sites when the location of the sites move • State of the art: Blacklist URL • Our approach: Blacklist based on network-level fingerprints Konte et al., “Dynamics of Online Scam Hosting Infrastructure”, PAM 2009
Online Scams • Often advertised in spam messages • URLs point to various point-of-sale sites • These scams continue to be a menace • As of August 2007, one in every 87 emails constituted a phishing attack • Scams often hosted on bullet-proof domains • Problem:Study the dynamics of online scams, as seen at a large spam sinkhole
Online Scam Hosting is Dynamic • The sites pointed to by a URL that is received in an email message may point to different sites • Maintains agility as sites are shut down, blacklisted, etc. • One mechanism for hosting sites: fast flux
Mechanism for Dynamics: “Fast Flux” Source: HoneyNet Project
Summary of Findings • What are the rates and extents of change? • Different from legitimate load balance • Different cross different scam campaigns • How are dynamics implemented? • Many scam campaigns change DNS mappings at all three locations in the DNS hierarchy • A, NS, IP address of NS record • Conclusion:Might be able to detect based on monitoring the dynamic behavior of URLs
Data Collection Method • Three months of spamtrap data • 384 scam hosting domains • 21 unique scam campaigns • Baseline comparison: Alexa “top 500” Web sites
Top 3 Spam Campaigns • Some campaigns hosted by thousands of IPs • Most scam domains exhibit some type of flux • Sharing of IP addresses across different roles (authoritative NS and scam hosting)
Rates of Change • How (and how quickly) do DNS-record mappings change? • Rates of change are much faster than for legitimate load-balanced sites. • Scam domains change on shorter intervals than their TTL values. • Domains for different scam campaigns exhibit different rates of change.
Rates of Change Rates of change are much faster than for legitimate load-balanced sites. • Domains that exhibit fast flux change more rapidly than legitimate domains • Rates of change are inconsistent with actual TTL values
Time Between Record Changes Fast-flux Domains tend to change much more frequently than legitimately hosted sites
Rates of Change by Campaign Domains for different scam campaigns exhibit different rates of change.
Rates of Accumulation • How quickly do scams accumulate new IP addresses? • Rates of accumulation differ across campaigns • Some scams only begin accumulating IP addresses after some time
Location • Where in IP address space do hosts for scam sites operate? • Scam networks use a different portion of the IP address space than legitimate sites • 30/8 – 60/8 --- lots of legitimate sites, no scam sites • Sites that host scam domains (both sites and authoritative DNS) are more widely distributed than those for legitimate sites
Location: Many Distinct Subnets Scam sites appear in many more distinct networks than legitimate load-balanced sites.
Conclusion • Scam campaigns rely on a dynamic hosting infrastructure • Studying the dynamics of that infrastructure may help us develop better detection methods • Dynamics • Rates of change differ from legitimate sites, and differ across campaigns • Dynamics implemented at all levels of DNS hierarchy • Location • Scam sites distributed across distinct subnets Data: http://www.gtnoise.net/scam/fast-flux.htmlTR:http://www.cc.gatech.edu/research/reports/GT-CS-08-07.pdf
References • Anirudh Ramachandran and Nick Feamster, “Understanding the Network-Level Behavior of Spammers”, ACM SIGCOMM, August 2006 • Anirudh Ramachandran, Nick Feamster, and Santosh Vempala, “Filtering Spam with Behavioral Blacklisting”, ACM CCS, November 2007 • Shuang Hao, Nick Feamster, Alex Gray and Sven Krasser, “SNARE: Spatio-temporal Network-level Automatic Reputation Engine”, USENIX Security, August 2009 • Maria Konte, Nick Feamster, Jaeyeon Jung, “Dynamics of Online Scam Hosting Infrastructure”, Passive and Active Measurement, April 2009