Spam and Botnets: Characterization and Mitigation

Spam and Botnets:Characterization and Mitigation Nick FeamsterAnirudh RamachandranDavid DagonGeorgia Tech

Talk Overview • Network-level behavior of spammers • Ultimate goal: Construct spam filters based on network-level properties, rather than content • Content-based properties are malleable • Low cost to evasion:Spammers can easily alter content • High admin cost: Filters must be continually updated • Content-based filters are applied at the destination • Too little, too late:Wasted network bandwidth, storage, etc. • Study of DNS-based blacklists • “Discovery”:One of the mosttelling network-level properties is botnet membership • DNSBL Counter-Intelligence • Network monitoring

Network-Level Behavior of Spammers: Major Findings • Where does spam come from? • Most received from few regions of IP address space • Insight about spammier prefixes could improve filters • Do spammers hijack routes? • A small set of spammers continually advertise short-lived routes • Traceability is not guaranteed • How is spam sent? • Most coming from Windows hosts (likely, bots) • Identification of spamming groups (e.g., botnets) could help

Data Collection • Two domains instrumented with MailAvenger (both on same network) • Sinkhole domain #1 • Continuous spam collection since Aug 2004 • No real email addresses---sink everything • 10 million+ pieces of spam • Legitimate mail corpus from a large email provider (40 million inboxes) • Monitoring BGP route advertisements from same network

Mail Collection: MailAvenger • Highly configurable SMTP server that collects many useful statistics

BGP Data Collection MX 1 MX 2

Spam Study: Major Findings • Where does spam come from? • Most received from few regions of IP address space • Insight about spammier prefixes could improve filters • Do spammers hijack routes? • A small set of spammers continually advertise short-lived routes • Traceability is not guaranteed • How is spam sent? • Most coming from Windows hosts (likely, bots) • identification of spamming groups (e.g., botnets) could help

What IP ranges does spam come from? Spam comes from a few concentrated regions of IP address space Fraction /24 prefix

Distribution across ASes Top 10 ASes by Spam Count Top 10 ASes by Legit Email Count • Top two spamming ASes: 10% of received spam • ASes in the US: most spam • Top ASes for legitimate email are different Points to note

Spam Study: Major Findings • Where does spam come from? • Most received from few regions of IP address space • Insight about spammier prefixes could improve filters • Do spammers hijack routes? • A small set of spammers continually advertise short-lived routes • Traceability is not guaranteed • How is spam sent? • Most coming from Windows hosts (likely, bots) • Indentification of spamming groups (e.g., botnets) could help

BGP Spectrum Agility • Log IP addresses of SMTP relays • Join with BGP route advertisements seen at network where spam trap is co-located. A small club of persistent players appears to be using this technique. Common short-lived prefixes and ASes 61.0.0.0/8 4678 66.0.0.0/8 21562 82.0.0.0/8 8717 ~ 10 minutes Somewhere between 1-10% of all spam (some clearly intentional, others might be flapping)

Why Such Big Prefixes? • Flexibility:Client IPs can be scattered throughout dark space within a large /8 • Same sender usually returns with different IP addresses • Visibility: Route typically won’t be filtered (nice and short)

Spam Study: Major Findings • Where does spam come from? • Most received from few regions of IP address space • Insight about spammier prefixes could improve filters • Do spammers hijack routes? • A small set of spammers continually advertise short-lived routes • Traceability is not guaranteed • How is spam sent? • Most coming from Windows hosts (likely, bots) • Identification of spamming groups (e.g., botnets) could help

Characteristics of spamming bots • Distribution across IP space for bots • Similar to IP space distribution for all spam • Lower bot activity in ranges where spam also comes from hijacked routes • Operating Systems of Spamming Hosts • ~ 95% run Windows • The 4% Unix-based hosts send up to 8% spam

Most Bots Send Low Volumes of Spam Most bot IP addresses send very little spam, regardless of how long they have been spamming… Amount of Spam 99% of bots Lifetime (seconds)

Most Bot IP addresses are quiet Percentage of bots 65% of bots only send mail to a domain once over 18 months Lifetime (seconds) Blacklists may want to target IP ranges, rather than individual IPs

Take-Away Lessons Network-Level Spam Filtering • Network-level properties are less malleable, and are observable closer to the sourceof spam • Aggregate properties(e.g., IP prefix, ASN, route used etc.) may be more effective • Some network-level properties can be incorporated into spam filters • could be used as a first-pass filter Redefining End-Host Identifiers • Spam filtering requires a better notion of end-host identity • Securing the Internet routing infrastructureis key to traceabilty

DNS-based Blacklists (DNSBLs) The most prevalent network-level spam filtering mechanism today Various criteria: open relays/proxies, virus senders, bad/unused address spaces etc. Hundreds of DNSBLs of all sizes How to measure the effectiveness of DNSBLs? Completeness Responsiveness What about DNS-Based Blacklists?

Questions about DNSBLs • What is the completenessof the DNSBL? • What is the responsivenessof the DNSBL? • How many distinct domainsare targeted by a spamming host before it is blacklisted? • Does frequency of spam from a host change after it is blacklisted?

Blacklisting: Completeness ~95% of bots listed in one or more blacklists Fraction of all spam received ~80% listed on average Only about half of the IPs spamming from short-lived BGP are listed in any blacklist Number of DNSBLs listing this spammer Spam from IP-agile senders tend to be listed in fewer blacklists

Are IP-Based Blacklists Enough? • Mail Avenger is very aggressive • Eight different blacklists • Cloaking techniques complicate detection • For example, what if a bot could change IP addresses and remain reachable? • LAN agility • BGP agility

A Model of Responsiveness • Response Time • Difficult to calculate without “ground truth” • Can still estimate lower bound Possible Detection Opportunity Infection Time S-Day RBL Listing Response Time Lifecycleof a spamming host

Measuring Responsiveness • Data • 1.5 days worth of packet captures of DNSBL queries from a mirror of Spamhaus • 46 days of pcaps from a hijacked C&C for a Bobax botnet; overlaps with DNSBL queries • Method • Monitor DNSBL for lookups for known Bobax hosts • Look for first query • Look for the first time a query response had a ‘listed’ status

Responsiveness: Preliminary Results • Observed 81,950 DNSBL queries for 4,295 (out of over 2 million) Bobax IPs • Only 255 (6%) Bobax IPs were blacklisted through the end of the Bobax trace (46 days) • 88 IPs became listed during the 1.5 day DNSBL trace • 34 of these were listed after a single detection opportunity Both responsiveness and completeness appear to be low.Much room for improvement.

Domains Performing Lookups • Over 60% are queried by just one IP/AS • Hypothesis: Decreased chances of being reported

So…What can be Done? • Network-level behavior of spammers • Ultimate goal: Construct spam filters based on network-level properties, rather than content • Content-based properties are malleable • Low cost to evasion: Spammers can easily alter content • High admin cost: Filters must be continually updated • Content-based filters are applied at the destination • Too little, too late: Wasted network bandwidth, storage, etc. • Study of DNS-based blacklists • “Discovery”:One of the mosttelling network-level properties is botnet membership • DNSBL Counter-Intelligence • Network monitoring

Mitigation #1: Counter-Intelligence • Botmasters advertise spamming bots for which bots are not listed in any blacklist. • Insight:Someone must be looking up the bots! • Can we fish out these DNSBL “reconnaissance” queries and identify subjects/targets as suspect?

Legitimate queriers are also the targets of queries Reconnaissance queriers are ususally not queried themselves Legit Queries vs. Reconnaissance DNS-BasedBlacklist DNS-BasedBlacklist lookupmx.b.com lookupmx.a.com Legit Mail Server Amx.a.com Legit Mail Server Bmx.b.com email to mx.b.com email to mx.a.com Reconnaissance host

Measurement Approach • Log Spamhaus queries • Construct querier/queried graph • Prune graph: only nodes in the Bobax trace • Examine nodes with high out-degree • Hypothesis: targets of nodes with high out-degree likely bots

Who’s Doing the Lookups? • The botmaster, on behalf of the bots • The bots, on behalf of themselves • The bots, on behalf of each other Known bobax drone! Spam Sinkhole Implication: Use a “seed” set to bootstrap?

Some Problems with Counter-Intel • Constructing the query graph is intensive • Computationally • Storage-wise • Initially pruning the graph with IP addresses of known suspects (e.g., spammers) could help

Mitigation: Network Monitoring • In-network filtering • Requires the ability to detect botnets • Question: Can we detect botnets by observing communication structure among hosts? Example: Migration between command and control hosts New type of problem: essentially coupon collectionHow good are current traffic sampling techniques at exposing these patterns?

Experimental Setup

(Preliminary) Results Feasible sampling rates Conventional sampling techniques are not well-suited to collecting conversations

Summary Lessons • Network-level spam filtering holds promise • Potentially a useful complement to content-based filters • Today’s DNSBLs aren’t doing the tricks • Two critical pieces • Monitoring techniques (which might be used together) • In-network (e.g., with better traffic monitoring techniques) • At the edge (e.g., DNSBL reconnaissance) • Routing security • “Clean-slate” wish list • Better notions of identity • More agile monitoring/sampling techniques

Spam and Botnets: Characterization and Mitigation