Spamming Botnets: Signatures and Characteristics

Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, and Ivan Osipkov. SIGCOMM, 2008. Presented by: Arnold Perez

Outline • Introduction • Goals • AutoRE • Challenges • Design • Results • Botnet characteristics • Contributions • Weaknesses

Introduction • Botnets are commonly used for profit • Botnets rented out to spammers • Botnets can send spam emails at a large scale • Can transmit thousands of emails in a short duration • Difficult to detect and blacklist individual bots

Goals • Understand the behaviors of botnets from the perspective of large email servers that are popular targets • Identify botnet characteristics and trends • Track sending behavior and content patterns • Develop a framework (AutoRE) that identifies botnet hosts by generating botnet spam signatures from emails

AutoRE • Motivated by recent success of signature based worm and virus detection systems • Botnet spam emails are often sent in an aggregate fashion, resulting in content prevalence similar to worm propagation • Focus primarily on URLs embedded in the email

AutoRE Challenges • Spammers often add random, legitimate URLs to content in order to increase the perceived legitimacy of emails

AutoRE Challenges • Spammers use URL obfuscation techniques to evade detection

AutoRE Design

AutoRE Design • Input • Set of unlabeled email messages • Output • Set of spam URL signatures • Complete URL string • URL regular expression • List of botnet host IP addresses

AutoRE Design • Comprised of three modules • URL preprocessor • Extracts URLs and other relevant fields and groups them according to web domain • Group selector • Selects URL groups with the highest degree of burstiness in sending times • RegEx generator • Extracts signatures by processing one group at a time

URL Pre-Processing • Extracts • URL string • Source server IP address • Email sending time • Partitions into groups based on web domains • Emails from same spam campaign always advertise the same product or service from the same domain

URL Group Selection • Each email my belong to more than one group • Use the bursty property of botnet email traffic • Select group that exhibits the strongest temporal correlation across a large set of distributed senders

Signature Generation and Botnet Identification • Two types of signatures • Complete URL based signature • Regular expression signatures • Signature criteria • Distributed • Bursty • Specific

Signature Generation and Botnet Identification • Distributed • Total number of Autonomous Systems (AS) spanned by source IP addresses must be at least 20 • Bursty • The set of matching URLs must be sent within 5 days • Specific • Complete URLs are specific by definition • For regex, entropy reduction is used to test. Probability of a random string matching signature is 1/(2^90)

Automatic URL Regular Expression Generation

Signature Tree Construction • Constructs a keyword-based signature tree where each node corresponds to a substring, with the root of the tree set to the domain name • Keywords are the most frequent substrings that are both bursty and distributed

Signature Tree Construction

Regular Expression Generation • Detailing • Returns a domain specific regular expression using the keyword-based signature • Generalization • Returns a more general domain-agnostic regular expression by merging very similar domain-specific expressions

Regular Expression Generation

Datasets and Results • Based on randomly sampled Hotmail email messages • November 2006 • June 2007 • July 2007 • Total of 5,382,460 sampled emails • Pre-classified as either spam or non-spam by human user (not used by filter, used for validation purposes only)

AutoRE Results • Identified 7,721 botnet spam campaigns • 580,466 spam messages • 340,050 distinct botnet host IP addresses • 5,916 AS

AutoRE Results

AutoRE Results • Majority of the campaigns belong to CU category • 100% increase from July 2007 when compared to Nov 2006 • Spam volume increased 50% in same time period • Total number of botnet IPs does not increase proportionally, suggesting that each botnet is being used more aggressively

False Positive Rate • Rate = non spam matching signature / total number of non spam

Ability to Detect Future Spam • Experiment • Apply signatures derived in Nov 2006 and June 2007 to the emails collected in July 2007 • Nov 2006 signatures are not useful • Indicates that spam URL patterns evolve over time • June 2007 signatures are highly effective • RE signatures are more robust than CU signatures over time

Regular Expressions vs Keyword Conjunctions • Identical spam detection rates • Difference is in false positive rate

Domain-specific vs Domain-Agnostic Signatures • Generalization effectively preserves the stable structures of polymorphic URLs while removing the volatile domain substrings

Botnet Characteristics • Distribution of IP addresses indicate botnet menace is a global phenomenon, with China, Korea, France, and USA having significant number of IP addresses

Botnet Characteristics • When viewed individually, botnet hosts do not exhibit distinct sending patterns • Content in email is quite different even though the target web pages are the same • 50% of botnet spam campaigns have a standard deviation of less than 1.81 hours, while 90% have standard deviation of less than 24 hours.

Botnet Characteristics • Similar number of recipients per email • Share a constant connection rate • Most likely due to rate control seen in botnet software • Large number of campaigns share the same domain-agnostic regular expression signatures • Same botnets participating in multiple spam campaigns

Contributions • AutoRE, a framework that automatically generates URL signatures for spamming botnet detection • Several important findings about botnet spam • Botnet hosts spread across the internet • No distinctive pattern when viewed individually • Botnet host sending patterns

Weaknesses • The AutoRE system analyzes batches of emails after they are all received • Would be better if we could do this in real time to stop email once a campaign has been identified and a signature created • The AutoRE system needs a lot of emails to work effectively. • We can’t use it on individual inboxes, it must be put between the ISP and the incoming email

Weaknesses • I was hoping to take the characteristics found in the paper to use in my own project • Paper shows that individually you can not identify spam from botnets. The AutoRE system works on group behavior.

References • "Spamming Botnets: Signatures and Characteristics". Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, and Ivan Osipkov. SIGCOMM, 2008.

Spamming Botnets: Signatures and Characteristics