1 / 43

BotGraph: Large Scale Spamming Botnet Detection

This research paper discusses the motivation, algorithms, and implementation of a botnet detection system using BotGraph. The system effectively identifies and analyzes web-account abuse and spam sending at a large scale.

edelen
Download Presentation

BotGraph: Large Scale Spamming Botnet Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BotGraph: Large Scale Spamming Botnet Detection Yao Zhao EECS Department Northwestern University

  2. Outline • Motivation and Problem Definition • BotGraph Algorithms • History based algorithm on Signup detection • Graph-based algorithm on login detection • Parallel Implementation on DryadLINQ • Detection Results • Discussion • Conclusion

  3. Web-Account Abuse Attack Server zombie User/Pwd Captcha solver RDSXXTD3

  4. Problems and Challenges • Web-account Abuse • Signup abuse • Spam sending • Challenges • Accuracy requirement • Stealthy (in terms of spam sending) • Large scale attack and huge Hotmail log data • Our Behavior-based Solutions • Correlate bot-users by their activities and identify the group properties • Design parallel algorithms on DryadLINQ to efficiently process large data

  5. Aggressive signups Signup data Sendmail data Suspicious clusters Spamming botnets Signup botnets System Architecture EWMA based change detection Verification & prune Verification & prune Random graph based clustering Graph generation Login graph Login data Run on DryadLinq clusters Output locally

  6. Outline • Motivation and Problem Definition • BotGraph Algorithms • History based algorithm on Signup detection • Graph-based algorithm on login detection • Parallel Implementation on DryadLINQ • Detection Results • Discussion • Conclusion

  7. History Based Change Detection Large prediction error Back to normal

  8. EWMA based Change Detection • EWMA (Exponentially Weighted Moving Average) • Yt: observation at time t, St : prediction at time t • St = α×Yt-1 + (1 - α)×St-1 • Large Prediction Error Implies Change (or Abnormal) • Et = Yt – St (Prediction error) • Rt = Yt / Max(St,ε) (Relative prediction error) • Apply EWMA based Change Detection to Signup Time Series of Each IP Address

  9. Outline • Motivation and Problem Definition • BotGraph Algorithms • History based algorithm on Signup detection • Graph-based algorithm on login detection • Parallel Implementation on DryadLINQ • Detection Results • Discussion • Conclusion

  10. Normal and Bot-user Behaviors • General Behaviors of Normal Users • Login Hotmail from home and/or office • The account shares IPs in one AS with others if dynamic IP is used • General Behaviors of Bot-users • A pool of bots (e.g. thousands) and a pool of bot-users (e.g. hundreds of thousands) • Each bot hosts multiple bot-users • Bot-user assigned to different random bots every day • Fixed binding is not adopted now • A pair of bot-users have large chance to share several different IPs in different ASes

  11. User-user Graph • Graph Model • A hotmail account => a node • A pair of accounts share IPs => an edge • Edge weight = Number of different ASes the shared IPs belong to • Consider edges with weight>1 • Key Observations • Bot-users form a giant connected component • Normal users do not form large connected component • Interpreted by the random graph theory

  12. Random Graph Theory • Random Graph G(n,p) • n nodes and a pair of nodes has an edge with probability p • Theorem • A graph generated by G(n, p) has average weight d = n·p. • If d < 1, then with high probability the largest component in the graph has size less than O(log n). • If d > 1, with high probability the graph will contain a giant component with size at the order of O(n).

  13. Typical Bot-user Graphs • Strategy 1 • Bot-user accounts are randomly assigned to bots. • Strategy 2 • Keeps a queue of the bot-users. • A bot comes online and gets the top k available (currently not used) bot-users in the queue. • Strategy 3 • Similar to the second case, except that there is no limit on the number of bot-users a bot can request for one day.

  14. Typical Bot-user Graphs • 10000 bot-users, 10-day activity, k = 20

  15. Bot-user Detection Algorithm • Issues • Different bot-user groups may be connected (in the graph with weight threshold 2) • Shared bots • Shared bot-users • No fixed weight threshold T • Exceptions: exist large connected components formed by normal users • Detection Algorithm • Hierarchical algorithm to extract connected components • Pruning

  16. Hierarchical Connected Component Extraction G T=2 A B T=3 C D T=4 E

  17. Exceptions: Connected Subgraphs of Normal Users • Potential Reasons • Some web service providers login Hotmail accounts for users (e.g. Facebook, Linkedin) • National proxies • Cell phones (e.g. iPhone) • Tor (Onion routing) • Solutions • Filter out some IPs • Prune potential good connected components

  18. Prune Good Groups • Email Sending Frequency • Normal users: generally don’t send many emails in average • Bot-users: to be efficient, send several spams every day • Email Size • Normal users: random size • Bot-users: (currently) similar email size

  19. Prune Good Groups Bad: Good:

  20. Prune Good Groups • Metrics • s1: the percentage of users who send more than 3 emails per day • s2: the percentage of users who send out emails with similar size (peak detection) • Pruning • Threshold of s1 is 0.8 (conservative and wide margins around 0.8) • s2 is used in validation

  21. Outline • Motivation and Problem Definition • BotGraph Algorithms • History based algorithm on Signup detection • Graph-based algorithm on login detection • Parallel Implementation on DryadLINQ • Detection Results • Discussion • Conclusion

  22. Parallel Implementation on DryadLINQ • EWMA Algorithm of Signup Abuse Detection • Partition data by IP (straightforward) • Graph Construction • Two algorithms • Connected Component Extraction • Divide and conquer

  23. Connected Component Extraction • Partitions of Edges • (User1, User2, weight) (A, B) (D, G) (B, C) (C, E) (C, D) (E, F) (B, G) (G, D)

  24. Connected Component Extraction (A, B) (D, G) (B, C) (C, E) (C, D) (E, F) (B, G) (G, D) Local Algo (A, B) (D, G) (B, C) (B, E) (C, D) (E, F) (B, G) (B, D)

  25. Connected Component Extraction (A, B) (D, G) (B, C) (B, E) (C, D) (E, F) (B, G) (B, D) Merge and local algo (A, B), (A, C), (A, E), (D, G) (B, D), (B, C), (B, G), (E, F)

  26. Connected Component Extraction (A, B), (A, C), (A, E), (D, G) (B, D), (B, C), (B, G), (E, F) (A, B), (A, C), (A, D), (A, E), (A, F), (A, G)

  27. Connected Component Extraction • Analysis • M partitions and log(M) steps • Partition size ≤ N (number of users) • Overall communication overhead • O(N·log(M)) • Computational overhead

  28. Outline • Motivation and Problem Definition • BotGraph Algorithms • History based algorithm on Signup detection • Graph-based algorithm on login detection • Parallel Implementation on DryadLINQ • Detection Results • Discussion • Conclusion

  29. Detection of Signup Abuse

  30. Detection by User-user Graph

  31. Validations • Manual Check • Verified by Hotmail group • Comparison with Known Spamming Users • Complained Hotmail accounts • Email Sending Patterns • Email size • False Positive Estimation • Naming pattern • Signup time

  32. Comparison with Complained Users • Ks : known spammer accounts signed up in the studied month • H : set of bot-users detected by EWMA

  33. Comparison with Complained Users • Ks : known spammer accounts that log in from at least 2 ASes • L : set of bot-users detected by user-user graph

  34. Validation of Sending Pattern

  35. False Positive Estimation (1) • Naming Pattern • Clear pattern in names of (current) bot-users • E.g. w9168d4dc8c5c25f9 • Naming pattern score • The largest fraction of users that follow a single naming template from a regular expression pool • The regular expressions don’t quite match normal user names

  36. False Positive Estimation (1) • Naming Score

  37. False Positive Estimation (1) • Naming Score • A majority of the bot-user groups have close to 1 naming pattern scores • A few small bot-user groups with scores lower than 95% • In total, 0.44% of identified bot-users do not strictly follow the naming templates of their corresponding groups. • Take this 0.44% as false positive bound

  38. False Positive Estimation (2) • Signup dates of the detected bot-users • Conservatively take all the accounts signed up before 2007 as legitimate • 0.08% bot-users were signed up before year 2007 • Among all the accounts in the 2008-dataset, about 59.1% of accounts were signed up before 2007 • False positive • Assuming normal users' behaviors don’t change • 0.08% / 59.1% = 0.13%

  39. Outline • Motivation and Problem Definition • BotGraph Algorithms • History based algorithm on Signup detection • Graph-based algorithm on login detection • Parallel Implementation on DryadLINQ • Detection Results • Discussion • Conclusion

  40. Evasion • Signup detection • Be stealthy • Login detection • Fixed binding • Low utilization rate • Bot-accounts bound to one host are easy to be grouped • Fixed AS assignment • Redefine the edge weight to consider IP prefix • Similar to fixed binding • Be stealthy (sending as few emails as normal user)

  41. Related Work • Botnet Detection • Hard in general • HoneyNet • Content-based Spam Detection • Bayesian filtering, AutoRE • Countermeasures: good words, image • Behavior-based Spam Detection • SpamTracker

  42. Conclusions • BotGraph • History-based change detection on Signup • Graph-based component to detect stealthy bot-user logins • Parallel Algorithms on DryadLINQ • Quick process of huge Hotmail log • Detection • Detect more than 26M bot-accounts in two-month log • Low false positive

  43. Q & A? Thanks!

More Related