300 likes | 406 Views
Andrew G. West and Insup Lee CEAS `11 – September 1, 2011. Towards the Effective Spatio- Temporal Mining of Spam Blacklists. Big Idea / Outline. BIG IDEA : Identify IP addresses that have temporally correlated spam behavior; harness this info. predictively Related work; motivations
E N D
Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio-Temporal Mining of Spam Blacklists
Big Idea / Outline • BIG IDEA: Identify IP addresses that have temporally correlated spam behavior; harness this info. predictively • Related work; motivations • Blacklists as ground truth • Data collection • Measurement study • Temporal association mining • Technique • Parameterization • Negative results; discussion
Usage Example • Blacklist history (time) • IPx • IPy • tnow • 20 min. • 20 min. • What to do “now”? • Assume IPy will be blacklisted • Start blocking; decrease listing latency
Motivations • Recent research leveraging group behaviors [1—5]: • Overcome “cold-start” • Grouping functions: subnets, rDNS hosts, AS, etc. • Botnets a driving force • Non-contiguous in IP space • “Campaigns” should give rise to temporal correlations • Can we calculate grouping function; use for reputation? History AS-REP AS REPALG BLOCK BLK-REP Mail IP IP IP-REP Time Spatial Functions Plot into 3-D Space SPAM or HAM Classify (SVM)
Related Work “How to determine botnet membership?” • Parsing P2P communication graphs • Issues: Unproven, reqs. expansive view (BotGrep [6]) • Blacklists have inherent global view • Similarity algs. over email bodies/URLs • Issues: Privacy, complexity (Botnet Judo [7]) • Mining uses only IP addresses in computation • Law enforcement infiltrations • Data only useful in ex post facto fashion
Blacklists • Why blacklists? • Global compilation; aggregate; low false-positives • We have tons of data • Spamhaus blacklists [8] • PBL (Policy Block List) – Dynamic IP ranges • SBL (Spamhaus Block List) – Static ranges belonging to spam gangs • XBL (Exploits Block List) – IPs spamming due to malware, Trojans (i.e., botnet nodes)
Blacklist Ops • listing duration (d) • listing • de-listing • re-listing • listed • listed • not- listed • IPx • Blacklist history (time)
Blacklist Size • Why?: Desirable to show that blacklists are a reasonable proxy for the spam problem #2 #1 • #1: Spike typical of holiday seasons • #2: Shutdowns of Spamit.com affiliate and Rustock • Small spikes: Evidence of campaigns
Listing Duration (d) • Why?: Re-listings (basis for patterns/ correlations) limited by de-listing speed • Almost universal d=7.5 days • Speaks to static TTL delisting policy • Must only correlate listings, not overlapping durations
DHCP Issues • Why?: Dynamic IPs may not be able to accumulate enough history for mining, or produce stale predictions XBL PBL “possiblydynamic” • A large percentage (80%+) of IPs are dynamic • More important, is how dynamic they are [9] • This fact supports narrow learning windows 11% 10% “knowndynamic” 79% ≈18.4% of all IP space is on the PBL
Relisting Quantity • Why?: Central issue: do some IPs have histories extensive enough to be mined? • #1: 50% of IPs have only 1 listing. Discard. Trim problem space. #1 #2 • #2: 20% of IPs have 5+ listings, yet these account for 66% of all listings (non-trivial).
Relisting Rates • Why?: Dynamism supports tight learning, thus we want all re-listings well clustered temporally. • Media re-listing time is 18 days • Far from a uniform distribution • Also speaks to infection lifetimes
Association Rules • Developed for “market basket” data • “Beer and diapers” example • Apriori and FP-Growth algs. • Example rule • {DIAPERS} →{BEER} • Interest measures [10]: • lift(DIAPERS → BEER) = (3/5) / (4/5) * (4/5) = 0.94 • Ratio of actual support, to expected rand. support
Correlations • Previous: discrete, unordered, and transactional data • Spam data defies these • Continuously distributed • Bi-directionally ordered • Define “correlation radius” (r) to make binary associations • Symmetric but non-associative • Radius enables probabilistic lift and support equivalents
Best Pairs For every IP address, produce a finite “best pairs” list for persistent storage, where ordering determined by “lift”
Implementation • 232 × 232 = Scalability issues • Prune search space with “minimum support” • M=3 produces a 54.3 trillion entry matrix • But 98% sparse • Multi-threaded runtime= 3 days; we used EC2
Free Variables • Correlation radius (r) • Try to capture campaigns with minimal noise • r = 2 hours (4 hour diameter) • Training window length (length(h’)) • Narrow: Infection lifetimes [11], DHCP addresses • Broad: Need for re-listings, bot-to-campaign map • length(h’) = 3 months • Minimum support (m) • Derived based on scalability needs (m=3)
RESULTS AND DISCUSSION 1. “Best pairs” significance 2. Botnet membership capture 3. Blacklist prediction
Rule Significance • Intuition: Lift matrix should have values higher than random chance would suggest #1 • #1: Flip expected; About 0.6% of all pairs correlate more than random #2 • #2: Even at lift=120, 36% chance the correlation is rand. AGGREGATE
Botnet Membership • Intuition: Given a set of botnet IPs, shared member/ pair lifts should exceed member/non-member pairs • Actual dumps: Kraken + Cutwail • 70-80% of IPs are XBL listed, 40% at min. support • 6.0% of shared have non-zero lift, compared to 2.8%
Blacklist Prediction (2) • Prediction criteria • No ballot stuffing; can’t re-guess • Experiment with different thresholds • Same story: Outperforming random, but too minimal to be of any consequence
Discussion (1) • Scalability issues × minor performance increments don’t warrant production • Focus on acute areas of improvement: • DHCP research • 90%+ of IPs at min. support are dynamic, how? • Need reliable IP classification; churn rates • Refining windows/correlations • Non-binary correlations. Gaussian weights. • Time-decay of events in training windows
Discussion (2) 3. Appropriateness of blacklist data • Desirable conciseness (500 million listings = 12GB) • Blacklists inherently latent. Their aggregate, opaque, and binary triggers may blur campaign-level data. • Install on an email server? Collect other metadata • Takeaway; Utility in negative result • Measurement study builds on prior research • Our model serves as foundation for future efforts • Lessons learned about botnet dynamics • Identified poorly understood dynamism areas
References [1] A. Ramachandran and N. Feamster. Understanding the network-level behavior of spammers. In SIGCOMM, 2006. [2] F. Li and M.-H. Hsieh. An empirical study of clustering behavior of spammers and group-based anti-spam strategies. In CEAS, 2006. [3] S. Hao, et al. Detecting spammers with SNARE: Spatio- temporal network-level automated reputation engine. In USENIX Security, 2009. [4] Z. Qian, et al. On network-level clusters for spam detection. In NDSS, 2010 [5] A. G. West, et al. Spam mitigation using spatio-temporal reputations from blacklist history. In ACSAC, 2010. [6] S. Nagaraja, et al. BotGrep: Finding P2P bots with structured graph analysis. In USENIX Security, 2010. [7] A. Pitsillidis, et al. Botnet judo: Fighting spam with itself. In NDSS, 2010. [8] Spamhaus Project. http://www.spamhaus.org/ [9] Y. Xie, et al. How dynamic are IP addresses? In SIGCOMM, 2007. [10] L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. ACM Comp. Surveys, 38(9), 2006. [11] J. E. Dunn. Botnet PCs stay infected for years. Tech World, 2009.
Backup Slides (2) Above: Lift distributions as a consequence of altering minimum support. Above: Lift distributions as a consequence of altering correlation radius and minimum support