1 / 39

BotGrep: Finding P2P Bots with Structured Graph Analysis

BotGrep: Finding P2P Bots with Structured Graph Analysis. Shishir Nagaraja, Prateek Mittal, Chi-Yao Hong, Matthew Caesar , Nikita Borisov ( UIUC ) USENIX Security Symposium 2010. A cut C = ( S , T ) is a partition of V of a graph G = ( V , E ) ( {1, 2, 3, 4, 5} , {6, 7} )

jayden
Download Presentation

BotGrep: Finding P2P Bots with Structured Graph Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BotGrep: Finding P2P Bots with Structured Graph Analysis Shishir Nagaraja, Prateek Mittal, Chi-Yao Hong, Matthew Caesar, Nikita Borisov (UIUC) USENIX Security Symposium 2010

  2. A cutC = (S, T) is a partition of V of a graph G = (V, E) ({1, 2, 3, 4, 5}, {6, 7}) The cut-set of a cut is the set of edges whose end points are in different subsets of the partition {(5, 6), (1, 7)} The size of a cut is the number of edges in the cut-set  small cut, min. cut… Graph Theory: Cut G = (V, E) 4 5 3 6 2 7 1 4 5 3 6 2 7 1 http://en.wikipedia.org/wiki/Cut_%28graph_theory%29 Speaker: Li-Ming Chen

  3. RW: a trajectory that consists of taking successive random steps on a graph E.g., the path traced by a molecule (分子) as it travels in a liquid RWs are usually assumed to be Markov processes Example: Graph Theory: Random Walk (RM) 4 5 3 6 2 7 1 From node 1 2 3 4 5 6 7 To node 1 2 3 4 5 6 7 Starts at node 1 Prob. after 1 RW Prob. after 2 RW http://en.wikipedia.org/wiki/Random_walk Speaker: Li-Ming Chen

  4. Graph Theory: Stationary Distribution Example 1: 4 • A stationary distribution π is a vector, whose entries are non-negative and sum to 1, that satisfies • Markov chain mixing time: • How large must t be until the time-t distribution (qt) is approximately π? (to converge to stationary Dist.) 5 3 6 (remain steady) Example 2: 2 7 1 (remain steady) http://en.wikipedia.org/wiki/Markov_chain Speaker: Li-Ming Chen

  5. Outline • Problem Definition • System Architecture • Approach: • Prefiltering Step • Clustering P2P Nodes • Validation • (*) Privacy Preserving Graph Algorithms • Results & Discussion • Conclusion & My Comments Speaker: Li-Ming Chen

  6. Communication Attack Propagation What is Botnet? • Bots: compromised hosts, “Zombies” • Botnets: networks of bots that are under the control of a human operator (botmaster) • (generally looks like) Worm + C&C channel • Command and Control Channel • Disseminate the botmasters’ commands to their bot armies (IRC, HTTP, … (can be encrypted)) (DoS, spamming, phishing site, …) Worm (vulnerabilities, file sharing, P2P, …) Speaker: Li-Ming Chen

  7. Botnet Structure Change!! • Centralized structure  P2P, why? • Growing size of botnets •  P2P communication is more efficient and robust • Development of mechanisms that detect traditional centralized C&C servers • Try to evade detection • Question: • Whether ISPs can detect P2P botnets and use this as a basis for botnet defense. (Traditional) centralized structure P2P structure Speaker: Li-Ming Chen

  8. Problem & Proposed Solution • Problem: • ISPs have significant visibility into the Comm. patterns • But, how to separate botnet traffic from background Internet traffic? • Proposed approach:BotGrep • An algorithm that isolates P2P Comm. structure • Only based on the information about which pairs of nodes communicate with one another • Input: a communication graph • Can sustain when only a partial view of the comm. graph is available • Can support “privacy preserving collaboration” Speaker: Li-Ming Chen

  9. Challenges • Background traffic volume is huge • Background traffic is highly variable and continuously changing • Botnet traffic blends in with the regular traffic of the legitimate users •  botnet is tightly integrated and can NOT be separated from the rest of the nodes by a small cut • ISPs collaboration  scaling issues • ISPs collaboration  privacy issues Speaker: Li-Ming Chen

  10. Outline • Problem Definition • System Architecture • Approach: • Prefiltering Step • Clustering P2P Nodes • Validation • (*) Privacy Preserving Graph Algorithms • Results & Discussion • Conclusion & My Comments Speaker: Li-Ming Chen

  11. BotGrep Architecture Data source 1: Combining observations across different network monitors into a single Comm. graph Outputs: A set of suspect hosts (and links) Data source 2: Borrow misuse detection to distinguish P2P bot and other P2P applications (speed up botnet identification) Speaker: Li-Ming Chen

  12. Inference System • As mentioned, botnet graph is embedded within a background Comm. graph • One common feature of P2P structured graph: • Fast mixing time (∵ highly structured) •  BotGrep exploits this feature by • performing random walks to identify fast-mixing component(s) and isolate them from the rest of the Comm. graph Speaker: Li-Ming Chen

  13. Outline • Problem Definition • System Architecture • Approach: • Prefiltering Step • Clustering P2P Nodes • Validation • (*) Privacy Preserving Graph Algorithms • Results & Discussion • Conclusion & My Comments Speaker: Li-Ming Chen

  14. Problem Formulation • Given a Comm. graph: G = (V, E) • (note: not specify a clear time period) • Assume a P2P graph Gp is embedded •  remaining subgraph contains non-P2P Comm. • Goal: • Partition the input G into {Gp, Gn} in the presence of dynamic background traffic and with only partial visibility edges Speaker: Li-Ming Chen

  15. Approach Overview (BotGrep) • Idea: • Perform random walks, and comparing the relative mixing rates of subgraphs • 3 steps: • (1) Pre-filtering (actually is k-means clustering) • Extract a small set of candidate P2P notes (+ FP) • (2) Clustering P2P Nodes (sampling) • Apply modified SybilInfer Algo. to remove FP • (3) Validation • Validate step (2) based on fast-mixing characteristic Speaker: Li-Ming Chen

  16. Step (1): Pre-filtering • Idea: • For short random walks, the state Prob. associated with nodes in the fast-mixing subgraph is likely to be closer to the “stationary distribution” thannodes in the slow-mixing subgraph 4 short RW, t =log(|V|) 5 init: Input: G = (V, E) 3 6 2 7 1 ∵ stationary Dist. is proportional to node degrees Goal: the sum of squares J from points to the assigned cluster centers cj is minimized. dampening constant, to undermine high- degree nodes k-means (squares J is the cluster score) ?? should be Speaker: Li-Ming Chen

  17. Step (2): Clustering P2P Nodes • Step (1)’s output: {G1, G2, …, Gk} •  perform “modified SybilInfer Algo.” on each subgraphs to remove weakly connected nodes (FP) • Concept of modified SybilInfer Algo. (3 steps): • Get “traces” T • A trace represents a related vertex-pair by using RW* • Use sampling to get P2P nodes • Assume a cut X0 consists of P2P nodes, • Check if X’ is better than X0 according to probability • If better, X’ replace X0; else X0 retained. • and then do it in several runs • Get {X0, X1, …, XN}, ; decide P[node i is P2P] = ? G. Danezis and P. Mittal, “SybilInfer: Detecting Sybil Nodes using Social Networks,” in Proc. NDSS, 2009. Speaker: Li-Ming Chen

  18. Step (2): Clustering P2P Nodes (cont’d) • Modified SybilInfer Algo.: • Step (1) Generation of traces: (ensures that the “stationary Dist.” of the RW is uniform over all vertices) Perform a number n of RWs, starting at each node, length t = log(|V|)  qt = qt-1‧P’  Traces T is the set ofstarting and ending vertex-pairs of each RW (we are interested in these pairs traversed by RW) Speaker: Li-Ming Chen

  19. (My Observation) 4 5 • Traces T • Set of vertex-pairs • The end points reflect the connectivity of the graph •  high-degree nodes may walk to low-degree nodes • ~ equal prob. to its neighbors •  RW may be trapped by low-degree nodes (if they are connected) • P’ is a symmetry matrix • RW will not converge by using P’ !! 3 6 2 7 1 From node 1 2 3 4 5 6 7 To node 1 2 3 4 5 6 7 Speaker: Li-Ming Chen

  20. Step (2): Clustering P2P Nodes (cont’d) • Modified SybilInfer Algo.: • Step (2) A (Bayesian) Prob. model for P2P nodes: • Given the set of traces T, compute the Prob. that any set of node X are all P2P nodes, goal can be acquired fixed assign a uniform prob. to all walks ending in the set X trace ends in vertex v in X trace ends in vertex a in number of RW ending in vertex a (or v) Speaker: Li-Ming Chen

  21. Step (2): Clustering P2P Nodes (cont’d) • Modified SybilInfer Algo.: • Step (3) Metropolis-Hastings Sampling: • To enumeration over all subsets X of the graph is impossible •  sample configurations Xi following this distribution • Xi ~ P(X|T) • Given a set of samples S = {X0, X1, …, XN}, we can compute marginal Prob. of nodes being P2P nodes as follows: • Threshold: • if P[node i is P2P] > 0.5, then P2P, else non-P2P (FP). • Next setp: validate P2P group! Node i exists in more than half of the samples Speaker: Li-Ming Chen

  22. Step (3): Validation • SybilInfer only partitions a graph into two subgraphs • We need to use multiple iterations to get to the desired fastest mixing subgraph •  require a validation test • If the cut passes all the 3 validation tests below, then we are done: • (1) Graph conductance test • (2) q(t) entropy comparison test • (3) Degree-homogeneity test Speaker: Li-Ming Chen

  23. 0 Step (3): Validation (cont’d) • (1) Graph conductance test • P2P network is fast mixing  no small cut  graph conductance should be high • (2) q(t) entropy comparison test • RWs on structured homogeneous P2P graphs are characterized by high entropy state Prob. Dist. • KL divergence measure should be close to 0 • (3) Degree-homogeneity test • To rule out star topology! • Measure the dispersion of degree values  should be homogeneous close Speaker: Li-Ming Chen

  24. Outline • Problem Definition • System Architecture • Approach: • Prefiltering Step • Clustering P2P Nodes • Validation • (*) Privacy Preserving Graph Algorithms (ignored) • Results & Discussion • Conclusion & My Comments Speaker: Li-Ming Chen

  25. Dataset (Graphs) • Background traffic communication graph: • Constructed from 1 day real-world traffic trace: • (1) Abilene’s NetFlow trace (2009/10/22) (104,426 nodes) • (2) CAIDA packet-level trace (2009/1/11) (3,839,936 nodes) • Botnet graph: • Synthetically add links between random selected “bots” in the background traffic • For sensitivity test, the structure of botnet graph includes: • (1) de Bruijn, (2) Chord, (3) Kademlia, (4) LEET-Chord •  Take the combined graph as the algorithm input Speaker: Li-Ming Chen

  26. An Algorithm Example • Background traffic communication graph: • GD: Abilene’s trace • Botnet graph: • Gp: de Bruijn structure • Random select 10,000 nodes from GD • Parameters: m=10 (outgoing links), n=4 (dimensions) • Combined input G = <V, E>: • N = |V| = 104,426 nodes (Abilene) • |E| = 647,053 edges • Goal: • Extract Gp from GD as accurately as possible! de Bruijn graph (m, n) de Bruijn graph (2, 3) Speaker: Li-Ming Chen

  27. Validation An Algorithm Example (cont’d) • (step 1) • Perform a short random walk starting from every node • Get si (use r = 100) • K-means clustering derives 10 clusters • (step 2 & 3) • Only check 4th cluster (yellow) • 17,576 nodes • Contains honey-net nodes • Recursively apply SybilInfer to this cluster and validate in 3 iterations • 10,143 nodes (TP: 9,905 / FP: 238) Speaker: Li-Ming Chen

  28. Results • (1) Effect of botnet topology • (2) Effect of botnet graph size • (3) Effect of background graph size • (4) Effect of reduced visibility • (5) Leveraging Honeynets • (6) Effect of inference algorithm Speaker: Li-Ming Chen

  29. 4 botnet graphs  de Bruijn, Chord, Kademlia, LEET-Chord Overall, performance is stable across these graphs Detection rate > 95% FP rate < 0.42% for LEET-Chord Stealthiness vs. resilience: Randomly removing nodes (%) Check failed paths: LEET-Chord is less resilience to failure  the use of stealth to evadeBotGrep would adversely effect the resilience of the botnet (1) Effect of botnet topology Speaker: Li-Ming Chen

  30. (2) Effect of botnet graph size • Experiment: • Keep the size of the background traffic graph constant • Vary the size of the synthetic botnet graph • 102, 103, 104, or 105 bots • Finding: • Size increases, performance degrades (but only by a small amount) Speaker: Li-Ming Chen

  31. (3) Effect of background graph size • A larger background graphs  botnet is easy to hide inside (?) • Experiment: • Try to scale up the background graphs while retaining their statistical properties (ignore the procedure here), then insert botnet • e.g., CAIDA: 3.8 million  30 million nodes (×9) • Finding: • BotGrep scales well with network size! ÷9 Speaker: Li-Ming Chen

  32. (4) Effect of reduced visibility • Previous Experiments: • Gp is present in its entirety • Problems of reduced visibility: • Only deploy BotGrep at a subset of ISPs • Network traffic sampling • Experiment: • Study Storm & Karaken botnet • Measure number of inter-bot paths visible from ASes • Sort ASes (according to # of pahts) • Cooperate sorted ASes and contribution see their “visibility” 5 most-affected ASes contribute views 57~65% Speaker: Li-Ming Chen

  33. (4) Effect of reduced visibility (cont’d) 57~65% visibility from Top 5 ASes • Apply BotGrep on a “combined graph” by removing 40% of links from the botnet graph Speaker: Li-Ming Chen

  34. (5) Leveraging Honeynets • Perform RWs starting only from the honey-net nodes to obtain a set of candidate P2P nodes in the prefiltering stage (?) • Finding: • Significantly reduce FP rates • Also speed up the efficiency • Only 1 iteration is required for Modified SybilInfer Algo. Speaker: Li-Ming Chen

  35. (6) Effect of inference algorithm • Compare with other graph partitioning algorithm • (1) Edge importance based community structure detection • Girvan-Newman Betweenness • Information centrality (too slow, not considered) • (2) Spectral-based approach • Modularity Eigenvector • Fast Greedy Modularity Run-time: Mot scaled well for large dataset Perform BFS and limit visited nodes by a size of 2k Speaker: Li-Ming Chen

  36. Related Work (Botnet Detection) • Network based approaches • Detect attack traffic • Exploit DNS usage patterns, using Honeypot • Detect control (C&C) traffic • Traffic signature based detection • Statistical traffic analysis based detection • Hybrid approaches • Detect attack & control traffic • Combine network-based and host-based approach • Graph based approaches • Centralized structure • P2P structure Speaker: Li-Ming Chen

  37. Conclusion • Goal: localize structured Comm. graphs within network traffic to identify botnets hosts and links • Propose BotGrep: searching for structured topologies, and separating them from background Comm. graph • Tackling the privacy-preserving issues • Achieve low FP rate and high detection rate • Future work: • Consider temporal variation • Observing how parts of the Comm. graph change over time • Distinguish other P2P structure • Address the botnet response problem • Do not completely disconnect a node but mitigate its potential malicious activities Speaker: Li-Ming Chen

  38. My Comments • About the approach: • In the 3 steps, the accuracy of the 1st step seems to be the key factor, but not proved • 1. deciding k (in k-means) • 2. Clustering is based the properties if each “node” (qi/di) after RW, does nodes in a same cluster really connected? • Step 2 & 3 could deal with this issue… • FN is due to? (pre-filtering? sampling?) • Why adopt modified SybilInfer Algo. in this paper to remove FP? • The original problem is dealing with P2P net + sybil nodes • Does the length of each RW affect the results? • Can we assign weights to edges? • e.g., # of connection between 2 nodes Speaker: Li-Ming Chen

  39. My Comments • Time issues: • Does not consider the effects of time, traffic log in different time period • (as mentioned) temporal variation of the Comm. graph • The effects of a small P2P network (at the early stage) vs. the effects of a large P2P network (lots of bots) Speaker: Li-Ming Chen

More Related