BotGrep: Finding P2P Bots with Structured Graph Analysis

BotGrep: Finding P2P Bots with Structured Graph Analysis Shishir Nagaraja, Prateek Mittal, Chi-Yao Hong, Matthew Caesar, Nikita Borisov (UIUC) USENIX Security Symposium 2010

A cutC = (S, T) is a partition of V of a graph G = (V, E) ({1, 2, 3, 4, 5}, {6, 7}) The cut-set of a cut is the set of edges whose end points are in different subsets of the partition {(5, 6), (1, 7)} The size of a cut is the number of edges in the cut-set  small cut, min. cut… Graph Theory: Cut G = (V, E) 4 5 3 6 2 7 1 4 5 3 6 2 7 1 http://en.wikipedia.org/wiki/Cut_%28graph_theory%29 Speaker: Li-Ming Chen

RW: a trajectory that consists of taking successive random steps on a graph E.g., the path traced by a molecule (分子) as it travels in a liquid RWs are usually assumed to be Markov processes Example: Graph Theory: Random Walk (RM) 4 5 3 6 2 7 1 From node 1 2 3 4 5 6 7 To node 1 2 3 4 5 6 7 Starts at node 1 Prob. after 1 RW Prob. after 2 RW http://en.wikipedia.org/wiki/Random_walk Speaker: Li-Ming Chen

Graph Theory: Stationary Distribution Example 1: 4 • A stationary distribution π is a vector, whose entries are non-negative and sum to 1, that satisfies • Markov chain mixing time: • How large must t be until the time-t distribution (qt) is approximately π? (to converge to stationary Dist.) 5 3 6 (remain steady) Example 2: 2 7 1 (remain steady) http://en.wikipedia.org/wiki/Markov_chain Speaker: Li-Ming Chen

Outline • Problem Definition • System Architecture • Approach: • Prefiltering Step • Clustering P2P Nodes • Validation • (*) Privacy Preserving Graph Algorithms • Results & Discussion • Conclusion & My Comments Speaker: Li-Ming Chen

Communication Attack Propagation What is Botnet? • Bots: compromised hosts, “Zombies” • Botnets: networks of bots that are under the control of a human operator (botmaster) • (generally looks like) Worm + C&C channel • Command and Control Channel • Disseminate the botmasters’ commands to their bot armies (IRC, HTTP, … (can be encrypted)) (DoS, spamming, phishing site, …) Worm (vulnerabilities, file sharing, P2P, …) Speaker: Li-Ming Chen

Botnet Structure Change!! • Centralized structure  P2P, why? • Growing size of botnets •  P2P communication is more efficient and robust • Development of mechanisms that detect traditional centralized C&C servers • Try to evade detection • Question: • Whether ISPs can detect P2P botnets and use this as a basis for botnet defense. (Traditional) centralized structure P2P structure Speaker: Li-Ming Chen

Problem & Proposed Solution • Problem: • ISPs have significant visibility into the Comm. patterns • But, how to separate botnet traffic from background Internet traffic? • Proposed approach:BotGrep • An algorithm that isolates P2P Comm. structure • Only based on the information about which pairs of nodes communicate with one another • Input: a communication graph • Can sustain when only a partial view of the comm. graph is available • Can support “privacy preserving collaboration” Speaker: Li-Ming Chen

Challenges • Background traffic volume is huge • Background traffic is highly variable and continuously changing • Botnet traffic blends in with the regular traffic of the legitimate users •  botnet is tightly integrated and can NOT be separated from the rest of the nodes by a small cut • ISPs collaboration  scaling issues • ISPs collaboration  privacy issues Speaker: Li-Ming Chen

BotGrep Architecture Data source 1: Combining observations across different network monitors into a single Comm. graph Outputs: A set of suspect hosts (and links) Data source 2: Borrow misuse detection to distinguish P2P bot and other P2P applications (speed up botnet identification) Speaker: Li-Ming Chen

Inference System • As mentioned, botnet graph is embedded within a background Comm. graph • One common feature of P2P structured graph: • Fast mixing time (∵ highly structured) •  BotGrep exploits this feature by • performing random walks to identify fast-mixing component(s) and isolate them from the rest of the Comm. graph Speaker: Li-Ming Chen

Problem Formulation • Given a Comm. graph: G = (V, E) • (note: not specify a clear time period) • Assume a P2P graph Gp is embedded •  remaining subgraph contains non-P2P Comm. • Goal: • Partition the input G into {Gp, Gn} in the presence of dynamic background traffic and with only partial visibility edges Speaker: Li-Ming Chen

Approach Overview (BotGrep) • Idea: • Perform random walks, and comparing the relative mixing rates of subgraphs • 3 steps: • (1) Pre-filtering (actually is k-means clustering) • Extract a small set of candidate P2P notes (+ FP) • (2) Clustering P2P Nodes (sampling) • Apply modified SybilInfer Algo. to remove FP • (3) Validation • Validate step (2) based on fast-mixing characteristic Speaker: Li-Ming Chen

Step (1): Pre-filtering • Idea: • For short random walks, the state Prob. associated with nodes in the fast-mixing subgraph is likely to be closer to the “stationary distribution” thannodes in the slow-mixing subgraph 4 short RW, t =log(|V|) 5 init: Input: G = (V, E) 3 6 2 7 1 ∵ stationary Dist. is proportional to node degrees Goal: the sum of squares J from points to the assigned cluster centers cj is minimized. dampening constant, to undermine high- degree nodes k-means (squares J is the cluster score) ?? should be Speaker: Li-Ming Chen

Step (2): Clustering P2P Nodes • Step (1)’s output: {G1, G2, …, Gk} •  perform “modified SybilInfer Algo.” on each subgraphs to remove weakly connected nodes (FP) • Concept of modified SybilInfer Algo. (3 steps): • Get “traces” T • A trace represents a related vertex-pair by using RW* • Use sampling to get P2P nodes • Assume a cut X0 consists of P2P nodes, • Check if X’ is better than X0 according to probability • If better, X’ replace X0; else X0 retained. • and then do it in several runs • Get {X0, X1, …, XN}, ; decide P[node i is P2P] = ? G. Danezis and P. Mittal, “SybilInfer: Detecting Sybil Nodes using Social Networks,” in Proc. NDSS, 2009. Speaker: Li-Ming Chen

Step (2): Clustering P2P Nodes (cont’d) • Modified SybilInfer Algo.: • Step (1) Generation of traces: (ensures that the “stationary Dist.” of the RW is uniform over all vertices) Perform a number n of RWs, starting at each node, length t = log(|V|)  qt = qt-1‧P’  Traces T is the set ofstarting and ending vertex-pairs of each RW (we are interested in these pairs traversed by RW) Speaker: Li-Ming Chen

(My Observation) 4 5 • Traces T • Set of vertex-pairs • The end points reflect the connectivity of the graph •  high-degree nodes may walk to low-degree nodes • ~ equal prob. to its neighbors •  RW may be trapped by low-degree nodes (if they are connected) • P’ is a symmetry matrix • RW will not converge by using P’ !! 3 6 2 7 1 From node 1 2 3 4 5 6 7 To node 1 2 3 4 5 6 7 Speaker: Li-Ming Chen

Step (2): Clustering P2P Nodes (cont’d) • Modified SybilInfer Algo.: • Step (2) A (Bayesian) Prob. model for P2P nodes: • Given the set of traces T, compute the Prob. that any set of node X are all P2P nodes, goal can be acquired fixed assign a uniform prob. to all walks ending in the set X trace ends in vertex v in X trace ends in vertex a in number of RW ending in vertex a (or v) Speaker: Li-Ming Chen

Step (2): Clustering P2P Nodes (cont’d) • Modified SybilInfer Algo.: • Step (3) Metropolis-Hastings Sampling: • To enumeration over all subsets X of the graph is impossible •  sample configurations Xi following this distribution • Xi ~ P(X|T) • Given a set of samples S = {X0, X1, …, XN}, we can compute marginal Prob. of nodes being P2P nodes as follows: • Threshold: • if P[node i is P2P] > 0.5, then P2P, else non-P2P (FP). • Next setp: validate P2P group! Node i exists in more than half of the samples Speaker: Li-Ming Chen

Step (3): Validation • SybilInfer only partitions a graph into two subgraphs • We need to use multiple iterations to get to the desired fastest mixing subgraph •  require a validation test • If the cut passes all the 3 validation tests below, then we are done: • (1) Graph conductance test • (2) q(t) entropy comparison test • (3) Degree-homogeneity test Speaker: Li-Ming Chen

0 Step (3): Validation (cont’d) • (1) Graph conductance test • P2P network is fast mixing  no small cut  graph conductance should be high • (2) q(t) entropy comparison test • RWs on structured homogeneous P2P graphs are characterized by high entropy state Prob. Dist. • KL divergence measure should be close to 0 • (3) Degree-homogeneity test • To rule out star topology! • Measure the dispersion of degree values  should be homogeneous close Speaker: Li-Ming Chen

Outline • Problem Definition • System Architecture • Approach: • Prefiltering Step • Clustering P2P Nodes • Validation • (*) Privacy Preserving Graph Algorithms (ignored) • Results & Discussion • Conclusion & My Comments Speaker: Li-Ming Chen

Dataset (Graphs) • Background traffic communication graph: • Constructed from 1 day real-world traffic trace: • (1) Abilene’s NetFlow trace (2009/10/22) (104,426 nodes) • (2) CAIDA packet-level trace (2009/1/11) (3,839,936 nodes) • Botnet graph: • Synthetically add links between random selected “bots” in the background traffic • For sensitivity test, the structure of botnet graph includes: • (1) de Bruijn, (2) Chord, (3) Kademlia, (4) LEET-Chord •  Take the combined graph as the algorithm input Speaker: Li-Ming Chen

An Algorithm Example • Background traffic communication graph: • GD: Abilene’s trace • Botnet graph: • Gp: de Bruijn structure • Random select 10,000 nodes from GD • Parameters: m=10 (outgoing links), n=4 (dimensions) • Combined input G = <V, E>: • N = |V| = 104,426 nodes (Abilene) • |E| = 647,053 edges • Goal: • Extract Gp from GD as accurately as possible! de Bruijn graph (m, n) de Bruijn graph (2, 3) Speaker: Li-Ming Chen

Validation An Algorithm Example (cont’d) • (step 1) • Perform a short random walk starting from every node • Get si (use r = 100) • K-means clustering derives 10 clusters • (step 2 & 3) • Only check 4th cluster (yellow) • 17,576 nodes • Contains honey-net nodes • Recursively apply SybilInfer to this cluster and validate in 3 iterations • 10,143 nodes (TP: 9,905 / FP: 238) Speaker: Li-Ming Chen

Results • (1) Effect of botnet topology • (2) Effect of botnet graph size • (3) Effect of background graph size • (4) Effect of reduced visibility • (5) Leveraging Honeynets • (6) Effect of inference algorithm Speaker: Li-Ming Chen

4 botnet graphs  de Bruijn, Chord, Kademlia, LEET-Chord Overall, performance is stable across these graphs Detection rate > 95% FP rate < 0.42% for LEET-Chord Stealthiness vs. resilience: Randomly removing nodes (%) Check failed paths: LEET-Chord is less resilience to failure  the use of stealth to evadeBotGrep would adversely effect the resilience of the botnet (1) Effect of botnet topology Speaker: Li-Ming Chen

(2) Effect of botnet graph size • Experiment: • Keep the size of the background traffic graph constant • Vary the size of the synthetic botnet graph • 102, 103, 104, or 105 bots • Finding: • Size increases, performance degrades (but only by a small amount) Speaker: Li-Ming Chen

(3) Effect of background graph size • A larger background graphs  botnet is easy to hide inside (?) • Experiment: • Try to scale up the background graphs while retaining their statistical properties (ignore the procedure here), then insert botnet • e.g., CAIDA: 3.8 million  30 million nodes (×9) • Finding: • BotGrep scales well with network size! ÷9 Speaker: Li-Ming Chen

(4) Effect of reduced visibility • Previous Experiments: • Gp is present in its entirety • Problems of reduced visibility: • Only deploy BotGrep at a subset of ISPs • Network traffic sampling • Experiment: • Study Storm & Karaken botnet • Measure number of inter-bot paths visible from ASes • Sort ASes (according to # of pahts) • Cooperate sorted ASes and contribution see their “visibility” 5 most-affected ASes contribute views 57~65% Speaker: Li-Ming Chen

(4) Effect of reduced visibility (cont’d) 57~65% visibility from Top 5 ASes • Apply BotGrep on a “combined graph” by removing 40% of links from the botnet graph Speaker: Li-Ming Chen

(5) Leveraging Honeynets • Perform RWs starting only from the honey-net nodes to obtain a set of candidate P2P nodes in the prefiltering stage (?) • Finding: • Significantly reduce FP rates • Also speed up the efficiency • Only 1 iteration is required for Modified SybilInfer Algo. Speaker: Li-Ming Chen

(6) Effect of inference algorithm • Compare with other graph partitioning algorithm • (1) Edge importance based community structure detection • Girvan-Newman Betweenness • Information centrality (too slow, not considered) • (2) Spectral-based approach • Modularity Eigenvector • Fast Greedy Modularity Run-time: Mot scaled well for large dataset Perform BFS and limit visited nodes by a size of 2k Speaker: Li-Ming Chen

Related Work (Botnet Detection) • Network based approaches • Detect attack traffic • Exploit DNS usage patterns, using Honeypot • Detect control (C&C) traffic • Traffic signature based detection • Statistical traffic analysis based detection • Hybrid approaches • Detect attack & control traffic • Combine network-based and host-based approach • Graph based approaches • Centralized structure • P2P structure Speaker: Li-Ming Chen

Conclusion • Goal: localize structured Comm. graphs within network traffic to identify botnets hosts and links • Propose BotGrep: searching for structured topologies, and separating them from background Comm. graph • Tackling the privacy-preserving issues • Achieve low FP rate and high detection rate • Future work: • Consider temporal variation • Observing how parts of the Comm. graph change over time • Distinguish other P2P structure • Address the botnet response problem • Do not completely disconnect a node but mitigate its potential malicious activities Speaker: Li-Ming Chen

My Comments • About the approach: • In the 3 steps, the accuracy of the 1st step seems to be the key factor, but not proved • 1. deciding k (in k-means) • 2. Clustering is based the properties if each “node” (qi/di) after RW, does nodes in a same cluster really connected? • Step 2 & 3 could deal with this issue… • FN is due to? (pre-filtering? sampling?) • Why adopt modified SybilInfer Algo. in this paper to remove FP? • The original problem is dealing with P2P net + sybil nodes • Does the length of each RW affect the results? • Can we assign weights to edges? • e.g., # of connection between 2 nodes Speaker: Li-Ming Chen

My Comments • Time issues: • Does not consider the effects of time, traffic log in different time period • (as mentioned) temporal variation of the Comm. graph • The effects of a small P2P network (at the early stage) vs. the effects of a large P2P network (lots of bots) Speaker: Li-Ming Chen

BotGrep: Finding P2P Bots with Structured Graph Analysis