150 likes | 175 Views
This study explores how to identify botnets by analyzing Netflow data, focusing on client-to-client traffic and excluding P2P traffic. Discusses methodology, problems, and strategies. Current status includes basic tools and visualization for data understanding.
E N D
Netflow and Botnets Steven M. Bellovin Columbia University smb
Hypothesis • Most hosts are either clients or servers • P2P traffic is an exception • Bots talk to other bots and thus to command and control node • By looking for unusual traffic flows – client-to-client traffic that isn’t P2P – we can find bots smb
Methodology • Use Netflow data to identify clients and servers • Classify nodes as clients or servers • Build a traffic matrix from the data to see which clients talk to which other clients • Exclude P2P traffic, which is generally identifiable based on flow size smb
Netflow • Originally from Cisco; now implemented by most router vendors • Also an IETF “Proposed Standard” • Records “flow information” – src/dst pairs (addresses and port numbers), length, timing, etc. – for “connections” through a given router • Intended for accounting and for traffic engineering smb
Problems with Netflow • Flows are unidirectional; need two records for complete picture • This is a consequence of Internet topology; most inter-ISP connections follow asymmetric paths • Routers often deliver sampled data; can miss flow start/end packets • Does not give unambiguous indication of client versus server smb
Strategy • Build tools at Columbia • Easy access to machines and data • Use existing archive of CU netflow data • Unclear if there are botnets present; get classification right first • Get other netflow archives (e.g., from predict.org) • Bring nominally-working code to AT&T to experiment with large-scale datasets • Compare with previous results from AT&T as check on correctness smb
Node Classification • Must use heuristics • Flag field in netflow data doesn’t show client vs. server • Timestamp not useful because of sampling • Current strategy: look at port number distribution • Clients usually use ports 48K-64K • Considering using node degree • But – problems with low-activity hosts? smb
Classification is Hard • Simple heuristics have not been satisfactory • Building visualization tools to help us understand the data smb
Ambiguous Host smb
Ambiguous Host Scatter Plot Is this the sort of host we’re looking for? smb
Current Status • Have basic tools built • Working with visualization tools to understand the data • Next steps: • Refine classification algorithms • Confirm analysis of bots in sample data • Try tools on larger dataset smb