Information Fusion

Information Fusion Ganesh Godavari

DDoS Data Set • DARPA DDoS data set (2000) is available • MIT Lincoln Laboratory • Data Set spans approximately 3 hours • The five phases of the attack scenario depicted [1]: • IPsweep of the Air Force Base from a remote site • Probe of live IP's to look for the sadmind daemon running on Solaris hosts • Breakins via the sadmind vulnerability, both successful and unsuccessful on those hosts • Installation of the trojan mstream DDoS software on three hosts at the AFB • Launching the DDoS

Related Work • Charu C. Aggarwal Philip S. Yu (2001) “Outlier detection for high dimensional data”, International Conference on Management of Data, ACM SIGMOD Pg: 37 – 46 • John McHugh (2000) “Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory”, ACM TISSEC, 3(4) Pg: 262 - 294 • Risto Vaarandi. (2003) A Data Clustering Algorithm for Mining Patterns From Event Logs. Work shop on IEEE IP Operations and Management

Attack Scenario [1]

Phase 1 Attack (DDoS DataSet) Id Date Time Duration SrcIP Target IP Analyzer Service • 03/07/2000 09:51:36 00:00:00 202.77.162.213 172.16.115.5 tcpdump_inside icmp-E-R • 03/07/2000 09:51:36 00:00:05 172.16.112.194 202.77.162.213 tcpdump_inside icmp-E-Rp 3 03/07/2000 09:51:36 00:00:00 202.77.162.213 172.16.115.20 tcpdump_inside icmp-E-R 4 03/07/2000 09:51:36 00:00:00 172.16.115.20 202.77.162.213 tcpdump_inside icmp-E-Rp 5 03/07/2000 09:51:38 00:00:00 202.77.162.213 172.16.115.87 tcpdump_inside icmp-E-R • 03/07/2000 09:51:38 00:00:00 172.16.115.87 202.77.162.213 tcpdump_inside icmp-E-Rp • 03/07/2000 09:51:41 00:00:00 202.77.162.213 172.16.115.234 tcpdump_inside icmp-E-R • 03/07/2000 09:51:50 00:00:00 202.77.162.213 172.16.113.50 tcpdump_inside icmp-E-R • 03/07/2000 09:51:50 00:00:00 172.16.113.50 202.77.162.213 tcpdump_inside icmp-E-Rp 10 03/07/2000 09:51:51 00:00:00 202.77.162.213 172.16.113.84 tcpdump_inside icmp-E-R 11 03/07/2000 09:51:51 00:00:09 172.16.112.194 202.77.162.213 tcpdump_inside icmp-E-Rp 12 03/07/2000 09:51:51 00:00:00 202.77.162.213 172.16.113.105 tcpdump_inside icmp-E-R 13 03/07/2000 09:51:51 00:00:00 172.16.113.105 202.77.162.213 tcpdump_inside icmp-E-Rp 14 03/07/2000 09:51:52 00:00:00 202.77.162.213 172.16.113.148 tcpdump_inside icmp-E-R : : : : : : : : : : : : • 03/07/2000 09:52:00 00:00:00 202.77.162.213 172.16.112.194 tcpdump_inside icmp-E-R • 03/07/2000 09:52:00 00:00:00 202.77.162.213 172.16.112.207 tcpdump_inside icmp-E-R icmp-E-R => icmp-echo-request icmp-E-Rp => icmp-echo-reply

Algorithm Step 1: go over the data file and build vocabulary • Read all the unique fields in the data files Step 2: identify the frequent vocabulary in the data file • How to determine frequency? How can one determine the threshold for frequency ? Step 3: Generate cluster candidates • Lines containing the same frequent words form cluster Step 4: Identify temporal relationships between cluster candidates • The 24 relationships of data Step 5: Generate unique lines • Lines in the data file in based on the candidate cluster

Need Suggestions • Is it safe to assume that a threshold parameter is provided? • Cluster candidate generation can involve too much data generation (next slide shows how)

Cluster Candidate Generation • Data Set has 8 dimensions • frequent words(4byte col. # word) with threshold > 10 are • 0004202.77.162.213 repeated 22 • 000103/07/2000 repeated 33 • 000300:00:00 repeated 31 • 0007icmp-echo-request repeated 22 • 0007icmp-echo-reply repeated 11 • 0006tcpdump_inside repeated 33 • 0005202.77.162.213 repeated 11

Candidate Generation Example • Example 03/07/2000 09:51:36 00:00:00202.77.162.213 172.16.115.5 tcpdump_insideicmp-E-R 03/07/2000 09:51:36 00:00:05 172.16.112.194 202.77.162.213tcpdump_inside icmp-E-Rp 03/07/2000 09:51:36 00:00:00202.77.162.213 172.16.115.20 tcpdump_inside icmp-E-R 03/07/2000 09:51:36 00:00:00 172.16.115.20 202.77.162.213 tcpdump_insideicmp-E-Rp In all data first field is common so should they be considered as a candidate cluster? Cluster 1 = { line 1, line 2, line 3, line 4} Cluster 2 = { line 1, line 3, line 4} Cluster 3 = { line 1, line 3} Cluster 4 = { line 2, line 4} Cluster 5 = { line 1, line 2, line 3, line 4} Cluster 5 = { line 1, line 3} Cluster 6 = { line 2, line 4} • Reduction but loss of information? • Cluster 1 = { line 1, line 3} • Cluster 2 = { line 2} • Cluster 3 = { line 4}

Work to be done • Complete the algorithm and coding part

References [1] MIT Lincoln laboratories http://www.ll.mit.edu/IST/ideval/data/2000/2000_data_index.html

Information Fusion