1 / 11

Information Fusion

This project analyzes the DARPA DDoS data set to study a specific attack scenario, including IP sweep, sadmind probe, break-ins, trojan installation, and DDoS launch. The algorithm detects outliers and generates cluster candidates based on frequent vocabulary. Further work is needed to complete the algorithm implementation.

jleslie
Download Presentation

Information Fusion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Fusion Ganesh Godavari

  2. DDoS Data Set • DARPA DDoS data set (2000) is available • MIT Lincoln Laboratory • Data Set spans approximately 3 hours • The five phases of the attack scenario depicted [1]: • IPsweep of the Air Force Base from a remote site • Probe of live IP's to look for the sadmind daemon running on Solaris hosts • Breakins via the sadmind vulnerability, both successful and unsuccessful on those hosts • Installation of the trojan mstream DDoS software on three hosts at the AFB • Launching the DDoS

  3. Related Work • Charu C. Aggarwal Philip S. Yu (2001) “Outlier detection for high dimensional data”, International Conference on Management of Data, ACM SIGMOD Pg: 37 – 46 • John McHugh (2000) “Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory”, ACM TISSEC, 3(4) Pg: 262 - 294   • Risto Vaarandi. (2003) A Data Clustering Algorithm for Mining Patterns From Event Logs. Work shop on IEEE IP Operations and Management

  4. Attack Scenario [1]

  5. Phase 1 Attack (DDoS DataSet) Id Date Time Duration SrcIP Target IP Analyzer Service • 03/07/2000 09:51:36 00:00:00 202.77.162.213 172.16.115.5 tcpdump_inside icmp-E-R • 03/07/2000 09:51:36 00:00:05 172.16.112.194 202.77.162.213 tcpdump_inside icmp-E-Rp 3 03/07/2000 09:51:36 00:00:00 202.77.162.213 172.16.115.20 tcpdump_inside icmp-E-R 4 03/07/2000 09:51:36 00:00:00 172.16.115.20 202.77.162.213 tcpdump_inside icmp-E-Rp 5 03/07/2000 09:51:38 00:00:00 202.77.162.213 172.16.115.87 tcpdump_inside icmp-E-R • 03/07/2000 09:51:38 00:00:00 172.16.115.87 202.77.162.213 tcpdump_inside icmp-E-Rp • 03/07/2000 09:51:41 00:00:00 202.77.162.213 172.16.115.234 tcpdump_inside icmp-E-R • 03/07/2000 09:51:50 00:00:00 202.77.162.213 172.16.113.50 tcpdump_inside icmp-E-R • 03/07/2000 09:51:50 00:00:00 172.16.113.50 202.77.162.213 tcpdump_inside icmp-E-Rp 10 03/07/2000 09:51:51 00:00:00 202.77.162.213 172.16.113.84 tcpdump_inside icmp-E-R 11 03/07/2000 09:51:51 00:00:09 172.16.112.194 202.77.162.213 tcpdump_inside icmp-E-Rp 12 03/07/2000 09:51:51 00:00:00 202.77.162.213 172.16.113.105 tcpdump_inside icmp-E-R 13 03/07/2000 09:51:51 00:00:00 172.16.113.105 202.77.162.213 tcpdump_inside icmp-E-Rp 14 03/07/2000 09:51:52 00:00:00 202.77.162.213 172.16.113.148 tcpdump_inside icmp-E-R : : : : : : : : : : : : • 03/07/2000 09:52:00 00:00:00 202.77.162.213 172.16.112.194 tcpdump_inside icmp-E-R • 03/07/2000 09:52:00 00:00:00 202.77.162.213 172.16.112.207 tcpdump_inside icmp-E-R icmp-E-R => icmp-echo-request icmp-E-Rp => icmp-echo-reply

  6. Algorithm Step 1: go over the data file and build vocabulary • Read all the unique fields in the data files Step 2: identify the frequent vocabulary in the data file • How to determine frequency? How can one determine the threshold for frequency ? Step 3: Generate cluster candidates • Lines containing the same frequent words form cluster Step 4: Identify temporal relationships between cluster candidates • The 24 relationships of data Step 5: Generate unique lines • Lines in the data file in based on the candidate cluster

  7. Need Suggestions • Is it safe to assume that a threshold parameter is provided? • Cluster candidate generation can involve too much data generation (next slide shows how)

  8. Cluster Candidate Generation • Data Set has 8 dimensions • frequent words(4byte col. # word) with threshold > 10 are • 0004202.77.162.213 repeated 22 • 000103/07/2000 repeated 33 • 000300:00:00 repeated 31 • 0007icmp-echo-request repeated 22 • 0007icmp-echo-reply repeated 11 • 0006tcpdump_inside repeated 33 • 0005202.77.162.213 repeated 11

  9. Candidate Generation Example • Example 03/07/2000 09:51:36 00:00:00202.77.162.213 172.16.115.5 tcpdump_insideicmp-E-R 03/07/2000 09:51:36 00:00:05 172.16.112.194 202.77.162.213tcpdump_inside icmp-E-Rp 03/07/2000 09:51:36 00:00:00202.77.162.213 172.16.115.20 tcpdump_inside icmp-E-R 03/07/2000 09:51:36 00:00:00 172.16.115.20 202.77.162.213 tcpdump_insideicmp-E-Rp In all data first field is common so should they be considered as a candidate cluster? Cluster 1 = { line 1, line 2, line 3, line 4} Cluster 2 = { line 1, line 3, line 4} Cluster 3 = { line 1, line 3} Cluster 4 = { line 2, line 4} Cluster 5 = { line 1, line 2, line 3, line 4} Cluster 5 = { line 1, line 3} Cluster 6 = { line 2, line 4} • Reduction but loss of information? • Cluster 1 = { line 1, line 3} • Cluster 2 = { line 2} • Cluster 3 = { line 4}

  10. Work to be done • Complete the algorithm and coding part

  11. References [1] MIT Lincoln laboratories http://www.ll.mit.edu/IST/ideval/data/2000/2000_data_index.html

More Related