1 / 30

Critique of 1998 and 1999 DARPA IDS Evaluations

This paper provides a critical analysis of the DARPA Intrusion Detection System (IDS) evaluations conducted by Lincoln Laboratory in 1998 and 1999. It discusses the flaws in data generation, taxonomy, and evaluation process, and offers recommendations for improvement.

liliam
Download Presentation

Critique of 1998 and 1999 DARPA IDS Evaluations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Testing Intrusion Detection Systems: A Critic for the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory By John Mchugh Presented by HongyuGao Feb. 5, 2009

  2. Outline • Lincoln Lab’s evaluation in 1998 • Critic on data generation • Critic on taxonomy • Critic on evaluation process • Brief discussion on 1999 evaluation • Conclusion

  3. The 1998 evaluation • The most comprehensive evaluation of research on intrusion detection systems that has been performed to date

  4. The 1998 evaluation cont’d • Objective: • “To provide unbiased measurement of current performance levels.” • “To provide a common shared corpus of experimental data that is available to a wide range of researchers”

  5. The 1998 evaluation, cont’d • Simulated a typical air force base network

  6. The 1998 evaluation, cont’d • Collected synthetic traffic data

  7. The 1998 evaluation cont’d • Researchers tested their system using the traffic • Receiver Operating Curve (ROC) was used to present the result

  8. 1. Critic on data generation • Both background (normal) and attack data are synthesized. • Said to represent traffic to and from a typical air force base. • It is required that such synthesized data should reflect system performance in realistic scenarios.

  9. Critic on background data • Counter point 1 • Real traffic is not well-behaved. • E.g. spontaneous packet storms that are indistinguishable from malicious attempts at flooding. • Not considered in background traffic

  10. Critic on background data, cont’d • Counter point 2 • Low average data rate

  11. Critic on background data, cont’d • Possible negative consequences • System may produce larger amount of FP in realistic scenario. • System may drop packets in realistic scenario

  12. Critic on attack data • The distribution of attack is not realisitic • The number of attacks, which are U2R, R2L, DoS, Probing, is of the same order

  13. Critic on attack data, cont’d • Possible negative consequences • The aggregate detection rate does not reflect the detection rate in real traffic

  14. Critic on simulated AFB network • Not likely to be realistic • 4 real machines • 3 fixed attack target • Flat architecture • Possible negative consequence • IDS can be tuned to only look at traffic targeting to certain hosts • Preclude the execution of “smurf” or ICMP echo attack

  15. 2. Critic on taxonomy • Based on the attacker’s point of view • Denial of service • Remote to user • User to root • probing • Not useful describing what an IDS might see

  16. Critic on taxonomy, cont’d • Alternative taxonomy • Classify by protocol layer • Classify by whether a completed protocol handshake is necessary • Classify by severity of attack • Many others…

  17. 3. Critic on evaluation • The unit of evaluation • Session is used • Some traffic (e.g. message originating with Ethernet hubs) are not in any session • Is “session” an appropriate unit?

  18. 3. Critic on evaluation • Scoring and ROC • Denominator?

  19. Critic on evaluation, cont’d • An non-standard variation of ROC • --Substitue x-axis with false alarms per day • Possible problem • The number of false alarms per unit time may increase significantly with data rate increasing • Suggested alternative • The total number of alert (both TP and FP) • Use the standard ROC

  20. Evaluation on Snort

  21. Evaluation on Snort, cont’d • Poor performance on Dos and Probe • Good performance on R2L and U2R • Conclusion on Snort: • Not sufficient to get any conclusion

  22. Critic on evaluation, cont’d • False alarm rate • A crucial concern • The designated maximum value (0.1%) is inconsistent with the maximum operator load set by Lincoln lab (100/day)

  23. Critic on evaluation, cont’d • Does the evaluation result really mean something? • ROC curve reflects the ability to detect attack against normal traffic • What does a good IDS consist of? • Algorithm • Reliability • Good signatures • …

  24. Brief discussion on 1999 evaluation • Have some superficial improvements • Additional hosts and host types are added • New attacks are added • None of these addresses the flaws listed above

  25. Brief discussion on 1999 evaluation, cont’d • Security policy is not clear • What is an attack, what is not? • Scan, probe

  26. Conclusion • The Lincoln lab evaluation is a major and impressive effort. • This paper criticizes the evaluation from different aspects.

  27. Follow-up Work • DETER - Testbed for network security technology. • Public facility for medium-scale repeatable experiments in computer security • Located at USC ISI and UC Berkeley. • 300 PC systems running Utah's Emulab software. • Experimenter can access DETER remotely to develop, configure, and manipulate collections of nodes and links with arbitrary network topologies. • Problem with this is currently that there isn't realistic attack module or background noise generator plugin for the framework. Attack distribution is a problem. • PREDICT - Its a huge trace repository. It is not public and there are several legal issues in working with it.

  28. Follow-up Work • KDD Cup - Its goal is to provide data-sets from real world problems to demonstrate the applicability of dierent knowledge discovery and machine learning techniques. • The 1999 KDD intrusion detection contest uses a labelled version of this 1998 DARPA dataset, • Annotated with connection features. • There are several problems with KDD Cup. Recently, people have found average TCP packet sizes as best correlation metrics for attacks, which is clearly points out the inefficacy.

  29. Discussion • Can the aforementioned problems be addressed? • Dataset • Taxonomy • Unit for analysis • Approach to compare between IDSes • …

  30. The End Thank you

More Related