1 / 27

Ensemble-based Adaptive Intrusion Detection Wei Fan IBM T.J.Watson Research Salvatore J. Stolfo

Ensemble-based Adaptive Intrusion Detection Wei Fan IBM T.J.Watson Research Salvatore J. Stolfo Columbia University. Data Mining for Intrusion Detection. Connection Records. (telnet, 10,3,...). Feature Construction. (ftp,10,20,...). Training Data. Label Existing Connections.

sharis
Download Presentation

Ensemble-based Adaptive Intrusion Detection Wei Fan IBM T.J.Watson Research Salvatore J. Stolfo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ensemble-based Adaptive Intrusion Detection Wei Fan IBM T.J.Watson Research Salvatore J. Stolfo Columbia University

  2. Data Mining for Intrusion Detection Connection Records (telnet, 10,3,...) Feature Construction (ftp,10,20,...) Training Data Label Existing Connections Intrusion Detection Model Inductive Learner

  3. Some interesting requirements ... ... • New types of intrusions are constantly invented by hackers. • Most recent coordinated attacks on many ebusiness websites in 2000. • Hackers tend to use new types of intrusions that intrusion detection system is unaware of or weak at detecting them successfully. • Data mining for intrusion detection is a very data-intensive process. • very large data • revolving patterns • real-time detection

  4. Question • When new types of intrusions are invented, can we quickly adapt our existing model to be able to detect these new intrusions before they cause more damages? • If we don't have a solution, the new attack will make significant damage. • For this kind of problem, having a solution that is not completely satisfactory is better than having no solution.

  5. Naive Approach - Complete Re-training Existing Training Data Merged Training Data New Data NEW Intrusion Detection Model Inductive Learner

  6. Problem with the Naive Approach • Since data (existing plus new) will be very large, it takes a long time to compute a detection model. • By the time, the model is constructed, the new attack probably will have already made enough damage to our system.

  7. New Approach New Data NEW Model Learner Combined Model Existing Model Key point: we only compute model from the data on new types of intrusions only

  8. How do we label connections? a new connection existing model connection type unrecognized NEW Model normal or previously known intrusion types normal or new intrusion types

  9. Basic Idea • Existing model is built to identify THREE classes • normal • some type of intrusions • and anomaly: some connection that is neither normal nor some known types of intrusions. • anomaly detection - we use the artificial anomaly generation method (Fan et al, ICDM 2001)

  10. Anomaly Detection • Generate "artificial anomalies" from training data: similar to "near misses". • Artificial anomalies are data points that are different from the training data. • The algorithm concentrates on feature values that are infrequent in the training data. • Distribution-based Artificial Anomaly (Fan et al, ICDM2001)

  11. Four Configurations • H1(x): existing model. • H2(x): new model. • They differ in how H2(x) is computed. • and how H1(x) and H2(x) are combined • and how a connection is processed and classified.

  12. Configuration I

  13. Configuration II

  14. Configuration III

  15. Configuration IV

  16. Experiment • 1998 DARPA Intrusion Detection Evaluation Dataset • 22 different types of intrusions.

  17. Experiment • Sequence to introduce intrusions into the training data to simulate new intrusions are being invented and launched by hackers • 22! unique sequences • we randomly used 3 unique sequences. • The results are averaged. • RIPPER • unordered rulesets

  18. 3 Unique Sequences

  19. Measurements • All results on the new intrusion types • Precision: • If I catch a potential thief, what is the probability that it is a real thief? • Recall: • What is the probability that real thieves are detected? • Anomaly Detection Rate classified as anomaly • Other classified as other types of intrusions.

  20. Precision Results

  21. Recall Results

  22. Anomaly Detection Rate

  23. Other Detection Rate Results

  24. Summary of results • The most accurate is Configuration 1 where • new model is trained from normal and the new intrusion type • all predicted normal and anomalies by the old model is examined by the new model. • Reason: • Existing model's precision to detect normal connection influences combined model's accuracy. • New data is limited in amount. Artificial anomalies generated from new data is limited as well.

  25. Training Efficiency

  26. Related Work (incomplete list) • Anomaly Detection: • SRI's IDES use probability distribution of past activities to measure abnormality of host events. We measure network events. • Forrest et al uses absence of subsequence to measure abnormality. • Lane and Brodley employ a similar approach but use incremental learning approach to update stored sequence from UNIX shell commands. • Ghosh and Schwarzbard use neural network to learn profile of normality and distance function to detect abnormality. • Generating Artificial Data: • Nigam et al assign label to unlabelled data using classifier trained from labeled data. • Chang and Lippman applied voice transformation techniques to add artificial training talkers to increase variability. • Multiple classifiers: • Asker and Macline "Ensembles as a sequence of classifiers"

  27. Summary and Future Work • Proposed a two-step two classifier approach for efficient training and fast model deployment. • Empirically tested in the intrusion detection domain. • Need to test if it works well for other domains.

More Related