1 / 31

Data Mining for Network Intrusion Detection: Experience with KDDCup’99 Data set

Data Mining for Network Intrusion Detection: Experience with KDDCup’99 Data set. Vipin Kumar, AHPCRC, University of Minnesota. Group members: L. Ertoz, M. Joshi, A. Lazarevic, H. Ramnani, P. Tan, J. Srivastava. Introduction. Key challenge

paul2
Download Presentation

Data Mining for Network Intrusion Detection: Experience with KDDCup’99 Data set

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining for Network Intrusion Detection: Experience with KDDCup’99 Data set Vipin Kumar, AHPCRC, University of Minnesota Group members: L. Ertoz, M. Joshi, A. Lazarevic, H. Ramnani, P. Tan, J. Srivastava

  2. Introduction • Key challenge • Maintain high detection rate while keeping low false alarm rate • Misuse Detection • Two phase learning – PNrule • Classification based on Associations (CBA) approach • Anomaly Detection • Unsupervised (e.g. clustering) and supervised methods to detect novel attacks

  3. DARPA 1998 - KDDCup’99 Data Set • Modification of DARPA 1998 data set prepared and managed by MIT Lincoln Lab • DARPA 1998 data includes a wide variety of intrusions simulated in a military network environment • 9 weeks of raw TCP dump data simulating a typical U.S. Air Force LAN • 7 weeks for training (5 million connection records) • 2 weeks for training (2 million connection records)

  4. KDDCup’99 Data Set • Connections are labeled as normal or attacks • Attacks fall into 4 main categories (38 attack types) - • DOS - Denial Of Service • Probe - e.g. port scanning • U2R - unauthorized access to root privileges, • R2L - unauthorized remote login to machine, • U2R and R2L extremely small classes • 3 groups of features • Basic, content based, time based features (details)

  5. KDDCup’99 Data Set • Training set - ~ 5 million connections • 10% training set - 494,021 connections • Test set - 311,029 connections • Test data has attack types that are not present in the training data => Problem is more realistic • Train set contains 22 attack types • Test data contains additional 17new attack types that belong to one of four main categories

  6. Performance of Winning Strategy • Cost-sensitive bagged boosting (B. Pfahringer)

  7. Simple RIPPER classification • RIPPER trained on 10% of data (494,021 connections) • Test on entire test set (311,029 connections)

  8. Simple RIPPER on modified data • Remove duplicates and merge new train and test data sets • Sample 69,980 examples from the merged data set • Sample from neptune and normal subclass. Other subclasses remain intact. • Divide in equal proportion to training and test sets • Apply RIPPER algorithm on the new data set

  9. Building Predictive Models in NID • Models should handle skewed class distributions • Accuracy is not sufficient metric for evaluation • Focus on both recall and precision • Recall (R) = TP/(TP + FN) • Precision (P) = TP/(TP + FP) • F – measure = 2*R*P/(R+P) rare class – C large class – NC

  10. Predictive Models for Rare Classes • Over-sampling the small class [Ling, Li, KDD 1998] • Down-sizing the large class [Kubat, ICML 1997] • Internally bias discrimination process to compen-sate for class imbalance [Fawcett, DMKDD 1997] • PNrule and related work [Joshi, Agarwal, Kumar, SIAM, SIGMOD 2001] • RIPPER with stratification • SMOTE algorithm [Chawla, JAIR 2002] • RareBoost [Joshi, Agarwal, Kumar, ICDM 2001]

  11. PNrule Learning • P-phase: • cover most of the positive examples with high support • seek good recall • N-phase: • remove FP from examples covered in P-phase • N-rules give high accuracy and significant support C C NC NC Existing techniques can possibly learn erroneous small signatures for absence of C PNrule can learn strong signatures for presence of NC in N-phase

  12. RIPPER vs. PNrule Classification • 5% sample from normal, smurf (DOS), neptune (DOS) from 10% of training data (494,021 connections) • Test on entire test set (311,029 connections)

  13. Þ s , a Classification Based on Associations (CBA) • What are Association patterns? • Frequent itemset: captures the set of “items” that co-occur together frequently in a transaction database. • Association Rule: predicts the occurrence of a set of items in a transaction given the presence of other items. Association Rule: y X Support: Confidence: Example:

  14. Classification Based on Associations (CBA) • Previous work: • Use association patterns to improve the overall performance of traditional classifiers. • Integrating Classification and Association Rule Mining [Liu, Li, KDD 1998] • CMAR: Accurate Classification Based on Multiple Class-Association Rules [Han, ICDM 2001] • Associations in Network Intrusion Detection • Use classification based on associations for anomaly detection and misuse detection [Lee, Stolfo, Mok 1999] • Look for abnormal associations [Barbara, Wu, Jajodia, 2001]

  15. Methodology F1: {A, B,C} => dos F2: {B,D} => dos … DOS Overall data set Feature Selection F1: {A, C, D} => u2r F2: {E,F,H} => u2r … U2R F1: {C,K,L} => r2l F2: {F,G,H} => r2l … R2L F1: {B,F} => probe F2: {B,C,H}=> probe … probe normal F1: {A, B} => normal F2: {E,G} => normal … Feed to classifier Stratification Frequent Itemset Generation

  16. Methodology • Current approaches use confidence-like measures to select the best rules to be added as features into the classifiers. • This may work well only if each class is well-represented in the data set. • For the rare class problems, some of the high recall itemsets could be potentially useful, as long as their precision is not too low. • Our approach: • Apply frequent itemset generation algorithm to each class. • Select itemsets to be added as features based on precision, recall and F-Measure. • Apply classification algorithm, i.e., RIPPER, to the new data set.

  17. Experimental Results (on modified data) Original RIPPER RIPPER with high Precision rules RIPPER with high Recall rules RIPPER with high F-measure rules

  18. Experimental Results (on modified data) RIPPER with high Precision rules Original RIPPER RIPPER with high Recall rules RIPPER with high F-measure rules For rare classes, rules ordered according to F-Measure produce the best results.

  19. CBA Summary • Association rules can improve the overall performance of classifiers • Measure used to select rules for feature addition can affect the performance of classifiers • The proposed F-measure rule selection approach leads to better overall performance

  20. Anomaly Detection – Related Work • Detect novel intrusions using pseudo-Bayesian estimators to estimate prior and posterior probabilities of new attacks [Barbara, Wu, SIAM 2001] • Generate artificial anomalies (intrusions) and then use RIPPER to learn intrusions [Fan et al, ICDM 2001] • Detect intrusions by computing changes in esti-mated probability distributions [Eskin, ICML 2000] • Clustering based approaches [Portnoy et al, 2001]

  21. SNN Clusteringon KDD Cup 99’ data • SNN clustering suited for finding clusters of varying sizes, shapes, densities in the presence of noise • Dataset • 10,000 examples were sampled from neptune, smurf and normal both from training and test • Other sub-classes remain intact • Total number of instances : 97,000 • Applied shared nearest neighbors based clustering and k-means clustering

  22. Clustering Results • SNN clusters of pure new attack types are found

  23. Clustering Results All k-means clusters • K-means performance • SNN clustering performance Tightest k-means clusters

  24. Nearest Neighbor (NN) based Outlier Detection • For each point in the training set, calculate the distance to the closest point • Build a histogram • Choose a threshold such that a small percentage (e.g., 2%) of the training set are classified as outliers

  25. attack Anomaly Detection using NN Scheme

  26. Novel Attack Detection Using NN Scheme Detection Rate for Novel Attacks = 68.50% False Positive Rate for Normal connections = 2.82%

  27. novel attacks Novel Attack Detection Using NN Scheme details details

  28. Conclusions • Predictive models specifically designed for rare class can help in improving the detection of small attack types • SNN clustering based approach shows promise in identifying novel attack types • Simple nearest neighbor based approaches appear capable of detecting anomalies

  29. KDDCup’99 Data Set • KDDCup’99 contains derived high-level features • 3 groups of features • basic features of individual TCP connections (duration, protocol type, service, src & dest bytes, …) • content features within a connection suggested by domain knowledge (e.g. # of failed login attempts) • time-based traffic features of the connection records • ''same host'' features examine only the connections that have the same destination host as the current connection • ''same service'' features examine only the connections that have the same service as the current connection back

  30. 1-NN on Anomalies back

  31. 1-NN on Known Attacks back

More Related