340 likes | 822 Views
Data Mining for Network Intrusion Detection: Experience with KDDCup’99 Data set. Vipin Kumar, AHPCRC, University of Minnesota. Group members: L. Ertoz, M. Joshi, A. Lazarevic, H. Ramnani, P. Tan, J. Srivastava. Introduction. Key challenge
E N D
Data Mining for Network Intrusion Detection: Experience with KDDCup’99 Data set Vipin Kumar, AHPCRC, University of Minnesota Group members: L. Ertoz, M. Joshi, A. Lazarevic, H. Ramnani, P. Tan, J. Srivastava
Introduction • Key challenge • Maintain high detection rate while keeping low false alarm rate • Misuse Detection • Two phase learning – PNrule • Classification based on Associations (CBA) approach • Anomaly Detection • Unsupervised (e.g. clustering) and supervised methods to detect novel attacks
DARPA 1998 - KDDCup’99 Data Set • Modification of DARPA 1998 data set prepared and managed by MIT Lincoln Lab • DARPA 1998 data includes a wide variety of intrusions simulated in a military network environment • 9 weeks of raw TCP dump data simulating a typical U.S. Air Force LAN • 7 weeks for training (5 million connection records) • 2 weeks for training (2 million connection records)
KDDCup’99 Data Set • Connections are labeled as normal or attacks • Attacks fall into 4 main categories (38 attack types) - • DOS - Denial Of Service • Probe - e.g. port scanning • U2R - unauthorized access to root privileges, • R2L - unauthorized remote login to machine, • U2R and R2L extremely small classes • 3 groups of features • Basic, content based, time based features (details)
KDDCup’99 Data Set • Training set - ~ 5 million connections • 10% training set - 494,021 connections • Test set - 311,029 connections • Test data has attack types that are not present in the training data => Problem is more realistic • Train set contains 22 attack types • Test data contains additional 17new attack types that belong to one of four main categories
Performance of Winning Strategy • Cost-sensitive bagged boosting (B. Pfahringer)
Simple RIPPER classification • RIPPER trained on 10% of data (494,021 connections) • Test on entire test set (311,029 connections)
Simple RIPPER on modified data • Remove duplicates and merge new train and test data sets • Sample 69,980 examples from the merged data set • Sample from neptune and normal subclass. Other subclasses remain intact. • Divide in equal proportion to training and test sets • Apply RIPPER algorithm on the new data set
Building Predictive Models in NID • Models should handle skewed class distributions • Accuracy is not sufficient metric for evaluation • Focus on both recall and precision • Recall (R) = TP/(TP + FN) • Precision (P) = TP/(TP + FP) • F – measure = 2*R*P/(R+P) rare class – C large class – NC
Predictive Models for Rare Classes • Over-sampling the small class [Ling, Li, KDD 1998] • Down-sizing the large class [Kubat, ICML 1997] • Internally bias discrimination process to compen-sate for class imbalance [Fawcett, DMKDD 1997] • PNrule and related work [Joshi, Agarwal, Kumar, SIAM, SIGMOD 2001] • RIPPER with stratification • SMOTE algorithm [Chawla, JAIR 2002] • RareBoost [Joshi, Agarwal, Kumar, ICDM 2001]
PNrule Learning • P-phase: • cover most of the positive examples with high support • seek good recall • N-phase: • remove FP from examples covered in P-phase • N-rules give high accuracy and significant support C C NC NC Existing techniques can possibly learn erroneous small signatures for absence of C PNrule can learn strong signatures for presence of NC in N-phase
RIPPER vs. PNrule Classification • 5% sample from normal, smurf (DOS), neptune (DOS) from 10% of training data (494,021 connections) • Test on entire test set (311,029 connections)
Þ s , a Classification Based on Associations (CBA) • What are Association patterns? • Frequent itemset: captures the set of “items” that co-occur together frequently in a transaction database. • Association Rule: predicts the occurrence of a set of items in a transaction given the presence of other items. Association Rule: y X Support: Confidence: Example:
Classification Based on Associations (CBA) • Previous work: • Use association patterns to improve the overall performance of traditional classifiers. • Integrating Classification and Association Rule Mining [Liu, Li, KDD 1998] • CMAR: Accurate Classification Based on Multiple Class-Association Rules [Han, ICDM 2001] • Associations in Network Intrusion Detection • Use classification based on associations for anomaly detection and misuse detection [Lee, Stolfo, Mok 1999] • Look for abnormal associations [Barbara, Wu, Jajodia, 2001]
Methodology F1: {A, B,C} => dos F2: {B,D} => dos … DOS Overall data set Feature Selection F1: {A, C, D} => u2r F2: {E,F,H} => u2r … U2R F1: {C,K,L} => r2l F2: {F,G,H} => r2l … R2L F1: {B,F} => probe F2: {B,C,H}=> probe … probe normal F1: {A, B} => normal F2: {E,G} => normal … Feed to classifier Stratification Frequent Itemset Generation
Methodology • Current approaches use confidence-like measures to select the best rules to be added as features into the classifiers. • This may work well only if each class is well-represented in the data set. • For the rare class problems, some of the high recall itemsets could be potentially useful, as long as their precision is not too low. • Our approach: • Apply frequent itemset generation algorithm to each class. • Select itemsets to be added as features based on precision, recall and F-Measure. • Apply classification algorithm, i.e., RIPPER, to the new data set.
Experimental Results (on modified data) Original RIPPER RIPPER with high Precision rules RIPPER with high Recall rules RIPPER with high F-measure rules
Experimental Results (on modified data) RIPPER with high Precision rules Original RIPPER RIPPER with high Recall rules RIPPER with high F-measure rules For rare classes, rules ordered according to F-Measure produce the best results.
CBA Summary • Association rules can improve the overall performance of classifiers • Measure used to select rules for feature addition can affect the performance of classifiers • The proposed F-measure rule selection approach leads to better overall performance
Anomaly Detection – Related Work • Detect novel intrusions using pseudo-Bayesian estimators to estimate prior and posterior probabilities of new attacks [Barbara, Wu, SIAM 2001] • Generate artificial anomalies (intrusions) and then use RIPPER to learn intrusions [Fan et al, ICDM 2001] • Detect intrusions by computing changes in esti-mated probability distributions [Eskin, ICML 2000] • Clustering based approaches [Portnoy et al, 2001]
SNN Clusteringon KDD Cup 99’ data • SNN clustering suited for finding clusters of varying sizes, shapes, densities in the presence of noise • Dataset • 10,000 examples were sampled from neptune, smurf and normal both from training and test • Other sub-classes remain intact • Total number of instances : 97,000 • Applied shared nearest neighbors based clustering and k-means clustering
Clustering Results • SNN clusters of pure new attack types are found
Clustering Results All k-means clusters • K-means performance • SNN clustering performance Tightest k-means clusters
Nearest Neighbor (NN) based Outlier Detection • For each point in the training set, calculate the distance to the closest point • Build a histogram • Choose a threshold such that a small percentage (e.g., 2%) of the training set are classified as outliers
attack Anomaly Detection using NN Scheme
Novel Attack Detection Using NN Scheme Detection Rate for Novel Attacks = 68.50% False Positive Rate for Normal connections = 2.82%
novel attacks Novel Attack Detection Using NN Scheme details details
Conclusions • Predictive models specifically designed for rare class can help in improving the detection of small attack types • SNN clustering based approach shows promise in identifying novel attack types • Simple nearest neighbor based approaches appear capable of detecting anomalies
KDDCup’99 Data Set • KDDCup’99 contains derived high-level features • 3 groups of features • basic features of individual TCP connections (duration, protocol type, service, src & dest bytes, …) • content features within a connection suggested by domain knowledge (e.g. # of failed login attempts) • time-based traffic features of the connection records • ''same host'' features examine only the connections that have the same destination host as the current connection • ''same service'' features examine only the connections that have the same service as the current connection back
1-NN on Anomalies back