330 likes | 339 Views
Learn about the results and winning entries of the KDD'99 Classifier Learning Contest and discover the top performing methods for detecting network intrusions.
E N D
KDD’99 Classifier Learning Contest~Network Intrusion Advisor: Dr. Hsu Graduate:Min-Hong Lin IDSL seminar
Outline • Motivation • Objective • Results of the KDD’99 Classifier Learning • The Winning Entry : Bagged Boosting • Second-place : Kernel Miner • Third-place : The MP13 Approach • Conclusions • Personal Opinion IDSL
Motivation • Network security is an important issue. • How to prevent network intrusion in advance • Classifier learning can solve these problems. IDSL
Objective • To learn a predictive model capable of distinguishing between legitimate and illegitimate connections in a computer network. IDSL
Introduction • 24 entries were submitted for the contest. • The training and test data were made by Prof. Sal Stolfo and Prof. Wenke Lee • There was a data quality issue with the labels of the test data discovered by Ramesh Agarwal and Mahesh Joshi • Each entry was scored against the corrected test data by a scoring awk script using the cost matrix IDSL
The Winning Entries • The winning entry was submitted by Dr. Bernhard Pfahringer of the Austrian Research Institute for Artificial Intelligence. • Second-place performance was achieved by Itzhak Levin from LLSoft, Inc. in Israel • Third-place performance was achieved by Vladimir Miheev,Alexei Vopilov, and Ivan Shabalin of the company MP13 in Moscow, Russia. • The difference in performance between the three best entries is only of marginal statistical significance IDSL
Performance Of The Winning Entry • The winning entry achieved an average cost of 0.2331 per test example and obtained the following confusion matrix: IDSL
Statistical Significance • The mean score is 0.2331 • The standard deviation is 0.8334 • The standard error is 0.8334/sqrt(N) • The test dataset contains 311,029 examples, but these are not all independent. • The standard error is 0.8334/sqrt(77291) = 0.0030 IDSL
Statistical Significance(contd.) • the winning entry is significantly superior to all others except the second and third best.(2 s.e.) • The first significant difference is between the 17th and 18th best entries. This difference is 0.2952 - 0.2684 = 0.0268(about 9 s.e.) IDSL
A Simple Method Performs Well • one entry was simply “the trusty old 1-nearest neighbor classifier.” (0.2523) • only nine entries scored better than 1-nearest neighbor, of which only six were statistically significantly better. IDSL
Cost-Based Scoring • The cost matrix used for scoring entries IDSL
Training VS. Test Distribution • Some basic domain knowledge about network intrusions suggests that the U2R and R2L categories are intrinsically rare. • The actual distributions of attack types in the training and test 10% datasets are: IDSL
The Winning Entry : Bagged Boosting • The solution is essentially a mixture of bagging and boosting. • Asymmetric error costs are taken into account by minimizing the conditional risk. • The standard sampling with replacement methodology of bagging was modified • put a specific focus on the smaller but expensive-if-predicted-wrongly classes. IDSL
Bagged Boosting:Preliminary Exploration • In an initial test stage : applied various standard learning algorithm • C5 Ripper, naive bayes,nearest neighbor, a back-propagation neural network • This initial scenario was a kind of inverted cross-validation, where the data was split into ten folds • Only one fold was used for learning and all the other nine folds for testing • All variants of C5 were performing much better than naive bayes. • Boosted trees showed a small, but significant lead. IDSL
Bagged Boosting:The Final Predictor • Fifty samples were drawn from the original 5 million odd examples set. • For each sample an ensemble of ten C5 decision trees was induced using both C5's error-cost and boosting options. • The final predictions were computed on top of the 50 single predictions of each of the sub-ensembles by minimizing the conditional risk. • This risk is defined as the sum of the error-costs predicting specific classes times the probabilities of the respective classes. IDSL
Bagged Boosting:Miscellaneous • The training sets about half a million examples took C5 less than an hour to process on a two-processor machine. • 50 such samples were processed, yielding 50X10 trees • A process took more than a day in the final production run. IDSL
LLSoft’s Results:Kernel Miner • Kernel Miner is a new data-mining tool based on building the optimal decision forest. • Kernel Miner is a tool for the description, classification and generalization of data, and for predicting the new cases. • Kernel Miner is a fully automated tool that provides solutions to database users. IDSL
LLSoft’s Results:General Model And Algorithm • Kernel Miner is based upon the global optimization model developed. • This global model is then decomposed into a system of interrelated, intercoordinated and interconsistent models and criteria. • As a result, Kernel Miner constructs the set of locally optimal decision trees (the decision forest) • From which it selects the optimal subset of trees (the subforest) used for predicting the new cases. • Taking into account the parameters of reliability and stability for prediction enables us to avoid the overfitting problem. IDSL
LLSoft’s Results:Task • Training dataset : 494,021 records • Each record contained values of 41 independent variables • The value of dependent variable labeled as either normal(0), or as an attack(1~4) • Test dataset : 311,029 records IDSL
LLSoft’s Results:Approach And Method Used • 1. Coding of Values of Variables • 2. Constructing the Set of Initial “Good” Partitions • 3. Constructing the Decision Trees • 4. Selection of the Optimal Decision Subforest • 5. Prediction on the Test Dataset IDSL
the type is "smurf" if and only if (519 < src_bytes <= 1032) and (service is ecr_i) . IDSL
Compare To The Winning Results • Kernel Miner is greater than the winning entry by 657 test examples (289,006 versus 288,349). • Kernel Miner made less misclassifications by 657 errors (22,023 versus 22,680). • However, Kernel Miner made more misclassifications in element (R2L, Normal) of the confusion matrix (14,994 versus 14,527) which were evaluated by the highest cost IDSL
Analysis of Results • the majority of misclassifications belong to the new attack types which were not in the training data. • 4804 errors predicting "normal" for "R2L" records. • The majority of these records were labeled "guess_passwd" in the test dataset (4110 out of 4804). • Note that in the training 10 % dataset, there were only 53 records labeled "guess_passwd“ • Kernel Miner determined the likely precise pattern for such records consisting of 10 decision trees. IDSL
The MP13 Approach • The MP13 method is best summarized as recognition based on voting decision trees using "pipes" in potential space. • The approach employed by «MP13» team works towards the idea of so-called 'Partner Systems'. • It is aimed on effective data analysis and particular problem resolution based on intrinsic formalization of an expert knowledge IDSL
The MP13 Approach:Steps: • Verbal rules constructed by an expert proficient in network security technology and familiar with KDD methods • First echelon of voting decision trees • Second echelon of voting decision trees IDSL
The MP13 Approach:Work Details: • In a preliminary stage, 13 decision trees were generated based on a subset of the training data. • The training dataset was randomly split into three subsamples: 25% for tree generation, 25% for tree tuning, and 50% for estimating model quality. • Prepared learning data set as 10% of the given complete training database (about 400,000 entries) • Randomly removed some of the DOS and "normal" connections from the full training database • Proceeded with learning based on the "one against the rest"' principle • Converted the testing dataset into ‘potential space’ representation. IDSL
The MP13 Approach:Training Algorithm • Use a version of the 'Fragment' algorithm originally invented at the IITP (Russian Academy of Science), in the division 'Partner Systems'. • For constructing a decision tree, training dataset is split into learning and testing samples. • The learning sample is used to find the structure of a tree and to generate a hierarchy of models on this tree. • The testing sample is used to select a sub-tree having optimal complexity. • Repeated application of the algorithm to various splits of the training data in the different subspaces of the initial data description. • generate a set of voting decision trees IDSL
Conclusions • The winning solution was not significantly better than the two runner-ups. • Kernel Miner is a continually developing tool, and new additional methods and algorithms are to be realized in the next versions of the tool. IDSL
Personal Opinion • Different distributions in the training and testing datasets may influence the final result. • A simple method performs well. • Time complexity should take into account. IDSL