330 likes | 343 Views
Explore the winning entries and performance metrics of the KDD'99 Classifier Learning Contest, including Bagged Boosting, Kernel Miner, and the MP13 Approach. Understand the statistical significance and methodology of these entries.
E N D
KDD’99 Classifier Learning Contest~Network Intrusion Advisor: Dr. Hsu Graduate:Min-Hong Lin IDSL seminar
Outline • Motivation • Objective • Results of the KDD’99 Classifier Learning • The Winning Entry : Bagged Boosting • Second-place : Kernel Miner • Third-place : The MP13 Approach • Conclusions • Personal Opinion IDSL
Motivation • Network security is an important issue. • How to prevent network intrusion in advance • Classifier learning can solve these problems. IDSL
Objective • To learn a predictive model capable of distinguishing between legitimate and illegitimate connections in a computer network. IDSL
Introduction • 24 entries were submitted for the contest. • The training and test data were made by Prof. Sal Stolfo and Prof. Wenke Lee • There was a data quality issue with the labels of the test data discovered by Ramesh Agarwal and Mahesh Joshi • Each entry was scored against the corrected test data by a scoring awk script using the cost matrix IDSL
The Winning Entries • The winning entry was submitted by Dr. Bernhard Pfahringer of the Austrian Research Institute for Artificial Intelligence. • Second-place performance was achieved by Itzhak Levin from LLSoft, Inc. in Israel • Third-place performance was achieved by Vladimir Miheev,Alexei Vopilov, and Ivan Shabalin of the company MP13 in Moscow, Russia. • The difference in performance between the three best entries is only of marginal statistical significance IDSL
Performance Of The Winning Entry • The winning entry achieved an average cost of 0.2331 per test example and obtained the following confusion matrix: IDSL
Statistical Significance • The mean score is 0.2331 • The standard deviation is 0.8334 • The standard error is 0.8334/sqrt(N) • The test dataset contains 311,029 examples, but these are not all independent. • The standard error is 0.8334/sqrt(77291) = 0.0030 IDSL
Statistical Significance(contd.) • the winning entry is significantly superior to all others except the second and third best.(2 s.e.) • The first significant difference is between the 17th and 18th best entries. This difference is 0.2952 - 0.2684 = 0.0268(about 9 s.e.) IDSL
A Simple Method Performs Well • one entry was simply “the trusty old 1-nearest neighbor classifier.” (0.2523) • only nine entries scored better than 1-nearest neighbor, of which only six were statistically significantly better. IDSL
Cost-Based Scoring • The cost matrix used for scoring entries IDSL
Training VS. Test Distribution • Some basic domain knowledge about network intrusions suggests that the U2R and R2L categories are intrinsically rare. • The actual distributions of attack types in the training and test 10% datasets are: IDSL
The Winning Entry : Bagged Boosting • The solution is essentially a mixture of bagging and boosting. • Asymmetric error costs are taken into account by minimizing the conditional risk. • The standard sampling with replacement methodology of bagging was modified • put a specific focus on the smaller but expensive-if-predicted-wrongly classes. IDSL
Bagged Boosting:Preliminary Exploration • In an initial test stage : applied various standard learning algorithm • C5 Ripper, naive bayes,nearest neighbor, a back-propagation neural network • This initial scenario was a kind of inverted cross-validation, where the data was split into ten folds • Only one fold was used for learning and all the other nine folds for testing • All variants of C5 were performing much better than naive bayes. • Boosted trees showed a small, but significant lead. IDSL
Bagged Boosting:The Final Predictor • Fifty samples were drawn from the original 5 million odd examples set. • For each sample an ensemble of ten C5 decision trees was induced using both C5's error-cost and boosting options. • The final predictions were computed on top of the 50 single predictions of each of the sub-ensembles by minimizing the conditional risk. • This risk is defined as the sum of the error-costs predicting specific classes times the probabilities of the respective classes. IDSL
Bagged Boosting:Miscellaneous • The training sets about half a million examples took C5 less than an hour to process on a two-processor machine. • 50 such samples were processed, yielding 50X10 trees • A process took more than a day in the final production run. IDSL
LLSoft’s Results:Kernel Miner • Kernel Miner is a new data-mining tool based on building the optimal decision forest. • Kernel Miner is a tool for the description, classification and generalization of data, and for predicting the new cases. • Kernel Miner is a fully automated tool that provides solutions to database users. IDSL
LLSoft’s Results:General Model And Algorithm • Kernel Miner is based upon the global optimization model developed. • This global model is then decomposed into a system of interrelated, intercoordinated and interconsistent models and criteria. • As a result, Kernel Miner constructs the set of locally optimal decision trees (the decision forest) • From which it selects the optimal subset of trees (the subforest) used for predicting the new cases. • Taking into account the parameters of reliability and stability for prediction enables us to avoid the overfitting problem. IDSL
LLSoft’s Results:Task • Training dataset : 494,021 records • Each record contained values of 41 independent variables • The value of dependent variable labeled as either normal(0), or as an attack(1~4) • Test dataset : 311,029 records IDSL
LLSoft’s Results:Approach And Method Used • 1. Coding of Values of Variables • 2. Constructing the Set of Initial “Good” Partitions • 3. Constructing the Decision Trees • 4. Selection of the Optimal Decision Subforest • 5. Prediction on the Test Dataset IDSL
the type is "smurf" if and only if (519 < src_bytes <= 1032) and (service is ecr_i) . IDSL
Compare To The Winning Results • Kernel Miner is greater than the winning entry by 657 test examples (289,006 versus 288,349). • Kernel Miner made less misclassifications by 657 errors (22,023 versus 22,680). • However, Kernel Miner made more misclassifications in element (R2L, Normal) of the confusion matrix (14,994 versus 14,527) which were evaluated by the highest cost IDSL
Analysis of Results • the majority of misclassifications belong to the new attack types which were not in the training data. • 4804 errors predicting "normal" for "R2L" records. • The majority of these records were labeled "guess_passwd" in the test dataset (4110 out of 4804). • Note that in the training 10 % dataset, there were only 53 records labeled "guess_passwd“ • Kernel Miner determined the likely precise pattern for such records consisting of 10 decision trees. IDSL
The MP13 Approach • The MP13 method is best summarized as recognition based on voting decision trees using "pipes" in potential space. • The approach employed by «MP13» team works towards the idea of so-called 'Partner Systems'. • It is aimed on effective data analysis and particular problem resolution based on intrinsic formalization of an expert knowledge IDSL
The MP13 Approach:Steps: • Verbal rules constructed by an expert proficient in network security technology and familiar with KDD methods • First echelon of voting decision trees • Second echelon of voting decision trees IDSL
The MP13 Approach:Work Details: • In a preliminary stage, 13 decision trees were generated based on a subset of the training data. • The training dataset was randomly split into three subsamples: 25% for tree generation, 25% for tree tuning, and 50% for estimating model quality. • Prepared learning data set as 10% of the given complete training database (about 400,000 entries) • Randomly removed some of the DOS and "normal" connections from the full training database • Proceeded with learning based on the "one against the rest"' principle • Converted the testing dataset into ‘potential space’ representation. IDSL
The MP13 Approach:Training Algorithm • Use a version of the 'Fragment' algorithm originally invented at the IITP (Russian Academy of Science), in the division 'Partner Systems'. • For constructing a decision tree, training dataset is split into learning and testing samples. • The learning sample is used to find the structure of a tree and to generate a hierarchy of models on this tree. • The testing sample is used to select a sub-tree having optimal complexity. • Repeated application of the algorithm to various splits of the training data in the different subspaces of the initial data description. • generate a set of voting decision trees IDSL
Conclusions • The winning solution was not significantly better than the two runner-ups. • Kernel Miner is a continually developing tool, and new additional methods and algorithms are to be realized in the next versions of the tool. IDSL
Personal Opinion • Different distributions in the training and testing datasets may influence the final result. • A simple method performs well. • Time complexity should take into account. IDSL