KDD’99 Classifier Learning Contest~Network Intrusion

KDD’99 Classifier Learning Contest~Network Intrusion Advisor： Dr. Hsu Graduate：Min-Hong Lin IDSL seminar

Outline • Motivation • Objective • Results of the KDD’99 Classifier Learning • The Winning Entry : Bagged Boosting • Second-place : Kernel Miner • Third-place : The MP13 Approach • Conclusions • Personal Opinion IDSL

Motivation • Network security is an important issue. • How to prevent network intrusion in advance • Classifier learning can solve these problems. IDSL

Objective • To learn a predictive model capable of distinguishing between legitimate and illegitimate connections in a computer network. IDSL

Introduction • 24 entries were submitted for the contest. • The training and test data were made by Prof. Sal Stolfo and Prof. Wenke Lee • There was a data quality issue with the labels of the test data discovered by Ramesh Agarwal and Mahesh Joshi • Each entry was scored against the corrected test data by a scoring awk script using the cost matrix IDSL

The Winning Entries • The winning entry was submitted by Dr. Bernhard Pfahringer of the Austrian Research Institute for Artificial Intelligence. • Second-place performance was achieved by Itzhak Levin from LLSoft, Inc. in Israel • Third-place performance was achieved by Vladimir Miheev,Alexei Vopilov, and Ivan Shabalin of the company MP13 in Moscow, Russia. • The difference in performance between the three best entries is only of marginal statistical significance IDSL

Performance Of The Winning Entry • The winning entry achieved an average cost of 0.2331 per test example and obtained the following confusion matrix: IDSL

Statistical Significance • The mean score is 0.2331 • The standard deviation is 0.8334 • The standard error is 0.8334/sqrt(N) • The test dataset contains 311,029 examples, but these are not all independent. • The standard error is 0.8334/sqrt(77291) = 0.0030 IDSL

Statistical Significance(contd.) • the winning entry is significantly superior to all others except the second and third best.(2 s.e.) • The first significant difference is between the 17th and 18th best entries. This difference is 0.2952 - 0.2684 = 0.0268(about 9 s.e.) IDSL

A Simple Method Performs Well • one entry was simply “the trusty old 1-nearest neighbor classifier.” (0.2523) • only nine entries scored better than 1-nearest neighbor, of which only six were statistically significantly better. IDSL

Cost-Based Scoring • The cost matrix used for scoring entries IDSL

Training VS. Test Distribution • Some basic domain knowledge about network intrusions suggests that the U2R and R2L categories are intrinsically rare. • The actual distributions of attack types in the training and test 10% datasets are: IDSL

The Winning Entry : Bagged Boosting • The solution is essentially a mixture of bagging and boosting. • Asymmetric error costs are taken into account by minimizing the conditional risk. • The standard sampling with replacement methodology of bagging was modified • put a specific focus on the smaller but expensive-if-predicted-wrongly classes. IDSL

Bagged Boosting:Preliminary Exploration • In an initial test stage : applied various standard learning algorithm • C5 Ripper, naive bayes,nearest neighbor, a back-propagation neural network • This initial scenario was a kind of inverted cross-validation, where the data was split into ten folds • Only one fold was used for learning and all the other nine folds for testing • All variants of C5 were performing much better than naive bayes. • Boosted trees showed a small, but significant lead. IDSL

Bagged Boosting:The Final Predictor • Fifty samples were drawn from the original 5 million odd examples set. • For each sample an ensemble of ten C5 decision trees was induced using both C5's error-cost and boosting options. • The final predictions were computed on top of the 50 single predictions of each of the sub-ensembles by minimizing the conditional risk. • This risk is defined as the sum of the error-costs predicting specific classes times the probabilities of the respective classes. IDSL

Bagged Boosting:Miscellaneous • The training sets about half a million examples took C5 less than an hour to process on a two-processor machine. • 50 such samples were processed, yielding 50X10 trees • A process took more than a day in the final production run. IDSL

LLSoft’s Results:Kernel Miner • Kernel Miner is a new data-mining tool based on building the optimal decision forest. • Kernel Miner is a tool for the description, classification and generalization of data, and for predicting the new cases. • Kernel Miner is a fully automated tool that provides solutions to database users. IDSL

LLSoft’s Results:General Model And Algorithm • Kernel Miner is based upon the global optimization model developed. • This global model is then decomposed into a system of interrelated, intercoordinated and interconsistent models and criteria. • As a result, Kernel Miner constructs the set of locally optimal decision trees (the decision forest) • From which it selects the optimal subset of trees (the subforest) used for predicting the new cases. • Taking into account the parameters of reliability and stability for prediction enables us to avoid the overfitting problem. IDSL

LLSoft’s Results:Task • Training dataset : 494,021 records • Each record contained values of 41 independent variables • The value of dependent variable labeled as either normal(0), or as an attack(1~4) • Test dataset : 311,029 records IDSL

LLSoft’s Results:Approach And Method Used • 1. Coding of Values of Variables • 2. Constructing the Set of Initial “Good” Partitions • 3. Constructing the Decision Trees • 4. Selection of the Optimal Decision Subforest • 5. Prediction on the Test Dataset IDSL

the type is "smurf" if and only if (519 < src_bytes <= 1032) and (service is ecr_i) . IDSL

IDSL

Compare To The Winning Results • Kernel Miner is greater than the winning entry by 657 test examples (289,006 versus 288,349). • Kernel Miner made less misclassifications by 657 errors (22,023 versus 22,680). • However, Kernel Miner made more misclassifications in element (R2L, Normal) of the confusion matrix (14,994 versus 14,527) which were evaluated by the highest cost IDSL

Analysis of Results • the majority of misclassifications belong to the new attack types which were not in the training data. • 4804 errors predicting "normal" for "R2L" records. • The majority of these records were labeled "guess_passwd" in the test dataset (4110 out of 4804). • Note that in the training 10 % dataset, there were only 53 records labeled "guess_passwd“ • Kernel Miner determined the likely precise pattern for such records consisting of 10 decision trees. IDSL

IDSL

The MP13 Approach • The MP13 method is best summarized as recognition based on voting decision trees using "pipes" in potential space. • The approach employed by «MP13» team works towards the idea of so-called 'Partner Systems'. • It is aimed on effective data analysis and particular problem resolution based on intrinsic formalization of an expert knowledge IDSL

The MP13 Approach:Steps: • Verbal rules constructed by an expert proficient in network security technology and familiar with KDD methods • First echelon of voting decision trees • Second echelon of voting decision trees IDSL

The MP13 Approach:Work Details: • In a preliminary stage, 13 decision trees were generated based on a subset of the training data. • The training dataset was randomly split into three subsamples: 25% for tree generation, 25% for tree tuning, and 50% for estimating model quality. • Prepared learning data set as 10% of the given complete training database (about 400,000 entries) • Randomly removed some of the DOS and "normal" connections from the full training database • Proceeded with learning based on the "one against the rest"' principle • Converted the testing dataset into ‘potential space’ representation. IDSL

The MP13 Approach:Training Algorithm • Use a version of the 'Fragment' algorithm originally invented at the IITP (Russian Academy of Science), in the division 'Partner Systems'. • For constructing a decision tree, training dataset is split into learning and testing samples. • The learning sample is used to find the structure of a tree and to generate a hierarchy of models on this tree. • The testing sample is used to select a sub-tree having optimal complexity. • Repeated application of the algorithm to various splits of the training data in the different subspaces of the initial data description. • generate a set of voting decision trees IDSL

Conclusions • The winning solution was not significantly better than the two runner-ups. • Kernel Miner is a continually developing tool, and new additional methods and algorithms are to be realized in the next versions of the tool. IDSL

Personal Opinion • Different distributions in the training and testing datasets may influence the final result. • A simple method performs well. • Time complexity should take into account. IDSL

KDD’99 Classifier Learning Contest~Network Intrusion

KDD’99 Classifier Learning Contest~Network Intrusion

Presentation Transcript

Guide to Network Defense and Countermeasures Second Edition

HyperSpector: Virtual Distributed Monitoring Environments for Secure Intrusion Detection

Guide to Network Defense and Countermeasures Third Edition

ISA 3200 Network Security

INTRUSION DETECTION SYSTEM

Intrusion Detection

Intrusion Detection

Ensembles

Data Mining and Machine Learning

A Wavelet Approach to Network Intrusion Detection

Intrusion Detection Systems

Snort

Intrusion Prevention

Intrusion Tolerance

CMSC 414 Computer and Network Security Lecture 26

Concepts of Network Security and Intrusion Detection

Machine Learning in Intrusion Detection Systems (IDS)

فصل سوم

Network Security

A Wavelet Approach to Network Intrusion Detection

Intrusion Detection Issues

Network Intrusion Detection Systems