900 likes | 1.02k Views
A Data Mining Approach for Building Cost-Sensitive and Light Intrusion Detection Models Quarterly Review – November 2000. North Carolina State University Columbia University Florida Institute of Technology. Outline. Project description Progress report:
E N D
A Data Mining Approach for Building Cost-Sensitive and Light Intrusion Detection Models Quarterly Review – November 2000 North Carolina State University Columbia University Florida Institute of Technology
Outline • Project description • Progress report: • Cost-sensitive modeling (NCSU/Columbia/FIT). • Automated feature and model construction (NCSU). • Anomaly detection (NCSU/Columbia/FIT). • Attack “clustering” and light modeling (FIT). • Real-time architecture and systems (NCSU/Columbia). • Correlation (NCSU). • Collaboration with industry (NCSU/Columbia). • Publications and software distribution. • Effort and budget. • Plan of work for next quarter
New Ideas and Hypotheses (1/2) • High-volume automated attacks can overwhelm a real-time IDS and its staff • IDS needs to consider cost factors: • Damage cost, response cost, operational cost, etc. • Pure statistical accuracy not ideal: • Base-rate fallacy of anomaly detection. • Alternative: the cost (saving) of an IDS.
New Ideas and Hypotheses (2/2) • Thorough analysis cannot always be done in real-time by one sensor: • Correlation of multiple sensor outputs. • Trend or scenario analysis. • Need better theories and tools for building misuse and anomaly detection models: • Characteristics of normal data and attack signatures can be measured and utilized.
Main Approaches (1/2) • Cost-sensitive models and architecture: • Optimized for the cost metrics defined by users. • Cost-sensitive machine learning algorithms. • Multiple specialized and light sensors dynamically activated/configured in run-time. • “Load balancing” of models and data • Aggregation and correlation. • Cost-effectiveness as the guiding principle and multi-model correlation as the architectural approach.
Main Approaches (2/2) • Theories and tools for more effective anomaly and misuse detection: • Information-theoretic measures for anomaly detection • “Regularity” of normal data is used to build model. • New algorithms, e.g. • Unsupervised learning using “noisy” data. • Using “artificial anomalies” • An automated system that integrate all these algorithms/tools.
Project Impacts (1/2) • A better understanding of the cost factors, cost models, and cost metrics related to intrusion detection. • Modeling techniques and deployment strategies for cost-effective IDSs • Provide the “best-valued” protection. • “Clustering” techniques for grouping intrusions and building specialized and light sensors. • An architecture for dynamically activating, configuring, and correlating sensors.
Project Impacts (2/2) • More effective misuse and anomaly detection models • With sound theoretical foundations and automation tools. • Analysis/correlation techniques for understanding/recognizing and predicting complex attack scenarios.
Cost-Sensitive Modeling • In previous quarters: • Cost factors and metrics definition and analysis. • Cost model definition. • Cost-sensitive modeling with machine learning. • Evaluation using DARPA off-line data. • Current quarter: • Real-time architecture. • Dynamic cost-sensitive deployment and correlation of sensors.
A Multi Layer/Component Architecture models Remote IDS/Sensor Dynamic Cost-sensitive Decision Making FW Real-time IDS Backend IDS ID Model Builder
Next Steps • Study “realistic” cost-metrics in the real-world. • Implement a prototype system • Demonstrate the advantage of cost-sensitive modeling and dynamic cost-effective deployment • Use representative scenarios for evaluation.
The Data Mining Process of Building ID Models models features patterns connection/ session records packets/ events (ASCII) raw audit data
Feature Construction From Patterns patterns new intrusion records mining mining normal and historical intrusion records compare intrusion patterns detection models features learning training data
Status and Next Steps • The effectiveness of the algorithms/tools (process steps) have been validated • 1998 DARPA Evaluation. • Automating the process: • Process steps “chained” together. • Process iteration: under development. • Field test: • Advanced Technology Systems, General Dynamics. • Planned public release 2Q-2001. • Dealing with “unlabeled” data • Integrate “anomaly detection over noisy data (Columbia)” algorithms.
Information-Theoretic Measures for Anomaly Detection • Motivations: • Need formal understandings. • Hypothesis: • Anomaly detection is based on “regularity” of normal data. • Approach: • Entropy and conditional entropy: regularity • Determine how to build a model. • Relative (conditional) entropy: how the regularities between training and test datasets relate • Determine the performance of a model on test data.
Case Studies • Anomaly detection for Unix processes • “Short sequences” as normal profile. • A classification approach: • Given the first k system calls, predict the k+1st system call • How to determine the “sequence length”, k? Will including other information help? • UNM sendmail system call traces. • MIT Lincoln Lab BSM data. • Anomaly detection for network • How to partition the data – refine the complex subject. • MIT Lincoln Lab tcpdump data.
Entropy and Conditional Entropy • “Impurity” of the dataset • the smaller (the more regular) the better. • “Irregularity” of sequential dependencies • “uncertainty” of a sequence after seeing its prefix (subsequences) • the smaller (the more regular) the better.
Relative (Conditional) Entropy • How different is p from q: • how different is the regularity of test data from that of training data • the smaller the better.
Information Gain and Classification • How much can attribute/feature A contribute to the classification process: • the reduction of entropy when the dataset is partitioned according to values of A. • the larger the better. • if A = the first k events in a sequence (i.e., Y) and the class label is the k+1st event • conditional entropy H(X|Y) is just the second term of the Gain(X, A) • the smaller the conditional entropy, the better performance the classifier.
Relative Conditional Entropy btw. Training and Testing Normal Data
Conditional Entropy of In- and Out- bound Email (MIT/LL BSM)
Key Findings • “Regularity” of data can guide how to build a model • For sequential data, conditional entropy directly influences the detection performance • Determines the (best) sequence length and whether to include more information, before building a model. • With cost is also considered, the “optimal” model. • Detection performance on test data can be attained only if regularity is similar to training data.
Next Steps • Study how to measure more complex environments • Network topology/configuration/traffic, etc. • Extend the principle/approach for misuse detection: • Measure normal, attack, and their relationship • “Parameter adjustment”, performance prediction.
New Anomaly Detection Approaches • Unsupervised training methods • Build models over noisy (not clean) data • Artificial anomalies • Improves performance of misuse and anomaly detection methods. • Network traffic anomaly detection
AD over Noisy Data • Builds normal models over data containing some anomalies. • Motivating assumptions: • Intrusions are extremely rare compared to to normal. • Intrusions are quantitatively different.
Approach Overview • Mixture model • Normal component • Anomalous component • Build probabilistic model of data • Max likelihood test for detection.
Mixture Model of Anomalies • Assume a generative model: • The data is generated with a probability distribution D. • Each element originates from one of two components: • M, the Majority Distribution (x M). • A, the Anomalous Distribution (x A). • Thus: D = (1-)M + A.
Modeling Probability Distributions • Train Probability Distributions over current sets of M and A. • PM(X) = probability distribution for Majority. • PA(X) = probability distribution for Anomaly. • Any probability modeling method can be used: • Naïve Bayes, Max Entropy, etc.
Experiments • Two Sets of experiments: • Measured Performance against comparison methods over noisy data. • Measured Performance trained over noisy data against comparison methods trained over clean data. • Method Robust in both comparisons.
AD Using Artificial Anomalies • Generate abnormal behavior artificially • Assume the given normal data are representative. • “Near misses" of normal behavior is considered abnormal. • Change the value of only one feature in an instance of normal behavior. • Sparsely represented values are sampled more frequently. • “Near misses" help define a tight boundary enclosing the normal behavior.
Experimental Results • Learning algorithm: RIPPER • Data: 1998 DARPA evaluation • U2R, R2L, DOS, PRB: 22 “clusters” • Training data: normal and artificial anomalies • Results • Overall detection rate: 94.26% • Overall false alarm rate: 2.02% • 100% dectection: buffer_overflow, guess_passwd, phf, back • 0% detection: perl, spy, teardrop, ipsweep, nmap • 50+% detection: 13 out of 22 intrusion subclasses
Combining Anomaly and Misuse Detection • Training data: normal data, artificially generated anomalies, known intrusion data • The learned model can predict normal, anomaly, or known intrusion subclass • Experiments were performed on increasingsubsets of known intrusion subclasses in the training data (simulates identified intrusions over time).
Combining Anomaly and Misuse Detection (continued) • Consider phf, pod, teardrop, spy, and smurf are unknown (absent from the training data) • Anomaly detection rate: phf=25%, pod=100%, teardrop=93.91%, spy=50%, smurf=100% • Overall false alarm rate: .20% • The false alarm rate has dropped from 2.02% to .20% when some known attacks are included for training
Adaptive Combined Anomaly and Misuse Detection • Completely re-train model whenever new intrusion is found is very expensive and slow process. • Effective and fast remedy is very important to thwart these attacks. • Re-training is still necessary when time and resource are enough.
Multiple Model Adaptive Approach • Generate an additional detection module only good at detecting the newly discovered intrusion. • Method 1: trained from normal and new intrusion data • Method 2: new intrusion and artificial anomaly • When old classifier predicts “anomaly”, it will be further predicted by the new classifier to examine if it is the new intrusion.
Multiple Model Adaptive Experiment • The “old model” is trained from n intrusions. • A light weight model is trained from one new intrusion type. • They are combined as an ensemble. • The accuracy and training time is compared with one model trained from n + 1 intrusions.
Multiple Model Adaptive Experiment Result • The accuracy difference is very small • recall: +3.4% • precision: -16% • In other words, ensemble approach detects more new intrusion, but also misidentifies more anomaly as new intrusion. • Training time difference: 150 time difference! or a cup of coffee versus one or two days.
Detecting Anomalies in Network Traffic (1/2) • Can we detect intrusions by identifying novel values in network packets? • Anomaly detection is potentially useful in detecting novel attacks. • Our model is trained on attack-free tcpdump data. • Fields in the Transport layer or below are considered.
Detecting Anomalies in Network Traffic (2/2) • Normal field values are learned. • During evaluation, a function scores a packet based on the likelihood of encountering novel field values. • Initial results indicate our learned model compares favorably with other systems on the 1999 DARPA evaluation data.