410 likes | 616 Views
Learning Rules and Clusters for Network Anomaly Detection. Philip Chan, Matt Mahoney, Muhammad Arshad Florida Institute of Technology. Outline. Related work in anomaly detection Rule Learning algorithm: LERAD Cluster learning algorithm: CLAD Summary and ongoing work.
E N D
Learning Rules and Clusters for Network Anomaly Detection Philip Chan, Matt Mahoney, Muhammad Arshad Florida Institute of Technology
Outline • Related work in anomaly detection • Rule Learning algorithm: LERAD • Cluster learning algorithm: CLAD • Summary and ongoing work
Related Work in Anomaly Detection • Host-based • STIDE (Forrest et al., 96): system calls, instance-based • (Lane & Brodley, 99): user commands, instance-based • ADMIT (Sequeira & Zaki, 02): user commands, clustering • Network-based • NIDES (SRI, 95): addresses and ports, probabilistic • SPADE (Silicon Defense, 01): addresses and ports, probabilistic • ADAM (Barbara et al., 01): hybrid anomaly-misuse detection
Probabilistic Models • Anomaly detection: • P(x | D NoAttacks) • Given training data with no attacks, estimate the probability of seeing event x • Easier if event x was observed during training • actually, since x is normal, we aren’t interested in its likelihood • Harder if event x was not observed (zero frequency problem) • we are interested in the likelihood of anomalies
Estimating Probability with Zero Frequency • r = number of unique values in an attribute in the training data • n = number of instances with the attribute in the training data • Likelihood of observing a novel value in an attribute is estimated by: p = r / n (Witten and Bell, 1991)
Anomaly Score • Likelihood of novel event = p • During detection, if a novel event (unobserved during training) actually occurs: • anomaly score = 1/p [surprise factor]
Example • Training Sequence1 = a, b, c, d, e, b, f, g, c, h • P(NovelLetter) = 8/10 • Z is observed during detection, anomaly score = 10/8 • Training Sequence2 = a, a, b, b, b, a, b, b, a, a • P(NovelLetter) = 2/10 • Z is observed during detection, anomaly score = 10/2
NonstationaryModel • More likely to see a novel value if novel values were seen recently (e.g., during an attack) • During detection, record when the last novel value was observed • ti = number of seconds since the last novel value in attribute Ai • Anomaly score for Ai: Scorei = ti / pi • Anomaly score for an instance = Si Scorei
LEarning Rules for Anomaly Detection (LERAD) • PHAD uses prior probabilities: P(Z) • ALAD uses conditional probabilities: P(Z|A) • More accurate to learn probabilities that are conditioned on multiple attributes: P(Z|A,B,C…) • Combinatorial explosion • Fast algorithm based on sampling
Rules in LERAD • If the antecedent is satisfied, the Z attribute has one of the values z1, z2, z3… • Unlike association rules, our rules allow a set of values in the consequent • Unlike classification rules, our rules don’t require a fixed attribute as the consequent
Semantics of a Rule • If the antecedent is satisfied but none of the values in the Z attribute is matched, the anomaly score is n/r (similar to PHAD/ALAD) • r = size of Z (# of unique values in Z) • n = # of tuples that satisfy the antecedent and have the Z attribute (support)
Overview of the Algorithm • Randomly select pairs of tuples (packets, connections, …) from a sample of the training data • Create candidate rules based on each pair • Estimate the score of each candidate rule based on a sample of the training data • Prune the candidate rules • Update the consequent and calculate the score for each rule using the entire training set
Creating Candidate Rules • Find the matching attributes; for example, given this randomly selected pair of tuples: • <A=1,B=2,C=3,D=4> and <A=1,B=2,C=3,D=6> • Attributes A, B, and, C match • Create these rules: • A=1, B=2 => C=? • B=2, C=3 => A=? • A=1, C=3 => B=?
Estimating Rule Scores • Randomly pick a sample from the training set to estimate the score (n/r) for each rule • The consequent of each rule is now estimated • n/r=100/3 • n/r=10/2 • n/r=200/100 • The larger the score (n/r), the higher the confidence that the rule captures normal behavior
Pruning Candidate Rules • To reduce the amount of time for learning from the entire training set and during detection • High scoring rules: more confidence for top rules • Redundancy check: some rules are not necessary • Coverage check: minimum set of rules that describe the data
Redundancy Check • Rule 1: • Rule 2: • Rule 2 is more general than Rule 1, which is redundant and can be removed • Rule 3: • Rule 2 and Rule 3 don’t overlap • Rule 4: • Rule 4 is more general than Rule 3, remove Rule 3
Coverage Check • A rule can cover multiple tuples, but a tuple can only be covered by one rule (highest-scoring rule). • Rules are checked in descending order of scores • For each rule in the candidate rule set • mark tuples that are covered by the rule • Rules that don’t cover any tuples are removed • Our coverage check includes the redundancy check
Final Training • The selected rules are trained on the entire training set: consequent and score are updated • n/r=100000/5 • n/r=4000/2 • 90% for training the rules • 10% for validating the rules: rules that cause false alarms are removed (being conservative--the remaining rules are highly predictive)
Scoring during Detection • Score for a matched rule that is violated S = t * n/r where t is the duration since the last time the rule was violated (anomaly occurred wrt the rule) • Anomaly score for the tuple = SiSi
Attributes Used in LERAD-tcp • TCP connections are reassembled (similar to ALAD) • Last 2 bytes of the destination IP address • 4 bytes of the source IP address • Source and destination ports • Duration (from the first packet to the last) • Length, TCP flags • First 8 strings in the payload (delimited by space/new line)
Attributes used in LERAD-all • Attributes used in LERAD-tcp • UDP and ICMP header fields
Experimental Data and Parameters • DARPA 99 data set • Training: Week 3; Testing: Weeks 4 & 5 • Training: 35K tuples (LERAD-tcp); 69K tuples (LERAD-all) • Testing: 178K tuples (LERAD-tcp); 941K tuples (LERAD-all) • 1,000 pairs of tuples were sampled to form candidate rules (more didn’t help much) • 100 tuples were sampled to estimate scores for candidate rules (more didn’t help much)
Experimental Results • Average of 5 runs • 10 false alarms per day • 201 attacks; 74 “hard-to-detect” attacks (Lippmann, 2000) • LERAD-tcp: ~117 detections (58%); ~45 “hard-to-detect” (60%) • LERAD-all: ~112 detections (56%); ~41 “hard-to-detect” (55%)
Experimental Time Statistics • Preprocessing: ~7.5 minutes (2.9GB, training set), ~20 minutes (4GB, test set) • LERAD-tcp: ~6 seconds (4MB, training), ~17 seconds (17MB, testing) • LERAD-all: ~12 seconds (8MB, training), ~95 seconds (91MB, testing) • 50-75 learned final rules
Results from Mixed Data (RAID 03) • DARPA 99 data set: attacks are real, background is simulated • Compared with collected real data • Artifacts: smaller range of values, little “crud,” values stop growing • Modified LERAD: 87 detections, 49 (56%) legitimate • Mixed data: 30 detections, 25 (83%) legitimate
CLAD: Clustering for Anomaly Detection (In Data Mining against Cyber Threats, Kumar et al., 03)
Finding Outliers • Cluster the data points • Outliers: points in far away and sparse clusters • Inter-cluster distance: average distance from the rest of the clusters • Density: number of data points in a fixed-volume cluster
CLAD • Simple efficient clustering algorithm (large amount of data) • Clusters with fixed radius • If a point is within the radius of an existing cluster • Add the point to the cluster • Else • The point becomes the centriod of a new cluster
CLAD Issues • Distance for discrete attributes • Values that are more frequent are likely to be more normal and are consider “closer” • Difference in frequency of discrete values • Power-law distributions: logarithm • Radius of clusters • Select a small random sample • Calculate the distance of all pairs • Average of the smallest 1%
Sparse and Dense Regions • Outliers are in distant and sparse regions • However, an attack might generate many connections and can make its neighborhood not sparse. • (distant and sparse) or (distant and dense) • Distant: distance > avg(distance) + sd(distance) • Sparse: density < avg(density) – sd(density) • Dense: density > avg(density) + sd(density)
Experiments • Weeks 1, 2, 4, and 5 • No explicit training-testing, looking for outliers • A model for each port • Ports with less than 1% traffic are lumped into the “Others” model • Anomaly scores are normalized in SD’s, the “Combined” model simply merges the scores from different models
Ongoing Work • On-line, noise-tolerant LERAD • Applying LERAD to system calls, including arguments • Tokenizing payload to create features
Data Mining for Computer Security Workshop at ICDM03Melbourne, FLNov 19, 2003 www.cs.fit.edu/~pkc/dmsec03/
http://www.cs.fit.edu/~pkc/id/ Thank you Questions?