1 / 40

Learning Rules and Clusters for Network Anomaly Detection

Learning Rules and Clusters for Network Anomaly Detection. Philip Chan, Matt Mahoney, Muhammad Arshad Florida Institute of Technology. Outline. Related work in anomaly detection Rule Learning algorithm: LERAD Cluster learning algorithm: CLAD Summary and ongoing work.

hammer
Download Presentation

Learning Rules and Clusters for Network Anomaly Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Rules and Clusters for Network Anomaly Detection Philip Chan, Matt Mahoney, Muhammad Arshad Florida Institute of Technology

  2. Outline • Related work in anomaly detection • Rule Learning algorithm: LERAD • Cluster learning algorithm: CLAD • Summary and ongoing work

  3. Related Work in Anomaly Detection • Host-based • STIDE (Forrest et al., 96): system calls, instance-based • (Lane & Brodley, 99): user commands, instance-based • ADMIT (Sequeira & Zaki, 02): user commands, clustering • Network-based • NIDES (SRI, 95): addresses and ports, probabilistic • SPADE (Silicon Defense, 01): addresses and ports, probabilistic • ADAM (Barbara et al., 01): hybrid anomaly-misuse detection

  4. LERAD: Learning Rules for Anomaly Detection (ICDM 03)

  5. Probabilistic Models • Anomaly detection: • P(x | D NoAttacks) • Given training data with no attacks, estimate the probability of seeing event x • Easier if event x was observed during training • actually, since x is normal, we aren’t interested in its likelihood • Harder if event x was not observed (zero frequency problem) • we are interested in the likelihood of anomalies

  6. Estimating Probability with Zero Frequency • r = number of unique values in an attribute in the training data • n = number of instances with the attribute in the training data • Likelihood of observing a novel value in an attribute is estimated by: p = r / n (Witten and Bell, 1991)

  7. Anomaly Score • Likelihood of novel event = p • During detection, if a novel event (unobserved during training) actually occurs: • anomaly score = 1/p [surprise factor]

  8. Example • Training Sequence1 = a, b, c, d, e, b, f, g, c, h • P(NovelLetter) = 8/10 • Z is observed during detection, anomaly score = 10/8 • Training Sequence2 = a, a, b, b, b, a, b, b, a, a • P(NovelLetter) = 2/10 • Z is observed during detection, anomaly score = 10/2

  9. NonstationaryModel • More likely to see a novel value if novel values were seen recently (e.g., during an attack) • During detection, record when the last novel value was observed • ti = number of seconds since the last novel value in attribute Ai • Anomaly score for Ai: Scorei = ti / pi • Anomaly score for an instance = Si Scorei

  10. LEarning Rules for Anomaly Detection (LERAD) • PHAD uses prior probabilities: P(Z) • ALAD uses conditional probabilities: P(Z|A) • More accurate to learn probabilities that are conditioned on multiple attributes: P(Z|A,B,C…) • Combinatorial explosion • Fast algorithm based on sampling

  11. Rules in LERAD • If the antecedent is satisfied, the Z attribute has one of the values z1, z2, z3… • Unlike association rules, our rules allow a set of values in the consequent • Unlike classification rules, our rules don’t require a fixed attribute as the consequent

  12. Semantics of a Rule • If the antecedent is satisfied but none of the values in the Z attribute is matched, the anomaly score is n/r (similar to PHAD/ALAD) • r = size of Z (# of unique values in Z) • n = # of tuples that satisfy the antecedent and have the Z attribute (support)

  13. Overview of the Algorithm • Randomly select pairs of tuples (packets, connections, …) from a sample of the training data • Create candidate rules based on each pair • Estimate the score of each candidate rule based on a sample of the training data • Prune the candidate rules • Update the consequent and calculate the score for each rule using the entire training set

  14. Creating Candidate Rules • Find the matching attributes; for example, given this randomly selected pair of tuples: • <A=1,B=2,C=3,D=4> and <A=1,B=2,C=3,D=6> • Attributes A, B, and, C match • Create these rules: • A=1, B=2 => C=? • B=2, C=3 => A=? • A=1, C=3 => B=?

  15. Estimating Rule Scores • Randomly pick a sample from the training set to estimate the score (n/r) for each rule • The consequent of each rule is now estimated • n/r=100/3 • n/r=10/2 • n/r=200/100 • The larger the score (n/r), the higher the confidence that the rule captures normal behavior

  16. Pruning Candidate Rules • To reduce the amount of time for learning from the entire training set and during detection • High scoring rules: more confidence for top rules • Redundancy check: some rules are not necessary • Coverage check: minimum set of rules that describe the data

  17. Redundancy Check • Rule 1: • Rule 2: • Rule 2 is more general than Rule 1, which is redundant and can be removed • Rule 3: • Rule 2 and Rule 3 don’t overlap • Rule 4: • Rule 4 is more general than Rule 3, remove Rule 3

  18. Coverage Check • A rule can cover multiple tuples, but a tuple can only be covered by one rule (highest-scoring rule). • Rules are checked in descending order of scores • For each rule in the candidate rule set • mark tuples that are covered by the rule • Rules that don’t cover any tuples are removed • Our coverage check includes the redundancy check

  19. Final Training • The selected rules are trained on the entire training set: consequent and score are updated • n/r=100000/5 • n/r=4000/2 • 90% for training the rules • 10% for validating the rules: rules that cause false alarms are removed (being conservative--the remaining rules are highly predictive)

  20. Scoring during Detection • Score for a matched rule that is violated S = t * n/r where t is the duration since the last time the rule was violated (anomaly occurred wrt the rule) • Anomaly score for the tuple = SiSi

  21. Attributes Used in LERAD-tcp • TCP connections are reassembled (similar to ALAD) • Last 2 bytes of the destination IP address • 4 bytes of the source IP address • Source and destination ports • Duration (from the first packet to the last) • Length, TCP flags • First 8 strings in the payload (delimited by space/new line)

  22. Attributes used in LERAD-all • Attributes used in LERAD-tcp • UDP and ICMP header fields

  23. Experimental Data and Parameters • DARPA 99 data set • Training: Week 3; Testing: Weeks 4 & 5 • Training: 35K tuples (LERAD-tcp); 69K tuples (LERAD-all) • Testing: 178K tuples (LERAD-tcp); 941K tuples (LERAD-all) • 1,000 pairs of tuples were sampled to form candidate rules (more didn’t help much) • 100 tuples were sampled to estimate scores for candidate rules (more didn’t help much)

  24. Experimental Results • Average of 5 runs • 10 false alarms per day • 201 attacks; 74 “hard-to-detect” attacks (Lippmann, 2000) • LERAD-tcp: ~117 detections (58%); ~45 “hard-to-detect” (60%) • LERAD-all: ~112 detections (56%); ~41 “hard-to-detect” (55%)

  25. LERAD-all vs. LERAD-tcp

  26. Experimental Time Statistics • Preprocessing: ~7.5 minutes (2.9GB, training set), ~20 minutes (4GB, test set) • LERAD-tcp: ~6 seconds (4MB, training), ~17 seconds (17MB, testing) • LERAD-all: ~12 seconds (8MB, training), ~95 seconds (91MB, testing) • 50-75 learned final rules

  27. Results from Mixed Data (RAID 03) • DARPA 99 data set: attacks are real, background is simulated • Compared with collected real data • Artifacts: smaller range of values, little “crud,” values stop growing • Modified LERAD: 87 detections, 49 (56%) legitimate • Mixed data: 30 detections, 25 (83%) legitimate

  28. CLAD: Clustering for Anomaly Detection (In Data Mining against Cyber Threats, Kumar et al., 03)

  29. Finding Outliers • Cluster the data points • Outliers: points in far away and sparse clusters • Inter-cluster distance: average distance from the rest of the clusters • Density: number of data points in a fixed-volume cluster

  30. CLAD • Simple efficient clustering algorithm (large amount of data) • Clusters with fixed radius • If a point is within the radius of an existing cluster • Add the point to the cluster • Else • The point becomes the centriod of a new cluster

  31. CLAD Issues • Distance for discrete attributes • Values that are more frequent are likely to be more normal and are consider “closer” • Difference in frequency of discrete values • Power-law distributions: logarithm • Radius of clusters • Select a small random sample • Calculate the distance of all pairs • Average of the smallest 1%

  32. Sparse and Dense Regions • Outliers are in distant and sparse regions • However, an attack might generate many connections and can make its neighborhood not sparse. • (distant and sparse) or (distant and dense) • Distant: distance > avg(distance) + sd(distance) • Sparse: density < avg(density) – sd(density) • Dense: density > avg(density) + sd(density)

  33. Experiments • Weeks 1, 2, 4, and 5 • No explicit training-testing, looking for outliers • A model for each port • Ports with less than 1% traffic are lumped into the “Others” model • Anomaly scores are normalized in SD’s, the “Combined” model simply merges the scores from different models

  34. Results

  35. LERAD vs. CLAD

  36. Ongoing Work • On-line, noise-tolerant LERAD • Applying LERAD to system calls, including arguments • Tokenizing payload to create features

  37. Data Mining for Computer Security Workshop at ICDM03Melbourne, FLNov 19, 2003 www.cs.fit.edu/~pkc/dmsec03/

  38. http://www.cs.fit.edu/~pkc/id/ Thank you Questions?

More Related