220 likes | 263 Views
This study explores the usage of Variable Length Markov Chains for anomaly detection in intrusion detection systems. It compares the advantages and disadvantages of signature-based and anomaly-based IDS techniques and explains the implementation of VLMC with Probabilistic Suffix Trees. The text discusses the process of constructing the VLMC, utilizing Akaike Information Criteria for model selection, and the significance of maximizing expected log likelihood in selecting the best-fit model. It also provides a simulation example using email sending behavior data.
E N D
Anomaly Detection using Variable Length Markov Chain Dong Kwan Lee
What is intrusion and intrusion detection? • Intrusion • Any malicious activity directed at a computer system or the service it provides • Intrusion detection • Operation gathering information from a computer or network of computers and attempt to detect intruders or system abuse
Detection Techniques • Signature (rule, misuse) based IDS • Look for a sequence of events that match a known type of attack • Advantage • Low false positive rate • Disadvantage • Difficult to detect unknown attack • Secure rule update for all nodes • Complexity increases as the number of known attacks increases
Detection Techniques (Cont) • Anomaly based IDS • Construct statistical models (profile) of the typical behavior of a system and monitor the amount of activity deviation from that model • Advantage • Capability of detecting novel new attacks • Disadvantage • High false positive rate • Require large computation and memory resources • Need initial training (pure training data is necessary)
Sense of self for UNIX Processes (Forrest ‘96) • Each program shows unique pattern of system call usage • Use sequence of system call to detect program anomaly • Map system call ID to integers & apply statistical analysis • exit(1), fork(2), open(3), create(4), ….. • N-gram anomaly detection (stide) • Training phase • Construct database for fixed length (N) system call patterns • Test phase • Find match in the database for each N-gram
BSM audit data example header,134,2,close(2),,Fri Jan 23 16:55:57 1998, + 770003000 msec argument,1,0x4,fd path,/etc/security/audit_data attribute,100660,root,other,8388632,4982,0 subject,root,root,other,root,other,5391,5288,0 0 192.168.0.20 return,success,0 trailer,134 header,120,2,kill(2),,Fri Jan 23 16:55:57 1998, + 780003000 msec argument,2,0x10,signal process,root,root,other,root,other,5389,5288,0,192.168.0.20 subject,root,root,other,root,other,5391,5288,0 0 192.168.0.20 return,success,0 trailer,120
1: AUE_EXIT 2: AUE_FORK 3: AUE_OPEN 4: AUE_CREAT 5: AUE_LINK 6: AUE_UNLINK 7: AUE_EXEC 8: AUE_CHDIR 9: AUE_MKNOD 10: AUE_CHMOD 11: AUE_CHOWN 12: AUE_UMOUNT 13: AUE_JUNK 14: AUE_ACCESS 15: AUE_KILL 16: AUE_STAT 17: AUE_LSTAT 18: AUE_ACCT 19: AUE_MCTL 20: AUE_REBOOT 21: AUE_SYMLINK 22: AUE_READLINK 23: AUE_EXECVE 24: AUE_CHROOT 25: AUE_VFORK 26: AUE_SETGROUPS 27: AUE_SETPGRP ………………… 264: AUE_INST_SYNC 265: AUE_SOCKCONFIG Kernel Audit Events (etc/security/audit_event)
More effective method? • Use correlation in the categorical data sequence • Complex sequences usually show exponentially decaying autocorrelation characteristic • Obtain the probability distribution of next alphabet based on a short length of previous alphabet history • Detect anomalous system call sequences using VLMC • Why VLMC? • Maintain equivalent prediction performance to a high-order Markov chain • Do not have the curse of dimensionality • VLMC construction • VLMC can be implemented by Probabilistic Suffix Tree (PST) • Use Akaike Information Criteria (AIC) for the best-fit PST model selection
Probabilistic Suffix Tree (PST) • Context tree • Correlation depth depends on the history • PST Example • Max Depth L = 2 • Alphabet size = 2
Model selection rule (AIC) • KL distance between true model and estimated model • Search for a model which maximize expected log likelihood • Select a model which has the largest [maximum log likelihood - number of free parameter] among candidate models
Best-fit PST construction • Initial Tree (Maximum Depth L = 3) • Training Data: 010110110101111011010101101110101110101010101011000
Above Tree Maximum log likelihood: -62.5 Number of Free Parameters: 6 Below Tree Maximum log likelihood: -63.0 Number of Free Parameters: 5 -62.5 – 6 < -63.0 – 5 Have to prune them! Best-fit PST construction (Cont)
Best-fit tree construction (Cont) • Final Tree • Likelihood Calculation for string: s = 10101101