300 likes | 411 Views
Data and Applications Security Developments and Directions. Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #20 Guest Lecture Data Mining for Intrusion Detection By Mamoun Awad March 24, 2005. Data Mining &Intrusion Detection Systems. Mamoun Awad
E N D
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #20 Guest Lecture Data Mining for Intrusion Detection By Mamoun Awad March 24, 2005
Data Mining &Intrusion Detection Systems Mamoun Awad Dept. of Computer Science University of Texas at Dallas
Outline • Intrusion Detection • Data Mining • Approach • Data set & Results
What is an intrusion? • An intrusion can be defined as “any set of actions that attempt to compromise the: • Integrity • confidentiality, or • availability of a resource”.
Intrusion Examples • Virus • Buffer-overflows • 2000 Outlook Express vulnerability. • Denial of Service (DOS) • explicit attempt by attackers to prevent legitimate users of a service from using that service. • Address spoofing • a malicious user uses a fake IP address to send malicious packets to a target. • Many others • R2L, U2R, Probe, …
Intrusion Detection System (IDS) • An Intrusion Detection System (IDS) inspects all inbound and outbound network activity and identifies suspicious patterns that may indicate a network or system attack from someone attempting to break into or compromise a system.
Attack Types • Host-based attacks • Gain access to privileged services or resources on a machine. • Network-based attacks • Make it difficult for legitimate users to access various network services
IDS Categories • Intrusion detection systems are split into two groups: • Anomaly detection systems • Identify malicious traffic based on deviations from established normal network. • Misuse detection systems • Identify intrusions based on a known pattern (signatures) for the malicious activity.
Problem Statement • Goal of Intrusion Detection Systems (IDS): • To detect an intrusion as it happens and be able to respond to it. • False positives: • A false positive is a situation where something abnormal (as defined by the IDS) happens, but it is not an intrusion. • Too many false positives • User will quit monitoring IDS because of noise. • False negatives: • A false negative is a situation where an intrusion is really happening, but IDS doesn't catch it.
Problem Statement • Misuse Detection
Firewall Rules Order Protocol source source destination destination action IP Port IP Port
Problem Statement • Anomaly Detection
Our Approach SVM Class Training Testing Class Training Data Problem??? Testing Data
Our Approach Hierarchical Clustering (DGSOT) SVM Class Training Testing Class Training Data Testing Data
DGOST • Learning Process • Winner Node • Update the Tree • Stopping Criteria
Support Vector Machine • Support Vector Machines (SVM) • One of the most powerful classification techniques • Find hyper-plane that separates classes • Based on the idea of mapping data points to a high dimensional feature space where a separating hyper-plane can be found
Feature Mapping Feature mapping from two dimensional input space to a two dimensional feature space.
SVM Limitations • Long training time limits its use. • Clustering has a positive impact on the training of an SVM -- each cluster is represented by only one reference • Reduce training time • Degrade generalization -- we use a fewer number of points.
Training set • 1998 DARPA data that originated from the MIT Lincoln Lab • http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html • Size: 1012,477 data point
Data set / Attack Types • DOS • denial-of-service • R2L • unauthorized access from a remote machine, e.g. guessing password; • U2R • unauthorized access to local super user (root) privileges, e.g., various ``buffer overflow'' attacks; • Probing • surveillance and other probing, e.g., port scanning.
Methods Weighted Accuracy Average Accuracy Average Training Time Average FP rate Average FN rate Random Selection 62.5% 62.61% 0.049 hours 22.40% 37.38% Pure SVM 62.74% 62.75% 0.51 hours 30.75% 37,24% SVM+Rocchio Bundling 63.09% 63.11% 0.93 hours 30.98% 36.89% SVM + DGSOT 63.34% 63.36% 0.26 hours 51.56% 36.64% Results
Relevant and Important Publications • “A Dynamical Growing Self-Organizing Tree (DGSOT) for Hierarchical Clustering Gene Expression Profiles,” Feng Luo, Latifur Khan , Farokh Bastani, I-Ling Yen and J. Zhou, the Bioinformatics Journal, Oxford University Press, UK, 20 16, (November 2004) 2605-2617. • “Automatic Image Annotation and Retrieval using Weighted Feature Selection”Lei Wang and Latifur Khan to appear in a special issue in Multimedia Tools and Applications, Kulwer Publisher. • “Hierarchical Clustering for Complex Data” Latifur Khan and Feng Luo, to appear in International Journal on Artificial Intelligence Tools, World Scientific publishers. • “A New Intrusion Detection System using Support Vector Machines and Hierarchical Clustering” Latifur Khan, Mamoun Awad, and Bhavani Thuraisingham, to appear in VLDB Journal: The International Journal on Very Large Databases, ACM/Springer-Verlag Publishing.
Relevant and Important Publications • R. Lippman J. Haines, D. Fried., J. Korba, and K. Das, “The 1999 DARPA off-line intrusion detection evaluation” , Computer Networks, 34, pp. 579-595, 2000.