Data and Applications Security Developments and Directions

Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #20 Guest Lecture Data Mining for Intrusion Detection By Mamoun Awad March 24, 2005

Data Mining &Intrusion Detection Systems Mamoun Awad Dept. of Computer Science University of Texas at Dallas

Outline • Intrusion Detection • Data Mining • Approach • Data set & Results

What is an intrusion? • An intrusion can be defined as “any set of actions that attempt to compromise the: • Integrity • confidentiality, or • availability of a resource”.

Intrusion Examples • Virus • Buffer-overflows • 2000 Outlook Express vulnerability. • Denial of Service (DOS) • explicit attempt by attackers to prevent legitimate users of a service from using that service. • Address spoofing • a malicious user uses a fake IP address to send malicious packets to a target. • Many others • R2L, U2R, Probe, …

Intrusion Detection System (IDS) • An Intrusion Detection System (IDS) inspects all inbound and outbound network activity and identifies suspicious patterns that may indicate a network or system attack from someone attempting to break into or compromise a system.

Attack Types • Host-based attacks • Gain access to privileged services or resources on a machine. • Network-based attacks • Make it difficult for legitimate users to access various network services

IDS Categories • Intrusion detection systems are split into two groups: • Anomaly detection systems • Identify malicious traffic based on deviations from established normal network. • Misuse detection systems • Identify intrusions based on a known pattern (signatures) for the malicious activity.

Problem Statement • Goal of Intrusion Detection Systems (IDS): • To detect an intrusion as it happens and be able to respond to it. • False positives: • A false positive is a situation where something abnormal (as defined by the IDS) happens, but it is not an intrusion. • Too many false positives • User will quit monitoring IDS because of noise. • False negatives: • A false negative is a situation where an intrusion is really happening, but IDS doesn't catch it.

Layered Security Mechanism

Problem Statement • Misuse Detection

Firewalls

Firewall Rules Order Protocol source source destination destination action IP Port IP Port

Hierarchical Distributed Firewall Setup

Problem Statement • Anomaly Detection

Our Approach SVM Class Training Testing Class Training Data Problem??? Testing Data

Our Approach Hierarchical Clustering (DGSOT) SVM Class Training Testing Class Training Data Testing Data

Dynamically Growing Self-Organizing Tree Algorithm (DGSOT)

DGOST • Learning Process • Winner Node • Update the Tree • Stopping Criteria

Support Vector Machine • Support Vector Machines (SVM) • One of the most powerful classification techniques • Find hyper-plane that separates classes • Based on the idea of mapping data points to a high dimensional feature space where a separating hyper-plane can be found

The value of support vectors and non-support vectors

The effect of adding new data points on the margins

Feature Mapping Feature mapping from two dimensional input space to a two dimensional feature space.

SVM Limitations • Long training time limits its use. • Clustering has a positive impact on the training of an SVM -- each cluster is represented by only one reference • Reduce training time • Degrade generalization -- we use a fewer number of points.

Hierarchical clustering with SVM flow chart

Training set • 1998 DARPA data that originated from the MIT Lincoln Lab • http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html • Size: 1012,477 data point

Data set / Attack Types • DOS • denial-of-service • R2L • unauthorized access from a remote machine, e.g. guessing password; • U2R • unauthorized access to local super user (root) privileges, e.g., various ``buffer overflow'' attacks; • Probing • surveillance and other probing, e.g., port scanning.

Methods Weighted Accuracy Average Accuracy Average Training Time Average FP rate Average FN rate Random Selection 62.5% 62.61% 0.049 hours 22.40% 37.38% Pure SVM 62.74% 62.75% 0.51 hours 30.75% 37,24% SVM+Rocchio Bundling 63.09% 63.11% 0.93 hours 30.98% 36.89% SVM + DGSOT 63.34% 63.36% 0.26 hours 51.56% 36.64% Results

Relevant and Important Publications • “A Dynamical Growing Self-Organizing Tree (DGSOT) for Hierarchical Clustering Gene Expression Profiles,” Feng Luo, Latifur Khan , Farokh Bastani, I-Ling Yen and J. Zhou, the Bioinformatics Journal, Oxford University Press, UK, 20 16, (November 2004) 2605-2617. • “Automatic Image Annotation and Retrieval using Weighted Feature Selection”Lei Wang and Latifur Khan to appear in a special issue in Multimedia Tools and Applications, Kulwer Publisher. • “Hierarchical Clustering for Complex Data” Latifur Khan and Feng Luo, to appear in International Journal on Artificial Intelligence Tools, World Scientific publishers. • “A New Intrusion Detection System using Support Vector Machines and Hierarchical Clustering” Latifur Khan, Mamoun Awad, and Bhavani Thuraisingham, to appear in VLDB Journal: The International Journal on Very Large Databases, ACM/Springer-Verlag Publishing.

Relevant and Important Publications • R. Lippman J. Haines, D. Fried., J. Korba, and K. Das, “The 1999 DARPA off-line intrusion detection evaluation” , Computer Networks, 34, pp. 579-595, 2000.

Data and Applications Security Developments and Directions