Data Mining Approach for Network Intrusion Detection

Data Mining Approach for Network Intrusion Detection Zhen Zhang Advisor: Dr. Chung-E Wang 04/24/2002 Department of Computer Science California State University, Sacramento

Outline • Background • Intrusion Detection: promises and challenges • Data Mining in IDS: how can it help • Motivation • Approaches, tasks, problems and my contributions • Results • Conclusion and future work

Intrusion Detection- Building a Secure Network • Primary assumptions • System activities are observable • Normal and intrusive activities have distinct evidence • Main techniques • Misuse detection: patterns of well-known attacks • Anomaly detection: deviation from normal usage

Data Mining in IDS • Shortfalls with current IDS (mostly misuse detections) • Variants: Intrusions change easily and frequently. • False positive: Difficult to pick up intrusions. • False negative: Detecting attacks for which there are no known signatures • Data overload: Amount of data grows rapidly.

What is Data Mining • Data Mining: Take data and pull from it patterns or deviations. • Many different types of algorithms: Decision Tree,Link analysis, Clustering, Association, Rule abduction, Deviation Analysis, and Sequence analysis. • Software and Tools: • MS SQL Server 2000 • Ripper and many others

How can Data Mining help • Variants • Use anomaly detection, no great concern with variants in an exploit code. • False positives • To identify recurring sequences of alarms in order to help identify valid network activity. • False negatives • Attacks for which signatures have not been developed might be detected. • Data overload • Data mining plays a vital role.

Summary of my work • Identify objective • Distinguish network attacks from normal traffic • New area, several research projects, no commercial products • Focus on the principle and basic implementation of concepts • Data Collection • Data Pre-processing on tcpdump dataset • Apply data mining on processed data • Investigate results • Software packages used: Visual Basic, Microsoft SQL Server 2000 with Analysis Server, Tcpdump

Data Collection • Tcpdump data (http://iris.cs.uml.edu:8080/) • Tcpdump was executed on the gateway, to capture the traffic between LAN and external, and broadcast packets within LAN • Only header, no user data • Filters were used, only TCP and UDP packets • Baseline and 4 simulated attacks

TCPDUMP data format • TCP packet • Time stamp • Source IP address • Source port • Destination IP address • Destination port • Flags (SYN, FIN, PUSH, RST, or .) • Data sequence number of this packet • Data sequence number of the data expected in return • Number of bytes of receive buffer space available • Indication of whether or not the data is urgent

Tcpdump data format • UDP packet • Time stamp • Source IP address • Source port • Destination IP address • Destination port • Length of the packet • Example data

Example tcpdump data

Data Pre-processing- 80% ~ 90% work • Packet level information to connection level • Group by same source/destination IP/Port • Use flags, acks to determine status of the connection • SF, REJ, S0, S1, S3, S3, S4, RSTOSn, RSTRSn, SS, SH, SHR, OOS1, OOS2 • Record start time, duration, protocol • Calculate bytes in, bytes out, resent rate • UDP is connectionless, so simply treat each packet as a connection

First round of processing Intrinsic Features

Establish more information Same Destination Temporal and Statistical Attributes (last 2 seconds)

Establish more information Same Service Temporal and Statistical Attributes (last 2 seconds)

Second round of processing Same Destination Temporal and Statistical Attributes

Final round of processing • Final, but important • Reduce data amount • Remove noise or trivial information • Re-organization data, add new feature if necessary • Challenges • Hard to tell which data to reduced/remove • Requires tremendous domain knowledge • Need experiments and adjustments

Data Mining • Decision Tree Algorithm • Microsoft SQL Server 2000 Analysis Server • Steps: • 80% of baseline (normal) dataset as training data • Use 20% left as validation data, compute misclassification. • 20% of each of the four intrusion datasets as predication data, compute misclassification.

Dependency Network

Decision Tree

Apply Data Mining Model to Validate/Predicate

Results

Conclusion and future improvement • Accuracy • Preliminary experiments of using DM on the tcpdump data showed promising results • depends on sufficient training data and right feature set. • Performance • 6 hours on one dataset (628775 records) • Size of time window • 2 seconds or larger? • Automated process • Call MSSQL DM and DTS procedures within VB • Real-time monitor and alarm

References • Intrusion Detection,Rebecca Gurley Bace, Macmillan Technical Publishing, 2000 • Data Mining: Concepts and Techniques, Jiawei Han Micheline kamber, Morgan Kaufmann Publishers 2001 • Data Mining with Microcoft SQL Server 2000, Claude Seidman. Microsoft Press, 2001 • http://www.cs.columbia.edu/~sal/hpapers/USENIX/usenix.html • http://iris.cs.uml.edu:8080/network.html • http://www-nrg.ee.lbl.gov/. Network Research Group (NRG) of the Information and Computing Sciences Division (ICSD) at Lawrence Berkeley National Laboratory (LBNL) in Berkeley, California.

Thank You!

Data Mining Approach for Network Intrusion Detection