E N D
1. Parallel and Distributed Computing for Cyber Security
2. Progress in HPC - past 6 decades
3. Applications Drive the Technology “I think there is world market for maybe 5 computers”
- Thomas Watson Sr. (1943)
4. Data Mining - A Driver for Parallel/ Distributed Computing Lots of data being collected in commercial and scientific world
Strong competitive pressure to extract and use the information from the data
Scaling of data mining to large data requires HPC
Data and/or computational resources needed for analysis are often distributed
Sometimes the choice is distributed data mining or no data mining
Ownership, privacy, security issues
5. Cyber Intrusion Detection - Motivation Sophistication of cyber attacks and their severity is increasing
Large-scale denial of service attacks
Identify Theft/ Fraud
Espionage
DOD and Other U.S. Government Agencies are major targets for sophisticated state sponsored cyber attacks
Security mechanisms always have inevitable vulnerabilities
Firewalls are not sufficient to ensure security in computer networks
Insider attacks difficult to detect
7. What are Intrusions? Intrusions are actions that attempt to bypass security mechanisms of computer systems. They are caused by:
Attackers accessing the system from Internet
Insider attackers - authorized users attempting to gain and misuse non-authorized privileges
Typical intrusion scenario
8. What are Intrusions? Intrusions are actions that attempt to bypass security mechanisms of computer systems. They are caused by:
Attackers accessing the system from Internet
Insider attackers - authorized users attempting to gain and misuse non-authorized privileges
Typical intrusion scenario
9. Intrusion Detection Systems
10. Data Mining for Intrusion Detection Increased interest in data mining based intrusion detection over the past decade
Misuse detection
Suitable for attacks for which it is difficult to build signatures
Builds predictive models from labeled labeled data sets (instances are labeled as “normal” or “intrusive”) to identify known intrusions
Cannot detect unknown and emerging attacks
Madam ID project, ADAM project, fuzzy association rules [Bridges00], decision trees [Sinclair99], neural networks [Lippmann00, Ghosh99], genetic algorithms [Bridges00, Sinclair99], cost sensitive modeling (AdaCost [Fan99], MetaCost [Domingos99, Ting00]), learning from rare class ([Kubat97, Fawcett97, Provost01, Japkowicz01, Joshi02, Lazarevic03]
Anomaly detection
Detects emerging/novel attacks as deviations from “normal” behavior
Potential high false alarm rate - previously unseen (yet legitimate) system behaviors may also be recognized as anomalies
PHAD, ALAD [Chan01, Cha02], ADAM [Barbara01] finite mixture model [Yamanishi00], ?2 based [Ye01]), temporal sequence learning [Lane98], neural networks [Ryan98], generating artificial anomalies [Fan01], clustering [Eskin02], unsupervised SVM [Eskin02, Lazarevic03], outlier detection schemes (MINDS), Bayesian net [Valdes00], Hidden Markov models [Ourston03]
11. Data Mining for Intrusion Detection Misuse Detection – Building Predictive Models
12. Misuse Detection – Building Predictive Models Data Mining for Intrusion Detection
13. MINDS – Minnesota INtrusion Detection System
14. Typical Anomaly Detection Output
15. Summarization Using Association Patterns
16. Typical MINDS Output
17. Typical MINDS Output
18. Typical Summarization Output
19. Detecting Modes of Network Traffic Using Clustering Used Shared Nearest Neighbor (SNN) clustering
Not distracted by “noise” in the data
CPU intensive: O(N2)
Requires storing an N x K matrix
K (number of neighbors) is typically between 10 – 20
K should be about the size of the smallest expect mode
Clustered 850,000 connections collected over one hour at one US Army Fort
Took 10 hours on a 16 CPU cluster
Found 3135 clusters
Largest clusters around 500 records, smallest cluster 10 records
Large clusters correspond to normal behavior
Many small clusters correspond to policy violations or other undesired behavior
20. Detecting Modes of Network Traffic Using Clustering
21. Detecting Modes of Network Traffic Using Clustering
22. Detecting Modes of Network Traffic Using Clustering
23. Detecting Modes of Network Traffic Using Clustering
24. Need for HPC Very large data size
Typical network traffic at University level reach around 500 million connections per day
Compute intensive nature of the pattern finding algorithm
Associative analysis
Clustering
Sequential pattern analysis
25. Need for Distributed Intrusion Detection Attacks on the network infrastructure may be launched from several different locations and may target multiple destinations
Stealthy coordinated attacks with low traffic volumes are difficult to detect by IDSs based at a single network site
Detection of such attacks in early stage requires correlation of data at multiple network sites
31. Centralizing data is not possible
Data needed for analysis is distributed
Costs of centralizing data is too high
Security and privacy issues
Computational resources needed for analysis are distributed Need for Grid-based IDS
32. Data Mining Middleware for Grids
33. Grid-Based Data Mining: Distributed Network Intrusion Detection
34. Publications Managing Cyber Threats: Issues, Approaches and Challenges, edited by V. Kumar, J. Srivastava, and A. Lazarevic, Kluwer Academic Publishers (forthcoming).
MINDS - Minnesota Intrusion Detection System, Ertöz, L., Eilertson, E., Lazarevic, A., Tan, P., Srivastava, J., Kumar, V., Dokas, P., Data Mining: Next Generation Challenges and Future Directions, editors: H. Kargupta, A. Joshi, K. Sivakumar, Y. Yesha MIT/AAAI Press, 2004, AHPCRC Technical Report # 2003-121
Detection of Novel Network Attacks Using Data Mining, L. Ertöz, E. Eilertson, A. Lazarevic, P. Tan, P. Dokas, V. Kumar, J. Srivastava, Workshop on Data Mining for Computer Security, IEEE International Conference on Data Mining, Melbourne, FL, November 19, 2003, AHPCRC Technical Report # 2003-108