Network-based Intrusion Detection, Mitigation and Forensics System

Network-based Intrusion Detection, Mitigation and Forensics System Yan Chen Department of Electrical Engineering and Computer Science Northwestern University Lab for Internet & Security Technology (LIST) http://list.cs.northwestern.edu

The Spread of Sapphire/Slammer Worms

Current Intrusion Detection Systems (IDS) • Mostly host-based and not scalable to high-speed networks • Slammer worm infected 75,000 machines in <10 mins • Host-based schemes inefficient and user dependent • Have to install IDS on all user machines ! • Mostly simple signature-based • Cannot recognize unknown anomalies/intrusions • New viruses/worms, polymorphism

Current Intrusion Detection Systems (II) • Statistical detection • Unscalable for flow-level detection • IDS vulnerable to DoS attacks • Overall traffic based: inaccurate, high false positives • Cannot differentiate malicious events with unintentional anomalies • Anomalies can be caused by network element faults • E.g., router misconfiguration, link failures, etc.

Network-based Intrusion Detection, Mitigation, and Forensics System • Online traffic recording [SIGCOMM IMC 2004, INFOCOM 2006, ToN to appear] • Reversible sketch for data streaming computation • Record millions of flows (GB traffic) in a few hundred KB • Small # of memory access per packet • Scalable to large key space size (232 or 264) • Online sketch-based flow-level anomaly detection [IEEE ICDCS 2006] [IEEE CG&A, Security Visualization 06] • Adaptively learn the traffic pattern changes • As a first step, detect TCP SYN flooding, horizontal and vertical scans even when mixed • Online stealthy spreader (botnet scan) detection [IWQoS 2007]

Network-based Intrusion Detection, Mitigation, and Forensics System (II) Integrated approach for false positive reduction • Polymorphic worm signature generation & detection [IEEE Symposium on Security and Privacy 2006] [IEEE ICNP 2007 to appear] • Accurate network diagnostics [ACM SIGCOMM 2006] [IEEE INFOCOM 2007] • Scalable distributed intrusion alert fusion w/ DHT [SIGCOMM Workshop on Large Scale Attack Defense 2006] • Large-scale botnet event forensics using honeynet [work in progress]

Sent out for aggregation Part I Sketch-based monitoring & detection Reversible sketch monitoring Normal flows Sketch based statistical anomaly detection (SSAD) Local sketch records Keys of suspicious flows Filtering Keys of normal flows Polymorphic worm detection Signature-based detection Per-flow monitoring Suspicious flows Network fault diagnosis Intrusion or anomaly alarms Modules on the critical path Modules on the non-critical path Data path Control path System Architecture Remote aggregated sketch records Streaming packet data Part II Per-flow monitoring & detection

HPNAIDM system HPNAIDM system Internet scan port Internet LAN Internet LAN HRAID system LAN Switch Switch Splitter Switch Splitter Router Router Switch Switch Router scan port LAN LAN Switch LAN (a) HPNAIDM system (b) (c) System Deployment • Attached to a router/switch as a black box • Edge network detection particularly powerful Monitor each port separately Monitor aggregated traffic from all ports Original configuration

Detecting Stealthy Spreaders Using Online Outdegree Histograms Yan Gao1, Yao zhao1, Robert Schweller1, Shobha Venkataraman2, Yan Chen1, Dawn Song2 and Ming-Yang Kao1 1. Northwestern University 2. Carnegie Mellon University

Outline • Motivation • Problem definition • System design • Evaluation • Conclusion

Motivation • High-speed network monitoring • Small amount of memory usage • Small number of memory accesses per packet • Superspreaders vs. Stealthy spreaders • Superspreaders: sources that connect a large number of distinct destinations • e.g. a compromised host doing fast scanning for worm propagation • Stealthy spreaders: a number of sources that send more than a certain number of connections (unsuccessful) to distinct destinations • e.g. botnet scans or moderate worm propagation

Existing Data Streaming Algorithms • Online entropy estimation approaches Chakrabarti et al. [STACS 06] and Guha et al. [ACM SODA 06] • Pros: detect unexpected changes in the network traffic • Cons: lose some concrete distribution information • Online histogram estimation algorithms Gibbons et al. [VLDB 97] and Gilbert et al. [STOC 02] • Pros: provide more information on the features of network traffic • Cons: cannot record the number of unique items • Superspreader detection schemes Venkataraman et al. [NDSS 05] and Zhao et al. [IMC 05] • Pros: detect sources with an very large outdegree • Cons: memory usage unscalable to small/medium outdegrees such as bot scans Superspreader detection is a special case of spreader detection

Problem Definitions Two high-level problems • Construct an approximation of the outdegree histogram online • Directly detect the presence of stealthy spreaders without constructing the complete outdegree histogram

Number of sources … … 20 21 22 23 24 25 26 27 Number of unique destinations Histogram Problem Definition • Input: stream of (Src, Dst) pairs S • Output z--- of which powers define the buckets of the histogram (z=2)

Problem Definition • Input: stream of (SIP, DIP) pairs S • Output Wi--- the set of sources Number of sources A source s is inWi if and only if the number of unique destinations that s connects to is in the range of [zi, zi+1) … … 20 21 22 23 24 25 26 27 Number of unique destinations Histogram

Problem Definition • Input: stream of (SIP, DIP) pairs S • Output mi = |Wi| Creating an approximate histogram is to estimate mifor each bucket Number of sources … … 20 21 22 23 24 25 26 27 Number of unique destinations Histogram

Contribution • Study the problem of detecting stealthy spreaders online • With constant small memory • With small memory accesses per packet • Design the algorithm to detect stealthy spreaders online by approximating the outdegree histogram • Data recording phase • Sampling and coupon collection-based algorithms • Spreader detection phase • Linear regression to find bins where attacks happen • Show that the change of approximated histogram reveals the presence of anomalies

src src src Recording Phase: Sampling Algorithm Fast: update a smaller number of counters per packet 2-3≤ h(src)≤ 2-2 (src, dst) Packet Sampling algorithm

(src,g0(dst)) (src,g1(dst)) (src,g2(dst)) (src,g3(dst)) (src,gd(dst)) Recording Phase:Coupon Collecting Algorithm Accurate: create a better approximation interim structure : uniform random hash function for hashing dst to an integer in [1, 2i] 2-3≤ h(src)≤ 2-2 (src, dst) Packet Coupon collecting algorithm

Spreader Detection Phase • Outdegree histogram construction Interim data structure -> final outdegree histogram Using linear programming method • Build a convex hull Other constraints: • Find the lower and upper bounds for mi • Solution • Directly use the interim data structure Pros: Obtain a reasonably accurate histogram for normal network traffic Cons: Fail to accurately estimate the outdegree histogram for anomalous traffic

System Design • Change detection • The change of the interim data structure of two time intervals • Stealthy spreader detection ki’ > ch (threshold) • System architecture

Spreader Detection Phase • The real scan event One Peak Number of scanners Close to 0 Number of distinct destination

Spreader Detection Phase • Linear regression for coupon collecting algorithm • Mean squared error as the fitting metric Value of counting Bucket Example of linear regression

Evaluation Methodology • Traffic traces • OC-48 CAIDA data on Aug. 14th, 2002 • The average packet rate: 191K/s • The average flow rate: 3.75K/s • A real scanning event collected from one class B honeynet on Jan 7th, 2007 • Port 23 • 2.5 hours • 1,607 unique sources • 1,700,236 scan sessions • Synthetic scanning traces

Simulation Results • Synthetic stealthy scan False negative: 0 The estimation error within 20%: 76.1% Attack intensity = False negative: 17.8% The estimation error within 20%: 33.9% Percentage of detection results Estimate ratio = Estimate ratio The estimate ratio of scan outdegree

Synthetic stealthy scan Simulation Results 80% Cumulative percentage (%) 35% Estimate ratio CDF of estimate ratio for spreader intensity estimation

Simulation Results • Real stealthy scan Estimation: 90 Ground truth: 87 Number of scanners Number of distinct destination The histogram of outdegree of scanners collected in the honeynet

Simulation Results • Real stealthy scan Mix the 5-min data of a real scanning event with 5-min normal traffic of CAIDA data (distribution over 30 such intervals) 80% Cumulative percentage (%) Estimate ratio CDF of estimate ratios of scan outdegree estimation

Online Performance • Memory consumption • Our method: O(c log(m)) • Constant memory: 24×1KB ＝ 24KB • Superspreader: • When k is small, the memory usage is closer to the size of the entire data stream N. • Memory access per packet • Single memory access per packet for each distinct counting structure • Speed up: processing in parallel or in pipeline • Speed • 3.2GHz Pentium 4 computer • Recording: 200 seconds for each 5-min CAIDA data interval • Detection: less than 0.1 second

Conclusion • Propose the stealthy spreader detection problem • Design an online outdegree histogram based stealthy spreader detection algorithm • Propose two randomized algorithms for recording phase • Propose the linear regression based approach for stealthy spreader detection

? ? ?

Network-based Intrusion Detection, Mitigation and Forensics System