290 likes | 418 Views
An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network Anomaly Detection. Matt Mahoney mmahoney@cs.fit.edu Feb. 18, 2003. Is the DARPA/Lincoln Labs IDS Evaluation Realistic?. The most widely used intrusion detection evaluation data set.
E N D
An Analysis of the 1999 DARPA/Lincoln LaboratoryEvaluation Data for Network Anomaly Detection Matt Mahoney mmahoney@cs.fit.edu Feb. 18, 2003
Is the DARPA/Lincoln Labs IDS Evaluation Realistic? • The most widely used intrusion detection evaluation data set. • 1998 data used in KDD cup competition with 25 participants. • 8 participating organizations submitted 18 systems to the 1999 evaluation. • Tests host or network based IDS. • Tests signature or anomaly detection. • 58 types of attacks (more than any other evaluation) • 4 target operating systems. • Training and test data released after evaluation to encourage IDS development.
Problems with the LL Evaluation • Background network data is synthetic. • SAD (Simple Anomaly Detector) detects too many attacks. • Comparison with real traffic – range of attribute values is too small and static (TTL, TCP options, client addresses…). • Injecting real traffic removes suspect detections from PHAD, ALAD LERAD, NETAD, and SPADE.
1. Simple Anomaly Detector (SAD) • Examines only inbound client TCP SYN packets. • Examines only one byte of the packet. • Trains on attack-free data (week 1 or 3). • A value never seen in training is an anomaly. • If there have been no anomalies for 60 seconds, then output an alarm with score 1. Train: 001110111 Test: 010203001323011 60 sec. 60 sec.
DARPA/Lincoln Labs Evaluation • Weeks 1 and 3: attack free training data. • Week 2: training data with 43 labeled attacks. • Weeks 4 and 5: 201 test attacks. Internet Router Sniffer Attacks SunOS Solaris Linux NT
SAD Evaluation • Develop on weeks 1-2 (available in advance of 1999 evaluation) to find good bytes. • Train on week 3 (no attacks). • Test on weeks 4-5 inside sniffer (177 visible attacks). • Count detections and false alarms using 1999 evaluation criteria.
SAD Results • Variants (bytes) that do well: source IP address (any of 4 bytes), TTL, TCP options, IP packet size, TCP header size, TCP window size, source and destination ports. • Variants that do well on weeks 1-2 (available in advance) usually do well on weeks 3-5 (evaluation). • Very low false alarm rates. • Most detections are not credible.
SAD vs. 1999 Evaluation • The top system in the 1999 evaluation, Expert 1, detects 85 of 169 visible attacks (50%) at 100 false alarms (10 per day) using a combination of host and network based signature and anomaly detection. • SAD detects 79 of 177 visible attacks (45%) with 43 false alarms using the third byte of the source IP address.
SAD Detections by Source Address(that should have been missed) • DOS on public services: apache2, back, crashiis, ls_domain, neptune, warezclient, warezmaster • R2L on public services: guessftp, ncftp, netbus, netcat, phf, ppmacro, sendmail • U2R: anypw, eject, ffbconfig, perl, sechole, sqlattack, xterm, yaga
2. Comparison with Real Traffic • Anomaly detection systems flag rare events (e.g. previously unseen addresses or ports). • “Allowed” values are learned during training on attack-free traffic. • Novel values in background traffic would cause false alarms. • Are novel values more common in real traffic?
Measuring the Rate of Novel Values • r = Number of values observed in training. • r1 = Fraction of values seen exactly once (Good-Turing probability estimate that next value will be novel). • rh = Fraction of values seen only in second half of training. • rt = Fraction of training time to observe half of all values. Larger values in real data would suggest a higher false alarm rate.
Network Data for Comparison • Simulated data: inside sniffer traffic from weeks 1 and 3, filtered from 32M packets to 0.6M packets. • Real data: collected from www.cs.fit.edu Oct-Dec. 2002, filtered from 100M to 1.6M. • Traffic is filtered and rate limited to extract start of inbound client sessions (NETAD filter, passes most attacks).
Attributes measured • Packet header fields (all filtered packets) for Ethernet, IP, TCP, UDP, ICMP. • Inbound TCP SYN packet header fields. • HTTP, SMTP, and SSH requests (other application protocols are not present in both sets).
Comparison results • Synthetic attributes are too predictable: TTL, TOS, TCP options, TCP window size, HTTP, SMTP command formatting. • Too few sources: Client addresses, HTTP user agents, ssh versions. • Too “clean”: no checksum errors, fragmentation, garbage data in reserved fields, malformed commands.
TCP SYN Source Address r1≈ rh ≈ rt ≈ 50% is consistent with a Zipf distribution and a constant growth rate of r.
Real Traffic is Less Predictable Real r (Number of values) Synthetic Time
3. Injecting Real Traffic • Mix equal durations of real traffic into weeks 3-5 (both sets filtered, 344 hours each). • We expect r ≥ max(rSIM, rREAL) (realistic false alarm rate). • Modify PHAD, ALAD, LERAD, NETAD, and SPADE not to separate data. • Test at 100 false alarms (10 per day) on 3 mixed sets. • Compare fraction of “legitimate” detections on simulated and mixed traffic for median mixed result.
PHAD • Models 34 packet header fields – Ethernet, IP, TCP, UDP, ICMP • Global model (no rule antecedents) • Only novel values are anomalous • Anomaly score = tn/r where • t = time since last anomaly • n = number of training packets • r = number of allowed values • No modifications needed
ALAD • Models inbound TCP client requests – addresses, ports, flags, application keywords. • Score = tn/r • Conditioned on destination port/address. • Modified to remove address conditions and protocols not present in real traffic (telnet, FTP).
LERAD • Models inbound client TCP (addresses, ports, flags, 8 words in payload). • Learns conditional rules with high n/r. • Discards rules that generate false alarms in last 10% of training data. • Modified to weight rules by fraction of real traffic. If port = 80 then word1 = GET, POST (n/r = 10000/2)
NETAD • Models inbound client request packet bytes – IP, TCP, TCP SYN, HTTP, SMTP, FTP, telnet. • Score = tn/r + ti/fi allowing previously seen values. • ti = time since value i last seen • fi = frequency of i in training. • Modified to remove telnet and FTP.
SPADE (Hoagland) • Models inbound TCP SYN. • Score = 1/P(src IP, dest IP, dest port). • Probability by counting. • Always in training mode. • Modified by randomly replacing real destination IP with one of 4 simulated targets.
Criteria for Legitimate Detection • Source address – target server must authenticate source. • Destination address/port – attack must use or scan that address/port. • Packet header field – attack must write/modify the packet header (probe or DOS). • No U2R or Data attacks.
Mixed Traffic: Fewer Detections, but More are LegitimateDetections out of 177 at 100 false alarms
Conclusions • SAD suggests the presence of simulation artifacts and artificially low false alarm rates. • The simulated traffic is too clean, static and predictable. • Injecting real traffic reduces suspect detections in all 5 systems tested.
Limitations and Future Work • Only one real data source tested – may not generalize. • Tests on real traffic cannot be replicated due to privacy concerns (root passwords in the data, etc). • Each IDS must be analyzed and modified to prevent data separation. • Is host data affected (BSM, audit logs)?
Limitations and Future Work • Real data may contain unlabeled attacks. We found over 30 suspicious HTTP request in our data (to a Solaris based host). IIS exploit with double URL encoding (IDS evasion?) GET /scripts/..%255c%255c../winnt/system32/cmd.exe?/c+dir Probe for Code Red backdoor. GET /MSADC/root.exe?/c+dir HTTP/1.0
Further Reading An Analysis of the 1999 DARPA/Lincoln Laboratories Evaluation Data for Network Anomaly Detection By Matthew V. Mahoney and Philip K. Chan Dept. of Computer Sciences Technical Report CS-2003-02 http://cs.fit.edu/~mmahoney/paper7.pdf