Network Anomaly Diagnosis

Network Anomaly Diagnosis Status Report 15-03-2006

Contents • Network performance metrics • Active monitoring tools • Preparation of canonical dataset • Anomaly detection • Anomaly diagnosis

Network performance metrics • One-Way Delay (OWD) • Serialization Delay • Propagation Delay • Queuing Delay • Forwarding Delay • Round-Trip Time (RTT) • Delay Variation (Jitter) • Packet Loss • Congestion • Errors • Packet Reordering • Maximum Transmission Unit (MTU) • Bandwidth Delay Product (BDP)

One-Way Delay (OWD) • The time it takes for a packet to reach its end-to-end destination • It can be broken down into: • per-hop one-way delays, and these in turn into: • per-linkand • per-node delay components • The per-link component of one-way delay consists of two sub-components: propagation delay and serialization delay. • The per-node component of one-way delay also consists of two sub- components: forwarding delay and queuing delay.

Serialization Delay • Its the time taken to separate a packet into sequential link transmission units (bits). • It is obtained by dividing the packet size (in bits) by the capacity of the link (in bits per second). • Nowadays, as links increasingly have a higher bit rate, serialization delay is less relevant.

Propagation Delay • Propagation delay is the duration of time forsignals to move from the transmitting to the receiving end of a link. • On simple links, this is the product of: • the link's physical length and • the characteristic propagation speed of media. • On high-speed wide-area network (WAN) paths, delay is usually dominated by propagation times.

Queuing Delay • Queuing delay is defined as the time a packet has to spend inside a node such as a router while waiting for availability of the output link. • It depends on • the amount of traffic competing to send packets towards the output link and • on the priorities of the packet

Forwarding Delay • Processing at the node • Reading forwarding-relevant information • Dest address + other headers • Forwarding decision • Based on the routing table

Round-Trip Time (RTT) • Its the sum of the one-way delays from source to destination plus time it takes B to formulate the response. • Large RTT values can cause problems for TCP and other window-based transport protocols. • The round-trip time influences the achievable throughput, as there can only be a window's worth of unacknowledged data in the network.

Delay Variation (Jitter) • OWD is not constant on a real network because of competing traffic and contention for processing resources. • The difference between a given packet’s actual and average OWD is termed ‘delay variation’ or jitter. • It only compares the delays experienced by packets of equal size, as OWD is dependent on packet size because of serialization delay

Packet Loss • Packet loss is determined as the probability of a packet being lost in transit from a source A to a destination B. • Applications requiring reliable transmission e.g. Bulk data transfers, use retransmission, which reduces performance. • In addition, congestion-sensitive protocols such as standard TCP assume that packet loss is due to congestion, and respond by reducing their transmission rate accordingly. • Congestion and errors are the two main reasons for packet loss.

Congestion • When the offered load exceeds the capacity of a part of the network, packets are buffered in queues. • Since these buffers are also of limited capacity, congestion can lead to queue overflows, which leads to packet losses. • Congestion can be caused by • moderate overload condition maintained for an extended amount of time or • by the sudden arrival of a very large amount of traffic (traffic burst).

Errors • Another reason for loss of packets is corruption, where parts of the packet are modified in-transit. • When such corruptions happen on a link (due to noisy lines etc.), this is usually detected by a link-layer checksum at the receiving end, which then discards the packet.

Packet Reordering • The Internet Protocol (IP) does not guarantee the packet ordering. • Packet reordering concerns packets of different sizes. • larger packets take longer to transfer and may be overtaken by smaller packets in transit. • It can be measured by: • injecting the same traffic pattern via traffic generator and calculating the reordering. • To measure maximal reordering: short burst of long packets immediately followed by a short burst of short packets

Maximum Transmission Unit (MTU) • It describes the maximum size of an IP packet that can be transferred over the link without fragmentation. • Common MTU sizes are: • 1500 bytes (Ethernet, 802.11 WLAN) • 4470 bytes (FDDI, common default for POS and serial links) • 9000 bytes (Internet2 and GÉANT convention, limit of some Gigabit Ethernet adapters) • 9180 bytes (ATM, SMDS)

Bandwidth Delay Product (BDP) • The Bandwidth Delay Product (BDP) of an end-to-end path is the product of the bottleneck bandwidth and the delay of the path. • It is often useful to think of BDP as the "memory capacity" of a path, i.e. the amount of data that fits entirely into the path between two end-systems. • This relates to throughput, which is the rate at which data is sent and received. • Network paths with a large BDP are called Long Fat Networks or LFNs. • BDP is an important parameter for the performance of window-based protocols such as TCP.

Available bandwidth • dynamic bandwidth capacity (Cap) by analyzing the minimum inter-packet delay; • the cross-traffic (Xtr) byanalyzing the inter-packet delay variability; • and the available bandwidth • (Abw = Cap – Xtr)

Active Probing Tools • Some publicly available active network monitoring tools are: • Throughput & Delay Measurement Tools • Ping • Traceroute • Iperf • Thrulay • Path Characterization & Bandwidth Estimation • pathChirp • Pathload • ABwE • Netperf • Nettest

Ping • Uses ICMP echo mechanism • Measures • RTT • Packet loss • Reachability • Source & Destination use timestamps to calculate RTT • Packet loss can be measured by sending a set of packets from a source to a destination and comparing the number of received packets against the number of packets sent.

Remote host Ping example Repeat count Packet size RTT syrup:/home$ ping -c 6 -s 64 thumper.bellcore.com PING thumper.bellcore.com (128.96.41.1): 64 data bytes 72 bytes from 128.96.41.1: icmp_seq=0 ttl=240 time=641.8 ms 72 bytes from 128.96.41.1: icmp_seq=2 ttl=240 time=1072.7 ms 72 bytes from 128.96.41.1: icmp_seq=3 ttl=240 time=1447.4 ms 72 bytes from 128.96.41.1: icmp_seq=4 ttl=240 time=758.5 ms 72 bytes from 128.96.41.1: icmp_seq=5 ttl=240 time=482.1 ms --- thumper.bellcore.com ping statistics --- 6 packets transmitted, 5 packets received, 16% packet loss round-trip min/avg/max = 482.1/880.5/1447.4 ms Missing seq # Summary

Traceroute • Determines the route a packet takes through the Internet to reach its destination; i.e. the number of "hops" it takes. • It sends "probe" packets with TTL values incrementing from one, and uses ICMP "Time Exceeded" messages to detect "hops" on the way to the specified destination. • It also records "response" times for each hop, and displays losses and other types of failures • Used to measure • hop-by-hop connectivity and • RTT

Traceroute Max hops Remote host Probes/hop • UDP/ICMP tool to show route packets take from local to remote host 17cottrell@flora06:~>traceroute -q 1 -m 20 lhr.comsats.net.pk traceroute to lhr.comsats.net.pk (210.56.16.10), 20 hops max, 40 byte packets 1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2) 0.642 ms 2 RTR-MSFC-DMZ.SLAC.Stanford.EDU (134.79.135.21) 0.616 ms 3 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.66) 0.716 ms 4 snv-slac.es.net (134.55.208.30) 1.377 ms 5 nyc-snv.es.net (134.55.205.22) 75.536 ms 6 nynap-nyc.es.net (134.55.208.146) 80.629 ms 7 gin-nyy-bbl.teleglobe.net (192.157.69.33) 154.742 ms 8 if-1-0-1.bb5.NewYork.Teleglobe.net (207.45.223.5) 137.403 ms 9 if-12-0-0.bb6.NewYork.Teleglobe.net (207.45.221.72) 135.850 ms 10 207.45.205.18 (207.45.205.18) 128.648 ms 11 210.56.31.94 (210.56.31.94) 762.150 ms 12 islamabad-gw2.comsats.net.pk (210.56.8.4) 751.851 ms 13 * 14 lhr.comsats.net.pk (210.56.16.10) 827.301 ms No response: Lost packet or router ignores

Iperf • It reports bandwidth, delay jitter, datagram loss. • It has two modes • TCP • UDP • In TCP mode it is used to measure thebandwidth. • It reports MSS (Maximum Segment Size) and the observed read sizes. It supports TCP window size adjustment via socket buffers. • In UDP mode packet loss and delay jitter can be measured.

Thrulay • Thrulay stands for THRUput and deLAY. • It measures the capacity of a network by sending a bulk TCP stream over it. • One-way delay can be measured by sending a Poisson stream (arrivals at random) of very precisely positioned UDP packets.

Pathload • Pathload is based on the technique of Self-Loading Periodic Streams (SLoPS), for measuringavailable bandwidth. • A periodic stream in SLoPS consists of K packets of size L, sent to the path at a constant rate R. • If the stream rate R is higher than the available bandwidth, the one-way delays of successive packets at the receiver show an increasing trend. • In pathload, sender uses UDP for packet stream and TCP connection for controlling measurements. • Sender timestamps each packet upon transmission, which provides relative one-way delay at the receiver side.

pathChirp • pathChirp is used to calculate unused capacity or available bandwidth. • Based on the concept of self-induced congestion. • Probing rate < available bw  no delay increase • Probing rate > available bw  delay increases • Unique to pathChirp is an exponentially spaced chirp probing train.

ABwE • ABwE is a light weight available bandwidth measurement tool based on the packet pair dispersion technique. • It uses fixed size packets with specific delay between each packet pair. • The observed packet pair delay is converted into available bandwidth calculations. • The tool can be used in continuous mode and detects all substantial bandwidth changes caused by improper routing or by congestions.

Netperf • Netperf is a benchmark that can be used to measure the performance of many different types of networking. • It provides tests for both unidirectional throughput, and end-to-end latency. • Essentially, these tests will measure how fast one system can send data to another and/or how fast that other system can receive it. • Request/response performance is the second area that can be investigated with netperf.

Selection of tools • Shriram et al. concludes his comparison of public End-to-End bandwidth estimation tools on High-Speed Links with following remarks: • “Pathload and Pathchirp are the most accurate. Iperf requires maximum buffer size and is sensitive to small packet loss.” • Ribeiro et al. in comparisons of pathChirp with pathload and TOPP, clearly show that pathChirp performs far batter than the other two tools in single-hop, multi-hop scenarios and on real internet as well. • Cottrel et al. states “Ping utility does not require any extra installation/deployment efforts and has become a de-facto standard for measuring RTT and packet loss.” • Tools selected are pathChirp, pinger, thrulay and traceroute.

References • Cottrell L., Internet Monitoring, Presented at NUST Institute of Information Technology (NIIT) Rawalpindi, Pakistan, March 15, 2005 • Cottrell L., Matthews W. and Logg C., Tutorial on Internet Monitoring & PingER at SLAC • Jacobson V., “pathchar — a tool to infer characteristics of Internet paths” • Ribeiro V., Riedi R., Baraniuk R., Navratil J., Cottrell L. pathChirp: Efficient Available Bandwidth Estimation for Network Paths. • Jain M., Dovrolis C., End-to-End Available Bandwidth: Measurement Methodology, Dynamics, and Relation with TCP Throughput. ACM SIGCOMM 2002. • Navratil J. and. Cottrell L., ABwE: A Practical Approach to Available Bandwidth Estimation. • http://thrulay-hd.sourceforge.net

References • http://www.netperf.org/netperf/training/Netperf.html • Beginner's Guide to Network-Distributed Resource Usage • Cottrell, L., Logg, C.: Overview of IEPM-BW Bandwidth Testing of Bulk Data Transfer. In: Sc2002: High Performance Networking and Computing. (2002) • http://dast.nlanr.net/Projects/Iperf/iperfdocs_1.7.0.html • http://www-didc.lbl.gov/NCS/netest/ • Shriram A., Murray M., Hyun Y., Brownlee N., Broido A., Fomenkov M., claffy k., “Comparison of Public End-to-End Bandwidth Estimation tools on High-Speed Links” • Prasad R. S., Dovrolis C., “Capacity estimation using packet dispersion techniques : recent progress and open issues” Presentation • http://www.academ.com/nanog/june1997/performance.html • Labit Y., Owezarski P., Larrieu N., Evaluation of Active Measurement Tools for Bandwidth Estimation in Real Environment. • Dovrolis C., Jain M., “End-to-end available bandwidth estimation using Pathload” Presentation • claffy k., Dovrolis C., “Bandwidth estimation: measurement methodologies and applications” Presentation • GÉANT2 Performance Enhancement and Response Team (PERT) User Guide and Best Practice Guide

Anomaly Detection Techniques • Exponential weighted moving average (EWMA) • Holt-Winters forecasting algorithm • Plateau Algorithm • Principal Component Analysis (PCA) • ARIMA-based Box-Jenkins forecasting models • Signal analysis techniques, e.g. the wavelet analysis scheme

EWMA • EWMA, a forecasting based techniques • EWMA predicts the next value in a given timeseries based on recent history. • Specifically, if ztdenotes the traffic at time t, then the EWMA prediction for time t+1 is given by ^zt+1 as: • Anomalies can then be quantified by taking the difference between the forecasted and the actual value, i.e.,

Holt-Winters (HW) Algorithm • Service network variable time series exhibit the following regularities: • A trend over time (i.e., a gradual increase in application daemon requests over a two month period due to increased subscriber load). • A seasonal trend or cycle (i.e., every day bytes per second increases in the morning hours, peaks in the afternoon and declines late at night). • Seasonal variability. (i.e., application requests fluctuate wildly minute by minute during the peak hours of 4-8 pm, but at 1 am application requests hardly vary at all).

Holt-Winters (HW) Algorithm • The Holt-Winters (HW) algorithm uses a triple Exponential Weighted Moving Average (EWMA) approximation to characterize the time series behavior as a superposition of three components: a baseline, a linear trend and a seasonal effect.

Plateau Algorithm • It is used to detect a significant change in the base RTT of a path. • Base RTT means a change in most of the RTT values measured, new values are based on the new, higher level. • The plateau detector is based around two windows, • the summary window and • the sample window. • Summary window is used to characterize the normal state of the path which is based on the previous samples. • The number of samples stored in the summary window is a user settable parameter.

Plateau Algorithm • Samples are first stored in summary window, trigger selector decides if a sample forms part of a trigger. • In this case, the sample is stored in the sample window. • This decision is made on the value of the sample, the mean and variance of the summary window, and on parameters that the user has chosen. • RTT > samplemean + (samplevariance* sensitivty)

Plateau Algorithm • A trigger is generated when a number of samples (a user settable parameter) meet the above equation are received. • Each time a sample, that fulfils the above equation, is received a counter is incremented. • For a non-zero counter when a sample that does not meet the above equation is received the counter is decremented. • If the counter reaches the user parameter trigger duration a trigger is generated. If the counter returns to zero the trigger is aborted. • In either case, data in the sample window is passed to the summary window.

Principal Component Analysis (PCA) • PCA is a coordinate transformation method that maps a given set of data points onto new axes. These axes are called the principal axes or principal components. • When working with zero-mean • data, each principal component has the property that it points in • the direction of maximum variance remaining in the data, given • the variance already accounted for in the preceding components. • As such, the first principal component captures the variance of the • data to the greatest degree possible on a single axis. The next principal • components then each capture the maximum variance among • the remaining orthogonal directions. Thus, the principal axes are • ordered by the amount of data variance that they capture.

References • McGregor A.J. and Braun H-W, ”Automated Event Detection for Active Measurement Systems”, Proceedings of PAM2001, Amsterdam, Netherlands, April 2001

Preparation of Canonical dataset • Fetch data from monitoring nodes (Pathchirp, Thrulay, Ping, Traceroute) - Adnan, Ali • Study/Analyze Data - Adnan, Ali • Run Run-period - Adnan • Analyze Alerts - Adnan, Ali • Design a format - Adnan, Ali • Rearrange data according to new format - Adnan, Ali

Monitoring Node Monitoring Node Monitoring Node Monitoring Node Monitored Nodes = 21 Topology to be used

Data being collected using this topology • Pinger: Round trip Time (Min, Max, Avg) • Pathchirp: Throughput (Avg, Max, Min, Stdev) • Thrulay: Throughput (Avg, Max, Min, Stdev) RTT (Avg, Max, Min, Stdev) • trace route summaries • Host Monitoring data (e.g. data from LISA, Nagios, Ganglia) --- candidate

Format of canonical data set • Not sure, needs some more investigation of available data from multiple sources. • Data is fetched from 1st Jan, 06 to 28th Feb 2006.

Network Anomaly Diagnosis