840 likes | 1.1k Views
IIT BOMBAY NETWORK MEASUREMENTS. Guided by: Prof. Purushottam Kulkarni. Submitted by: Manveer Singh Chawla. MONITORING THE PERFORMANCE OF BACKHAUL CAMPUS NETWORK. OVERVIEW. Motivation Problem statement Related Work IIT Bombay Network Background Our Solution Architecture Implementation
E N D
IIT BOMBAY NETWORK MEASUREMENTS Guided by: Prof. Purushottam Kulkarni Submitted by: Manveer Singh Chawla MONITORING THE PERFORMANCE OF BACKHAUL CAMPUS NETWORK
OVERVIEW • Motivation • Problem statement • Related Work • IIT Bombay Network Background • Our Solution • Architecture • Implementation • Experimental Evaluation • Network measurement data • Proxy log analysis • Future Work • Thesis Contribution
MOTIVATION • Consider following scenarios • User writes a mail, clicks send but sending fails!! • User is talking with a friend on gtalk and it disconnects • User is browsing web but the browsing speed is very slow • What will a novice user do? • No structured approach: • Starts fiddling around with network settings • Reboots machine • Result? • Wastes a lot of time • May not even find the cause
MOTIVATION CNTD. • Multiple points of failure • User’s machine • Incorrect network settings • Failure of ethernet card/cable • LAN • Switch • Router • DNS • Proxy • WAN • Web Server • Network Congestion • No user control over LAN / WAN failures
PROBLEM DEFINITION • Build a measurement tool which monitors the status of elements in network back- bone, such that in case of network failure, it is able to detect and diagnose the cause of failure. These elements include the subnet routers, switches, DNS servers and network proxy. • A measurement study of the network proxy to study the response time variation, traffic pattern and object size variation across the day
RELATED WORK • Jigsaw • Merge traces to passively measure • queuing delays, throughput • We summarize a trace to determine status of nodes • WiFiProfiler • Fault diagnosis in wireless setting for user machine • Perform distributed analysis • Ours is centralized processing of wired network • Network measurement tools • Pathchar: bandwidth, queue size, packet drop rate • Traceroute: RTT, Topology
SERVICES • Proxy: netmon • Web caching • Authentication • Content filtering • Firewall • NATing • Packet filtering • Internal and External • DNS • DNS server for campus • DNS servers in few subnets • Monitoring • Traffic statistics
MEASUREMENT CHALLENGES • Permission from Computer Centre • Large volume of data • Unaware and amateur users • Specific h/w required • What to measure in such a large network • Use existing infrastructure • Old h/w: unpredictable failures • WAN: firewall makes difficult to diagnose
SERVER NODE ICMP PORT_UNREACHABLE Bad request on HTTP GET request Query reply from server • Send logs to diagnostic-node after collection
CLIENT NODE • Send logs to diagnostic-node after collection
DIAGNOSTIC NODE CNTD. Yes Is it seen by all? Machine down with failure Failure Seen No Machine overloaded Determining the status of proxy (netmon)
DIAGNOSTIC NODE CNTD. No Is it not reach-able for all? Send back to back querie-s other cases Machine overloaded internal answered external not answered Yes Machine Down Problem in hierarchy Determining the status of dns servers
DIAGNOSTIC NODE CNTD. • Offline mode • statistics for specified time period • Online mode • statistics for last 10 minutes • Remote query mode • query status of node at specified time
SETUP • Server node at 8 locations around the campus • Client node at 3 locations around campus • Collected data from 26th March – 15th June • No data for 25th May to 2nd Jun • Measurements for following nodes:
DNS SERVICE TIME DISTRIBUTION: OBSERVATIONS • Median response time is very less for all • Average is significantly greater than median • heavy tailed • kresit-dns has much higher average and 90th percentile
OUTAGE DISTRIBUTIONS • Most of the outages are of smaller length. • Median is <= 2 minutes, 90th Percentile <= 10 for almost all.
PERCENTAGE DOWNTIME ACROSS DAYS • On most of the days downtimes are < 2 % for most of the nodes. • There is not much pattern across days
COMBINED DOWNTIME • netmon ~ 0.24 % • Percentage time atleast on interface is not working is close to all not working • Either machine goes down • Or the measurements are not taking place at same time • Time to check the status of machine is variable
RESULTS SUMMARY • Router failure > DNS failure > netmon failure • Median node outage <= 2 min • Small number of outages each day • No pattern across days • Average DNS Service time ~ 300 ms • netmon is less than generally perceived • Dependence on other services: LDAP, DNS • A lot of machinery in the network is old
MOTIVATION • Per day logs are huge, over 6 Gb • Storing logs to perform long historical analysis a problem • Over 2 Tb for a year ! • What is the traffic distribution ? • What is the object size distribution ? • What is response time distribution ? • Is there some trend across days? • What strategy can be used to select logs for long term historical analysis ?
PROBLEM DEFINITION • Build a measurement tool which monitors the status of elements in network back- bone, such that in case of network failure, it is able to detect and diagnose the cause of failure. These elements include the subnet routers, switches, DNS servers and network proxy. • A measurement study of the network proxy to study the response time variation, traffic pattern and object size variation across the day
PROXY LOG ANALYSIS • Log file has following format • Month Date Time Proxy_Server squid_process_id epoch_timestamp process_time_ms source_ip tcp_status/http_status_code object_size request_type URL user_id hierarchy_code/server_ip object_type/object_sub_type • Stored in a MySQL database • Processed logs for a week from • May 14, 2009 – May 20, 2009 • Size of the log file ~ 6 Gb • Number of requests in a day ~ 22 million • Bytes downloaded ~ 401.6 Gb
TRAFFIC DISTRIBUTION ON OBJECT TYPE: REQUESTS • Percentage distribution remain same across days • Multimedia traffic is the least ~ 0.2 % • Text traffic is maximum ~ 40 %
TRAFFIC DISTRIBUTION ON OBJECT TYPE: DOWNLOADED BYTES • Percentage distribution remain same across days • Multimedia traffic is the maximum ~ 38 %
TRAFFIC DISTRIBUTION ON LOCATION: REQUESTS • Percentage distribution remain same across week days • Increase in hostel traffic on weekends • Decrease in academic traffic on weekends
TRAFFIC DISTRIBUTION ON LOCATION: DOWNLOADED BYTES • Percentage distribution for downloaded bytes follow number of requests • Object type distribution remains same across days, thus majority of users have similar behavior in different locations
NUMBER OF ARRIVALS PER SECOND • Lesser activity from 2 a.m. – 11 a.m, lan curtailment • Higher activity points at 3 p.m., 7 p.m., and 11 p.m. • Average ~ 250 , Standard Deviation ~ 135
NUMBER OF REQUESTS CONCURRENTLY SERVED • Average ~ 2000 , Standard Deviation ~ 859 • Follows the arrival curve
MEAN RESPONSE TIME AT TIME OF DAY • Response time remains almost constant throughout the day • A peak at around 4 a.m. • Average ~ 9.8 seconds
MEDIAN RESPONSE TIME AT TIME OF DAY • Median Response time remains constant throughout the day, 480 ms for the day • Median curve is a better estimate of average value on a day • Both the median and mean response time do not follow requests concurrently served and arrival curve
CUMULATIVE RESPONSE TIME DISTRIBUTION • For multimedia the curve becomes linear • For remaining categories it is heavy tailed • Median response times: application ~472 ms, text ~ 563 ms, image ~ 172 ms, multimedia ~ 10175 ms and other ~ 672 ms
CUMULATIVE OBJECT SIZE DISTRIBUTION • For multimedia object sizes are more evenly distributed • Remaining categories have 90 % of objects < 10 Kb • Median object sizes: application ~1.5 Kb, text ~ 0.8 Kb, image ~ 1.7 Kb, multimedia ~ 903 Kb and other ~ 0.46 Kb
RESULTS SUMMARY • Multimedia traffic is the major part of WAN traffic • Percentage traffic distribution • Similar across object type on days • Similar in different areas except on weekends • Thus any log file can be selected as a representative of the week • Larger log file for more data • one for weekend and one for weekdays
FUTURE WORK • Characterization of request processing time at proxy • Explore the other causes of failure including the LDAP service • Explore the failures from the side of ISP, from a point outside the network • Studying the traffic within LAN
THESIS CONTRIBUTIONS • Studied the tools and methodologies used for network measurement • Surveyed and documented the campus network of IIT Bombay • Architecture • Services • Failures • Developed a tool to detect some of the failures • Can be easily extended to detect others • Experimental evaluation of tool by setting up testbed • Measurement analysis of proxy logs
BIBLIOGRAPHY [1] Computer Center, IIT Bombay. http://www.cc.iitb.ac.in [2] dnscache. http://cr.yp.to/djbdns/dnscache.html [3] Iperf. http://dast.nlanr.net/Projects/Iperf/ [4] iptables. http://www.netfilter.org/projects/iptables/index.html. [5] Jpcap: a Java library for capturing and sending network packets. http://netresearch.ics.uci.edu/kfujii/jpcap/doc/. [6] Squid logs. http://wiki.squid-cache.org/SquidFaq/SquidLogs [7] Traceroute. http://sourceforge.net/projects/traceroute
BIBLIOGRAPHY CNTD. [8] Ultra monkey. http://www.ultramonkey.org/ [9] Wikimedia. http://www.squid-cache.org/Library/wikimedia.dyn [10] Kostas G. Anagnostakis, Michael Greenwald, and Raphael Ryger. cing: Measuring network-internal delays using only existing infrastructure. In proceedings of IEEE Infocom, April 2003. [11] Ranveer Chandra, Venkata N. Padmanabhan, and Ming Zhang. Wifiprofiler: Cooper- ative Diagnosis in Wireless LANs. In Proceedings of the 4th international conference on Mobile systems, applications and services, June 2006.
BIBLIOGRAPHY CNTD. [12] Yu-Chung Cheng, John Bellardo, Peter Benko, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage. Jigsaw: Solving the Puzzle of Enterprise 802.11 Analysis. In Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications, September 2006 [13] Ramesh Govindan and Hongsuda Tangmunarunkit. Heuristics for Internet map dis- covery. In proceedings of Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies, 2000. 101102 Bibliography
BIBLIOGRAPHY CNTD. [14] Bradley Huffaker, Marina Fomenkov, David Moore, and Ke Claffey. Macroscopic analyses of the infrastructure: measurement and visualization of Internet connectivity and performance. In proceedings of Passive and Active Measurements, 2001 [15] Van Jacobson. pathchar - a tool to infer characteristics of Internet paths, 1997. [16] Alex Rousskov and Valery Soloviev. A performance study of the Squid proxy on HTTP/1.0. World-Wide Web Journal, Special Edition on WWW Characterization and Performance Evaluation, 1999.
BIBLIOGRAPHY CNTD. [17] Stefan Savage. Sting: a TCP-based Network Measurement Tool. In Proceedings of the Second Conference on USENIX Symposium on Internet Technologies and Systems, 1999. [18] Subhabrata Sen and Jia Wang. Analyzing peer-to-peer traffic across large networks. In Proceedings of the 2006 ACM CoNEXT conference, 2006. [19] Nirav S. Uchat. IIT bombay web traffic characterization. [20] Ameya P. Usgaonkar. Network Performance Analysis by Mining Multi-Variate Time Series Data, January 2001.