420 likes | 558 Views
Root Cause Analysis of TCP Throughput: Methodology, Techniques, and Applications Matti Siekkinen Ph.D. Defense October 30, 2006 Institut Eurecom Sophia Antipolis, France. Outline. Introduction and Motivation Root cause analysis of TCP throughput: what and why? Part 1: Methodology
E N D
Root Cause Analysis of TCP Throughput: Methodology, Techniques, and Applications Matti Siekkinen Ph.D. Defense October 30, 2006 Institut Eurecom Sophia Antipolis, France
Outline • Introduction and Motivation • Root cause analysis of TCP throughput: what and why? • Part 1: Methodology • InTraBase: Integrated Traffic Analysis Based on Object Relational DBMS • Part 2: Root cause analysis techniques • Taxonomy of TCP rate limitation causes • Our approach to infer limitation causes • Part 3: Case study on Performance Analysis of ADSL Clients • Conclusions • Contributions • Future work
The Internet: over the last 5 years… • Traffic volumes and number of users have skyrocketed • Access link capacities have multiplied • Dominance shifted from Web+FTP into Peer-to-peer applications • TCP still the dominating transport protocol • Carries over 90% of traffic
The Internet: questions raised • ISPs would like to know how clients are doing • What are the performance limitations that Internet applications are facing? • Why does a client with 4Mbit/s ADSL access obtain only total download rate of few KB/s with eDonkey? • Why, after upgrading my link, I see no improvement in throughput? • Internet does not provide directly answers • The network is dumb! • Need techniques for traffic measurement and analysis
Root Cause Analysis of TCP Throughput What? • Analysis and inference of the reasons that prevent a given TCP connection from achieving a higher throughput. • Reasons are called limitation causes Why TCP? • TCP typically over 90% of all traffic
Background • TCP Rate Analysis Tool (T-RAT) by Zhang et al. (sigcomm 2002) • Pioneering research work • Ground breaking insights • It is not all congestion! • Opened up many questions • We implemented and tested it • Results are way off too often • Fundamental assumptions do not hold • T-RAT analyzes unidirectional traffic • Passively collected measurements • Usable in more cases (asymmetric paths) • The source of the problems
Our approach • We analyze only passive traffic measurements • Capture and store all TCP/IP headers, analyze later off-line • Observe traffic at a single measurement point • Applicable in diverse situations • E.g. at the edge of an ISP’s network • Know all about clients’ downloads and uploads • Bidirectional packet traces • Connection level analysis
Challenges (1/3) • Single measurement point anywhere along the path • Cannot/don’t want to control it • Complicates estimation of parameters (RTT and cwnd) A: RTT ~ d1 piece of cake… B: RTT ~ d3+d4 How to get d4? • (Did ack2 trigger • data2?) ack2 A B
Challenges (2/3) • A lot of data to analyze • Potentially millions of connections per trace • Deep analysis • For each connection of each trace • Compute a lot of metrics • Divide connections into pieces • Analyse separately and compute more metrics • Need to keep track of everything
Challenges (3/3) • Find the right metrics to characterize all limitations • Not too many • Need to gather a lot of experience • Get it right! • Several methods for computing a particular metrics • Choose the “best” for the situation • Try to maximize correctness of results • E.g. 5 ways to estimate RTTs • Careful validations • Benchmark with a lot of reference traces • Cross validate metrics
Outline • Introduction and Motivation • Root cause analysis of TCP throughput: what and why? • Part 1: Methodology • InTraBase: Integrated Traffic Analysis Based on Object Relational DBMS • Part 2: Root cause analysis techniques • Taxonomy of TCP rate limitation causes • Our approach to infer limitation causes • Part 3: Case study on Performance Analysis of ADSL Clients • Conclusions • Contributions • Future work
Filter Process Combine Store Interpret Why did we need InTraBase? • First try: ad-hoc scripts and specialized software tools (tcptrace et al.) • Problems: • Management • Data, metadata, and tools • Got lost with files containing data and ad-hoc scripts • Lot of metrics to compute and combine • Cumbersome analysis process • Iterative analysis • Data loses semantics and structure • Scalability • Cannot analyze large enough data sets
Meta data Functions Queries Results Base data Application logs Database System Preprocess Application Raw base data files Web100 TCP IP tcpdump Network link Our InTraBase approach • Store traffic measurements in files as base data • Upload base data into the db and process it within the db • Issue SQL queries • Object-relational DBMS create functions for advanced processing
Benefits from a DBMS-based Approach • Organize and manage data, related metadata, analysis results and tools • Data becomes structured and has semantics • Processing and updating data is easier • Tools “understand” the data higher-level programming • Searching is more efficient (indexes) • Store reusable intermediate results • It is easier to combine different data sources • E.g. across OSI layers
packets connections bytes packets tput … connection id timestamp start #seq end #seq flags … connection id iat(…) plot_ts_hist() histogram.pdf Histogram of the packet inter-arrival times of the fastest connection SELECT plot_ts_hist(‘SELECT * FROM iat(t2.cnxid,t2.reverse,”packets”)','histogram.pdf') FROM (SELECT cnxid,reverse FROM cnxs,(SELECT max(throughput) FROM cnxs) AS t1 WHERE cnxs.throughput=t1.max) AS t2; SELECT plot_ts_hist(‘SELECT * FROM iat(t2.cnxid,t2.reverse,”packets”)','histogram.pdf') FROM (SELECT cnxid,reverse FROM cnxs,(SELECT max(throughput) FROM cnxs) AS t1 WHERE cnxs.throughput=t1.max) AS t2; SELECT plot_ts_hist(‘SELECT * FROM iat(t2.cnxid,t2.reverse,”packets”)','histogram.pdf') FROM (SELECT cnxid,reverse FROM cnxs,(SELECT max(throughput) FROM cnxs) AS t1 WHERE cnxs.throughput=t1.max) AS t2; SELECT plot_ts_hist(‘SELECT * FROM iat(t2.cnxid,t2.reverse,”packets”)','histogram.pdf') FROM (SELECT cnxid,reverse FROM cnxs,(SELECT max(throughput) FROM cnxs) AS t1 WHERE cnxs.throughput=t1.max) AS t2;
Outline • Introduction and Motivation • Root cause analysis of TCP throughput: what and why? • Part 1: Methodology • InTraBase: Integrated Traffic Analysis Based on Object Relational DBMS • Part 2: Root cause analysis techniques • Taxonomy of TCP rate limitation causes • Our approach to infer limitation causes • Part 3: Case study on Performance Analysis of ADSL Clients • Conclusions • Contributions • Future work
Scope • Study long lived TCP connections • Short connections are another topic • Dominated by slow start? • Assume FIFO scheduling • Necessary for link capacity estimations with packet dispersion techniques • Reasonable assumption for most traffic • May not hold for cable modem and 802.11 access networks
Limitation Causes for TCP Throughput • Application • Transport layer • TCP receiver • Receiver window limitation • TCP protocol • Slow start… • Network layer • Bottleneck link
Application that sends larger bursts separated by idle periods • BitTorrent, HTTP/1.1 (persistent) transfer periods only keep-alive messages
Sender Receiver Application Application buffers TCP Network TCP Limitation Causes: Application • The application does not even attempt to use all network resources • TCP connections are partitioned into two periods: • Bulk Transfer Period (BTP): application provides constantly data to transfer • Never run out of data in buffer B1 • Application Limited Period (ALP): opposite of BTP • TCP has to wait for data because B1 is empty B1
Sender Receiver Application Application buffers TCP Network TCP Limitation Causes: TCP Receiver • Receiver advertized window limits the rate • max amount of outstanding bytes = min(cwnd,rwnd) Sender is idle waiting for ACKs to arrive • Flow control • Sender application overflows receiving application • Buffer B2 is full • Configuration problem (unintentional) • default receiver advertized window is set too low • window scaling is not enabled B2
Limitation Causes: Network • Limitation is due to congestion at a bottleneck link • Shared bottleneck: obtain only a fraction of its capacity • Non-shared bottleneck: obtain all of its capacity
Our Approach to Root Cause Analysis • Divide & Conquer • Partition connections into BTPs and ALPs • Filter out application impact • Analyze the bulk transfer periods for limitation by • TCP receiver • TCP protocol • Network • Methods are based on metrics computed from packet headers
Why filter out application effect? • Many TCP/IP –level traffic studies do not account for application effect • RTTs, burstiness… • Try to study network properties but end up measuring application effect instead!
Distinguishing BTPs from ALPs:Isolate & Merge algorithm • 1. phase: Isolate • Fact: TCP always tries to send MSS size packets • Consequence: small packets (size < MSS) and idle time indicate application limitation • Buffer between application and TCP is empty packet smaller than MSS ALP ALP … … large fraction of small packets Idle time > RTT Time MSS packet
Distinguishing BTPs from ALPs:Isolate & Merge algorithm • 2. phase: Merge • Why? • After Isolate, BTPs may be separated by very short ALPs • Analyze impact of the application • How much ALPs decrease overall throughput? • How? • Merge subsequent transfer periods separated by ALP to create a new BTP • Mergers controlled with drop parameter • Iterate until all possible mergers are performed
BTP Analysis • Compute limitation scores for each BTP • 4 quantitative scores • [0,1] • We use retransmission rates, inter-arrival time patterns, path capacity, RTT etc. • Perform classification of BTPs into limitation causes • Map (combination of) limitation scores into a cause • Threshold-based scheme
Classification scheme Dispersion score • 4 thresholds need to be set Retransmission score Receiver window limitation score b-score
Classification: calibrating the thresholds • Difficult task: Diversity vs. Control • Reference data needs to be representative & diverse enough • No simulations • Need to control experiments in some way to get what we want • Reference data with partially controlled experiments • Try to generate transfers limited by certain cause • FTP downloads from Fedora Core mirror sites • 232 sites covering all continents • Artificial bottleneck links with rshaper • network limitation • Nistnet to add delay • receiver limitation (Wr/RTT < bw) • Control the number of simultaneous downloads • unshared vs. shared bottleneck Australia Japan Internet Rshaper Nistnet Eurecom USA Finland
Classification: calibrating the thresholdsexample set th1 here bottleneck set at 1 Mbit/s, 1 download at a time
Outline • Introduction and Motivation • Root cause analysis of TCP throughput: what and why? • Part 1: Methodology • InTraBase: Integrated Traffic Analysis Based on Object Relational DBMS • Part 2: Root cause analysis techniques • Taxonomy of TCP rate limitation causes • Our approach to infer limitation causes • Part 3: Case study on Performance Analysis of ADSL Clients • Conclusions • Contributions • Future work
Motivation • Stress test for our techniques • Do we learn useful things? • Knowing throughput limitations (=performance) is useful • ISPs want satisfied clients • Need to know what’s going on before things can be improved • Installed InTraBase at France Telecom to study traffic at their ADSL access network • Root cause analysis techniques implemented within InTraBase
Measurement Setup • 24 hours of traffic on March 10, 2006 • 290 GB of TCP traffic • 64% downstream, 36% upstream • Observed packets from ~3000 clients, analyze only 1335 • Excluded clients did not generate enough traffic for RCA Internet access network collect network Two pcap probes here
Warming up… • Connections • Size distribution highly skewed • Use only 1% of them for RCA • Represent > 85% of all traffic • Clients • Heavy-hitters: 15% of clients generate 85-90% of traffic (up & down) • Low access link utilization • Why?
Results of Limitation Analysis • Striking result • Application limits performance of over 80% of clients • What’s going on?
Application analysis:Application limited traffic other • Quite stable and symmetric volumes • Over 80% of all traffic • eDonkey and “other” dominate eDonkey P2P
Application analysis:Saturated access link • No recognized P2P • Asymmetric port 80/8080 downstream • Real Web traffic?
Connecting the evidence… • Most clients’ performance limited by applications • Very low link utilizations for application limited traffic • Most of application limited traffic seems to be P2P • Peers often have asymmetric uplink and downlink capacities • P2P applications/users enforce upload rate limits Most clients’ download performance seems to suffer from P2P clients drastically limiting their upload rates downloading client uploading clients Internet Low utilization Low capacity+rate limiter
Outline • Introduction and Motivation • Root cause analysis of TCP throughput: what and why? • Part 1: Methodology • InTraBase: Integrated Traffic Analysis Based on Object Relational DBMS • Part 2: Root cause analysis techniques • Taxonomy of TCP rate limitation causes • Our approach to infer limitation causes • Part 3: Case study on Performance Analysis of ADSL Clients • Conclusions • Contributions • Future work
ConclusionsClaims and contributions • DBMSs provide powerful infrastructure for analysis of passive traffic measurements • Performance is good. • We can infer root causes for TCP throughput using • bidirectional packet traces at • single measurement point located anywhere on the TCP/IP path. • Today’s Internet applications interact in diverse ways with TCP • Bias/error in TCP/IP path analysis • Filter out their effects first • TCP root cause analysis techniques with DBMS-based analysis enable: • performance evaluation of applications, • evaluation of network utilization, and • identification of TCP configuration problems. • Part 1 • Part 2 • Part 2 • Part 3
The case is not yet closed… • Short connections • Challenge previous “old” results with RCA • What about persistent connections? • Wireless traffic • Non-FIFO scheduling • Link-layer issues • Extended case study on ADSL clients • We saw a day, what about a week? • Trends, consistency