600 likes | 611 Views
This presentation discusses potential problems in gigabit networks, including hardware, driver, and OS issues, protocol stack overhead, TCP stability and utilization, as well as related experiments and measurements. It also covers hardware, drivers, and OS topics such as NIC drivers, device management, redundant copies, device polling, and zero-copy TCP.
E N D
Towards Gigabit David Wei Netlab@Caltech For FAST Meeting July.2
Potential Problems • Hardware / Driver / OS • Protocol Stack Overhead • Scalability of the protocol specification • TCP Stability /Utilization (New Congestion Control Algorithm) • Related Experiments & Measurements
Hardware / Drivers /OS • NIC Driver • Device Management (Interrupt) • Redundant Copies • Device Polling (http://info.iet.unipi.it/~luigi/polling/) • Zero-Copy TCP • … www.cs.duke.edu/ari/publications/talks/freebsdcon
Device Polling • Current process for NIC driver in FreeBSD: • Packet come to NIC • NIC->Hardware Interrupt • CPU jumps to the interrupt handler for that NIC • MAC layer process reads data from NIC to a queue • Upper layer process the data in queue (lower priority) • Drawback: CPU checks the NIC for every packet -- Context switching. Frequent interruption for high speed device • Live-Lock: CPU is too busy working on NIC interruption to process the data in the queue.
Device Polling Device Polling: • Polling: CPU checks the device when it has time. • Scheduling: User specifies a time ratio for CPU to work on devices and on non-device processing. Advantages: • Balance between the device service and non-device processing • Improve performance in fast devices
Protocol Stack Overhead Per-packet over head: • Ethernet Header / Checksum • IP Header / Checksum • TCP Header / Checksum • Coping / interruption process Solution: Increase packet size • Opt Packet Size=min{ packet size along the path} (Fragmentation results in low performance too.)
Path MTU Discovery (1191) Current Method: • “Don’t Fragment” bits (Router: Drop/Fragment; Host: Test/Enforce) • MTU=min{576, first hop MTU} • MSS=MTU-40 • MTU<=65535 (Architecture) • MSS<=65495 (IP sign-bit bugs…) • Drawback: Usually too small
Path MTU Discovery • How to Discover PMTU? Current: • Search (Proportional Decreasing / Binary) • Update (Periodically Increasing – set to the MTU of first hop) Proposed: • Search/Update with typical MTU values • Routers: provide suggestion of MTU in DTB indicating the DF pack drop.
Path MTU Discovery Implementation Host: • Packetization Layer (TCP / Connection over UDP): DF/Packet Size • IP: Store PMTU for each known path (routing table) • ICMP: “Datagram Too Big” Message Router: • Send ICMP Packet when Datagram is too big. Implementation problems: • RFC 2923
Scalability of Protocol Specifications • Windows Size Space (<=64K) • Sequence Number Space (Wrapping up, <=2G) • Inadequate Frequency of RTT Sampling (1 sample per Window)
Sequence Number Space • MSL (Max Segment Life)>Variance of IP delay • MSL<Sequence Number Space/Bandwidth
Sequence Number Space • MSL (Max Segment Life)>Variance in IP • MSL<8*|Sequence Number Space|/Bandwidth • |SN Space|=2^31=2GB • Bandwidth=1GB • MSL<=16sec • Variance of IP delay<=16 sec • Current TCP: 3 min. • Not scalable with bandwidth growth
TCP-Extensions (1323) • Window Spaces: 16bit Scale Factor in SYN: Win=[Win]*2^S • RTT Measurement: Timestamp for each packet (generated by sender, relayed by receiver) • PAWS (Protect Against Wrapped Sequence Number): Use timestamp to expand the sequence space. (So the timer should not be too fast or too slow: 1ms ~ 1 sec) • Header Prediction: Simplify the process
High Speed TCP Floyd ’02. Goals: • Achieve large window size with realistic loss rate (Use current window size in AIMD parameter) • High Speed in a single connection (10Gbps) • Easy to achieve high sending rate for a given loss rate. How to Achieve TCP-Friendliness? • Incremental Deployable (no router support required)
High Speed TCP Problem in Steady State: • TCP response function: • Large congestion window requires a very low loss rate. Problem in Recovery: • Congestion Avoidance takes too long to recover (Consecutive Time-outs)
High Speed TCP Change the TCP response function: • p is high (higher than maxP corresponding to the default cwnd size W): standard TCP • p is low: (cwnd >= W): use a(w), b(w) instead of constant a,b in the adjustment of cwnd. • For a given loss rate P and desired windows Size W1 at P: get a(w) and b(w). (Keep the linearity on a log-log scale. ∆ logW∆ logP)
Change TCP Function • Standard TCP:
Expectations • Achieve large window with realistic loss rate • Relative fairness between standard TCP and High speed TCP (Acquired bandwidth cwnd ) • Moderate decrease instead of halving window size when congestion detected (0.33 at 1000) • Pre-computed Look-up to implement a(w) and b(w).
Slow Start Modification of Slow Start: • Problem: doubling cwnd for each RTT is too aggressive for large cwnd • Proposal: To limit ∆cwnd in a RTT in Slow Start.
Limited Slow Start For each ACK: • Cwnd<=max_ss_threshold: ∆cwnd=MSS (Standard TCP Slow Start) • Cwnd>max_ss-threshold: ∆cwnd=0.5max_ss_threshold/cwnd (at most 0.5 max_ssthreshold each RTT)
Related Projects • Cray Research (’92); • CASA Testbed (’94) • Duke (’99) • Pittsburg Supercomputing center • Portland State Univ.(’00) • Internet 2 (’01) • Web100 • Net100 (built on Web 100)
Cray Research ’92 • TCP/IP Performance at Cray Research (Dave Borman) Configuration: • HIPPI between two dedicated Y/MPs with Model E IOS and Unicos 8.0 • Memory to memory transfer Results: • Direct channel-to-channel: MTU - 64K - 781 Mbps • Through a HIPPI switch: MTU - 33K - 416 Mbps MTU - 49K - 525 Mbps MTU - 64K - 605 Mbps
CASA Testbed ’94 Applied Network Research of San Diego Supercomputer Center + UCSD • Goal: Delay and Loss Characteristics of HIPPI-based gigabit testbed • Link Feature: Blocking (HIPPI), tradeoff between high lost rate and high delay • Conclusion: Avoiding packet loss is more important than reduce delay • Performance (Delay*Bandwidth =2MB; 1323 on; Cray machines): 500Mbps TCP sustained throughput (TTCP/Netperf)
Trapeze/IP (Duke) Goal: • What optimization is most useful to reduce host overheads for fast TCP? • How fast does TCP really go, at what cost? Approaches: • Zero-Copy • Checksum offloading Result: • >900Mbps for MTU>8K
Trapeze/IP (Duke) • Zero-copy www.cs.duke.edu/ari/publications/talks/freebsdcon
Trapeze/IP (Duke) www.cs.duke.edu/ari/publications/talks/freebsdcon
Trapeze/IP (Duke) www.cs.duke.edu/ari/publications/talks/freebsdcon
Trapeze/IP (Duke) www.cs.duke.edu/ari/publications/talks/freebsdcon
Enabling High Performance Data Transfers on Hosts By Pittsburg Supercomputing center • Enable RFC 1191 MTU Discovery • Enable RFC 1323 Large Windows • OS Kernel: Large enough socket buffers • Application: Set its send and receive socket buffer sizes Detailed methods to tune various OS.
PSU Experiment Goal: • Round Trip Delay and TCP throughput with different window size • Influence by different devices (CISCO 3508/3524/5500), different NIC Environment: • OS: FreeBSD 4.0/4.1 (without 1323?), Linux, Solaris • WAN: 155Mbps OC-3 over SONET MAN • Measurement Tools: Ping + TTCP
PSU Experiment • "smaller" switches and low-level routers can easily muck things up. • bugs in Linux 2.2 kernels • Different NICs have different performance. • Fast PCI bus (64 bits * 66mhz) is necessary • Switch MTU size can make a difference (giant packets are better). • Bigger TCP window sizes can help but there seems to be a knee around 4MB that is not remarked upon in the literature.
Internet-2 Experiment Goal: Single TCP connection with 700-800Mbps over WAN; Relations among Window Size, MTU and Throughput Back-to-Back • OS: FreeBSD 4.3 release • Architecture: 64bit-66Mhz PCI+… • Configuration: sendspace=recvspace=102400 • Setup: Direct connection (back-back) and WAN • WAN: Symmetric path: host1-Abilene-host2 • Measurement: Ping + IPerf
Internet-2 Experiment Back-to-Back • No Loss • Found some bug in FreeBSD 4.3 WAN: • <=200Mbps • Asymmetry in different directions (cache of MTU…)
Web 100 • Goal: Make it easy for non-expertise to achieve high bandwidth • Method: Get more information from TCP • Software: Measurement: embedded into kernel TCP App Layer: Diagnostics / Auto-Tuning • Proposal: RFC 2012 (MIB)
Net 100 • Built on Web 100 • Auto-tune the parameter for non-experts. • Network-Aware OS • Bulk File Transportation for ORNL • Implementation of Floyd’s High Speed TCP