Towards Gigabit

Towards Gigabit David Wei Netlab@Caltech For FAST Meeting July.2

Potential Problems • Hardware / Driver / OS • Protocol Stack Overhead • Scalability of the protocol specification • TCP Stability /Utilization (New Congestion Control Algorithm) • Related Experiments & Measurements

Hardware / Drivers /OS • NIC Driver • Device Management (Interrupt) • Redundant Copies • Device Polling (http://info.iet.unipi.it/~luigi/polling/) • Zero-Copy TCP • … www.cs.duke.edu/ari/publications/talks/freebsdcon

Device Polling • Current process for NIC driver in FreeBSD: • Packet come to NIC • NIC->Hardware Interrupt • CPU jumps to the interrupt handler for that NIC • MAC layer process reads data from NIC to a queue • Upper layer process the data in queue (lower priority) • Drawback: CPU checks the NIC for every packet -- Context switching. Frequent interruption for high speed device • Live-Lock: CPU is too busy working on NIC interruption to process the data in the queue.

Device Polling Device Polling: • Polling: CPU checks the device when it has time. • Scheduling: User specifies a time ratio for CPU to work on devices and on non-device processing. Advantages: • Balance between the device service and non-device processing • Improve performance in fast devices

Protocol Stack Overhead Per-packet over head: • Ethernet Header / Checksum • IP Header / Checksum • TCP Header / Checksum • Coping / interruption process Solution: Increase packet size • Opt Packet Size=min{ packet size along the path} (Fragmentation results in low performance too.)

Path MTU Discovery (1191) Current Method: • “Don’t Fragment” bits (Router: Drop/Fragment; Host: Test/Enforce) • MTU=min{576, first hop MTU} • MSS=MTU-40 • MTU<=65535 (Architecture) • MSS<=65495 (IP sign-bit bugs…) • Drawback: Usually too small

Path MTU Discovery • How to Discover PMTU? Current: • Search (Proportional Decreasing / Binary) • Update (Periodically Increasing – set to the MTU of first hop) Proposed: • Search/Update with typical MTU values • Routers: provide suggestion of MTU in DTB indicating the DF pack drop.

Path MTU Discovery Implementation Host: • Packetization Layer (TCP / Connection over UDP): DF/Packet Size • IP: Store PMTU for each known path (routing table) • ICMP: “Datagram Too Big” Message Router: • Send ICMP Packet when Datagram is too big. Implementation problems: • RFC 2923

Scalability of Protocol Specifications • Windows Size Space (<=64K) • Sequence Number Space (Wrapping up, <=2G) • Inadequate Frequency of RTT Sampling (1 sample per Window)

Sequence Number Space

Sequence Number Space • MSL (Max Segment Life)>Variance of IP delay • MSL<Sequence Number Space/Bandwidth

Sequence Number Space • MSL (Max Segment Life)>Variance in IP • MSL<8*|Sequence Number Space|/Bandwidth • |SN Space|=2^31=2GB • Bandwidth=1GB • MSL<=16sec • Variance of IP delay<=16 sec • Current TCP: 3 min. • Not scalable with bandwidth growth

TCP-Extensions (1323) • Window Spaces: 16bit Scale Factor in SYN: Win=[Win]*2^S • RTT Measurement: Timestamp for each packet (generated by sender, relayed by receiver) • PAWS (Protect Against Wrapped Sequence Number): Use timestamp to expand the sequence space. (So the timer should not be too fast or too slow: 1ms ~ 1 sec) • Header Prediction: Simplify the process

High Speed TCP Floyd ’02. Goals: • Achieve large window size with realistic loss rate (Use current window size in AIMD parameter) • High Speed in a single connection (10Gbps) • Easy to achieve high sending rate for a given loss rate. How to Achieve TCP-Friendliness? • Incremental Deployable (no router support required)

High Speed TCP Problem in Steady State: • TCP response function: • Large congestion window requires a very low loss rate. Problem in Recovery: • Congestion Avoidance takes too long to recover (Consecutive Time-outs)

Consecutive Time-out

High Speed TCP Change the TCP response function: • p is high (higher than maxP corresponding to the default cwnd size W): standard TCP • p is low: (cwnd >= W): use a(w), b(w) instead of constant a,b in the adjustment of cwnd. • For a given loss rate P and desired windows Size W1 at P: get a(w) and b(w). (Keep the linearity on a log-log scale. ∆ logW∆ logP)

Change TCP Function • Standard TCP:

Change TCP Function

Expectations • Achieve large window with realistic loss rate • Relative fairness between standard TCP and High speed TCP (Acquired bandwidth  cwnd ) • Moderate decrease instead of halving window size when congestion detected (0.33 at 1000) • Pre-computed Look-up to implement a(w) and b(w).

Slow Start Modification of Slow Start: • Problem: doubling cwnd for each RTT is too aggressive for large cwnd • Proposal: To limit ∆cwnd in a RTT in Slow Start.

Limited Slow Start For each ACK: • Cwnd<=max_ss_threshold: ∆cwnd=MSS (Standard TCP Slow Start) • Cwnd>max_ss-threshold: ∆cwnd=0.5max_ss_threshold/cwnd (at most 0.5 max_ssthreshold each RTT)

Related Projects • Cray Research (’92); • CASA Testbed (’94) • Duke (’99) • Pittsburg Supercomputing center • Portland State Univ.(’00) • Internet 2 (’01) • Web100 • Net100 (built on Web 100)

Cray Research ’92 • TCP/IP Performance at Cray Research (Dave Borman) Configuration: • HIPPI between two dedicated Y/MPs with Model E IOS and Unicos 8.0 • Memory to memory transfer Results: • Direct channel-to-channel: MTU - 64K - 781 Mbps • Through a HIPPI switch: MTU - 33K - 416 Mbps MTU - 49K - 525 Mbps MTU - 64K - 605 Mbps

CASA Testbed ’94 Applied Network Research of San Diego Supercomputer Center + UCSD • Goal: Delay and Loss Characteristics of HIPPI-based gigabit testbed • Link Feature: Blocking (HIPPI), tradeoff between high lost rate and high delay • Conclusion: Avoiding packet loss is more important than reduce delay • Performance (Delay*Bandwidth =2MB; 1323 on; Cray machines): 500Mbps TCP sustained throughput (TTCP/Netperf)

Trapeze/IP (Duke) Goal: • What optimization is most useful to reduce host overheads for fast TCP? • How fast does TCP really go, at what cost? Approaches: • Zero-Copy • Checksum offloading Result: • >900Mbps for MTU>8K

Trapeze/IP (Duke) • Zero-copy www.cs.duke.edu/ari/publications/talks/freebsdcon

Trapeze/IP (Duke) www.cs.duke.edu/ari/publications/talks/freebsdcon

Enabling High Performance Data Transfers on Hosts By Pittsburg Supercomputing center • Enable RFC 1191 MTU Discovery • Enable RFC 1323 Large Windows • OS Kernel: Large enough socket buffers • Application: Set its send and receive socket buffer sizes Detailed methods to tune various OS.

PSU Experiment Goal: • Round Trip Delay and TCP throughput with different window size • Influence by different devices (CISCO 3508/3524/5500), different NIC Environment: • OS: FreeBSD 4.0/4.1 (without 1323?), Linux, Solaris • WAN: 155Mbps OC-3 over SONET MAN • Measurement Tools: Ping + TTCP

PSU Experiment • "smaller" switches and low-level routers can easily muck things up. • bugs in Linux 2.2 kernels • Different NICs have different performance. • Fast PCI bus (64 bits * 66mhz) is necessary • Switch MTU size can make a difference (giant packets are better). • Bigger TCP window sizes can help but there seems to be a knee around 4MB that is not remarked upon in the literature.

Internet-2 Experiment Goal: Single TCP connection with 700-800Mbps over WAN; Relations among Window Size, MTU and Throughput Back-to-Back • OS: FreeBSD 4.3 release • Architecture: 64bit-66Mhz PCI+… • Configuration: sendspace=recvspace=102400 • Setup: Direct connection (back-back) and WAN • WAN: Symmetric path: host1-Abilene-host2 • Measurement: Ping + IPerf

Internet-2 Experiment Back-to-Back • No Loss • Found some bug in FreeBSD 4.3 WAN: • <=200Mbps • Asymmetry in different directions (cache of MTU…)

Web 100 • Goal: Make it easy for non-expertise to achieve high bandwidth • Method: Get more information from TCP • Software: Measurement: embedded into kernel TCP App Layer: Diagnostics / Auto-Tuning • Proposal: RFC 2012 (MIB)

Net 100 • Built on Web 100 • Auto-tune the parameter for non-experts. • Network-Aware OS • Bulk File Transportation for ORNL • Implementation of Floyd’s High Speed TCP

Towards Gigabit

Towards Gigabit

Presentation Transcript

Tsukuba Gigabit Laboratory & Japan Gigabit Network

Gigabit Ethernet

LTE-Advanced: The Path towards Gigabit/s in Wireless Mobile Communication

NRLINK Gigabit Power line Adapter (NRLINK Gigabit PLC)

Gigabit Ethernet and 10 Gigabit Ethernet signaling

Gigabit Ethernet

IPv6 : Native Gigabit

Gigabit Ethernet PMD

Gigabit Ethernet PMD

Gigabit Ethernet

Gigabit Ethernet

Gigabit Ethernet

Gigabit Ethernet PMD

Gigabit Ethernet

Gigabit Ethernet TxRx

Optical Gigabit Ethernet

Optical Gigabit Ethernet

Gigabit Ethernet TxRx

Gigabit Ethernet

Towards Gigabit

Towards Gigabit

Presentation Transcript

Tsukuba Gigabit Laboratory &amp; Japan Gigabit Network

Gigabit Ethernet

LTE-Advanced: The Path towards Gigabit/s in Wireless Mobile Communication

NRLINK Gigabit Power line Adapter (NRLINK Gigabit PLC)

Gigabit Ethernet and 10 Gigabit Ethernet signaling

Gigabit Ethernet

IPv6 : Native Gigabit

Gigabit Ethernet PMD

Gigabit Ethernet PMD

Gigabit Ethernet

Gigabit Ethernet

Gigabit Ethernet

Gigabit Ethernet PMD

Gigabit Ethernet

Gigabit Ethernet TxRx

Optical Gigabit Ethernet

Optical Gigabit Ethernet

Gigabit Ethernet TxRx

Gigabit Ethernet

Tsukuba Gigabit Laboratory & Japan Gigabit Network