260 likes | 525 Views
Transport Layer Enhancements for Unified Ethernet in Data Centers. K. Kant Raj Ramanujan Intel Corp. Exploratory work only, not a committed Intel position. Context. Data center is evolving Fabric should too. Last talk: Enhancements to Ethernet, already on track This talk:
E N D
Transport Layer Enhancements for Unified Ethernet in Data Centers K. Kant Raj Ramanujan Intel Corp Exploratory work only, not a committed Intel position
Context • Data center is evolving Fabric should too. • Last talk: • Enhancements to Ethernet, already on track • This talk: • Enhancements to Transport Layer • Exploratory, not in any standards track.
Outline • Data Center evolution & transport impact • Transport deficiencies & remedies • Many areas of deficiencies … • Only Congestion Control and QoS addressed in detail • Summary & Call to Action
Data Center Today IPC Fabric • Tiered structure • Multiple incompatible fabrics • Ethernet, Fiber Channel, IBA, Myrinet, etc. • Management complexity • Dedicated servers for applications Inflexible resource usage Storage Fabric database query client req/ resp business trans network Fabric SAN storage
Future DC: Stage 1 – Fabric Unification • Enet dominant, but convergence really on IP. • New layer2: PCI-Exp, Optical, WLAN, UWB, … • Most ULP’s run over transport over IP Need to comprehend transport implications iSCSI storage database query client req/ resp business trans
Sub-cluster 2 Storage Nodes Sub-cluster1 Sub-cluster 3 Future DC: Stage 2 – Clustering & Virtualization • SMP Cluster (cost, flexibility, …) • Virtualization • Nodes, network, storage, … Virtual clusters (VC) • Each VC may have multiple traffic types inside Virtual Cluster1 IP ntwk Virtual Cluster 2 Virtual Cluster 3
Future DC: New Usage Models • Dynamically provisioned virtual clusters • Distributed storage (per node) • Streaming traffic (VoIP/IPTV + data services) • HPC in DC • Data mining for focused advertising, pricing, … • Special purpose nodes • Protocol accelerators (XML, authentication, etc.) New models New fabric requirements
Fabric Impact • More types of traffic, more demanding needs. • Protocol impact at all levels • Ethernet: Previous presentation. • IP: Change affects entire infrastructure. • Transport: This talk • Why transport focus? • Change primarily confined to endpoints. • Many app needs relate to transport layer • App. interface (Sockets/RDMA) mostly unchanged. DC evolution Transport evolution
Transport Issues & enhancements • Transport (TCP) enhancement areas • Better Congestion control and QoS • Support media evolution • Support for high availability • Many others • Message based & unordered data delivery. • Connection migration in virtual clusters. • Transport layer multicasting. • How do we enhance transport? • New TCP compatible protocol? • Use an existing protocol (SCTP)? • Evolutionary changes to TCP from DC perspective.
IP MAC App App transport transport IP IP MAC MAC What’s wrong with TCP Congestion control • TCP congestion control (CC) works independently for each connection • By default TCP equalizes throughput undesirable • Sophisticated QoS can change this, but … • Lower level CC Backpressure on transport • Transport layer congestion control is crucial TL cong cntrl Cong feedback ECN/ICMP MAC MAC switch switch router
What’s wrong with QoS? • Elaborate mechanisms • Intserv (RSVP), Diffserv, BW broker, … • … But a nightmare to use • App knowledge, many parameters, sensitivity, … • What do we need? • Simple/intuitive parameters • e.g., streaming or not, normal vs. premium, etc. • Automatic estimation of BW needs. • Application focus, not flow focus! • QoS relevant primarily under congestion Fix TCP congestion control, use IP QoS sparingly.
TCP Congestion Control Enhancements • Collective control of all flows of an app • Applicable to both TCP & UDP • Ensures proportional fairness of multiple inter-relatedflows • Tagging of connections to identify related flows. • Packet loss highly undesirable in DC • Move towards a delay based TCP variant. • Multilevel Coordination • Socket vs. RDMA apps, TCP vs. UDP, … • A layer above transport for coordination
Collective Congestion Control Cong. Control • Control connections thru a congested device together (control set) • Determining control set is challenging • BW requirement estimated automatically during non-congested periods SW0 S23 S13 SW2 SW1 S21 S11 CL2 CL1
Sample Collective Control • App 1: client1 server1 • Database queries over a single connection Drives ~5.0 Mb/s BW • App2: client2 server1 • Similar to App1 Drives 2.5 Mb/s BW • App 3: client3 server2 • FTP, starts at t=30 secs 25 conn. 8 Mb/s
Sample Results Cong. Control • Modified TCP can maintain 2:1 throughput ratio • Also yields lower losses & smaller RTT. Collective control highly desirable within a DC
Adaptation to Media • Problem: TCP assumes loss congestion, and designed for WAN (high loss/delay) • Effects: • Wireless (e.g. UWB) attractive in DC (wiring reduction, mobility, self configuration). • … but TCP is not a suitable transport. • Overkill for communications within a DC. • Solution:A self-adjusting transport • Support multiple congestion/flow-control regimes. • Automatically selected during connection setup.
Path 1 A B Path 2 High Availability Issues • Problem: Single failure broken connection, weak robustness check, … • Effect: Difficult to achieve high availability. • Solution: • Multi-homed connections w/ load sharing among paths. • Ideally, controlled diversity & path management • Difficult: need topology awareness, spanning tree problem,
Summary & call to action • Data Centers are evolving • Transport must evolve too, but a difficult proposition • TCP is heavily entrenched, change needs an industry wide effort • Call to Action • Need to get an industry effort going to define • New features & their implementation • Deployment & compatibility issues. • Change will need push from data center administrators & planners.
Additional Resources • Presentation can be downloaded from the IDF web site – when prompted enter: • Username: idf • Password: fall2005 • Additional backup slides • Several relevant papers available at http://kkant.ccwebhost.com/download.html • Analysis of collective bandwidth control. • SCTP performance in data centers.
Comparative Fabric Features DC requirements TCP lacks many desirable features; SCTP has some
Transport Layer QoS Inter-app Web app • Needed at multiple levels • Between transport uses • Conn. of a given transport • Logical streams DB App • May be on two VM’s on • same physical machine. ntwk IPC page iSCSI Intra-app • Best BW subdivision to maximize performance? text images cntrl data Intra-conn Intra-conn • Requirements • Must be compatible with lower level QoS • PCI-Exp, MAC, etc. • Automatic estimation of bandwidth requirements • Automatic BW control
Multicasting in DC • Software/patch distribution • Multicast to all machines w/ same version. • Characteristics • Medium to large file transfer • Time to finish matters, BW doesn’t. • Scale: 10s to 1000s. • High performance computing • MPI collectives need multicasting • Characteristics • Small but frequent transfers • Latency premium, BW not an issue mostly. • Scale: 10s to 100’s
IP multicasting A A subnet2 subnet1 subnet2 subnet1 outer router outer router Transport layer multicasting TL multicasting
A B subnet2 subnet1 D outer router C subnet3 subnet4 TL multicasting value • Assumptions • A 16 node cluster w/ 4-node subclusters. • Mcast group: 2 nodes in each sub-cluster • Latencies: • endpt: 2 us, ack proc: 1 us, switch: 1 us • App-TL interface: 5 us • Latency w/o mcast • send: 7x2 + 3x1 + 2 = 19 us • ack: 1 + 3x1 + 7x1 = 11 us • reply: 5 + 2 + 7x2 = 21 us • Total: 19+11+21 = 51 us • Latency w/ mcast • send: 3x2 + 3x1 + 2 + 2x(1+1) + 2 = 17 us • ack: 1 + 1 + 2x1 + 3x1 + 3x1 = 10 us • Total = 17 + 10 + 5 = 32 us • Larger savings in full network mcast.
A subnet2 subnet1 outer router subnet3 subnet4 A S4 n2 n1 S2 S3 n2 n1 n2 n2 n1 n1 Hierarchical Connections • Choose a “leader” in each subnet. • Topology directed • Multicast connections to others nodes via leaders • Ack consolidation at leaders (multicast) • Msg consolidation at leaders (reverse multicast) • Done by a layer above? (layer 4.5?)