1 / 26

Transport Layer Enhancements for Unified Ethernet in Data Centers

Transport Layer Enhancements for Unified Ethernet in Data Centers. K. Kant Raj Ramanujan Intel Corp. Exploratory work only, not a committed Intel position. Context. Data center is evolving  Fabric should too. Last talk: Enhancements to Ethernet, already on track This talk:

amos
Download Presentation

Transport Layer Enhancements for Unified Ethernet in Data Centers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transport Layer Enhancements for Unified Ethernet in Data Centers K. Kant Raj Ramanujan Intel Corp Exploratory work only, not a committed Intel position

  2. Context • Data center is evolving  Fabric should too. • Last talk: • Enhancements to Ethernet, already on track • This talk: • Enhancements to Transport Layer • Exploratory, not in any standards track.

  3. Outline • Data Center evolution & transport impact • Transport deficiencies & remedies • Many areas of deficiencies … • Only Congestion Control and QoS addressed in detail • Summary & Call to Action

  4. Data Center Today IPC Fabric • Tiered structure • Multiple incompatible fabrics • Ethernet, Fiber Channel, IBA, Myrinet, etc. • Management complexity • Dedicated servers for applications  Inflexible resource usage Storage Fabric database query client req/ resp business trans network Fabric SAN storage

  5. Future DC: Stage 1 – Fabric Unification • Enet dominant, but convergence really on IP. • New layer2: PCI-Exp, Optical, WLAN, UWB, … • Most ULP’s run over transport over IP  Need to comprehend transport implications iSCSI storage database query client req/ resp business trans

  6. Sub-cluster 2 Storage Nodes Sub-cluster1 Sub-cluster 3 Future DC: Stage 2 – Clustering & Virtualization • SMP  Cluster (cost, flexibility, …) • Virtualization • Nodes, network, storage, …  Virtual clusters (VC) • Each VC may have multiple traffic types inside Virtual Cluster1 IP ntwk Virtual Cluster 2 Virtual Cluster 3

  7. Future DC: New Usage Models • Dynamically provisioned virtual clusters • Distributed storage (per node) • Streaming traffic (VoIP/IPTV + data services) • HPC in DC • Data mining for focused advertising, pricing, … • Special purpose nodes • Protocol accelerators (XML, authentication, etc.) New models  New fabric requirements

  8. Fabric Impact • More types of traffic, more demanding needs. • Protocol impact at all levels • Ethernet: Previous presentation. • IP: Change affects entire infrastructure. • Transport: This talk • Why transport focus? • Change primarily confined to endpoints. • Many app needs relate to transport layer • App. interface (Sockets/RDMA) mostly unchanged. DC evolution  Transport evolution

  9. Transport Issues & enhancements • Transport (TCP) enhancement areas • Better Congestion control and QoS • Support media evolution • Support for high availability • Many others • Message based & unordered data delivery. • Connection migration in virtual clusters. • Transport layer multicasting. • How do we enhance transport? • New TCP compatible protocol? • Use an existing protocol (SCTP)? • Evolutionary changes to TCP from DC perspective.

  10. IP MAC App App transport transport IP IP MAC MAC What’s wrong with TCP Congestion control • TCP congestion control (CC) works independently for each connection  • By default TCP equalizes throughput  undesirable • Sophisticated QoS can change this, but … • Lower level CC  Backpressure on transport • Transport layer congestion control is crucial TL cong cntrl Cong feedback ECN/ICMP MAC MAC switch switch router

  11. What’s wrong with QoS? • Elaborate mechanisms • Intserv (RSVP), Diffserv, BW broker, … • … But a nightmare to use • App knowledge, many parameters, sensitivity, … • What do we need? • Simple/intuitive parameters • e.g., streaming or not, normal vs. premium, etc. • Automatic estimation of BW needs. • Application focus, not flow focus! • QoS relevant primarily under congestion  Fix TCP congestion control, use IP QoS sparingly.

  12. TCP Congestion Control Enhancements • Collective control of all flows of an app • Applicable to both TCP & UDP • Ensures proportional fairness of multiple inter-relatedflows • Tagging of connections to identify related flows. • Packet loss highly undesirable in DC • Move towards a delay based TCP variant. • Multilevel Coordination • Socket vs. RDMA apps, TCP vs. UDP, … • A layer above transport for coordination

  13. Collective Congestion Control Cong. Control • Control connections thru a congested device together (control set) • Determining control set is challenging • BW requirement estimated automatically during non-congested periods SW0 S23 S13 SW2 SW1 S21 S11 CL2 CL1

  14. Sample Collective Control • App 1: client1  server1 • Database queries over a single connection  Drives ~5.0 Mb/s BW • App2: client2  server1 • Similar to App1  Drives 2.5 Mb/s BW • App 3: client3  server2 • FTP, starts at t=30 secs  25 conn.  8 Mb/s

  15. Sample Results Cong. Control • Modified TCP can maintain 2:1 throughput ratio • Also yields lower losses & smaller RTT. Collective control highly desirable within a DC

  16. Adaptation to Media • Problem: TCP assumes loss  congestion, and designed for WAN (high loss/delay) • Effects: • Wireless (e.g. UWB) attractive in DC (wiring reduction, mobility, self configuration). • … but TCP is not a suitable transport. • Overkill for communications within a DC. • Solution:A self-adjusting transport • Support multiple congestion/flow-control regimes. • Automatically selected during connection setup.

  17. Path 1 A B Path 2 High Availability Issues • Problem: Single failure  broken connection, weak robustness check, … • Effect: Difficult to achieve high availability. • Solution: • Multi-homed connections w/ load sharing among paths. • Ideally, controlled diversity & path management • Difficult: need topology awareness, spanning tree problem,

  18. Summary & call to action • Data Centers are evolving • Transport must evolve too, but a difficult proposition • TCP is heavily entrenched, change needs an industry wide effort • Call to Action • Need to get an industry effort going to define • New features & their implementation • Deployment & compatibility issues. • Change will need push from data center administrators & planners.

  19. Additional Resources • Presentation can be downloaded from the IDF web site – when prompted enter: • Username: idf • Password: fall2005 • Additional backup slides • Several relevant papers available at http://kkant.ccwebhost.com/download.html • Analysis of collective bandwidth control. • SCTP performance in data centers.

  20. Backup

  21. Comparative Fabric Features DC requirements TCP lacks many desirable features; SCTP has some

  22. Transport Layer QoS Inter-app Web app • Needed at multiple levels • Between transport uses • Conn. of a given transport • Logical streams DB App • May be on two VM’s on • same physical machine. ntwk IPC page iSCSI Intra-app • Best BW subdivision to maximize performance? text images cntrl data Intra-conn Intra-conn • Requirements • Must be compatible with lower level QoS • PCI-Exp, MAC, etc. • Automatic estimation of bandwidth requirements • Automatic BW control

  23. Multicasting in DC • Software/patch distribution • Multicast to all machines w/ same version. • Characteristics • Medium to large file transfer • Time to finish matters, BW doesn’t. • Scale: 10s to 1000s. • High performance computing • MPI collectives need multicasting • Characteristics • Small but frequent transfers • Latency premium, BW not an issue mostly. • Scale: 10s to 100’s

  24. IP multicasting A A subnet2 subnet1 subnet2 subnet1 outer router outer router Transport layer multicasting TL multicasting

  25. A B subnet2 subnet1 D outer router C subnet3 subnet4 TL multicasting value • Assumptions • A 16 node cluster w/ 4-node subclusters. • Mcast group: 2 nodes in each sub-cluster • Latencies: • endpt: 2 us, ack proc: 1 us, switch: 1 us • App-TL interface: 5 us • Latency w/o mcast • send: 7x2 + 3x1 + 2 = 19 us • ack: 1 + 3x1 + 7x1 = 11 us • reply: 5 + 2 + 7x2 = 21 us • Total: 19+11+21 = 51 us • Latency w/ mcast • send: 3x2 + 3x1 + 2 + 2x(1+1) + 2 = 17 us • ack: 1 + 1 + 2x1 + 3x1 + 3x1 = 10 us • Total = 17 + 10 + 5 = 32 us • Larger savings in full network mcast.

  26. A subnet2 subnet1 outer router subnet3 subnet4 A S4 n2 n1 S2 S3 n2 n1 n2 n2 n1 n1 Hierarchical Connections • Choose a “leader” in each subnet. • Topology directed • Multicast connections to others nodes via leaders  • Ack consolidation at leaders (multicast) • Msg consolidation at leaders (reverse multicast) • Done by a layer above? (layer 4.5?)

More Related