1 / 37

Data Center Networks for the Application

Data Center Networks for the Application. Jon Crowcroft, http://www.cl.cam.ac.uk/~jac22. Data Centers don ’ t just go fast. They need to serve applications Latency, not just throughput Availability & Failure Detectors Application code within network Work based on Matt Grosvenor ’ s PhD

bernad
Download Presentation

Data Center Networks for the Application

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Center Networks for the Application Jon Crowcroft, http://www.cl.cam.ac.uk/~jac22

  2. Data Centers don’t just go fast • They need to serve applications • Latency, not just throughput • Availability & Failure Detectors • Application code within network • Work based on Matt Grosvenor’s PhD • And EPSRC funded NaaS project…

  3. 1. Deterministic latency bounding • Learned what I was teaching wrong! • I used to say: • Integrated Service too complex • Admission&scheduling hard • Priority Queue can’t do it • PGPS computation for latency? • I present Qjump scheme, which • Uses intserv (PGPS) style admission ctl • Uses priority queues for service levels

  4. Data Center Latency Problem • Tail of the distribution, • due to long/bursty flows interfering • Need to separate classes of flow • Low latency are usually short flows (or RPCs) • Bulk transfers aren’t so latency/jitter sensitiv

  5. Data Center Qjump Solution • In Data Center, not general Internet! • can exploit topology & • traffic matrix & • source behaviour knowledge • Regular, and simpler topology key • But also largely “cooperative” world…

  6. 5 port virtual output queued

  7. Latency of ping v. iperf

  8. Hadoop perturbs time synch

  9. Hadoop perturbs memcached

  10. Hadoop perturbs Naiad

  11. Qjump – two pieces • At network config time • Compute a set of (8*) rates based on • Traffic matric & hops => fan in (f) • At run time • Flow assigns itself a priority/rate class • subject it to (per hypervisor) rate limit * 8 arbitrary – but often h/w supported

  12. Qjump – self admit/policer (per hypervisor, at least)

  13. Qjump latency thru 1 switch

  14. Memcached latency redux w/ QJ

  15. QJ naiad barrier synch latency redux

  16. Experimental topology

  17. PTPd/memcached along, with Hadoop, with Hadoop+QJump

  18. 2PC throughput with and without QJump

  19. QJump: variance compared with other schemes…

  20. Topology for flow completion experiment

  21. Web search FCT100Kb ave

  22. Web search FCT 99th centile

  23. Web search FCT 10Mb ave

  24. Data-mining 100k FCT ave

  25. Data-mining 100k FCT 99th cent

  26. Data-mining 10M FCT ave

  27. Memcached latency

  28. Qjump scale-up emulation

  29. Latency bound validation topo

  30. Big Picture Comparison – Related work…

  31. 2 Failure Detectors • 2PC & CAP theorem • Recall CAP (Brewer’s Hypothesis) • Consistency, Availability, Partitions • Strong& weak versions! • If have net&node deterministic failure detector, isn’t necessarily so! • What can we use CAP-able system for?

  32. Consistent, partition tolerant app? • Software Defined Net update! • Distributed controllers have distributed rules • Rules change from time to time • Need to update, consistently • Need update to work in presence of partitions • By definition! • So Qjump may let us do this too!

  33. 3. Application code -> Network • Last piece of data center working for application • Switch and Host NICs have a lot of smarts • Network processors, • like GPUs or (net)FPGAs • Can they help applications? • In particular, avoid pathological traffic patterns (e.g. TCP incast)

  34. Application code • E.g. shuffle phase in map/reduce • Does a bunch of aggregation • (min, max, ave) on a row of results • And is cause of traffic “implosion” • So do work in stages in the switches in the net (like merge sort!) • Code very simple • Cross-compile into switch NIC cpus

  35. Other application examples • Are many … • Arose in Active Network research • Transcoding • Encryption • Compression • Index/Search • Etc etc

  36. Need language to express these • Finite iteration • (not Turing-complete language) • So design python– with strong types! • Work in progress in NaaS project at Imperial and Cambridge…

  37. Conclusions/Discussion • Data Center is a special case! • Its important enough to tackle • We can hard bound latency easily • We can detect failures and therefore solve some nice distributed consensus problems • We can optimise applicatiosn pathological traffic patterns • Plenty more to do…

More Related