1 / 36

Presenter: r 00945020 @ntu.edu.tw Po-Chun Wu

Presenter: r 00945020 @ntu.edu.tw Po-Chun Wu. Outline. Introduction BCube Structure BCube Source Routing (BSR) Other Design Issues Graceful degradation Implementation and Evaluation Conclusion. Introduction. Container-based modular DC. 1000-2000 servers in a single container

vian
Download Presentation

Presenter: r 00945020 @ntu.edu.tw Po-Chun Wu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presenter: r00945020@ntu.edu.tw Po-ChunWu

  2. Outline • Introduction • BCube Structure • BCube Source Routing (BSR) • Other Design Issues • Graceful degradation • Implementation and Evaluation • Conclusion

  3. Introduction

  4. Container-based modular DC • 1000-2000 servers in a single container • Core benefits of Shipping Container DCs: • Easy deployment • High mobility • Just plug in power, network, & chilled water • Increased cooling efficiency • Manufacturing & H/W Admin. Savings

  5. BCube design goals • High network capacity for: • One-to-one unicast • One-to-all and one-to-several reliable groupcast • All-to-all data shuffling • Only use low-end, commodity switches • Graceful performance degradation • Performance degrades gracefully as servers/switches failure increases

  6. BCube Structure

  7. BCube Structure BCube1 <1,0> <1,1> <1,2> <1,3> • A BCubekhas: • K+1 levels: 0 through k • n-port switches, same count at each level (nk) • nk+1 total servers, (k+1)nk total switches. • (n=8,k=3 : 4-levels connecting 4096 servers using 512 8-port switches at each layer.) • A server is assigned a BCube addr (ak,ak-1,…,a0) where ai[0,k] • Neighboring server addresses differ in only one digit • Switches only connect to servers Level-1 BCube0 Level-0 <0,0> <0,1> <0,2> <0,3> 00 01 02 10 11 12 20 21 22 30 31 32 03 13 23 33 • Connecting rule • The i-th server in the j-th BCube0 connects to the j-th port of the i-th level-1 switch • Server “13” is connected to switches <0,1> and <1,3> switch server

  8. Bigger BCube: 3-levels (k=2) BCube2 BCube1

  9. BCube: Server centric network BCube1 <1,0> <1,1> <1,2> <1,3> • Server-centric BCube • Switches neverconnect to other switches and only connect to servers • Servers control routing, load balancing, fault-tolerance BCube0 <0,0> <0,1> <0,2> <0,3> dst src 00 01 02 10 11 12 20 21 22 30 31 32 03 13 23 33 Switch <1,3> MAC table MAC addr MAC23 MAC23 MAC03 MAC03 Switch <0,2> MAC table MAC20 MAC20 MAC23 MAC23 port port 20 20 03 03 20 20 03 03 MAC03 0 Bcubeaddr MAC20 0 MAC13 1 MAC21 1 data data data data MAC23 2 MAC22 2 MAC33 3 MAC23 3

  10. Bandwidth-intensive application support • One-to-one: • one server moves data to another server. (disk backup) • One-to-several: • one server transfers the same copy of data to several receivers. (distributed file systems) • One-to-all: • a server transfers the same copy of data to all the other servers in the cluster (boardcast) • All-to-all: • very server transmits data to all the other servers (MapReduce)

  11. Multi-paths for one-to-one traffic • Theorem 1. The diameter(longest path) of a BCubek is k+1 • Theorem 3. There are k+1 parallel paths between any two servers in a BCubek <1,0> <1,1> <1,2> <1,3> <0,1> <0,3> <0,0> <0,2> 10 11 12 30 31 32 13 33 00 01 02 20 21 22 03 23

  12. Speedup for one-to-several traffic • Theorem 4. Server A and a set of servers {di|di is A’s level-i neighbor} form an edge disjoint complete graph of diameter 2 <1,0> <1,1> <1,2> <1,3> Writing to ‘r’ servers, is r-times faster than pipeline replication <0,3> <0,0> <0,1> <0,2> 30 31 32 33 00 01 02 10 11 12 20 21 22 03 13 23 P1 P1 P1 P2 P2 P2

  13. Speedup for one-to-all traffic • Theorem 5. There are k+1 edge-disjoint spanning trees in a Bcubek src 00 01 02 03 10 11 12 13 • The one-to-all and one-to-several spanning trees can be implemented by TCP unicast to achieve reliability 20 21 22 23 30 31 32 33

  14. Aggregate bottleneck throughput for all-to-all traffic • The flows that receive the smallest throughput are called the bottleneck flows. • Aggregate bottleneck throughput (ABT) • ( the bottleneck flow) * ( the number of total flows in the all-to-all traffic ) • LargerABT means shorter all-to-all job finish time. • Theorem 6. The ABT for a BCube network is • where n is the switch port number and N is the total server number In BCube there are no bottlenecks since all links are used equally

  15. Bcube Source Routing (BSR)

  16. BCube Source Routing (BSR) • Server-centric source routing • Source server decides the best path for a flow by probing a set of parallel paths • Source server adapts to network condition by re-probing periodically or due to failures • Intermediate servers only forward the packets based on the packet header. intermediate destination source K+1 path Probe packet

  17. BSR Path Selection • Source server: • 1.construct k+1 paths using BuildPathSet • 2.Probes all these paths (no link status broadcasting) • 3.If a path is not found, it uses BFS to find alternative (after removing all others) • Intermediate servers: • Updates Bandwidth: min(PacketBW, InBW, OutBW) • If next hops is not found, returns failure to source • Destination server: • Updates Bandwidth: min(PacketBW, InBW) • Send probe response to source on reverse path • 4.Use a metric to select best path. (maximum available bandwidth / end-to-end delay)

  18. Path Adaptation • Source performs path selection periodically (every 10 seconds) to adapt to failures and network condition changes. • If a failure is received, the source switches to an available path and waits for next timer to expire for the next selection round and not immediately. • Usually uses randomness in timer to avoid path oscillation.

  19. Packet Forwarding • Each server has two components: • Neighbor status table (k+1)x(n-1) entries • Maintained by the neighbor maintenance protocol (updated upon probing / packet forwarding) • Uses NHA(next hop index) encoding for indexing neighbors ([DP:DV]) • DP: diff digit (2bits) • DV: value of diff digit (6 bits) • NHA Array (8 bytes: maximundiameter = 8) • Almost static (except Status) • Packet forwarding procedure • Intermediate servers update next hop MAC address on packet if next hop is alive • Intermediate servers update status from packet • One table lookup

  20. Path compression and fast packet forwarding Traditional address array needs 16 bytes: Path(00,13) = {02,22,23,13} Forwarding table of server 23 The Next Hop Index (NHI) Array needs 4 bytes: Path(00,13)={0:2,1:2,0:3,1:1} <1,0> <1,1> <1,2> <1,3> Fwd node 2 3 Next hop 1 3 <0,0> <0,1> <0,2> <0,3> 00 01 02 10 11 12 20 21 22 30 31 32 03 13 23 33

  21. Other Design Issues

  22. Partial Bcubek (1) build the need BCubek−1s  (2) use partial layer-k switches ? • Solution • connect the BCubek−1s using a full layer-k switches. • Advantage • BCubeRouting performs just as in a complete BCube, and BSR just works as before. • Disadvantage • switches in layer-k are not fully utilized. BCube1 <1,0> <1,1> <1,2> <1,3> Level-1 <0,0> <0,1> BCube0 00 01 02 10 11 12 03 13 Level-0

  23. Packing and Wiring (1/2) • 2048 servers and 1280 8-port switches • a partial BCube with n = 8 and k = 3 • 40 feet container (12m*2.35m*2.38m) • 32 racks in a container

  24. Packing and Wiring (2/2) • One rack =  BCube1 • Each rack has 44 units • 1U = 2 servers or 4 switches • 64 servers occupy 32 units • 40 switches occupy 10 units • Super-rack(8 racks) = BCube2

  25. 31 Routing to external networks (1/2) 11 • Ethernet has two levels link rate hierarchy • 1G for end hosts and 10G for uplink 21 aggregator 01 10G <1,0> <1,1> <1,1> <1,2> <1,1> <1,3> <1,3> 1G <0,1> <0,2> <0,3> <0,0> 10 11 12 20 21 22 30 31 32 13 23 33 00 01 02 03 gateway gateway gateway gateway

  26. Routing to external networks (2/2) • When an internal server sends a packet to an external IP address • choose one of the gateways. • The packet is then routed to the gateway using BSR (BCube Source Routing) • After the gateway receives the packet, it strips the BCube protocol header and forwards the packet to the external network via the 10G uplink

  27. Graceful degradation

  28. DCell fat-tree

  29. Graceful degradation • Graceful degradation : when server or switch failure increases, ABT reduces slowly and there are no dramatic performance falls.(Simulation Based) • Server failure • Switch failure BCube BCube Fat-tree Fat-tree DCell DCell

  30. Implementation and Evaluation

  31. Implementation BCube configuration app software TCP/IP protocol driver kernel Intermediate driver BCube driver Packet send/recv BSR path probing & selection packet fwd Neighbor maintenance Flow-path cache Ava_band calculation Ethernet miniport driver IF k IF 0 IF 1 hardware packet fwd Neighbor maintenance server ports Ava_band calculation

  32. Testbed • A BCube testbed • 16 servers (Dell Precision 490 workstation with Intel 2.00GHz dualcore CPU, 4GB DRAM, 160GB disk) • 8 8-port mini-switches (DLink 8-port Gigabit switch DGS-1008D) • NIC • Intel Pro/1000 PT quad-port Ethernet NIC • NetFPGA NetFPGA Intel® PRO/1000 PT Quad Port Server Adapter

  33. Bandwidth-intensive application support • Per-server throughput

  34. Support for all-to-all traffic • Total throughput for all-to-all

  35. Related work Speedup

  36. Conclusion • BCubeis a novel network architecture for shipping-container-based MDC • Forms a server-centric architecture network • Use mini-switches instead of 24 port switches • BSR enables graceful degradation and meets the special requirements of MDC

More Related