Presenter: r 00945020 @ntu.edu.tw Po-Chun Wu

Presenter: r00945020@ntu.edu.tw Po-ChunWu

Outline • Introduction • BCube Structure • BCube Source Routing (BSR) • Other Design Issues • Graceful degradation • Implementation and Evaluation • Conclusion

Introduction

Container-based modular DC • 1000-2000 servers in a single container • Core benefits of Shipping Container DCs: • Easy deployment • High mobility • Just plug in power, network, & chilled water • Increased cooling efficiency • Manufacturing & H/W Admin. Savings

BCube design goals • High network capacity for: • One-to-one unicast • One-to-all and one-to-several reliable groupcast • All-to-all data shuffling • Only use low-end, commodity switches • Graceful performance degradation • Performance degrades gracefully as servers/switches failure increases

BCube Structure

BCube Structure BCube1 <1,0> <1,1> <1,2> <1,3> • A BCubekhas: • K+1 levels: 0 through k • n-port switches, same count at each level (nk) • nk+1 total servers, (k+1)nk total switches. • (n=8,k=3 : 4-levels connecting 4096 servers using 512 8-port switches at each layer.) • A server is assigned a BCube addr (ak,ak-1,…,a0) where ai[0,k] • Neighboring server addresses differ in only one digit • Switches only connect to servers Level-1 BCube0 Level-0 <0,0> <0,1> <0,2> <0,3> 00 01 02 10 11 12 20 21 22 30 31 32 03 13 23 33 • Connecting rule • The i-th server in the j-th BCube0 connects to the j-th port of the i-th level-1 switch • Server “13” is connected to switches <0,1> and <1,3> switch server

Bigger BCube: 3-levels (k=2) BCube2 BCube1

BCube: Server centric network BCube1 <1,0> <1,1> <1,2> <1,3> • Server-centric BCube • Switches neverconnect to other switches and only connect to servers • Servers control routing, load balancing, fault-tolerance BCube0 <0,0> <0,1> <0,2> <0,3> dst src 00 01 02 10 11 12 20 21 22 30 31 32 03 13 23 33 Switch <1,3> MAC table MAC addr MAC23 MAC23 MAC03 MAC03 Switch <0,2> MAC table MAC20 MAC20 MAC23 MAC23 port port 20 20 03 03 20 20 03 03 MAC03 0 Bcubeaddr MAC20 0 MAC13 1 MAC21 1 data data data data MAC23 2 MAC22 2 MAC33 3 MAC23 3

Bandwidth-intensive application support • One-to-one: • one server moves data to another server. (disk backup) • One-to-several: • one server transfers the same copy of data to several receivers. (distributed file systems) • One-to-all: • a server transfers the same copy of data to all the other servers in the cluster (boardcast) • All-to-all: • very server transmits data to all the other servers (MapReduce)

Multi-paths for one-to-one traffic • Theorem 1. The diameter(longest path) of a BCubek is k+1 • Theorem 3. There are k+1 parallel paths between any two servers in a BCubek <1,0> <1,1> <1,2> <1,3> <0,1> <0,3> <0,0> <0,2> 10 11 12 30 31 32 13 33 00 01 02 20 21 22 03 23

Speedup for one-to-several traffic • Theorem 4. Server A and a set of servers {di|di is A’s level-i neighbor} form an edge disjoint complete graph of diameter 2 <1,0> <1,1> <1,2> <1,3> Writing to ‘r’ servers, is r-times faster than pipeline replication <0,3> <0,0> <0,1> <0,2> 30 31 32 33 00 01 02 10 11 12 20 21 22 03 13 23 P1 P1 P1 P2 P2 P2

Speedup for one-to-all traffic • Theorem 5. There are k+1 edge-disjoint spanning trees in a Bcubek src 00 01 02 03 10 11 12 13 • The one-to-all and one-to-several spanning trees can be implemented by TCP unicast to achieve reliability 20 21 22 23 30 31 32 33

Aggregate bottleneck throughput for all-to-all traffic • The flows that receive the smallest throughput are called the bottleneck flows. • Aggregate bottleneck throughput (ABT) • ( the bottleneck flow) * ( the number of total flows in the all-to-all traffic ) • LargerABT means shorter all-to-all job finish time. • Theorem 6. The ABT for a BCube network is • where n is the switch port number and N is the total server number In BCube there are no bottlenecks since all links are used equally

Bcube Source Routing (BSR)

BCube Source Routing (BSR) • Server-centric source routing • Source server decides the best path for a flow by probing a set of parallel paths • Source server adapts to network condition by re-probing periodically or due to failures • Intermediate servers only forward the packets based on the packet header. intermediate destination source K+1 path Probe packet

BSR Path Selection • Source server: • 1.construct k+1 paths using BuildPathSet • 2.Probes all these paths (no link status broadcasting) • 3.If a path is not found, it uses BFS to find alternative (after removing all others) • Intermediate servers: • Updates Bandwidth: min(PacketBW, InBW, OutBW) • If next hops is not found, returns failure to source • Destination server: • Updates Bandwidth: min(PacketBW, InBW) • Send probe response to source on reverse path • 4.Use a metric to select best path. (maximum available bandwidth / end-to-end delay)

Path Adaptation • Source performs path selection periodically (every 10 seconds) to adapt to failures and network condition changes. • If a failure is received, the source switches to an available path and waits for next timer to expire for the next selection round and not immediately. • Usually uses randomness in timer to avoid path oscillation.

Packet Forwarding • Each server has two components: • Neighbor status table (k+1)x(n-1) entries • Maintained by the neighbor maintenance protocol (updated upon probing / packet forwarding) • Uses NHA(next hop index) encoding for indexing neighbors ([DP:DV]) • DP: diff digit (2bits) • DV: value of diff digit (6 bits) • NHA Array (8 bytes: maximundiameter = 8) • Almost static (except Status) • Packet forwarding procedure • Intermediate servers update next hop MAC address on packet if next hop is alive • Intermediate servers update status from packet • One table lookup

Path compression and fast packet forwarding Traditional address array needs 16 bytes: Path(00,13) = {02,22,23,13} Forwarding table of server 23 The Next Hop Index (NHI) Array needs 4 bytes: Path(00,13)={0:2,1:2,0:3,1:1} <1,0> <1,1> <1,2> <1,3> Fwd node 2 3 Next hop 1 3 <0,0> <0,1> <0,2> <0,3> 00 01 02 10 11 12 20 21 22 30 31 32 03 13 23 33

Other Design Issues

Partial Bcubek (1) build the need BCubek−1s (2) use partial layer-k switches ? • Solution • connect the BCubek−1s using a full layer-k switches. • Advantage • BCubeRouting performs just as in a complete BCube, and BSR just works as before. • Disadvantage • switches in layer-k are not fully utilized. BCube1 <1,0> <1,1> <1,2> <1,3> Level-1 <0,0> <0,1> BCube0 00 01 02 10 11 12 03 13 Level-0

Packing and Wiring (1/2) • 2048 servers and 1280 8-port switches • a partial BCube with n = 8 and k = 3 • 40 feet container (12m*2.35m*2.38m) • 32 racks in a container

Packing and Wiring (2/2) • One rack = BCube1 • Each rack has 44 units • 1U = 2 servers or 4 switches • 64 servers occupy 32 units • 40 switches occupy 10 units • Super-rack(8 racks) = BCube2

31 Routing to external networks (1/2) 11 • Ethernet has two levels link rate hierarchy • 1G for end hosts and 10G for uplink 21 aggregator 01 10G <1,0> <1,1> <1,1> <1,2> <1,1> <1,3> <1,3> 1G <0,1> <0,2> <0,3> <0,0> 10 11 12 20 21 22 30 31 32 13 23 33 00 01 02 03 gateway gateway gateway gateway

Routing to external networks (2/2) • When an internal server sends a packet to an external IP address • choose one of the gateways. • The packet is then routed to the gateway using BSR (BCube Source Routing) • After the gateway receives the packet, it strips the BCube protocol header and forwards the packet to the external network via the 10G uplink

Graceful degradation

DCell fat-tree

Graceful degradation • Graceful degradation : when server or switch failure increases, ABT reduces slowly and there are no dramatic performance falls.(Simulation Based) • Server failure • Switch failure BCube BCube Fat-tree Fat-tree DCell DCell

Implementation and Evaluation

Implementation BCube configuration app software TCP/IP protocol driver kernel Intermediate driver BCube driver Packet send/recv BSR path probing & selection packet fwd Neighbor maintenance Flow-path cache Ava_band calculation Ethernet miniport driver IF k IF 0 IF 1 hardware packet fwd Neighbor maintenance server ports Ava_band calculation

Testbed • A BCube testbed • 16 servers (Dell Precision 490 workstation with Intel 2.00GHz dualcore CPU, 4GB DRAM, 160GB disk) • 8 8-port mini-switches (DLink 8-port Gigabit switch DGS-1008D) • NIC • Intel Pro/1000 PT quad-port Ethernet NIC • NetFPGA NetFPGA Intel® PRO/1000 PT Quad Port Server Adapter

Bandwidth-intensive application support • Per-server throughput

Support for all-to-all traffic • Total throughput for all-to-all

Related work Speedup

Conclusion • BCubeis a novel network architecture for shipping-container-based MDC • Forms a server-centric architecture network • Use mini-switches instead of 24 port switches • BSR enables graceful degradation and meets the special requirements of MDC

Presenter: r 00945020 @ntu.edu.tw Po-Chun Wu