460 likes | 612 Views
Dr. Multicast for Data Center Communication Scalability. Ymir Vigfusson Hussam Abu-Libdeh Mahesh Balakrishnan Ken Birman Cornell University Yoav Tock IBM Research Haifa. HotNets , October 5, 2008. IP Multicast in Data Centers. IPMC is not used in data centers.
E N D
Dr. Multicast for Data Center Communication Scalability Ymir Vigfusson Hussam Abu-Libdeh Mahesh Balakrishnan Ken Birman Cornell University Yoav Tock IBM Research Haifa HotNets, October 5, 2008
IP Multicast in Data Centers • IPMC is not used in data centers
IP Multicast in Data Centers • IPMC is not used in data centers • Would speed up products that use multicast
IP Multicast in Data Centers • Why is IP multicast rarely used?
IP Multicast in Data Centers • Why is IP multicast rarely used? • Limited IPMC scalability on switches/routers and NICs
IP Multicast in Data Centers • Why is IP multicast rarely used? • Limited IPMC scalability on switches/routers and NICs • Broadcast storms: Loss triggers a horde of NACKs, which triggers more loss, etc. • Disruptive even to non-IPMC applications.
IP Multicast in Data Centers • IP multicast has a bad reputation
IP Multicast in Data Centers • IP multicast has a bad reputation • Works great up to a point, after which it breaks catastrophically
IP Multicast in Data Centers • Bottom line: • Administrators have no control over multicast use ... • Without control, they opt for never.
Dr. Multicast (MCMD) • Policy: Permits data center operators to selectively enable and control IPMC • Transparency: Standard IPMC interface, system calls are overloaded. • Performance: Uses IPMC when possible, otherwise point-to-point unicast • Robustness: Distributed, fault-tolerant service
Terminology • Process: Application that joins logical IPMC groups • Logical IPMC group: A virtualized abstraction • Physical IPMC group: As usual • UDP multi-send: New kernel-level system-call • Collection: Set of logical IPMC groups with identical membership
Acceptable Use Policy • Assume a higher-level network management tool compiles policy into primitives • Explicitly allow a process to use IPMC groups • allow-join(process,logical IPMC) • allow-send(process,logical IPMC) • UDP multi-send always permitted • Additional restraints • max-groups(process,limit) • force-udp(process,logical IPMC)
Overview • Library module • Mapping module • Gossip layer • Optimization questions • Results
MCMD Library Module • Transparent. Overloads the IPMC functions • setsockopt(),send(), etc. • Translation. Logical IPMC map to a set of P-IPMC/unicast addresses. • Two extremes
MCMD Mapping Role • MCMD Agent runs on each machine • Contacted by the library modules • Provides a mapping • One agent elected to be a leader: • Allocates IPMC resources according to the current policy
MCMD Mapping Role • Allocating IPMC resources: An optimization problem Procs Collections L-IPMC Procs L-IPMC This box intentionally left BLACK
MCMD Gossip Layer • Runs system-wide as part of the agent • Automatic failure detection • Group membership fully replicated via gossip • Node reports its own state • Future: Replicate more selectively • Leader runs optimization algorithm on data and reports the mapping
MCMD Gossip Layer • But gossip is slow... • Implications: • Slow propagation of group membership • Slow propagation of new maps • We assume a low rate of membership churn • Remedy: Broadcast module • Leader broadcasts urgent messages • Bounded bandwidth of urgent channel • Trade-off between latency and scalability
Overview • Library module • Mapping module • Gossip layer • Optimization questions • Results
Optimization Questions Collections BLACK Procs L-IPMC Procs L-IPMC • First step: compress logical IPMC groups
Optimization Questions • How compressible are subscriptions? • Multi-objective optimization: • Minimize number of collections • Minimize bandwidth overhead on network • Thm: The general problem is NP-complete • Thm: In uniform random allocation, "little" compression opportunity. • Social preferences • Lots of duplicates due to replication (e.g. for load balancing) klk;l
Optimization Questions • Which collections get an IPMC address? • Thm: Ordered by decreasing traffic*size, assign P-IPMC addresses greedily, we minimize bandwidth. • Tiling heuristic: • Sort L-IPMC by traffic*size • Greedily collapse identical groups • Assign IPMC to collections in reverse order of traffic*size, UDP-multisend to the rest • Building tilings incrementally klk;l
Experimental Results klk;l
klk;l Overhead (max. throughput) • Insignificant overhead when mapping L-IPMC to P-IPMC.
klk;l Overhead (CPU utilization) • Insignificant overhead when mapping L-IPMC to P-IPMC.
Network Overhead • Gossip Layer uses constant background bandwidth, urgent channel behaves well klk;l
Latency • Latency of propagation of joins/leaves and new maps
klk;l Policy control • A malfunctioning node bombards an existing IPMC group. • MCMD policy prevents ill-effects <Traffic starts <New policy
Conclusion • IPMC has been a bad citizen...
Conclusion • IPMC has been a bad citizen... • Dr. Multicast has the cure! • Opportunity for big performance enhancements and policy control.
klk;l Overhead • Insignificant overhead when mapping L-IPMC to P-IPMC.
klk;l Policy control • A malfunctioning node bombards an existing IPMC group. • MCMD policy prevents ill-effects
klk;l Policy control • A malfunctioning node bombards an existing IPMC group. • MCMD policy prevents ill-effects
klk;l Overhead • Linux kernel module increases UDP-multisend throughput by 17% (compared to user-space UDP-multisend)
Latency of events • Gossip: 99% of nodes aware of change within 9 epochs (now 1 sec) klk;l
Conclusions • Policy: Allows data center operators to enable and control IPMC • Transparency: Standard IPMC interface, system calls are overloaded. • Performance: Uses IPMC when possible, otherwise point-to-point UDP • Robustness: Distributed, fault-tolerant service
Results • Library Module • Insignificant slowdown • Linux Kernel module provides 17% speed-up for UDP multi-send klk;l
Optimization questions Users Topics This box intentionally left BLACK Users Groups Topics klk;l • Multi-objective: • Minimize number of groups • Minimize bandwidth overhead on network • Thm: This problem is NP-complete • Reduction to Minimum Normal Set Basis
MCMD Library Layer • Overloads the IPMC functions • setsockopt(),send(), etc. • Translates logical IPMC addresses to physical IPMC, or point-to-point UDP packets depending on policy • Notifies MCMD immediately about joins/leaves • Learns about new mappings from MCMD • Keeps statistics about group traffic rates
MCMD Library Layer • Overloads the IPMC functions • setsockopt(),send(), etc. • Translates logical IPMC addresses to physical IPMC, or point-to-point UDP packets depending on policy • Caches translation maps • Maintains a connection to MCMD for updates
Overview • Library module • Mapping module • Gossip layer • Optimization questions • Results