480 likes | 642 Views
State of IP Multicast. Radia Perlman radia.perlman@sun.com. Outline. Addresses IGMP Various Routing Protocols review of DVMRP, MOSPF, CBT, PIM-DM, PIM-SM, MSDP, BGMP/MASC problems (scaling, etc) potential solutions: Simple Multicast/Express. Addresses. IP Address is 4 bytes
E N D
State of IP Multicast Radia Perlman radia.perlman@sun.com Radia Perlman
Outline • Addresses • IGMP • Various Routing Protocols • review of DVMRP, MOSPF, CBT, PIM-DM, PIM-SM, MSDP, BGMP/MASC • problems (scaling, etc) • potential solutions: Simple Multicast/Express Radia Perlman
Addresses • IP Address is 4 bytes • “Class A” top bit is 0 • “Class B” top bits 01 • “Class C” top bits 001 • IP Multicast address is “class D”, top bits are 0001 • Mapping to layer 2: use bottom 23 bits, top 24 is OUI, one more bit so ISOC has some Radia Perlman
IGMP (Internet Group Management Protocol) • Purpose: router on a LAN discovers which multicast addresses have receivers on LAN • Rtr sends query. Members respond • V1: IGMP response to derived layer 2 multicast address after random delay. Rtr listens promiscuously • V2: Resign. Rtr queries again • V3: join ({S’s},G) sent to rtr layer 2 address Radia Perlman
There are two ways of constructing a design. One way is to make it so simple there are obviously no deficiencies. The other way is to make it so complicated that there are no obvious deficiencies. ---Tony Hoare Radia Perlman
DVMRP • Flood and prune • send data everywhere (optimization: reverse path forwarding) • send prune (S,G) • remember who you sent prunes to (in case join happens, so you can de-prune) • remember prunes you received (so you can filter) Radia Perlman
Flooding/RPF • Forward received packet onto all links except the one it was received on • exponential overhead • RPF: Only accept pkt with source S on link L if you’d send to S via L • n2 overhead: each pkt goes on each link Radia Perlman
Why DVMRP Doesn’t Scale • Leaking even a few packets for each of millions of sessions periodically • Prune state (S,G) pairs/neighbor of groups they DON’T want (most of the millions) Radia Perlman
MOSPF • Pass information about all members for all groups in routing protocol • Calculate spanning tree from source when packet arrives from (S,G) (and cache result) • Scaling issues: • routing control overhead (all group members) • CPU for multiple Dijkstra calculations Radia Perlman
CBT F D C R R R R R R R A B Radia Perlman
CBT • Build bidirectional tree rooted at Core • Only routers on tree need to know about tree • Only problem: Who is the core? • Two mechanisms specified • configure the routers with (C,G) mappings • do PIM-SM bootstrap protocol (see next) Radia Perlman
PIM-SM • Unidirectional Shared Tree (tunnel packet to core) • Plus dynamically formed per-source trees when (enough) traffic occurs Radia Perlman
Unidirectional Tree F D C R R R R R R R A B Radia Perlman
Bidirectional Tree F D C R R R R R R R A G B Radia Perlman
Dynamically Formed Per-Source Trees • If enough traffic from S • join a tree rooted at S • prune off from shared tree for (S,G) • Routers keep more trees and more prune state • State timed out. Bursty source problem Radia Perlman
(Simplified) PIM core mapping • PIM: “bootstrap” routers flood advertisements throughout domain • Core capable routers register with elected BSR • BSR announces list of cores Radia Perlman
PIM core mapping (cont’d) • Hash alg to map M to one of the set of currently alive core capable routers • Core not necessarily near group, so shared tree can be really bad • Advertisements don’t scale, so this is intra-domain only Radia Perlman
Interdomain • Use protocols that don’t scale within domains • Find some way of gluing domains together • BGMP/MASC • MSDP Radia Perlman
BGMP/MASC • For interdomain: have each domain dynamically choose and defend a block of multicast addresses • Have interdomain routing protocol pass around “reachability” of multicast address blocks • Join is in direction of multicast address prefix Radia Perlman
Scaling Problems • MASC • Harder than asking entire Internet to automatically number itself with IP addresses. • Too much bandwidth used • Too hard to debug • Too much of a burden on BGP • Will run out of addresses Radia Perlman
MSDP • Multicast Source Distribution Protocol • “Interim solution” until BGMP/MASC done • Configure tunnels between core capable routers in various domains, enough so hopefully Internet is connected • Flood (S,G) for all active (S,G)’s, throughout Internet Radia Perlman
MSDP x x x x x x x x x x x x Radia Perlman
Why MSDP Won’t Scale • Too much information to pass around (all active (S,G) pairs • Too many tunnels to configure Radia Perlman
“Current approach” • Use protocols that don’t scale within a domain • Find some way of hooking domains together for groups with members in different domains • MSDP or MASC Radia Perlman
Simple Multicast • What causes the greatest complexity, scalability problems in the design? • Remove the need for those • Result: one scalable mechanism that will work both inside and between domains • Doesn’t need to be called “new protocol”. Can be modification of something else Radia Perlman
Solve 90% of the problem as simply as possible. Then remove the remaining 10% from the problem requirements --- Marshall Rose Radia Perlman
First Simplification • Don’t bother dynamically creating per-source trees • Instead use a single shared, good bidirectional tree • Less state • Better shared tree (bidirectional) Radia Perlman
Bidirectional Suboptimal? • Cost to network to deliver data NOT MORE • Core is NOT a bottleneck • Core can be an endnode, does not need to forward data • With single exit point from “domain”, delay difference from source tree is negligible • Don’t need “optimal”. Need “good enough” Radia Perlman
Bidirectional Trees Best • Per-Source Trees • Do NOT make network overhead lower (unless core is poorly chosen) • More state for net (n trees rather than one) • Only metric under which per-source tree is better is delay from source to each receiver • Bidirectional tree, with slight care, can ensure short paths to nearby members from any source Radia Perlman
Choosing good bidirectional tree • From each domain (or region separated by expensive links), have routers agree on one exit point per IP address prefix • Choose core to be a member of the group, or close to a member of the group • No “bandwidth bottleneck” around core--it’s just a node in the tree • C can be endnode (only fwd tunneled pkts) Radia Perlman
Good Bidirectional Tree R3 R1 R4 R2 Radia Perlman
Next simplification • Forcing all routers in Internet to figure out C from M is too expensive and complicated • Instead, make group ID 8 bytes • Only extra work for endnode: look up 8 byte group ID rather than 4 bytes. • Eliminate need for multicast address allocation, domain-wide core advertisements, etc. Radia Perlman
Simple Multicast • Bidirectional Tree • Group ID is (C,M) • To create group: choose C, ask C for M • Member discovers 8 byte (C,M) • via email, web page, SDR, directory, etc. • Include C and M in join or IGMP reply • Include C and M in data messages Radia Perlman
Simple Multicast Variants • (C,G) in join, not in data messages • requires unique G’s • what if disagreement about C for G? • (C,G) in both join and data • explicitly (e.g., IP option) • MPLS • use link-local destination address Radia Perlman
Link Local Destination Address A R1 R2 C Join C,G Join C,G Join C,G Ack C,G, use X1 Ack C,G, use X2 Ack C,G, use X3 Data, dest=X1 Data, dest=X2 Data, dest=X3 Radia Perlman
Simple Multicast Variants, Cont’d • Express • 8-byte group ID (S,G) • Unidirectional Tree • If multiple senders • create multiple trees • tunnel to S Radia Perlman
Issues (with good answers) • Access Control: controlling who sends by configuring “one” node • Reliability if core goes down • Backward compatibility (migrating nodes one at a time) Radia Perlman
“Access Control” • Suppose want to restrict senders? • Express: S can choose not to forward from others • PIM: RP can be configured with authorized senders. Refuse to forward. (but members below 1st hop router will receive pkts) • SM: Core can be configured, and tell others in heartbeat Radia Perlman
Access Control, Cont’d • What if list doesn’t fit in the heartbeat msg? • Only say no S “if needed” (after bad S sends) • Only say yes S if needed (S tunnels to core or asks permission of core) • Can have list of yes’s, no’s, or both Radia Perlman
Multiple Groups for Availability • Rather than “backup core”, just create multiple groups (C1,M1), (C2,M2) and members join both • Transmit on one (one where you’re getting heartbeat). Receive on both. • Or if application requires absolute timeliness, transmit on both • Also, create multiple for load sharing Radia Perlman
Multiple Groups • Interdomain policy might require a tree per source domain. • Create a single tree for each domain rather than one per source in that domain. • Can use shared tree like RP: If create extra auxiliary tree, have it advertised via heartbeat Radia Perlman
Distributed Cores • If really want failover to another core • Have protocol among core capable routers • They advertise among themselves • Winner injects host route • Will be less overhead than PIM BSR protocol advertising throughout domain Radia Perlman
Backward Compatibility • Simplest: look different so other multicast protocols won’t forward the packet • Assume incremental deployment • Join sent to Core. Unicast by non-SM rtrs • Data destination=core or M or tunnel endpoint Radia Perlman
Automatically discovering Tunnel • R1 sends “join”. Destination=core • Forwarded until it reaches R2 • R2 notes pkt rcv’d from non-neighbor R1 • Adds “tunnel port” to R1 to state for (C,M) • Sends join-ack to R1 • R1 creates “tunnel port” to R2 as parent port for (C,M) Radia Perlman
Tunnel needed R2 r R1 r r C D A R3 B R1 -- R2 and R2 -- C are “tunnels” IP option contains both C and M IP destination address has C or tunnel endpoint or M Radia Perlman
New Protocol or New version of existing protocol? • No reason to do “totally new thing” • Two suggestions: bidirectional shared trees, and group ID=(C,G) • Suggestions orthogonal • CBT and BGMP already do bidirectional trees. PIM could be modified to do it • Easy to modify any of them to get core from pkt Radia Perlman
Summary • Shared bidirectional trees • fewer trees to keep track of and maintain • more efficient than tunneling to core • Group ID C+M • trivial address allocation • no extra info for BGP to pass around • no “core capable router advertisements” • controlled selection of core for group Radia Perlman
Summary • This stuff doesn’t have to be so complicated • It would be good for Internet if multicast really could allow millions of groups, easily formed by anyone Radia Perlman