Network Protocols: Design and Analysis

Network Protocols: Design and Analysis Polly Huang EE NTU http://cc.ee.ntu.edu.tw/~phuang phuang@cc.ee.ntu.edu.tw

Multicast Routing [Deering88b]

Key ideas • lays foundation for IP multicast • defines IP service model • ex. best effort, packet based, anon group • compare to ISIS with explicit group membership, guaranteed ordering (partial or total ordering) • several algorithms • extended/bridged LANs • distance-vector extensions • link-state extensions • cost analysis

Why Multicast • save bandwidth • anonymous addressing

Characterizing Groups • pervasive or dense • most LANs have a receiver • sparse • few LANs have receivers • local • inside a single adminstrative domain

Service Model • same delivery characteristics as unicast • best effort packet delivery • open-loop (no built-in congestion/flow control) • scoping as control mechanism • groups identified by a single IP address • group membership is open • anyone can join or leave • do security at higher levels

Routing Algorithms • single spanning tree • for bridged LANs • distance-vector based • link-state based

Distance-vector Mcast Rtg • Basic idea: flood and prune • flood: send info about new sources everywhere • prune: routers will tell us if they don’t have receivers • routing info is soft state; periodically re-flood (and prune) to refresh this info • if no refresh, then the info goes away => easy fault recovery

Example Topology g g s g

Phase 1: Flood using Truncated Broadcast g g truncated broadcast: this router knows it has no gropus on its LAN, so it doesn’t broadcast s g

Phase 2: Prune g g prune (s,g) prune (s,g) s g

Phase 3: Graft g g g report (g) graft (s,g) graft (s,g) s g

Phase 4: Steady State g g g s g

Sending Data in DVMRP • Data packets are sent on all branches of the tree • send on all interfaces except the one they came in on • RPF (Reverse Path Forwarding) Check: • drop packets that arrive on incorrect interfaces (i.e., not from the unicast direction to the sending host) • why? suppress errant packets

DVMPR Pros and Cons • Pros: • simple • works well with many receivers. why? overhead is per-sender, receivers are passive • Cons: • works poorly with many groups (why? every sender in every group floods the nets) • works poorly with sparse groups (why? flood data everywhere and then prune back, expensive if only needed some places)

Link-state Multicast Routing • Basic idea: treat group members (receivers) as new links • flood info about them to everyone in LSA msg (just like LSA rtg) • Compute next-hop for mcast routes on-demand (lazily) • unlike for LSA unicast where all are computed as soon as LSA arrives • realized as MOSPF

S1 Z X R1 Y R2 Link state: Each router floods link state advertisement Multicast: add membership information to “link state” Each router computes multicast tree for each active source, builds forwarding entry with outgoing interface list.

Z has network map, including membership at X and Y Z computes shortest path tree from S1 to X and Y (when it gets a data packet on G), puts in rtg table W, Q, R, each do same thing as data arrives at them S1 Z W Q X R R1 Y R2

Link state advertisement with new topology may require re-computation of tree and forwarding entry (only Z and W send new LSA messages, but all on path recompute) S1 Z W Q X R R1 Y R2

Link state advertisement (T) with new membership (R3) may require incremental computation and addition of interface to outgoing interface list (Z) S1 Z R3 W T Q X R R1 Y R2

MOSPF Pros and Cons • Pros: • simple add on to OSFP • works well with many senders. why? no per-sender state • Cons: • works poorly with many receivers (why? per-receiver costs) • works poorly with sparse groups (why? lots of info goes places that don’t want it) • works poorly with large domains (why? link-state scales wrt number of links—many links causes frequent changes)

PIM [Deering96a]

Key ideas • want a mcast routing protocol that works well with sparse users • use a single shared tree; fix one host as rendezvous point

Why not just DVMRP or MOSPF? • With sparse groups, both are expensive • DVMRP problem with many senders • MOSPF problem with many receivers • neither works well with sparse groups • Solution: PIM-SM • use rendezvous point as a place to meet • but dowside: • single point of failure • don’t necessarily get shortest path • also concerned about “concentration” of all data going through rendezvous point

New Design Questions • Where to place RP? • How to make the RP robust? • don’t want a single point of failure • How to build the tree given an RP? • How to send data with a shared tree? • What is the overhead of going through RP (a shared tree)? • How to switch from shared tree to SPT?

Where to place RP? • RP is a node to which people send join messages • place it in the core • at the edge is more expensive since tfc must go through it • optimal placement is NP-hard

Robustness • single RP is single point of failure, so must have backup plan • approach: • start with a set of cores • hash the group name to form an ordered list • basic idea: order RPs, hash(G) selects one, use it • if it fails, hash(G) to find the next one • if everyone uses the same hash function, people find the same RPs

Building the Shared Tree • Simply send a message towards the RP • use the unicast routing table to get there • Add links to the tree as you go • Stop if you get to a rtr that’s already in the tree • Gets reverse shortest path to RP

Shared tree after R1,R2,R3 join Join messagetoward RP (*, G) (*, G) (*, G) (*, G) RP (*, G) (*, G) (*, G) (*, G) R1 R4 R2 R3 PIM Example: build Shared tree

PIM: Sending Data • If you are on the tree, you just send it as with other mcast protocols • it follows the tree • If you are not on the tree (say, you’re a sender but not a group member), the pkt is tunneled to the RP that sends it • why central placement of RP is important

PIM Example: sending data on the tree (*, G) (*, G) (*, G) (*, G) RP (*, G) (*, G) (*, G) (*, G) R1 R4 R4 sends data R2 R3

Sending data if not on the tree S1 unicast encapsulated data packet to RP inRegister RP decapsulates, forwards downshared tree RP (*, G) R1 R4 R2 R3

What is the cost of the shared tree? • Some data goes further than it should • but latency is bounded to 2x SPT • All data goes on one tree, rather than on many trees • but no guarantee you get multiple paths with source-specific trees • But to optimize things, PIM-SM supports source-specific trees

S1 RP distribution tree Build source-specific tree for high data rate source (S1, G) Join messagestoward S1 (S1, G) (S1, G) (*,G) (*, G) (S1, G), (*,G) (S1, G) (*,G) RP R1 R4 R2 R3 Build source-specific tree

Forward packets on “longest-match” entry Source (S1)-specific distribution tree S1 Source-specific entry is “longer match” for source S1 than is Shared tree entry that can be used by any source (*,G) (S1, G) Shared tree (S1, G) (*,G) R5 (S1, G) (*,G) (S1, G) (*,G) (*, G) (S1, G), (*,G) RP R1 R4 R2 R3

SPT and Shared Trees • Many more details to be careful about • need to handle switchover from shared-tree to SPT gracefully • need to support pruning for both SPT and shared-tree • and have to worry about LANs with multiple routers, multiple senders, etc. • Uses similar protocols (soft-state, refresh, etc.), but lots of details

PIM-SM observations • does a good job at intra-domain mcast routing that scales to • many senders • many receivers • many groups • large bandwidth • preserves original (simple) service model • but quite complex • but actually implemented today

Multi-AS Mcast Routing • Fine, PIM-SM (or DVMRP or MOSPF) work inside an AS, what about between ASes? • lots of policy questions • and have to show ISPs why they should deploy (how they can make money :-) • and convince them the world won’t end • multicast, that’s for high-bandwidth video, right? • multicast can flood all my links with data, right? • what apps, again?

MSDP • Support for inter-domain PIM-SM • Temporary solution • Basic approach: • send all sources to all ASes (like original flood-and-prune) • AS border routers are PIM-SM RPs for their domain

But does this seem complicated? • some people thought so • and commercial deployment has been slow • if we change the service model, maybe we can greatly simplify things • and make it easier for ISPs to understand how to change/manage mcast • EXPRESS

Express [Holbrook99a]

Key ideas • use channels: a single sender, many subscribes • makes mcast tree easier to config • easier to tell who can send • add mechanism to let you count subscribers • easier to think about billing • goal: define a simpler model

Multicast Problems • need billing mechanism • need to know number of subscribers • need access control • need to limit who can send and subscribe • ISPs concerned about mcast • IPv4 mcast addresses too limited • current protocols too complex • single source multicast

Express vs. Multicast Problems • need billing mechanism • record sources • count receivers • need access control • only subscriber can send • IPv4 mcast addresses too limited • address on source and group

Express Approach • all addresses are source specific (S,E) • 224 channels per source, (232 sources) • access control • only source can send • channels optionally protected by “key” (really just a secret) • sub-cast support (encapsulate pkt to any router on the tree [if you know who they are]) • best-effort counting service

Express Components • ECMP: Express Count Mgt Protocol • like IGMP, but also adds count support • counts used to determine receivers or for other things like voting • not clear how general • session relays • service at source that can relay data on to tree (similar to PIM tunneling)

Observations • Simpler? yes • Enough to justify mcast to ISPS? not clear

Another Alternative: Application-level Multicast • if the ISPs won’t give us multicast, we’ll take it :-) • just do it all at the app • results in some duplicated data on links • and app doesn’t have direct access to unicast routing • but can work… (ex. Yoid project at ISI)

Src Src Src Application-level Multicast Example

App-level Multicast • Simplest approach: • send data to central site that forwards • Better approaches: • try to balance load on any one link • try to topologically cluster relays

Network Protocols: Design and Analysis