310 likes | 422 Views
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure. Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T. Rowstron IEEE Journal on Selected Areas in Communications, Oct, 2002. Outline. Pastry A peer-to-peer location and routing substrate Scribe
E N D
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T. Rowstron IEEE Journal on Selected Areas in Communications, Oct, 2002
Outline • Pastry • A peer-to-peer location and routing substrate • Scribe • Built on top of Pastry • Experimental evaluation • Delay penalty • Node stress (routing tables) • Link stress (network bandwidth)
Pastry (1/2) • Each Pastry node has a unique, 128-b nodeId. • The set of existing nodeIds is uniformly distributed. • This is achieved by basing the nodeId on a secure hash of the node’s public key or IP address.
Pastry (2/2) • Each node contains • Routing tables (some of live nodes) • Each entry maps a nodeId to the associated node’s IP address. • IP addresses for the nodes in its “leaf set”. • Leaf set (total l nodes) • The set of nodes with • l/2 numerically closest larger nodeId • l/2 numerically closest smaller nodeId
Routing • Given a message and a key, Pastry reliably routes the message to the node with the nodeId that is numerically closest to the key among all live nodes. • In each routing step, the current node normally forwards the message to a node whose nodeId shares a longer prefix with the key. • The key can be different from the destination nodeId.
Routing a message From node 65a1fc with key d46a1c
Locality properties • Short routes property • Concern the total distance that messages travel along Pastry routes. • In each step, a message is routed to the nearest node with a longer prefix match. • Route convergence property • Concern the distance traveled by two messages sent to the same key before their routes converge. A B D C Converge E
Node addition • The new nodeId X can initialize its state by contacting a nearby node A. • A will route a special message using X as the key. • This message is routed to the existing node Z with nodeId numerically closest to X. • X then obtains • the leaf set • the routing table from Z. • Z is the nearest node, so their leaf sets are almost the same. • Their routing tables are very similar.
Failure • To handle node failures, neighboring nodes in the nodeId space periodically exchange keep-alive messages. • If a node is unresponsive for a period T, it is presumed failed. • All members of the failed node’s leaf set are then notified and they update their leaf sets. • Routing table entries that refer to the failed nodes are repaired lazily.
Scribe • Scribe uses Pastry to manage group creation, group joining and to build a per-group multicast tree. • Implementation • CREATE • JOIN • MULTICAST • LEAVE
1100 CREATE 1001 1111 1101 forwarder forwarder forwarder 0111 0100 JOIN JOIN Multicast tree creation Because b = 1, so both 1111 and 1101 can be a forwarder. b = 1 ( match 1 bit at a time) 0100 1111 1100 1001 1101 0111
Membership • Rendezvous point • The root of the multicast tree. • Can be changed. • Forwarder • Scribe nodes that are part of a group’s multicast tree. • They may or may not be member of the group. • Each forwarder maintains a children table.
Multicast message dissemination • Multicast sources use Pastry to locate the rendezvous point of a group. • They route to the rendezvous point and ask it to return its IP address. • They cache the rendezvous point’s IP address and use it in subsequent multicasts to the group. • Multicast messages are disseminated from the rendezvous point along the multicast tree. Why? Each multicast source can also be viewed as the root. If each multicast source transmit data by itself, the delay penalty in worst case can become twice.
Reliability • Each nonleaf node in the tree sends a heartbeat message to its children. • A child suspects that its parent is faulty when it fails to receive heartbeat messages. • Upon detection of the failure of its parent, a node calls Pastry to route a JOIN message to a new parent. • If the failed node is the root, a new root (the live node with the numerically closet nodeId to the groupId) will replace it.
Experimental evaluation • Compare with IP multicast • Delay penalty • Node stress • Link stress • Experimental setup • A network topology with 5,050 routers • Scribe run on 100,000 end nodes. • 1,500 groups
Delay penalty • Scribe increases the delay to deliver messages relative to IP multicast. • RMD • The ratio between the maximum delay using Scribe and the maximum delay using IP multicast. • RAD • The ratio between the average delay using Scribe and the average delay using IP multicast.
Delay penalty The number of groups with a RAD or RMD lower than or equal to the relative delay. Scribe / IP multicast
Node stress (2/2) Long tail Each node averagely remembers few children.
Link stress IP multicast 950 Scribe 4031
Bottleneck remover (1/3) • Reasons • Some node may have less computational power or bandwidth available than others. • The distribution of children table entries has a long tail. • Algorithm • When a node is overloaded, it selects the group that consumes the most resources. • It chooses the child in this group that is farthest away.
Bottleneck remover (2/3) • The parent drops the chosen child by sending it a message containing the children table for the group. • When the child receives the message, • It measures the delay between itself and other nodes in the table. • It computes the total delay between itself and the parent via each node in the table. • It sends a join message to the node that provides the smallest combined delay.
Node stress No long tail
Scalability • Evaluating Scribe’s scalability with a large number of groups. • Experimental setup • 50,000 Scribe nodes • 30,000 groups with 11 members
Node stress (1/2) Collapse will be introduced later.
Node stress (2/2) Long tail Scribe is inappropriate to small groups!
Scribe collapse (1/2) • If a multicast group has few members, the group may require many other nodes to become forwarders. (The tree is inefficient.) • The new algorithm collapses long paths in the tree. • Removing nodes that are not members of a group and have only one entry on the group’s children table.
Link stress IP multicast Naïve unicast Scribe collapse Scribe