370 likes | 482 Views
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T. Rowstron Presented by Yu Feng and Elizabeth Lynch. Introduction. Application-level multicast Goals Scalability Failure tolerance Low delay
E N D
Scribe: A Large-Scale and Decentralized Application-Level Multicast InfrastructureMiguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T. RowstronPresented byYu Feng and Elizabeth Lynch
Introduction • Application-level multicast • Goals • Scalability • Failure tolerance • Low delay • Effective use of network resources
Pastry • P2P location and routing substrate • Provides: • Scalability • Large numbers of groups • Large numbers of multicast sources • Large numbers of members per group • Self-organization • Peer-to-peer location and routing • Good locality properties
Scribe • Application-level multicast infrastructure • Built on top of Pastry • Takes advantage of Pastry properties • Robustness • Self-organization • Locality • Reliability
nodeId • Each node is assigned 128-bit nodeId • nodeIds are uniformly distributed • Each node maintains tables that map nodeIds to IP addresses • (2^b-1)*[log(2^b)N] + l entries • O(log(2^b)N) messages required to update after group membership change
Routing Guarantees • A message and key will be routed to the live node whose nodeId is closest to the key • In a network of N nodes, the average number of steps in a route to any node is less than log(2^b)N • Delivery is guaranteed unless l/2 or more nodes with adjacent nodeIds fail
Routing Tables • nodeIds and keys are treated as sequences of digits base 2^b • Each node's routing table has [log(2^b)N] rows and 2^b – 1 entries per row • Each entry in row n refers to a node whose nodeId matches the present node's nodeId in the first n digits but whose n+1th digit has one of 2^b – 1 other possible values • The entry closest to the present node according to a distance metric is chosen
Leaf Sets • l/2 closest larger and l/2 closest smaller nodeIds relative to present nodeId • Each node maintains IP addresses for its leaf set
Routing algorithm • Current node forwards to a node whose nodeId has a prefix at least one digit (b bits) longer in common with the key • If no such node is available, forward to a node with the same prefix length whose nodeId is closer to the key
Locality • Proximity metric • Locality properties relevant to Scribe • Short routes • According to simulations: 1.59 to 2.2 times distance directly between the source and destination • Route convergence • According to simulations: average distance traveled by two messages sent to the same key is approximately equal to the distance between the two source nodes
Node Addition • New node X picks a nodeId • X contacts nearby node A • A routes special message with X as key • Message is routed to a node Z with nodeId numerically closest to X • If X==Z, X must choose a new nodeId • X obtains leafset from Z • X obtains ith row of routing table from ith node traversed from A to Z • X notifies appropriate nodes that it is now alive
Node Failure • Neighboring nodes in nodeId space periodically exchange keep-alive messages • If a node is silent for a period of time, T, it is presumed failed. • All members of the failed node's leaf set are notified and then remove the failed node from their leaf sets and update.
Node Recovery • Contacts all the nodes in last known leaf set • Obtains their leaf sets • Updates its leaf set • Notifies members of new leaf set
Pastry API • nodeId=pastryInit(Credentials) • Causes local node to join existing Pastry network or start a new one • route(msg, key) • Routes msg to the node with nodeId numerically closest to key • send(msg, IP-addr) • Sends msg to the node at IP-addr
Required Pastry Functions • deliver(msg, key) • When msg is received and local node's nodeId is closest to key out of all live nodes • When msg is received that was transmitted via send() to IP of local node • forward(msg, key, nextId) • Called just before msg is forwarded to node with nodeId=nextId • Application can change msg content or nextId value • If nextId=NULL, msg terminates at local node • newLeafs(leafSet) • Called whenever there's a change in the leaf set
Scribe Overview • Multicast application framework built on top of Pastry • Any Scribe node may create a group • Other nodes can join the group and multicast to all members of that group • Best effort delivery and does not guarantee ordered delivery
How? • A group is formed by building a multicast tree through joining Pastry routes from each group member to a rendezvous point (root of the tree). • Multicast messages are sent to rendezvous point for distribution • Pastry and Scribe are fully decentralized • Decisions are based on local information • Provides reliability and scalability
Multicast Tree • Scribe creates a multicast tree rooted at the rendezvous point. • Scribe nodes that are part of a multicast tree are called forwarders. • They may or MAY NOT be a members of the group. • Each forwarder contains a children table. • There is an entry (IP address and nodeId) for each of its children in the multicast tree.
Scribe API • create(credentials, groupId) • Creates a new group using the credentials to control future access • join(credentials, groupId, messageHandler) • Join a group with the specified groupId • leave(credentials, groupId) • Leave a group with the specified groupId • multicast(credentials, groupId, message) • Multicast the specified message to the group with specified groupId
Scribe Implementation Creating a Group • A scribe node asks Pastry to route a CREATE message using the groupId as the key. [e.g., route(CREATE, groupId)] • Pastry delivers the CREATE message to a node that has its nodeId numerically closest to the groupId. • Scribe’s deliver method is invoked and adds the new groupId to a list of groups it already knows. In addition, it also checks the credentials to ensure the group can be created. • This node becomes the rendezvous point for the newly created group.
Scribe Implementation Joining a Group • Asks Pastry to route a JOIN message with the groupId as the key. [e.g., route(JOIN, groupId)]. The message is routed towards the rendezvous point. • Each node along the route, Pastry invokes Scribe’s Forward method. • Checks to see if it is a forwarder for the group. • If it is a current forwarder for the group, then it adds the node as a child. • If it is NOT a current forwarder for the group, then it creates a children table for the new group, adds the node as a child. Then it routes a JOIN message with groupId as key [e.g., route(JOIN, groupId)]. • Finally, it terminates route message it received form the source.
Scribe Implementation Leaving a Group • It records locally that it left the group. • If there are no more children in its children table, it sends a LEAVE message to its parent node. • The parent node repeats step 2 until a node with a non-empty children table is found after removing the source node.
Multicast a Message • Locate rendezvous point for the group. [e.g., route(MULTICAST, groupId)], and ask it to return its IP address. • The source caches the IP address and uses it for future multicasts. • If the rendezvous point changes or fails, it uses Pastry again to find the new rendezvous point. • All multicast messages are sent from rendezvous point.
Reliability of Scribe Repairing the Tree • Periodically, each non-leaf node sends out a heartbeat message to all of its children. • When a leaf node does not receive a heartbeat after a certain period of time, it sends a JOIN message with the group’s identifier. • Pastry will route the message to a new parent, thus fixing the multicast tree.
Reliability of Scribe Failure of Rendezvous Point • The state of rendezvous point is replicated across k closest nodes to the root node (Typical value of k is 5). • These k nodes are all children of the root node. • When a root node fails, its immediate children detect the failure and join again through pastry. • Pastry routes the new join message to a new root (a live root with the numerically closest nodeId to the groupId), which takes over the role of the rendezvous point.
Reliability of Scribe • Children table entries are discarded unless the child node sends a explicit message stating it wants to remain in the table. • Tree repair mechanism scales well: • Fault detection is done by sending messages to a small number of nodes • Recovery from faults is local and only a small number of nodes is involved (O(log2bN))
Scribe - Providing Additional Guarantees • Scribe only provides reliable, ordered delivery of multicast messages only if the TCP connections do not fail. • Scribe provides a simple mechanism to allow other applications to implement stronger reliability guarantees. • forwardHandler(msg): Invoked by Scribe before the node forwards a multicast message to its children. • joinHandler(msg): Invoked by Scribe after a new child is added to one of the node’s children tables. • faultHandler(msg): Invoked by Scribe when a node suspects its parent is faulty.
Additional Reliability Example • forwardHandler • The root assigns a sequence number to each message • Multicast messages are buffered by the root and by each node in the multicast tree. • Messages are retransmitted after the multicast tree is repaired. • faultHandler • adds the last sequence number delivered by the node to the JOIN message that is sent out to repair the tree. • joinHandler • retransmits buffered messages numbers above n to the new child.
Experimental Setup • Randomly generated network topology with 5050 routers • Scribe was run on 100,000 end nodes randomly assigned to routers with uniform distribution • Using different random seeds, ten different topologies were generated • Results are averaged over all ten topologies • Experimented with a wide range of group sizes and large number of groups • Size of group with rank r: gsize(r)=floor(N*r^(-1.25) + .5) • Group membership selected randomly with uniform distribution
Delay Penalty • Compare delay between Scribe multicast and IP multicast • Measure distribution of delay to deliver a message to each member of a group • Two metrics: • RMD • 50% of groups less than 1.69 • Max = 4.26 • RAD • 50% of groups less than 1.68 • Max = 2
Node Stress • Stress imposed by maintaining groups and handling forwarding packets and duplicate packets at the end node instead of on the routers • Measure the number of groups with non-empty children tables and the number of entries in children tables • In our simulation with 1500 groups • Non-empty children tables per node: Avg=2.4, max=40 • Children table entries per node: Avg=6.2, max=1059
Link Stress Experiment • Computed link stress by counting the number of packets that are sent over each link when a message is sent to each of the 1500 groups. • Total number of links is 1,035,295 • Total number of messages for Scribe is 2,489,824 • Total number of messages for IP multicast is 758,853 • Mean number of message per link: • 2.4 for Scribe • 0.7 for IP multicast • Maximum Link Stress: • 4031 for Scribe • 950 for IP multicast
Bottleneck Remover • When a node detects it is overloaded, it selects the group that consumes the most resources. • Then it chooses the child in this group that is farthest away. • The parent then drops the child by sending it a message containing the children table for the group along with delays between each children and the parent. • When the child receives the message it does the following: • It measures the delay between itself and other child in the children table received. • It then computes the delay between itself and the parent via each of the nodes. • Finally, it sends a JOIN message to the node that provides the least combined delay.
Bottleneck Remover Results • This introduces potential for routing loops • When a loop is detected, the node sends another JOIN message to generate a new random route. • The bottleneck remover limits the number of entries for its children tables at a cost of increased link stress during join. • Average link stress increases from 2.4 to 2.7 and maximum increases from 4031 to 4728.
Scalability with Many Small Groups • 50,000 Scribe nodes • 30,000 Scribe group with 11 nodes per group • Average number of children entries per node is 21.2 compared to a plain (naïve) multicast average of only 6.6 • Average link stress: • 6.1 for Scribe • 1.6 for IP multicast • 2.9 for Naïve multicast • Scribe entries are higher because it creates trees with long paths and no branching.
Conclusion • Scribe is a fully decentralized and large-scale application-level multicast infrastructure built on top of Pastry. • Designed to scale to large number of groups, large group size, and supports multiple multicasting sources per group. • Scribe and Pastry’s randomized placement of nodes, groups, and multicast roots balances the load and the multicast tree. • Scribe uses a best effort delivery scheme but can be extended to guarantee more strict multicast requirements. • Experimental results show that Scribe can efficiently support large number of nodes, groups, and a wide range of group sizes compared to IP multicasting.