190 likes | 370 Views
A Randomized Error Recovery Algorithm for Reliable Multicast. Zhen Xiao Ken Birman AT&T Labs – Research Cornell University. Challenges in Reliable Multicast. How to avoid message implosion?
E N D
A Randomized Error Recovery Algorithm for Reliable Multicast Zhen Xiao Ken Birman AT&T Labs – ResearchCornell University
Challenges in Reliable Multicast • How to avoid message implosion? • How to confine the impact of a message loss to the region where the loss has occurred? • How to avoid message duplication? • How to reduce error recovery latency?
Previous work • Scalable reliable multicast protocol (SRM) • Tree-based protocol (RMTP, TMTP, LBRRM) • Protocols with router support (PGM, LMS, Search Party)
q crash Failure of a Repair Server
Bimodal Multicast • Appeared in Transaction on Computer Systems, May 1999, by Kenneth P. Birman, Mark Hayden, Oznur Ozkasap, Zhen Xiao, Mihai Budiu, and Yaron Minsky. • Periodic exchange of message history to resolve inconsistency. • Bimodal delivery guarantees. • Steady throughput even when failures occur. But … • Do not use any hierarchical structure. • Message exchanges happen only periodically.
RRMP: Randomized Reliable Multicast Protocol Key idea: combine previous work on randomized error recovery with the Bimodal Multicast protocol and hierarchical error recovery similar to that employed by tree-based protocols. • Group receivers into a hierarchy. • Do not use any repair server. • parent region: the least upstream region of a receiver in the hierarchy. • Each receiver maintains group membership information about receivers in its region and receivers in its parent region.
Two-phase Error Recovery Assumea receiver p detects a message loss. • local loss: the loss affects a fraction of receivers in p’s region • regional loss: the loss affects all receivers in p’s region Local recovery phase: a receiver tries to recover the loss from randomly selected neighbors. Remote recovery phase: a randomly selected subset of members in the region request retransmissions from the parent region. Concurrent execution of local recovery and remote recovery: p doesn’t know how many members in its region missed the message.
Local Recovery Phase • p sends a request to a randomly selected member q in its region and sets a timer. • If q has the msg, it sends the msg to p. Otherwise it ignores the request. • When p times out, it sends a request to another randomly selected member. • As long as at least one local receiver has the msg, p is able to recover the loss eventually.
Remote Recovery Phase • p randomly selects a receiver r in its parent region. • p sends a request to r with a small probability. • p sets a timer regardless whether it sends a request or not. • If r has the msg, it sends the msg to p. Otherwise it records p’s request. • When p receives the msg, it multicasts the msg in its local region if the msg is not a duplicate. • When p times out, it randomly selects another receiver in its parent region and repeats the above.
Performance Analysis • Implosion avoidance: good • Robustness: good • Recovery latency: • Penalty due to randomization • Likely to get a repair from a neighbor • Repair duplication • Concurrent execution of two recovery phases
Simulation Results in NS2 • TRMP: Tree-based Reliable Multicast Protocol (target for comparison) • Topology: transit-stub networks generated by GT-ITM network generator. • transit-transit link: 45M bps • stub-stub link: 10M bps • transit-stub link: 8M bps • queue limit: 16 packets • Traffic: 1K byte messages, 50 message/sec • Loss pattern: congestion loss caused by background TCP traffic • transit-stub link: 0.71% -- 8.02%, median 4.29% • stub-stub link: 0% -- 1.29%, typically around 0.32% • transit-transit link: no loss
Load Balance request traffic repair traffic Comparison of request and repair traffic when group size increases
Load Balance (cont.) Comparison of repair traffic for a lossy receiver (group size: 160)
Conclusion • Efficient error recovery can be achieved using a hierarchy of regions without imposing any specific structure inside a region. • Better load balancing and robustness can be achieved by diffusing the responsibility of error recovery among all members in the group.