Fault Tolerance

Fault Tolerance CSCI 4780/6780

Reliable Group Communication • Reliable multicasting is important for several applications • Transport layer protocols rarely offer reliable multicasting • What is reliable multicasting? • Communication sent to the group should reach each member • What happens if process crashes (or enters) during multicasting? • Multicasting with non-faulty processes & multicasting with faulty processes

Basic Reliable Multicasting • Group is assumed to be stable • Communication may be faulty • Underlying unreliable multicasting service • Straightforward if the number of processes are small • Sequence number for each message • Use acknowledgements • Either positive or negative • Retransmission on negative ack or on timeout • Poor scalability of positive ack

Basic Reliable-Multicasting Schemes • A simple solution to reliable multicasting when all receivers are known and are assumed not to fail • Message transmission • Reporting feedback

Positive Vs. Negative Feedback • Can we do better than both of them? • Hybrid scheme that has strengths of both but can mask the drawbacks • Negative ack on each msg but positive ack on every nth msg • Process not positively acking will receive all msgs in the cycle

Reliable Group Communication • Reliable multicasting is important for several applications • Transport layer protocols rarely offer reliable multicasting • What is reliable multicasting? • Communication sent to the group should reach each member • What happens if process crashes (or enters) during multicasting? • Multicasting with non-faulty processes & multicasting with faulty processes

Basic Reliable Multicasting • Group is assumed to be stable • Communication may be faulty • Underlying unreliable multicasting service • Straightforward if the number of processes are small • Sequence number for each message • Use acknowledgements • Either positive or negative • Retransmission on negative ack or on timeout • Poor scalability of positive ack

Basic Reliable-Multicasting Schemes • A simple solution to reliable multicasting when all receivers are known and are assumed not to fail • Message transmission • Reporting feedback

Positive Vs. Negative Feedback • Can we do better than both of them? • Hybrid scheme that has strengths of both but can mask the drawbacks • Negative ack on each msg but positive ack on every nth msg • Process not positively acking will receive all msgs in the cycle

Nonhierarchical Feedback Control • Reducing feedback overheads • Only negative feedback • Feedback is multicast to all members • Retransmissions are multicast too • Feedback time has to be carefully adjusted • Can unnecessarily interrupt other processes • Processes that regularly miss msgs form a separate group • Retransmission to that group

Nonhierarchical Feedback Control • Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

Hierarchical Feedback Control • Nonhierarchical feedback control may suffice for small multicast groups • Overheads are still too heavy for large groups • Limited geographic scalability • One sender, large numbers of receivers • Receivers partitioned into sub-groups • Each subgroup has a coordinator • Coordinator responsible for retransmissions within subgroup • Constructing and maintaining multicast tree is notoriously difficult

Hierarchical Feedback Control • The essence of hierarchical reliable multicasting. • Each local coordinator forwards the message to its children. • A local coordinator handles retransmission requests.

Fault Tolerance

Fault Tolerance

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance