200 likes | 375 Views
CprE 545: Fault Tolerant Systems. Dependable Real-Time Communication Networks (Prepared by G. Sudha Anil Kumar). Introduction. Real-Time communication services Digital continuous media (audio & video) Distributed real-time control Properties of a typical real-Time communication scheme
E N D
CprE 545: Fault Tolerant Systems Dependable Real-Time Communication Networks (Prepared by G. Sudha Anil Kumar) CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Introduction • Real-Time communication services • Digital continuous media (audio & video) • Distributed real-time control • Properties of a typical real-Time communication scheme • QoS Contracted • Connection Oriented • Reservation based CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Typical RT-Comm. scheme Client specifies the QoS requirements What if the channel crashes due to failure? Real-Time channel established for the required QoS Client’s messages are transported over the reserved real-time channel Channel re-establishment takes a lot of time !!! Sometimes it may not be feasible Therefore, we need a dependable channel !! CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Motivational example Each link can handle two channels simultaneously N1 N2 N3 D1 S2 Channel 2 fails to get along N4 N5 N6 S1 D2, D3 N7 N8 N9 S3 Channel 1 Channel 2 Channel 3 CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Dependable Channel • Basic Idea: To assure successful rerouting and avoid the time-consuming channel establishment process, one or more backup channels are set up a priori, in addition to each primary channel. • A dependable channel thus consists of a primary channel and one or more backup channels. • A backup remains a cold-standby until it is activated. That is, it does not carry any data in a normal situation, so that the resources reserved for the backup may be used for the other traffic. • However, backups can degrade the network utilization and hence several backups are often multiplexed. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Dependable Channel: An Example Each link can handle two channels simultaneously N1 N2 N3 D1 S2 All three channels made it N4 N5 N6 S1 D2, D3 N7 N8 N9 S3 Channel 1 Channel 2 Channel 3 Backup 1 Backup 2 Backup 3 CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Backup Channel Protocol (BCP): Design Goals • Per-connection fault tolerance control • Fast (time-bounded) failure recovery • Robust failure handling • Small fault tolerance overhead • Interoperability/Scalability CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Markov model: Reliability of a D-connection 1 Λ1 – Λ3 Λ2 µ 0 Λ3 3 µ 2 Λ1 µ: channel repair rate Λ2 – Λ3 Λ1: failure rate of channel 1 This model is used to estimate the reliability of each D-connection Λ2: failure rate of channel 2 Λ3: failure rate of the shared part CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Combinatorial model: Reliability of a D-connection • This alternative model is simpler. • Each edge in the network is assumed a failure a rate of Λ (lambda). • Pr = reliability of the D-connection • Pr = probability that at least one channel remains healthy during one time unit. • For example, the Pr of a D-connection with a single backup is • P( primary not fail) + P( primary fails & backup not fail) CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Disadvantages of Backup channels • Backup is not free, it requires same amount of resources as its primary channel to be reserved, for immediate activation upon failure of the primary. • As a result, equipping each D-connection with a single backup routed disjointly with its primary reduces the network capacity by 50% or more. • Large amounts of spare resources (resources reserved for the backups) can seriously degrade the attractiveness of the backup-channel scheme. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Backup Multiplexing • Basic idea: At each link, we reserve only a very small fraction of spare resources needed for all backups going through the link. • That is resources for backup channels are overbooked or overloaded. • One of the key problems in backup multiplexing is to decide which backups will share the same resources. • A natural solution to this problem is to choose those backups which are less likely to be activated simultaneously. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Backup Multiplexing • The probability of simultaneous activation of two backups belonging to two different D-connections is bounded by the probability of simultaneous failure of their respective primary channels. • This probability depends on the routing of the primary channels, and increases with the number of components shared between the primary channels. • S(Bi,Bj) = probability that backups Bi and Bj fail simultaneously. = [1 – P( no failure in shared components)* P( no simultaneous failure in the rest)] CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
How are backups multiplexed ? • The set of backups to be multiplexed together is determined for each backup on each link. • Bi and Bj are multiplexed if S(Bi,Bj) is smaller than a threshold v, called multiplexing degree, which is specific to each backup. • The smaller v of a backup, the higher fault tolerance will result. • For instance, if v of a backup path is set to lambda, Bi will not be multiplexed with any other backup whose primary overlaps with Bi’s primary. • This way per-connection control of fault tolerance is possible, thus allowing more important connections to have higher fault tolerance. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Failure Recovery: Channel Switching Schemes • Failure reports are sent from the failure-detecting nodes only to the end-nodes of failed channels. • Failure reports are delivered through healthy segments of the failed channel’s paths. • Each failure report contains the “channel-id” of the failure channel. • Channel switching schemes define how exactly the above steps are executed. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Channel Switching Scheme - 1 Failure report Primary Channel Destination Source Backup Channel Activation message CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Channel Switching Scheme - 2 Primary Channel Failure report Destination Source Backup Channel Activation message CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Channel Switching Scheme - 3 Failure report Primary Channel Failure report Destination Source Backup Channel Activation message CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Relative performance of the schemes • Schemes 2 and 3 have an advantage over scheme 1 in terms of recovery delay, because data transfer through the new primary channel can be resumed immediately after sending the activation message, while in scheme 1 data transfer has to wait until the activation message is received by the source node. • If a failure occurs near the destination node, this advantage will be minimal. • Scheme 3 has an edge over scheme 2 in two aspects: • All nodes of a channel are informed – useful for resource reconfiguration • Channel destination can prepare early for channel switching, and the activation delay will be reduced by the bidirectional activation. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
P U N B Channel State Diagram Channel establishment message Failure report Channel closure message Activation message Failure Report Rejoin msg Rejoin-timer timeout Activation message Channel closure message Rejoin msg Channel establishment message Failure report N: Non existent state P: healthy primary B: healthy backup U: unhealthy CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University
Inferences from the simulation studies • Higher the backup multiplexing higher the resource utilization. • However, higher the backup multiplexing lower is the fault-tolerance (reliability). • Read the paper S. Han and K.G. Shin, "A primary-backup channel approach to dependable real-time communication in multihop networks," IEEE Trans. Computers, vol.47, no.1, pp.46-61, Jan. 1998. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University