1 / 20

CprE 545: Fault Tolerant Systems

CprE 545: Fault Tolerant Systems. Dependable Real-Time Communication Networks (Prepared by G. Sudha Anil Kumar). Introduction. Real-Time communication services Digital continuous media (audio & video) Distributed real-time control Properties of a typical real-Time communication scheme

kendall
Download Presentation

CprE 545: Fault Tolerant Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CprE 545: Fault Tolerant Systems Dependable Real-Time Communication Networks (Prepared by G. Sudha Anil Kumar) CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  2. Introduction • Real-Time communication services • Digital continuous media (audio & video) • Distributed real-time control • Properties of a typical real-Time communication scheme • QoS Contracted • Connection Oriented • Reservation based CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  3. Typical RT-Comm. scheme Client specifies the QoS requirements What if the channel crashes due to failure? Real-Time channel established for the required QoS Client’s messages are transported over the reserved real-time channel Channel re-establishment takes a lot of time !!! Sometimes it may not be feasible Therefore, we need a dependable channel !! CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  4. Motivational example Each link can handle two channels simultaneously N1 N2 N3 D1 S2 Channel 2 fails to get along N4 N5 N6 S1 D2, D3 N7 N8 N9 S3 Channel 1 Channel 2 Channel 3 CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  5. Dependable Channel • Basic Idea: To assure successful rerouting and avoid the time-consuming channel establishment process, one or more backup channels are set up a priori, in addition to each primary channel. • A dependable channel thus consists of a primary channel and one or more backup channels. • A backup remains a cold-standby until it is activated. That is, it does not carry any data in a normal situation, so that the resources reserved for the backup may be used for the other traffic. • However, backups can degrade the network utilization and hence several backups are often multiplexed. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  6. Dependable Channel: An Example Each link can handle two channels simultaneously N1 N2 N3 D1 S2 All three channels made it N4 N5 N6 S1 D2, D3 N7 N8 N9 S3 Channel 1 Channel 2 Channel 3 Backup 1 Backup 2 Backup 3 CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  7. Backup Channel Protocol (BCP): Design Goals • Per-connection fault tolerance control • Fast (time-bounded) failure recovery • Robust failure handling • Small fault tolerance overhead • Interoperability/Scalability CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  8. Markov model: Reliability of a D-connection 1 Λ1 – Λ3 Λ2 µ 0 Λ3 3 µ 2 Λ1 µ: channel repair rate Λ2 – Λ3 Λ1: failure rate of channel 1 This model is used to estimate the reliability of each D-connection Λ2: failure rate of channel 2 Λ3: failure rate of the shared part CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  9. Combinatorial model: Reliability of a D-connection • This alternative model is simpler. • Each edge in the network is assumed a failure a rate of Λ (lambda). • Pr = reliability of the D-connection • Pr = probability that at least one channel remains healthy during one time unit. • For example, the Pr of a D-connection with a single backup is • P( primary not fail) + P( primary fails & backup not fail) CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  10. Disadvantages of Backup channels • Backup is not free, it requires same amount of resources as its primary channel to be reserved, for immediate activation upon failure of the primary. • As a result, equipping each D-connection with a single backup routed disjointly with its primary reduces the network capacity by 50% or more. • Large amounts of spare resources (resources reserved for the backups) can seriously degrade the attractiveness of the backup-channel scheme. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  11. Backup Multiplexing • Basic idea: At each link, we reserve only a very small fraction of spare resources needed for all backups going through the link. • That is resources for backup channels are overbooked or overloaded. • One of the key problems in backup multiplexing is to decide which backups will share the same resources. • A natural solution to this problem is to choose those backups which are less likely to be activated simultaneously. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  12. Backup Multiplexing • The probability of simultaneous activation of two backups belonging to two different D-connections is bounded by the probability of simultaneous failure of their respective primary channels. • This probability depends on the routing of the primary channels, and increases with the number of components shared between the primary channels. • S(Bi,Bj) = probability that backups Bi and Bj fail simultaneously. = [1 – P( no failure in shared components)* P( no simultaneous failure in the rest)] CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  13. How are backups multiplexed ? • The set of backups to be multiplexed together is determined for each backup on each link. • Bi and Bj are multiplexed if S(Bi,Bj) is smaller than a threshold v, called multiplexing degree, which is specific to each backup. • The smaller v of a backup, the higher fault tolerance will result. • For instance, if v of a backup path is set to lambda, Bi will not be multiplexed with any other backup whose primary overlaps with Bi’s primary. • This way per-connection control of fault tolerance is possible, thus allowing more important connections to have higher fault tolerance. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  14. Failure Recovery: Channel Switching Schemes • Failure reports are sent from the failure-detecting nodes only to the end-nodes of failed channels. • Failure reports are delivered through healthy segments of the failed channel’s paths. • Each failure report contains the “channel-id” of the failure channel. • Channel switching schemes define how exactly the above steps are executed. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  15. Channel Switching Scheme - 1 Failure report Primary Channel Destination Source Backup Channel Activation message CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  16. Channel Switching Scheme - 2 Primary Channel Failure report Destination Source Backup Channel Activation message CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  17. Channel Switching Scheme - 3 Failure report Primary Channel Failure report Destination Source Backup Channel Activation message CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  18. Relative performance of the schemes • Schemes 2 and 3 have an advantage over scheme 1 in terms of recovery delay, because data transfer through the new primary channel can be resumed immediately after sending the activation message, while in scheme 1 data transfer has to wait until the activation message is received by the source node. • If a failure occurs near the destination node, this advantage will be minimal. • Scheme 3 has an edge over scheme 2 in two aspects: • All nodes of a channel are informed – useful for resource reconfiguration • Channel destination can prepare early for channel switching, and the activation delay will be reduced by the bidirectional activation. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  19. P U N B Channel State Diagram Channel establishment message Failure report Channel closure message Activation message Failure Report Rejoin msg Rejoin-timer timeout Activation message Channel closure message Rejoin msg Channel establishment message Failure report N: Non existent state P: healthy primary B: healthy backup U: unhealthy CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  20. Inferences from the simulation studies • Higher the backup multiplexing higher the resource utilization. • However, higher the backup multiplexing lower is the fault-tolerance (reliability). • Read the paper S. Han and K.G. Shin, "A primary-backup channel approach to dependable real-time communication in multihop networks," IEEE Trans. Computers, vol.47, no.1, pp.46-61, Jan. 1998. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

More Related