130 likes | 212 Views
Introduction High-Availability Systems: An Example. AT&T. Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in 40 years (i.e., 3 min/year ), with less than 0.01% of the calls handled incorrectly.
E N D
IntroductionHigh-Availability Systems: An Example AT&T • Pioneered FT in telephone switching applications. • Aggressive availability goal: 2 hours downtime in 40 years (i.e., 3 min/year), with less than 0.01% of the calls handled incorrectly.
IntroductionHigh-Availability Systems: An Example AT&T In 1978, Bell Labs collected data on historic trends of causes of system downtime: • 20% attributed to HW (good diagnostics and trouble-location programs can help minimize HW-induced downtime). • 15% attributed to SW (SW deficiencies included improper translation of algorithms into code or improper specifications). • 35% attributed to recovery deficiencies (these deficiencies can be caused by undetected faults or incorrect fault isolation). • 30% attributed to human procedural error.
IntroductionHigh-Availability Systems: An Example AT&T Other studies on the same direction ...
There is some natural redundancy in the telephone switching network: “a telephone user will redial in he gets a wrong # or is disconnected”. However, there is a user aggravation level that must be avoided: “users will redial as long as it does not happen to frequently”. IntroductionHigh-Availability Systems: An Example AT&T
IntroductionHigh-Availability Systems: An Example AT&T Note, however, that the thresholds are different for failure to establish a call (moderately high) and disconnection of an established call (very low): Levels of recovery in a Telephone Switching System
IntroductionHigh-Availability Systems: An Example AT&T In a typical telephone switching system, tasks of the Central Control Unit are related with: • Overall system control/administration • Call processing • System maintenance • Automatic isolation of faulty units • Defensive SW strategies • Support for rapid repair
IntroductionHigh-Availability Systems: An Example AT&T Bus Interface Program Store (PS) Central Control (CC) Call Store (CS) AU Auxiliary Unit (AU) Bus Typical switching system diagram
IntroductionHigh-Availability Systems: An Example AT&T CC instructions reside in the program store (PS) while transient info (e.g., telephone calls, routing, equipment configuration) is held in the call store (CS) Auxiliary Unit (AU) Bus interfaces to disk and magnetic tape mass storage.
IntroductionHigh-Availability Systems: An Example AT&T PSB: Program Store Bus PU: Peripheral Unit Bus PUB1 PUB2 Bus Interface 1 Bus Interface 2 PSB1 PSB2 Program Store 1 (PS) Program Store 2 (PS) Central Control 1 (CC) Central Control 2 (CC) Call Store 1 (CS) Call Store 2 (CS) AU 1 AU 2 Auxiliary Unit (AU) Bus Duplex configuration for switching computer. (Assuming that only one of each component is required for a functional system, there are 64 possible system configurations.)
IntroductionHigh-Availability Systems: An Example AT&T 1-Both CCs operate in synchronism. Two matched circuits compare 24 bits of internal state during each 5.5us machine cycle. 2-There are 6 different sets of internal nodes that can be monitored, depending on the instruction being executed. 3-A mismatch generates an interrupt which calls fault recognition programs to determine which half of the system is faulty. 4-Information can be sample by the matchers and retained for later examination by diagnostic programs.
IntroductionHigh-Availability Systems: An Example AT&T 5-The OS employs Hamming code on the 37 data bits. 6-There is parity check bits over address plus data bus: the CS has one parity bit on address and data, and another parity bit just on address. 7-Both OS and CS automatically retry operations upon error detection. 8-After a fault has been detected, the system configuration logic attempts to establish various combinations of subunits. 9-A sanity program is then executed.
IntroductionHigh-Availability Systems: An Example AT&T Summarizing some features of the FT system: • Duplication of ALU. • 30% of Control Logic devoted to Self-Checking. • EDAC on disks. • SW audits. • Sanity timer (a Sanity Program is similar to a maze that the HW must traverse before the sanity timer times out. If a time-out occurs, the reconfiguration logic generates a new configuration to be tried).
IntroductionHigh-Availability Systems: An Example AT&T • Integrity monitor (Supervisor). • Byte parity on datapaths. • Parity checking where parity preserved, duplication otherwise. • Two-parity bits on registers. • Modified Hamming Code on Main Memory. • Maintenance Channel for observability and controlability.