ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering HIGH Level Fault-Tolerance: Checkpointing and recovery Introductory material

Overview • Introduction and basic concept • Fault model and fault coverage • Checkpointing and backward error recovery (rollback) • General principles • Uniprocessor systems • Summary • Cost, Overhead, Latency issues • Distributed Systems ECE 753 Fault Tolerant Computing

Introduction • References • Text Chapter 6 • [Prad:96] Chapter 3 – sections on rollback and reconfiguration ECE 753 Fault Tolerant Computing

Introduction (contd.) • Some what higher level than ECC and watchdog, uses re-execution as basic recovery strategy • It is a hardware assisted software method in practice • Basic concept: save fault-free state of the system and if and when an error is detected, reload the fault-free state and re-execute ECE 753 Fault Tolerant Computing

Introduction - Basic Concept (contd.) • Three phases of recovery • Error detection • Damage assessment • Recovery – error elimination and arrival at the point where error was detected • often entails re-starting fresh on a system presumably fault free • Backward error recovery • Current process is rolled back to some error-free point and re-executes • Trivial solution – start afresh from the beginning of the program ECE 753 Fault Tolerant Computing

Fault model and fault coverage • Possible scenarios • Hardware is faulty, software is fault-free • Fault detection mechanism exists – in hardware or in software form • Hardware fault-free, software is faulty • Both hardware software faulty • Assumptions for backward error recovery • Reliable error detection mechanism exists • Error can be removed by re-execution • Process state can be restored to a previous error-free state ECE 753 Fault Tolerant Computing

Fault model and fault coverage (contd.) • Based on the assumptions stated: • The method is normally applicable when: error detection mechanism exists, transient hardware faults, and no-software faults • Methods to address other fault scenario are • Re-configuration • Software fault-tolerance: e.g. recovery block and n-version programming ECE 753 Fault Tolerant Computing

Checkpointing and Rollback • General principles • Time redundancy is permissible • Transient hardware errors • If software errors (design or otherwise) alternative modules exist or there are timing errors that may be solved during re-execution • Reliable error detection mechanism • It is feasible to determine checkpoints (system states that need to be saved) in an application • Method can apply to redundant as well as nonredundant systems ECE 753 Fault Tolerant Computing

Checkpointing and Rollback (contd.) • General issues: checkpointing & rollback • Save system state at regular interval • How often to save - checkpoint interval • How much to save - can be as little as PC and status flags, just one instruction or as mush as log of all messages, the complete program and associated data values at a given time • How long between fault occurrence and its detection (error latency) is tolerable – often large error latency may make this method less than an ideal method ECE 753 Fault Tolerant Computing

Checkpointing and Rollback (contd.) • General issues: checkpointing & rollback • Rollback recovery • Where do we go back to: damage assessment • Rollback: load the state vector (state of the processor, the data that may have been altered or corrupted) • Restart the computation ECE 753 Fault Tolerant Computing

Checkpointing and Rollback (contd.) • What do we need • Error detection mechanism • Various self-checking mechanisms, e.g. error detection, timers, watchdog, acceptance tests. • Storage for state/data saving • Large enough storage – PC, stack, data segments (static and dynamic), information about user and system files that may be open • Access time – issue during storing and retrieval • Volatility and stability of the storage ECE 753 Fault Tolerant Computing

Checkpointing and Rollback (contd.) • What do we need (contd.) • Events • Messages and transactions that should be logged and replayed • Procedures to handle errors and restart computation • What if errors continue to exist? – mechanism to handle this ECE 753 Fault Tolerant Computing

Checkpointing: Uniprocessor systems • Uniprocess and uniprocessor systems equivalence • Simplest scheme • Instruction re-execution • Hardware (parity, self-checking, duplication) reports error • Instruction is re-executed using previous data and state • Issues • Register file update (commit) • Latency, especially in pipeline systems • Key is to determine the state to be saved ECE 753 Fault Tolerant Computing

Checkpointing: Uniprocessor systems (contd.) • Process control systems • Program that monitors a process behaves in a predetermined manner – known control flow and typically periodic • Define checkpoints statically ECE 753 Fault Tolerant Computing

Checkpointing: Uniprocessor systems (contd.) • Process control systems (contd.) • Typical objectives • Recovery possible in a given time • Minimize the total number of checkpoints • Methods of this nature studied in 60’s ECE 753 Fault Tolerant Computing

Checkpointing: Uniprocessor systems (contd.) • General purpose systems • How much information to save • System state consisting of register file, PC, stack, etc. • Data? • All of it? Can be prohibitive (space and time) • So? • Only that data which is modified after the last checkpoint • How do we do this efficiently? • Caches provide a nice boundary to achieve this ECE 753 Fault Tolerant Computing

Summary • Discussed checkpointing classical studies ECE 753 Fault Tolerant Computing

ECE 753: FAULT-TOLERANT COMPUTING