230 likes | 420 Views
Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment. Shantanu Dutt, Hasan Arslan ECE Dept. University of Illinois -Chicago. Outline. Past Work-- General Fault Detection and Tolerance Past Work – EMI Induce Faults
E N D
Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment Shantanu Dutt, Hasan Arslan ECE Dept. University of Illinois -Chicago
Outline • Past Work-- General Fault Detection and Tolerance • Past Work – EMI Induce Faults • Fault Types and Fault Injection Methods • Proposed Work and System • Methodologies to Detect Faults • Question and Future Outlook
Past work – General Fault Detection and Tolerance • Off-line testing of digital circuits • Self-diagnosis • Test each of functional block • Not a good system for real-time app. • Redundancy • Hardware, software or time • Have a high overhead penalty
Past work – General Fault Detection and tolerance • Concurrent-online testing: Adding external hardware, monitoring data,address and control lines • Memory:error-detecting & correcting codes • Computer systems • Watchdog processor – detecting control flow errors in program execution [Mahmood & McCluskey, TC’88] • Algorithm-based fault tolerance: use of some property of computation for self-checking [Huang & Abraham, TC’84, Dutt & Assad, TC’96]
Past work – General Fault Detection and tolerance ( contd.) • Concurrent-online testing(contd.) • Reconfigurable Systems: On-line testing and fault tolerance using dynamic circuit reconfiguration • FPGA-based systems: On-line testing & FT [Verma, M.S. Thesis, UIC’01], [Dutt, et al., ICCAD’99], [Mahapatra & Dutt, FTCS’99], [Abramovici et al., ITC’99]
EM-Induced Faults • High level computer failure detection due to different types of EM signals[Mojert et al., EMC’01] • Radiation therapy machine overdoses patients • Space Shuttle can’t launch due to synchronization error in redundant computers • Failure in real-time communication & control systems from communication line error due to EM signals [Kohlberg & Carter, EMC’01] • SEUs (single-Event Upsets): potential threat to the reliability of integrated circuits operating in radiation environment • Space/avionics application, due to heavy-energy particles. • Hubble’s Space Telescope • Ground level (atmospheric neutrons) • NASA space-based astronomical observatory
Fault Types & Fault Injection Methods • Error Types • Control flow errors—incorrect sequence of instruction execution. Causes: address gen. Error, memory faults, bus faults • Data Errors: Causes: computation errors, memory & bus faults • Hung processor & crashes: Causes: C.U. transition to dead-end states, invalid instruction, out-of-bound address, divide-by-zero • Error types are NOT mutually exclusive
Fault Types & Fault Injection Methods • Fault Injection Methods • Hardware Fault Injection • with contact (voltage or current changes,use pin-level probes and sockets)Messaline_[Arla et.al.,FTC’89 ] • without contact (heavy-ion radiation and EMI) FIST_[Gunnetlo et al.,FTC’89] MARS_[Karlsson er al.,DCCA’95] • Software Fault Injection • Compile-time injection(modifying program instr. ) Doctor_[Han et al., CPD’95] • Runtime injection (trigger fault injection mechanism) • Time-out • Exception/trap Xception_[Carreira et al., DCCA’95] • Code insertion Ferrari_[Kanawati et al.,FTC’92] Ftape_[Tsai et al., FTC’96]
Fault Types & Fault Injection Methods • Software Fault Injection (Contd.) • Adv. • Don’t require expensive hardware • Used to target application and operation systems,which is difficult to do with hardware fault injection • Disadv. • Change the structure of original software • Can not inject faults into location. That are inaccessible to soft.
Fault Types & Fault Injection Methods Fault injection system Controller Workload library Fault Library Fault Injector Workload generator Monitor Data collector Data analyzer Target system
Proposed Work • VHD modeling of a modern microprocessor (using an available VHDL description of the DLX microprocessor, with appropriate modification) • VHDL-based introduction of fault injection logic in the CPU as well as memory and external buses to simulate different fault patterns likely caused by EMI • Develop techniques for detection of program errors due to these faults • Classification of the fault types into data, control and hung/crashed processor • Preliminary results for simulation of faults in external memory address and data buses
Counter_2 Signal line 1 0 Counter_1 data Var-width Var-period Pulse gen. Data Bus DLX CPU Memory Address Bus Fault Generator Proposed Work Location & Values of Faults Fault Types (stuck_at 0, stuck-at 1, single random, clustered, multiple random, etc) Duration of Faults & Start Times [0-50T] T= CPU clock cycle [0,Texc(workload)]Texc: execution time without fault
Proposed Work(contd.) • Will include similar fault-injection capability for on-chip wires with a probabilistic component that will be based on analysis of EM effects on p/g lines from the circuit analysis component • Processor will be partitioned onto 4 main modules: control unit, ALU, register file & cache with separate or common p/g lines with these to determine different degrees of susceptibility p/g p/g Cache Control Unit Register File p/g ALU p/g
Methodologies: Control Flow Checking • A watchdog: small co-processor,monitors the behavior of the • system • Provided previously with information about the processor • to be checked(memory access, control flow,control signal ..) • Compares the information gathered concurrently to the information previously provided • Complexity,lies between the current circuit-level and system-level tech. Memory Hierarchy Watchdog Memory Bus Signal from branch circuit Processor
_fibo: sw -4(r14),r30 . . seq r1,r3,r4 bnez r1,L3 . . seq r1,r3,r4 bnez r1,L3 j L2 L3: . addi r1,r0,#1 j L1 L2: .. .. A node is a block of inst. with a branch at the end A derived sign. of a node is a function(e.g.,xor, LFSR) of all instructions A program graph is one in which there is an arc from node u to v if the branch at u can lead to node v. Based on the signature Computation, error coverage is high(>90%) even with multiple faults[Mahmood & McCluskey, FTCS’85] Methodologies: Control Flow Checking n1 n2 n3 n4 n5 Sign(n4) BRT L1 L1 WD
L1: . lw r3,0(r30) addi r0,r0,#1 seq r1,r3,r0 bnez r1,L1 L2 . . subi r2,r2,#1 seq r1,r3,r2 bnez r1,L2 j L4 L3: . addi r1,r0,#1 j L1 L4: .. .. Error types Segmentation fault r0=24 r3=25 Hung-processor r2=1 r3=0 Out-of-bound address L4=256 Invalid instruction Instruction code can be changed Examples of Error types
Analysis of Error • Program never finished (%47) • Program terminated incorrectly(%23) • Terminated with incorrect result (%23) • Terminated with correct result(%7)
Methodologies: Algorithm-Based Fault Tolerance • Instruction execution errors • Difficult to detect, occur inside the microprocessor,not observable to an external watchdog processor • Off-line scheme for detecting execution errors due to permanent faults[K.K. Saluja et al. IEEE ITC’1983] • Transient fault occur more frequently than permanent faults in digital systems • Detecting transient faults must be done in real-time
Methodologies: Algorithm-Based Fault Tolerance • Use properties of the computation to check correctness of computed data • E.g. linearly property: f(v1+v2)=f(v1)+f(2) of computation f() can be used to check it • Pre-compute v’ = v1 + v1 + …+ vk (input checksum) • Computer f(v1), …..f(vk) • Compute u = f(v) + f(v2) + …. + f(vk) (output checksum) • Check if f(v’) = u; inequality indicates computation error(s) • Can be used for linear computations such as matrix multiplication, matrix addition, Gaussian elimination [Huang & Abraham, TC’84],[Dutt & Assad, TC’96]
Methodologies: Algorithm-Based Fault Tolerance • Use a watchdog to monitor the bus and fetch the instruction opcodes along with the main processor • Calculate expected execution parameters of each instruction • Store this information in the watchdog processor (instruction parameter table) • Compare the fetched instruction parameters with the stored data • If parameters do not match, give error message • Based on the program and microprocessor , error coverage can be change.8086 instruction set, error coverage is around %85 percent for single bit error [Khan & Tront, IEEE TC, 1989]
Goals,Questions & Future Outlook • Q: Are there patterns of errors that lead to computer crashes w/ high probability? • Q:If so, can the detection of such patterns be used to shut down the computer in a fail-safe manner (save state & data for later resumption) • Q:Are there patterns of errors that are characteristic of EM-induced faults versus random single/double faults? • Q:If so, can these be used as “early detection & warning” of EM interference? • Future: Based on the correlation of system errors to EM faults, determine fault tolerance/ error minimization techniques for EM-induced faults.