1 / 23

Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment

Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment. Shantanu Dutt, Hasan Arslan ECE Dept. University of Illinois -Chicago. Outline. Past Work-- General Fault Detection and Tolerance Past Work – EMI Induce Faults

camdyn
Download Presentation

Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment Shantanu Dutt, Hasan Arslan ECE Dept. University of Illinois -Chicago

  2. Outline • Past Work-- General Fault Detection and Tolerance • Past Work – EMI Induce Faults • Fault Types and Fault Injection Methods • Proposed Work and System • Methodologies to Detect Faults • Question and Future Outlook

  3. Past work – General Fault Detection and Tolerance • Off-line testing of digital circuits • Self-diagnosis • Test each of functional block • Not a good system for real-time app. • Redundancy • Hardware, software or time • Have a high overhead penalty

  4. Past work – General Fault Detection and tolerance • Concurrent-online testing: Adding external hardware, monitoring data,address and control lines • Memory:error-detecting & correcting codes • Computer systems • Watchdog processor – detecting control flow errors in program execution [Mahmood & McCluskey, TC’88] • Algorithm-based fault tolerance: use of some property of computation for self-checking [Huang & Abraham, TC’84, Dutt & Assad, TC’96]

  5. Past work – General Fault Detection and tolerance ( contd.) • Concurrent-online testing(contd.) • Reconfigurable Systems: On-line testing and fault tolerance using dynamic circuit reconfiguration • FPGA-based systems: On-line testing & FT [Verma, M.S. Thesis, UIC’01], [Dutt, et al., ICCAD’99], [Mahapatra & Dutt, FTCS’99], [Abramovici et al., ITC’99]

  6. EM-Induced Faults • High level computer failure detection due to different types of EM signals[Mojert et al., EMC’01] • Radiation therapy machine overdoses patients • Space Shuttle can’t launch due to synchronization error in redundant computers • Failure in real-time communication & control systems from communication line error due to EM signals [Kohlberg & Carter, EMC’01] • SEUs (single-Event Upsets): potential threat to the reliability of integrated circuits operating in radiation environment • Space/avionics application, due to heavy-energy particles. • Hubble’s Space Telescope • Ground level (atmospheric neutrons) • NASA space-based astronomical observatory

  7. Fault Types & Fault Injection Methods • Error Types • Control flow errors—incorrect sequence of instruction execution. Causes: address gen. Error, memory faults, bus faults • Data Errors: Causes: computation errors, memory & bus faults • Hung processor & crashes: Causes: C.U. transition to dead-end states, invalid instruction, out-of-bound address, divide-by-zero • Error types are NOT mutually exclusive

  8. Fault Types & Fault Injection Methods • Fault Injection Methods • Hardware Fault Injection • with contact (voltage or current changes,use pin-level probes and sockets)Messaline_[Arla et.al.,FTC’89 ] • without contact (heavy-ion radiation and EMI) FIST_[Gunnetlo et al.,FTC’89] MARS_[Karlsson er al.,DCCA’95] • Software Fault Injection • Compile-time injection(modifying program instr. ) Doctor_[Han et al., CPD’95] • Runtime injection (trigger fault injection mechanism) • Time-out • Exception/trap Xception_[Carreira et al., DCCA’95] • Code insertion Ferrari_[Kanawati et al.,FTC’92] Ftape_[Tsai et al., FTC’96]

  9. Fault Types & Fault Injection Methods • Software Fault Injection (Contd.) • Adv. • Don’t require expensive hardware • Used to target application and operation systems,which is difficult to do with hardware fault injection • Disadv. • Change the structure of original software • Can not inject faults into location. That are inaccessible to soft.

  10. Fault Types & Fault Injection Methods Fault injection system Controller Workload library Fault Library Fault Injector Workload generator Monitor Data collector Data analyzer Target system

  11. Characteristics of Fault Injection Methods

  12. Proposed Work • VHD modeling of a modern microprocessor (using an available VHDL description of the DLX microprocessor, with appropriate modification) • VHDL-based introduction of fault injection logic in the CPU as well as memory and external buses to simulate different fault patterns likely caused by EMI • Develop techniques for detection of program errors due to these faults • Classification of the fault types into data, control and hung/crashed processor • Preliminary results for simulation of faults in external memory address and data buses

  13. Counter_2 Signal line 1 0 Counter_1 data Var-width Var-period Pulse gen. Data Bus DLX CPU Memory Address Bus Fault Generator Proposed Work Location & Values of Faults Fault Types (stuck_at 0, stuck-at 1, single random, clustered, multiple random, etc) Duration of Faults & Start Times [0-50T] T= CPU clock cycle [0,Texc(workload)]Texc: execution time without fault

  14. Proposed Work(contd.) • Will include similar fault-injection capability for on-chip wires with a probabilistic component that will be based on analysis of EM effects on p/g lines from the circuit analysis component • Processor will be partitioned onto 4 main modules: control unit, ALU, register file & cache with separate or common p/g lines with these to determine different degrees of susceptibility p/g p/g Cache Control Unit Register File p/g ALU p/g

  15. Methodologies: Control Flow Checking • A watchdog: small co-processor,monitors the behavior of the • system • Provided previously with information about the processor • to be checked(memory access, control flow,control signal ..) • Compares the information gathered concurrently to the information previously provided • Complexity,lies between the current circuit-level and system-level tech. Memory Hierarchy Watchdog Memory Bus Signal from branch circuit Processor

  16. _fibo: sw -4(r14),r30 . . seq r1,r3,r4 bnez r1,L3 . . seq r1,r3,r4 bnez r1,L3 j L2 L3: . addi r1,r0,#1 j L1 L2: .. .. A node is a block of inst. with a branch at the end A derived sign. of a node is a function(e.g.,xor, LFSR) of all instructions A program graph is one in which there is an arc from node u to v if the branch at u can lead to node v. Based on the signature Computation, error coverage is high(>90%) even with multiple faults[Mahmood & McCluskey, FTCS’85] Methodologies: Control Flow Checking n1 n2 n3 n4 n5 Sign(n4) BRT L1 L1 WD

  17. L1: . lw r3,0(r30) addi r0,r0,#1 seq r1,r3,r0 bnez r1,L1 L2 . . subi r2,r2,#1 seq r1,r3,r2 bnez r1,L2 j L4 L3: . addi r1,r0,#1 j L1 L4: .. .. Error types Segmentation fault r0=24 r3=25 Hung-processor r2=1 r3=0 Out-of-bound address L4=256 Invalid instruction Instruction code can be changed Examples of Error types

  18. Analysis of Error

  19. Analysis of Error • Program never finished (%47) • Program terminated incorrectly(%23) • Terminated with incorrect result (%23) • Terminated with correct result(%7)

  20. Methodologies: Algorithm-Based Fault Tolerance • Instruction execution errors • Difficult to detect, occur inside the microprocessor,not observable to an external watchdog processor • Off-line scheme for detecting execution errors due to permanent faults[K.K. Saluja et al. IEEE ITC’1983] • Transient fault occur more frequently than permanent faults in digital systems • Detecting transient faults must be done in real-time

  21. Methodologies: Algorithm-Based Fault Tolerance • Use properties of the computation to check correctness of computed data • E.g. linearly property: f(v1+v2)=f(v1)+f(2) of computation f() can be used to check it • Pre-compute v’ = v1 + v1 + …+ vk (input checksum) • Computer f(v1), …..f(vk) • Compute u = f(v) + f(v2) + …. + f(vk) (output checksum) • Check if f(v’) = u; inequality indicates computation error(s) • Can be used for linear computations such as matrix multiplication, matrix addition, Gaussian elimination [Huang & Abraham, TC’84],[Dutt & Assad, TC’96]

  22. Methodologies: Algorithm-Based Fault Tolerance • Use a watchdog to monitor the bus and fetch the instruction opcodes along with the main processor • Calculate expected execution parameters of each instruction • Store this information in the watchdog processor (instruction parameter table) • Compare the fetched instruction parameters with the stored data • If parameters do not match, give error message • Based on the program and microprocessor , error coverage can be change.8086 instruction set, error coverage is around %85 percent for single bit error [Khan & Tront, IEEE TC, 1989]

  23. Goals,Questions & Future Outlook • Q: Are there patterns of errors that lead to computer crashes w/ high probability? • Q:If so, can the detection of such patterns be used to shut down the computer in a fail-safe manner (save state & data for later resumption) • Q:Are there patterns of errors that are characteristic of EM-induced faults versus random single/double faults? • Q:If so, can these be used as “early detection & warning” of EM interference? • Future: Based on the correlation of system errors to EM faults, determine fault tolerance/ error minimization techniques for EM-induced faults.

More Related