Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment

Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment Shantanu Dutt, Hasan Arslan ECE Dept. University of Illinois -Chicago

Outline • Past Work-- General Fault Detection and Tolerance • Past Work – EMI Induce Faults • Fault Types and Fault Injection Methods • Proposed Work and System • Methodologies to Detect Faults • Question and Future Outlook

Past work – General Fault Detection and Tolerance • Off-line testing of digital circuits • Self-diagnosis • Test each of functional block • Not a good system for real-time app. • Redundancy • Hardware, software or time • Have a high overhead penalty

Past work – General Fault Detection and tolerance • Concurrent-online testing: Adding external hardware, monitoring data,address and control lines • Memory:error-detecting & correcting codes • Computer systems • Watchdog processor – detecting control flow errors in program execution [Mahmood & McCluskey, TC’88] • Algorithm-based fault tolerance: use of some property of computation for self-checking [Huang & Abraham, TC’84, Dutt & Assad, TC’96]

Past work – General Fault Detection and tolerance ( contd.) • Concurrent-online testing(contd.) • Reconfigurable Systems: On-line testing and fault tolerance using dynamic circuit reconfiguration • FPGA-based systems: On-line testing & FT [Verma, M.S. Thesis, UIC’01], [Dutt, et al., ICCAD’99], [Mahapatra & Dutt, FTCS’99], [Abramovici et al., ITC’99]

EM-Induced Faults • High level computer failure detection due to different types of EM signals[Mojert et al., EMC’01] • Radiation therapy machine overdoses patients • Space Shuttle can’t launch due to synchronization error in redundant computers • Failure in real-time communication & control systems from communication line error due to EM signals [Kohlberg & Carter, EMC’01] • SEUs (single-Event Upsets): potential threat to the reliability of integrated circuits operating in radiation environment • Space/avionics application, due to heavy-energy particles. • Hubble’s Space Telescope • Ground level (atmospheric neutrons) • NASA space-based astronomical observatory

Fault Types & Fault Injection Methods • Error Types • Control flow errors—incorrect sequence of instruction execution. Causes: address gen. Error, memory faults, bus faults • Data Errors: Causes: computation errors, memory & bus faults • Hung processor & crashes: Causes: C.U. transition to dead-end states, invalid instruction, out-of-bound address, divide-by-zero • Error types are NOT mutually exclusive

Fault Types & Fault Injection Methods • Fault Injection Methods • Hardware Fault Injection • with contact (voltage or current changes,use pin-level probes and sockets)Messaline_[Arla et.al.,FTC’89 ] • without contact (heavy-ion radiation and EMI) FIST_[Gunnetlo et al.,FTC’89] MARS_[Karlsson er al.,DCCA’95] • Software Fault Injection • Compile-time injection(modifying program instr. ) Doctor_[Han et al., CPD’95] • Runtime injection (trigger fault injection mechanism) • Time-out • Exception/trap Xception_[Carreira et al., DCCA’95] • Code insertion Ferrari_[Kanawati et al.,FTC’92] Ftape_[Tsai et al., FTC’96]

Fault Types & Fault Injection Methods • Software Fault Injection (Contd.) • Adv. • Don’t require expensive hardware • Used to target application and operation systems,which is difficult to do with hardware fault injection • Disadv. • Change the structure of original software • Can not inject faults into location. That are inaccessible to soft.

Fault Types & Fault Injection Methods Fault injection system Controller Workload library Fault Library Fault Injector Workload generator Monitor Data collector Data analyzer Target system

Characteristics of Fault Injection Methods

Proposed Work • VHD modeling of a modern microprocessor (using an available VHDL description of the DLX microprocessor, with appropriate modification) • VHDL-based introduction of fault injection logic in the CPU as well as memory and external buses to simulate different fault patterns likely caused by EMI • Develop techniques for detection of program errors due to these faults • Classification of the fault types into data, control and hung/crashed processor • Preliminary results for simulation of faults in external memory address and data buses

Counter_2 Signal line 1 0 Counter_1 data Var-width Var-period Pulse gen. Data Bus DLX CPU Memory Address Bus Fault Generator Proposed Work Location & Values of Faults Fault Types (stuck_at 0, stuck-at 1, single random, clustered, multiple random, etc) Duration of Faults & Start Times [0-50T] T= CPU clock cycle [0,Texc(workload)]Texc: execution time without fault

Proposed Work(contd.) • Will include similar fault-injection capability for on-chip wires with a probabilistic component that will be based on analysis of EM effects on p/g lines from the circuit analysis component • Processor will be partitioned onto 4 main modules: control unit, ALU, register file & cache with separate or common p/g lines with these to determine different degrees of susceptibility p/g p/g Cache Control Unit Register File p/g ALU p/g

Methodologies: Control Flow Checking • A watchdog: small co-processor,monitors the behavior of the • system • Provided previously with information about the processor • to be checked(memory access, control flow,control signal ..) • Compares the information gathered concurrently to the information previously provided • Complexity,lies between the current circuit-level and system-level tech. Memory Hierarchy Watchdog Memory Bus Signal from branch circuit Processor

_fibo: sw -4(r14),r30 . . seq r1,r3,r4 bnez r1,L3 . . seq r1,r3,r4 bnez r1,L3 j L2 L3: . addi r1,r0,#1 j L1 L2: .. .. A node is a block of inst. with a branch at the end A derived sign. of a node is a function(e.g.,xor, LFSR) of all instructions A program graph is one in which there is an arc from node u to v if the branch at u can lead to node v. Based on the signature Computation, error coverage is high(>90%) even with multiple faults[Mahmood & McCluskey, FTCS’85] Methodologies: Control Flow Checking n1 n2 n3 n4 n5 Sign(n4) BRT L1 L1 WD

L1: . lw r3,0(r30) addi r0,r0,#1 seq r1,r3,r0 bnez r1,L1 L2 . . subi r2,r2,#1 seq r1,r3,r2 bnez r1,L2 j L4 L3: . addi r1,r0,#1 j L1 L4: .. .. Error types Segmentation fault r0=24 r3=25 Hung-processor r2=1 r3=0 Out-of-bound address L4=256 Invalid instruction Instruction code can be changed Examples of Error types

Analysis of Error

Analysis of Error • Program never finished (%47) • Program terminated incorrectly(%23) • Terminated with incorrect result (%23) • Terminated with correct result(%7)

Methodologies: Algorithm-Based Fault Tolerance • Instruction execution errors • Difficult to detect, occur inside the microprocessor,not observable to an external watchdog processor • Off-line scheme for detecting execution errors due to permanent faults[K.K. Saluja et al. IEEE ITC’1983] • Transient fault occur more frequently than permanent faults in digital systems • Detecting transient faults must be done in real-time

Methodologies: Algorithm-Based Fault Tolerance • Use properties of the computation to check correctness of computed data • E.g. linearly property: f(v1+v2)=f(v1)+f(2) of computation f() can be used to check it • Pre-compute v’ = v1 + v1 + …+ vk (input checksum) • Computer f(v1), …..f(vk) • Compute u = f(v) + f(v2) + …. + f(vk) (output checksum) • Check if f(v’) = u; inequality indicates computation error(s) • Can be used for linear computations such as matrix multiplication, matrix addition, Gaussian elimination [Huang & Abraham, TC’84],[Dutt & Assad, TC’96]

Methodologies: Algorithm-Based Fault Tolerance • Use a watchdog to monitor the bus and fetch the instruction opcodes along with the main processor • Calculate expected execution parameters of each instruction • Store this information in the watchdog processor (instruction parameter table) • Compare the fetched instruction parameters with the stored data • If parameters do not match, give error message • Based on the program and microprocessor , error coverage can be change.8086 instruction set, error coverage is around %85 percent for single bit error [Khan & Tront, IEEE TC, 1989]

Goals,Questions & Future Outlook • Q: Are there patterns of errors that lead to computer crashes w/ high probability? • Q:If so, can the detection of such patterns be used to shut down the computer in a fail-safe manner (save state & data for later resumption) • Q:Are there patterns of errors that are characteristic of EM-induced faults versus random single/double faults? • Q:If so, can these be used as “early detection & warning” of EM interference? • Future: Based on the correlation of system errors to EM faults, determine fault tolerance/ error minimization techniques for EM-induced faults.

Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment

Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment

Presentation Transcript

Interference to Nutritional Needs Due to Degeneration and Inflammation

Sector Processor Simulation Status

ExtraVirt: Detecting and recovering from transient processor faults

Wave –Coast Interactions Wave patterns due to coastal interference

Interference due to transmitted light in thin films

Introduction to faults

PSEUDO-HYPERTHYROXINAEMIA UNCOMMON CAUSE OF IMMUNOASSAY INTERFERENCE DUE TO BIOTIN THERAPY

Cable Faults update due to Taiwan Earthquake

TigerSHARC processor and evaluation board

Introduction to EM Simulation

Controlling POTW Pass-Through/Interference Due to Compatible Pollutants

Precise and Accurate Processor Simulation

Today’s agenda: Thin Film Interference. Phase Change Due to Reflection.

Simulation of Spin Interference and Echo Effect

Variable effects of environment due to genetics.

TigerSHARC processor and evaluation board

Precise and Accurate Processor Simulation

Simulation concepts and architectures

Mutiple Faults: Modeling, Simulation and Test

Introduction to Faults