Runtime Logic and Interconnect Fault Recovery on Diverse FPGA Architectures

Runtime Logic and Interconnect Fault Recovery on Diverse FPGA Architectures John Lach1 William H. Mangione-Smith1 Miodrag Potkonjak2 UCLA Departments of Electrical Engineering1 and Computer Science2

Outline • Motivation • Project goals • General approach • Tiling and redundant interconnect • Architecture applications • Experimental results • Conclusions 2

Motivation • Long-life and mission critical applications • Space and remote systems • FPGA applications • Radiation faults • System constraints • power • cost • size 3

Project Goals • High tolerance of multiple faults • logic • interconnect • radiation and manufacturing/operating imperfections • Reduced system down time • Transparent • Low area and timing overhead • Low memory requirements • Low design effort • Applicable to diverse FPGA architectures 4

General Approach • Create and store many different instances of the same design • Each resource is unused in at least one instance • In the face of a fault, an instance is activated that does not use the faulty resource 5

General Approach Problem:Tolerance of Multiple Faults A 6x6 LB design with 4 LB area free 6

Solution: Fine-Grained Tiling • Partition design into a set of “tiles” • Each tile has some unused resources • area overhead??? • Lock the interface between tiles • tile independence • Generate instances of each tile • Atomic Fault-Tolerant Blocks (AFTBs) • Each resource is unused in at least one AFTB • In the face of a fault, an AFTB is invoked that does not use the faulty resource 7

Tiling Example A 6x6 LB design partitioned into 4 3x3 tiles AFTB example 8

Design Example(Xilinx XC4000 Device) Initial floorplan for PREP 5 benchmark After tiling and one AFTB identified After fault detected at (20,3) 9

Benefits of Approach • High reliability • Low overhead • physical resources, circuit performance, memory, design effort • Runtime management • tolerates faults on-line • minimizes system downtime • Flexibility • variable timing constraints, resource limitations, and estimated reliability 10

Interconnect Faults • Tiling tolerates most interconnect faults • Inter-tile interconnect • global • overlapped/segmented • Reserve redundant interconnect 11

Diverse Architectures • Sanders CSRC device • Xilinx XC4000 family • Altera Flex 10k series 12

Sanders CSRCCSLA and Level 1 Routing 13

Sanders CSRCData Pipes and Level 3 Routing 14

Sanders CSRCPre and Post Pipe Portion Fault Configurations 15

Sanders CSRCHierarchical Redundancy 16

Xilinx XC4000 Initial floorplan for PREP 5 benchmark After tiling and one AFTB identified After fault detected at (20,3) 17

Xilinx XC4000Inter-Tile Interconnect Fault Recovery 18

Altera Flex 10k 19

Altera Flex 10kHierarchical Redundancy 20

Timing and Area Overhead 21

Reliability Enhancement:Variable Resource Reliability 22

Reliability Enhancement:Correlated Fault Model with Variable μ 23

5000 CLB Design Reliability 24

Conclusion • Tolerate logic and interconnect faults • High tolerance of multiple faults • Short system down time • runtime management • transparent • Low area, timing, memory, effort overhead • Flexible • applications • architectures 25

Runtime Logic and Interconnect Fault Recovery on Diverse FPGA Architectures

Runtime Logic and Interconnect Fault Recovery on Diverse FPGA Architectures

Presentation Transcript

Fault Tolerant FPGA Co-processing Toolkit

Interconnect Testing in Cluster Based FPGA Architectures

The Future of FPGA Interconnect

LOGIC SIMULATION AND FAULT DIAGNOSIS

High-Level Interconnect Architectures for FPGAs

FPGA Logic Cluster Design

Testing and Diagnosis of Interconnect Faults in Cluster-Based FPGA Architectures

Fault Tolerant Runtime Research @ ANL

High-Level Interconnect Architectures for FPGAs

Autonomous FPGA Fault Handling through Competitive Runtime Reconfiguration

Programmable Logic Device Architectures

Robust FPGA Resynthesis Based on Fault-Tolerant Boolean Matching

Runtime Verification and Software Fault Protection with Eagle

High-Level Interconnect Architectures for FPGAs

Multiple Drain Transistor-Based FPGA Architectures

Fault Tolerance and Recovery

uGNI -based Charm++ Runtime for Cray Gemini Interconnect

Fault tolerance and disaster recovery

Architectures and Programmable Logic Devices

Incorporating Fault Tolerance and Reliability in Software Architectures

Syntax Errors, Runtime Errors, and Logic Errors

Exploration of Pipelined FPGA Interconnect Structures