250 likes | 394 Views
Runtime Logic and Interconnect Fault Recovery on Diverse FPGA Architectures. John Lach 1 William H. Mangione-Smith 1 Miodrag Potkonjak 2 UCLA Departments of Electrical Engineering 1 and Computer Science 2. Outline. Motivation Project goals General approach
E N D
Runtime Logic and Interconnect Fault Recovery on Diverse FPGA Architectures John Lach1 William H. Mangione-Smith1 Miodrag Potkonjak2 UCLA Departments of Electrical Engineering1 and Computer Science2
Outline • Motivation • Project goals • General approach • Tiling and redundant interconnect • Architecture applications • Experimental results • Conclusions 2
Motivation • Long-life and mission critical applications • Space and remote systems • FPGA applications • Radiation faults • System constraints • power • cost • size 3
Project Goals • High tolerance of multiple faults • logic • interconnect • radiation and manufacturing/operating imperfections • Reduced system down time • Transparent • Low area and timing overhead • Low memory requirements • Low design effort • Applicable to diverse FPGA architectures 4
General Approach • Create and store many different instances of the same design • Each resource is unused in at least one instance • In the face of a fault, an instance is activated that does not use the faulty resource 5
General Approach Problem:Tolerance of Multiple Faults A 6x6 LB design with 4 LB area free 6
Solution: Fine-Grained Tiling • Partition design into a set of “tiles” • Each tile has some unused resources • area overhead??? • Lock the interface between tiles • tile independence • Generate instances of each tile • Atomic Fault-Tolerant Blocks (AFTBs) • Each resource is unused in at least one AFTB • In the face of a fault, an AFTB is invoked that does not use the faulty resource 7
Tiling Example A 6x6 LB design partitioned into 4 3x3 tiles AFTB example 8
Design Example(Xilinx XC4000 Device) Initial floorplan for PREP 5 benchmark After tiling and one AFTB identified After fault detected at (20,3) 9
Benefits of Approach • High reliability • Low overhead • physical resources, circuit performance, memory, design effort • Runtime management • tolerates faults on-line • minimizes system downtime • Flexibility • variable timing constraints, resource limitations, and estimated reliability 10
Interconnect Faults • Tiling tolerates most interconnect faults • Inter-tile interconnect • global • overlapped/segmented • Reserve redundant interconnect 11
Diverse Architectures • Sanders CSRC device • Xilinx XC4000 family • Altera Flex 10k series 12
Sanders CSRCPre and Post Pipe Portion Fault Configurations 15
Xilinx XC4000 Initial floorplan for PREP 5 benchmark After tiling and one AFTB identified After fault detected at (20,3) 17
Reliability Enhancement:Correlated Fault Model with Variable μ 23
Conclusion • Tolerate logic and interconnect faults • High tolerance of multiple faults • Short system down time • runtime management • transparent • Low area, timing, memory, effort overhead • Flexible • applications • architectures 25