A Survey of Fault Tolerant Methodologies for FPGA’s

A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu 2006703357

Outline • Introduction to FPGA’s • Device-Level Fault Tolerance • Methods • Configuration Level Fault Tolerance • Methods • Comparison of Methodologies • Conclusion

Introduction (FPGA) • A field programmable gate array is a semiconductor device containing programmable logic components and programmable interconnects • Consists of regular arrays of processing logic blocks (PLBs) • Programmable routing matrix • Configuration of FPGA includes • The functionality of the FPGA • Which PLBs will be used • The functionality of the PLBs • Which wire segments will be used for connecting PLBs

Introduction (FPGA) • PLB’s are multi-input, multi-output circuits and allow: • Sequential Designs • Combinational Designs • PLB’s include: • Look Up Tables (LUTs or small ROMs) • Multiplexers • Flip-Flops

Introduction (FPGA) • Look Up Tables (LUTs): • 4 input-1 output units • Can be used as: • RAM • ROM • Shift Register • Functional Unit • Configured by an 16-bit “INIT” function

Introduction (FPGA) • An Example: • y=(x1+x2)*x3+x4 • Create truth table • Assign “INIT” to the LUT • Since there are 4 inputs and 1 output, 1 LUT is enough to represent the equation • The LUT can be put into any PLB in the FPGA

Introduction (FPGA) • Another Example: • y=(x1+x2)*x3+x4 • z=y*x5 • Create truth tables • Assign “INIT”s to LUTs • Since there are 5 inputs and 1 output, 2 LUTs needed to represent the equation • The LUTs can be put into any PLBs in the FPGA • A1 and A0 are “don’t care”s

Introduction (FPGA) • An example of a full design on an FPGA

Fault Tolerance • Device-Level Fault Tolerance • Attempts to deal with faults at the level of FPGA hardware • Select redundant HW, replace faulty one • Solution with extra HW resources • Configuration-Level Fault Tolerance • Tolerates faults at the level of FPGA configuration • When a circuit is placed, fault-free resources are selected • Status of the resources is considered each time a circuit is placed-and-routed • Solution with extra reconfiguration time

Device-Level FT Methods(1) • Extra Rows • One extra spare row is added • Selection Logic is added to bypass the defective row • Vertical wire segments are increased by one row • Faults in one row can be tolerated • More than 1 spare row needed to tolerate faults in multiple rows

Device-Level FT Methods(2) • Reconfiguration Network • Four architectural changes • Additional routing resources (bypass lines) • Reconfiguration Memory to store locations of faulty resources • On-chip circuitry for reconfiguration routing • Additional column of PLBs

Device-Level FT Methods(2) • Reconfiguration Network • Test and identify faulty resources • Create fault map • Load map into Reconfiguration Memory • On-board router avoids faulty resources • The network is constructed by shifting all PLBs in the fault-containing row towards the right • Method can tolerate 1 fault in each row if there is one extra spare column.

Device-Level FT Methods(3) • Self-Repairing Architecture • Sub-arrays of PLBs • Routers between sub-arrays • Extra columns of PLBs • PLBs constantly test themselves • If a fault is detected, • Column of affected PLB is shifted one position to the right • The inter-array routers are adjusted • Area overhead of this method is significant • If there is 1 spare column and N sub-arrays in vertical, method can tolerate N faults at a time

Device-Level FT Methods(4) • Block-Structured Architecture • Goal:tolerate larger and denser patterns of defects efficiently • Blocks of PLBs • FPGA is configured by a loading arm. • The block at the end of loading arm is configured

Device-Level FT Methods(4) • Block-Structured Architecture • A block is selected by the loading arm and tested • If the test is passed, it is configured, otherwise designated as faulty • Loading arm configures blocks one by one • If the arm cannot extend any further in a path, it’s retracted by one block • Fault tolerance is provided by redundant rows and/or columns • Area overhead is significant

Device-Level FT Methods(5) • Fault Tolerant Segments/Grids • Fault Tolerant Segments: • Adds one track of spare segment to each wiring channel • If a faulty segment is found, segment is shifted to spare • Single fault can be tolerated • Fault Tolerant Grids: • An entire spare routing grid is added • No additional elements in routing channel, no extra time delay

Device-Level FT Methods(6) • SRAM Shifting • Based on shifting the entire circuit on the FPGA • PLBs should be placed in 2 ways: • King Allocation: 8 PLBs uses one spare, circuit can move in 8 directions • Horse Allocation: 4 PLBs uses one spare, circuit can move in 4 directions • Testing determines the faulty cells, feeds information to the shifter circuitry on the FPGA.

Device-Level FT Methods(6) • SRAM Shifting • Additional spare PLBs surrounding the FPGA • Horse Allocation used in the figure • The circuit is shifted up and right • Advantages of the Method: • No external reconfiguration algorithm is required • The timing of the circuit is almost fixed • Any single fault can be tolerated

Configuration-Level FT Methods(1) • Pebble Shifting • Find an initial circuit configuration, then move pieces from faulty units • Occupied PLBs are called pebbles • Pair pebbles on faulty cells with unique, unused cells such that sum of weighted Manhattan distance is minimized • Start shifting pebbles • If a pebble finds an empty cell other than the intended cell, this empty cell becomes the destination • No limit to the number of faults that can be tolerated

Configuration-Level FT Methods(1) • Pebble Shifting • Example: • 1 and 6 are on faulty cells • Using a minimum-cost, maximum matching algorithm, pairings are: 1->v11 and 6->v32 • Element 1 is shifted its position • To move 6, we shift 3,8 and 7 • Now all elements are on non-faulty cells and allocation is done

Configuration-Level FT Methods(2) • Mini-Max Grid Matching • Uses a grid matching algorithm to match faulty logic to empty, non-faulty locations • Like Pebble Shifting, uses minimum cost, maximum matching algorithm • Minimizes the maximum distance between the pairings, since the circuit’s performance is set by the critical (longest) path • Can tolerate faults until there are no unused cells

Configuration-Level FT Methods(3) • Node-Covering and Cover Segments • When a fault is discovered, nodes are shifted along the chain(row) towards the right • The last PLB of a chain is reserved as a spare • One fault in a row can be tolerated • Needs no reconfiguration if local routing configurations are present

Configuration-Level FT Methods(4) • Tiling • Partition FPGA into tiles • Precompiled configurations of tiles are stored in memory • Each tile contains system function, some spare logic and interconnect resources • When a logic fault occurs in a tile, the configuration of the tile is replaced by a configuration that does not use the faulty resources • Many logic faults can be tolerated • Local interconnect faults can be tolerated, but global ones can’t be tolerated

Configuration-Level FT Methods(5) • Cluster-Based • Intracluster tolerance in a PLB • Basic Logic Elements (BLEs or LUTs) • For simple LUT faults, preferred solution is to use another LUT in the PLB • Instead of changing PLB, try to find a solution in the same PLB • In example, T is faulty and 4th PLB is used instead of 2nd PLB

Configuration-Level FT Methods(6) • Column-Based • Treats the design as a set of functional units, each unit is a column • Like Tiling, less cost precompiled configurations • At least one column should be spare • If there is a faulty cell in a column, the column is shifted toward the spare column • Method can tolerate m faulty columns, where m is the number of columns not occupied by system functions

Comparison of Methodologies(1) • Device Level (DL) Methods need extra HW and have more area cost • DL Methods use one initial reconfiguration and no extra reconfiguration cost • Configuration Level Methods needs more than one reconfiguration and sometimes result in high time cost • CL Methods don’t need extra HW and no additional area cost

Comparison of Methodologies(2) • DL Methods are less flexible, therefore less able to improve reliability • CL Methods usually tolerate more faults than DL Methods • Performance impact of fault tolerance is less for DL Methods than CL Methods

Conclusion • No single Fault Tolerance methodology is better than the others in all cases. • DL Techniques has less impact on performance, but not flexible • CL Methods tolerates more faults but have more impact on performance

A Survey of Fault Tolerant Methodologies for FPGA’s