240 likes | 249 Views
Redundancy-Aware, Fault-Tolerant Clustering. Jason Cong and Brian Tagiku VLSI CAD Lab Computer Science Department University of California, Los Angeles {cong,btagiku}@cs.ucla.edu http://cadlab.cs.ucla.edu/. Overview of IC-DFN Efforts at UCLA. Synthesis for higher level of abstraction
E N D
Redundancy-Aware, Fault-Tolerant Clustering Jason Cong and Brian Tagiku VLSI CAD Lab Computer Science Department University of California, Los Angeles {cong,btagiku}@cs.ucla.edu http://cadlab.cs.ucla.edu/
Overview of IC-DFN Efforts at UCLA • Synthesis for higher level of abstraction • Architecture and synthesis for nanoFPGAs (jointly with Prof. Tim Cheng, Evelyn Hu, and Kang Wang) • Synthesis for error-resilient designs UCLA VLSICAD LAB
xPilot: Platform-Based Synthesis System SystemC/C/MMM Platform Description & Constraints • Uniqueness of xPilot • Platform-based synthesis and optimization • Communication-centric synthesis • Recent Progress on xPilot • Refined MMM-to-SSDM translation • Efficient & versatile scheduling engine based system of difference constraints (DAC’06) • Communication-centric binding based distributed register file μ- arch (ICCAD’06) • Behavior-and-communication co-optimization for interface synthesis (DAC’06) • Design drivers • Motion-JPEG • MPEG4 simple profile video decoder Hybrid approach on Xilinx XUP board • Microblaze (or PowerPC) + HW synthesized blocks xPilot xPilot Front End Profiling SSDM(System-Level Synthesis Data Model) Analysis Mapping Processor & Architecture Synthesis Interface Synthesis Behavioral Synthesis Custom Logic Drivers + Glue Logic Processor Cores+ Executables FPSoC UCLA VLSICAD LAB
MPEG-4 Simple Profile Decoder: Synthesis Results • Complexity of synthesized RTLs UCLA VLSICAD LAB
Updated Results on Motion-JPEG Example Preprocess DCT Quant Huffman Model #1 : 5 Microblazes FSL-based communication Table Modification OR HW-DCT Preprocess Quant Huffman Encoded JPEG Images Model #2 : 4 Microblazes + DCT on FPGA fabrics Table Modification RAW Images UCLA VLSICAD LAB FSL-based communication is a major performance overhead Xilinx XUP Board
Overview of IC-DFN Efforts at UCLA • Synthesis for higher level of abstraction • Architecture and synthesis for nanoFPGAs (jointly with Prof. Tim Cheng, Evelyn Hu, and Kang Wang) • Synthesis for error-resilient designs • Redundancy-aware, fault-tolerant clustering UCLA VLSICAD LAB
Hierarchical FPGAs • 2 level, hierarchical circuit logic • Level 1 – LUTs • Level 2 – Clusters of LUTs • Higher levels (clusters of clusters) also possible • Uses locality of interconnections to improve circuit performance UCLA VLSICAD LAB
Redundancy in FPGAs • LUTs can fail with some probability • Allocate extra components (e.g. LUTs) into the system • Re-route inputs and outputs to a spare LUT • Ideally, want the spare LUT to be close to the failure so that delay does not increase UCLA VLSICAD LAB
Redundancy in FPGAs • LUTs can fail with some probability • Allocate extra components (e.g. LUTs) into the system • Re-route inputs and outputs to a spare LUT • Ideally, want the spare LUT to be close to the failure so that delay does not increase UCLA VLSICAD LAB
Redundancy in FPGAs • LUTs can fail with some probability • Allocate extra components (e.g. LUTs) into the system • Re-route inputs and outputs to a spare LUT • Ideally, want the spare LUT to be close to the failure so that delay does not increase UCLA VLSICAD LAB
Redundancy in FPGAs • LUTs can fail with some probability • Allocate extra components (e.g. LUTs) into the system • Re-route inputs and outputs to a spare LUT • Ideally, want the spare LUT to be close to the failure so that delay does not increase UCLA VLSICAD LAB
Redundancy in FPGAs • LUTs can fail with some probability • Allocate extra components (e.g. LUTs) into the system • Re-route inputs and outputs to a spare LUT • Ideally, want the spare LUT to be close to the failure so that delay does not increase UCLA VLSICAD LAB
A C B D Motivational Example • 4 LUTs (each of delay 1) • 2 Clusters of 3 LUTs • Inter-cluster edges have delay 3 • Target delay 6 • LUTs fail with probability 0.1 A C B D UCLA VLSICAD LAB
Motivational Example UCLA VLSICAD LAB
The Problem • Inputs • A network G of n LUTs (acyclic) • An FPGA with C clusters, each with M LUTs • Inter-cluster interconnect delay d • Target circuit delay D • Probability p of LUT failure • Objective • Cluster G using no more than C clusters such that probability of circuit achieving delay D or faster is maximized. • LUT duplication allowed, but at the cost of a spare LUT. UCLA VLSICAD LAB
Dynamic Programming Heuristic • Use a dynamic programming matrix A • A is an n £ n £ D matrix • Each entry A[i,j,k] stores a clustering solution of LUT i and its predecessors such that • Exactly j clusters are used • The minimum arrival time at the output of i is k • The probability of the circuit achieving delay k is maximized UCLA VLSICAD LAB
Dynamic Programming Heuristic • Filling out the matrix • Traverse graph in topological order • For PI, form its own cluster • For all others • Select subset of parents • Select clusters of parents and merge • Place resulting clustering in A if probability of achieving k is largest so far • Repeat for all possible subsets of parents and clusterings UCLA VLSICAD LAB
Dynamic Programming Heuristic • Filling out the matrix • Traverse graph in topological order • For PI, form its own cluster • For all others • Select subset of parents • Select clusters of parents and merge • Place resulting clustering in A if probability of achieving k is largest so far • Repeat for all possible subsets of parents and clusterings PI PI UCLA VLSICAD LAB
Dynamic Programming Heuristic • Filling out the matrix • Traverse graph in topological order • For PI, form its own cluster • For all others • Select subset of parents • Select clusters of parents and merge • Place resulting clustering in A if probability of achieving k is largest so far • Repeat for all possible subsets of parents and clusterings UCLA VLSICAD LAB
Dynamic Programming Heuristic • Filling out the matrix • Traverse graph in topological order • For PI, form its own cluster • For all others • Select subset of parents • Select clusters of parents and merge • Place resulting clustering in A if probability of achieving k is largest so far • Repeat for all possible subsets of parents and clusterings UCLA VLSICAD LAB
DP Heuristic Performance • All LUTs weight 1 • 10% failure rate • Intracluster edge delay 0 • Intercluster edge delay 3 • 8 clusters each of 3 LUTs • Target delay of 7 UCLA VLSICAD LAB
DP Heuristic Performance DP clustering Achieves delay 7 with probability ≈ 0.39 Min-delay clustering Achieves delay 7 with probability ≈ 0.28 UCLA VLSICAD LAB
Difficulties • Best known algorithm for calculating probability distribution of delays is exponential • Reconvergent fan-out introduces dependencies in probabilities • Can’t use exact probabilities to guide algorithms/heuristics • Hard to evaluate the performance of algorithms/heuristics • Difficult to assess quality of a sub-clustering of a node and its fan-in cone • Global knowledge (e.g. placement of spares) of the clustering is needed • Makes dynamic programming a harder approach UCLA VLSICAD LAB
Future Work • Study the tractability of the problem • Propose exact or approximation algorithms or better heuristics • Generalize the interconnect delays so the problem addresses LUT placement • Study the problem of assigning failures to spares so as to minimize delay UCLA VLSICAD LAB