290 likes | 463 Views
Design, Synthesis and Evaluation of Heterogeneous FPGA with Mixed LUTs and Macro-Gates. Yu Hu 1 , Satyaki Das 2 , Steve Trimberger 2 , and Lei He 1 1. Electrical Engineering Dept., UCLA 2. Research Labs, Xilinx Inc. Presented by Yu Hu Address comments to lhe@ee.ucla.edu. Outline.
E N D
Design, Synthesis and Evaluation of Heterogeneous FPGA with Mixed LUTs and Macro-Gates Yu Hu1, Satyaki Das2, Steve Trimberger2, and Lei He1 1. Electrical Engineering Dept., UCLA 2. Research Labs, Xilinx Inc. Presented by Yu Hu Address comments to lhe@ee.ucla.edu
Outline Introduction Design of the Macro-gates Synthesis for the Proposed FPGA Architecture Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work
Heterogeneity in FPGA Architectures • Heterogeneity among SLICEs • Programmable logic and routing • Tiles are not identical • soft logic fabric [Kaviani, FPGA’96]] • hard structures [Jamieson, FPL’05] • Dedicated hard structures • e.g. DSP • e.g memory block • Heterogeneity within a SLICE • Programmable logic and routing • Tiles (SLICEs) are identical • Different logics exist within a SLICE • e.g. LUTs with different size [Cong, FPGA’99] • e.g. mixed PLAs and LUTs [Cong, TODAES’05] • e.g. mixed macro-gates and LUTs (source: Jamieson@FPL’05)
Heterogeneous FPGA with Macro-Gates • There exists programmability and cost trade-off between LUTs and macrogates • Xilinx V4 benefits from small gates (MUX2, XOR2) built in SLICEs. • The benefit of wider macro-gates • Effectiveness of the incorporation of wider logic functions (macro gates) is not clear. • Our contributions • Design a new FPGA architecture with mixed LUTs and macro-gates • Propose a new automatic synthesis flow for mapping a circuit to the proposed FPGA architecture • Evaluate the architecture and show that the proposed architecture reduces delay and area by 16.5% and 30%, respective, compared to the LUT-only architecture.
Outline • Introduction • Design of the Macro-gates • Synthesis for the Proposed FPGA Architecture • Comparison of Heterogeneous FPGA Architectures • Conclusions and Future Work
Overview of Macro-Gate Design • Key problem • Select the logic functions for the macro-gate • Problem formulation: • Input: a set of training circuits, which have been mapped to K-input LUTs • Output: N K-input Boolean functions: f1 , … , fN • Objective: Maximize the number of logics (in the training circuit set) which can be implemented by f1 , … , fN • The proposed solution • Ranking of the logic functions for a set of training circuits
Level3: 3-input Level2: 2-input Level1: 1-input Level0: constant NPN-Class Diagram: Organization of Logics • Canonical and efficient representation of all NPN classes • NPN-Equivalent: functional equivalency under inputs negation, permutation or output negation • E.g., f(a,b,c)=a+bc, g(a,b,c)=b’a+b’c • NPN-Cofactor relationship is indicated • DAG: easy to manipulate • It becomes impractical to compute for more than 6-input functions! • Solution: Utilization NPN-Class Diagram Wider inputs
UND: Utilization NPN-Class Diagram • UND is an DAG, sub-graph of NCD • Help for scoring and ranking functions ab’c’+a’bc’ ab’c’+a’bc’ / 1 / xx% abc/ 1 / xx% abc ab’+a’b a ab’+a’b / 0 / xx% ab / 0 / xx% a / 0 / xx% Implementation capability -0- / 0 / xx% functionality Appearance frequency
UND: Utilization NPN-Class Diagram ab’c’+a’bc’ ab’c’+a’bc’ / 1 / xx% abc/ 1 / xx% abc ab’+a’b a ab’+a’b / 1 / xx% ab’+a’b / 0 / xx% ab / 0 / xx% a / 0 / xx% a / 1 / xx% -0- / 0 / xx%
UND: Utilization NPN-Class Diagram • Calculate Implementation Capability ab’c’+a’bc’ ab’c’+a’bc’ / 1 / 75% abc/ 1 / 50% abc ab’+a’b a ab’+a’b / 1 / 50% ab / 0 / 25% The topology property (DAG) of UND enables us to efficiently explore different metrics for functionality ranking, e.g.,utilization rate. a / 1 / 25% -0- / 0 / xx% Fanout cone of ab’c+a’bc’
f LUT ab’c’+a’bc’ / 1 / xx% ab’c’+a’bc’ / 1 / 75% abc/ 1 / 50% abc/ 1 / xx% g 1+1*2/3+1*1/3=2 1+1*1/3=1.33 and2(3) LUT d ab’+a’b / 1 / 50% ab’+a’b / 1 / xx% ab’+a’b / 0 / xx% ab / 0 / 25% ab / 0 / xx% F e 1*1/2=0.5 1+1*1/2=1.5 h a / 0 / xx% a / 1 / 25% a / 1 / xx% b LUT 1 a -0- / 0 / xx% -0- / 0 / xx% nand2(2) c inv(1) Recap: Overall Flow for Macro-Gate Design 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 …… Map with LUT-N Extract logic functions Generate Utilization NPN Diagram Calculate score For logic functions Rank logic functions Best function: ab’c’+a’bc’
Proposed Macro-Gates and FPGA Architecture • For IWLS’05 benchmarks, the following four 6-input functions have the highest ranks • GI1=a b c d e f (AND-6) • GI2=a’ b’ c’ + b c f’ + b c’ d’ + b’ c e (MUX-4) • GI3=a b' c d' e + b c e f + d e f • GI4=a b' + a' c d' + b' c' + e' + f‘ • It can implement over 50% of logic functions in IWLS’05 benchmarks. • The architecture of the proposed macro-gate and FPGA SLICE are
Outline • Design of the Embedded Macro-gates • Synthesis for the Proposed FPGA Architecture • Technology Mapping for Heterogeneous FPGAs • SAT-based Packing • Place and Routing • Comparison of Heterogeneous FPGA Architectures • Conclusions and Future Work
w z x y c a b Yes d Functional & Structural Cut Enumeration b=y+wz a=(x+y)’ 4-input macro gate lib 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 …… d=ab=(x+y)’(y+wz)=x’y’wz Is x’v’wz in library? • Phase1:Enumerate and label cuts from PIs to Pos • Check the feasibility of a cut w.r.t. the macro-gate • Phase2:Select best choice from POs to Pis • A general yet efficient solution is SAT based Boolean matching • Exploiting Symmetry in SAT-Based Boolean Matching for Heterogeneous FPGA Technology Mapping , Session 5C.1, ICCAD 07
Key in Technology Mapping: Balance Resource Utilization • Asymmetric architecture causes problem to resource utilization • Exclusively use of one logic resource leads to lots of unused fabric • Simple yet effective solution : • Change LUT-MG ratio by adjusting their area weights. • Precise calibration is hard to reach by this approach. Total# too large! Objective architecture: LUT6:MacroGate6 =1:1 Hard to obtain precise calibration Best LUT-MG ratio = 1:1 LUT-MG ratio = LUT#/MG#
MG6 MG6 MG6 MG6 Post-Mapping Area Recovery (motivation example) • Given: • Target architecture = LUT6 + MG6 • LUT-MG ratio in target architecture = 1:1 • LUT# < MG# in the mapped design • Intrinsic delay (LUT6 : MG6) = 5:4 • Objective: balance LUT MG number without increasing delay 5 / 5 9 / 13 PO LUT6 PI 17 / 17 9 / 9 13 / 13 4 / 5 MG6 MG6 8 / 9
MG6 MG6 MG6 Post-Mapping Area Recovery (motivation example) • Given: • Target architecture = LUT6 + MG6 • LUT-MG ratio in target architecture = 1:1 • LUT# < MG# in the mapped design • Intrinsic delay (LUT6 : MG6) = 5:4 • Objective: balance LUT MG number without increasing delay 5 / 5 10 / 13 PO LUT6 LUT6 PI 17 / 17 9 / 9 13 / 13 4 / 5 MG6 MG6 8 / 9
MG6 MG6 MG6 Post-Mapping Area Recovery (motivation example) • Given: • Target architecture = LUT6 + MG6 • LUT-MG ratio in target architecture = 1:1 • LUT# < MG# in the mapped design • Intrinsic delay (LUT6 : MG6) = 5:4 • Objective: balance LUT MG number without increasing delay Timing slack budgeting is necessary! 5 / 5 10 / 13 PO LUT6 LUT6 PI 18 / 17 9 / 9 14 / 13 5 / 5 LUT6 LUT6 Timing target violation! 10 / 9
MG6 MG6 MG6 MG6 MG6 MG6 Post Mapping Area Recovery by Timing Budgeting • Formulated as an Integer Linear Programming (ILP) Problem • Objective (minimize gap between target and actual LUT-MG ratios): min |m2+…+m7-7/2| • Arrival time constraints: ai+dj+bj<=aj • Clock period target: ai<=17 • LUT assignment with given timing slack: (5-4)*mj<=bj, mj={0,1} a1 • Easy to be generalized to handle arch • with multiple macro gates • with different input pin numbers a2 PO LUT6 PI a3 a5 a4 a6 a7
Outline • Design of the Embedded Macro-gates • Synthesis for the Proposed FPGA Architecture • Technology Mapping for Heterogeneous FPGAs • SAT-based Packing • Comparison of Heterogeneous FPGA Architectures • Conclusions and Future Work
SAT-Based Packing • Motivation • Traditional packing tools, e.g., T-VPack, hard-codes the architecture specification of a SLICEs…. • Re-impalement from scratch when architecture changes • Propose a unified implementation of the packers for different architectures: easy to perform architecture exploration! • The architecture dependent sub-problem in packing • Structural feasibility checking for a sub-circuit to the SLICE • Solution • Solve the problem of validating SLICE packing as a local place&route problem • A SAT solver is used to carry out the validation checking
Example of SAT-Based SLICE Packing • Examples of constraints: (for each classes of constraint…) • Placement and routing choice variables: X@A, X@B, U5@N10 • Exclusively constraint: (¬X@A) ∨ (¬X@B) • Presence constraint: (X@A) ∨ (¬X@B) • Input/Output constraint: X@A → U5@N10 • Routing constraint: G0 →out ∧ U5@N10) → U5@N12
f LUT g LUT LUT d F e LUT6 LUT6 MG6 MG6 MG6 h b LUT6 LUT a MG6 MG6 MG6 LUT6 MG6 MG6 MG6 LUT6 LUT c LUT6 MG6 MG6 MG6 LUT MG6 Recap: Overall Synthesis Flow Area weight Setting Cut-based Mapping Y Area-Balance Trade-off? Post-mapping Area recovery N packing
Outline • Motivation and Objectives • Methodology for Logic Function Exploration • Technology Mapping for Heterogeneous FPGAs • Evaluation of Heterogeneous FPGA Architectures • Conclusions and Future Work
Experimental Setting • Design library parameters [Cong, TODAES’05] • Benchmark set: IWLS 2005 • Four architectures are compared: • LUT4, LUT4 + macro gate, LUT6, and LUT6 + macro gate • Synthesize the proposed macro-gate by SIS1.2 • Delay and area model • Interconnect delay is igonired
Delay Comparisons • Compared to LUT4, LUT4+MG reduces both logic depth and delay by 9.2%. • Compared to LUT6, LUT6+MG reduces delay by 30% while increasing logic depth by 36.5%. • A LUT6 can implement more logics than a macro-gate
Logic Area Comparisons • Compared to LUT4, LUT4+MG reduces logic area by 12.5%. • Compared to LUT6, LUT6+MG reduces logic area by 16.9%.
Outline • Motivation and Objectives • Methodology for Logic Function Exploration • Technology Mapping for Heterogeneous FPGAs • Comparison of Heterogeneous FPGA Architectures • Conclusions and Future Work
Conclusions • Conclusions • A novel FPGA architecture with the mixed LUTs and macro-gates is proposed • A synthesis flow for the proposed architecture is implemented • The preliminary experimental results show the effectiveness of the proposed architecture for the area and delay reduction • Future Work • Perform the physical design for the synthesized circuits and compare the routing costs, architecture evaluation considering interconnect delay • Study the effectiveness of the power reduction for the proposed architecture • Macro-gates with wider inputs will be examined