360 likes | 499 Views
Clustering of Large Designs for Channel-Width Constrained FPGAs. Marvin Tom Guy Lemieux University of British Columbia Department of Electrical and Computer Engineering Vancouver, BC, Canada. Overview. Introduction, Goals and Motivation
E N D
Clustering of Large Designs forChannel-Width Constrained FPGAs Marvin Tom Guy Lemieux University of British Columbia Department of Electrical and Computer Engineering Vancouver, BC, Canada
Overview • Introduction, Goals and Motivation • Reduce channel width, lower cost, make circuits “routable” • Reducing Channel Width By Depopulation • Large Benchmark Circuits • New Clustering Technique • Selective Depopulation • Conclusions and Future Work
L L L L L L L L L L L L L L L L L L L L L L L L L Mesh-Based FPGA Architecture • Channel width • Number of routing tracks per channel • Larger FPGA devices: more tiles • Channel width is fixed
SIZE of Layout Tile Number of Layout Tiles Motivation: Area of FPGA Devices MCNC Circuits Mapped onto an FPGA Total Layout AREA = SIZE * Number
Interconnect Range User has no choice! Logic Range User buys bigger device. Motivation: Channel Width Demand MCNC Circuits Mapped onto an FPGA Devices built for worst-casechannel width (fixed width) Interconnect cost dominates (>70%)
Altera Cyclone • Channel width constraint • of 80 routing tracks • Constrained FPGA • Channel width constraint of 60 routing tracks • Smaller area, lower cost for low-channel-width circuits Goal: Reduce Channel Width But { apex4, elliptic, frisc, ex1010, spla, pdc } are unroutable…. Can we make them routable in a Constrained FPGA?
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L Possible Solution • Trade-off logic utilization for channel width • User can always buy more logic…. (not more wires) Trade-off: CLB count for Channel width FPGA 1 FPGA 2 But….. can we achieve lower Total Area? ( = SIZE * CLB Count)
L L L L L L L L L L L L L L L L Logic Element: BLE and CLB BLE #1 • Basic Logic Element (BLE) • ‘k’-input LUT + FF • Clustered Logic Block (CLB) • ‘N’ BLEs, ‘N’ outputs • ‘I’ shared inputs BLE #2 BLE #3 ‘N’ Outputs ‘I’ Inputs BLE #4 Note: I < k*N BLE #5 CLB
CLB Depopulation BLE #1 • Normally: CLBs fully packed • Reduces total # of CLBs needed for circuit • CLB Depopulation: Tessier, DeHon • Do not use all BLEs • Increase # CLBs used • Decrease channel width • Decreaseoverall area • Problem • Increase in # CLBs high for large circuits • Our work: limits # CLB increase BLE #2 BLE #3 ‘N’ Outputs ‘I’ Inputs BLE #4 BLE #5 CLB
Uniform Depopulation • Previous work • Depopulate each CLB by equal amount • But… circuit observations • regions of high routing demand • regions of low routing demand • Depopulate in low congestion areas ?? • Unnecessary increase in area
Non-Uniform Depopulation • Our depopulation method: • Assume congestion is localized • Depopulate only congested areas • We show non-uniform de-population • Effective method of channel width reduction • Graceful tradeoff between channel width and area • Makes unroutable circuits routable
CLB Depopulation BLE #1 • General Approach • Use existing clustering tools • Do not fill CLB while clustering • Input-Limited • Eg. Maximum 67% inpututilization per CLB • Might use allBLEs • BLE-Limited • Eg. Maximum 60% BLE utilization per CLB • Might use allInputs BLE #2 ‘N’ Outputs BLE #3 ‘I’ Inputs BLE #4 BLE #5 CLB
Reducing Channel Width Results(max cluster size 16) • Input-Limited • No channel width control • BLE-Limited • (almost) monotonically increasing good channel width control
Benchmark Circuit Creation (We want BIG circuits!) (What do REALLY BIG circuits look like?)
Benchmarking Circuits: Some Observations • Altera has bigger benchmarks than academics • We noted similar characteristics: • Some LARGE circuits routable with NARROW routing channels • Some SMALL circuits need WIDE routing channels • What if each circuit is IP Block in larger system… ??
Benchmark Creation – IP Blocks • Mimic process of creating large designs • “IP Blocks” <==> MCNC Circuits • SoC <==> Randomly integrate/stitch together “IP Blocks” • IP Blocks have varied interconnect needs • Real-life large designs: System-on-Chip Methodology • IP blocks (own, 3rd party) • Re-use improves productivity • Primarily integration and verification effort
Benchmark Creation – Large Designs • Considered 3 stitching schemes… • Independent • IP Blocks are not connected to each other • Pipeline • Outputs of one IP block connected to inputs of next IP block • Clique • Outputs of each IP block are uniformly distributed to inputs of all other IP blocks
MetaCircuit:Reducing Routed Channel Width? • Observations • IP blocks are tightly-connected internally • IP blocks have varied channel width needs • Hypotheses • Placement keeps each “IP block” together • IP blocks has large routed channel width MetaCircuit has large routed channel width
Hypothesis Testing:MetaCircuit P&R Results • Use VPR FPGA tools from University of Toronto • Hypothesis 1 • VPR placer successfully groups IP blocks from random initial placement • Hypothesis 2 • VPR router confirms channel width of MetaCircuit is dominated by a few IP blocks{ pdc, clma, ex1010 }
Consequences of Hypothesis 2 • Question • Shrink channel width of few IP blocks ?? shrink channel width of MetaCircuit? • How to shrink channel widths? • Selective CLB Depopulation !! • Depopulate hard-to-route IP blocks the most • How much to depopulate? • Channel width profiling of IP block…
Meeting Channel Width Constraints:Selective Depopulation • Step 1: Channel Width Profiling of IP Blocks (Congestion Estimation) • Step 2: Re-cluster Only Congested IP Blocks (Selective Depopulation)
IP Block Properties • Cluster IP Blocks into N=16, k=6 • VPR: determine minimum channel width for each IP Block • Sort IP Blocks based on channel width Hard-to-Route Circuits Easy-to-Route Circuits
Channel Width Profiling of IP Block • Cluster sizes • NA = FPGA Architecture Cluster Size (fixed) • NC = BLE-Limit Size (variable) • Sweep NC for each IP block
Analysis with Constraint • Given channel-width constraint of 60 tracks • tseng routable (easy) • clma routable for NC <= 10 • clma not routable for NC > 10
Our Technique: Selective Depopulation • Step 1: Channel Width Profiling of IP Blocks (Congestion Estimation) • Step 2: Re-cluster Only Congested IP Blocks (Selective Depopulation)
Uniform Depopulation • Minimum NC Cluster Size • De-populate all clusters equally • Eg, use NC=10 for both IP Blocks
Non-Uniform Depopulation • Maximal NC Cluster Size • Depopulate each IP block according to maximal cluster size • Eg, clma NC=10, tseng NC=16
Uniform vs. Non-Uniform Total CLBs Needed LUT Utilization Uniform Non-Uniform Uniform Non-Uniform • Non-Uniform depopulation better than Uniform • Lower CLB count • Higher LUT utilization x 1,000 Channel Width Constraint Channel Width Constraint
MetaCircuit Clustering Results • Depopulate the most-congested IP blocks • (BLE-Limit) of each IP block shown(max=16) • Some IP blocks are depopulated more than others
MetaCircuit P&R Results Constraint Routed Channel Width Normalized Area • Clique MetaCircuit • P&R channel width results closely match “constraints” 1 Channel Width Constraint Channel Width Constraint • Shrink Channel Width by ~20% (from 95 to 75), NO AREA INCREASE by ~50% (from 95 to 50), 1.7x area increase
Other MetaCircuit Results * These latest results are better than those given in paper
Critical Path Delay and Average Wirelength • Expect critical path delay to increase under tighter constraints • Delay “noise” due to instability of floorplan locations • Average wirelength / net increases under tighter constraints
Conclusion • System-level technique to map large System-on-Chip (SoC) designs to channel-width constrained FPGAs using fewer routing resources • Depopulating CLBs effective at reducing channel width • Non-uniform depopulation important to limit area inflation • Channel width reduced • by 0-20% with < 5% area increase • by up to 50% with 3.3 X area increase • Effective solution to trade-off CLBs for Interconnect !!! • UNROUTABLE circuits (channel width TOO LARGE)can be made ROUTABLE (reduced channel width)by buying an FPGA with MORE LOGIC!!!
Future Work • Real-Life SoC Benchmark • Licensed IP: Bluetooth baseband processor • 325,000 ASIC gates • Numerous IP blocks of varying complexity • Needed to authenticate “Synthetic” results • Automated technique to find “hard” IP blocks • Granularity is based on design hierarchy (?) • Replaces time-consuming Step 1 of process