310 likes | 443 Views
Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction. Yan Lin and Lei He EE Department, UCLA Partially supported by NSF. Address comments to lhe@ee.ucla.edu. Outline. Review and Motivation Chip-level Vdd-level Assignment Algorithms
E N D
Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported by NSF. Address comments to lhe@ee.ucla.edu
Outline • Review and Motivation • Chip-level Vdd-level Assignment Algorithms • Experimental Results • Conclusions
FPGA Power Reduction • Existing FPGAs are power inefficient compared to ASICs [kussy, ISLPED’98] • Power aware FPGA CAD algorithms for existingFPGA architectures • CAD algorithms to minimize power-delay product[Lamoureux et al, ICCAD’03] • Configuration inversion for leakage reduction[Anderson et al, FPGA’04] • Power efficient FPGA circuits and architectures • Dual-Vdd and Vdd-programmable FPGA logic blocks[Li et al, FPGA’04][Li et al, DAC’04] • Vdd-programmable FPGA interconnects • [Li et al, ICCAD’04] • [Gayasen et al, FPL’04] [Anderson et al, ICCAD’04]
Vdd-programmable Interconnects [Li et al, ICCAD’04] Power transistor • Conventional routing switch • Vdd-programmable switch • Vdd selection for used switch • Power-gating unused switch • Reduce leakage by 300X • Configurable Vdd-level conversion • Avoid excessive leakage when low-Vdd switch drives high-Vdd switches • Segment based Vdd-level converter insertion (SLC) • Area overhead • 35% area overhead for MCNC benchmark circuits • Leakage overhead • 29% leakage overhead for MCNC benchmark circuits
Previous Approaches w/o LCs • [Gayasen et al, FPL’04] • Level converters inserted at CLB inputs (outputs) • All the routing trees driven by (driving) the source (sink)CLB have the same Vdd-level as the source (sink) CLB • Lacking in flexibility • A path-based Vdd-level assignment is performed for CLBsand interconnects • [Anderson et al, ICCAD’04] • VT drop of NMOS is used to generate low-Vdd • Positive feedback PMOS is used to tolerate low-Vdd switch driving high-Vdd switches • Alternative design of level converter • Still has delay and power penalty
Our Major Contributions • Proposed two ways to avoid using level converters in interconnects • Tree based level converter insertion (TLC) • All the switches in one routing tree have same Vdd-level • Dual-Vdd tree based level converter insertion (dTLC) • Only high-Vdd switch drives low-Vdd switches in one tree • Proposed a few Vdd-level assignment algorithms • Sensitivity based algorithms • TLC-S and dTLC-S for TLC and dTLC, respectively • Linear programming (LP) based algorithm • dTLC-LP for dTLC
Tree based LC insertion (TLC) • allows one type of Vdd-level within one routing tree • Dual-Vdd tree based LC insertion (dTLC) • allows high-Vdd switch drives low-Vdd switches, but not vice versa Problem Formulations • Assign Vdd-level to each interconnect switch to minimize interconnect power • Meet the delay target Tspec • Vdd-level converters • are removed within interconnects • are inserted at CLB inputs/outputs and can be used when needed
Outline • Review and Motivation • Chip-level Vdd-level Assignment Algorithms • Experimental Results • Conclusions
Interconnect power • Dynamic power • Leakage power is pre-characterized using SPICE Delay & Power Model with Dual-Vdd • To incorporate dual-Vdd into timing analysis • Pre-characterize the intrinsic delay and effective driving resistance of switch using SPICE • Calculate routing delay using Elmore delay model
Chip-level Assignment Algorithms • Tree based level converter insertion (TLC) • Sensitivity based algorithm TLC-S • Dual-Vdd tree based level converter insertion (dTLC) • Sensitivity based algorithm dTLC-S • Linear programming (LP) based algorithm dTLC-LP
Sensitivity Based Algorithm TLC-S • Iterative assignment • Assign low-Vdd to the ‘untried’ tree with maximum power sensitivity in each iteration • Reject the assignment if critical path increases • Iteration terminates after all trees are ‘tried’ • Power sensitivity • The power reduction by changing Vdd from high-Vdd to low-Vdd • Power includes both dynamic and leakage power
Sensitivity Based Algorithm dTLC-S • A “candidate switch” is defined as • A switch does not drive any switch • Low-Vdd has been assigned to all of its fanout switches • Iterative assignment • Assign low-Vdd to a candidate switch with maximum power sensitivity in each iteration • Reject assignment if critical path increases • Iteration terminates when there is no candidate switch
LP Based Algorithm dTLC-LP: Overview Single-Vdd placed and routed netlist Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Dual-Vdd netlist
b4 b4 b4 b4 b3 b3 b1 b3 b1 b3 b1 b1 b2 b2 b2 sink1 b2 s1=2 s1=2 s1=1 sink2 s2=1 s1 s2=3 s2=1 s2 dTLC-LP: Single-Net Estimation • Slack is represented in multiples of • is delay increase of an interconnect segment by changing Vdd from high-Vdd to low-Vdd • An example
dTLC-LP: Single-Net Estimation (Cont.) • Given the allocated slacks, estimate number of low-Vdd switches • sik: Slack for kth sink in ithrouting tree • lik: Number of switches in the path from source to kth sink in ithtree • SLij: Set of sinks in the fanout cone of jth switch in ithtree • An example Source
s1/l1 s1 dTLC-LP: Single-Net Estimation (Cont.) • Given the allocated slacks, estimate number of low-Vdd switches • sik: Slack for kth sink in ithrouting tree • lik: Number of switches in the path from source to kth sink in ithtree • SLij: Set of sinks in the fanout cone of jth switch in ithtree • An example Source
dTLC-LP: Single-Net Estimation (Cont.) • Given the allocated slacks, estimate number of low-Vdd switches • sik: Slack for kth sink in ithrouting tree • lik: Number of switches in the path from source to kth sink in ithtree • SLij: Set of sinks in the fanout cone of jth switch in ithtree • An example Source s2/l2 s2
dTLC-LP: Single-Net Estimation (Cont.) • Given the allocated slacks, estimate number of low-Vdd switches • sik: Slack for kth sink in ithrouting tree • lik: Number of switches in the path from source to kth sink in ithtree • SLij: Set of sinks in the fanout cone of jth switch in ithtree • An example Source s3/l3 s3
dTLC-LP: Single-Net Estimation (Cont.) • Given the allocated slacks, estimate number of low-Vdd switches • sik: Slack for kth sink in ithrouting tree • lik: Number of switches in the path from source to kth sink in ithtree • SLij: Set of sinks in the fanout cone of jth switch in ithtree • An example Source Min(sk/lk) • Theorem: The estimation gives a lower bound of number of low-Vdd switches that can be achieved
dTLC-LP : Full-chip Time Slack Allocation • Objective function • fs(i): transition density of ithtree • Fn(i): estimated number of low-Vdd switches in ith tree • Directly minimize dynamic power • May help minimizing leakage power that exponentially depends on Vdd-level • Constraints • Net-based timing constraints • For PIs and POs • For edges corresponding to routing • For edges other than routing
Constraints due to transforming min function to linear function dTLC-LP : Full-chip Time Slack Allocation • Objective function • fs(i): transition density of ithtree • Fn(i): estimated number of low-Vdd switches in ith tree • Directly minimize dynamic power • May help minimizing leakage power that exponentially depends on Vdd-level • Constraints • Upper bound for useful slack • Theorem: The time slack allocation problem is an LP problem
dTLC-LP : Overview Single-Vdd placed and routed netlist Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Dual-Vdd netlist
dTLC-LP : Net-level Bottom-up Assignment • Theorem: the bottom-up assignment is optimal • Perform bottom-up assignment within each tree to leverage the allocated slacks • Bottom-up assignment • Assign low-Vdd to switches in the routing tree in a bottom-up fashion • Slack is reduced by in each step • Stop the process until no slack left
dTLC-LP : Overview Single-Vdd placed and routed netlist Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Dual-Vdd netlist
Outline • Review and Motivation • Modeling and Problem Formulations • Chip-level Vdd-level Assignment Algorithms • Experimental Results • Conclusions
Experimental Setting • Cluster-based Island Style FPGA Structure • 100% buffered interconnects, subset switch block • Uniform length 4 for all wire segments • ITRS 100nm technology • Use VPR [Betz-Rose-Marquardt] for placement and routing • Use fpgaEva-LP2 [Lin et al, FPGA’05] for power calculation • Considering short-circuit power, glitch power and input vector • 8% average error compared to SPICE simulation
0.05 Leakage power Dynamic power 0.045 0.04 0.035 0.03 Interconnect Power (watt) 0.025 0.02 0.015 0.01 0.005 0 dTLC-LP TLC-S dTLC-S Interconnect Power Comparison between TLC-S, dTLC-S and dTLC-LP • dTLC-S and dTLC-LP achieve 6.7% and 6.9% less interconnect power compared to TLC-S, respectively • Interconnect power breakdown • TLC-S, dTLC-S and dTLC-LP have almost the same leakage • dTLC-S and dTLC-LP achieve 13.8% and 15.8% less interconnect dynamic power compared to TLC-S, respectively
h2lLCi SLC dTLC-LP 25% 20% 0% 5% 15% 10% 15% 20% 25% 64% 19% 10% 5% dTLC-LP h2lLCi SLC 0% dTLC-LP compared to SLC and h2lLCi 100% 0.14 90% 0.12 80% 0.1 70% Interconnect Power (watt) % of VddL Switches 0.08 60% 0.06 50% 0.04 40% 30% 0.02 12.00 12.50 13.00 13.50 14.00 14.50 15.00 15.50 12.00 12.50 13.00 13.50 14.00 14.50 15.00 15.50 Critical Path Delay (ns) Critical Path Delay (ns) • SLC [Li et al, ICCAD ’04] • Segment based level converter inserted in interconnects • Sensitivity based assignment algorithm • h2lLCi [Gayasen et al, FPL’04] • All the routing tree driven by source CLB have the same Vdd-level as the source CLB • Path based assignment algorithm • dTLC-LP, SLC and h2lLCi achieve 77.54%, 74.70% and 41.80% low-Vdd switches w/o relaxing Tspec • At different delays,dTLC-LP achieves • The highest number of low-Vdd switches • The lowest power consumption
1.E+04 TLC-S 9.E+03 dTLC-S 8.E+03 dTLC-LP 7.E+03 6.E+03 Runtime (s) 5.E+03 4.E+03 3.E+03 2.E+03 1.E+03 0.E+00 alu4 apex2 apex4 elliptic ex1010 frisc pdc s38417 s38584 MCNC Benchmarks Runtime Comparison between TLC-S, dTLC-S and dTLC-LP • TLC-S runs the fastest • dTLC-S versus dTLC-LP • Runs 3X faster than dTLC-LP • But achieves similar power consumption
Conclusions and Future Work • Proposed two ways to avoid using level converters in Vdd-programmable interconnects • Tree based level converter insertion (TLC) • Dual-Vdd tree based level converter insertion (dTLC) • Developed chip-level dual-Vdd assignment algorithms w/o level converters • Sensitivity based algorithms TLC-S and dTLC-S • LP based algorithm dTLC-LP • Developed dTLC-LP that reduces interconnect power by 64% • Developed dTLC-S that obtains slightly smaller power reduction with 3X speedup compared to dTLC-LP • Extend chip-level Vdd-level assignment to interconnects using wire segments of different lengths • Allocate time slack to logic blocks and interconnects in a uniform fashion