380 likes | 576 Views
Power Reduction for FPGA using Multiple Vdd/Vth. Cecille Freeman Monday April 3, 2006. References. Fei Li; Yan Lin; Lei He. “Vdd programmability to reduce FPGA interconnect power” in ICCAD 2004. International Conference on Computer Aided Design , 2004, p 760-5.
E N D
Power Reduction for FPGA using Multiple Vdd/Vth Cecille Freeman Monday April 3, 2006
References Fei Li; Yan Lin; Lei He. “Vdd programmability to reduce FPGA interconnect power” in ICCAD 2004. International Conference on Computer Aided Design, 2004, p 760-5. Fei Li; Yan Lin; Lei He. “FPGA power reduction using configurable dual-Vdd” in Proceedings 2004. Design Automation Conference, 2004, p 735-40. Fei Li; Yan Lin; Lei He; Jason Cong. “Low-power FPGA using pre-defined dual-Vdd/dual-Vt fabrics” in ACM/SIGDA International Symposium on Field Programmable Gate Arrays - FPGA, v 12, 2004, p 42-50.
Outline • Introduction • Pre-defined dual Vdd • Dual Vt and dual Vdd structures • CAD tool flow • Results • Configurable Dual Vdd • Structure • CAD • Results • Interconnect Dual Vdd • Structure • CAD • Results
Introduction • Power consumption • FPGAs are less power efficient than ASICs • Reducing power loss is important if FPGAs are going to be used in embedded systems • Previous approaches mostly focus on changing the design implementation • This is the first “in-depth study” of dual Vdd/Vt techniques for FPGA • This technique is fairly common in ASIC
Introduction • Power consumption • Power loss from switching and leakage • Leakage is dominant in submicron (<100nm) • Both leakage and switching are reduced by reducing Vdd • Leakage is reduced by increasing Vt • Programmable Vdd/Vt – 40-45% power reduction in ASIC
Introduction • Dynamic Power • f=clock frequency • E= Effective transition density • C=load capacitance • Vdd=supply voltage
Introduction • Leakage Power • Ilkg=leakage current • Vdd=supply voltage Ilkg increases as Vt decreases
Introduction • Dual Vdd theory • Lower supply power is slower, but results in less power loss • Not all paths in the circuit need to be equally fast • Critical path has high Vdd for speed • Non-critical path has low Vdd for power • Makes use of timing slack
Predefined Dual Vdd/Vt • Design in 3 stages • Determine a good Vdd/Vt scaling from a normal LUT design • Dual Vt within each LUT • Dual Vdd across the chip
Predefined Dual Vdd/Vt • Single Vdd/Vt LUT (normal) • SRAM cell, MUX tree
Predefined Dual Vdd/Vt • Single Vdd/Vt scaling • Scaling across all LUTs • Reduction in switching power (quadratic as reduce supply voltage) • Large delay penalties as supply is reduced • Examined 3 scaling schemes • Constant Vt • Fixed Vdd/Vt ratio • Constant leakage power
Predefined Dual Vdd/Vt • Scaling Vdd to constant leakage is best
Predefined Dual Vdd/Vt • Dual Vt within a single LUT • SRAM can have a high Vt because they are configured at the start, and are only read during operation (ie, no switching delay) • Increasing Vt increases the time taken to program the FPGA
Predefined Dual Vdd/Vt • Vt of SRAM set to get 15X SRAM leakage reduction • Increases configuration time by 13% • MUX (region II) Vdd set using constant leakage scaling • Vdd of SRAM set to be same as MUX (constant in LUT)
Predefined Dual Vdd/Vt • High and Low Vdd LUTs • Need a level converter • Need to determine how the high and low voltage LUTs will be placed on the chip • Need a tool to determine • What should be in low and what should be in high • How the placement and routing should be done
Predefined Dual Vdd/Vt • Level Converter • Basically 2 inverters with a level restore
Predefined Dual Vdd/Vt • FPGA Fabric – 2 choices
Predefined Dual Vdd/Vt • CAD tool • Assignment of high/low LUTs based on “power sensitivity” • LUT that will cause most power reduction when moved to low VDD is changed • If timing constraints are met, keep, otherwise change back • Routing done using simulated annealing, with extra cost function for matching the high and low LUT assignment
Predefined Dual Vdd/Vt • Tested on 20 MCNC benchmarks • Dual Vt • 11.6% power reduction for combinational • 14.6% power reduction for sequential • Dual Vdd/Vt • 13.6% combinational, 14.1% sequential • Not as much as expected – routing and placement issues because predefined • Layout • Average 75% to low Vdd LUTs • No significant difference with fabric layout
Configurable Dual Vdd/Vt • Pre-defined did not get good power reduction from dual Vdd because of routing and placement issues • Solution: make each LUT able to be either a high or a low Vdd LUT, so don’t have the extra constraint
Configurable Dual Vdd/Vt • Configurable LUT • Attached by P-MOS transistor to both rails • SRAM configuration bits to determine which rail supplies power • 3 possible configurations • VddL, VddH, Power gated (both off) • Configuration bits also determine if output goes through a level converter
Configurable Dual Vdd/Vt • Problem: AREA • Normally sleep transistors have high Vt, but this means they are larger • Instead use normal Vt transistors for switches • Normal Vt gives higher leakage • Gate boosting • When a switch is off, apply gate voltage one vt higher than Vdd at the source • Gate boosting is used in Xilinx boards already
Configurable Dual Vdd/Vt • Problem: AREA • Apply switches with a larger granularity • Clusters of 10 Logic blocks for one switch configuration • Problem: Leakage from extra SRAM • SRAM can have high Vt because not written during operation • Vt set so have 15X leakage reduction over normal, increase in configuration time of 13%
Configurable Dual Vdd/Vt • FPGA fabric • Compared fabric with all programmable to one with VddH, VddL and programmable
Configurable Dual Vdd/Vt • CAD tools • Same as for predefined, except the matching cost now includes programmable blocks as being able to be assigned as either high or low LUTs in the placement algorithm
Configurable Dual Vdd/Vt • Results: • Compared to single Vdd FPGAs with Vdd optimized for the same target clock frequency • Full supply programmability • Logic power reduction of 35.5% • Logic block area increased by 24% • Partial supply programmability (1/1/3 H/L/P) • Logic power reduction of 28.62% • Logic block area increased by 14% • Logic area increase is not very significant when compared to area of routing
Configurable Interconnect • Global interconnect power is very high • Becomes more dominant as apply power reduction to logic blocks • Solution: make the interconnect programmable as well
Configurable Interconnect • Only a small portion of the interconnect is ever being used (avg 11.9% on their tests • Would be good to power gate the unused • 1 configuration bit • VddH, VddL • 2 configuration bits • VddH, VddL, power gated
Configurable Interconnect • Configuration for routing switches and connection to logic block
Configurable Interconnect • Power considerations for SRAM • Additional SRAM means additional leakage power • Only program SRAM once before use • Use same high-Vt SRAM as for configurable logic blocks • Delay considerations • Longer delay though routing switch • Bound delay increase to 6% by properly sizing the tri-state buffer
Configurable Interconnect • CAD tools • Similar to tools as the configurable Vdd/Vt • Use only full programmable block fabric • No placement and routing constraints
Configurable Interconnect • Results • One bit configuration (no power gating) = 22.21% power reduction • Two bit configuration (power gating) = 50.55% power reduction • 56.1% reduction to interconnect power • Power gating reduces FPGA interconnect power by 32% - many unused routing resources can be gated
Summary • Using a Dual Vt LUT decreases power by ~13% • Predefined dual Vdd has very little effect on power because of routing • Fully programmable Vdd logic cells reduces power by 28.6% • Fully configurable Vdd logic cells and interconnects with power gating reduces power by 50.55% • Tradeoffs: increase in area, increase in delay, increase in configuration time
Future Work • Reduction of SRAM cells required for programmability • Design of a good power supply network for the chip
Conclusions • Excellent power reduction overall • Excellent design if power reduction is a concern – no changes required to the design itself • Might introduce some timing issues because of extra delay through chip • Might be expensive due to extra area required on the chip