Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability

Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability Yan Lin, Fei Li and Lei He EE Department, UCLA LHE@ee.ucla.edu Partially supported by NSF.

Overview • FPGA architecture evaluation • Area and delay [Rose et al, JSSC’90] • Power [Poon et al, FPLA’02][Li et al, FPGA’03] • Vdd programmability for power reduction • Concept in [FPGA’03] • Application to logic [FPGA’04][DAC’04] • Application to interconnects [ICCAD’04][Anderson et al, ICCAD’04] • Novel circuits and Architecture evaluation for FPGAs with Vdd-programmability • Reduce power by 50% with 17% area and 3% delay increase

Outline • Power modeling and architecture evaluation methodology • FPGA Circuits for Vdd Programmability • Architecture Evaluation with Vdd programmability • Conclusions and Ongoing Work

Parasitic Extraction Arch Spec Power Framework fpgaEva-LP Benchmark circuits Logic Optimization(SIS) Tech-Mapping (RASP) Timing-Driven Packing (TV-Pack) Cycle-accurate Power Simulator Placement & Routing (VPR) Area Delay

Dynamic power • Capacitive power • Short-circuit power ( transition time) Capacitive power • Functional switch • Glitch Static Power • Sub-threshold leakage • Reverse biased leakage • Gate leakage FPGA Structure and Models • Cluster-based Island Style FPGA Structure • 100% buffered interconnects, subset switch block • input fc = 50%, output fc = 25% • Area and delay models similar to [Betz-Rose-Marquardt] • But based on layout and SPICE for 100nm and below • Mixed-level power model from [FPGA’03]

New Power Model in fpgaEva-LP2 • Short-circuit power  switching time * switching power • fpgaEva-LP used average signal transition time • fpgaEva-LP2 calculates transition time for each buffer as , the buffer delay •  is NOT a constant 2 as in literature due to input slew •  is pre-characterized by SPICE

Validation Using SPICE • Validate by comparison for each power-component • High fidelity with average absolute error of 8%

5.6 circuit: s38584 5.55 8 10 5.5 3 5.45 6 +5% FPGA Energy (nJ/cycle) 5.4 5 2 7 5.35 4 9 5.3 +12% 1 5.25 10.2 10.4 10.6 10.8 11 11.2 11.4 11.6 11.8 12 Critical Path Delay (ns) Impact of Random Seeds in VPR • 12% delay variation and 5% energy variation • Min-delay solution among 10 runs is used

Evaluation of Single-Vdd FPGAs • Architectures explored • Cluster size N = {6, 8, 10, 12} • LUT size k = {3, 4, 5, 6, 7} • Energy-delay (ED) dominant architectures • Architecture with smaller delay or less energy (compared to any other architecture) • Relaxed ED dominant set may be also valuable 9 (12, 7) 8 7 (10, 7) (12, 6) (10, 3) Total FPGA Energy (nJ/cycle) (12, 3) 6 (8, 7) (10, 6) (8, 3) (8, 6) (12, 5) (6, 7) 5 (6, 3) (6, 6) (6, 5) (10, 5) (8, 5) (6, 4) 4 (12, 4) (10, 4) (8, 4) 3 9 10 11 12 13 14 15 16 17 Critical Path Delay (ns)

9 (12, 7) 8 7 (10, 7) (12, 6) (10, 3) (12, 3) 6 Total FPGA Energy (nJ/cycle) (8, 7) (10, 6) (8, 3) (8, 6) (12, 5) (6, 7) 5 (6, 3) (6, 6) (6, 5) (10, 5) (8, 5) (6, 4) 4 (12, 4) (10, 4) (8, 4) 3 9 10 11 12 13 14 15 16 17 Critical Path Delay (ns) Energy versus Delay Current commercial architecture • For 100nm ITRS technology • Min-Energy arch (N,k)=(10,4) or (8.4) • Min-Delay arch (N,k)=(8,7)  0.8x delay but 1.7x power

Outline • Power modeling and evaluation methodology • FPGA Circuits for Vdd Programmability • Architecture Evaluation with Vdd programmability • Conclusions and Ongoing Work

Vdd-programmable FPGA [DAC’04][ICCAD’04] • Vdd-programmable logic block • Vdd selection • Power-gating unused blocks

Vdd-programmable FPGA [FPGA’04][ICCAD’04] • Vdd-programmable logic block • Vdd selection • Power-gating unused blocks • Vdd-programmable switch • Vdd-level conversion is needed when VddL drives VddH • To avoid excessive leakage

Vdd-programmable routing switch • Brute-force design [ICCAD’04] • Two extra SRAM cells for each routing switch • New design • One extra SRAM cell • NAND2 gate –- minimum size & high-Vt transistor Vdd-programmable Routing Switch • Conventional routing switch

New design • Only TWO extra SRAM cells for n connection switches • Control logic includes 2n NAND2 and a decoder Vdd-Programmable Interconnect Connection Block • Brute-force design [ICCAD’04] • 2n extra SRAM cells for n connection switches

Power and Delay • Vdd-programmable switch uses • 4X PMOS power transistor for 7X routing switch • 1X PMOS power transistor for 4X connection switch • Compared to conventional switch • 1000X less leakage power • Connection box is 28% faster and has 18% less dynamic power • By moving mux from critical path of connection box

Vdd-gateable Routing Switch • Conventional • Vdd-gateable • two states  Normal Vdd or Power-gating • Enable power-gating capability w/o extra SRAM cells • Can be replaced by tri-state buffer Power transitor

Vdd-gateable Connection Block • Conventional • Vdd-gateable • Enable power-gating capabilityw/ only one extra SRAM and a low leakage decoder

Outline • Power modeling and evaluation methodology • FPGA Circuits for Vdd Programmability • Architecture Evaluation with Vdd programmability • Conclusions and Ongoing Work

FPGA Architecture Classes • High-Vt is applied to configuration SRAM cells for all the classes

Vdd-level Converters • Class3 removes Vdd-level converters from interconnects in Class1 • With constraints that no VddL drives VddH • We developed a routing that one routing tree has a single Vdd level • But trees with different Vdd-levels can share the same wire track • Alternative approaches: • Combined vdd-level converter and buffer [Anderson et al, ICCAD’04] • Our new work [DAC’05] allows dual vdd in a tree with a chip level time slack budgeting for extra power reduction

LUT 7 High Performance Class 1 Class 2 Class 3 LUT 4 Low Energy (8, 7) (6, 7) (6, 6) (10, 5) (8, 5) (8,7) (12, 4) (8, 4) (6, 4) (6,7) (10,6) (6,6) (8,6) (8,7) (10,5) (8,5) (6,7) (10,6) (6,6) (12,4) (8,6) (10,5) (8,5) (12,4) Energy versus Delay 6 Class 0 (8, 7) 5.5 (6, 7) 5 (6, 6) (8, 6) (10, 5) (8, 5) 4.5 (12, 4) (6, 5) 4 (6, 4) (8, 4) (10, 4) Total FPGA Energy/Cycle (nJ) 3.5 3 2.5 2 1.5 10 10.5 11 11.5 12 12.5 13 Critical Path Delay (ns) • ED-product reduction • 20% by Class1 (Vdd-programmable interconnects w/ level converters) • 45% by Class2 (Vdd-gateable interconnects) • 50% by Class3 (class1 minus level converters) • Performance degrades 3% due to Vdd programmability

Class1 Class2 Class3 (8,7) (6,7) (6,6) (10,5) (6,4) (8,7) (8,4) (12,4) (8,5) (6,7) (8,7) (10,6) (8,6) (6,7) (10,6) (6,6) (8,6) (8,5) (6,6) (10,4) (10,5) (8,5) (10,5) (8,4) (12,4) (8,4) (10,4) (12,4) Energy versus Area Class0 6 (8,7) (6,7) Min-area  Min-energy 5 (10,5) (8,6) (8,5) (6,6) (12,4) 4 (6,5) (6,4) Total FPGA Energy/Cycle (nJ) (8,4) (10,4) 3 2 1 6.00E+06 8.00E+06 1.00E+07 1.20E+07 1.40E+07 1.60E+07 1.80E+07 2.00E+07 2.20E+07 2.40E+07 2.60E+07 Total FPGA Device Area • Average area overhead • 118% for Class1 (Vdd-programmable interconnects w/ level converters) • 17% for Class2 (Vdd-gateable interconnects) • 52% by Class3 (Vdd-programmable interconnects w/o level converters) • Class2 is the best considering both energy and area

Logic Leakage Energy 4.5 Logic Dynamic Energy 2.94% 4 Local Interconnect Leakage Energy 3.71% Local Interconnect Dynamic Energy 16.03% 3.5 Global Interconnect Leakage Energy 8.09% 3 Global Interconnect Dynamic Energy 2.70% 3.04% 2.5 Total FPGA Energy (nJ/Cycle) 26.22% 4.07% 2 4.40% 7.43% 3.92% 4.32% 49.89% 1.5 39.69% 42.93% 1 9.81% 42.84% 10.81% 4.88% 5.85% 0.5 19.33% 37.62% 17.77% 31.70% 0 Class0 Class1 Class2 Class3 FPGA Architecture (N,k) = (12,4) Energy Breakdown • Class2 and Class3 dramatically reduce global interconnect leakage • But class1 fails due to leakage in Vdd-level converters

Area Overhead 20% 18% 1.39% Power Transistors & SRAMs (CLBs) 16% Logic Blocks 3.19% 1.80% Vdd-level Converters (CLBs) 14% 12% 4.82% Control (Connection Blocks) 10% Connection Blocks 10.38% 8% FPGA Area Overhead Power Transistors (Connection Blocks) 4.96% 6% SRAMs (Connection Blocks) 0.60% 4% Routing Switches 3.87% 2% Power Transistors (Routing Switches) 3.87% 0% Class2: Vdd-gateable interconnects + Vdd-programmable CLBs(12, 4) • 17% = 9% for power transistors + 5% for control + 2% for SRAM

Conclusions and New Results • Field programmability is needed for fine-grained dual-vdd and Vdd-gating in FPGA • Vdd-gating offers a better area-power tradeoff than Vdd-selection • 45% energy-delay product reduction with 17% area overhead • Architecture with Vdd-programmability • LUT size 4  low energy and area • LUT size 7  best performance • New results [dac’05] • Time slack allocation for Vdd-programmable interconnects • Device and architecture co-optimization for 77% energy-delay reduction

References and Download • All references and tools at http://eda.ee.ucla.edu • Results in the slides have been updated compared to the paper in ISFPGA’05

Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability

Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability

Presentation Transcript

Architecture and Routing for NoC-based FPGA

FPGA Circuits

Programmability

FRC FPGA Architecture

POWER-DRIVEN MAPPING K-LUT-BASED FPGA CIRCUITS

Modeling and Simulation for Architecture: Breakout

Low Power FPGA Using Pre-defined Dual-Vdd / Dual-Vt Fabrics

A novel and flexible Architecture for CAHN

Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction

Enhancing FPGA Performance for Arithmetic Circuits

FPGA Architecture

Power Reduction for FPGA using Multiple Vdd/Vth

Device and Architecture Co-Optimization for FPGA Power Reduction

Incremental Placement and Routing Algorithms for FPGA and VLSI Circuits

An Efficient Chiplevel Time Slack Allocation Algorithm for Dual-Vdd FPGA Power Reduction

Basic FPGA Architecture

FPGA Architecture

A Reconfigurable FPGA Architecture for DSP Transforms

A Novel Synthesis Algorithm for Reversible Circuits

Thermal Modeling for Modern VLSI Circuits

Enhancing FPGA Performance for Arithmetic Circuits

FPGA Based Digital Logic Circuits Operation for Beginners