Exponential Challenges, Exponential Rewards— The Future of Moore’s Law

Exponential Challenges, Exponential Rewards—The Future of Moore’s Law Shekhar Borkar Intel Fellow Circuit Research, Intel Labs Fall, 2004

ISSCC 2003—Gordon Moore said… “No exponential is forever… But We can delay Forever”

Outline • The exponential challenges • Circuit and mArch solutions • Major paradigm shifts in design • Integration & SOC • The exponential reward • Summary

How do you get there? Goal: 1TIPS by 2010 Pentium® 4 Architecture Pentium® Pro Architecture Pentium® Architecture 486 386 286 8086

GATE DRAIN SOURCE BODY Technology Scaling GATE Xj DRAIN SOURCE D Tox BODY Leff Technology has scaled well, will it in the future?

Technology Outlook

I ≠0 I = 0 On I = 1ma/u I = ∞ I = 0 I ≠0 Off I = 0 I ≠ 0 Sub-threshold Leakage Is Transistor a Good Switch?

Exponential Challenge #1 Subthreshold Leakage

Sub-threshold Leakage Assume: 0.25mm, Ioff = 1na/m 5X increase each generation at 30ºC Sub-threshold leakage increases exponentially

SD Leakage Power SD leakage power becomes prohibitive

Leakage Power A. Grove, IEDM 2002 Leakage power limits Vt scaling

Exponential Challenge #2 Gate Leakage

90nm MOS Transistor 50nm Gate 1.2 nm SiO2 Silicon substrate Gate Oxide is Near Limit If Tox scaling slows down, then Vdd scaling will have to slow down High-K dielectric is crucial

Exponential Challenge #3 Energy & Power

Energy per Logic Operation Energy per logic operation scaling will slow down

The Power Crisis Business as usual is not an option

Exponential Challenge #4 Variations

Lithography Wavelength 365nm 248nm 193nm 180nm 130nm Gap 90nm 65nm 45nm Generation 32nm 13nm EUV Random Dopant Fluctuations Source: Mark Bohr, Intel Sub-wavelength Lithography Heat Flux (W/cm2) Results in Vcc variation Temperature Variation (°C) Hot spots Sources of Variations

Low Freq Low Isb High Freq High Isb High Freq Medium Isb Frequency & SD Leakage 0.18 micron ~1000 samples 30% 20X

High Freq Medium Isb Low Freq Low Isb High Freq High Isb Vt Distribution 0.18 micron ~1000 samples ~30mV

Exponential Challenge #5 Platform & System

3000 2500 2000 System Volume ( cubic inch) 1500 1000 500 0 PC tower Mini tower m-tower Slim line Small pc 1.5 100 Pentium ® III 75 Projected Heat Dissipation Volume 1.0 Pentium ® 4 Heat-Sink Volume (in3) Thermal Budget (oC/W) 50 Air Flow Rate (CFM) Projected Air Flow Rate 0.5 25 Thermal Budget 0 0 250 0 50 100 150 200 Power (W) Platform Requirements Shrinking volume Quieter Yet, High Performance Thermal budget decreasing Higher heat sink volume Higher air flow rate

Exponential Challenge #6 Economics

Litho Cost FAB Cost G. Moore ISSCC 03 www.icknowledge.com $ per Transistor $ per MIPS Exponential Costs

Product Cost Pressure Shrinking ASP, and shrinking $ budget for power

Must Fit in Power Envelope ) 1400 2 SiO2 Lkg 10 mm Die 1200 SD Lkg Active 1000 800 Power (W), Power Density (W/cm 8 MB 600 4 MB 400 2 MB 1 MB 200 0 90nm 65nm 45nm 32nm 22nm 16nm Technology, Circuits, and Architecture to constrain the power

Tox scaling will slow down—may stop? Vdd scaling will slow down—may stop? Vt scaling will slow down—may stop? Approaching constant Vdd scaling Energy/logic op will not scale Some Implications

The Gigascale Dilemma • 1B T integration capacity will be available • But could be unusable due to power • Logic T growth will slow down • Transistor performance will be limited Solutions • Low power design techniques • Improve design efficiency—Multi everywhere • Valued performance by even higher integration (of potentially slower transistors)

Solutions Power—active and leakage Variations Microarchitecture

Slow Fast Slow High Supply Voltage Multiple Supply Voltages Low Supply Voltage Replicated Designs Vdd Vdd/2 Freq = 0.5 Vdd = 0.5 Throughput = 1 Power = 0.25 Area = 2 Pwr Den = 0.125 Freq = 1 Vdd = 1 Throughput = 1 Power = 1 Area = 1 Pwr Den = 1 Logic Block Logic Block Logic Block Active Power Reduction

Body Bias Stack Effect Sleep Transistor Vbp Vdd +Ve Logic Block Equal Loading Vbn -Ve 5-10X Reduction 2-1000X Reduction 2-10X Reduction Leakage Control

power 2 2 target frequency probability 1.5 1.5 1 1 0.5 0.5 0 0 large high small low Transistor size Low-Vt usage Circuit Design Tradeoffs • Higher probability of target frequency with: • Larger transistor sizes • Higher Low-Vt usage • But with power penalty

60% # critical paths 1.4 40% 1.3 Number of dies 20% Mean clock frequency 1.2 0% 0.9 1.1 1.3 1.5 1.1 Clock frequency 1 9 17 25 # of critical paths Impact of Critical Paths • With increasing # of critical paths • Both s and m become smaller • Lower mean frequency

Logic depth: 16 1.0 to Ion-s NMOS Ion PMOS Ion Delay 0.5 Ratio of s/m s m s m / / delay-s 5.6% 3.0% 4.2% 0.0 16 49 Logic depth Impact of Logic Depth Device I NMOS 40% ON PMOS 20% Number of samples (%) Delay 40% 20% 0% -16% -8% 0% 8% 16% Variation (%)

1.5 1.5 1 1 0.5 0.5 0 0 less more large small # uArch critical paths Logic depth frequency target frequency probability mArchitecture Tradeoffs • Higher target frequency with: • Shallow logic depth • Larger number of critical paths • But with lower probability

2 2 power 1.5 1.5 target frequency probability 1 1 0.5 0.5 0 0 large high small low Transistor size Low-Vt usage 1.5 1.5 1 1 0.5 0.5 frequency 0 target frequency probability 0 less more large small # uArch critical paths Logic depth Variation-tolerant Design Balance power & frequency with variation tolerance

Due to variations in: Vdd, Vt, and Temp Path Delay Deterministic Deterministic Probabilistic Frequency Probabilistic 10X variation ~50% total power # of Paths # of Paths Delay Target Delay Target Leakage Power Probabilistic Design Probability Delay Deterministic design techniques inadequate in the future

Today: Local Optimization Single Variable Tomorrow: Global Optimization Multi-variate Shift in Design Paradigm • Multi-variable design optimization for: • Yield and bin splits • Parameter variations • Active and leakage power • Performance

Resistor Network PD & Counter Delay Bias Amplifier CUT Adaptive Body Bias--Experiment Multiple subsites Resistor Network 5.3 mm 4.5 mm 1.6 X 0.24 mm, 21 sites per die 150nm CMOS Technology 150nm CMOS Number of 21 Die frequency: Min(F1..F21) Die power: Sum(P1..P21) subsites per die 0.5V FBB to Body bias range 0.5V RBB Bias resolution 32 mV

noBB ABB within die ABB 97% highest bin • For given Freq and Power density • 100% yield with ABB • 97% highest freq bin with ABB for within die variability 100% yield Adaptive Body Bias--Results 100% 60% Accepted die 20% 0% Higher Frequency 

Design & mArch Efficiency Employ efficient design & mArchitectures

Memory Latency CPU Cache Memory Small ~few Clocks Large 50-100ns Assume: 50ns Memory latency Cache miss hurts performance Worse at higher frequency

Increase on-die Memory Large on die memory provides: Increased Data Bandwidth & Reduced Latency Hence, higher performance for much lower power

Multi-threading Thermals & Power Delivery designed for full HW utilization Single Thread Full HW Utilization Wait for Mem ST Multi-Threading Wait for Mem MT1 Wait MT2 MT3 Multi-threading improves performance without impacting thermals & power delivery

Chip Multi-Processing C1 C2 Cache C3 C4 • Multi-core, each core Multi-threaded • Shared cache and front side bus • Each core has different Vdd & Freq • Core hopping to spread hot spots • Lower junction temperature

Special Purpose Hardware TCP Offload Engine Opportunities: Network processing engines MPEG Encode/Decode engines Speech engines 2.23 mm X 3.54 mm, 260K transistors Special purpose HW—Best Mips/Watt

Special-purpose hardware  more MIPS/mm² SIMD integer and FP instructions in several ISAs Valued Performance: SOC (System on a Chip) Heterogeneous Si, SiGe, GaAs Polylithic Si Monolithic Special HW Wireline Opto-Electronics CMOS RF CPU Memory RF Dense Memory

Multi-Threaded, Multi-Core Multi Threaded Era of Thread & Processor Level Parallelism Special Purpose HW Speculative, OOO Super Scalar 486 386 Era of Instruction Level Parallelism 286 8086 Era of Pipelined Architecture The Exponential Reward Multi-everywhere: MT, CMP

Summary—Delaying Forever • Gigascale transistor integration capacity will be available—Power and Energy are the barriers • Variations will be even more prominent—shift from Deterministic toProbabilistic design • Improve design efficiency • Multi—everywhere, & SOC valued performance • Exploit integration capacity to deliver performance in power/cost envelope

Exponential Challenges, Exponential Rewards— The Future of Moore’s Law