510 likes | 829 Views
Exponential Challenges, Exponential Rewards— The Future of Moore’s Law. Shekhar Borkar Intel Fellow Circuit Research, Intel Labs Fall, 2004. ISSCC 2003— Gordon Moore said…. “No exponential is forever… But We can delay Forever”. Outline. The exponential challenges
E N D
Exponential Challenges, Exponential Rewards—The Future of Moore’s Law Shekhar Borkar Intel Fellow Circuit Research, Intel Labs Fall, 2004
ISSCC 2003—Gordon Moore said… “No exponential is forever… But We can delay Forever”
Outline • The exponential challenges • Circuit and mArch solutions • Major paradigm shifts in design • Integration & SOC • The exponential reward • Summary
How do you get there? Goal: 1TIPS by 2010 Pentium® 4 Architecture Pentium® Pro Architecture Pentium® Architecture 486 386 286 8086
GATE DRAIN SOURCE BODY Technology Scaling GATE Xj DRAIN SOURCE D Tox BODY Leff Technology has scaled well, will it in the future?
I ≠0 I = 0 On I = 1ma/u I = ∞ I = 0 I ≠0 Off I = 0 I ≠ 0 Sub-threshold Leakage Is Transistor a Good Switch?
Exponential Challenge #1 Subthreshold Leakage
Sub-threshold Leakage Assume: 0.25mm, Ioff = 1na/m 5X increase each generation at 30ºC Sub-threshold leakage increases exponentially
SD Leakage Power SD leakage power becomes prohibitive
Leakage Power A. Grove, IEDM 2002 Leakage power limits Vt scaling
Exponential Challenge #2 Gate Leakage
90nm MOS Transistor 50nm Gate 1.2 nm SiO2 Silicon substrate Gate Oxide is Near Limit If Tox scaling slows down, then Vdd scaling will have to slow down High-K dielectric is crucial
Exponential Challenge #3 Energy & Power
Energy per Logic Operation Energy per logic operation scaling will slow down
The Power Crisis Business as usual is not an option
Exponential Challenge #4 Variations
Lithography Wavelength 365nm 248nm 193nm 180nm 130nm Gap 90nm 65nm 45nm Generation 32nm 13nm EUV Random Dopant Fluctuations Source: Mark Bohr, Intel Sub-wavelength Lithography Heat Flux (W/cm2) Results in Vcc variation Temperature Variation (°C) Hot spots Sources of Variations
Low Freq Low Isb High Freq High Isb High Freq Medium Isb Frequency & SD Leakage 0.18 micron ~1000 samples 30% 20X
High Freq Medium Isb Low Freq Low Isb High Freq High Isb Vt Distribution 0.18 micron ~1000 samples ~30mV
Exponential Challenge #5 Platform & System
3000 2500 2000 System Volume ( cubic inch) 1500 1000 500 0 PC tower Mini tower m-tower Slim line Small pc 1.5 100 Pentium ® III 75 Projected Heat Dissipation Volume 1.0 Pentium ® 4 Heat-Sink Volume (in3) Thermal Budget (oC/W) 50 Air Flow Rate (CFM) Projected Air Flow Rate 0.5 25 Thermal Budget 0 0 250 0 50 100 150 200 Power (W) Platform Requirements Shrinking volume Quieter Yet, High Performance Thermal budget decreasing Higher heat sink volume Higher air flow rate
Exponential Challenge #6 Economics
Litho Cost FAB Cost G. Moore ISSCC 03 www.icknowledge.com $ per Transistor $ per MIPS Exponential Costs
Product Cost Pressure Shrinking ASP, and shrinking $ budget for power
Must Fit in Power Envelope ) 1400 2 SiO2 Lkg 10 mm Die 1200 SD Lkg Active 1000 800 Power (W), Power Density (W/cm 8 MB 600 4 MB 400 2 MB 1 MB 200 0 90nm 65nm 45nm 32nm 22nm 16nm Technology, Circuits, and Architecture to constrain the power
Tox scaling will slow down—may stop? Vdd scaling will slow down—may stop? Vt scaling will slow down—may stop? Approaching constant Vdd scaling Energy/logic op will not scale Some Implications
The Gigascale Dilemma • 1B T integration capacity will be available • But could be unusable due to power • Logic T growth will slow down • Transistor performance will be limited Solutions • Low power design techniques • Improve design efficiency—Multi everywhere • Valued performance by even higher integration (of potentially slower transistors)
Solutions Power—active and leakage Variations Microarchitecture
Slow Fast Slow High Supply Voltage Multiple Supply Voltages Low Supply Voltage Replicated Designs Vdd Vdd/2 Freq = 0.5 Vdd = 0.5 Throughput = 1 Power = 0.25 Area = 2 Pwr Den = 0.125 Freq = 1 Vdd = 1 Throughput = 1 Power = 1 Area = 1 Pwr Den = 1 Logic Block Logic Block Logic Block Active Power Reduction
Body Bias Stack Effect Sleep Transistor Vbp Vdd +Ve Logic Block Equal Loading Vbn -Ve 5-10X Reduction 2-1000X Reduction 2-10X Reduction Leakage Control
power 2 2 target frequency probability 1.5 1.5 1 1 0.5 0.5 0 0 large high small low Transistor size Low-Vt usage Circuit Design Tradeoffs • Higher probability of target frequency with: • Larger transistor sizes • Higher Low-Vt usage • But with power penalty
60% # critical paths 1.4 40% 1.3 Number of dies 20% Mean clock frequency 1.2 0% 0.9 1.1 1.3 1.5 1.1 Clock frequency 1 9 17 25 # of critical paths Impact of Critical Paths • With increasing # of critical paths • Both s and m become smaller • Lower mean frequency
Logic depth: 16 1.0 to Ion-s NMOS Ion PMOS Ion Delay 0.5 Ratio of s/m s m s m / / delay-s 5.6% 3.0% 4.2% 0.0 16 49 Logic depth Impact of Logic Depth Device I NMOS 40% ON PMOS 20% Number of samples (%) Delay 40% 20% 0% -16% -8% 0% 8% 16% Variation (%)
1.5 1.5 1 1 0.5 0.5 0 0 less more large small # uArch critical paths Logic depth frequency target frequency probability mArchitecture Tradeoffs • Higher target frequency with: • Shallow logic depth • Larger number of critical paths • But with lower probability
2 2 power 1.5 1.5 target frequency probability 1 1 0.5 0.5 0 0 large high small low Transistor size Low-Vt usage 1.5 1.5 1 1 0.5 0.5 frequency 0 target frequency probability 0 less more large small # uArch critical paths Logic depth Variation-tolerant Design Balance power & frequency with variation tolerance
Due to variations in: Vdd, Vt, and Temp Path Delay Deterministic Deterministic Probabilistic Frequency Probabilistic 10X variation ~50% total power # of Paths # of Paths Delay Target Delay Target Leakage Power Probabilistic Design Probability Delay Deterministic design techniques inadequate in the future
Today: Local Optimization Single Variable Tomorrow: Global Optimization Multi-variate Shift in Design Paradigm • Multi-variable design optimization for: • Yield and bin splits • Parameter variations • Active and leakage power • Performance
Resistor Network PD & Counter Delay Bias Amplifier CUT Adaptive Body Bias--Experiment Multiple subsites Resistor Network 5.3 mm 4.5 mm 1.6 X 0.24 mm, 21 sites per die 150nm CMOS Technology 150nm CMOS Number of 21 Die frequency: Min(F1..F21) Die power: Sum(P1..P21) subsites per die 0.5V FBB to Body bias range 0.5V RBB Bias resolution 32 mV
noBB ABB within die ABB 97% highest bin • For given Freq and Power density • 100% yield with ABB • 97% highest freq bin with ABB for within die variability 100% yield Adaptive Body Bias--Results 100% 60% Accepted die 20% 0% Higher Frequency
Design & mArch Efficiency Employ efficient design & mArchitectures
Memory Latency CPU Cache Memory Small ~few Clocks Large 50-100ns Assume: 50ns Memory latency Cache miss hurts performance Worse at higher frequency
Increase on-die Memory Large on die memory provides: Increased Data Bandwidth & Reduced Latency Hence, higher performance for much lower power
Multi-threading Thermals & Power Delivery designed for full HW utilization Single Thread Full HW Utilization Wait for Mem ST Multi-Threading Wait for Mem MT1 Wait MT2 MT3 Multi-threading improves performance without impacting thermals & power delivery
Chip Multi-Processing C1 C2 Cache C3 C4 • Multi-core, each core Multi-threaded • Shared cache and front side bus • Each core has different Vdd & Freq • Core hopping to spread hot spots • Lower junction temperature
Special Purpose Hardware TCP Offload Engine Opportunities: Network processing engines MPEG Encode/Decode engines Speech engines 2.23 mm X 3.54 mm, 260K transistors Special purpose HW—Best Mips/Watt
Special-purpose hardware more MIPS/mm² SIMD integer and FP instructions in several ISAs Valued Performance: SOC (System on a Chip) Heterogeneous Si, SiGe, GaAs Polylithic Si Monolithic Special HW Wireline Opto-Electronics CMOS RF CPU Memory RF Dense Memory
Multi-Threaded, Multi-Core Multi Threaded Era of Thread & Processor Level Parallelism Special Purpose HW Speculative, OOO Super Scalar 486 386 Era of Instruction Level Parallelism 286 8086 Era of Pipelined Architecture The Exponential Reward Multi-everywhere: MT, CMP
Summary—Delaying Forever • Gigascale transistor integration capacity will be available—Power and Energy are the barriers • Variations will be even more prominent—shift from Deterministic toProbabilistic design • Improve design efficiency • Multi—everywhere, & SOC valued performance • Exploit integration capacity to deliver performance in power/cost envelope