350 likes | 1.29k Views
VLSI Design Challenges for Gigascale Integration. Shekhar Borkar Intel Corp. October 25, 2005. Outline. Technology scaling challenges Circuit and design solutions Microarchitecture advances Multi-everywhere Summary. How do you get there?. Goal: 10 TIPS by 2015.
E N D
VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005
Outline • Technology scaling challenges • Circuit and design solutions • Microarchitecture advances • Multi-everywhere • Summary
How do you get there? Goal: 10 TIPS by 2015 Pentium® 4 Architecture Pentium® Pro Architecture Pentium® Architecture 486 386 286 8086
GATE DRAIN SOURCE BODY Technology Scaling GATE Xj DRAIN SOURCE D Tox BODY Leff Scaling will continue, but with challenges!
90nm MOS Transistor Gate 1.2 nm SiO2 Si 50nm The Leakage(s)…
Technology, Circuits, and Architecture to constrain the power Must Fit in Power Envelope ) 1400 2 SiO2 Lkg 10 mm Die 1200 SD Lkg Active 1000 800 Power (W), Power Density (W/cm 600 400 200 0 90nm 65nm 45nm 32nm 22nm 16nm
Solutions • Move away from Frequency alone to deliver performance • More on-die memory • Multi-everywhere • Multi-threading • Chip level multi-processing • Throughput oriented designs • Valued performance by higher level of integration • Monolithic & Polylithic
Planar Transistor Gate electrode Tri-gate Transistor Gate 3.0nm High-k 1.2 nm SiO2 Silicon substrate Silicon substrate Leakage Solutions For a few generations, then what?
Slow Fast Slow High Supply Voltage Multiple Supply Voltages Low Supply Voltage Throughput Oriented Designs Vdd Vdd/2 Freq = 0.5 Vdd = 0.5 Throughput = 1 Power = 0.25 Area = 2 Pwr Den = 0.125 Freq = 1 Vdd = 1 Throughput = 1 Power = 1 Area = 1 Pwr Den = 1 Logic Block Logic Block Logic Block Active Power Reduction
Body Bias Stack Effect Sleep Transistor Vbp Vdd +Ve Logic Block Equal Loading Vbn -Ve 5-10X Reduction 2-1000X Reduction 2-10X Reduction Leakage Control
Optimum 10 10 Sub-threshold Leakage increases exponentially 8 8 Power 6 6 Efficiency 4 4 2 2 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Relative Frequency Relative Pipeline Depth Process Technology Pipeline Depth 10 Performance 8 6 Diminishing Return 4 2 0 1 2 3 4 5 6 7 8 9 10 Relative Frequency (Pipelining) Pipeline & Performance Optimum Frequency • Maximum performance with • Optimum pipeline depth • Optimum frequency
Memory Latency CPU Cache Memory Small ~few Clocks Large 50-100ns Assume: 50ns Memory latency Cache miss hurts performance Worse at higher frequency
Increase on-die Memory Large on die memory provides: Increased Data Bandwidth & Reduced Latency Hence, higher performance for much lower power
Multi-threading Thermals & Power Delivery designed for full HW utilization Single Thread Full HW Utilization Wait for Mem ST Multi-Threading Wait for Mem MT1 Wait MT2 MT3 Multi-threading improves performance without impacting thermals & power delivery
Single Core Power/Performance Moore’s Law more transistors for advanced architectures Delivers higher peak performance But… Lower power efficiency
Chip Multi-Processing C1 C2 Cache C3 C4 • Multi-core, each core Multi-threaded • Shared cache and front side bus • Each core has different Vdd & Freq • Core hopping to spread hot spots • Lower junction temperature
Cache Cache Core Core Core Dual Core Rule of thumb In the same process technology… Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8
Cache Large Core Small Core C1 C2 Cache C3 C4 Multi-Core Power Power = 1/4 4 Performance Performance = 1/2 3 2 2 1 1 1 1 4 4 Multi-Core: Power efficient Better power and thermal management 3 3 2 2 1 1
Special Purpose Hardware TCP/IP Offload Engine 2.23 mm X 3.54 mm, 260K transistors Opportunities: Network processing engines MPEG Encode/Decode engines, Speech engines Special purpose HW provides best Mips/Watt
Performance Scaling Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N) Serial% = 6.7% N = 16, N1/2 = 8 16 Cores, Perf = 8 Serial% = 20% N = 6, N1/2 = 3 6 Cores, Perf = 3 Parallel software key to Multi-core success
144 Cores 12 Cores 24 Cores From Multi to Many… 13mm, 100W, 48MB Cache, 4B Transistors, in 22nm
General Purpose Cores GP GP GP C GP C C C GP SP GP C SP C C C Special Purpose HW C C C C SP GP GP SP Interconnect fabric C C C C GP GP GP GP Future Multi-core Platform Heterogeneous Multi-Core Platform
Multi-Threaded, Multi-Core Multi Threaded Era of Thread & Processor Level Parallelism Special Purpose HW Speculative, OOO Super Scalar 486 386 Era of Instruction Level Parallelism 286 8086 Era of Pipelined Architecture The New Era of Computing Multi-everywhere: MT, CMP
Summary • Business as usual is not an option • Performance at any cost is history • Must make a Right Hand Turn (RHT) • Move away from frequency alone • Future mArchitectures and designs • More memory (larger caches) • Multi-threading • Multi-processing • Special purpose hardware • Valued performance with higher integration