VLSI Design Challenges for Gigascale Integration

VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005

Outline • Technology scaling challenges • Circuit and design solutions • Microarchitecture advances • Multi-everywhere • Summary

How do you get there? Goal: 10 TIPS by 2015 Pentium® 4 Architecture Pentium® Pro Architecture Pentium® Architecture 486 386 286 8086

GATE DRAIN SOURCE BODY Technology Scaling GATE Xj DRAIN SOURCE D Tox BODY Leff Scaling will continue, but with challenges!

Technology Outlook

90nm MOS Transistor Gate 1.2 nm SiO2 Si 50nm The Leakage(s)…

Technology, Circuits, and Architecture to constrain the power Must Fit in Power Envelope ) 1400 2 SiO2 Lkg 10 mm Die 1200 SD Lkg Active 1000 800 Power (W), Power Density (W/cm 600 400 200 0 90nm 65nm 45nm 32nm 22nm 16nm

Solutions • Move away from Frequency alone to deliver performance • More on-die memory • Multi-everywhere • Multi-threading • Chip level multi-processing • Throughput oriented designs • Valued performance by higher level of integration • Monolithic & Polylithic

Planar Transistor Gate electrode Tri-gate Transistor Gate 3.0nm High-k 1.2 nm SiO2 Silicon substrate Silicon substrate Leakage Solutions For a few generations, then what?

Slow Fast Slow High Supply Voltage Multiple Supply Voltages Low Supply Voltage Throughput Oriented Designs Vdd Vdd/2 Freq = 0.5 Vdd = 0.5 Throughput = 1 Power = 0.25 Area = 2 Pwr Den = 0.125 Freq = 1 Vdd = 1 Throughput = 1 Power = 1 Area = 1 Pwr Den = 1 Logic Block Logic Block Logic Block Active Power Reduction

Body Bias Stack Effect Sleep Transistor Vbp Vdd +Ve Logic Block Equal Loading Vbn -Ve 5-10X Reduction 2-1000X Reduction 2-10X Reduction Leakage Control

Optimum 10 10 Sub-threshold Leakage increases exponentially 8 8 Power 6 6 Efficiency 4 4 2 2 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Relative Frequency Relative Pipeline Depth Process Technology Pipeline Depth 10 Performance 8 6 Diminishing Return 4 2 0 1 2 3 4 5 6 7 8 9 10 Relative Frequency (Pipelining) Pipeline & Performance Optimum Frequency • Maximum performance with • Optimum pipeline depth • Optimum frequency

Memory Latency CPU Cache Memory Small ~few Clocks Large 50-100ns Assume: 50ns Memory latency Cache miss hurts performance Worse at higher frequency

Increase on-die Memory Large on die memory provides: Increased Data Bandwidth & Reduced Latency Hence, higher performance for much lower power

Multi-threading Thermals & Power Delivery designed for full HW utilization Single Thread Full HW Utilization Wait for Mem ST Multi-Threading Wait for Mem MT1 Wait MT2 MT3 Multi-threading improves performance without impacting thermals & power delivery

Single Core Power/Performance Moore’s Law  more transistors for advanced architectures Delivers higher peak performance But… Lower power efficiency

Chip Multi-Processing C1 C2 Cache C3 C4 • Multi-core, each core Multi-threaded • Shared cache and front side bus • Each core has different Vdd & Freq • Core hopping to spread hot spots • Lower junction temperature

Cache Cache Core Core Core Dual Core Rule of thumb In the same process technology… Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8

Cache Large Core Small Core C1 C2 Cache C3 C4 Multi-Core Power Power = 1/4 4 Performance Performance = 1/2 3 2 2 1 1 1 1 4 4 Multi-Core: Power efficient Better power and thermal management 3 3 2 2 1 1

Special Purpose Hardware TCP/IP Offload Engine 2.23 mm X 3.54 mm, 260K transistors Opportunities: Network processing engines MPEG Encode/Decode engines, Speech engines Special purpose HW provides best Mips/Watt

Performance Scaling Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N) Serial% = 6.7% N = 16, N1/2 = 8 16 Cores, Perf = 8 Serial% = 20% N = 6, N1/2 = 3 6 Cores, Perf = 3 Parallel software key to Multi-core success

144 Cores 12 Cores 24 Cores From Multi to Many… 13mm, 100W, 48MB Cache, 4B Transistors, in 22nm

General Purpose Cores GP GP GP C GP C C C GP SP GP C SP C C C Special Purpose HW C C C C SP GP GP SP Interconnect fabric C C C C GP GP GP GP Future Multi-core Platform Heterogeneous Multi-Core Platform

Multi-Threaded, Multi-Core Multi Threaded Era of Thread & Processor Level Parallelism Special Purpose HW Speculative, OOO Super Scalar 486 386 Era of Instruction Level Parallelism 286 8086 Era of Pipelined Architecture The New Era of Computing Multi-everywhere: MT, CMP

Summary • Business as usual is not an option • Performance at any cost is history • Must make a Right Hand Turn (RHT) • Move away from frequency alone • Future mArchitectures and designs • More memory (larger caches) • Multi-threading • Multi-processing • Special purpose hardware • Valued performance with higher integration

VLSI Design Challenges for Gigascale Integration

VLSI Design Challenges for Gigascale Integration

Presentation Transcript

CMOS VLSI Design

Algorithms for VLSI Design Automation

Applied VLSI Design

VLSI Design

Applied VLSI Design

VLSI System Design

VLSI Design

VLSI Design

VLSI Design

EE4271 VLSI Design

VLSI Design Introduction

VLSI DESIGN

VLSI Design

Opportunities for Gigascale Integration in Three Dimensional Architectures

VLSI Design

Applied VLSI Design

VLSI for 3D Integration: Modeling, Design and Prototyping

VLSI DESIGN Lecture 10 Design for Testability

EE466: VLSI Design

EE466: VLSI Design

VLSI Design Flow

Opportunities for Gigascale Integration in Three Dimensional Architectures