320 likes | 729 Views
Thousand Core Chips A Technology Perspective. Shekhar Borkar Intel Corp. June 7, 2007. Outline. Technology outlook Evolution of Multi—thousands of cores? How do you feed thousands of cores Future challenges: variations and reliability Resiliency Summary. Technology Outlook.
E N D
Thousand Core ChipsA Technology Perspective Shekhar Borkar Intel Corp. June 7, 2007
Outline • Technology outlook • Evolution of Multi—thousands of cores? • How do you feed thousands of cores • Future challenges: variations and reliability • Resiliency • Summary
Terascale Integration Capacity Total Transistors, 300mm2 die ~100MB Cache ~1.5B Logic Transistors 100+B Transistor integration capacity
300mm2 Die Scaling Projections Freq scaling will slow down Vdd scaling will slow down Power will be too high
Why Multi-core? –Performance Ever increasing single cores yield diminishing performance in a power envelope Multi-cores provide potential for near-linear performance speedup
Cache Cache Core Core Core Why Dual-core? –Power Rule of thumb In the same process technology… Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8
Cache Large Core Small Core C1 C2 Cache C3 C4 From Dual to Multi— Power Power = 1/4 4 Performance Performance = 1/2 3 2 2 1 1 1 1 4 4 Multi-Core: Power efficient Better power and thermal management 3 3 2 2 1 1
General Purpose Cores GP GP GP C GP C C C GP SP GP C SP C C C Special Purpose HW C C C C SP GP GP SP Interconnect fabric C C C C GP GP GP GP Future Multi-core Platform Heterogeneous Multi-Core Platform—SOC
Vdd 0.7xVdd Cores with critical tasks Freq = f, at Vdd TPT = 1, Power = 1 f f/2 0 f Non-critical cores Freq = f/2, at 0.7xVdd TPT = 0.5, Power = 0.25 f/2 0 f f/2 0 f f/2 0 f f/2 0 f Cores shut down TPT = 0, Power = 0 Fine Grain Power Management
Performance Scaling Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N) Serial% = 6.7% N = 16, N1/2 = 8 16 Cores, Perf = 8 Serial% = 20% N = 6, N1/2 = 3 6 Cores, Perf = 3 Parallel software key to Multi-core success
144 Cores 12 Cores 48 Cores From Multi to Many… 13mm, 100W, 48MB Cache, 4B Transistors, in 22nm
288 Cores 24 Cores 96 Cores From Many to Too Many… 13mm, 100W, 96MB Cache, 8B Transistors, in 16nm
On Die Network Power 300mm2 Die • A careful balance of: • Throughput performance • Single thread performance (core size) • Core and network power
Observations • Scaling Multi— demands more parallelism every generation • Thread level, task level, application level • Many (or too many) cores does not always mean… • The highest performance • The highest MIPS/Watt • The lowest power • If on-die network power is significant, then power is even worse Now software, too, must follow Moore’s Law
Memory BW Gap Busses have become wider to deliver necessary memory BW (10 to 30 GB/sec) Yet, memory BW is not enough Many Core System will demand 100 GB/sec memory BW How do you feed the beast?
IO Pins and Power State of the art: 100 GB/sec ~ 1 Tb/sec = 1,000 Gb/sec 25mw/Gb/sec = 25 Watts Bus-width = 1,000/5 = 200, about 400 pins (differential) Too many signal pins, too much power
High speed busses Busses are transmission lines L-R-C effects Need signal termination Signal processing consumes power > 5mm Chip Chip Bus <2mm Solutions: Reduce distance to << 5mm R-C bus Reduce signaling speed (~1Gb/sec) Increase pins to deliver BW 1-2 mw/Gbps Chip Chip Solution 100 GB/sec ~ 1 Tb/sec = 1,000 Gb/sec 2mw/Gb/sec = 2 Watts Bus-width = 1,000/1 = 1,000 pins
Heat-sink Heat Si Chip Power Signals Package Anatomy of a Silicon Chip
Si Chip Si Chip Package System in a Package Limited pins: 10mm / 50 micron = 200 pins Limited pins Signal distance is large ~10 mm – higher power Complex package
Heat-sink Temp = 85°C High temp, hot spots Not good for DRAM Junction Temp = 100+°C CPU DRAM Package DRAM on Top
Heat-sink DRAM CPU Package DRAM at the Bottom Power and IO signals go through DRAM to CPU Thin DRAM die Through DRAM vias The most promising solution to feed the beast
Wider Extreme device variations Soft Error FIT/Chip (Logic & Mem) Burn-in may phase out…? Time dependent device degradation Reliability
Implications to Reliability • Extreme variations (Static & Dynamic) will result in unreliable components • Impossible to design reliable system as we know today • Transient errors (Soft Errors) • Gradual errors (Variations) • Time dependent (Degradation) Reliable systems with unreliable components —Resilient mArchitectures
Implications to Test • One-time-factory testing will be out • Burn-in to catch chip infant-mortality will not be practical • Test HW will be part of the design • Dynamically self-test, detect errors, reconfigure, & adapt
100 Billion Transistors 100 BT integration capacity Billions unusable (variations) Some will fail over time Intermittent failures In a Nut-shell… Yet, deliver high performance in the power & cost envelope
C C C C C C C C C C C C C C C C Resiliency with Many-Core • Dynamic on-chip testing • Performance profiling • Cores in reserve (spares) • Binning strategy • Dynamic, fine grain, performance and power management • Coarse-grain redundancy checking • Dynamic error detection & reconfiguration • Decommission aging cores, swap with spares • Dynamically… • Self test & detect • Isolate errors • Confine • Reconfigure, and • Adapt
Summary • Moore’s Law with Terascale integration capacity will allow integration of thousands of cores • Power continues to be the challenge • On-die network power could be significant • Optimize for power with the size of the core and the number of cores • 3D Memory technology needed to feed the beast • Many-cores will deliver the highest performance in the power envelope with resiliency