140 likes | 299 Views
Runnemede : Disruptive Technologies for UHPC. John Gustafson Intel Labs HPC User Forum – Houston 2011. The battle lines are drawn…. “We’re going to try to make the entire exascale machine cache-coherent .” —Bill Dally, Nvidia. “Caches are for morons.” —Shekhar Borkar, Intel.
E N D
Runnemede:Disruptive Technologiesfor UHPC John Gustafson Intel Labs HPC User Forum – Houston 2011
The battle lines are drawn… “We’re going to try to make the entire exascale machine cache-coherent.” —Bill Dally, Nvidia “Caches are for morons.” —Shekhar Borkar, Intel
Intel’s UHPC Approach • Design test chips with the idea of maximizing learning. • Very different from producing product roadmap processor designs. • Going from Peta to Exa is nothing like the last few 1000x increases…
Building with Today’s Technology TFLOP Machine today Decode and control Translations …etc Power supply losses Cooling…etc 4450W 10TB disk @ 1TB/disk @10W 5KW 100W Disk 100pJ com per FLOP 100W Com 0.1B/FLOP @ 1.5nJ per Byte 150W Memory 200W 200pJ per FLOP Compute KW Tera, MW Peta, GW Exa?
The Power & Energy Challenge TFLOP Machine today 4550W TFLOP Machine then With Exa Technology 5KW 100W Disk 100W Com 5W ~3W 150W ~20W Memory ~5W 2W 200W Compute 5W
Scaling Assumptions 65 nm Core + Local Memory 8 nm Core + Local Memory DP FP Add, Multiply Integer Core, RF Router 5mm2 (50%) DP FP Add, Multiply Integer Core, RF Router 0.17mm2 (50%) Memory 0.35MB 0.17mm2 (50%) Memory 0.35MB 5mm2 (50%) ~0.6mm 0.34 mm2, 4.6 GHz, 9.2 GF, 0.24 to 0.46 W 10 mm2, 3 GHz, 6 GF, 1.8 W
1 450 10 65nm CMOS, 50°C 65nm CMOS, 50°C 400 350 300 1 250 Energy Efficiency (GOPS/Watt) Active Leakage Power (mW) 200 9.6X Subthreshold Region -1 150 10 100 50 320mV 320mV -2 0 10 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Supply Voltage (V) Near Threshold Logic H. Kaul et al, 16.6: ISSCC08
Traditional DRAM New DRAM architecture RAS Addr Page Page Page Page Page Page CAS Addr Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of I/Os(3D) Revise DRAM Architecture Energy cost today: ~150 pJ/bit
Data Locality Chip to memory Communication: ~1.5 nJ per Byte ~150 pJper Byte Core-to-core Communication on the chip: ~10 pJper Byte Chip to chip Communication: ~100 pJper Byte Data movement is expensive—keep it local (1) Core to core, (2) Chip-to-chip, (3) Memory
Disruptive Approach to Faults • We tend to assume that execution faults (soft errors, hard errors) are rare. And it’s a valid speculation. Currently. • Soon, we will need much more paranoia in hardware designs.
Road to Unreliability? Resiliency will be the cornerstone
Resiliency Minimal overhead for resiliency Error detection Fault isolation Fault confinement Reconfiguration Recovery & Adapt Applications System Software Programming system Microcode, Platform Microarchitecture Circuit & Design
Execution Model and Codelets Sea of Codelets Programming Models/Systems (Rich) • Codelet - Code that can be executed non-preemptively with an “event-driven” model • Shared memory model based on LC (Location Consistency – a generalized single-assignment model [GaoSarkar1980]) Run Time System Cores Hardware Abstraction Advanced Hardware Monitoring Net Peripherals/Devices
Summary • Voltage scaling to reduce power and energy • Explodes parallelism • Cost of communication vs computation—critical balance • Resiliency to combat side-effects and unreliability • Programming system for extreme parallelism • Application driven, HW/SW co-design approach • Self-awareness & execution model to harmonize