1 / 14

Runnemede : Disruptive Technologies for UHPC

Runnemede : Disruptive Technologies for UHPC. John Gustafson Intel Labs HPC User Forum – Houston 2011. The battle lines are drawn…. “We’re going to try to make the entire exascale machine cache-coherent .” —Bill Dally, Nvidia. “Caches are for morons.” —Shekhar Borkar, Intel.

cira
Download Presentation

Runnemede : Disruptive Technologies for UHPC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Runnemede:Disruptive Technologiesfor UHPC John Gustafson Intel Labs HPC User Forum – Houston 2011

  2. The battle lines are drawn… “We’re going to try to make the entire exascale machine cache-coherent.” —Bill Dally, Nvidia “Caches are for morons.” —Shekhar Borkar, Intel

  3. Intel’s UHPC Approach • Design test chips with the idea of maximizing learning. • Very different from producing product roadmap processor designs. • Going from Peta to Exa is nothing like the last few 1000x increases…

  4. Building with Today’s Technology TFLOP Machine today Decode and control Translations …etc Power supply losses Cooling…etc 4450W 10TB disk @ 1TB/disk @10W 5KW 100W Disk 100pJ com per FLOP 100W Com 0.1B/FLOP @ 1.5nJ per Byte 150W Memory 200W 200pJ per FLOP Compute KW Tera, MW Peta, GW Exa?

  5. The Power & Energy Challenge TFLOP Machine today 4550W TFLOP Machine then With Exa Technology 5KW 100W Disk 100W Com 5W ~3W 150W ~20W Memory ~5W 2W 200W Compute 5W

  6. Scaling Assumptions 65 nm Core + Local Memory 8 nm Core + Local Memory DP FP Add, Multiply Integer Core, RF Router 5mm2 (50%) DP FP Add, Multiply Integer Core, RF Router 0.17mm2 (50%) Memory 0.35MB 0.17mm2 (50%) Memory 0.35MB 5mm2 (50%) ~0.6mm 0.34 mm2, 4.6 GHz, 9.2 GF, 0.24 to 0.46 W 10 mm2, 3 GHz, 6 GF, 1.8 W

  7. 1 450 10 65nm CMOS, 50°C 65nm CMOS, 50°C 400 350 300 1 250 Energy Efficiency (GOPS/Watt) Active Leakage Power (mW) 200 9.6X Subthreshold Region -1 150 10 100 50 320mV 320mV -2 0 10 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Supply Voltage (V) Near Threshold Logic H. Kaul et al, 16.6: ISSCC08

  8. Traditional DRAM New DRAM architecture RAS Addr Page Page Page Page Page Page CAS Addr Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of I/Os(3D) Revise DRAM Architecture Energy cost today: ~150 pJ/bit

  9. Data Locality Chip to memory Communication: ~1.5 nJ per Byte ~150 pJper Byte Core-to-core Communication on the chip: ~10 pJper Byte Chip to chip Communication: ~100 pJper Byte Data movement is expensive—keep it local (1) Core to core, (2) Chip-to-chip, (3) Memory

  10. Disruptive Approach to Faults • We tend to assume that execution faults (soft errors, hard errors) are rare. And it’s a valid speculation. Currently. • Soon, we will need much more paranoia in hardware designs.

  11. Road to Unreliability? Resiliency will be the cornerstone

  12. Resiliency Minimal overhead for resiliency Error detection Fault isolation Fault confinement Reconfiguration Recovery & Adapt Applications System Software Programming system Microcode, Platform Microarchitecture Circuit & Design

  13. Execution Model and Codelets Sea of Codelets Programming Models/Systems (Rich) • Codelet - Code that can be executed non-preemptively with an “event-driven” model • Shared memory model based on LC (Location Consistency – a generalized single-assignment model [GaoSarkar1980]) Run Time System Cores Hardware Abstraction Advanced Hardware Monitoring Net Peripherals/Devices

  14. Summary • Voltage scaling to reduce power and energy • Explodes parallelism • Cost of communication vs computation—critical balance • Resiliency to combat side-effects and unreliability • Programming system for extreme parallelism • Application driven, HW/SW co-design approach • Self-awareness & execution model to harmonize

More Related