Beyond Multi-core: The Dawning of the Era of Tera Intel™ 80-core Tera-scale Research Processor

Jim Held Intel Fellow & Director, Tera-scale Computing Research Intel Corporation Beyond Multi-core: The Dawning of the Era of TeraIntel™ 80-core Tera-scale Research Processor

Agenda • Tera-scale Computing • Motivation • Platform Vision • Teraflops Research Processor • Key Ingredients • Power Management • Performance • Programming • Key Learnings • Work in progress • Summary

Emerging Applications will demand Tera-scale performance Computational(GFLOPs) Memory BW(GB/s) Computational(GFLOPs) Memory BW(GB/s) Computational(GFLOPs) Memory BW(GB/s) Computational(GFLOPs) Memory BW(GB/s) 10000 10000 10000 10000 10000 10000 10000 10000 1000 1000 1000 1000 100 100 1000 1000 1000 100 100 1000 10 10 10 10 1 1 100 100 100 100 1 1 0.1 0.1 0.1 0.1 10 10 10 10 2 Webcam Video Surveillance 4 Camera Body Tracking Bar Scene, 1M pixel Beetle Scene 1M pixel 1 1 1 1 Courtesy of Prof. Ron Fetkiw Stanford University 0.1 0.1 0.1 0.1 CFD, 75x50x50, 10 fps CFD, 150x100x100, 30 fps ALM, 6 assets, 7 branches, 1 min ALM, 6 assets, 10 branches, 1 sec Ray-Tracing Computer Vision Physical Simulation Financial Analytics

A Tera-scale Platform Vision Special Purpose Engines Integrated IO devices Cache Cache Cache Off Die interconnect Last Level Cache Last Level Cache Last Level Cache Integrated Memory Controllers IO Socket Inter- Connect High Bandwidth Memory Scalable On-die Interconnect Fabric

Tera-scale Computing Research • Applications – Identify, characterize & optimize • Programming – Empower the mainstream • System Software – Scalable services • Memory Hierarchy – Feed the compute engine • On-Die Interconnect – High bandwidth, low latency • Cores – power efficient general & special function

Teraflops Research Processor 12.64mm I/O Area Goals: • Deliver Tera-scale performance • Single precision TFLOP at desktop power • Frequency target 5GHz • Bi-section B/W order of Terabits/s • Link bandwidth in hundreds of GB/s • Prototype two key technologies • On-die interconnect fabric • 3D stacked memory • Develop a scalable design methodology • Tiled design approach • Mesochronous clocking • Power-aware capability single tile 1.5mm 2.0mm 21.72mm PLL PLL TAP TAP I/O Area I/O Area

Key Ingredients MSINT 96 Mesochronous Interface • Special Purpose Cores • High performance Dual FPMACs • 2D Mesh Interconnect • High bandwidth low latency router • Phase-tolerant tile to tile communication • Mesochronous Clocking • Modular & scalable • Lower power • Workload-aware Power Management • Sleep instructions and Packets • Chip voltage & freq. control MSINT 39 Crossbar Router 40 GB/s MSINT 39 MSINT 2KB Data memory (DMEM) 64 RIB 64 64 32 32 6-read, 4-write 32 entry RF 32 32 x x 96 3KB Inst. memory (IMEM) + + 32 32 Normalize Normalize FPMAC0 FPMAC1 Tile Processing Engine (PE)

Fine Grain Power Management FP Engine 1 FP Engine 1 Sleeping: 90% lesspower Data Memory Data Memory Sleeping:57% less power Instruction Memory Instruction Memory Sleeping:56% less power FP Engine 2 FP Engine 2 Sleeping: 90% lesspower Router Router Sleeping: 10% less power (stays on to pass traffic) 21 sleep regions per tile (not all shown) • Dynamic sleep • STANDBY: • Memory retains data • 50% less power/tile • FULL SLEEP: • Memories fully off • 80% less power/tile Scalable power to match workload demands

Leakage Savings Crossbar Router 42mW8mW MSINT MSINT MSINT MSINT 15mW7.5mW 63mW21mW Est breakdown @ 1.2V 110C • Dynamic sleep • Total measured idle power 13W  ~7W sleeping • Regulated sleep for memory arrays • State retention 82mW 70mW 2X-5X leakage power reduction 2KB Data memory (DMEM) RIB 6R, 4W 32 entry Register File 100mW 20mW 100mW 20mW x x Memory Clamping 3KB Inst. Memory (IMEM) + + Normalize Normalize FPMAC0 FPMAC1 Processing Engine (PE)

Router Power Management • Activity based power management • Individual port enables • Queues on sleep and clock gated when port idle 924mW 7X power reduction for idle routers

Power Performance Results 20 19.4 N=80 80°C 15 6 80°C (1.63 TFLOP) 16% GFLOPS/W 5.1GHz 5 10 Sleep disabled 80°C (1.81 TFLOP) 10.5 Sleep enabled (1 TFLOP) N=80 5.67GHz 4 250 12% 3.16GHz 1.33TFLOP @ 230W 5 80°C, N=80 Frequency (GHz) 225 3 % Total Power 8% 2X (0.32 TFLOP) Active Power 200 0 1GHz 2 200 400 600 800 1000 1200 1400 Leakage Power 4% 175 152 GFLOPS 5.8 1 150 N=80 394 GFLOPS 1TFLOP @ 97W 0% Power (W) 0 0.70 0.80 0.90 1.00 1.10 1.20 1.30 125 Vcc (V) 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 100 78 V (V) cc 75 50 26 15.6 25 0 0.70 0.80 0.90 1.00 1.10 1.20 1.30 Vcc (V) Peak Performance Average Power Efficiency Measured Power Leakage Stencil: 1TFLOP @ 97W, 1.07V; All tiles awake/asleep

Programming Results • Not designed as a general Software Development Vehicle • Small memory • ISA limitations • Limited data ports • Four kernels hand-coded to explore delivered performance: • Stencil 2D heat diffusion equation • SGEMM for 100x100 matrices • Spreadsheet doing weighted sums • 64 point 2D FFT (w 64 tiles) • Demonstrated utility and high scalability of message passing programming models on many core

Key Learnings • Teraflop performance is possible within a mainstream power envelope • Peak of 1.01 Teraflops at 62 watts • Measured peak power efficiency of 19.4 GFLOPS/Watt • Tile-based methodology fulfilled its promise • Design possible with ½ the team in ½ the time • Pre & Post-Si debug reduced – fully functional on A0 • Fine-grained power management pays off • Hierarchical clock gating and sleep transistor techniques • Up to 3X measured reduction in standby leakage power • Scalable low-power mesochronous clocking • Excellent SW performance possible in this message-based architecture • Further improvements possible with additional instructions, larger memory, wider data ports

Work in Progress:Stacked Memory Prototype 256 KB SRAM per core 4X C4 bump density 3200 thru-silicon vias 80-tile processor with Cu bumps “Polaris” Denser than C4 pitch Memory “Freya” C4 pitch Thru-SiliconVia Package Package Memory access to match the compute power

Teraflops on IA Pat Gelsinger – Intel Developer Forum 2007

Summary • Emerging applications will demand teraflop performance • Teraflop performance is possible within a mainstream power envelope • Intel is developing technologies to enable Tera-scale computing

Questions

Acknowledgments Sriram Vangal, Jason Howard, Gregory Ruhl, Saurabh Dighe, Howard Wilson, James Tschanz, David Finan, Priya Iyer, Arvind Singh, Tiju Jacob, Shailendra Jain, Sriram Venkataraman, Yatin Hoskote, Nitin Borkar, Rob van der Wijngaart, Michael Frumkin and Tim Mattson

Beyond Multi-core: The Dawning of the Era of Tera Intel™ 80-core Tera-scale Research Processor