Challenges for High Performance Processors

Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo

What’s the challenge? • Our Primary Goal: Performance • How ? • increase the number and/or operating frequency of functional units AND • supply functional units with sufficient data (bandwidth) • Problems: • Memory Wall • system performance is limited by poor memory performance • Power Wall • power consumption is approaching cooling limitation France-Japan PAAP Workshop

Memory Wall Problem • Performance improvement • CPU: 55% / year • DRAM: 7% / year France-Japan PAAP Workshop

L2 hit L1 hit 1/6 cache miss Example of Memory Wall: Performance of 2GHz Pentium4 for a[i]=b[i]+c[i] non-blocking cache & out-of-order issue  lack of effective memory throughput France-Japan PAAP Workshop

Itanium2/Montecito : Huge L3 cache (12MB x 2) Recap: Memory Wall Problem • growing gap between processor and memory speed • performance is limited by memory ability in High Performance Computing (HPC) • long access latency of main memory • lack of throughput of main memory  making full use of local memory (on-chip memory) of wide bandwidth is indispensable • on-chip memory space is valuable resource • not enough for HPC • should exploit data locality France-Japan PAAP Workshop

Does cache work well in HPC? works well in many cases, but not the best for HPC • data location and replacement by hardware × unfortunate line conflicts occur although most of data accesses are regular ex. data used only once flush out other useful data • transfer size of cache  off-chip is fixed • for consecutive data: larger transfer size is preferable • for non-consecutive data: large line transfer incurs unnecessary data transfer  waste of bandwidth • Most of HPC applications exhibit regularity in data access, which is sometimes not well enjoyed. France-Japan PAAP Workshop

ALU FPU register reconfigurable SCM Cache SCM Cache NIA ・・・ Memory (DRAM) Network SCIMA (Software Controlled Integrated Memory Architecture) [kondo-ICCD2000] (joint work with Prof. Boku @ Univ. of Tsukuba and others) • addressable SCM in addition to ordinary cache • a part of logical address space • no inclusive relations with Cache • SCM and cache are reconfigurable at the granularity of way (SCM: Software Controllable Memory) overview of SCIMA address space France-Japan PAAP Workshop

load/store line transfer page-load/page-store Data Transfer Instruction • load/store • register  SCM/Cache • page-load/page-store • SCM  Off-Chip Memory • large granularity transfer • wider effective bandwidthby reducing latency stall • block stride transfer • avoid unnecessary data transfer • more effective utilizationof On-Chip Memory New Register SCM Cache Off-Chip Memory France-Japan PAAP Workshop

first, apply (1) (2) allocate small stream buffer in SCM reserve SCMfor reused data (4) use SCM as a stream buffer consecutive (1) reserve SCMfor reused data (5) second, apply (4) (5) and (6) use SCM as a stream buffer stride (2) reserve SCMfor reused data allocate rest area of SCM for reused data (6) not use SCM (3) not-reusable reusable Strategy of Software Control • SCM must be controlled by software • arrays are classified into 6 groups Consecutiveness irregular Reusability ・prototype of semi-automatic compiler : users specify hints on reusability of data arrays France-Japan PAAP Workshop

benchmark programs • CG, FT, QCD assumption • cache model: cache size = 64KB(4way) SCM size = 0KB • SCIMA mode: cache size = 16KB (1way) SCM size = 48KB • total # of way: 4 • line size: 32B, 128B due to fully exploitation of data reusability Results of Memory Traffic • unnecessary memory traffic is suppressed 1% - 61% of memory traffic decreases in SCIMA France-Japan PAAP Workshop

assumption load/store latency: 2cyclebus throughput: 4B/cyclememory latency: 40cycle Results of Performance • CPU busy time • latency stall : elapsed time due to memory latency • throughput stall : elapsed time due to lack of throughput normalized execution time • 1.3-2.5 times faster than cache • latency stall reduction by large granularity of data transfer • throughput stall reduction by suppressing unnecessary data transfer France-Japan PAAP Workshop

♦Itanium (130W) Power Wall • Next Focus: Power Consumption of Processors • Is there any room for power reduction ? • If yes, then how to reduce ? Trends of Heat Density France-Japan PAAP Workshop

Observation(1) Moore’s Law • Num. of transistors : doubles every 18 months France-Japan PAAP Workshop

Observation (2) – frequency – • Frequency doubles every 3 years. • Number of transistors : doubles every 18 months • Number of switching on a chip: 8 times every 3 years France-Japan PAAP Workshop

Observation (3) – performance – • # of switching on a chip: 8 times every 3 years • effective performance: 4 times every 3 years • “microprocessor performance improved 55% per year” from “Computer Architecture A Quantitative Approach” by J.Henessy and D.Patterson, Morgan Kaufmann • unnecessary switching = chance of power reduction: doubles every 3 years France-Japan PAAP Workshop

4 6 8 10 12 An Evidence of the Observation-unnecessary switching = x2 / 3 years - [Zyuban00] @ ISLPED’00 • energy/instr. increases to exploit ILP for higher performance • at functional units : no increase • at issue window, register file : increase • flushed instruction by incorrect prediction: increase rename map table bypass mechanism load/store window issue window register file functional units flushed instruction access energy per instruction (nJ) committed instruction Issue Width waste of power France-Japan PAAP Workshop

Registers • Register consumes a lot of power • roughly speaking, power ∝(num. of registers) X (num. of ports) • high performance wide issue superscalar processors more registers, more read/write ports • Open Question • in HPC, what is the best way to use many function units (or accelerators) from the perspective of register file design • scalar registers with SIMD operations • vector registers with vector operations • ……… • Personal Impression • vector registers are accessed in well-organized fashion, it is easy to reduce “num. of ports” by sub-banking technique • can vector operations make good use of local on-chip memory? (at least, traditional vector processors can never!) France-Japan PAAP Workshop

Cache Cache Core Core Core Dual Core helps … Rule of thumb In the same process technology… Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8 France-Japan PAAP Workshop

Cache Large Core Small Core C1 C2 Cache C3 C4 Multi-Core helps more … Power Power = 1/4 4 Performance Performance = 1/2 3 2 2 1 1 1 1 no need for wider instruction issue  4 4 Multi-Core: Power efficient Better power and thermal management 3 3 2 2 1 1 France-Japan PAAP Workshop

) 1400 2 SiO2 Lkg 10 mm Die 1200 SD Lkg Active [Borkar-MICRO05] 1000 [Borkar-MICRO05] VDD leakage current 800 Power (W), Power Density (W/cm 600 ON 400 Input 0 OFF 200 0 90nm 65nm 45nm 32nm 22nm 16nm Leakage problem IEEE Computer Magazine • How to attack leakage problem? France-Japan PAAP Workshop

Introduction of our research • Innovative Power Control for Ultra Low-Power and High-Performance System LSIs • 5 years project started October, 2006 • supported by JST (Japan Science and Technology Agency) as a CREST (Core Research for Evolutional Science and Technology) program • Objective: drastic power reduction of high-performance system LSIs by innovative power controlthrough tight cooperation of various design levels including circuit, architecture, and system software. • Members: • Prof. H. Nakamura (U. Tokyo): architecture & compiler [leader] • Prof. M. Namiki (Tokyo Univ of Agri. Tech): OS • Prof. H. Amano (Keio Univ): architecture & F/E design • Prof. K. Usami (Shibaura I.T.): circuit & B/E design France-Japan PAAP Workshop

Sleep How to reduce leakage: Power Gating • Focusing on Power Gating for reducing leakage • Inserting a Power Switch (PS) between VDD and GND • Turning off PS when sleep logic gates VDD VDD logic gates GND Virtual GND Power Switch France-Japan PAAP Workshop

Circuit A Circuit B Circuit C Power Switch Sleep Control ckt Run-time Power Gating (RTPG) • control power switch at run time • Coarse grain: Mobile processor by Renesas (independent power domains for BB module, MPEG module, ..) • Fine grain (our target): power gating within a module France-Japan PAAP Workshop

Fine-grain Run-time Power Gating • Longer sleep time is preferable • Leakage savings • Overheads: power penalties for wakeup • Evaluation through a real chip not reported • Test vehicle: 32b x 32b Multiplier • Either or both operands (input data) are likely less than 16-bit • Circuit portions to compute upper bits of product need not to operate  waste leakage power By detecting 0s at upper 16-bits of operands, power gate internal Multiplier array France-Japan PAAP Workshop

4.0 Power dissipation（mW） 3.5 125C 3.0 85C 2.5 25C 2.0 Sequence 3 (Domain H and M sleep) Sequence 1 (No sleep) Sequence 2 (Domain H sleeps) Test chip "Pinnacle" real measurement - Exhibits good power reduction - Current Status • Designing a pipelined microprocessor with FG-RTPG • Compiler (instruction scheduler) to increase sleep time Not applied FG-RTPG applied France-Japan PAAP Workshop

Low Power Linux Scheduler based onstatistical modeling • Co-optimization of System Software and Architecture • Objective: • process scheduler which reduce power consumption by DVFS (dynamic voltage and frequency scaling) of each process with satisfying its performance constraint • How to find the lowest frequency with satisfying performance constraints ? • it depends on hardware and program characteristics • performance ratio is different from frequency ratio • hard to find the answer straightforward  modeling by statistical analysis of hardware events France-Japan PAAP Workshop

Evaluation result Pentium M 760 (Max 2.00 GHz, FSB 533 MHz) • Specified threshold • Black dotted line • Perf. is within the threshold in all the cases except for mgrid • 3-7% below the threshold • Accurate model is obtained • Linux scheduler using this model is developed May 8, 2007 27 France-Japan PAAP Workshop

Summary • Challenge for high performance processors: • Memory Wall and Power Wall • One solution to memory wall • make good use of on-chip memory with software controllability • Solutions to power wall • many cores will relax the problem, but • leakage current is getting a big problem • new research/approach is required • our project “Innovative Power Control for Ultra Low-Power and High-Performance System LSIs” is introduced France-Japan PAAP Workshop

France-Japan PAAP Workshop

Challenges for High Performance Processors

Challenges for High Performance Processors

Presentation Transcript

High Performance Sorting and Searching using Graphics Processors

Adaptive Cache Compression for High-Performance Processors

ECE 569 High Performance Processors and Systems

Novel Methods of Augmenting High Performance Processors with Security Hardware

ECE 569 High Performance Processors and Systems

ECE 569 High Performance Processors and Systems

On-Chip Photonic Communications for High Performance Multi-Core Processors

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

High Performance Network Monitoring Challenges for Grids

Scalable High-Performance Parallel Design for NIDS on Many-Core Processors

High Performance Processors and Systems

Advanced Topic: High Performance Processors

High Performance Mobile Computing Using Flexible Wide SIMD Processors

Fueling for High Performance

High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors *

Gnort : High Performance Network Intrusion Detection Using Graphics Processors

High Performance Computing Challenges and Trends

High Performance Processors

High Performance Discrete Fourier Transforms on Graphics Processors

Compiler Challenges for High Performance Architectures

High Performance Computing Challenges and Trends

High Performance Computer Architecture Challenges Rajeev Balasubramonian