290 likes | 432 Views
Challenges for High Performance Processors. Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo. What’s the challenge?. Our Primary Goal: Performance How ? increase the number and/or operating frequency of functional units AND
E N D
Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo
What’s the challenge? • Our Primary Goal: Performance • How ? • increase the number and/or operating frequency of functional units AND • supply functional units with sufficient data (bandwidth) • Problems: • Memory Wall • system performance is limited by poor memory performance • Power Wall • power consumption is approaching cooling limitation France-Japan PAAP Workshop
Memory Wall Problem • Performance improvement • CPU: 55% / year • DRAM: 7% / year France-Japan PAAP Workshop
L2 hit L1 hit 1/6 cache miss Example of Memory Wall: Performance of 2GHz Pentium4 for a[i]=b[i]+c[i] non-blocking cache & out-of-order issue lack of effective memory throughput France-Japan PAAP Workshop
Itanium2/Montecito : Huge L3 cache (12MB x 2) Recap: Memory Wall Problem • growing gap between processor and memory speed • performance is limited by memory ability in High Performance Computing (HPC) • long access latency of main memory • lack of throughput of main memory making full use of local memory (on-chip memory) of wide bandwidth is indispensable • on-chip memory space is valuable resource • not enough for HPC • should exploit data locality France-Japan PAAP Workshop
Does cache work well in HPC? works well in many cases, but not the best for HPC • data location and replacement by hardware × unfortunate line conflicts occur although most of data accesses are regular ex. data used only once flush out other useful data • transfer size of cache off-chip is fixed • for consecutive data: larger transfer size is preferable • for non-consecutive data: large line transfer incurs unnecessary data transfer waste of bandwidth • Most of HPC applications exhibit regularity in data access, which is sometimes not well enjoyed. France-Japan PAAP Workshop
ALU FPU register reconfigurable SCM Cache SCM Cache NIA ・・・ Memory (DRAM) Network SCIMA (Software Controlled Integrated Memory Architecture) [kondo-ICCD2000] (joint work with Prof. Boku @ Univ. of Tsukuba and others) • addressable SCM in addition to ordinary cache • a part of logical address space • no inclusive relations with Cache • SCM and cache are reconfigurable at the granularity of way (SCM: Software Controllable Memory) overview of SCIMA address space France-Japan PAAP Workshop
load/store line transfer page-load/page-store Data Transfer Instruction • load/store • register SCM/Cache • page-load/page-store • SCM Off-Chip Memory • large granularity transfer • wider effective bandwidthby reducing latency stall • block stride transfer • avoid unnecessary data transfer • more effective utilizationof On-Chip Memory New Register SCM Cache Off-Chip Memory France-Japan PAAP Workshop
first, apply (1) (2) allocate small stream buffer in SCM reserve SCMfor reused data (4) use SCM as a stream buffer conse- cutive (1) reserve SCMfor reused data (5) second, apply (4) (5) and (6) use SCM as a stream buffer stride (2) reserve SCMfor reused data allocate rest area of SCM for reused data (6) not use SCM (3) not-reusable reusable Strategy of Software Control • SCM must be controlled by software • arrays are classified into 6 groups Consecutiveness irregular Reusability ・prototype of semi-automatic compiler : users specify hints on reusability of data arrays France-Japan PAAP Workshop
benchmark programs • CG, FT, QCD assumption • cache model: cache size = 64KB(4way) SCM size = 0KB • SCIMA mode: cache size = 16KB (1way) SCM size = 48KB • total # of way: 4 • line size: 32B, 128B due to fully exploitation of data reusability Results of Memory Traffic • unnecessary memory traffic is suppressed 1% - 61% of memory traffic decreases in SCIMA France-Japan PAAP Workshop
assumption load/store latency: 2cyclebus throughput: 4B/cyclememory latency: 40cycle Results of Performance • CPU busy time • latency stall : elapsed time due to memory latency • throughput stall : elapsed time due to lack of throughput normalized execution time • 1.3-2.5 times faster than cache • latency stall reduction by large granularity of data transfer • throughput stall reduction by suppressing unnecessary data transfer France-Japan PAAP Workshop
♦Itanium (130W) Power Wall • Next Focus: Power Consumption of Processors • Is there any room for power reduction ? • If yes, then how to reduce ? Trends of Heat Density France-Japan PAAP Workshop
Observation(1) Moore’s Law • Num. of transistors : doubles every 18 months France-Japan PAAP Workshop
Observation (2) – frequency – • Frequency doubles every 3 years. • Number of transistors : doubles every 18 months • Number of switching on a chip: 8 times every 3 years France-Japan PAAP Workshop
Observation (3) – performance – • # of switching on a chip: 8 times every 3 years • effective performance: 4 times every 3 years • “microprocessor performance improved 55% per year” from “Computer Architecture A Quantitative Approach” by J.Henessy and D.Patterson, Morgan Kaufmann • unnecessary switching = chance of power reduction: doubles every 3 years France-Japan PAAP Workshop
4 6 8 10 12 An Evidence of the Observation-unnecessary switching = x2 / 3 years - [Zyuban00] @ ISLPED’00 • energy/instr. increases to exploit ILP for higher performance • at functional units : no increase • at issue window, register file : increase • flushed instruction by incorrect prediction: increase rename map table bypass mechanism load/store window issue window register file functional units flushed instruction access energy per instruction (nJ) committed instruction Issue Width waste of power France-Japan PAAP Workshop
Registers • Register consumes a lot of power • roughly speaking, power ∝(num. of registers) X (num. of ports) • high performance wide issue superscalar processors more registers, more read/write ports • Open Question • in HPC, what is the best way to use many function units (or accelerators) from the perspective of register file design • scalar registers with SIMD operations • vector registers with vector operations • ……… • Personal Impression • vector registers are accessed in well-organized fashion, it is easy to reduce “num. of ports” by sub-banking technique • can vector operations make good use of local on-chip memory? (at least, traditional vector processors can never!) France-Japan PAAP Workshop
Cache Cache Core Core Core Dual Core helps … Rule of thumb In the same process technology… Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8 France-Japan PAAP Workshop
Cache Large Core Small Core C1 C2 Cache C3 C4 Multi-Core helps more … Power Power = 1/4 4 Performance Performance = 1/2 3 2 2 1 1 1 1 no need for wider instruction issue 4 4 Multi-Core: Power efficient Better power and thermal management 3 3 2 2 1 1 France-Japan PAAP Workshop
) 1400 2 SiO2 Lkg 10 mm Die 1200 SD Lkg Active [Borkar-MICRO05] 1000 [Borkar-MICRO05] VDD leakage current 800 Power (W), Power Density (W/cm 600 ON 400 Input 0 OFF 200 0 90nm 65nm 45nm 32nm 22nm 16nm Leakage problem IEEE Computer Magazine • How to attack leakage problem? France-Japan PAAP Workshop
Introduction of our research • Innovative Power Control for Ultra Low-Power and High-Performance System LSIs • 5 years project started October, 2006 • supported by JST (Japan Science and Technology Agency) as a CREST (Core Research for Evolutional Science and Technology) program • Objective: drastic power reduction of high-performance system LSIs by innovative power controlthrough tight cooperation of various design levels including circuit, architecture, and system software. • Members: • Prof. H. Nakamura (U. Tokyo): architecture & compiler [leader] • Prof. M. Namiki (Tokyo Univ of Agri. Tech): OS • Prof. H. Amano (Keio Univ): architecture & F/E design • Prof. K. Usami (Shibaura I.T.): circuit & B/E design France-Japan PAAP Workshop
Sleep How to reduce leakage: Power Gating • Focusing on Power Gating for reducing leakage • Inserting a Power Switch (PS) between VDD and GND • Turning off PS when sleep logic gates VDD VDD logic gates GND Virtual GND Power Switch France-Japan PAAP Workshop
Circuit A Circuit B Circuit C Power Switch Sleep Control ckt Run-time Power Gating (RTPG) • control power switch at run time • Coarse grain: Mobile processor by Renesas (independent power domains for BB module, MPEG module, ..) • Fine grain (our target): power gating within a module France-Japan PAAP Workshop
Fine-grain Run-time Power Gating • Longer sleep time is preferable • Leakage savings • Overheads: power penalties for wakeup • Evaluation through a real chip not reported • Test vehicle: 32b x 32b Multiplier • Either or both operands (input data) are likely less than 16-bit • Circuit portions to compute upper bits of product need not to operate waste leakage power By detecting 0s at upper 16-bits of operands, power gate internal Multiplier array France-Japan PAAP Workshop
4.0 Power dissipation(mW) 3.5 125C 3.0 85C 2.5 25C 2.0 Sequence 3 (Domain H and M sleep) Sequence 1 (No sleep) Sequence 2 (Domain H sleeps) Test chip "Pinnacle" real measurement - Exhibits good power reduction - Current Status • Designing a pipelined microprocessor with FG-RTPG • Compiler (instruction scheduler) to increase sleep time Not applied FG-RTPG applied France-Japan PAAP Workshop
Low Power Linux Scheduler based onstatistical modeling • Co-optimization of System Software and Architecture • Objective: • process scheduler which reduce power consumption by DVFS (dynamic voltage and frequency scaling) of each process with satisfying its performance constraint • How to find the lowest frequency with satisfying performance constraints ? • it depends on hardware and program characteristics • performance ratio is different from frequency ratio • hard to find the answer straightforward modeling by statistical analysis of hardware events France-Japan PAAP Workshop
Evaluation result Pentium M 760 (Max 2.00 GHz, FSB 533 MHz) • Specified threshold • Black dotted line • Perf. is within the threshold in all the cases except for mgrid • 3-7% below the threshold • Accurate model is obtained • Linux scheduler using this model is developed May 8, 2007 27 France-Japan PAAP Workshop
Summary • Challenge for high performance processors: • Memory Wall and Power Wall • One solution to memory wall • make good use of on-chip memory with software controllability • Solutions to power wall • many cores will relax the problem, but • leakage current is getting a big problem • new research/approach is required • our project “Innovative Power Control for Ultra Low-Power and High-Performance System LSIs” is introduced France-Japan PAAP Workshop