220 likes | 358 Views
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, Dean M. Tullsen. Presenter: Borys Bradel. Introduction. Different programs have different requirements (e.g. ILP)
E N D
Single-ISA Heterogeneous Multi-Core Architectures:The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, Dean M. Tullsen Presenter: Borys Bradel
Introduction • Different programs have different requirements (e.g. ILP) • Extends to phases of a single program • Heterogeneous cores • Use core that matches the requirements • Reuse existing cores • Use multiple generations of the same family of processors
Outline • Methodology • Hardware • Assumptions • Power • Experiments • Optimal – energy/energy delay product • Heuristic based – static/dynamic • Related Work • Conclusion
Single ISA Multi-Core Benefits • Small area overhead because of the growth in core sizes between generations • Clock frequencies of older cores would scale with technology • P3 1 GHz = P4 1.4 GHz • Increased pipeline depth precisely because could not scale
Hardware – Alpha Family • 2 in order cores • EV4=21064 • EV5=21164 • 2 out of order cores • EV6=21264 • EV8-=21464 (multi thread support removed)
Hardware Size • 15% more area than just using 21464
Assumptions • Can switch cores dynamically • Private L1 cache and common L2 cache • All cores use 0.10 micron technology • Single process executing on a single core at any one time • 2.1 GHz clock (=21264 0.35 micron 600 MHz) • Input voltage 1.2V • Cores shut down when idle • 1000 cycle restart cost (staged, phase lock loop left alone) • 150 ms memory access • Stall cycles through CACTI
Power Model • Use Wattch to account for activity based dissipation • Use scaling and offset factors to account for other factors • This hybrid model is closer to manufacturer’s data points • Peak power: data sheets less L2 cache and output pins • Typical power: scaled based on Intel chips
Performance Modeling • Use SMTSIM, a cycle accurate simulator • simpoint is used to identify representative instructions of programs and how many instructions need to be fast forwarded
Oracle Switching for Energy • Performance always within 10% of EV8-
Oracle Switching for Energy Delay Product • Performance always within 50% of EV8-
Others • Voltage/frequency scaling – not as good • Static core selection • only EV6 and EV8- are used • Dynamic heuristic • Running average performance within 10% • Every 100 time intervals (100 million instructions) cores are sampled for 5 intervals • Select best core based on sampling
Related Work • Gating based power optimization • Cannot gate at a fine enough granularity • May still have leakage • This could be thought of as gating to reduce capabilities of different units • Voltage and frequency scaling • Chip wide – one size does not fit all • Fine grained – granularity problems
Conclusions • Heterogeneous multi core architectures reduce the energy-delay product • More fine grained than other approaches • Using several cores from the same family is good • Reduces development/testing costs • Is it scalable? • Just use EV6??