Understanding Sources of Inefficiency in General-Purpose Chips

Written by: Hameed et. Al – ISCA 2010 Presented by: David Schlais Understanding Sources of Inefficiency in General-Purpose Chips

Motivations • Most modern systems are power critical • Cell phones, servers, tablets, etc. • ASIC • Low power / Low flexibility • Chip Multiprocessors (CMP) • Higher Power / High Flexibility • Goal: Low power / High Flexibility • Paper attempts to identify what causes CMPpower inefficiencies

Application-Specific Integrated Circuits (ASICs) • ASIC • Application Specific • Performs specific task efficiently • Power • Area • Low overhead in order to ‘execute’ tasks

Chip Multiprocessors • General-Purpose • High overhead • Execution units in red • Rest is overhead • High Flexibility • Can perform many instructions http://pages.cs.wisc.edu/~david/courses/cs752/Fall2014/handouts/lecture/03_pipeline.pdf

And how do we minimize it? What is the energy gap between these two?

Exploring CMP vs ASIC gap • H.264 encoding (MPEG-4 advanced video coding) • Compared ASIC for H.264 encoding vs Tensilica CMP • Why H.264 encoding? • Large ASIC vs CMP gap • Easier to see inefficiencies in CMP • ASICs for H.264 are commercially available • Commercial products serve as benchmark • Contains variety of computational motifs • Results are applicable to larger set of applications 150-500x power gap

H.264 (MPEG-4 AVC) • 99% of tasks are summed up in 5 steps: • Integer Motion Estimation (IME) • Computes vector of image-block motion • Fractional Motion Estimation (FME) • Refines initial match to quarter-pixel resolution • Intra Prediction (IP) • Based on surrounding image-blocks, makes prediction • Transform and Quantization (DCT/Quant) • Determines difference between prediction and current • Context Adaptive Binary Arithmetic Coding (CABAC) • Encodes coefficients

CMP power breakdown Only ~6% of total energy Goes towards computation

Instruction Decode Logic • Decoding is ~30% of overall power • Repetitive decoding of similar instructions • Solution? • 16 and 18-way SIMD datapaths • Many Instructions can be done in parallel • Solution? • 2 and 3-slot VLIW instructions • Impact: • 10x performance increase • 10x energy decrease • Still ~30% of power to decoding

Work done per instruction Application Specific instructions RISC instructions

“Magic” Instructions • Desire: • Single instruction executes 100s of operations • Reuse/shift pixels to prevent repetitive x-1, x0, x1,x2, x3 loads • Requirements: • Hardware support • Custom data storage elements • Links to provide large amounts of data to these storage units • Path from storage units to functional units • Instruction support • New instruction to dictate when to do these operations

Results

Takaways • Fetching instructions is expensive • Current Techniques only provide partial improvement • SIMD • VLIW • Processors have large overheads and small % computing • Power of FU alone exceeds ASIC designs • Purpose for automatic tools to create modified instructions

Takeaways Cont. • Application specific instructions and hardware reduce inefficiencies • “Instructions” performing hundreds of operations is ideal • Extensible processor is not enough • Current RISC architectures contain inefficiencies not present in ASIC designs • Still many areas to improve processor power & performance

Progress since 2010 • Neural acceleration • Esmaeilzadeh, Hadi, et al. "Neural acceleration for general-purpose approximate programs." Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2012. • Dynamically Specialized datapaths • Govindaraju, Venkatraman, Chen-Han Ho, and KarthikeyanSankaralingam. "Dynamically specialized datapaths for energy efficient computing." High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on. IEEE, 2011. • Power-Efficient Compute-intensive GPGPU • Gilani, Syed Zulqarnain, Nam Sung Kim, and Michael J. Schulte. "Power-efficient computing for compute-intensive GPGPU applications." High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on. IEEE, 2013.

Thank you • Questions?

Understanding Sources of Inefficiency in General-Purpose Chips

Understanding Sources of Inefficiency in General-Purpose Chips

Presentation Transcript

General Purpose

General Purpose Packages

General Purpose Processor

Understanding the Purpose of Pavement Markings

General Purpose Technologies

General Purpose Package

General Purpose Packages

General Purpose

General Purpose Packages

Understanding Author’s Purpose

General purpose systems

WIC : A General-Purpose Algorithm for Monitoring Web Information Sources

Sources of General Principles

Purpose of General Physical Examination

General Purpose Functions

Inefficiency of Photon Detection

General Purpose Packages

General Purpose Packages

UNDERSTANDING THE PURPOSE AND PROPER RESPONSE TO A “SOURCES SOUGHT”

General Purpose:

General Purpose Packages

General Purpose Worklist