Multicores, Manycores and Amdahl’s Law

Multicores, Manycores and Amdahl’s Law 2012

Amdahl’s Law – Reminder • Original Amdahl’s Law for n identical cores • f – fraction of parallelizable execution time • (1-f) – fraction of totally sequential execution time • Sequential runs on a single core • Parallel runs on all n cores • Q: What are the hidden assumptions?

Multicore CPU • Manycore – Tens or hundreds of cores • Why don’t we have Sandy Bridge with 100 cores? Intel’s Sandy Bridge

Core Performance Constraints • Manufacturing technology • Area (for more logic) • Area = Money; Manufacturing constraints • Power (for more logic, higher frequencies) • Sub-threshold leakage current • More power requires better cooling solutions

So Why Not One Single Core? Core

Large Core Performance Accurate Branch Prediction Big data caches e.g., Simple In-order core Large Core • We have a base line core (BCE) with area=1, performance=1 • We can add microarchitectural features • New core area is then r (r>1) • Large core is faster, with performance of perf(r) • Q: For which perf(r) function, large core is better than multiple small ones? • So what is perf(r) ? uOp Cache BCE OOOE

Area: Pollack’s Rule • An empirical rule • Multicore implications. For example: double the CPU logic and get • 40% more performance with a larger single-core • For purely parallel code – 100% more performance with dual-core

Power • Power is usually considered as proportional to area • In this presentation we consider area as the main constraint • Not completely true [Esmaeilzadeh’11] • For simplicity we keep with

Why Multicore/Manycore? • More performance per mm2 & watt for parallel code • Less power (& heat) • Save power by turning on and off each CPU • Run each core in optimized frequency/power • Load balance to distribute heat • Lower die temperatures • New performance constraint: parallel fraction

Cost Model • To find the best performing CPU configuration we need a cost model • Basic core - Baseline Core Equivalent (BCE) • Chip is limited to have no more than nBCEs • Performance • Performance of each BCE is 1 • Architects can expand the resources of rBCEs to create a powerful core with performance of perf(r) • f– fraction of the parallelizable execution time

Symmetric Multicore Chips • Run the sequential part on one core • Run the parallel part on all cores n=16 r=1 16 1-BCE cores 4 4-BCE cores n=16 r=4

Symmetric Multicore Chips • n/r identical cores • Each core performance perf(r) • Execution • Sequential part – 1 core; performance - perf(r) • Parallel part – all cores; performance - perf(r) * n/r

Symmetric, n=16 F=0.9, R=2, Cores=8, Speedup=6.7 As Moore’s Law enables N to go from 16 to 256 BCEs, More core enhancements? More cores? Or both?

Symmetric, n=256 F1 R=1 (vs. 1) Cores=256 (vs. 16) Speedup=204 (vs. 16) MORE CORES! F=0.99 R=3 (vs. 1) Cores=85 (vs. 16) Speedup=80 (vs. 13.9) CORE ENHANCEMENTS& MORE CORES! F=0.9 R=28 (vs. 2) Cores=9 (vs. 8) Speedup=26.7 (vs. 6.7) CORE ENHANCEMENTS!

Symmetric Multicores • In symmetric multicores with fixed n, perf(r)=sqrt(r), maximum performance is achieved when: • Q1: When will a single core perform better than any symmetric multicore? • Q2: In the optimal configuration, what are the proportions of the execution time between the optimal sequential and parallel parts?

Asymmetric Multicore Chips • Run the sequential part on the big core • Run the parallel part on all cores One 4-BCE core; Twelve 1-BCE cores

Asymmetric Multicore Chips • One large r-BCE core with performance of perf(r) • n-rsmall 1-BCE cores with performance of 1 • Execution: • Sequential part – 1 core; performance - perf(r) • Parallel part – all cores; performance - perf(r) + n - r

Asymmetric, n=256 • Is asymmetric architecture potential greater than that of symmetric? Recall F=0.99 R=41 Cores=216 Speedup=166

Dynamic (Composed) Multicore Chips • Combine up to r cores to boost sequential performance • Helper threads • Thread LevelSpeculation • Hardware supportmay be required • Q: Why “up to r cores”?

Dynamic (Composed) Multicore Chips • Execution: • Sequential part – 1 big core; performance - perf(r) • Parallel part – all cores; performance – n

Dynamic, n=256 • Q: How does dynamic multicore scale relatively to symmetric and asymmetric? F=0.99 R=256 (vs. 41) Cores=256 (vs. 216) Speedup=223 (vs. 166) Note: #Cores always N=256

Manufacturing Technology • New manufacturing technology will not save us

The Future…

Summary • Multicores and manycores are required due to the diminishing returns of large cores • Amdahl’s Law allows us to predict the performance of various architectures • Dynamic (composed) architecture is promising • To take advantage of future CPUs, the parallel part of the code must be very high • …and still we are going to have a problem

References • Amdahl’s Law in the Multicore Era [Hill’08] • Thousand Core Chips—A Technology Perspective [Borkar’07] • Dark Silicon and the End of Multicore Scaling [Esmaeilzade’11] • Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip Multiprocessors [Morad’05]

Multicores, Manycores and Amdahl’s Law

Multicores, Manycores and Amdahl’s Law

Presentation Transcript

Dynamic Performance Tuning of Word-Based Software Transactional Memory

Embedded Multicores Example of Freescale solutions

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra

Smart Data Structures

Nested data parallelism in Haskell

Manycores – From hardware prospective to software

Detecting and surviving data races using complementary schedules

Akbar Sharifi , Emre Kultursay , Mahmut Kandemir and Chita R. Das

Instruction Set

CS 352H: Computer Systems Architecture

Faster unicores are still needed

Yooseong Kim 1,2 , David Broman 2,3 , Jian Cai 1 , Aviral Shrivastava 1,2

Chapter 7

Bionic databases are coming. What will they look like?

Chapter 7

Dagstuhl April 2008

Chapter 6

SSDM: Smart Stack Data Management for Software Managed Multicores

Ke Bai , Jing Lu, Aviral Shrivastava , and Bryce Holton Compiler Microarchitecture Lab

Multicores give you the utmost protection and efficiency