270 likes | 610 Views
Multicores, Manycores and Amdahl’s Law. 2012. Amdahl’s Law – Reminder. Original Amdahl’s Law for n identical cores f – fraction of parallelizable execution time (1-f) – fraction of totally sequential execution time Sequential runs on a single core Parallel runs on all n cores
E N D
Amdahl’s Law – Reminder • Original Amdahl’s Law for n identical cores • f – fraction of parallelizable execution time • (1-f) – fraction of totally sequential execution time • Sequential runs on a single core • Parallel runs on all n cores • Q: What are the hidden assumptions?
Multicore CPU • Manycore – Tens or hundreds of cores • Why don’t we have Sandy Bridge with 100 cores? Intel’s Sandy Bridge
Core Performance Constraints • Manufacturing technology • Area (for more logic) • Area = Money; Manufacturing constraints • Power (for more logic, higher frequencies) • Sub-threshold leakage current • More power requires better cooling solutions
Large Core Performance Accurate Branch Prediction Big data caches e.g., Simple In-order core Large Core • We have a base line core (BCE) with area=1, performance=1 • We can add microarchitectural features • New core area is then r (r>1) • Large core is faster, with performance of perf(r) • Q: For which perf(r) function, large core is better than multiple small ones? • So what is perf(r) ? uOp Cache BCE OOOE
Area: Pollack’s Rule • An empirical rule • Multicore implications. For example: double the CPU logic and get • 40% more performance with a larger single-core • For purely parallel code – 100% more performance with dual-core
Power • Power is usually considered as proportional to area • In this presentation we consider area as the main constraint • Not completely true [Esmaeilzadeh’11] • For simplicity we keep with
Why Multicore/Manycore? • More performance per mm2 & watt for parallel code • Less power (& heat) • Save power by turning on and off each CPU • Run each core in optimized frequency/power • Load balance to distribute heat • Lower die temperatures • New performance constraint: parallel fraction
Cost Model • To find the best performing CPU configuration we need a cost model • Basic core - Baseline Core Equivalent (BCE) • Chip is limited to have no more than nBCEs • Performance • Performance of each BCE is 1 • Architects can expand the resources of rBCEs to create a powerful core with performance of perf(r) • f– fraction of the parallelizable execution time
Symmetric Multicore Chips • Run the sequential part on one core • Run the parallel part on all cores n=16 r=1 16 1-BCE cores 4 4-BCE cores n=16 r=4
Symmetric Multicore Chips • n/r identical cores • Each core performance perf(r) • Execution • Sequential part – 1 core; performance - perf(r) • Parallel part – all cores; performance - perf(r) * n/r
Symmetric, n=16 F=0.9, R=2, Cores=8, Speedup=6.7 As Moore’s Law enables N to go from 16 to 256 BCEs, More core enhancements? More cores? Or both?
Symmetric, n=256 F1 R=1 (vs. 1) Cores=256 (vs. 16) Speedup=204 (vs. 16) MORE CORES! F=0.99 R=3 (vs. 1) Cores=85 (vs. 16) Speedup=80 (vs. 13.9) CORE ENHANCEMENTS& MORE CORES! F=0.9 R=28 (vs. 2) Cores=9 (vs. 8) Speedup=26.7 (vs. 6.7) CORE ENHANCEMENTS!
Symmetric Multicores • In symmetric multicores with fixed n, perf(r)=sqrt(r), maximum performance is achieved when: • Q1: When will a single core perform better than any symmetric multicore? • Q2: In the optimal configuration, what are the proportions of the execution time between the optimal sequential and parallel parts?
Asymmetric Multicore Chips • Run the sequential part on the big core • Run the parallel part on all cores One 4-BCE core; Twelve 1-BCE cores
Asymmetric Multicore Chips • One large r-BCE core with performance of perf(r) • n-rsmall 1-BCE cores with performance of 1 • Execution: • Sequential part – 1 core; performance - perf(r) • Parallel part – all cores; performance - perf(r) + n - r
Asymmetric, n=256 • Is asymmetric architecture potential greater than that of symmetric? Recall F=0.99 R=41 Cores=216 Speedup=166
Dynamic (Composed) Multicore Chips • Combine up to r cores to boost sequential performance • Helper threads • Thread LevelSpeculation • Hardware supportmay be required • Q: Why “up to r cores”?
Dynamic (Composed) Multicore Chips • Execution: • Sequential part – 1 big core; performance - perf(r) • Parallel part – all cores; performance – n
Dynamic, n=256 • Q: How does dynamic multicore scale relatively to symmetric and asymmetric? F=0.99 R=256 (vs. 41) Cores=256 (vs. 216) Speedup=223 (vs. 166) Note: #Cores always N=256
Manufacturing Technology • New manufacturing technology will not save us
Summary • Multicores and manycores are required due to the diminishing returns of large cores • Amdahl’s Law allows us to predict the performance of various architectures • Dynamic (composed) architecture is promising • To take advantage of future CPUs, the parallel part of the code must be very high • …and still we are going to have a problem
References • Amdahl’s Law in the Multicore Era [Hill’08] • Thousand Core Chips—A Technology Perspective [Borkar’07] • Dark Silicon and the End of Multicore Scaling [Esmaeilzade’11] • Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip Multiprocessors [Morad’05]