290 likes | 516 Views
Extending Amdahl’s Law in the Multicore Era. Erlin Yao, Yungang Bao, Guangming Tan and Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences yaoerlin@gmail.com, {baoyg, tgm, cmy}@ncic.ac.cn. A Brief Intro Of ICT, CAS. ICT has developed the Loongson CPU.
E N D
Extending Amdahl’s Law in the Multicore Era Erlin Yao, Yungang Bao, Guangming Tan and Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences yaoerlin@gmail.com, {baoyg,tgm,cmy}@ncic.ac.cn
A Brief Intro Of ICT, CAS ICT has developed the Loongson CPU ICT has built the Fastest HPC in China – Dawning 5000, which is 233.5TFlops and rank 10th in Top500.
Outline • I. Background and Related Works • II. Model of Multicore Scalability • III. Symmetrical Multicore Chips • IV. Asymmetrical Multicore Chips • V. Dynamic Multicore Chips • VI. Conclusion and Future Work
We are in the Multi-Core Era • Mainstream market has already been dominated by multicore • Intel: 2-core Core Duo, 4-core i7 • AMD: 2-core Athlon, 4-core Opteron • IBM: 2-core POWER6, 9-core Cell • Sun: 8-core T1/T2 • ……
Many-Core is coming • Some processor vendors have announced or released their manycore processors • Tilera: 64-core • Intel: 80-core • GPGPU: 100x-core • ……
Revisiting Amdahl’s Law in the Multi/Many-Core Era • Assume that a fraction f of a program’s execution time was infinitely parallelizable with no scheduling overhead, while the remaining fraction, 1 − f, was totally sequential. Using p processors to accelerate the parallel fraction. • Fixed-size speedup, the amount of work to be executed is independent of the number of processors
Implications of Amdahl’s Law • Despite its simplicity, Amdahl’s law applies broadly and gives important insights such as: • (i) Attack the common case: When f is small, optimization will have little effect. • (ii) The aspects you ignore also limit speedup: Even if p approaches infinity, speedup is bounded by 1/(1−f) .
Mark Hill et al.’s Insights • Hill and Marty apply Amdahl’s law to multicore hardware by constructing a cost model for the number and performance of cores in one chip. • Obtaining optimal multicore performance requires further research both in extracting more parallelism and in making sequential cores faster. • Woo and Lee have extended Hill’s work by taking power and energy into account.
Motivation of Our Work • The revised Amdahl’s Law model provides a better understanding of multicore scalability. • However, there is little work on theoretical analysis. • This paper presents our investigations on theoretical analysis of multicore scalability and attempts to find the optimal results under different conditions.
Model of Multicore Scalability • We adopt the same cost model on multicore hardware proposed by Hill and Marty, which includes two assumptions: • First, assume that a multicore chip of given size and technology generation can contain at most n base core equivalents (BCE) • Second, assume that the individual core with more resources (r BCEs) can achieve better sequential performance. • 1 < perf(r) < r • The architecture of multicore chips can be classified into three types: • Symmetric • Asymmetric • Dynamic
Model-Symmetrical • A symmetric multicore chip requires that all its cores have the same cost. • Example: given 16 BCEs. • r = 8 2 cores * 8 BCEs/core • r = 4 4 cores * 4 BCEs/core • Given the resource budget of n BCEs, we have n/r cores, each with r BCEs. Performance of each core is perf(r). Then we get
Model-Asymmetrical • In an asymmetric multicore chip, several cores are more powerful than the others. • Example: given 16 BCEs • 1 four-BCE core and 12 base cores. • 1 six-BCE core and 10 base cores. • Given the resource budget of n BCEs, we have 1+n−r cores with one larger core (with r BCEs) and n−r base cores (with 1 BCE each). Then we get
Model-Dynamic • A dynamic multicore chip can dynamically combine up to r cores into one core in order to boost sequential performance. • In sequential mode, it can execute with performance of perf(r) when the dynamic techniques use r BCEs. • In parallel mode, it can obtain performance of n using all base cores in parallel. • Then, we get
Symmetrical Multicore Chips • Fixed n and r, speedup is an increasing function of f • Fixed f and r, speedup is an increasing function of n • Increasing both the parallel fraction (f) and the number of base core (n) can improve the speedup of symmetric multicore chip. • For fixed f and n, we have the following theorem:
Symmetrical Multicore Chips • For any fixed f and c, • if f < c, the maximum speedup is achieved at r = n. • if f > c and n is not big, the maximum speedup is achieved at r = 1. • if f > c and n is big enough, to obtain optimal multicore performance, the resources of BCEs should be dedicated to one core intended to offer reasonable individual core’s performance.
Symmetrical Multicore Chips • If n is big enough, then will the maximum speedup always be achieved between extremes for any perf(x) < x? • Counterexample: • (i) perf(x)=kx, for any 0<k<1; • (ii) perf(x)=xc, for any f<c<1.
Asymmetrical Multicore Chips • Similarly, increasing both the parallel fraction (f) and the number of BCEs (n) can improve the speedup of asymmetric multicore chip. • For fixed f and n, we have the following theorem:
Asymmetrical Multicore Chips • If f >c and n is not big, maximum speedup is achieved at r = 1. • If f <c and n is not big, maximum speedup is achieved at r = n. • For any fixed f and c, if n is big enough, the maximum speedup is achieved at 1<r0<n.
Asymmetrical Multicore Chips • Note that the optimal r0 in Theorem 2 can not be solved analytically. • r0 is linear with n, and if n is big enough, r0 will approach n to any extent.
Asymmetrical Multicore Chips • If n is big enough, will the maximum speedup always be achieved between extremes for any perf(x)<x? • Counterexample: • perf(x)=kx, for any f<k<1. • For saturated functions, • Like p(x)=xc, p(x)=kxc+mxc’+…, where c, c’<1.
Asymmetrical Multicore Chips • Based on the simplistic assumptions of Amdahl’s law, it makes most sense to devote extra resources to increase only one core’s capability. In fact we have the following theorem: • Although the architecture of asymmetric multicore chip using one large core and many base cores is assumed originally for simplicity, it is indeed the optimal architecture in the sense of speedup.
Dynamic Multicore Chips • We should increase both f and n to enhance the speedup of dynamic multicore chip. • For fixed f and n, • if perf(r) is an increasing function, speedup is also an increasing function • the maximum speedup is always achieved at r = n. • Dynamic multicore chips can offer potential speedups that are greater and never worse than symmetric or asymmetric multicore chips with identical perf(r) functions. • So researchers should continue to investigate methods that approximate a dynamic multicore chip.
Potentials of Maximum Speedups • Recall that in the Amdahl’s law, even if the number of processors approaches infinity, the speedup is bound by1/(1−f) . • The increasing of n can improve the speedup continuously. Under the assumption of perf(r) = rc, when n approaches infinity, the speedup can also approach infinity even if the performance index c is small.
Implications and Results • A theoretical analysis of multicore scalability is investigated, and quantitative conditions are given to determine how to obtain optimal multicore performance. • The theorems and corollary provide computer architects with a better understanding of multicore design types, enabling them to make more informed tradeoffs. • However, our precise quantitative results are suspect because the real world is much more complex. The model considered here ignores many important structures. • This theoretical analysis attempts to provide insights on future work.
Future Work • In applications, the parallel fraction f can not be infinitely parallelizable. The parallel degree can be less than some constant d or even be random in some circumstances. • Introducing practical structures, such as memory hierarchy, shared caches, etc. • More cores might allow more parallelism for larger problem size. Fixed-time speedup, like the Gustafson’s law, should be considered. • … …
Acknowledgements • We would like to thank Professor Mark Hill for his valuable comments and suggestions. • We also appreciate the help of Dr. Mark Squillant and the arrangement of the MAMA organizator on this video presentation.
Thanks Welcome Questions and Comments