400 likes | 609 Views
Heterogeneous CPU Cores. March 11, 2014. Kevin Stewart Derrik Huey Shuai Xu. Outline. Introduction to Multi-cores ARM big.LITTLE Technology Multi-thread Programming. ECE 570 W14 – Heterogeneous CPU Cores. 2. Multi-Cores. Multi-cores and why they are needed Cost and Power Benefits
E N D
Heterogeneous CPU Cores March 11, 2014 Kevin Stewart Derrik Huey ShuaiXu
Outline • Introduction to Multi-cores • ARM big.LITTLE Technology • Multi-thread Programming ECE 570 W14 – Heterogeneous CPU Cores 2
Multi-Cores • Multi-cores and why they are needed • Cost and Power Benefits • Heterogeneous Cores and Homogeneous Cores • Who are the Players in this field? ECE 570 W14 – Heterogeneous CPU Cores 3
Multi-cores and why they are needed ECE 570 W14 – Heterogeneous CPU Cores 4
Multi-cores and why they are needed • Multi-cores came about due to increasing frequency scaling • Physical barriers due to power and thermal heat • Easier to have two cores than double the frequency ECE 570 W14 – Heterogeneous CPU Cores 5
Cost and Power Benefits ECE 570 W14 – Heterogeneous CPU Cores 6
Cost and Power Benefits Active Power Dissipation: (Switching power) • Standby Power Dissipation: ECE 570 W14 – Heterogeneous CPU Cores 7
Cost and Power Benefits ECE 570 W14 – Heterogeneous CPU Cores 8
Heterogeneous and Homogeneous cores • Homogeneous has the same cores • Symmetric Multi-Processing (SMP) • Heterogeneous has different cores • Heterogeneous Multi-Processing (HMP) • Application specific Processing (ASP) • SOC or SoC ECE 570 W14 – Heterogeneous CPU Cores 9
Who are the players in the field? The usual cast: The mobile arena: ECE 570 W14 – Heterogeneous CPU Cores 10
Outline • Introduction to big.LITTLE • big and LITTLE cores • The challenge of cache coherency • Pairing big and LITTLE cores • Software challenges • Benchmarks and market overview ECE 570 W14 – Heterogeneous CPU Cores 11
Heterogeneous CPU cores • Dynamically adapt to computing needs • Combination of small and large core(s) • Large core(s) active • High performance • Small core(s) active • Low power • Proprietary technology called big.LITTLE by ARM ECE 570 W14 – Heterogeneous CPU Cores 12
Requirements for cores • Requirements for cores • Caches need to be compatible • Same fundamental architecture (code compatible) • Can have different micro-architecture • LITTLE core • Cortex-A7 • big core • Cortex-A15 ECE 570 W14 – Heterogeneous CPU Cores 13
The LITTLE core: A7 • ARM Cortex-A7 micro architecture • In-order execution • Dual issue • 8 to 10 stage pipeline ECE 570 W14 – Heterogeneous CPU Cores 14 Figure from Ref. [1]
The big core: A15 • ARM Cortex-A15 micro architecture • Out-of-order execution • Triple issue • 15 to 24 stage pipeline ECE 570 W14 – Heterogeneous CPU Cores 15 Figure from Ref. [1]
The big core: A15 • ARM Cortex-A15 micro architecture • Out-of-order execution • Triple issue • 15 to 24 stage pipeline 4x larger area than A7 4x higher power consumption 2-3x higher performance ECE 570 W14 – Heterogeneous CPU Cores 16
Communication between cores • big and LITTLE cores need to be able to talk with each other • Cache coherency! ECE 570 W14 – Heterogeneous CPU Cores 17
Cache coherency ECE 570 W14 – Heterogeneous CPU Cores 18 Figure from Ref. [7]
Switching between cores • How does switching between cores work? ECE 570 W14 – Heterogeneous CPU Cores 19
Switching between cores Migrates in less than 20,000 cycles or 20 µs ECE 570 W14 – Heterogeneous CPU Cores 20 Figure from Ref. [9]
Pairing of big.LITTLE cores Switching threshold ECE 570 W14 – Heterogeneous CPU Cores 21 Figure from Ref. [4]
Pairing of big.LITTLE - Summary • Cluster Switching Mode • All tasks are assigned to one cluster while the other one is inactive • CPU Migration Mode (In-kernel switcher) • Big and LITTLE cores are grouped in pairs • Heterogeneous Multi-Processing Mode (Global Task Scheduling) • Tasks are assigned to cores independently ECE 570 W14 – Heterogeneous CPU Cores 22
Software challenges • Task scheduling • Operating System needs to assign tasks to specific cores * Utilize already available drivers for Dynamic Voltage Frequency Scaling (DVFS) ECE 570 W14 – Heterogeneous CPU Cores 23
Software challenges • Cluster Switching and CPU Migration implemented in Linux Kernel and Android OS • Heterogeneous Multi-Processing support in development (2013) ECE 570 W14 – Heterogeneous CPU Cores 24
Benchmarks Geekbench 3 Higher performance with similar power consumption ECE 570 W14 – Heterogeneous CPU Cores 25 Figures from Ref. [1],[8]
Applications ECE 570 W14 – Heterogeneous CPU Cores 26
Conclusion • ARM big.LITTLE technology • Can be combined with other power saving • techniques like DVFS or power/clock gating • Cluster Switching Mode • CPU Migration • Heterogeneous Multi-Processing ECE 570 W14 – Heterogeneous CPU Cores 27
Multi-thread Operating System • Multi-thread Core • Multi-thread Programming • Pthread • GPU Programming • C++ AMP ECE 570 W14 – Heterogeneous CPU Cores 28
Multi-thread Operating System • Thread • A thread is essentially a single sequence of instructions • Single-thread OS • In a Single-thread OS only one task can be runed at same time • For example the DOS • Low CPU usage • Multi-thread OS • Multi-thread OS can have more threads at one time which make the multitask possible • Higher CPU usage ECE 570 W14 – Heterogeneous CPU Cores 29
Multi-thread Core • Intel Hyper-Threading(HT) Technology • Simultaneous multithreading(SMT) • According to Intel’s report, only used 5% more die area than the comparable non-hyperthreaded processor, but the performance was 15–30% better • In some specific situation this technology will reduce the performance of a physical processor or lead to more usage of power ECE 570 W14 – Heterogeneous CPU Cores 30
Multi-thread Programming ECE 570 W14 – Heterogeneous CPU Cores 31
Multi-thread Programming Single thread code main() { clock_t start=clock(); int res[M][N]={0}; //to store the result int i,j,k; for(i=0;i<M;i++) for(j=0;j<M;j++) for(k=0;k<N;k++) res[i][j]+=matrixA[i][k]*matrixB[k][j]; //calculate the result clock_t finish=clock(); printf("Time use:%.2f s\n",(long)(finish-start)/1E6); } It spends about 0.07s to calculate the multiplication of two random matrices in size (200,300) and (300,200). ECE 570 W14 – Heterogeneous CPU Cores 32
Multi-thread Programming Multi-thread code(Pthread) for(i=0;i<num_p;i++) { if(pthread_create(&tids[i],NULL,func,(void *)&i)) //create a thread { perror("pthread_create");//if cannot create the thread return error exit(1); } } for(i=0;i<num_p;i++) pthread_join(tids[i],NULL); //join all the threads for(i=0;i<M;i++) for(j=0;j<M;j++) for(k=0;k<N;k++) res[i][j]+=arr[i][j][k]; //add the result together It spends about 0.02s to calculate the multiplication of two random matrices in size (200,300) and (300,200) when using 4 threads. ECE 570 W14 – Heterogeneous CPU Cores 33
GPU Programming • CUDA (Compute Unified Device Architecture) • Introduced by NVIDIA in 2006, the world’s first solution for general-computing on GPUs • Special hardware architecture • Just supportsNVIDIA • SupportsC/C++, C#, Python, Fortran • OpenCL(Open Computing Language) • Introduced by Apple in 2008 • Supportsalmost all GPU • Just supportsC ECE 570 W14 – Heterogeneous CPU Cores 34
GPU Programming extern "C" void MatrixMultiplication_CUDA(const float* M,const float* N,float* P,int Width) { cudaSetDevice(0); float *Md, *Nd, *Pd; int size = Width * Width * sizeof(float); cudaMalloc((void**)&Md, size); cudaMalloc((void**)&Nd, size); cudaMalloc((void**)&Pd, size); //Copies a matrix from the memory* area pointed to by src to the memory area pointed to by dst cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // dim3 dimGrid(Width / TILE_WIDTH, Width / TILE_WIDTH); dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); MatrixMulKernel<<< dimGrid, dimBlock >>>(Md, Nd, Pd, Width); cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); cudaFree(Md); cudaFree(Nd); cudaFree(Pd); } ECE 570 W14 – Heterogeneous CPU Cores 35
C++ AMP • C++ Accelerated Massive Parallelism • Introduced by MS in 2012 • Only supported by VS 11 or later version • Real heterogenous programming, use both CPU and GPU • Automatically control how many threads can run in parallel ECE 570 W14 – Heterogeneous CPU Cores 36
C++ AMP array_view<const int, 2> a(M, W, vA), b(W, N, vB); array_view<int, 2> c(M, N, vC); c.discard_data(); parallel_for_each(c.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; int sum = 0; for(int i = 0; i < b.extent[0]; i++) sum += a(row, i) * b(i, col); c[idx] = sum; }); c.synchronize(); ECE 570 W14 – Heterogeneous CPU Cores 37
Summary • Heterogeneous CPU Cores • Circumvent the “Power Wall” • Next step after Homogeneous Multi-Cores • ARM big.LITTLE is one heterogeneous solution • Many of the challenges of Homogeneous Multi-Cores still apply • Finding ILP • Writing parallel programs ECE 570 W14 – Heterogeneous CPU Cores 38
Questions? ECE 570 W14 – Heterogeneous CPU Cores 39
References [1]P. Greenhalgh, “big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7,” ARM White Paper, Sep. 2011. [2]MEDIATEK, “MediaTek Enables ARM big.LITTLETM Heterogeneous Multi-Processing Technology in Mobile SoCs,” MEDIATEK White Paper, 2013. [3]“ARM Processors: Combining large and small compu... | ARM Connected Community.” [Online]. Available: http://community.arm.com/groups/processors/blog/2011/10/19/combining-large-and-small-compute-engines--arm-cortex-a7. [Accessed: 17-Feb-2014]. [4]“ARM Processors: Ten Things to Know About big.LI... | ARM Connected Community.” [Online]. Available: http://community.arm.com/groups/processors/blog/2013/06/18/ten-things-to-know-about-biglittle. [Accessed: 18-Feb-2014]. [5]“big.LITTLE Processing - ARM.” [Online]. Available: http://www.arm.com/products/processors/technologies/biglittleprocessing.php. [Accessed: 15-Feb-2014]. [6] “ARM Processors: big.LITTLE and AMBA 4 ACE keep ... | ARM Connected Community.” [Online]. Available: http://community.arm.com/groups/processors/blog/2011/11/10/biglittle-and-amba-4-ace-keep-your-cache-warm-and-avoid-flushes. [Accessed: 17-Feb-2014]. [7]“CoreLinkCCI-400 Cache Coherent Interconnect - ARM.” [Online]. Available: http://www.arm.com/products/system-ip/interconnect/corelink-cci-400.php. [Accessed: 15-Feb-2014]. [8]H. Chung, M. Kang, and H.-D. Cho, “Heterogeneous Multi-Processing Solution of Exynos 5 Octa with ARM® big. LITTLE™ Technology,” Samsung White Paper, Nov. 2013. [9]A. Stevens, “Introduction to AMBA® 4 ACETM and big.LITTLETM Processing Technology,” ARM White paper, http://wwww. arm. com, Jun. 2011. [10]“Software Techniques for ARM big.LITTLE Systems | ARM Connected Community.” [Online]. Available: http://community.arm.com/docs/DOC-2875. [Accessed: 17-Feb-2014]. [11]“Linux support for ARM big.LITTLE [LWN.net].” [Online]. Available: http://lwn.net/Articles/481055/. [Accessed: 17-Feb-2014]. [12]“A big.LITTLE scheduler update [LWN.net].” [Online]. Available: http://lwn.net/Articles/501501/. [Accessed: 17-Feb-2014]. ECE 570 W14 – Heterogeneous CPU Cores 40