160 likes | 248 Views
SMYLE: Scalable ManYcore for Low-Energy computing. Koji Inoue 1 and Masaaki Kondo 2 Kyushu University 1 The University of Electro-Communications 2. Moving to Manycore Era!. 256. CSX700. 128. Intel 80-core. Toshiba Manycore. TILE-GX100. ClearSpeed CSX600. 64. TILERA TILE64.
E N D
SMYLE: Scalable ManYcore for Low-Energy computing Koji Inoue1 and Masaaki Kondo2 Kyushu University1 The University of Electro-Communications2
Moving to Manycore Era! 256 CSX700 128 Intel 80-core Toshiba Manycore TILE-GX100 ClearSpeed CSX600 64 TILERA TILE64 Knights Corner 32 Number of Cores UT TRIPS (OPN) Intel MIC (Knights Ferry) MIT RAW 16 STI Cell BE Oracle T3 SPARC64 IXfx Opteon 6200 8 Sun T1 T2 4 Intel Core, IBM PowerX AMD Opteron, SPARC64 2 2012 2002 2004 2006 2008 2010
Why Manycore? Large #of simple cores Efficient “on-chip” parallel executions Many-core Multi-core Potential of Many-core Single 256BCE Core 256 1BCE Cores Quad 64BCE Cores Same HW resouce BCEs used for each core a single BCE(Base Core Equivalent): HW resource required to implement the smallest base-line core M. Hill, “Amdahl’s Law in the Multicore Era,” IEEE computer, Vol. 7, Issue 7, pp.33-38, July 2008
Do we have enough parallelismin a “single” multi-thread program? • In-Order Core@1GHz (1~128 ) coresw/ private 32KB L1 & 512KB L2 • 2D Mesh NoC (no contention) • 100 ns DRAM latency Simulator: Graphite (MIT) Speed up (from single core) Barnes LU_non_contiguous Water-spatial # of cores Ocean_non_contiguous Water-nsquared Radix Benchmarks: SPLASH-2
SMYLE~Flexible Execution of Multiple Multi-threaded Programs~ Application-X Application-Y Task A Task B Task C Host Code Accelerated Code Host Code Accelerated Code Host Code Accelerated Code c c c c c c c c c c c c c c c P P P VAM 32-Core VAM 16-Core VAM 64-Core c c c c c c c c c c c c c c c c c c c c Virtual Accelerator on Manycore P Host Core P
Compile Flow C Source Program • CLtrump • C to OpenCL translation • User-interactive parallelization • OpenCL support • Heterogeneous executions • Synthesize VAM organization • #of cores, mapping • Memory configuration • Etc… • H. Tomiyama, “OpenCL Compiler and Runtime Library for Embedded ManycoreSoCs,” MPSoC 2012 (Monday) CLtrump(C to OpenCL C) Host Program (OpenCL C) Accelerator Program (OpenCL C) Accelerator Program (OpenCL C) Accelerator Program (OpenCL C) Accelerator Program (OpenCL C) • SMYLE OpenCL Compiler • Bin. Generation • VAM Architecture Synthesis • Run-time Lib. Host Bin. VAM Config. VAM Bin. SMYLEref Architecture
SMYLEref Architecture • Shared memory • Clusters + NoC • Private L1$ + Distributed shared L2$ • No hardware coherence support IL1 IL1 IL1 IL1 Host Cluster Scalar Core Scalar Core Scalar Core Scalar Core DL1 DL1 DL1 DL1 Router Packet Cont. Distributed Shared L2$ DL1 DL1 DL1 DL1 Scalar Core Scalar Core Scalar Core Scalar Core Host IL1 IL1 IL1 IL1 Mem. Controller I/O Controller Peripherals SDRAM
DCFSDynamic Core count and Frequency Scaling 64 cores@Low Freq. 32 cores@Mid. Freq. 16 cores@High Freq. P c c c c c c c c P c c c c c c c c P c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c P c c c c c c c c P c c c c c c c c P c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Power Power Power 1 64 1 32 1 16 Core ID Core ID Core ID S. Imamura, H. Sasaki, N. Fukumoto, K. Inoue, and K. Murakami, “Optimizing Power-Performance Trade-off for Parallel Applications through Dynamic Core-count and Frequency Scaling,” 2nd Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments, Mar. 2012.
DCFSDynamic Core count and Frequency Scaling 64@1.4GHz 48@1.7GHz • Best config. under a power constraint • Balckscholse: 64core@1.4GHz • X264: 24core@2.3GHz • dedup: 18core@2.6GHz • Consider the tradeoff between Parallelism (scalability) vs. Frequency • Optimize #of cores and frequency at runtime! • 3.7x speed-up (in maximum) over a conventional 64-core execution • Saving energy consumption! blackscholes 32@2.0GHz 18@2.6GHz 24@2.3GHz X264 Norm. Performance Norm. Performance dedup Norm. Performance S. Imamura, H. Sasaki, N. Fukumoto, K. Inoue, and K. Murakami, “Optimizing Power-Performance Trade-off for Parallel Applications through Dynamic Core-count and Frequency Scaling,” 2nd Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments, Mar. 2012. #of Cores
DCFSDynamic Core count and Frequency Scaling 64@1.4GHz • Best config. under a power constraint • Balckscholse: 64core@1.4GHz • X264: 24core@2.3GHz • dedup: 18core@2.6GHz • Consider the tradeoff between Parallelism (scalability) vs. Frequency • Optimize #of cores and frequency at runtime! • 3.7x speed-up (in maximum) over a conventional 64-core execution • Saving energy consumption! blackscholes 24@2.3GHz X264 Norm. Performance Norm. Performance 18@2.6GHz dedup Norm. Performance S. Imamura, H. Sasaki, N. Fukumoto, K. Inoue, and K. Murakami, “Optimizing Power-Performance Trade-off for Parallel Applications through Dynamic Core-count and Frequency Scaling,” 2nd Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments, Mar. 2012. #of Cores
Conclusions • Introduction of SMYLE project • Future of SoC implementation • MaPSoC • Good-bye hardware macro! • Other research topics • Manycore architecture for video mining (w/ TOPS Systems Corp.) • Programming environment (w/ Fixstars Solutions Inc.)
Acknowledgement • Special thanks to SMYLE project members • This research was supported in part by New Energy and Industrial Technology Development Organization
Flexible HW Barrier Support • Tree-style barrier network • Support high-speed, flexible sync. Cluster Barrier Tree Network 0 1 2 3 Root Node 7 6 5 4 Intermediate Node Cluster 0 1 2 3 4 5 6 7 VAM1
Flexible HW Barrier Support • Tree-style barrier network • Support high-speed, flexible sync. Cluster Barrier Tree Network 0 1 2 3 Root Node 7 6 5 4 Intermediate Node Cluster 0 1 2 3 4 5 6 7 VAM4 VAM1 VAM3 VAM2
Flexible HW Barrier Support • High-speed barrier operation • 66x improvement over a SW implementation • Parallel barrier execution for VAMs Cluster Barrier Tree Network 0 1 2 3 Root Node 7 6 5 4 Intermediate Node Cluster 0 1 2 3 4 5 6 7 VAM4 VAM1 VAM3 VAM2