1 / 16

SMYLE: Scalable ManYcore for Low-Energy computing

SMYLE: Scalable ManYcore for Low-Energy computing. Koji Inoue 1 and Masaaki Kondo 2 Kyushu University 1 The University of Electro-Communications 2. Moving to Manycore Era!. 256. CSX700. 128. Intel 80-core. Toshiba Manycore. TILE-GX100. ClearSpeed CSX600. 64. TILERA TILE64.

sybil
Download Presentation

SMYLE: Scalable ManYcore for Low-Energy computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SMYLE: Scalable ManYcore for Low-Energy computing Koji Inoue1 and Masaaki Kondo2 Kyushu University1 The University of Electro-Communications2

  2. Moving to Manycore Era! 256 CSX700 128 Intel 80-core Toshiba Manycore TILE-GX100 ClearSpeed CSX600 64 TILERA TILE64 Knights Corner 32 Number of Cores UT TRIPS (OPN) Intel MIC (Knights Ferry) MIT RAW 16 STI Cell BE Oracle T3 SPARC64 IXfx Opteon 6200 8 Sun T1 T2 4 Intel Core, IBM PowerX AMD Opteron, SPARC64 2 2012 2002 2004 2006 2008 2010

  3. Why Manycore? Large #of simple cores Efficient “on-chip” parallel executions Many-core Multi-core Potential of Many-core Single 256BCE Core 256 1BCE Cores Quad 64BCE Cores Same HW resouce BCEs used for each core a single BCE(Base Core Equivalent): HW resource required to implement the smallest base-line core M. Hill, “Amdahl’s Law in the Multicore Era,” IEEE computer, Vol. 7, Issue 7, pp.33-38, July 2008

  4. Do we have enough parallelismin a “single” multi-thread program? • In-Order Core@1GHz (1~128 ) coresw/ private 32KB L1 & 512KB L2 • 2D Mesh NoC (no contention) • 100 ns DRAM latency Simulator: Graphite (MIT) Speed up (from single core) Barnes LU_non_contiguous Water-spatial # of cores Ocean_non_contiguous Water-nsquared Radix Benchmarks: SPLASH-2

  5. SMYLE~Flexible Execution of Multiple Multi-threaded Programs~ Application-X Application-Y Task A Task B Task C Host Code Accelerated Code Host Code Accelerated Code Host Code Accelerated Code c c c c c c c c c c c c c c c P P P VAM 32-Core VAM 16-Core VAM 64-Core c c c c c c c c c c c c c c c c c c c c Virtual Accelerator on Manycore P Host Core P

  6. Compile Flow C Source Program • CLtrump • C to OpenCL translation • User-interactive parallelization • OpenCL support • Heterogeneous executions • Synthesize VAM organization • #of cores, mapping • Memory configuration • Etc… • H. Tomiyama, “OpenCL Compiler and Runtime Library for Embedded ManycoreSoCs,” MPSoC 2012 (Monday) CLtrump(C to OpenCL C) Host Program (OpenCL C) Accelerator Program (OpenCL C) Accelerator Program (OpenCL C) Accelerator Program (OpenCL C) Accelerator Program (OpenCL C) • SMYLE OpenCL Compiler • Bin. Generation • VAM Architecture Synthesis • Run-time Lib. Host Bin. VAM Config. VAM Bin. SMYLEref Architecture

  7. SMYLEref Architecture • Shared memory • Clusters + NoC • Private L1$ + Distributed shared L2$ • No hardware coherence support IL1 IL1 IL1 IL1 Host Cluster Scalar Core Scalar Core Scalar Core Scalar Core DL1 DL1 DL1 DL1 Router Packet Cont. Distributed Shared L2$ DL1 DL1 DL1 DL1 Scalar Core Scalar Core Scalar Core Scalar Core Host IL1 IL1 IL1 IL1 Mem. Controller I/O Controller Peripherals SDRAM

  8. DCFSDynamic Core count and Frequency Scaling 64 cores@Low Freq. 32 cores@Mid. Freq. 16 cores@High Freq. P c c c c c c c c P c c c c c c c c P c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c P c c c c c c c c P c c c c c c c c P c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Power Power Power 1 64 1 32 1 16 Core ID Core ID Core ID S. Imamura, H. Sasaki, N. Fukumoto, K. Inoue, and K. Murakami, “Optimizing Power-Performance Trade-off for Parallel Applications through Dynamic Core-count and Frequency Scaling,” 2nd Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments, Mar. 2012.

  9. DCFSDynamic Core count and Frequency Scaling 64@1.4GHz 48@1.7GHz • Best config. under a power constraint • Balckscholse: 64core@1.4GHz • X264: 24core@2.3GHz • dedup: 18core@2.6GHz • Consider the tradeoff between Parallelism (scalability) vs. Frequency • Optimize #of cores and frequency at runtime! • 3.7x speed-up (in maximum) over a conventional 64-core execution • Saving energy consumption! blackscholes 32@2.0GHz 18@2.6GHz 24@2.3GHz X264 Norm. Performance Norm. Performance dedup Norm. Performance S. Imamura, H. Sasaki, N. Fukumoto, K. Inoue, and K. Murakami, “Optimizing Power-Performance Trade-off for Parallel Applications through Dynamic Core-count and Frequency Scaling,” 2nd Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments, Mar. 2012. #of Cores

  10. DCFSDynamic Core count and Frequency Scaling 64@1.4GHz • Best config. under a power constraint • Balckscholse: 64core@1.4GHz • X264: 24core@2.3GHz • dedup: 18core@2.6GHz • Consider the tradeoff between Parallelism (scalability) vs. Frequency • Optimize #of cores and frequency at runtime! • 3.7x speed-up (in maximum) over a conventional 64-core execution • Saving energy consumption! blackscholes 24@2.3GHz X264 Norm. Performance Norm. Performance 18@2.6GHz dedup Norm. Performance S. Imamura, H. Sasaki, N. Fukumoto, K. Inoue, and K. Murakami, “Optimizing Power-Performance Trade-off for Parallel Applications through Dynamic Core-count and Frequency Scaling,” 2nd Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments, Mar. 2012. #of Cores

  11. FPGA prototyping32 cores (8 cores/FPGA x 4 boards)

  12. Conclusions • Introduction of SMYLE project • Future of SoC implementation • MaPSoC • Good-bye hardware macro! • Other research topics • Manycore architecture for video mining (w/ TOPS Systems Corp.) • Programming environment (w/ Fixstars Solutions Inc.)

  13. Acknowledgement • Special thanks to SMYLE project members • This research was supported in part by New Energy and Industrial Technology Development Organization

  14. Flexible HW Barrier Support • Tree-style barrier network • Support high-speed, flexible sync. Cluster Barrier Tree Network 0 1 2 3 Root Node 7 6 5 4 Intermediate Node Cluster 0 1 2 3 4 5 6 7 VAM1

  15. Flexible HW Barrier Support • Tree-style barrier network • Support high-speed, flexible sync. Cluster Barrier Tree Network 0 1 2 3 Root Node 7 6 5 4 Intermediate Node Cluster 0 1 2 3 4 5 6 7 VAM4 VAM1 VAM3 VAM2

  16. Flexible HW Barrier Support • High-speed barrier operation • 66x improvement over a SW implementation • Parallel barrier execution for VAMs Cluster Barrier Tree Network 0 1 2 3 Root Node 7 6 5 4 Intermediate Node Cluster 0 1 2 3 4 5 6 7 VAM4 VAM1 VAM3 VAM2

More Related