SMYLE: Scalable ManYcore for Low-Energy computing

SMYLE: Scalable ManYcore for Low-Energy computing Koji Inoue1 and Masaaki Kondo2 Kyushu University1 The University of Electro-Communications2

Moving to Manycore Era! 256 CSX700 128 Intel 80-core Toshiba Manycore TILE-GX100 ClearSpeed CSX600 64 TILERA TILE64 Knights Corner 32 Number of Cores UT TRIPS (OPN) Intel MIC (Knights Ferry) MIT RAW 16 STI Cell BE Oracle T3 SPARC64 IXfx Opteon 6200 8 Sun T1 T2 4 Intel Core, IBM PowerX AMD Opteron, SPARC64 2 2012 2002 2004 2006 2008 2010

Why Manycore? Large #of simple cores Efficient “on-chip” parallel executions Many-core Multi-core Potential of Many-core Single 256BCE Core 256 1BCE Cores Quad 64BCE Cores Same HW resouce BCEs used for each core a single BCE（Base Core Equivalent）: HW resource required to implement the smallest base-line core M. Hill, “Amdahl’s Law in the Multicore Era,” IEEE computer, Vol. 7, Issue 7, pp.33-38, July 2008

Do we have enough parallelismin a “single” multi-thread program? • In-Order Core@1GHz (1～128 ) coresw/ private 32KB L1 & 512KB L2 • 2D Mesh NoC (no contention) • 100 ns DRAM latency Simulator: Graphite (MIT) Speed up (from single core) Barnes LU_non_contiguous Water-spatial # of cores Ocean_non_contiguous Water-nsquared Radix Benchmarks: SPLASH-2

SMYLE～Flexible Execution of Multiple Multi-threaded Programs～ Application-X Application-Y Task A Task B Task C Host Code Accelerated Code Host Code Accelerated Code Host Code Accelerated Code c c c c c c c c c c c c c c c P P P VAM 32-Core VAM 16-Core VAM 64-Core c c c c c c c c c c c c c c c c c c c c Virtual Accelerator on Manycore P Host Core P

Compile Flow C Source Program • CLtrump • C to OpenCL translation • User-interactive parallelization • OpenCL support • Heterogeneous executions • Synthesize VAM organization • #of cores, mapping • Memory configuration • Etc… • H. Tomiyama, “OpenCL Compiler and Runtime Library for Embedded ManycoreSoCs,” MPSoC 2012 (Monday) CLtrump(C to OpenCL C) Host Program (OpenCL C) Accelerator Program (OpenCL C) Accelerator Program (OpenCL C) Accelerator Program (OpenCL C) Accelerator Program (OpenCL C) • SMYLE OpenCL Compiler • Bin. Generation • VAM Architecture Synthesis • Run-time Lib. Host Bin. VAM Config. VAM Bin. SMYLEref Architecture

SMYLEref Architecture • Shared memory • Clusters + NoC • Private L1$ + Distributed shared L2$ • No hardware coherence support IL1 IL1 IL1 IL1 Host Cluster Scalar Core Scalar Core Scalar Core Scalar Core DL1 DL1 DL1 DL1 Router Packet Cont. Distributed Shared L2$ DL1 DL1 DL1 DL1 Scalar Core Scalar Core Scalar Core Scalar Core Host IL1 IL1 IL1 IL1 Mem. Controller I/O Controller Peripherals SDRAM

DCFSDynamic Core count and Frequency Scaling 64 cores@Low Freq. 32 cores@Mid. Freq. 16 cores@High Freq. P c c c c c c c c P c c c c c c c c P c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c P c c c c c c c c P c c c c c c c c P c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Power Power Power 1 64 1 32 1 16 Core ID Core ID Core ID S. Imamura, H. Sasaki, N. Fukumoto, K. Inoue, and K. Murakami, “Optimizing Power-Performance Trade-off for Parallel Applications through Dynamic Core-count and Frequency Scaling,” 2nd Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments, Mar. 2012.

DCFSDynamic Core count and Frequency Scaling 64@1.4GHz 48@1.7GHz • Best config. under a power constraint • Balckscholse: 64core@1.4GHz • X264: 24core@2.3GHz • dedup: 18core@2.6GHz • Consider the tradeoff between Parallelism (scalability) vs. Frequency • Optimize #of cores and frequency at runtime! • 3.7x speed-up (in maximum) over a conventional 64-core execution • Saving energy consumption! blackscholes 32@2.0GHz 18@2.6GHz 24@2.3GHz X264 Norm. Performance Norm. Performance dedup Norm. Performance S. Imamura, H. Sasaki, N. Fukumoto, K. Inoue, and K. Murakami, “Optimizing Power-Performance Trade-off for Parallel Applications through Dynamic Core-count and Frequency Scaling,” 2nd Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments, Mar. 2012. #of Cores

DCFSDynamic Core count and Frequency Scaling 64@1.4GHz • Best config. under a power constraint • Balckscholse: 64core@1.4GHz • X264: 24core@2.3GHz • dedup: 18core@2.6GHz • Consider the tradeoff between Parallelism (scalability) vs. Frequency • Optimize #of cores and frequency at runtime! • 3.7x speed-up (in maximum) over a conventional 64-core execution • Saving energy consumption! blackscholes 24@2.3GHz X264 Norm. Performance Norm. Performance 18@2.6GHz dedup Norm. Performance S. Imamura, H. Sasaki, N. Fukumoto, K. Inoue, and K. Murakami, “Optimizing Power-Performance Trade-off for Parallel Applications through Dynamic Core-count and Frequency Scaling,” 2nd Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments, Mar. 2012. #of Cores

FPGA prototyping32 cores (8 cores/FPGA x 4 boards)

Conclusions • Introduction of SMYLE project • Future of SoC implementation • MaPSoC • Good-bye hardware macro! • Other research topics • Manycore architecture for video mining (w/ TOPS Systems Corp.) • Programming environment (w/ Fixstars Solutions Inc.)

Acknowledgement • Special thanks to SMYLE project members • This research was supported in part by New Energy and Industrial Technology Development Organization

Flexible HW Barrier Support • Tree-style barrier network • Support high-speed, flexible sync. Cluster Barrier Tree Network 0 1 2 3 Root Node 7 6 5 4 Intermediate Node Cluster 0 1 2 3 4 5 6 7 VAM1

Flexible HW Barrier Support • Tree-style barrier network • Support high-speed, flexible sync. Cluster Barrier Tree Network 0 1 2 3 Root Node 7 6 5 4 Intermediate Node Cluster 0 1 2 3 4 5 6 7 VAM4 VAM1 VAM3 VAM2

Flexible HW Barrier Support • High-speed barrier operation • 66x improvement over a SW implementation • Parallel barrier execution for VAMs Cluster Barrier Tree Network 0 1 2 3 Root Node 7 6 5 4 Intermediate Node Cluster 0 1 2 3 4 5 6 7 VAM4 VAM1 VAM3 VAM2

SMYLE: Scalable ManYcore for Low-Energy computing

SMYLE: Scalable ManYcore for Low-Energy computing

Presentation Transcript

Scalable Behaviors for Crowd Simulation

Applications and Runtime for multicore/manycore

The Manycore Shift: Making Parallel Computing Mainstream

McPAT : An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures

Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing

Fast, Scalable Demand-Shaping with ColorPower

Building Scalable Scientific Applications using Makeflow

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Manycore Optimizations: A Compiler and L anguage Independent ManyCore Runtime System

TimeCube A Manycore Embedded Processor with Interference-agnostic Progress Tracking

Thilina Gunarathne

CCNoC : On-Chip Interconnects for Cache-Coherent Manycore Server Chips

Reliable and Scalable Checkpointing Systems for Distributed Computing Environments

Scalable Parallel Computing on Clouds

Speed Scaling to Manage Energy and Temperature

Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors

Architecting a Symbiotic Virtual Machine Monitor for Scalable High Performance Computing

Faster , More Scalable Computing in the Cloud

6.964 Pervasive Computing Grid: Scalable Ad Hoc Networking

High Energy Physics Networking and Computing in Europe