Exploiting Both Pipelining and Data Parallelism with SIMD RA

Exploiting Both Pipelining and Data Parallelism with SIMD RA Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, IngooHeo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science & Technology) ARC March 21, 2012Hong Kong

Reconfigurable Architecture Source: ChipDesignMag.com • Reconfigurable architecture • High performance • Flexible • Cf. ASIC • Energy efficient • Cf. GPU

Coarse-Grained Reconfigurable Architecture ADRES Main Processor CGRA MainMemory DMA Controller MorphoSys • Coarse-Grained RA • Word-level granularity • Dynamic reconfigurability • Simpler to compile • Execution model

DFG generation Place & Route Application Mapping Application Front-end IR Arch Param. Partitioner Seq Code Loops Mapping for CGRA Conventional Ccompilation <DFG> <CGRA> Assembly Configuration Extended assembler Exec. + Config. • Place and route DFG on the PE array mapping space • Should satisfy several constraints • Should map nodes on the PE which have a right functionality • Data transfer between nodes should be guaranteed • Resource consumption should be minimized for performance

Software Pipelining • Modulo scheduling-based mapping A[i] B[i] 0 1 C[i] 3 2 4 0 1 3 2 0 0 1 1 4 PE1 PE0 II = 2 cycles 3 3 2 2 PE2 PE3 4 4 II : Initiation Interval time

Problem - Scalability • Suffer several problems in a large scale CGRA • Lack of parallelism • Limited ILP in general applications • Configuration size(in unrolling case) • Search a very large mapping space for placement and routing • Skyrocketing compilation time CGRAs remain at 4x4 or 8x8 at the most.

Overview Background SIMD Reconfigurable Architecture (SIMD RA) Mapping on SIMD RA Evaluation

SIMD Reconfigurable Architecture • Consists of multiple identical parts, called cores • Identical for the reuse of configurations • At least one load-store PE in each core Core 1 Core 2 Core 3 Core 4 Crossbar Switch Bank1 Bank2 Bank3 Bank4

Large Core Advantages of SIMD-RA Large Core • More iterations executed in parallel • Scale with the PE array size • Short compilation time thanks to small mapping space • Archive denser scheduled configuration • Higher utilization and performance. • Loop must not have loop-carried dependence. time Core 1 Core 2 Core 3 Core 4 Iter. 9 Iter. 6 Iter. 3 Iter. 0 Iter. 4 Iter. 10 Iter. 7 Iter. 1 Iteration 3 Iteration 0 Iter. 11 Iteration 1 Iter. 8 Iteration 5 Iter. 5 Iteration 4 Iteration 2 Iter. 2 Core 1 Core 2 Core 3 Core 4 time

Overview Background SIMD Reconfigurable Architecture (SIMD RA) Bank Conflict Minimization in SIMD RA Evaluation

Problems of SIMD RA mapping • New mapping problem • Iteration-to-core mapping • Iteration mapping affects on the performance • related with a data mapping • affect the number of bank conflicts 15 iterations for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i]; } Core 1 Core 2 Core 3 Core 4

Crossbar Switch Mapping schemes A[0] A[4] A[8] A[12] B[1] B[5] B[9] B[13] B[0] B[1] B[2] B[3] B[4] B[5] B[13] B[14] A[0] A[1] A[2] A[3] A[4] A[5] A[13] A[14] A[3] A[7] A[11] B[0] B[4] B[8] B[12] A[2] A[6] A[10] A[14] B[3] B[7] B[11] A[1] A[5] A[9] A[13] B[2] B[6] B[10] B[14] Iteration-to-core mapping Data mapping … … Iter. 0-3 Iter. 4-7 for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i]; } Iter. 8-11 Iter. 12-14 Crossbar Switch < Sequential > < Sequential > Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 2,6,10,14 Iter. 3,7,11 < Interleaving > < Interleaving >

Interleaving data placement Iter. 0-3 Iter. 4-7 A[3] A[7] A[11] B[0] B[4] B[8] B[12] A[2] A[6] A[10] A[14] B[3] B[7] B[11] A[1] A[5] A[9] A[13] B[2] B[6] B[10] B[14] A[0] A[4] A[8] A[12] B[1] B[5] B[9] B[13] • With interleaving data placement, interleaved iteration assignment is better than sequential iteration assignment. • Weak in stride accesses • reduce the number of utilized banks, • increase bank conflicts Configuration Load A[2i] Load A[i] … Iter. 8-11 Iter. 12-14 … … Crossbar Switch Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 2,6,10,14 Iter. 3,7,11

Crossbar Switch Sequential data placement Iter. 0-3 Iter. 4-7 A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] A[4] A[5] A[6] A[7] B[4] B[5] B[6] B[7] A[8] A[9] A[10] A[11] B[8] B[9] B[10] B[11] A[12] A[13] A[14] B[12] B[13] B[14] A[0] A[1] A[2] A[3] A[4] A[5] A[13] A[14] B[0] B[1] B[2] B[3] B[4] B[5] B[13] B[14] • Cannot work well with SIMD mapping • Cause frequent bank conflicts • Data tiling • i) array base address modification • ii) rearranging data on the local memory. • Sequential iteration assignment with data tiling suits for SIMD mapping Configuration Load A[i] … Iter. 8-11 Iter. 12-14 … … … … Crossbar Switch Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 2,6,10,14 Iter. 3,7,11 14

Summary of Mapping Combinations Analysis • Two out of the four combinations have strong advantages • Interleaved iteration, interleaved data mapping • Weak in accesses with stride • Simple data management • Sequential iteration, sequential data mapping (with data tiling) • More robust against bank conflict • Data rearranging overhead

Experimental Setup • Sets of loop kernels from OpenCV, multimedia and SPEC2000 benchmarks • Target system • Two CGRA sizes – 4x4, 8x4 • 2x2 core with one load-store PE and one multiplier PE • Mesh + diagonal connections between PEs • Full crossbar switch between PEs and local memory banks • Compared with non-SIMD mapping • Original : non-SIMD previous mapping • SIMD : Our approach (interleaving-interleaving mapping)

Configuration Size reduced by 61% in 4x4 CGRA, 79% in 8x4 CGRA

Runtime 29% 32%

Conclusion • Presented SIMD reconfigurable architecture • Exploit data parallelism and instruction level parallelism at the same time • Advantages of SIMD reconfigurable architecture • Scale the large number of PEs well • Alleviate increasing compilation time • Increase performance and reduce configuration size

Thank you!

Core size • In a large loop case, • small core might not be a good match • Merge multiple cores ⇒Macrocore • No HW modification require Core 1 Macrocore 1 Core 2 Macrocore 2 Core 3 Core 4 Crossbar Switch Bank1 Bank2 Bank3 Bank4

SIMD RA mapping flow Check SIMD Requirement Traditional Mapping Fail Select Core Size Iteration Mapping Int-Int Seq-Tiling Array Placement (Implicit) Data Tiling Operation Mapping Modulo Scheduling If scheduling fails, increase II and repeat. If scheduling fails and MaxII<II, increase core size.

Exploiting Both Pipelining and Data Parallelism with SIMD RA

Exploiting Both Pipelining and Data Parallelism with SIMD RA

Presentation Transcript

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Exploiting Parallelism

Qilin : Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping

Exploiting Parallelism on GPUs

Janus : exploiting parallelism via hindsight

Exploiting Both Pipelining and Data Parallelism with SIMD RA

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism

Exploiting SIMD parallelism with the CGiS compiler framework

Advanced Pipelining and Instruction Level Parallelism (ILP)

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Exploiting Parallelism

Exploiting Parallelism