1 / 22

Exploiting Both Pipelining and Data Parallelism with SIMD RA

Exploiting Both Pipelining and Data Parallelism with SIMD RA. Yongjoo Kim *, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo * and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science & Technology). ARC March 21, 2012 Hong Kong.

Download Presentation

Exploiting Both Pipelining and Data Parallelism with SIMD RA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Both Pipelining and Data Parallelism with SIMD RA Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, IngooHeo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science & Technology) ARC March 21, 2012Hong Kong

  2. Reconfigurable Architecture Source: ChipDesignMag.com • Reconfigurable architecture • High performance • Flexible • Cf. ASIC • Energy efficient • Cf. GPU

  3. Coarse-Grained Reconfigurable Architecture ADRES Main Processor CGRA MainMemory DMA Controller MorphoSys • Coarse-Grained RA • Word-level granularity • Dynamic reconfigurability • Simpler to compile • Execution model

  4. DFG generation Place & Route Application Mapping Application Front-end IR Arch Param. Partitioner Seq Code Loops Mapping for CGRA Conventional Ccompilation <DFG> <CGRA> Assembly Configuration Extended assembler Exec. + Config. • Place and route DFG on the PE array mapping space • Should satisfy several constraints • Should map nodes on the PE which have a right functionality • Data transfer between nodes should be guaranteed • Resource consumption should be minimized for performance

  5. Software Pipelining • Modulo scheduling-based mapping A[i] B[i] 0 1 C[i] 3 2 4 0 1 3 2 0 0 1 1 4 PE1 PE0 II = 2 cycles 3 3 2 2 PE2 PE3 4 4 II : Initiation Interval time

  6. Problem - Scalability • Suffer several problems in a large scale CGRA • Lack of parallelism • Limited ILP in general applications • Configuration size(in unrolling case) • Search a very large mapping space for placement and routing • Skyrocketing compilation time CGRAs remain at 4x4 or 8x8 at the most.

  7. Overview Background SIMD Reconfigurable Architecture (SIMD RA) Mapping on SIMD RA Evaluation

  8. SIMD Reconfigurable Architecture • Consists of multiple identical parts, called cores • Identical for the reuse of configurations • At least one load-store PE in each core Core 1 Core 2 Core 3 Core 4 Crossbar Switch Bank1 Bank2 Bank3 Bank4

  9. Large Core Advantages of SIMD-RA Large Core • More iterations executed in parallel • Scale with the PE array size • Short compilation time thanks to small mapping space • Archive denser scheduled configuration • Higher utilization and performance. • Loop must not have loop-carried dependence. time Core 1 Core 2 Core 3 Core 4 Iter. 9 Iter. 6 Iter. 3 Iter. 0 Iter. 4 Iter. 10 Iter. 7 Iter. 1 Iteration 3 Iteration 0 Iter. 11 Iteration 1 Iter. 8 Iteration 5 Iter. 5 Iteration 4 Iteration 2 Iter. 2 Core 1 Core 2 Core 3 Core 4 time

  10. Overview Background SIMD Reconfigurable Architecture (SIMD RA) Bank Conflict Minimization in SIMD RA Evaluation

  11. Problems of SIMD RA mapping • New mapping problem • Iteration-to-core mapping • Iteration mapping affects on the performance • related with a data mapping • affect the number of bank conflicts 15 iterations for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i]; } Core 1 Core 2 Core 3 Core 4

  12. Crossbar Switch Mapping schemes A[0] A[4] A[8] A[12] B[1] B[5] B[9] B[13] B[0] B[1] B[2] B[3] B[4] B[5] B[13] B[14] A[0] A[1] A[2] A[3] A[4] A[5] A[13] A[14] A[3] A[7] A[11] B[0] B[4] B[8] B[12] A[2] A[6] A[10] A[14] B[3] B[7] B[11] A[1] A[5] A[9] A[13] B[2] B[6] B[10] B[14] Iteration-to-core mapping Data mapping … … Iter. 0-3 Iter. 4-7 for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i]; } Iter. 8-11 Iter. 12-14 Crossbar Switch < Sequential > < Sequential > Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 2,6,10,14 Iter. 3,7,11 < Interleaving > < Interleaving >

  13. Interleaving data placement Iter. 0-3 Iter. 4-7 A[3] A[7] A[11] B[0] B[4] B[8] B[12] A[2] A[6] A[10] A[14] B[3] B[7] B[11] A[1] A[5] A[9] A[13] B[2] B[6] B[10] B[14] A[0] A[4] A[8] A[12] B[1] B[5] B[9] B[13] • With interleaving data placement, interleaved iteration assignment is better than sequential iteration assignment. • Weak in stride accesses • reduce the number of utilized banks, • increase bank conflicts Configuration Load A[2i] Load A[i] … Iter. 8-11 Iter. 12-14 … … Crossbar Switch Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 2,6,10,14 Iter. 3,7,11

  14. Crossbar Switch Sequential data placement Iter. 0-3 Iter. 4-7 A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] A[4] A[5] A[6] A[7] B[4] B[5] B[6] B[7] A[8] A[9] A[10] A[11] B[8] B[9] B[10] B[11] A[12] A[13] A[14] B[12] B[13] B[14] A[0] A[1] A[2] A[3] A[4] A[5] A[13] A[14] B[0] B[1] B[2] B[3] B[4] B[5] B[13] B[14] • Cannot work well with SIMD mapping • Cause frequent bank conflicts • Data tiling • i) array base address modification • ii) rearranging data on the local memory. • Sequential iteration assignment with data tiling suits for SIMD mapping Configuration Load A[i] … Iter. 8-11 Iter. 12-14 … … … … Crossbar Switch Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 2,6,10,14 Iter. 3,7,11 14

  15. Summary of Mapping Combinations Analysis • Two out of the four combinations have strong advantages • Interleaved iteration, interleaved data mapping • Weak in accesses with stride • Simple data management • Sequential iteration, sequential data mapping (with data tiling) • More robust against bank conflict • Data rearranging overhead

  16. Experimental Setup • Sets of loop kernels from OpenCV, multimedia and SPEC2000 benchmarks • Target system • Two CGRA sizes – 4x4, 8x4 • 2x2 core with one load-store PE and one multiplier PE • Mesh + diagonal connections between PEs • Full crossbar switch between PEs and local memory banks • Compared with non-SIMD mapping • Original : non-SIMD previous mapping • SIMD : Our approach (interleaving-interleaving mapping)

  17. Configuration Size reduced by 61% in 4x4 CGRA, 79% in 8x4 CGRA

  18. Runtime 29% 32%

  19. Conclusion • Presented SIMD reconfigurable architecture • Exploit data parallelism and instruction level parallelism at the same time • Advantages of SIMD reconfigurable architecture • Scale the large number of PEs well • Alleviate increasing compilation time • Increase performance and reduce configuration size

  20. Thank you!

  21. Core size • In a large loop case, • small core might not be a good match • Merge multiple cores ⇒Macrocore • No HW modification require Core 1 Macrocore 1 Core 2 Macrocore 2 Core 3 Core 4 Crossbar Switch Bank1 Bank2 Bank3 Bank4

  22. SIMD RA mapping flow Check SIMD Requirement Traditional Mapping Fail Select Core Size Iteration Mapping Int-Int Seq-Tiling Array Placement (Implicit) Data Tiling Operation Mapping Modulo Scheduling If scheduling fails, increase II and repeat. If scheduling fails and MaxII<II, increase core size.

More Related