220 likes | 457 Views
SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures. Yongjun Park 1 , Sangwon Seo 2 , Hyunchul Park 3 , Hyoun Kyu Cho 1 , and Scott Mahlke 1. March 6, 2012 1 University of Michigan, Ann Arbor 2 Qualcomm Incorporated, San Diego, CA
E N D
SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun Park1, Sangwon Seo2, Hyunchul Park3, Hyoun Kyu Cho1, and Scott Mahlke1 March 6, 2012 • 1University of Michigan, Ann Arbor • 2Qualcomm Incorporated, San Diego, CA • 3Programmin Systems Lab, Intel Labs, Santa Clara, CA 1
Convergence of Functionalities Flexible Accelerator! 4G Wireless Audio Video 3D Navigation Anatomy of an iPhone Convergence of functionalities demands a flexible solution Applications have different characteristics 2
SIMD : Attractive Alternative to ASICs • Suitable for running wireless and multimedia applications for future embedded systems • Advantage • High throughput • Low fetch-decode overhead • Easy to scale • Disadvantage • Hard to realize high resource utilization • High SIMDization overhead • Example SIMD architectures • IBM Cell, ARM NEON, • Intel MIC architecture, etc. 5.6x VLIW 2x SIMD FUs Example SIMD machine: 100 MOps /mW 3
Under-utilization on wide SIMD AAC 3D H.264 • Multimedia applications have various natural SIMD widths • SIMD width characterization of innermost loops (Intel Compiler rule) • Inside & across applications • How to use idle SIMD resources? 16-way SIMD Resource utilization Full Under Execution time distribution at different SIMD widths 4
Traditional Solutions for Under-utilization • Dynamic power gating • Selectively cut off unused SIMD lanes • Effective dynamic & leakage power savings • Transition time & power overhead • High area overhead Thread 1 On Thread 2 Off Thread 3 • Thread Level parallelism • Execute multiple threads having separate data • Different instruction flow • Input-dependent control flow • High memory pressure Thread 4 5
Objective of This Work • Beyond loop-level SIMD • Put idle SIMD lanes to work • Find more SIMD opportunities inside vectorized basic blocks when loop-level SIMD parallelism is insufficient • Possible SIMD instructions inside a vectorized basic block • Perform same work • Same data flow • More than 50% of total instructions have some opportunities • Challenges • High data movement overhead between lanes • Hard to find best instruction packing combination 6
Partial SIMD Opportunity 2. Partial SIMDization 1. Loop level SIMDization SIMD Resource 1: For (it = 0; It < 4; it++) { 2: i = a[it] + b[it]; 3: j = c[it] + d[it]; 4: k = e[it] + f[it]; 5: l = g[it] + h[it]; 6: m = i + j; 7: n = k + l; 8: result[it] = m + n; 9: } 0 1 2 3 4 +4 +12 5 6 7 8 9 10 11 12 13 14 15 7
Subgraph Level Parallelism(SGLP) • Data level parallelism between ‘identical subgraphs’ • SIMDizable operators • Isomorphic dataflow • No dependencies on each other • Advantages • Minimize overhead • No inter-lane data movement inside a subgraph • High instruction packing gain • Multiple instructions inside a subgraph increase the packing gain • Cost • Gain • Cost • Gain • Cost • Gain 8
Example: program order (2 degree) FFT kernel SIMD Lane LD0 LD1 Lane 1 Lane 0 0 2 1 3 0 1 LD1 1 3 5 7 9 11 ST1 ST3 5 7 4 6 Inter-lane move Inter-lane move Inter-lane move Inter-lane move Inter-lane move Inter-lane move Inter-lane move Inter-lane move LD0 0 2 4 6 8 10 ST0 ST2 9 11 8 10 Cycle ST1 ST0 ST3 ST2 Gain: 1 = 9 (SIMD) – 8 (overhead) 9
Example: SGLP (2 degree) FFT kernel SIMD Lane LD0 LD1 Lane 1 Lane 0 0 2 1 3 ST2 ST3 10 6 LD1 2 3 7 11 0 1 5 7 4 6 Inter-lane move Inter-lane move ST0 ST1 8 4 LD0 0 1 5 9 9 11 8 10 Cycle ST1 ST0 ST3 ST2 Gain: 7 = 9 (SIMD) – 2 (overhead) 10
Compilation Overview Hardware Information Application Loop-unrolling & Vectorization Dataflow Generation Loop-level VectorizedBasicblock Dataflow Graph 1. Subgraph Identification Identical Subgraphs 2. SIMD Lane Assignment Lane-assigned Subgraphs 3. Code Generation 11
1. Subgraph Identification • Heuristic discovery • Grow subgraphs from seed nodes and find identical subgraphs • Additional conditions over traditional subgraph search • Corresponding operators are identical • Operand type should be same: register/constant • No inter-subgraph dependencies 256 a b c d e f g h 2 1 * + result 12
2. SIMD Lane Assignment 2. Affinity cost Data movement between different subgraphs Use producer/consumer, common producer/consumer relation Affinity value: how many related operations exist between subgraphs Assign a lane with highest affinity cost Affinity: B0 is closer to A0 than A1 1. Subgraph gain Assign lanes based on decreasing order of gain Gain: A > B > C > D 3. Partial order check Partial order of identical subgraphs inside the SIMD lanes must be same Partial order: C0 ≠ C1 • Select subgraphs to be packed and assign them to SIMD lane groups • Pack maximum number of instructions with minimum overhead • Safe parallel execution without dependence violation • Criteria: gain, affinity, partial order A1 A1 A0 A0 SIMD Lane Conflict!! B1 B1 C0 Lane 4~7 A1 B1 C1 C0-1 C1-1 B0 B0 C1 Lane 0~3 D D A0 B0 C0 D C1-0 C0-0 Time 13
Experimental Setup • 144 loops from industry-level optimized media applications • AAC decoder (MPEG4 audio decoding, low complexity profile) • H.264 decoder (MPEG4 video decoding, baseline profile, qcif) • 3D (3D graphics rendering). • Target architecture: wide vector machines • SIMD width: 16 ~ 64 • SODA-style wide vector instruction set • Single-cycle delay data shuffle instruction(≈vperm(VMX), vex_perm(AltiVec)) • IMPACT frontend compiler + cycle-accurate simulator • Compared to 2 other solutions • SLP: superword level parallelism (basic block level SIMDization) [Larsen, PLDI’00] • ILP: Instruction level parallelism on same-way VLIW machines • Apply 2 ~ 4 degree SGLP 14
Static Performance • SGLP retains similar trend as ILP after overhead consideration • Max 1.66x @ 4-way (SLP 1.27x) • See the paper for representative kernels (FFT, DCT, HafPel….) AAC H.264 3D AVG 15
Dynamic Performance on SIMD • Only when a natural SIMD width is insufficient, the available degree of SGLP is exploited. (Up to 4-way) • Max 1.76x speedups (SLP: 1.29x) AAC H.264 AVG 3D 16
Energy@H.264 Execution • 200 MHz(IBM 0.65nm technology) SIMD VLIW Control Control Control Control Control 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 30% energy efficient! 17
Conclusion • SIMD is an energy-efficient solution for mobile systems. • SIMD programming of multimedia applications is an interesting challenge due to various degrees of SIMD parallelism. • Subgraph level parallelism successfully provides supplemental SIMD parallelism by converting ILP into DLP inside the vectorized basic block. • SGLP outperforms traditional loop-level SIMDization by up to 76% on a 64-way SIMD architecture. 18
Questions? • For more information • http://cccp.eecs.umich.edu 19
Example 2: High-level View SIMD Lane Kernel 1 A1 A0 A1 A0 A1 A0 Kernel 0 SIMD width: 8 Lane 4~7 B B B Kernel 1 SIMD width: 4 Kernel 0 Kernel 2 C1 C0 C1 C0 C1 C0 Lane 0~3 Kernel 2 SIMD width: 8 D D D Time Gain = (A1 + C1) (SIMD) – ((A1->B) + (C1->D)) (overhead) 20
Static Performance • Performance results depend on kernel characteristics(Ex: MatMul4x4, MatMul3x3) • SGLP retains similar trend as ILP after overhead consideration • Max 1.66x @ 4-way (SLP 1.27x) AVG FFT MDCT MatMul4x4 MatMul3x3 HalfPel QuarterPel AAC 3D H.264 Kernel Application 21