SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures

SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun Park1, Sangwon Seo2, Hyunchul Park3, Hyoun Kyu Cho1, and Scott Mahlke1 March 6, 2012 • 1University of Michigan, Ann Arbor • 2Qualcomm Incorporated, San Diego, CA • 3Programmin Systems Lab, Intel Labs, Santa Clara, CA 1

Convergence of Functionalities Flexible Accelerator! 4G Wireless Audio Video 3D Navigation Anatomy of an iPhone Convergence of functionalities demands a flexible solution Applications have different characteristics 2

SIMD : Attractive Alternative to ASICs • Suitable for running wireless and multimedia applications for future embedded systems • Advantage • High throughput • Low fetch-decode overhead • Easy to scale • Disadvantage • Hard to realize high resource utilization • High SIMDization overhead • Example SIMD architectures • IBM Cell, ARM NEON, • Intel MIC architecture, etc. 5.6x VLIW 2x SIMD FUs Example SIMD machine: 100 MOps /mW 3

Under-utilization on wide SIMD AAC 3D H.264 • Multimedia applications have various natural SIMD widths • SIMD width characterization of innermost loops (Intel Compiler rule) • Inside & across applications • How to use idle SIMD resources? 16-way SIMD Resource utilization Full Under Execution time distribution at different SIMD widths 4

Traditional Solutions for Under-utilization • Dynamic power gating • Selectively cut off unused SIMD lanes • Effective dynamic & leakage power savings • Transition time & power overhead • High area overhead Thread 1 On Thread 2 Off Thread 3 • Thread Level parallelism • Execute multiple threads having separate data • Different instruction flow • Input-dependent control flow • High memory pressure Thread 4 5

Objective of This Work • Beyond loop-level SIMD • Put idle SIMD lanes to work • Find more SIMD opportunities inside vectorized basic blocks when loop-level SIMD parallelism is insufficient • Possible SIMD instructions inside a vectorized basic block • Perform same work • Same data flow • More than 50% of total instructions have some opportunities • Challenges • High data movement overhead between lanes • Hard to find best instruction packing combination 6

Partial SIMD Opportunity 2. Partial SIMDization 1. Loop level SIMDization SIMD Resource 1: For (it = 0; It < 4; it++) { 2: i = a[it] + b[it]; 3: j = c[it] + d[it]; 4: k = e[it] + f[it]; 5: l = g[it] + h[it]; 6: m = i + j; 7: n = k + l; 8: result[it] = m + n; 9: } 0 1 2 3 4 +4 +12 5 6 7 8 9 10 11 12 13 14 15 7

Subgraph Level Parallelism(SGLP) • Data level parallelism between ‘identical subgraphs’ • SIMDizable operators • Isomorphic dataflow • No dependencies on each other • Advantages • Minimize overhead • No inter-lane data movement inside a subgraph • High instruction packing gain • Multiple instructions inside a subgraph increase the packing gain • Cost • Gain • Cost • Gain • Cost • Gain 8

Example: program order (2 degree) FFT kernel SIMD Lane LD0 LD1 Lane 1 Lane 0 0 2 1 3 0 1 LD1 1 3 5 7 9 11 ST1 ST3 5 7 4 6 Inter-lane move Inter-lane move Inter-lane move Inter-lane move Inter-lane move Inter-lane move Inter-lane move Inter-lane move LD0 0 2 4 6 8 10 ST0 ST2 9 11 8 10 Cycle ST1 ST0 ST3 ST2 Gain: 1 = 9 (SIMD) – 8 (overhead) 9

Example: SGLP (2 degree) FFT kernel SIMD Lane LD0 LD1 Lane 1 Lane 0 0 2 1 3 ST2 ST3 10 6 LD1 2 3 7 11 0 1 5 7 4 6 Inter-lane move Inter-lane move ST0 ST1 8 4 LD0 0 1 5 9 9 11 8 10 Cycle ST1 ST0 ST3 ST2 Gain: 7 = 9 (SIMD) – 2 (overhead) 10

Compilation Overview Hardware Information Application Loop-unrolling & Vectorization Dataflow Generation Loop-level VectorizedBasicblock Dataflow Graph 1. Subgraph Identification Identical Subgraphs 2. SIMD Lane Assignment Lane-assigned Subgraphs 3. Code Generation 11

1. Subgraph Identification • Heuristic discovery • Grow subgraphs from seed nodes and find identical subgraphs • Additional conditions over traditional subgraph search • Corresponding operators are identical • Operand type should be same: register/constant • No inter-subgraph dependencies 256 a b c d e f g h 2 1 * + result 12

2. SIMD Lane Assignment 2. Affinity cost Data movement between different subgraphs Use producer/consumer, common producer/consumer relation Affinity value: how many related operations exist between subgraphs Assign a lane with highest affinity cost Affinity: B0 is closer to A0 than A1 1. Subgraph gain Assign lanes based on decreasing order of gain Gain: A > B > C > D 3. Partial order check Partial order of identical subgraphs inside the SIMD lanes must be same Partial order: C0 ≠ C1 • Select subgraphs to be packed and assign them to SIMD lane groups • Pack maximum number of instructions with minimum overhead • Safe parallel execution without dependence violation • Criteria: gain, affinity, partial order A1 A1 A0 A0 SIMD Lane Conflict!! B1 B1 C0 Lane 4~7 A1 B1 C1 C0-1 C1-1 B0 B0 C1 Lane 0~3 D D A0 B0 C0 D C1-0 C0-0 Time 13

Experimental Setup • 144 loops from industry-level optimized media applications • AAC decoder (MPEG4 audio decoding, low complexity profile) • H.264 decoder (MPEG4 video decoding, baseline profile, qcif) • 3D (3D graphics rendering). • Target architecture: wide vector machines • SIMD width: 16 ~ 64 • SODA-style wide vector instruction set • Single-cycle delay data shuffle instruction(≈vperm(VMX), vex_perm(AltiVec)) • IMPACT frontend compiler + cycle-accurate simulator • Compared to 2 other solutions • SLP: superword level parallelism (basic block level SIMDization) [Larsen, PLDI’00] • ILP: Instruction level parallelism on same-way VLIW machines • Apply 2 ~ 4 degree SGLP 14

Static Performance • SGLP retains similar trend as ILP after overhead consideration • Max 1.66x @ 4-way (SLP 1.27x) • See the paper for representative kernels (FFT, DCT, HafPel….) AAC H.264 3D AVG 15

Dynamic Performance on SIMD • Only when a natural SIMD width is insufficient, the available degree of SGLP is exploited. (Up to 4-way) • Max 1.76x speedups (SLP: 1.29x) AAC H.264 AVG 3D 16

Energy@H.264 Execution • 200 MHz(IBM 0.65nm technology) SIMD VLIW Control Control Control Control Control 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 30% energy efficient! 17

Conclusion • SIMD is an energy-efficient solution for mobile systems. • SIMD programming of multimedia applications is an interesting challenge due to various degrees of SIMD parallelism. • Subgraph level parallelism successfully provides supplemental SIMD parallelism by converting ILP into DLP inside the vectorized basic block. • SGLP outperforms traditional loop-level SIMDization by up to 76% on a 64-way SIMD architecture. 18

Questions? • For more information • http://cccp.eecs.umich.edu 19

Example 2: High-level View SIMD Lane Kernel 1 A1 A0 A1 A0 A1 A0 Kernel 0 SIMD width: 8 Lane 4~7 B B B Kernel 1 SIMD width: 4 Kernel 0 Kernel 2 C1 C0 C1 C0 C1 C0 Lane 0~3 Kernel 2 SIMD width: 8 D D D Time Gain = (A1 + C1) (SIMD) – ((A1->B) + (C1->D)) (overhead) 20

Static Performance • Performance results depend on kernel characteristics(Ex: MatMul4x4, MatMul3x3) • SGLP retains similar trend as ILP after overhead consideration • Max 1.66x @ 4-way (SLP 1.27x) AVG FFT MDCT MatMul4x4 MatMul3x3 HalfPel QuarterPel AAC 3D H.264 Kernel Application 21

SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures

SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures

Presentation Transcript

Chapter 2

Architectures and Algorithms for Data Privacy

Consciousness and Creativity in Brain-Inspired Cognitive Architectures

Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Parallel and Concurrent Programming

Lecture 10: Parallel Databases

Massively Parallel/Distributed Data Storage Systems

Danny Bickson

Parallel Programming Models, Languages and Compilers

Distributed Databases

Psychosynthesis The Realization of the Self

Parallel Programming Models, Languages and Compilers

Basic Knowledge of Data Converters

Managing XML and Semistructured Data

Advanced Computer Architectures – HB49 –

Parallel Algorithms on Networks of Processors

Scalable Web Architectures

Chapter 7 Parallel Ports

Day 2

Parallel Real-Time Systems

Lecture 09: Parallel Databases , Big Data, Map/Reduce, Pig-Latin