210 likes | 334 Views
Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping. Nathan Clark, Amir Hormati, Scott Mahlke, Sami Yehia * , Krisztián Flautner * University of Michigan *ARM Ltd. . Computational Efficiency. Low power envelope More useful work/transistors
E N D
Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark, Amir Hormati, Scott Mahlke, Sami Yehia*, Krisztián Flautner* University of Michigan *ARM Ltd. 1
Computational Efficiency • Low power envelope • More useful work/transistors • Hardware accelerators • Niagara II encryption engine Source: AMD Analyst Day 12/14/06 2
Program Accel. CPU How Are Accelerators Used? Control statically placed in binary 3
Program Accel. Accel. CPU CPU Problem With Static Control Not forward/backward compatible CPU 4
Accel. Accel. Proc. Proc. Program Proc. Trans. Engineer/ Compiler Trans. Trans. Solution: Virtualization • Statically identify accelerated computation • Abstract accelerator features • Dynamically retarget binary 5
Liquid SIMD • Virtualize SIMD accelerators • Why virtualize SIMD? • Intel MMX to SSE2 • ARM v6 to Neon • Wide vectors useful [Lin 06] 6
SIMD Accelerator Assumptions • Same instruction stream • Separate pipeline – memory interface SIMD Exec Decode Fetch Retire Scalar Exec 7
How to Virtualize • Use scalar ISA to represent SIMD operations • Compatibility, low overhead • Key: easy to translate Program Branch 8
uCode Cache Accel. Fetch Retire Trans. Execute Decode Virtualization Architecture 9
A A A B B B + + + & & & 1. Data Parallel Operations for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = r3 & constant; C[i] = r4; } C 10
A B SADD 1a. What If There’s No Scalar Equivalent? for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; cmp r3, #FF; r3 = movgt #FF; ... } Idioms can always be constructed 11
+ + + & & & 2. Scalarizing Permutations for(i = 0; i < 8; i++) { … r1 = r2 + r3; tmp[i] = r1 } for(i = 0; i < 8; i++) { r1 = offset[i]; r2 = tmp[r1 + i] r3 = r2 & const … } offset = {4, 4, 4, 4, -4, -4, -4, -4} offset = {4, 4, 4, 4, -4, -4, -4, -4} offset = {4, 4, 4, 4, -4, -4, -4, -4} 12
+ 3. Scalarizing Reductions for(i = 0; i < 8; i++) { … r1 = A[i]; r2 = r2 + r1; … } 13
v3 v2 1 0 1 3 v1 Mem v1 Applied to ARM Neon • All instructions supported except… • VTBL – indirect indexing v1 = vtbl v2, v3 • Interleaved memory accesses • Not needed in evaluated benchmarks 14
Translation to SIMD • Update induction variable • Use inverse of defined translation rules for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = offset[i]; } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = v3 & constant } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v3 = shuffle v3; C[i] = v3; } i += 4 for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = offset[i]; C[i + r4] = r3; } 15
Accel. Accel. Proc. Proc. Program Proc. Trans. Engineer/ Compiler Trans. Trans. Translator Design Translator: efficiency, speed, flexibility 16
Evaluation • Trimaran ARM • Hand SIMDized loops • SimpleScalar model ARM926 w/ Neon SIMD • VHDL translator, 130nm std. cell 17
Liquid SIMD Issues • Code bloat • <1% overhead beyond baseline • Register pressure • Not a problem • Translator cost • 0.2 mm2 + 2KB cache • Translation overhead 18
Translation Overhead MediaBench Kernels SPECfp 19
Summary • Accelerators are more common and evolving • Costly binary migration • SIMD virtualization using scalar ISA • One binary: forward/backward compatibility • Negligible overhead 20
Questions ? ? ? ? ? ? ? ? ? ? ? ? 21