150 likes | 175 Views
Platform based design. Abstracting SIMD Hardware By: Joris Dobbelsteen Lennart de Graaf References: Liquid SIMD Abstracting SIMD Hardware using Lightweight Dynamic Mapping – Clark, Hormati a.o. Some issues w.r.t. SIMD. Issue 1:
E N D
Platform based design Abstracting SIMD Hardware By: Joris Dobbelsteen Lennart de Graaf References: Liquid SIMD Abstracting SIMD Hardware using Lightweight Dynamic Mapping – Clark, Hormati a.o.
Some issues w.r.t. SIMD • Issue 1: • SIMD programming requires in depth knowlegde of the architecture • Issue 2: • SIMD is still evolving fast (especially in embedded systems) • Width changes (e.g. from 64 to 128 bit vectors) • Functionality increases (instruction set expands) • This leads to problems with regard to (forward) compatibility • Our objective: • Find possible solutions for these issues (in papers)
Short web research • Several papers on SIMDization (issue 1): • Exploiting Vector Parallelism in Software pipelined Loops – • Larsen, Rabbah, Amarasinghe • Software pipelining is often used for ILP, however it does not use specific vector resources without explicit instructions. • For vectorization compilers generate separate loops for vectorizable and non vectorizable parts. This deminishes ILP. • By means of a 'selective vectorization' algorithm this is improved
Short web research • Several papers on SIMDization (issue 1): • Compilation techniques for multimedia processors • Krall, Lelait • Scalar code is converted to SIMD code • loop unroling used to create the possibility to perform iterations in parallel • Does not work when iterations have data dependencies !
Short web research • Compatibility (issue 2): • Abstracting SIMD Hardware using lightweight dynamic mapping • Clark, Hormati, Yehia, Mahlke, Flautner • This paper describes a way to abstract SIMD Hardware. It describes: • compiler/translation framework to realize Liquid SIMD • Vector width independent • a lightweight dynamic runtime SIMD code generator • This is the first step in solving the prior issues. • (n.b. this paper does not involve SIMDization) • We zoom in on this paper!
Decoupling Principle • The paper describes a way to 'abstract' SIMD hardware with regard to specifically the second issue (compatibility) • Start off with SIMD instructions for a specific SIMD architecture • Then globally two steps are taken: • Mapping of SIMD instructions to equivalent scalar representation • Can take place at compiletime • Functionally SIMD and scalar are equivalent • Dynamic translation from scalar to SIMD instructions for (different) SIMD architecture • Takes place dynamically • In hardware • In software (just like JIT)
Step 1: SIMD to scalar • The paper describes a table with rules to convert SIMD instructions to scalar instructions • Rule example: All elements are independent + + + + + • (relatively simple operation)
Step 2: Scalar to SIMD • Based on the scalar instructions and information about the (other) SIMD architecture a conversion takes place the other way around + + + + + + • n.b. scalar to SIMD is only done for that nr of scalar instructions that is dividable by the width of the SIMD architecture
Liquid SIMD • Migrating code from one SIMD architecture now is easy: SIMD instructions for arch X To SIMD For arch X scalar instructions for any arch a.k.a. Vitualized SIMD coded SIMD instructions for arch N Liquid SIMD (to scalar) To SIMD For arch Y To SIMD For arch Z compiletime runtime
Dynamic translation • The paper describes a Hw design for dynamic translation Simple combinational logic to detect and translate (reverse of SIMD to scalar)
Is life really that simple? • No: Converting from SIMD to scalar becomes more difficult if multiple vector elements are used to compute one result • Example: determining a minimum • Translation back to SIMD is impossible in current implementation min min min min
SIMD to scalar • It gets worse if vector elements are reordered • Example: butterfly operation • Resource usage increases! • Translation back to SIMD is not possible. a b c d a b c d d c b a
Drawbacks • Code is less efficient then optimized (by hand) for a specific SIMD architecture • Because of the translations, registers are utilized more then before • Some instructions cannot be supported • Code cache needs to be bigger because both scalar and SIMD generated code needs to be stored (in current implementation) • Only works for a predefined maximum vector width
Strong points • + (fairly) SIMD architecture independent • + less code rewriting • + new hardware = speedup without extra effort • + code even runs when no SIMD accellerator is used. • + approach is 'general'. It recognizes 'structure' of the code by means of a set of rules. (i.s.o. using conversions for specific instructions)
Weak points • Extra overhead (although small) • Efficiency of SIMDization depends on quality of set of translation rules • Not a perfect exploitation of SIMD (In real demanding apps, we have the feeling that the processor capabilities are not fully utilized because: • Some instructions cannot be translated • Future instructions are not utilized • In our opinion: • primarily solves second issue • it's a first small step … but there's a long way to go!