380 likes | 535 Views
ODES-9. Workshop on Optimizations for DSP and Embedded Systems. Eliminating non-productive memory operations in narrow-bitwidth architectures Indu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC). 64-bit datapath 64-bit addressing and high precision computing. 64 bit.
E N D
ODES-9 Workshop on Optimizations for DSP and Embedded Systems
Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) 64-bit datapath 64-bit addressing and high precision computing 64 bit 64-bit adder 64 bit
Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) 64-bit datapath 64-bit addressing and high precision computing 64 bit 16-bit adder 16-bit adder 16-bit adder 16-bit adder 64 bit
Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) 16-bit integer datapath 64-bit addressing and high precision computing 40% of computations need only a 16-bit datapath Caveat: 64-bit computation becomes 8 * 16-bit computations (DBT) 16-bit adder 16-bit adder 16-bit adder 16-bit adder
Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) What does non-productive mean? 0 x 0000 0000 0000 0001 0 x 0000 0000 0000 0025 + 0 x 0000 0000 0000 0026
Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) What does non-productive mean? 0 x 0000000000000001 0 x 0000000000000025 + 0 x 0000000000000026
Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) What does non-productive mean? 0 x 0000 0000 00000001 0 x 0000 0000 0000 0025 + 0 x 0000 0000 0000 0026
Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) Contributions and conclusions • Narrow ISA offers more opportunities to remove non-productive memory operations • 50 % of dynamic narrow operations are non-productive • Memory Productiveness Pruning: profile-guided, dynamic optimization
Energy efficient code generationfor processors with exposed datapathDongrui She, Yifan He, Bart Mesman, Henk Corporaal (TUE) Exposed datapath: software controls every movement in the data path Example: transport-triggered architecture (Henk Corporaal) Register file access reduction
Register Reuse SchedulingGergöBarany ObjectiveMinimize spill code by attempting to find an instruction schedule that allows for the least expensive register allocation MotivationSpill code generatedby the compiler has crucial effect on program performance MethodImplicitly enforce instruction scheduling decisions by addingextra arcs to the data dependence graph (DDG) Results8.9% less spilling, 3.4% smaller static spill costs
Register Allocation and spilling Virtual registers Memory Physical registers Register Reuse Scheduling
Register Allocation with reuse candidates definitely overlap data dependence graph basic block definitely NO overlap possible overlap interference graph Register Reuse Scheduling
Register Allocation with reuse candidates Register Reuse Scheduling
Decomposing Meeting Graph Circuits to Minimise Kernel Loop UnrollingMouniraBachir, Sid-Ahmed-AliTouati, Albert Cohen ObjectiveMinimize the unrolling factor resulting from periodic register allocation of a software-pipelined loop, without altering the initiation interval (II) MotivationCode size related with memory requirements and I-cache performance MethodStrategically insert move operations without increasing II to split meeting graph components into smaller ones Results“Good” if enough functional units to perform the additional move operations and acceptable execution time
Periodic Register Allocation • Rotating Register File R Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling
Periodic Register Allocation • Rotating Register File • Move operations d-1MOVs/iteration d : iteration span of variables Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling
Periodic Register Allocation • Rotating Register File • Move operations • Loop unrolling 3 * code size Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling
Periodic Register Allocation • Rotating Register File • Move operations • Loop unrolling • Modulo Variable Expansion a[i] b[i] c[i] a[i+1] b[i+1] c[i+1] a[i+2] b[i+2] c[i+2] using 9 registers instead of 8 MAXLIVE = 8 Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling
Periodic Register Allocation • Rotating Register File • Move operations • Loop unrolling • Modulo Variable Expansion • Meeting Graph lifetime in cycles lifetime interval of c ends when interval of b begins Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling
a[i] b[i] c[i] a[i+1] b[i+1] c[i+1] a[i+2] b[i+2] c[i+2] a[i+3] b[i+3] c[i+3] a[i+4] b[i+4] c[i+4] a[i+5] b[i+5] c[i+5] a[i+6] b[i+6] c[i+6] a[i+7] b[i+7] c[i+7] Meeting Graph Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling
Circuit Decomposition Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling
Main Conference 2011 International Symposium onCode Generation and Optimization
MAO – an extensible Micro-Architectural OptimizerRobert Hundt, Easwaran Raman, Martin Thuresson, Neil Vachharajani (Google) Micro-architectural: notalwaysdocumented Proprietary compilers at advantage! SPEC2000 int SPEC2000 int + 1 NOP instruction Loop NOP Loop - 7% execution time
MAO – an extensible Micro-Architectural OptimizerRobert Hundt, Easwaran Raman, Martin Thuresson, Neil Vachharajani (Google) Micro-architectural: notalwaysdocumented Example: instructiondecoding in Core 2 in chunks of 16 bytes SPEC2000 int SPEC2000 int Loop 16-byte alignmentboundary NOP Loop 16-byte alignmentboundary
MAO – an extensible Micro-Architectural OptimizerRobert Hundt, Easwaran Raman, Martin Thuresson, Neil Vachharajani (Google) Contributions and conclusions • Extensibleassembly to assemblyoptimizer • Does not fit in GCC flow, becauseafter RTL level notenoughinformationpreserved • Discover micro-architectural details semi-automaticallythroughgeneration of micro-benchmarks
Dynamic register promotion of stack variablesJianjun Li, ChenggangWu, Wei-ChungHsu Use DBT to let x86 binariesuse the extra registers on x86-64 recompiling is notalwaysanoption (legacybinaries) compute-intensiveapplicationsgain speed whenusing 64-bit Challenge: implicitstackaccesses Solvedusing page protection and stackswitching (withshadowstack)
Language and compiler support forauto-tuningvariable-accuracyalgorithmsJasonAnsel, Yee Lok Wong, CyChan, MarekOlszewski, Alan Edelman, SamanAmarasinghe (MIT) PetaBricks: languageextensions to exposetrade-offsbetween time and accuracy to the compiler • New programminglanguage, toolchain and run-time environment • Techniqueformappingvariableaccuracy code to enableauto-efficienttuning
Practical memorycheckingwith Dr. MemoryDerek Bruening (Google), QinZhao (MIT) Existingmemorychecking tools (e.g. Valgrind) slow manyfalsepositives x86
A trace-based Java JIT compilerretrofittedfrom a method-based compilerHiroshiInoue, HiroshigeHayashizaki, PengWu, ToshioNakatani (IBM) Extend the compilation scope frommethods to traces Traces span multiple methodinvocations More powerfulthanmethodinlining
A trace-based Java JIT compilerretrofittedfrom a method-based compilerHiroshiInoue, HiroshigeHayashizaki, PengWu, ToshioNakatani (IBM) Claim: currenttrace-JITs are immature Keep the advancedoptimizationinfrastructurebyretrofitting
Phase-based Tuning for Better Utilization ofPerformance-Asymmetric Multicore Processors Tyler Sondag and HrideshRajan ObjectiveDesign and apply a transparent and fully-automatic process called phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores MotivationTrend towards performance asymmetry among cores of a single chip MethodStatically partition the application into code sections that are likely to have similar runtime behavior. Exhibited runtime characteristics of representative sections are used to map the whole cluster Results36% average process speedup with negligible overheads
Phase-based tuning Phase-based Tuning for Better Utilization of Performance-AsymmetricMulticore Processors
Vapor SIMD: Auto-Vectorize Once, Run EverywhereDoritNuzman, Sergei Dyshel, ErvenRohou, Ira Rozen, Albert Cohen, AyalZaks ObjectiveDesign and a split vectorization framework and study how it compares to monolithic one MotivationJIT compiler technology offers portability while facilitating target – and context-specific specialization; SIMD hardware is ubiquitous and diverse MethodMix-and-match existing open compilation tools, namely GCC and MONO ResultsComparable to specialized monolithic offline compilers
Vectorizing for different platforms Vapor SIMD: Auto-Vectorize Once, Run Everywhere
Split vectorization scheme Vapor SIMD: Auto-Vectorize Once, Run Everywhere
Interoparable compilation flows Vapor SIMD: Auto-Vectorize Once, Run Everywhere