1 / 38

Workshop on Optimizations for DSP and Embedded Systems

ODES-9. Workshop on Optimizations for DSP and Embedded Systems. Eliminating non-productive memory operations in narrow-bitwidth architectures Indu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC). 64-bit datapath 64-bit addressing and high precision computing. 64 bit.

derora
Download Presentation

Workshop on Optimizations for DSP and Embedded Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ODES-9 Workshop on Optimizations for DSP and Embedded Systems

  2. Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) 64-bit datapath 64-bit addressing and high precision computing 64 bit 64-bit adder 64 bit

  3. Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) 64-bit datapath 64-bit addressing and high precision computing 64 bit 16-bit adder 16-bit adder 16-bit adder 16-bit adder 64 bit

  4. Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) 16-bit integer datapath 64-bit addressing and high precision computing 40% of computations need only a 16-bit datapath Caveat: 64-bit computation becomes 8 * 16-bit computations (DBT) 16-bit adder 16-bit adder 16-bit adder 16-bit adder

  5. Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) What does non-productive mean? 0 x 0000 0000 0000 0001 0 x 0000 0000 0000 0025 + 0 x 0000 0000 0000 0026

  6. Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) What does non-productive mean? 0 x 0000000000000001 0 x 0000000000000025 + 0 x 0000000000000026

  7. Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) What does non-productive mean? 0 x 0000 0000 00000001 0 x 0000 0000 0000 0025 + 0 x 0000 0000 0000 0026

  8. Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) Contributions and conclusions • Narrow ISA offers more opportunities to remove non-productive memory operations • 50 % of dynamic narrow operations are non-productive • Memory Productiveness Pruning: profile-guided, dynamic optimization

  9. Energy efficient code generationfor processors with exposed datapathDongrui She, Yifan He, Bart Mesman, Henk Corporaal (TUE) Exposed datapath: software controls every movement in the data path Example: transport-triggered architecture (Henk Corporaal) Register file access reduction

  10. Register Reuse SchedulingGergöBarany ObjectiveMinimize spill code by attempting to find an instruction schedule that allows for the least expensive register allocation MotivationSpill code generatedby the compiler has crucial effect on program performance MethodImplicitly enforce instruction scheduling decisions by addingextra arcs to the data dependence graph (DDG) Results8.9% less spilling, 3.4% smaller static spill costs

  11. Register Allocation and spilling Virtual registers Memory Physical registers Register Reuse Scheduling

  12. Register Allocation with reuse candidates definitely overlap data dependence graph basic block definitely NO overlap possible overlap interference graph Register Reuse Scheduling

  13. Register Allocation with reuse candidates Register Reuse Scheduling

  14. Decomposing Meeting Graph Circuits to Minimise Kernel Loop UnrollingMouniraBachir, Sid-Ahmed-AliTouati, Albert Cohen ObjectiveMinimize the unrolling factor resulting from periodic register allocation of a software-pipelined loop, without altering the initiation interval (II) MotivationCode size related with memory requirements and I-cache performance MethodStrategically insert move operations without increasing II to split meeting graph components into smaller ones Results“Good” if enough functional units to perform the additional move operations and acceptable execution time

  15. Periodic Register Allocation • Rotating Register File R Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

  16. Periodic Register Allocation • Rotating Register File • Move operations d-1MOVs/iteration d : iteration span of variables Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

  17. Periodic Register Allocation • Rotating Register File • Move operations • Loop unrolling 3 * code size Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

  18. Periodic Register Allocation • Rotating Register File • Move operations • Loop unrolling • Modulo Variable Expansion a[i] b[i] c[i] a[i+1] b[i+1] c[i+1] a[i+2] b[i+2] c[i+2] using 9 registers instead of 8 MAXLIVE = 8 Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

  19. Periodic Register Allocation • Rotating Register File • Move operations • Loop unrolling • Modulo Variable Expansion • Meeting Graph lifetime in cycles lifetime interval of c ends when interval of b begins Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

  20. a[i] b[i] c[i] a[i+1] b[i+1] c[i+1] a[i+2] b[i+2] c[i+2] a[i+3] b[i+3] c[i+3] a[i+4] b[i+4] c[i+4] a[i+5] b[i+5] c[i+5] a[i+6] b[i+6] c[i+6] a[i+7] b[i+7] c[i+7] Meeting Graph Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

  21. Circuit Decomposition Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

  22. Main Conference 2011 International Symposium onCode Generation and Optimization

  23. MAO – an extensible Micro-Architectural OptimizerRobert Hundt, Easwaran Raman, Martin Thuresson, Neil Vachharajani (Google) Micro-architectural: notalwaysdocumented Proprietary compilers at advantage! SPEC2000 int SPEC2000 int + 1 NOP instruction Loop NOP Loop - 7% execution time

  24. MAO – an extensible Micro-Architectural OptimizerRobert Hundt, Easwaran Raman, Martin Thuresson, Neil Vachharajani (Google) Micro-architectural: notalwaysdocumented Example: instructiondecoding in Core 2 in chunks of 16 bytes SPEC2000 int SPEC2000 int Loop 16-byte alignmentboundary NOP Loop 16-byte alignmentboundary

  25. MAO – an extensible Micro-Architectural OptimizerRobert Hundt, Easwaran Raman, Martin Thuresson, Neil Vachharajani (Google) Contributions and conclusions • Extensibleassembly to assemblyoptimizer • Does not fit in GCC flow, becauseafter RTL level notenoughinformationpreserved • Discover micro-architectural details semi-automaticallythroughgeneration of micro-benchmarks

  26. Dynamic register promotion of stack variablesJianjun Li, ChenggangWu, Wei-ChungHsu Use DBT to let x86 binariesuse the extra registers on x86-64 recompiling is notalwaysanoption (legacybinaries) compute-intensiveapplicationsgain speed whenusing 64-bit Challenge: implicitstackaccesses Solvedusing page protection and stackswitching (withshadowstack)

  27. Language and compiler support forauto-tuningvariable-accuracyalgorithmsJasonAnsel, Yee Lok Wong, CyChan, MarekOlszewski, Alan Edelman, SamanAmarasinghe (MIT) PetaBricks: languageextensions to exposetrade-offsbetween time and accuracy to the compiler • New programminglanguage, toolchain and run-time environment • Techniqueformappingvariableaccuracy code to enableauto-efficienttuning

  28. Practical memorycheckingwith Dr. MemoryDerek Bruening (Google), QinZhao (MIT) Existingmemorychecking tools (e.g. Valgrind) slow manyfalsepositives x86

  29. A trace-based Java JIT compilerretrofittedfrom a method-based compilerHiroshiInoue, HiroshigeHayashizaki, PengWu, ToshioNakatani (IBM) Extend the compilation scope frommethods to traces Traces span multiple methodinvocations More powerfulthanmethodinlining

  30. A trace-based Java JIT compilerretrofittedfrom a method-based compilerHiroshiInoue, HiroshigeHayashizaki, PengWu, ToshioNakatani (IBM) Claim: currenttrace-JITs are immature Keep the advancedoptimizationinfrastructurebyretrofitting

  31. Phase-based Tuning for Better Utilization ofPerformance-Asymmetric Multicore Processors Tyler Sondag and HrideshRajan ObjectiveDesign and apply a transparent and fully-automatic process called phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores MotivationTrend towards performance asymmetry among cores of a single chip MethodStatically partition the application into code sections that are likely to have similar runtime behavior. Exhibited runtime characteristics of representative sections are used to map the whole cluster Results36% average process speedup with negligible overheads

  32. Phase-based tuning Phase-based Tuning for Better Utilization of Performance-AsymmetricMulticore Processors

  33. Vapor SIMD: Auto-Vectorize Once, Run EverywhereDoritNuzman, Sergei Dyshel, ErvenRohou, Ira Rozen, Albert Cohen, AyalZaks ObjectiveDesign and a split vectorization framework and study how it compares to monolithic one MotivationJIT compiler technology offers portability while facilitating target – and context-specific specialization; SIMD hardware is ubiquitous and diverse MethodMix-and-match existing open compilation tools, namely GCC and MONO ResultsComparable to specialized monolithic offline compilers

  34. Vectorizing for different platforms Vapor SIMD: Auto-Vectorize Once, Run Everywhere

  35. Split vectorization scheme Vapor SIMD: Auto-Vectorize Once, Run Everywhere

  36. Interoparable compilation flows Vapor SIMD: Auto-Vectorize Once, Run Everywhere

  37. This is not a bulletslide.

More Related