Workshop on Optimizations for DSP and Embedded Systems

ODES-9 Workshop on Optimizations for DSP and Embedded Systems

Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) 64-bit datapath 64-bit addressing and high precision computing 64 bit 64-bit adder 64 bit

Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) 64-bit datapath 64-bit addressing and high precision computing 64 bit 16-bit adder 16-bit adder 16-bit adder 16-bit adder 64 bit

Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) 16-bit integer datapath 64-bit addressing and high precision computing 40% of computations need only a 16-bit datapath Caveat: 64-bit computation becomes 8 * 16-bit computations (DBT) 16-bit adder 16-bit adder 16-bit adder 16-bit adder

Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) What does non-productive mean? 0 x 0000 0000 0000 0001 0 x 0000 0000 0000 0025 + 0 x 0000 0000 0000 0026

Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) What does non-productive mean? 0 x 0000000000000001 0 x 0000000000000025 + 0 x 0000000000000026

Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) What does non-productive mean? 0 x 0000 0000 00000001 0 x 0000 0000 0000 0025 + 0 x 0000 0000 0000 0026

Eliminating non-productive memory operationsin narrow-bitwidth architecturesIndu Bhagat, Enric Gibert, Jesús Sánchez, Antonio González (UPC) Contributions and conclusions • Narrow ISA offers more opportunities to remove non-productive memory operations • 50 % of dynamic narrow operations are non-productive • Memory Productiveness Pruning: profile-guided, dynamic optimization

Energy efficient code generationfor processors with exposed datapathDongrui She, Yifan He, Bart Mesman, Henk Corporaal (TUE) Exposed datapath: software controls every movement in the data path Example: transport-triggered architecture (Henk Corporaal) Register file access reduction

Register Reuse SchedulingGergöBarany ObjectiveMinimize spill code by attempting to find an instruction schedule that allows for the least expensive register allocation MotivationSpill code generatedby the compiler has crucial effect on program performance MethodImplicitly enforce instruction scheduling decisions by addingextra arcs to the data dependence graph (DDG) Results8.9% less spilling, 3.4% smaller static spill costs

Register Allocation and spilling Virtual registers Memory Physical registers Register Reuse Scheduling

Register Allocation with reuse candidates definitely overlap data dependence graph basic block definitely NO overlap possible overlap interference graph Register Reuse Scheduling

Register Allocation with reuse candidates Register Reuse Scheduling

Decomposing Meeting Graph Circuits to Minimise Kernel Loop UnrollingMouniraBachir, Sid-Ahmed-AliTouati, Albert Cohen ObjectiveMinimize the unrolling factor resulting from periodic register allocation of a software-pipelined loop, without altering the initiation interval (II) MotivationCode size related with memory requirements and I-cache performance MethodStrategically insert move operations without increasing II to split meeting graph components into smaller ones Results“Good” if enough functional units to perform the additional move operations and acceptable execution time

Periodic Register Allocation • Rotating Register File R Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

Periodic Register Allocation • Rotating Register File • Move operations d-1MOVs/iteration d : iteration span of variables Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

Periodic Register Allocation • Rotating Register File • Move operations • Loop unrolling 3 * code size Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

Periodic Register Allocation • Rotating Register File • Move operations • Loop unrolling • Modulo Variable Expansion a[i] b[i] c[i] a[i+1] b[i+1] c[i+1] a[i+2] b[i+2] c[i+2] using 9 registers instead of 8 MAXLIVE = 8 Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

Periodic Register Allocation • Rotating Register File • Move operations • Loop unrolling • Modulo Variable Expansion • Meeting Graph lifetime in cycles lifetime interval of c ends when interval of b begins Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

a[i] b[i] c[i] a[i+1] b[i+1] c[i+1] a[i+2] b[i+2] c[i+2] a[i+3] b[i+3] c[i+3] a[i+4] b[i+4] c[i+4] a[i+5] b[i+5] c[i+5] a[i+6] b[i+6] c[i+6] a[i+7] b[i+7] c[i+7] Meeting Graph Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

Circuit Decomposition Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling

Main Conference 2011 International Symposium onCode Generation and Optimization

MAO – an extensible Micro-Architectural OptimizerRobert Hundt, Easwaran Raman, Martin Thuresson, Neil Vachharajani (Google) Micro-architectural: notalwaysdocumented Proprietary compilers at advantage! SPEC2000 int SPEC2000 int + 1 NOP instruction Loop NOP Loop - 7% execution time

MAO – an extensible Micro-Architectural OptimizerRobert Hundt, Easwaran Raman, Martin Thuresson, Neil Vachharajani (Google) Micro-architectural: notalwaysdocumented Example: instructiondecoding in Core 2 in chunks of 16 bytes SPEC2000 int SPEC2000 int Loop 16-byte alignmentboundary NOP Loop 16-byte alignmentboundary

MAO – an extensible Micro-Architectural OptimizerRobert Hundt, Easwaran Raman, Martin Thuresson, Neil Vachharajani (Google) Contributions and conclusions • Extensibleassembly to assemblyoptimizer • Does not fit in GCC flow, becauseafter RTL level notenoughinformationpreserved • Discover micro-architectural details semi-automaticallythroughgeneration of micro-benchmarks

Dynamic register promotion of stack variablesJianjun Li, ChenggangWu, Wei-ChungHsu Use DBT to let x86 binariesuse the extra registers on x86-64 recompiling is notalwaysanoption (legacybinaries) compute-intensiveapplicationsgain speed whenusing 64-bit Challenge: implicitstackaccesses Solvedusing page protection and stackswitching (withshadowstack)

Language and compiler support forauto-tuningvariable-accuracyalgorithmsJasonAnsel, Yee Lok Wong, CyChan, MarekOlszewski, Alan Edelman, SamanAmarasinghe (MIT) PetaBricks: languageextensions to exposetrade-offsbetween time and accuracy to the compiler • New programminglanguage, toolchain and run-time environment • Techniqueformappingvariableaccuracy code to enableauto-efficienttuning

Practical memorycheckingwith Dr. MemoryDerek Bruening (Google), QinZhao (MIT) Existingmemorychecking tools (e.g. Valgrind) slow manyfalsepositives x86

A trace-based Java JIT compilerretrofittedfrom a method-based compilerHiroshiInoue, HiroshigeHayashizaki, PengWu, ToshioNakatani (IBM) Extend the compilation scope frommethods to traces Traces span multiple methodinvocations More powerfulthanmethodinlining

A trace-based Java JIT compilerretrofittedfrom a method-based compilerHiroshiInoue, HiroshigeHayashizaki, PengWu, ToshioNakatani (IBM) Claim: currenttrace-JITs are immature Keep the advancedoptimizationinfrastructurebyretrofitting

Phase-based Tuning for Better Utilization ofPerformance-Asymmetric Multicore Processors Tyler Sondag and HrideshRajan ObjectiveDesign and apply a transparent and fully-automatic process called phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores MotivationTrend towards performance asymmetry among cores of a single chip MethodStatically partition the application into code sections that are likely to have similar runtime behavior. Exhibited runtime characteristics of representative sections are used to map the whole cluster Results36% average process speedup with negligible overheads

Phase-based tuning Phase-based Tuning for Better Utilization of Performance-AsymmetricMulticore Processors

Vapor SIMD: Auto-Vectorize Once, Run EverywhereDoritNuzman, Sergei Dyshel, ErvenRohou, Ira Rozen, Albert Cohen, AyalZaks ObjectiveDesign and a split vectorization framework and study how it compares to monolithic one MotivationJIT compiler technology offers portability while facilitating target – and context-specific specialization; SIMD hardware is ubiquitous and diverse MethodMix-and-match existing open compilation tools, namely GCC and MONO ResultsComparable to specialized monolithic offline compilers

Vectorizing for different platforms Vapor SIMD: Auto-Vectorize Once, Run Everywhere

Split vectorization scheme Vapor SIMD: Auto-Vectorize Once, Run Everywhere

Interoparable compilation flows Vapor SIMD: Auto-Vectorize Once, Run Everywhere

This is not a bulletslide.

Workshop on Optimizations for DSP and Embedded Systems

Workshop on Optimizations for DSP and Embedded Systems

Presentation Transcript

Embedded Systems ECE 420: DSP Lab

Java for embedded systems

Computing for Embedded Systems

Principles and Pragmatics for Embedded Systems

Software for Embedded Systems

Middleware for Embedded Systems

Papyrus for Embedded Systems

UIs for Embedded Systems

Software for Embedded Systems

Networking for Embedded Systems

Performance Optimizations for NUMA-Multicore Systems

Networking for Embedded Systems

Embedded DSP Spectrum Analyzer

Embedded Systems “PIC Microcontroller and Embedded Systems”

Embedded Systems Course | Best Institute for Embedded Systems Course

Processors for Embedded Systems

Middleware for Embedded Systems

Compiler Supports and Optimizations for PAC VLIW DSP Processors

Embedded Components Component Infrastructures for embedded and small systems

OS for Embedded Systems

Processors for Embedded Systems