240 likes | 427 Views
Code Compaction for UniCore on Link-Time Optimization Platform. Zhang Jiyu Compilation Toolchain Group MPRC. Compilation Process. Our Optimization Process. CLOU is a Link-time Optimizer for UniCore. Code. Data. Code. Data. Code. Data. Data. Data. Data. Linking. Code. Code. Code.
E N D
Code Compaction for UniCoreon Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC
CLOU is a Link-time Optimizer for UniCore Code Data Code Data Code Data Data Data Data Linking Code Code Code Data Data Data Meta Meta Translation to IR Meta CFG construction & Optimizations Exec Layout; Assembling A Graph Modified From Diablo
Code Compaction based on CLOU 1 2 3 • Motivation of code compaction • Limited memory and energy resources for embedded systems • Code density affects both memory and energy consumption • Goal: reducing code size without losing performance • Code compaction in different levels 1. Typical optimizations for code size reduction at link-time 2. Hot/cold code splitting 3. New mixed code generation method
Typical Optimizations for Code Size Reduction • Redundant code elimination • Computations whose results have been computed previously and are guaranteed to be available at that point • Unreachable code elimination • Code fragments which there is no control flow path to from the entry node • Many of them are following useless comparisons • Dead code elimination • Computations whose results are never used • Peephole optimization • Procedural abstraction -- might lead to performance loss
Experiments for Typical Optimizations for Code Size Reduction • Benchmark: Mediabench • Code size reduction • Average: 12.8% • Max: 22.3% • Performance improvement • Average: 2.4% • Max: 4.2%
Hot/Cold Code Splitting Code 3 2 1 Code Code Condition Condition Condition Hot Code Hot Code Hot Code Cold Code More Code Cold Code More Code More Code Cold Code • Less code transferred from remote to local, from disk to memory, or from memory to cache • Question: might be too conservative or lead to performance loss? • Get hot/cold code splitted through basic block reordering
Hot/Cold Code Splitting • PH: A popular greedy approach • Structural Analysis Based Basic Block Reordering • Most part of a program can be decomposed into several typical structures • Cost Module for each structure • Minimal-cost layout Optimal layout for each local structure based on profiling information
Basic Block Reordering • Cost Model • Different kinds of control flow edges have different cost • For a specific order, • A list can be got for each structure f (structure, frequencies of all edges) the best order of basic blocks for the local structure control flow edges
Experiments • Complexity: O(N*log N),N: number of basic blocks • Experiment results (not using other link-time optimizations) • Normalized cycle counts Normalized cache miss rate
Mixed Code Generation • Dual-width Instruction Set • 32-bit ISA: more powerful • 16-bit ISA: more compact • Less coding space for operations • Less register field • Less immediate field 32-bit: add r0, r0, 0xff800000 16-bit: str r2, [addr] mov r2, 0xff lsl r2, #1 add r2, #1 lsl r2, 24 add r0, r2 ld r2, [addr]
Mixed Code Generation • Related works in dual-width Instruction Set design and mixed code generation • Coarse-grained function-level mixed code generation • By BX in arm and JALX in MIPS • Simple fine-grained instruction-level mixed code generation • By BX in arm and JALX in MIPS • By single specific mode-changing instruction • Specialized coding • One-leading instruction word indicates one 32-bit instruction; Zero-leading instruction word indicates two 16-bit instruction. • 16-bit ISA extensions • Problem: Always lead to performance loss
Potential benefit • Analysis of Programs in Mediabench 27851 different instructions in all programs: • Log(27851)=15 1 2
Two-operand instructions mov rd, rm or short immediate cmp rn, rm or short immediate Branch/Jump Distribution of immediate-offsets of branch instructions. Two Main Kinds of Frequent Instructions
The Idea of Mode-Changing Instruction Set (MC) • Extend the 32-bit ISA to add a small MC Instruction Set (using the reserved coding space) • Change the CPU mode • Perform its own normal operation • Scan for suitable 32-bit instructions to be encoded into 16-bit instructions • A mixed code fraction with MCinstructions
Mixed code execution in Unicore-I pipeline Improved mixed code executionin Unicore-I pipeline Modification to Micro Architecture • No extra cycles • One more 16-bit instruction-fetch buffer • An MC-decoder
Mixed Code Generation Instruction Analyzer program Link-Time Optimizer program program program Mixed coded Program Mode -Changing Instructions Simulator
Experiment Results • Normalized code size (results not using other link-time optimizations)
Conclusion • Code compaction on Link-Time Optimization Platform • Compiler optimizations applied at link time • Typical optimizations for code size reduction • Program layout optimization • Hot/cold code splitting through basic block reordering • Machine code generation • Mixed code generation • Experiment Results • Average code size reduction: 32.9% • Average performance improvement: 9.1%
Instruction Analysis Instruction format type classifications
Normalized dynamic instruction numbers Normalized cycle counts EXPERIMENT RESULTS