Project Presentation

Project Presentation by Joshua George Advisor – Dr. Jack Davidson

Exploiting hardware for loop optimizations • ZOLB – Zero Overhead Loop Buffers • Decrement, compare and jump instructions • Compare with zero instructions

Background • VPO – Very Portable Optimizer – operates on low-level, m/c independent form – RTLs • ZOLB – Many DSPs have a compiler managed cache for loops • Reduces loop over-head • No branch • Buffered in internal buffer – save on instruction fetch • Can reduce code size • Power overhead remains low • Decrement, compare and jump • Eg. loop instruction on x86, banz on tms320c54x • Compare with zero • Eg. SPARC

Status • Added support for Repeat instructions (ZOLB) on tms320c54x. • Support for converting loops to count down so as to make use of decrement, compare and jmp instructions – retargetted to three machines – x86, SPARC and tms320c54x.

Implementation - guidelines • Add minimum possible code to m/c dependent part (md), while doing most of the implementation in the m/c independent part (lib). • Design the interface between lib and md to allow for possible issues with other targets.

Issues (ZOLB) • How to describe effectively? An example :- BRC=10; (Block Repeat Count) RSA=L1;REA=EN[L1]; (Repeat Start Address and Repeat End Address) L1: w[0]=w[0]+1;W[w[0]]=0; PC=BRC>0,L1;BRC=BRC-1;

Issues (ZOLB) • How to bind the rpt instruction to the start of the rpt block (on the tms320c54x, the start of the rpt block is implicitly the instruction after the rpt instruction) • Changing vpo to support ‘binding’ of an instruction to the next would be overkill. • Solution: Make fixentry() take care of this. (after vpo has finished its optimization loop).

Issues (ZOLB) • How to describe unrepeatable instructions? • The machine description sets the UNREPEATABLE flag for each unrepeatable instruction. • Machine description also provides a list of instructions that disappear after conversion. VPO ignores instructions in this list when checking for unrepeatability.

Issues (ZOLB) • How to specify end-label? • If we simply label the next-block, vpo wont print the label since it cannot see a jump to that label. • Solution: Use mangled version of the start label (eg. L1_end) as the end label for the rpt instruction. Output same mangled version of the start label when the last instruction in the rptblock is encountered in fixentry. Note that this last instruction contains the start label.

Implementation • Information supplied by md to lib. • Which instructions are unrepeatable. • The number of instructions that would remain after the conversion. • The list of rtls involved in the compare and jmp. • The elements involved in the compare (the register, expression it is being compared with, and the relational operator)  helps to determine iteration count. • Identifying a comparison rtl. • How to initialize a register to an expression. • Note : Many md parts were already in place – for eg. loop strength reduction support code.

Implementation (cont..) • The md does the actual insertion of rpt rtls b[0]=10; (Block Repeat Count)  sr_init() b[1]=L1;b[2]=EN[L1];  md_convert_rpt_block (Repeat Start Address and Repeat End Address) L1: w[0]=w[0]+1;W[w[0]]=0; PC=b[0]>0,L1;b[0]=b[0]-1;  md_convert_rpt_block (The last rtl is simply converted to a label when outputting the assembly)

Implementation • What is done in lib? • Ensuring that the instructions in the loop are repeatable. • Counting number of instructions that will remain in loop after conversion. This is useful to allow md to determine if it wants to convert this to a single-rpt instruction. • Analysis of uses and life-time of loop control variable to determine if control variable increments can disappear. • Finding iteration count of the loop. • Identifying loop control variable/increment points. • Finding loop exit block. • Note : A lot of functionality (marked above) was already present in vpo lib.

Example Before conversion – 5 instructions. Has a branch. w[0]=_A;  stm #0, ar0 L1: w[0]=w[0]+1;W[w[0]]=0;  st #0, *ar0+ r[0] = (w[0]{24)}24;  ld *(ar0), A r[0] = r[0] – (_A + 10);  sub _A+#10, A PC=r[0]<0,L1; bc L1, Alt After conversion to a single instruction repeat – only 3 instructions. Dynamic instruction count becomes much higher once the instruction is in the pipeline. w[0]=_A;  stm #0,ar0 n[1]=L6;n[2]=EL[L6];n[0]=9;  rpt #9 L6 w[0]=w[0]+1;W[w[0]]=0;  st #0, *ar0+ PC=n[0]>0,L6;n[0]=n[0]-1;

Future work • How to prevent vpo from changing block size (for eg. when spills are added)? • In single repeat instruction, how to add support for auto-increment direct addressing mode. • Eg. rpt #123 mvdk *ar1, #800h

Count down loops • Objective – convert loops to count down to zero, instead of counting up to a constant or counting down to a constant. • Reasoning • Most architectures have a single compare to zero instruction. Comparing to other values needs at least one more instruction. • Some architectures can decrement, compare and jmp in a single instruction!

Implementation • Information supplied by md to lib • List of registers that are candidates to form the count down to zero induction variable. (eg. on x86 it is advantageous to do this conversion only if the count down uses the ecx register) • Is this conversion worthwhile on this loop.

Implementation (cont..) • Information supplied by md to lib • How to initialize a register to an expression. • How to decrement a register. • Elements of a comparison. • Identifying a comparison rtl. • The relop used for comparing to zero.

Implementation • What is done in lib? • Finding the expression that represents the iteration count. • Identifying the loop control variable/increment points. • Analysis of uses and life-time of loop control. variable to determine if conversion is worth-while. Decision made by md. • Identifying the exit block. • Spill/re-load new loop control variable if needed.

Implementation • What is done in lib? • Analyze list of candidate registers to select the best one for this loop. • First preference – the current control variable, provided it is free. • If worthwhile, then any other free register. • Last option is to use a register that is live across the loop, but not used within the loop. This register will have to be spilled in the loop pre-header and reloaded at loop exit.

Performance – spec on x86

Analysis • Average performance has improved after applying the count down optimization.

Conclusion • More fine-tuning needed to realize substantial performance gains. • Primary objective of adding easily retargetable support for these loop optimizations accomplished – retargeted to 3 targets!

Acknowledgements Dr. Jack Davidson (advisor) Jason Hiser Clark Coleman

Project Presentation