Project Presentation

Project Presentation by Joshua George Advisor – Dr. Jack Davidson

Exploiting hardware for loop optimizations • ZOLB – Zero Overhead Loop Buffers • Decrement, compare and jump instructions • Compare with zero instructions

VPO - Very Portable Optimizer • VPO is a retargetable optimizer that operates on a low-level, machine-independent representation called RTLs (register transfer lists) • VPO is retargeted by providing a machine description (MD) of the target machine, and revising a few machine-dependent routines • VPO is small, easily extended, and extremely effective

Implementation - guidelines • Add minimum possible code to machine dependent part (MDP), while doing most of the implementation in the machine independent part (LIB) • Design the interface between LIB and MDP to allow for possible issues with other targets

ZOLB - Zero Overhead Loop Buffer • DSP algorithms – loop intensive • Eg. FIR filter • DSPs – power consumption and code size constraints • ZOLB hardware – compiler managed loop cache • No branch • Instructions executed from buffer • Doesn’t need more power • Reduces code size

ZOLB on TMS320C54X • TMS320C54X - popular DSP from TI • Has block repeat and single instruction repeat • Single repeat rpt #127 st #0, *ar0+

ZOLB on TMS320C54X • Block repeat stm #127, brc rptb L2 L1: …. L2:

Conversion example w[0] = _A; b[1]=L1; b[2]=EN[L1]; b[0]=9; L1: w[0]=w[0]+1;W[w[0]]=0; PC=b[0]>0,L1;b[0]=b[0]-1; w[0]=_A; L1: w[0]=w[0]+1;W[w[0]]=0; r[0] = (w[0]{24)}24; r[0] = r[0] - 10 - _A; PC = r[0]<0,L1; stm #_A, ar0 L1: st #0, *ar0+ ld *(ar0), A sub (_A+#10), A bc L1, Alt stm #_A, ar0 rpt #9 st #0, *ar0+

Retargetability • 205 lines of C code added to MDP • Various other parts of MDP re-used. For eg., code to return details of a comparison instruction. • 322 lines of C code added to LIB • Various other parts of VPO re-used. For eg., the loop analysis code.

Future work • How to prevent VPO from changing block size (for eg. when spills are added)? • In single repeat instruction, how to add support for auto-increment direct addressing mode. • Eg. rpt #123 mvdk *ar1, #800h

Count down loops • Objective – convert loops to count down to zero, instead of counting up to a constant or counting down to a constant. • Reasoning • Most architectures have a single compare to zero instruction. Comparing to other values needs at least one more instruction. • Some architectures can decrement, compare and jmp in a single instruction! • Sometimes it is possible to use one less register in the loop when using count down.

Example – TMS320C54X Exploiting the banz instruction w[0] = 0; L1: … w[0]=w[0] + 1; r[0] = (w[0]{24)}24; r[0] = r[0] – 10; PC = r[0]<0,L1; w[0] = 10; L1: … w[0]=w[0] - 1; r[0] = (w[0]{24)}24; r[0] = r[0] – 0; PC = r[0]!0,L1; w[0] = 10; L1: … PC=(w[0]-1)!0,L1;w[0]=w[0]=1; (banz *ar0-,L1) Before Conversion After Conversion After folding down

Example – x86 Exploiting the loop instruction. One register (r[6]) freed from the loop. r[4] = 0; L1: …. r[4] = r[4] + 1; n[0] = r[6] ? r[4]; PC = n[0]<0,L1; r[4] = r[6]; L1: …. r[4] = r[4] - 1; n[0] = 0 ? r[4]; PC = n[0]!0,L1; r[4] = r[6]; L1: … PC=0?r[4]-1!0,L1;r[4]=r[4]-1; (loop L1) Before conversion After conversion After folding down

Example – sparc Exploiting the subcc instruction r[16]=0; L6: ST=test; r[16]=r[16]+1; IC=r[16]?2; PC=IC<0,L6; IC r[16]=2; L6: ST=test; IC=r[16]-1:0;r[16]=r[16]-1; (subcc) PC=IC!0,L6; IC

Retargetability • Lines of C code added to MDP (including the elaborate comments!) • 83 on TMS320C54X • 87 on x86 • 78 on sparc • 634 lines of C added to LIB • Compared to ZOLB support, this optimization is almost completely implemented in LIB

Performance – spec on x86

Analysis • Average performance has improved after applying the count down optimization

Conclusion • More fine-tuning needed to realize substantial performance gains • Primary objective of adding easily retargetable support for these loop optimizations accomplished – retargeted to 3 targets!

Acknowledgements Dr. Jack Davidson (advisor) Jason Hiser Clark Coleman

Project Presentation