1 / 20

Retargetting of VPO to the tms320c54x - a status report

Retargetting of VPO to the tms320c54x - a status report. Presented by Joshua George Advisor: Dr. Jack Davidson. Status. Register assignment and allocation Common sub-expression elimination Constant propagation/Copy propagation Induction variable elimination Code motion

mikasi
Download Presentation

Retargetting of VPO to the tms320c54x - a status report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson

  2. Status • Register assignment and allocation • Common sub-expression elimination • Constant propagation/Copy propagation • Induction variable elimination • Code motion • Recurrence detection

  3. Status (continued) • Strength reduction • Instruction selection • Dead code elimination • Constant folding (simp()) • Branch minimization • Support for repeat blocks

  4. The tms320c54x • 1 40-bit ALU, 2 40-bit accumulators (A,B) (r[0],r[2] in vpo) • 1 17x17bit parallel multiplier with adder for single cycle MAC operation • 1 barrel shifter • 8 16-bit address registers (AR0-AR7) (w[0]..w[7] in vpo)

  5. Compiler writer woes • Address arithmetic – can only add a constant to an address register. Causes complications in optimizer (eg. in strength reduction code). • Interesting note: r[0]=(w[0]{24)}24; r[0]=r[0]+1; w[1]=r[0]; /* w[1]=w[0]+1 gets rejected */ W[w[1]]=50; /* by instruction selection */ --------------------------- w[0]=w[0]+1; W[w[0]+1]=50; The first sequence cannot normally collapse into the more efficient second sequence. But after minimize_registers, instruction selection is able to fold them into a single instruction.

  6. Compiler writer woes • 16 bit word addressing – required special case handling in lcc frontend. • Only 2 accumulator registers. • Local Register Assigner had to be fixed to handle this. • Lots of spills. Refined vpo to use memory disambiguation techniques in instruction selection (maybe_same()).

  7. Compiler writer woes • No pipeline interlocks => unprotected pipeline conflicts. • 40 bit accumulator. Needed major change to simp(). Complicated machine description with sign-extends and ANDs. • Global data placed in special cinit section and is relocated to RAM at run-time. VISTA/EASE code instrumentation had to be done differently from other targets.

  8. Compiler writer woes • Compare and jump has the induction variable and the value to compare with, spread over two instructions. All targets till now had a simple compare and jump. Resulted in small change to vpo lib/md interface. • Eg. AR1 (w[1]) is the induction variable and runs from 0 to 9. The loop exit check – SSBX SXM // s[0]=1; (set sign-ext on) LD *(AR1),A ; // r[0]=(w[1]{24)}24; SUB #10,A,A ; // r[0]=r[0]-10; BC L1,ALT ; // PC=r[0],0?L1;

  9. Timeline of progress on this project • Spring 2002 • Code-expander completed. • Only basic addressing modes and instructions supported. • Stack layout • Calling sequence • Data declarations • Structure operations • Passes ctests/ptests with instruction selection. • Support for stdargs added.

  10. Timeline of progress on this project • Fall 2002 • Major changes to simp() to handle 40 bit arithmetic. • Enabled Register Coloring and CSE. • Lot of work on comp() to allow better instruction selection and other optimizations. (eg. w[1]=( (w[1]{24)}24)+1 ) & 65535 folds down to w[1]=w[1]+1; <- only now strength reduction can detect the induction variable) • Integrated VISTA into mainline vpo.

  11. Timeline of progress on this project • Spring 2003 • Enabled Code motion & Strength reduction. • Further refined the machine description/grammar. • Started work on Zero Overhead Loop Buffer (ZOLB) support. • Second merge of VISTA with vpo done. • Retargeted VISTA to the tms320c54x.

  12. To-Dos/Future work • Parallel instructions • Issues with ZOLB (details later) • Scheduling • The banz instruction (very useful for loops) – allows comparison of an address register with zero • Circular addressing

  13. TI’s compiler cl500 has.. • Inter-procedural analysis • For eg. if the parameters to a function are constants or globals, the actual parameters are substituted into the function, thus avoiding expensive stack frame setup. • Inline expansion of runtime-support library functions.

  14. Code comparison Code Fragment: Get address of local _a r[2]=(w[7]{24)}24; r[2]=r[2]+_l0_2_a; w[3]=r[2]&65535; // w[3]=w[7]+_l0_2_a ---------------------------- w[3]=w[7]; w[3]=w[3]+_l0_2_a; VPO cl500 (TI-compiler)

  15. Code comparison • Code fragment: for (i = 0; i < STRUCTSIZE; i++) // STRUCTSIZE=2 sum += b.field[i]; Because vpo maintains the running sum in a 16 bit register (address register) we use 2 extra instructions and lose the opportunity for converting into a repeat single instruction. The TI-compiler maintains the sum in an accumulator register.

  16. AR3 (w[3]) points to start of array. AR1 maintains the running count. brc=1; rptb .L10_rpt_end-1 .L10: ld *(AR1),A // r[0]=(w[1]{24)}24; add *AR3+,A // r[0]=r[0]+(W[w[3]]{24)}24;w[3]=w[3]+1; stl A,*(AR1) // w[1]=r[0]&65535; .L10_rpt_end: -------------------------------------------- AR3 (w[3]) points to start of array. A (r[0]) maintains the running count. RPT #1 L5: ADD *AR3+,A // r[0]=r[0]+(W[w[3]]{24)}24;w[3]=w[3]+1; L6: VPO cl500 (TI-compiler)

  17. Zero Overhead Loop Buffers • Loops are buffered in a special internal buffer using a rpt instruction whose parameters are start label, end label and loop count. Access to this buffer may be faster than fetching the instructions from memory. • The usual branch instruction at the end of the loop is no longer necessary when using a repeat instruction, and hence pipeline bubbles are avoided. • On the tms320c54x a single instruction rpt allows memory block copies/initializations without using an address register.

  18. Detail on ZOLB • Advantage of doing it in vpo • Can make use of all the information that vpo has already collected about the loop. • Easily retargetable • Code in machine independent part is reused. • Code in machine dependent part for one target provides a framework for the new target. • After conversion to a Repeat Block, registers may be freed up. Other optimizations may get enabled.

  19. Status of ZOLB • Repeat Blocks with compile time known loop iteration count implemented. • Plan to implement the banz instruction which is the next best option to ZOLB.

  20. Acknowledgements Dr. Jack Davidson (advisor) Jason Hiser Clark Coleman

More Related