450 likes | 549 Views
Basic Block and Trace. Chapter 8. (1) Semantic gap. Tree IR. Machine Languages. (2) IR is not proper for optimization analysis. Eg: - Some expressions have side effects ESEQ, CALL Tree representation => no execution order is assumed. - Semantic Gap
E N D
Basic Block and Trace Chapter 8
(1) Semantic gap Tree IR Machine Languages (2) IR is not proper for optimization analysis • Eg: • - Some expressions have side effects • ESEQ, CALL • Tree representation => no execution order is assumed. • - Semantic Gap • CJUMP vs. Jump on Condition • 2 targets 1 target + “fall through”
Semantic Gap Continued - ESEQ within expression is inconvenient - evaluation order matters - CALL node within expression causes side effect ! - CALL node within the argument – expression of other CALL nodes will cause problem if the args of result are passed in the same (one) register. - Rewrite Tree into an equivalent tree(Canonical Form) SEQ S1 SEQ => S1;S2;S3;S4;S5 SEQ SEQ S2 S3 S4 S5
Transformation Step 1: A tree is rewritten into a list of “canonical trees” without SEQ or ESEQ nodes. -> Tree. StmList linearize(Tree.Stm S); Step 2: Grouping into a set of “basic blocks” which contains no internal jumps or labels -> BasicBlocks Step 3: Basic Blocks are ordered into a set of “traces” in which every CJUMP is immediately followed by false label. -> Trace Schedule(BasicBlock b)
8.1 Canonical Trees Def : canonical trees as having following properties: 1. No SEQ or ESEQ 2. The parent of each CALL is either EXP(..) or MOVE(TEMP t, ….) => Separate SEQs and EXPressions
Transformations on ESEQ -move ESEQ to higher level. Eg. ESEQ ESEQ Case 1: ESEQ e SEQ S1 e S2 S2 S1 Case 2: BINOP MEM JUMP CJUMP e2 op e l1 l2 op ESEQ ESEQ ESEQ ESEQ e1 e1 e1 e1 S S S S ESEQ ESEQ SEQ ESEQ MEM BINOP S CJUMP JMP S S S e1 op e1 e2 op e1 e2 l1 l2 e1
Case 3: BINOP CJUMP op e1 ESEQ op e1 ESEQ ㅣ1 ㅣ2 e2 e2 S S1 ESEQ ESEQ MOVE ESEQ MOVE SEQ e1 S e1 S TEMP BINOP TEMP CJUMP t t op TEMP e2 op TEMP e2 ㅣ1 ㅣ2 t t
Case 4: When S does not affect e1 in case 3 and (s and e1 have not I/O) CJUMP BINOP op e1 ESEQ ㅣ1 ㅣ2 op e1 ESEQ e2 S1 e2 S if s,e1 commute if s,e1 commute ESEQ SEQ S S BINOP CJUMP op e1 e2 op e1 e2 ㅣ1 ㅣ2
How can we tell if tow expressions commute? MOVE(MEM(x),y) MEM(z) x = z (aliased) not commute x z commute CONST(n) can commute with any Expression!! => Be Conservative !! ?? We don’t know yet!
General Rewriting Rules • Identify the subexpressions. • Pull the ESEQs out of the stm or exp. • Ex: • [e1,e2,ESEQ(S,e3)] • -> (s1;[e1,e2,e3]) s1, e1,e2 commute • -> (SEQ (MOVE(t1, e1),SEQ(MOVE(t2,e2),s)); • -> (SEQ(MOVE(t1,e1),s); [TEMP(t1),e2,e3] • Reorder(ExpListexps) => (stms; ExpList)
MOVING CALLS TO TOPLEVEL • CALL returns its result in the same register TEMP(Ri) • BINOP(+,CALL(…),CALL(…)) Solution CALL(fun,args) -> ESEQ(MOVE(TEMP t,CALL()),TEMP t) Then eliminate ESEQ. => need extra TEMP(t) (registers) do_stm(MOVE(TEMP tnew, CALL(f, args))) do_stm (EXP(CALL(f, args))) - will not reorder on CALL node - will reorder on f and args as the children of MOVE overwrite TEMP(RV)
A LINEAR LIST OF STATEMENTS S0 [S0’ ; ] SEQ(SEQ(SEQ…())) => a;b;c SEQ SEQ a SEQ SEQ c linear(stm s) a b b c
8.2 TAMING CONDITIONAL BRANCHES BASIC BLOCK a sequence of statements entered at the beginning exited at the end - The 1st stmt is a LABEL - The last stmt is a JUMP a CJUMP - no other LABELs, JUMPs, CJUMPs.. Cond CJUMP CJUMP T F Cond F: …… t T: . . . JUMP LABEL
Algorithm • Scan from beginning to end • - when Label is found, begin new Block • - when JUMP or CJUMP is found, a block is ended • - If block ends without JUMP or CJUMP, • insert JUMP LABEL, LABEL ; • Epilogue block of Function. • Label it as DONE • and put JUMP DONE at the end of • body of the function. • -> Canon.BasicBlocks.
Traces Trace: a sequence of stmts that could be consecutively executed during the execution of the program. We want a set of traces that exactly covers the program : one block in one trace. To reduce JUMPs, fewer traces are preferred !! Exit
Idea !! JUMP on False 1 1 remove JUMP 2 2 F T 3 3 4 4 F->T T->F T F 5 5 6 6 JUMP 7 7
Algorithm 8.2 (Canon.Trace Schedule) Put all the blocks of the Program into a list Q. while Q is not empty Start a new(empty) trace, call it T. Remove the head element b from Q. while b is not marked Mark b; T <- T;b. Examine the successors of b. if there is nay unmarked successor C b <- C. END the current trace T.
Finishing Up • - analysis and optimizations are efficient • for basic blocks (not for stmts level) • Some local arrangement • CJUMP + false Label => OK • CJUMP + true Label => reverse condition • CJUMP lt,lf + no lt lf • CJUMP lt, lf’; LABEL lf’; JUMP lf • JUMP on true lt; JUMP lf • chance to optimized !! • Finding optimal trace is not easy !!
Instruction Selection Chapter 9
What we are going to do. Tree IR machine Instruction (Jouette Architecture or SPARC or MIPS or Pentium or T ) MEM LOAD R1,ea(c); BINOP CONST + ea C
Machine Example - Jouette Architecture Register R0 always contains zero ri TEMP ADDI ri <- rj+C ADD ri <- rj+rk + + + CONST MUL ri <- rj*rk CONST CONST * SUBI ri <- rj-C - SUB ri <- rj-rk - CONST BINOP DIV ri <- rj/rk LOAD ri <- M[rj+C] / MEM MEM MEM MEM CONST + + CONST CONST • Instructions produces a result in a register => EXP
STORE M[rj+C] <-ri MOVEM M[rj] <-M[ri] MOVE MOVE MOVE MOVE MOVE MEM MEM MEM MEM MEM MEM CONST + + CONST CONST • Execution of instructions produce side effects on Mem. • => Stm
9 MOVE MEM MEM 8 6 + + 7 2 5 CONST x MEM * FP 3 4 CONST 4 TEMP i + 1 CONST a FP Tiling the IR tree ex: a[i]:= x i:register a,x:frame var 2 LOAD r1<-M[fp+a] 4 ADDI r2<- r0 + 4 5 MUL r2 <- ri*r2 6 ADD r1 <- r1+r2 8 LOAD r2<-M[fp+x] 9 STORE M[r1+0]<- r2
9 MOVE MEM MEM 8 6 + + 7 2 5 CONST x MEM * FP 3 4 CONST 4 TEMP i + 1 CONST a FP Another Solution ex:a[i]:= x i:register a,x:frame var 2 LOAD r1<-M[fp+a] 4 ADDI r2<- r0 + 4 5 MUL r2 <- ri*r2 6 ADD r1 <- r1+r2 8 ADDI r2<- fp+x 9 MOVEM M[r1]<- M[r2 ]
10 MOVE 9 MEM MEM 8 6 + + 3 7 5 CONST x MEM * FP 4 CONST 4 2 1 TEMP i + CONST a FP Or Another Tiles with a different set of tile-pattern 1 ADDI r1<- r0 + a 2 ADD r1 <- fp +r1 3 LOAD r1<-M[r1+0] 4 ADDI r2<- r0 + 4 5 MUL r2 <- ri*r2 6 ADD r1 <- r1+r2 7 ADDI r2<- r0 + x 8 ADD r2<- fp+ r2 9 LOAD r2<-M[r2+0] 10 STORE M[r1+0]<- r2
OPTIMAL and OPTIMUM TILINGS • Optimum Tiling : one whose tiles sum to the • lowest possible value. • cost of tile : instr. exe. time, # of bytes, ...... • Optimal Tiling : one where no two adjacent tiles can • be combined into a single tile of • lower cost. then why we keep ? are enough. 30 25
Algorithms for Instruction Selection • Optimal vs Optimum • simple maybe hard • CISC vs RISC • (Complex Instr. Set Computer) • tile size large small • optimal >= optimum optimal ~= optimum • instruction cost varies almost same! • on addressing mode
Maximal Munch – optimal tiling algorithm • starting at root, find the largest tile that fits. • repeat step 1 for several subtrees • which are generated(remain)!! • 3. Generate instructions for each tile • (which are in reverse order) • => traverse tree of tiles in post-order • When several tiles can be matched, select • the largest tile(which covers the most nodes). • If same tiles are matched, choose an arbitrary one.
Implementation • See Program 9.3 for example(p181) • case statements for each root type!! • There is at least one tile for each type of root node!!
40 Dynamic Programming– finding optimum tiling finding optimum solutions based on optimum solutions of each subproblem!! 30+2+40+5=? 1. Assign cost to every node in the tree. 2. Find several matches. 3. Compute the cost for each match. 4. Choose the best one. 5. Let the cost be the value of node. 10+20+40 +4= MEM 30+2 A B 10 20
Example + MEM + + CONST CONST1 CONST2 + ADDI ADDI CONST MEM node MEM MEM MEM 2 + + CONST CONST 1 1 LOAD ri<-M[rj] LOAD ri<-M[rj+c] LOAD ri<-M[rj+c] cost 1+2 1+1 1+1
Example : Schizo-Jouette machine d+ ADDI di <- dj +C ADD di <- dj +dk d d d+ d+ d* MUL di <- dj *dk d CONST CONST d CONSTd d d SUBI di <- dj -C d- SUB di <- dj -dk d d d- DIV di <- dj /dk d/ d CONST d d MOVEA dj<- ai da MOVED aj<- di ad Tree Grammars A generalization of DP for machines with complex instruction set and several classes of registers and addressing modes. ai : address register dj : data register
dMEM dMEM dMEM dMEM LOAD di<-M[aj+C] CONST a + + a CONST CONST a STORE M[aj+C]<- di MOVE MOVE MOVE MOVE MEM MEM MEM MEM d d d d CONST a + + a CONST CONST a MOVEM M[aj]<- M[ai ] MOVE MEM MEM a a
Use Context-free grammar to describe the tiles; ex: nonterminal s : statement a : address d : data d -> MEM(+(a,CONST)) d-> MEM(+(CONST,a)) d-> MEM(CONST) d-> MEM(a) d -> a MOVEA a -> d MOVED LOAD s MOVE(MEM(+(a,CONST)), d) STORE s MOVE(M(a),M(a)) MOVEM => ambiguous grammar!! -> parse based on the minimum cost!!
Efficiency of Tiling Algorithms Order of Execution Cost for “Maximal Munch & Dynamic Programming” T : # of different tiles. K : # of non-leaf node of tile (in average) K’: largest # of node that need to be examined to choose the right tile ~= the size of largest tile T’: average # of tile-patterns which matches at each tree node Ex: for RISC machine T = 50, K = 2, K’= 4, T’ = 5 ,
N : # of input nodes in a tree. complexity = N/K * ( K’ + T’) of maximal Munch # of node (#of patterns) to be examined to find matched pattern to find minimum cost complexity of Dynamic Programming = N * (K’ + T’) “linear to N”
9.2 RISC vs CISC • RISC • 1. 32registers. • 2. only one class of integer/pointer registers. • 3. arithmetic operations only between registers. • 4. “three-address” instruction form r1<-r2 & r3 • 5. load and store instructions with only the • M[reg+const] addressing mode. • 6. every instruction exactly 32 bits long. • 7. One result or effect per instruction.
CISC(Complex Instruction Set Computers) • Complex Addressing Mode • 1. few registers (16 or 8 or 6). • 2. registers divided into different classes. • 3. arithmetic operations can access registers or • memory through “addressing mode”. • 4. “two-address” instruction of the form • r1<-r1 & r2. • 5. several different addressing modes. • 6. variable length instruction format. • 7. instruction with side effects. • eg: auto-increment/decrement.
Solutions for CISC • 1. Few registers. • - do it in register allocation phase. • 2. Classes of registers. • - specify the operands and result explicitly. • - ex: left opr of arith op (e.g. mul) must be eax • - t1 t2 x t3 ==> • - move eax, t2 eax t2 • - mul t3 eax eax x t3; edx garbage • - mov t1 eax t1 eax • 3. Two addressing instructions • - add extra move instruction -> resgister allocation • t1 <- t2+t3 move t1,t2 t1<- t2 • add t1,t3 t1<- t1+t3
4.Arithmetic operations can address memory. - actually handled by “register spill” phase. - load memory operand into register and store back into memory -> may trash registers!! -ex: add [ebp – 8,] ecx is equivalent to - mov eax, [ebp –8] - add eax, ecx - mov [ebp – 8], eax
5. several addressing modes • - takes time to execute (no faster than multiInstr seq) • “trash” fewer registers • short instruction sequence • select appropriate patterns for addressing mode. • 6. Variable Length Instructions • - let assembler do generate binary code. • 7. Instruction with Side effect • eg: r2 <- M[r1]; r1<- r1 + 4; • - difficult to model!! • (a) ignore the auto increment-> forget it! • (b) try to match special idioms • (c) try to invent new algorithms.
Abstract Assembly Language Instructions • assembly language instruction without register assignment. package assem; public abstract class Instr { public String assem; // instr template public abstract temp.TempList use(); // retrun src list public abstract temp.TempList def(); // return dst list public abstract Targets jumps(); // return jump public String format(temp.tempMap m); // txt of assem instr } public Targets(temp.LabelList labes);
// dst, src and jump can be null. public OPER(String assem, TempList dst, TempList src, temp.LabelList jump); public OPER(String assem, TempList dst, TempList src); public MOVE(String assem, Temp dst, Temp src) public LABEL(String assem, temp.Label label);
Example • assem.Instr is independent of th etarget machine assembly. ex: MEM( +( fp, CONST(8)) ==> new OPER(“LOAD ‘d0 <- M[‘s0 + 8]”, new TempList(new Temp(),null), new TempList(frame.FP(), null)); call format(…) on the above Instr. we get LOAD r1 <- M[r27+8] assume reg. allocator assign r1 to the new Temp and r27 is the frame pointer register.
Another Example • *(+(Temp(t87), CONST(3)), MEM(temp(t92)) • assem dst src • ADDI ‘d0 <- ‘s0 + 3 t908 t87 • LOAD ‘d0 <- M[‘s0+0] t909 t92 • MUL ‘d0 <- ‘s0*’s1 t910 t908,t909 • after register allocation, the instr look like: • ADDI r1 <- r12 + 3 t908/r1 t87/r12 • LOAD r2 <- M[r13+0] t909/r2 t92/r13 • MUL r1 <- r1*r2 t910/r1 • Two-address instructions • t1 t1 + t2 ==> • assem dst src • add ‘d0 ‘s1 t1 t1,t2