Peephole Optimization

Peephole Optimization • Final pass over generated code: • examine a few consecutive instructions: 2 to 4 • See if an obvious replacement is possible: store/load pairs • MOV %eax => memaMOV mema => %eax • Can eliminate the second instruction without needing any global knowledge of mema • Use algebraic identities • Special-case individual instructions

Algebraic identities worth recognizing single instructions with a constant operand: • A * 2 = A + A • A * 1 = A • A * 0 = 0 • A / 1 = A • More delicate with floating-point

Is this ever helpful? • Why would anyone write X * 1? • Why bother to correct such obvious junk code? • In fact one might write • #define MAX_TASKS 1...a = b * MAX_TASKS; • Also, seemingly redundant code can be produced by other optimizations. This is an important effect.

Replace Multiply by Shift • A := A * 4; • Can be replaced by 2-bit left shift (signed/unsigned) • But must worry about overflow if language does • A := A / 4; • If unsigned, can replace with shift right • But shift right arithmetic is a well-known problem • Language may allow it anyway (traditional C)

Addition chains for multiplication • If multiply is very slow (or on a machine with no multiply instruction like the original SPARC), decomposing a constant operand into sum of powers of two can be effective: • X * 125 = x * 128 - x*4 + x • two shifts, one subtract and one add, which may be faster than one multiply • Note similarity with efficient exponentiation method

The Right Shift problem • Arithmetic Right shift: • shift right and use sign bit to fill most significant bits • -5 111111...1111111011 • SAR 111111...1111111101 • which is -3, not -2 • in most languages -5/2 = -2 • Prior to C99, implementations were allowed to truncate towards or away from zero if either operand was negative

Folding Jumps to Jumps • A jump to an unconditional jump can copy the target address • JNE lab1 ...lab1 JMP lab2 • Can be replaced by JNE lab2 • As a result, lab1 may become dead (unreferenced)

Jump to Return • A jump to a return can be replaced by a return • JMP lab1 ... lab1 RET • Can be replaced by RET • lab1 may become dead code

Tail Recursion Elimination • A subprogram is tail-recursive if the last computation is a call to itself: function last (lis : list_type) return lis_type is begin if lis.next = null then return lis; else return last (lis.next); end; • Recursive call can be replaced with lis := lis.next; goto start; -- added label

Advantages of tail recursion elimination • saves time: an assignment and jump is faster than a call with one parameter • saves stack space: converts linear stack usage to constant usage. • In languages with no loops, this may be a required optimization: specified in Scheme standard.

Tail-recursion elimination at the Instruction Level • Consider the sequence on the x86: • CALL func RET • CALL pushes return point on stack, RET in body of func removes it, RET in caller returns • Can generate instead: JMP func • Now RET in func returns to original caller, because single return address on stack

Peephole optimization in the REALIA COBOL compiler • Full compiler for Standard COBOL, targeted to the IBM PC. • Now distributed by Computer Associates • Runs in 150K bytes, but must be able to handle very large programs that run on mainframes • No global optimization possible: multiple linear passes over code, no global data structures, no flow graph. • Multiple peephole optimizations, compiler iterates until code is stable. Each pass scan code backwards to minimize address recomputations

Typical COBOL code: control structures and perform blocks. • Process-Balance. if Balance is negative then perform Send-Bill else perform Record-Credit end-if.Send-Bill. ...Record-Credit. ...

Simple Assembly: perform equivalent to call • Pb: cmp balance, 0 jnl L1 call Sb jmp L2L1: call RcL2: retSb: ... retRc: ... ret

Fold jump to return statement • Pb: cmp balance, 0 jnl L1 call Sb jmp L2 -- jump to return • L1: call Rc L2: retSb: ... retRc: ... ret

Corresponding Assembly • Pb: cmp balance, 0 jnl L1 -- jump to unconditional jump call Sb ret -- foldedL1: jmp Rc -- will become uselessL2: ret Sb: ... retRc: ... ret

code following a jump is unreachable • Pb: cmp balance, 0 jnl Rc -- folded jmp Sb ret -- unreachableL1: jmp Rc -- unreachableSb: ... retRc: ... ret

Jump to following instruction is a noop • Pb: cmp balance, 0 jnl Rc jmp Sb -- jump to next instructionSb: ... retRc: ... ret

Final code • Pb: cmp balance, 0 jnl RcSb: ... retRc: ... ret • Final code as efficient as inlining. • All transformations are local. Each optimization may yield further optimization opportunities • Iterate till no further change

Arcane tricks • Consider typical maximum computation • if A >= B then C := A;else C := B;end if; • For simplicity assume all unsigned, and all in registers

Eliminating max jump on x86 • Simple-minded asm code • CMP A, B JNAE L1 MOV A=>C JMP L2L1: MOV B=>CL2: • One jump in either case

Computing max without jumps on X86 • Architecture-specific trick: use subtract with borrow instruction and carry flag • CMP A, B ; CF=1 if B > A, CF = 0 if A >= B SBB %eax,%eax ; all 1's if B > A, all 0's if A >= B MOV %eax, C NOT C ; all 0's if B > A, all 1's if A >= B AND B=>%eax ; B if B>A, 0 if A>=B AND A=>C ; 0 if B >A, A if A>=B OR %eax=>C ; B if B>A, A if A>=B • More instructions, but NO JUMPS • Supercompiler: exhaustive search of instruction patterns to uncover similar tricks

Peephole Optimization