CS 3214 Introduction to Computer Systems

CS 3214Introduction to Computer Systems Godmar Back Lecture 8

Announcements • Read Chapter 6 • Project 2 due Fri, Sep 25 • Need to pass at least “Firecracker” phase • Exercise 6 due Fri, Sep 25 • Stay tuned for exercise 7 (tomorrow) • Midterm: Tuesday Oct 20 CS 3214 Fall 2009

Some of the following slides are taken with permission from Complete Powerpoint Lecture Notes forComputer Systems: A Programmer's Perspective (CS:APP) Randal E. Bryant and David R. O'Hallaron http://csapp.cs.cmu.edu/public/lectures.html Part 3 Code OPTIMIZATION CS 3214 Fall 2009

Machine-INDependent Optimizations CS 3214 Fall 2009

Dangers of Recursion public class List { private List next; List(List next) { this.next = next; } int size() { return next == null ? 1 : next.size() + 1; } public static void main(String []av) { List l = null; for (int i = 0; i < 20000; i++) l = new List(l); System.out.println(l.size()); } } Exception in thread "main" java.lang.StackOverflowError at List.size(List.java:7) at List.size(List.java:7) at List.size(List.java:7) at List.size(List.java:7) ... CS 3214 Fall 2009

Optimizing Compilers & Recursion • “To understand recursion, one must first understand recursion” • Once you do, you know not to use it • Naïve use of recursion turns algorithms with O(1) space complexity into (entirely unnecessary) O(n) complexity • Means you run out of memory needlessly • Optimizing compilers will remove recursion if they can • For tail- and linear recursion CS 3214 Fall 2009

Recursion Removed list_size: pushl %ebp xorl %eax, %eax movl %esp, %ebp movl 8(%ebp), %edx testl %edx, %edx je .L5 movl $1, %ecx .p2align 4,,7 .L6: movl (%edx), %edx movl %ecx, %eax leal 1(%ecx), %ecx testl %edx, %edx jne .L6 .L5: popl %ebp ret struct node { struct node *next; void *value; }; int list_size(struct node *node) { return node == 0 ? 0 : list_size(node->next) + 1; } int list_size(struct node *node) { int len = 0; while (node != 0) { node = node->next; len++; } return len; } CS 3214 Fall 2009

Machine-Dependent Optimizations CS 3214 Fall 2009

Modern CPU Design Instruction Control Address Fetch Control ILP – Instruction Level Parallelism Instruction Cache Retirement Unit Instrs. Register File Instruction Decode Out-of-order: executes operations in possibly different order than specified in assembly code Super-scalar: can do more than one instruction at once (resources permitted) Pipelined: some units start with the next operation before current one is finished Operations Register Updates Prediction OK? Execution Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. Data Data Data Cache CS 3214 Fall 2009

What About Branches? • Challenge • Instruction Control Unit must work well ahead of Exec. Unit • To generate enough operations to keep EU busy • When encounters conditional branch, cannot reliably determine where to continue fetching 80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25 80489fe: movl %esi,%esi 8048a00: imull (%eax,%edx,4),%ecx Executing Fetching & Decoding

Branch Outcomes • When encounter conditional branch, cannot determine where to continue fetching • Branch Taken: Transfer control to branch target • Branch Not-Taken: Continue with next instruction in sequence • Cannot resolve until outcome determined by branch/integer unit 80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25 80489fe: movl %esi,%esi 8048a00: imull (%eax,%edx,4),%ecx Branch Not-Taken Branch Taken 8048a25: cmpl %edi,%edx 8048a27: jl 8048a20 8048a29: movl 0xc(%ebp),%eax 8048a2c: leal 0xffffffe8(%ebp),%esp 8048a2f: movl %ecx,(%eax)

Branch Prediction • Idea • Guess which way branch will go • Begin executing instructions at predicted position • But don’t actually modify register or memory data 80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25 . . . Predict Taken 8048a25: cmpl %edi,%edx 8048a27: jl 8048a20 8048a29: movl 0xc(%ebp),%eax 8048a2c: leal 0xffffffe8(%ebp),%esp 8048a2f: movl %ecx,(%eax) Execute

Branch Prediction Through Loop 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 Assume vector length = 100 i = 98 Predict Taken (OK) 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 i = 99 Predict Taken (Oops) Executed 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 Read invalid location i = 100 Fetched 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 i = 101

Branch Misprediction Invalidation 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 Assume vector length = 100 i = 98 Predict Taken (OK) 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 i = 99 Predict Taken (Oops) 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 i = 100 Invalidate 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx i = 101

Branch Misprediction Recovery 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 Assume vector length = 100 i = 98 • Performance Cost • Misprediction on Pentium III wastes ~14 clock cycles • That’s a lot of time on a high performance processor Predict Taken (OK) 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 80488bb: leal 0xffffffe8(%ebp),%esp 80488be: popl %ebx 80488bf: popl %esi 80488c0: popl %edi i = 99 Learn not taken

Branch Prediction Strategies • Most Intel processors use “forward branch not taken, backward branch taken” for first time a branch is seen • Even if done always, fits loops well – misprediction penalty paid only during last iteration • May be augmented by branch history buffer • Some architectures have “hinted branch” instructions where likely direction is encoded in instruction word • User-guided branch hinting • GCC: __builtin_expect(expr, value) CS 3214 Fall 2009

Branch Hinting #define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(x, 0) extern int g(int), h(int); int f_likely_g(int v) { return likely(v) ? g(v) : h(v); } int f_likely_h(int v) { return unlikely(v) ? g(v) : h(v); } CS 3214 Fall 2009

Branch Hinting f_likely_h: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax testl %eax, %eax jne .L7 movl $0, 8(%ebp) popl %ebp jmp h .L7: popl %ebp jmp g f_likely_g: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax testl %eax, %eax je .L9 popl %ebp jmp g .L9: movl $0, 8(%ebp) popl %ebp jmp h Aside: absent user-provided branch hinting, compiler guesses which direction the branch goes and arranges code accordingly CS 3214 Fall 2009

Avoiding Branches • On Modern Processor, Branches Very Expensive • Unless prediction can be reliable • When possible, best to avoid altogether • Example • Compute maximum of two values • 14 cycles when prediction correct • 29 cycles when incorrect movl 12(%ebp),%edx # Get y movl 8(%ebp),%eax # rval=x cmpl %edx,%eax # rval:y jge L11 # skip when >= movl %edx,%eax # rval=y L11: int max(int x, int y) { return (x < y) ? y : x; }

Avoiding Branches with Bit Tricks • Use masking rather than conditionals • Compiler still uses conditional • 16 cycles when predict correctly • 32 cycles when mispredict int bmax(int x, int y) { int mask = -(x>y); return (mask & x) | (~mask & y); } xorl %edx,%edx # mask = 0 movl 8(%ebp),%eax movl 12(%ebp),%ecx cmpl %ecx,%eax jle L13 # skip if x<=y movl $-1,%edx # mask = -1 L13:

Avoiding Branches with Bit Tricks movl 8(%ebp),%ecx # Get x movl 12(%ebp),%edx # Get y cmpl %edx,%ecx # x:y setg %al # (x>y) movzbl %al,%eax # Zero extend movl %eax,-4(%ebp) # Save as t movl -4(%ebp),%eax # Retrieve t int bvmax(int x, int y) { volatile int t = (x>y); int mask = -t; return (mask & x) | (~mask & y); } • Force compiler to generate desired code • volatile declaration forces value to be written to memory • Compiler must therefore generate code to compute t • Simplest way is setg/movzbl combination • Not very elegant! • A hack to get control over compiler • 22 clock cycles on all data • Better than misprediction • Check before doing (x86_64 code might do already)

Machine-Dependent Opt. Summary • Pointer Code • Look carefully at generated code to see whether helpful • Loop Unrolling • Some compilers do this automatically • Generally not as clever as what can achieve by hand • Consider Branch Hinting • Exposing Instruction-Level Parallelism • Very machine dependent • Avoid spilling • Best if performed by compiler when tuning for a particular architecture

Role of Programmer How should I write my programs, given that I have a good, optimizing compiler? • Don’t: Smash Code into Oblivion • Hard to read, maintain & assure correctness • Do: • Select best algorithm • Write code that’s readable & maintainable • Use procedures, recursion, avoid built-in constant limits • Even though these factors can slow down code • Eliminate optimization blockers • Allows compiler to do its job

Profiling • “Premature Optimization is the Root of all Evil” – Donald Knuth • Where should you start optimizing? • Where it pays off the most: inner loops, where code spends most of its time • Amdahl’s Law • If you speed up the portion of a program in which it spends  of its time ( in [0, 1]) by a factor of k, the overall speedup of the program will be • See 5.15.3 in book CS 3214 Fall 2009

Profiling • Profilers monitor where a program spends its time • Callgraph profiling: • Instrument procedure entry & exit and record how often each function called and by whom • Statistical profiling • Periodically interrupt (sample) program counter, then extrapolate how much time was spent in function • gprof is a classic if crude tool for that purpose; but demonstrates the principle still used in modern tools such as Intel’s VTune or AMD’s CodeAnalyst • Compile with ‘-pg’ switch CS 3214 Fall 2009

Example: -pg strlen: pushl %ebp movl %esp, %ebp call mcount movl 8(%ebp), %edx movl $0, %eax cmpb $0, (%edx) je .L4 movl $0, %eax .L5: addl $1, %eax cmpb $0, (%eax,%edx) jne .L5 .L4: popl %ebp ret int strlen(char *s) { int len = 0; while (*s++) { len++; } return len; } CS 3214 Fall 2009

Example void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } void turnAtoX(char *s) { int i, l = strlen(s); for (i = 0; i < l; i++) if (s[i] == 'A') s[i] = 'X'; } CS 3214 Fall 2009

Example (cont’d) int main(int ac, char *av[]) { char buf[1024]; while (fgets(buf, sizeofbuf, stdin)) { turnAtoX(buf); lower(buf); fputs(buf, stdout); } return 0; } gcc -O2 -pg lower.c -o lower perl -e '$i = 0; while ($i++ < 1000000) { print "AASSAXXXXXXXXXXXXXXXXXXXXXFDFDBCDEF\n"; }' | ./lower > /dev/null gprof ./lower > lower.gprof CS 3214 Fall 2009

Example Output % cumulative self self total time seconds seconds calls us/call us/call name 82.42 1.36 1.36 38000000 0.04 0.04 strlen 10.30 1.53 0.17 1000000 0.17 1.49 lower 4.85 1.61 0.08 1000000 0.08 0.12 turnAtoX 2.42 1.65 0.04 main • Sorted by time • Provide both cumulative (including time spent in called functions) and time spent directly in function • Provides accurate call count CS 3214 Fall 2009

<spontaneous> [1] 100.0 0.04 1.61 main [1] 0.17 1.32 1000000/1000000 lower [2] 0.08 0.04 1000000/1000000 turnAtoX [4] Call Graph 1000000 • gprof provides sections of callgraph for each function • These are not backtraces! main main 1000000 1000000 turnAtoX turnAtoX lower lower 1000000 37000000 strlen CS 3214 Fall 2009

0.17 1.32 1000000/1000000 main [1] [2] 90.6 0.17 1.32 1000000 lower [2] 1.32 0.00 37000000/38000000 strlen [3] Call Graph 1000000 • gprof provides sections of callgraph for each function • These are not backtraces! main main 1000000 1000000 turnAtoX lower lower 1000000 37000000 strlen strlen CS 3214 Fall 2009

0.04 0.00 1000000/38000000 turnAtoX [4] 1.32 0.00 37000000/38000000 lower [2] [3] 82.4 1.36 0.00 38000000 strlen [3] Call Graph 1000000 • gprof provides sections of callgraph for each function • These are not backtraces! main 1000000 1000000 turnAtoX turnAtoX lower lower 1000000 37000000 strlen strlen CS 3214 Fall 2009

0.08 0.04 1000000/1000000 main [1] [4] 7.0 0.08 0.04 1000000 turnAtoX [4] 0.04 0.00 1000000/38000000 strlen [3] Call Graph 1000000 • gprof provides sections of callgraph for each function • These are not backtraces! main main 1000000 1000000 turnAtoX turnAtoX lower 1000000 37000000 strlen strlen CS 3214 Fall 2009

Benefits & Limitations • Benefits: • Can quickly direct attention to hotspots, especially in complex systems • Limitations: • Input dependent • Shows only this path • Program must run a minimum period for sampling to be accurate CS 3214 Fall 2009

Advanced Profilers • Intel VTune, Code Analyst • Use performance counters of processor and relate to code • Examples: number of retired instructions, floating point ops, etc. • Cache behavior • “Whole-system” profilers, e.g. oprofile • Example showed only application code • Bottleneck may lie in libraries or OS CS 3214 Fall 2009

CS 3214 Introduction to Computer Systems