1 / 20

Computer Systems

Learn how to optimize program performance by using pointers instead of array indices, doubles instead of floats, optimizing inner loops, and more.

manzano
Download Presentation

Computer Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Systems Optimizing program performance Computer Systems – optimizing program performance

  2. Performance can make the difference • Use Pointers instead of array indices • Use doubles instead of floats • Optimize inner loops • Recommendations Patrick van der Smagt in 1991 for neural net implementations Computer Systems – optimizing program performance

  3. Performance gain • A factor of 10 can easily be gained • We have now knowledge how programs are executed: • Load / Use hazards (20% of load instr. → 1 bubble) • Mispredicted branches(40% of jmp instr. → 2 bubbles) • Return from procedure calls(100% of ret instr. → 3 bubbles) • Directions for optimizing procedures and loops • Gain has to be measured Computer Systems – optimizing program performance

  4. Amdahl's Law When we speed up a part of a program, the effect on the overall performance is limited by the significance of that part • If a part of the system initially consumed a of the execution time, speeding up this part of the code with factor k, the overall factor S is much less Computer Systems – optimizing program performance

  5. Recipe for optimizing • Use Profile to find most used procedure • Optimize inner-loop of that procedure for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; Computer Systems – optimizing program performance

  6. Optimizing Compilers • Provide efficient mapping to machine • register allocation • code selection and ordering • eliminating minor inefficiencies • Have difficulty with “optimization blockers” • potential memory aliasing • potential procedure side-effects Computer Systems – optimizing program performance

  7. Manual solution • Code movement for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j]; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; Most compilers do a good job with array code + simple loop structures Computer Systems – optimizing program performance

  8. Compilers solution • As long as no optimization blockers are present, compilers can’t be beaten for (i = 0; i < n; i++) { int ni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j]; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; imull %ebx,%eax # i*n movl 8(%ebp),%edi # a leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4) # Inner Loop .L40: movl 12(%ebp),%edi # b movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx) # *p = b[j] addl $4,%edx # p++ (scaled by 4) incl %ecx # j++ jl .L40 # loop if j<n Computer Systems – optimizing program performance

  9. Memory Aliasing void twiddle1 (int *xp, int *yp) { *xp += *yp; *xp += *yp: } • Twiddle (&xp, &xp) • Twiddle1: 4x xp • Twiddle2: 3x xp void twiddle2 (int *xp, int *yp) { *xp += 2* *yp; } Computer Systems – optimizing program performance

  10. Side effects int func1 (int x) { return f(x)+f(x)+f(x)+f(x); } • f(x){return counter++;} → Func (0) • Func1 = 0+1+2+3=6 • Func2 = 4* 0=0 int func2 (int x) { return 4* f(x); } Computer Systems – optimizing program performance

  11. Limitations for Compilers • Operate Under Fundamental Constraint • Must not cause any change in program behavior under any possible condition • Often prevents it from making optimizations when would only affect behavior under pathological conditions. • Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles • e.g., data ranges may be more limited than variable types suggest • Most analysis is performed only within procedures • whole-program analysis is too expensive in most cases • Most analysis is based only on static information • compiler has difficulty anticipating run-time inputs • When in doubt, the compiler must be conservative Computer Systems – optimizing program performance

  12. Machine-independent versus Machine-dependent optimizations • Optimizations you should do regardless of processor / compiler • Code Motion (out of the loop) • Reducing procedure calls • Unneeded Memory usage • Share Common sub-expressions • Machine-Dependent Optimizations • Pointer code • Unrolling • Enabling instruction level parallelism Computer Systems – optimizing program performance

  13. Optimization Example void combine1(vec_ptr v, data_t *dest) { int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } } • Procedure • Compute aggregate OPER of all elements of vector • Store result at destination location • Integer addition: Clock Cycles / Element • 42.06 (Compiled -g) 31.25 (Compiled -O2) Computer Systems – optimizing program performance

  14. Move Call Out of Loop void combine2(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } } int vec_length(vec_ptr v) { return v->len; } • Optimization • Move call to vec_length out of inner loop • Value does not change from one iteration to next • Function calls are expensive • CPE: 20.66 (Compiled -O2) • vec_length() requires 10 clock cycles Computer Systems – optimizing program performance

  15. Bypass data-abstraction void combine3(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) *dest = *dest OPER data[i]; } int get_vec_element() { if (index < 0 || index >= v->len) return 0; *dest = v->data[index]; return 1; } • Optimization • Avoid procedure call to retrieve each vector element • Get pointer to start of array before loop • Within loop just do pointer reference • Not as clean in terms of data abstraction • CPE: 6.00 (Compiled -O2) • get_vec_element() requires 14 clock cycles • Bounds checking is expensive Computer Systems – optimizing program performance

  16. Eliminate Unneeded Memory Refs void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = IDENT; for (i = 0; i < length; i++) sum = sum OPER data[i]; *dest = sum; } • Optimization • Don’t need to store in destination until end • Local variable sum held in register • Avoids 1 memory read, 1 memory write per cycle • CPE: 2.00 (Compiled -O2) • Memory references are expensive! Computer Systems – optimizing program performance

  17. Why did the compiler do that? • Different behavior due to memory aliasing • Combine (v, get_vec_start(v)+2) with OPER * • Combine3[2,3,5]→[2,3,1] →[2,3,2] →[2,3,6] →[2,3,36] • Combine4[2,3,5]→[2,3,5] →[2,3,5] →[2,3,5] →[2,3,30] Computer Systems – optimizing program performance

  18. Machine Independent • Code Motion • Reduce frequency with which computation performed • If it will always produce same result • Especially moving expensive code out of loop Computer Systems – optimizing program performance

  19. Conclusion How should I write my programs, given that I have a good, optimizing compiler? • Don’t: Smash Code into Oblivion • Hard to read, maintain, & assure correctness • Do: • Select best algorithm & data representation • Write code that’s readable & maintainable • Procedures, recursion, without built-in constant limits • Even though these factors can slow down code • Focus on Inner Loops • Detailed optimization means detailed measurement Computer Systems – optimizing program performance

  20. Assignment • Practice Problems • Practice Problem 5.1: 'What effect has the call swap(&xp, &xp)?‘ • Practice Problem 5.3: ‘Indicate the number of functions calls in 3 fragments‘ Computer Systems – optimizing program performance

More Related