1 / 25

Computer Systems

Computer Systems. Optimizing program performance. Performance can make the difference. Use Pointers instead of array indices Use doubles instead of floats Optimize inner loops. Recommendations Patrick van der Smagt in 1991 for neural net implementations.

Download Presentation

Computer Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Systems Optimizing program performance Computer Systems – optimizing program performance

  2. Performance can make the difference • Use Pointers instead of array indices • Use doubles instead of floats • Optimize inner loops • Recommendations Patrick van der Smagt in 1991 for neural net implementations Computer Systems – optimizing program performance

  3. Machine-independent versus Machine-dependent optimizations • Optimizations you should do regardless of processor / compiler • Code Motion (out of the loop) • Reducing procedure calls • Unneeded Memory usage • Share Common sub-expressions • Machine-Dependent Optimizations • Pointer code • Unrolling • Enabling instruction level parallelism Computer Systems – optimizing program performance

  4. Machine dependent One has to known today’s architectures • Superscalar (Pentium)(often two instructions/cycle) • Dynamic execution (P6)(three instructions out-of-order/cycle) • Explicit parallelism (Itanium)(six execution units) Computer Systems – optimizing program performance

  5. Pentium III Design Instruction Control Address Fetch Control Instruction Cache Retirement Unit Instrs. Register File Instruction Decode Operations Register Updates Prediction OK? Execution Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. Data Data Data Cache Computer Systems – optimizing program performance

  6. Functional Units of Pentium III • Multiple Instructions Can Execute in Parallel • 1 load • 1 store • 2 integer (one may be branch) • 1 FP Addition • 1 FP Multiplication or Division Computer Systems – optimizing program performance

  7. Performance of Pentium III operations • Many instructions can be Pipelined to 1 cycle Instruction Latency Cycles/Issue • Load / Store 3 1 • Integer Multiply 4 1 • Integer Divide 36 36 • Double/Single FP Multiply 5 2 • Double/Single FP Add 3 1 • Double/Single FP Divide 38 38 Computer Systems – optimizing program performance

  8. Instruction Control Address Fetch Control Instruction Cache Retirement Unit Instrs. Register File Instruction Decode Operations Instruction Control • Grabs Instructions From Memory • Based on current PC + predicted targets for predicted branches • Hardware dynamically guesses (possibly) branch target • Translates Instructions Into Operations • Primitive steps required to perform instruction • Typical instruction requires 1–3 operations • Converts Register References Into Tags • Abstract identifier linking destination of one operation with sources of later operations Computer Systems – optimizing program performance

  9. Translation Example • Version of Combine4 • Integer data, multiply operation • Translation of First Iteration .L24: # Loop: imull (%eax,%edx,4),%ecx # t *= data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop .L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Computer Systems – optimizing program performance

  10. %edx.0 load incl %edx.1 cmpl cc.1 jl %ecx.0 t.1 imull Time %ecx.1 Visualizing Operations load (%eax,%edx,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 • Operations • Vertical position denotes time at which executed • Cannot begin operation until operands available • Height denotes latency Computer Systems – optimizing program performance

  11. 4 Iterations of Combining Sum • Resource Analysis • Performance • Unlimited resources should give CPE of 1.0 • Would require executing 4 integer operations in parallel 4 integer ops Computer Systems – optimizing program performance

  12. Pentium Resource Constraints • Only two integer functional units • Set priority based on program order Performance • Sustain CPE of 2.0 Computer Systems – optimizing program performance

  13. Measured CPE=1.33 Loop Unrolling void combine5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; int i; /* Combine 3 elements at a time */ for (i = 0; i < limit; i+=3) { sum += data[i] + data[i+2] + data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { sum += data[i]; } *dest = sum; } • Optimization • Combine multiple iterations into single loop body • Amortizes loop overhead across multiple iterations • Finish extras at end Computer Systems – optimizing program performance

  14. Resource distribution with Loop Unrolling • Predicted Performance • Can complete iteration in 3 cycles • Should give CPE of 1.0 Computer Systems – optimizing program performance

  15. Effect of Unrolling • Only helps integer sum for our examples • Other cases constrained by functional unit latencies • Effect is nonlinear with degree of unrolling • Many subtle effects determine exact scheduling of operations Computer Systems – optimizing program performance

  16. Unrolling is for long vectors • New source of overhead • The need to finish the remaining elements when the vector length is not divisible by the degree of unrolling Computer Systems – optimizing program performance

  17. 3 Iterations of Combining Product • Unlimited Resource Analysis • Assume operation can start as soon as operands available • Performance • Limiting factor becomes latency • Gives CPE of 4.0 Computer Systems – optimizing program performance

  18. Iteration splitting void combine6(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x0 = 1; int x1 = 1; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 *= data[i]; x1 *= data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x0 *= data[i]; } *dest = x0 * x1; } • Optimization • Make operands available by accumulating in two different products (x0,x1) • Combine at end • Performance • CPE = 2.0 Computer Systems – optimizing program performance

  19. Resource distribution with Iteration Splitting • Predicted Performance • Make use of both execution units • Gives CPE of 2.0 Computer Systems – optimizing program performance

  20. Results for Pentium III • Biggest gain doing basic optimizations • But, last little bit helps Computer Systems – optimizing program performance

  21. Results for Pentium 4 • Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0) • Clock runs at 2.0 GHz • Not an improvement over 1.0 GHz P3 for integer * • Avoids FP multiplication anomaly Computer Systems – optimizing program performance

  22. Machine-Dependent Opt. Summary • Loop Unrolling • Some compilers do this automatically • Generally not as clever as what can achieve by hand • Exposing Instruction-Level Parallelism • Very machine dependent • Warning: • Benefits depend heavily on particular machine • Do only for performance-critical parts of code • Best if performed by compiler • But GCC on IA32/Linux is not very good Computer Systems – optimizing program performance

  23. Conclusion How should I write my programs, given that I have a good, optimizing compiler? • Don’t: Smash Code into Oblivion • Hard to read, maintain, & assure correctness • Do: • Select best algorithm & data representation • Write code that’s readable & maintainable • Procedures, recursion, without built-in constant limits • Even though these factors can slow down code • Focus on Inner Loops • Detailed optimization means detailed measurement Computer Systems – optimizing program performance

  24. Assignment • Practice Problems • Practice Problem 5.8: ‘Associations of aprod.' • Optimization Lab Computer Systems – optimizing program performance

  25. Performance of Pentium Core 2 operations • Two stores and one load in a single cycle.The performance of IDIV depends on the quotient Instruction Latency Cycles/Issue • Load / Store 1 0.33 • Integer Multiply 3 1 • Integer Divide 17-41 12-36 • Double/Single FP Multiply 5 2 • Double/Single FP Add 3 1 • Double/Single FP Divide 32 32 Computer Systems – optimizing program performance

More Related