1 / 119

Code Optimization

Code Optimization. Outline. Optimizing Blockers Memory alias Side effect in function call Understanding Modern Processor Super-scalar Out-of –order execution More Code Optimization techniques Performance Tuning Suggested reading 5.1, 5.7 ~ 5.16.

ira
Download Presentation

Code Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Code Optimization

  2. Outline • Optimizing Blockers • Memory alias • Side effect in function call • Understanding Modern Processor • Super-scalar • Out-of –order execution • More Code Optimization techniques • Performance Tuning • Suggested reading • 5.1, 5.7 ~ 5.16

  3. 5.1 Capabilities and Limitations of Optimizing CompliersReview on5.3 Program Example5.4 Eliminating Loop Inefficiencies5.5 Reducing Procedure Calls5.6 Eliminating Unneeded Memory References

  4. Example P387 void combine1(vec_ptr v, data_t *dest) { int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } }

  5. Example P388 void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } }

  6. Example P392 void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) { *dest = *dest OPER data[i]; }

  7. Example P394 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT; for (i = 0; i < length; i++) x = x OPER data[i]; *dest = x; }

  8. Machine Independent Opt. Results • Optimizations • Reduce function calls and memory references within loop

  9. Machine Independent Opt. Results • Performance Anomaly • Computing FP product of all elements exceptionally slow. • Very large speedup when accumulate in temporary • Memory uses 64-bit format, register use 80 • Benchmark data caused overflow of 64 bits, but not 80 Combine1 P385 Combine1 P388 Combine2 P392 Combine3 P394 Combine4

  10. Optimization Blockers P394 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; }

  11. Optimization Blocker: Memory Aliasing P394 • Aliasing • Two different memory references specify single location • Example • v: [3, 2, 17] • combine3(v, get_vec_start(v)+2) --> ? • combine4(v, get_vec_start(v)+2) --> ?

  12. Optimization Blocker: Memory Aliasing • Observations • Easy to have happen in C • Since allowed to do address arithmetic • Direct access to storage structures • Get in habit of introducing local variables • Accumulating within loops • Your way of telling compiler not to check for aliasing

  13. Optimizing Compilers • Provide efficient mapping of program to machine • register allocation • code selection and ordering • eliminating minor inefficiencies

  14. Optimizing Compilers • Don’t (usually) improve asymptotic efficiency • up to programmer to select best overall algorithm • big-O savings are (often) more important than constant factors • but constant factors also matter • Have difficulty overcoming “optimization blockers” • potential memory aliasing • potential procedure side-effects

  15. Limitations of Optimizing Compilers • Operate Under Fundamental Constraint • Must not cause any change in program behavior under any possible condition • Often prevents it from making optimizations when would only affect behavior under pathological conditions.

  16. Limitations of Optimizing Compilers • Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles • e.g., data ranges may be more limited than variable types suggest • e.g., using an “int” in C for what could be an enumerated type obfuscated:混乱

  17. Limitations of Optimizing Compilers • Most analysis is performed only within procedures • whole-program analysis is too expensive in most cases • Most analysis is based only on static information • compiler has difficulty anticipating run-time inputs • When in doubt, the compiler must be conservative

  18. Optimization Blockers P380 • Memory aliasing void twiddle1(int *xp, int *yp) { *xp += *yp ; *xp += *yp ; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp ; }

  19. Optimization Blockers P381 • Function call and side effect int f(int) ; int func1(x) { return f(x)+f(x)+f(x)+f(x) ; } int func2(x) { return 4*f(x) ; }

  20. Optimization Blockers P381 • Function call and side effect int counter = 0 ; int f(int x) { return counter++ ; }

  21. 5.7 Understanding Modern Processors 5.7.1 Overall Operation

  22. Modern CPU Design Figure 5.11P396 Instruction Control Address Fetch Control Instruction Cache Retirement Unit Instructions Register File Instruction Decode Operations Register Updates Prediction OK? Execution Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. Data Data Data Cache

  23. 2) 4) Fetch Control 1) Retirement Unit Register File Address Instruction Cache 3) 5) Instruction Decode Instructions operations Register Updates Predication OK? Functional units (1) (2) (3) (4) (5) (6) Integer /branch General Integer FP Add FP mult/div Load Store addr addr Operation results data data Data Cache (7)

  24. Modern Processor P396 • Superscalar • Perform multiple operations on every clock cycle • Out-of-order execution • The order in which the instructions execute need not correspond to their ordering in the assembly program

  25. Modern Processor P396 • Two main parts • Instruction Control Unit • Responsible for reading a sequence of instructions from memory • Generating from above instructions a set of primitive operations to perform on program data • Execution Unit

  26. 1) Instruction Control Unit • Instruction Cache • A special, high speed memory containing the most recently accessed instructions.

  27. 1) Instruction Control Unit • Instruction Decoding Logic • Take actual program instructions • Converts them into a set of primitive operations • Each primitive operation performs some simple task • Simple arithmetic, Load, Store • addl %eax, 4(%edx) --- three operations load 4(%edx)  t1 addl %eax, t1  t2 store t2, 4(%edx) • Register renaming P397 P398

  28. 2) Fetch Control • Fetch Ahead P396 • Fetches well ahead of currently accessed instructions • ICU has enough time to decode these • ICU has enough time to send decoded operations down to the EU

  29. Fetch Control • Branch Predication P397 • Branch taken or fall through • Guess whether branch is taken or not • Speculative Execution P397 • Fetch, decode and execute only according to the branch prediction • Before the branch predication has been determined

  30. 5.7 Understanding Modern Processors 5.7.2 Functional Unit Performance

  31. Multi-functional Units • Multiple Instructions Can Execute in Parallel • 1 load • 1 store • 2 integer (one may be branch) • 1 FP Addition • 1 FP Multiplication or Division

  32. Multi-functional Units Figure 5.12P400 • Some Instructions Take > 1 Cycle, but Can be Pipelined • Instruction Latency Cycles/Issue • Load / Store 3 1 • Integer Multiply 4 1 • Integer Divide 36 36 • Double/Single FP Multiply 5 2 • Double/Single FP Add 3 1 • Double/Single FP Divide 38 38

  33. 5.7 Understanding Modern Processors 5.7.1 Overall Operation

  34. Execution Unit • Receives operations from ICU • Each cycle it may receive more than one operation • Operations are queued in buffer

  35. Execution Unit • Operation is dispatched to one of multi-functional units, whenever • All the operands of an operation are ready • Suitable functional units are available • Execution results are passed among functional units • (7) Data Cache P398 • A high speed memory containing the most recently accessed data values

  36. 4) Retirement Unit P398 • Instructions need to commit in serial order • Misprediction • Exception • Updates Architecture status • Memory and register values

  37. 5.7.3 A Closer Look at Processor Operation Translation Instruction into Operations

  38. Translation Example P401 .L24: # Loop: imull (%eax,%edx,4),%ecx # t *= data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop .L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1

  39. Understanding Translation Example P401 • Split into two operations • Load reads from memory to generate temporary result t.1 • Multiply operation just operates on registers imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

  40. Understanding Translation Example P401 • Operands • Registers %eax does not change in loop. Values will be retrieved from register file during decoding imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

  41. Understanding Translation Example P401 • Operands • Register %ecx changes on every iteration. • Uniquely identify different versions as • %ecx.0, %ecx.1, %ecx.2, … • Register renaming • Values passed directly from producer to consumers imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

  42. Understanding Translation Example P402 incl %edx • Register %edx changes on each iteration • Renamed as %edx.0, %edx.1, %edx.2, … incl %edx.0  %edx.1

  43. Understanding Translation Example P402 cmpl %esi,%edx cmpl %esi, %edx.1  cc.1 • Condition codes are treated similar to registers • Assign tag to define connection between producer and consumer

  44. Understanding Translation Example P402 jl .L24 jl-taken cc.1 • Instruction control unit determines destination of jump • Predicts whether target will be taken • Starts fetching instruction at predicted destination

  45. Understanding Translation Example P401 jl .L24 jl-taken cc.1 • Execution unit simply checks whether or not prediction was OK • If not, it signals instruction control • Instruction control then “invalidates” any operations generated from misfetched instructions • Begins fetching and decoding instructions at correct target

  46. %edx.0 load incl %edx.1 cmpl cc.1 jl %ecx.0 t.1 imull %ecx.1 Visualizing Operations Figure 5.13 P403 load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 • Operations • Vertical position denotes time at which executed • Cannot begin operation until operands available • Height denotes latency • Operands • Arcs shown only for operands that are passed within execution unit Time

  47. %edx.0 load load incl %edx.1 %ecx.i +1 cmpl cc.1 jl %ecx.0 t.1 addl %ecx.1 Visualizing Operations Figure 5.14 P403 load (%eax,%edx,4)  t.1 iaddl t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Time • Operations • Same as before, except that add has latency of 1

  48. 3 Iterations of Combining Product Figure 5.15 P404 • Unlimited Resource Analysis • Assume operation can start as soon as operands available • Operations for multiple iterations overlap in time • Performance • Limiting factor becomes latency of integer multiplier • Gives CPE of 4.0

  49. 4 Iterations of Combining Sum Figure 5.16 P405 4 integer ops • Unlimited Resource Analysis • Performance • Can begin a new iteration on each clock cycle • Should give CPE of 1.0 • Would require executing 4 integer operations in parallel

  50. Combining Product: Resource Constraints Figure 5.17 P406 • Figure 5.17 P406

More Related