1 / 21

Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems

Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems. 37 th International Symposium on Microarchitecture 2004 Gerolf Hoflehner, Knud Kirkegaard, Rod Skinner, Daniel Lavery, Yong-fong Lee, Wei Li Intel® Compiler Lab. May, 29, 2006 SNU, IDB Lab.

odakota
Download Presentation

Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems 37th International Symposium on Microarchitecture 2004 Gerolf Hoflehner, Knud Kirkegaard, Rod Skinner, Daniel Lavery, Yong-fong Lee, Wei Li Intel® Compiler Lab. May, 29, 2006 SNU, IDB Lab. Kisung Kim

  2. Introduction • Describes compiler optimizations • Produce a 40% speed-up in OLTP performance • Implemented in the Intel C/C++ Itanium compiler • OLTP workload • A larger number of clients • Update a small portion of the database through short running transaction • e.g. bank, airline reservation •  a large instruction and data footprint and high I/O traffic

  3. A Repertoire of compiler optimizations for Server Applications • RSE traffic reduction • setjmp()/longjmp() Optimization • Linux preemption model • Data layout optimizations • Instruction prefetching

  4. RSE Traffic Reduction • Itanium architecture • 128 integer register, r32-r127 are stacked • Each procedure have its own variable size register stack frame • Resister Stack • Allocate a single variable-size register stack frame to each procedure

  5. RSE Traffic Reduction • RSE(Register Stack Engine) • Maps a register stack frame onto the physical register file and copies values in registers to and from memory in response to overflow and underflow conditions • alloc instruction • Determines the size of the register stack frame • Second alloc instruction in the same procedure • Does not cause RSE spills • Shrink the register stack frame when it allocates a smaller stack frame than the previous alloc instruction

  6. Needs Spill RSE Traffic Reduction • Unoptimized RSE foo(): alloc rx=0, 90, 0 1: bar(): call bar() alloc ry=0,50,0 2: … return alloc rz=0, 90, 0 3:

  7. RSE Traffic Reduction • Optimized RSE foo(): alloc rx=0, 90, 0 1: alloc rz=0,30,0 2: bar(): call bar() alloc ry=0,50,0 3: … return alloc rz=0, 90, 0 4:

  8. RSE Traffic Reduction • Shrink register stack • Liveness analysis determines the registers that are unused at the point of the call • If the number of dead registers on top of the register stack exceed a given threshold, the register stack is reduced by the amount of dead registers

  9. Overhead • Compiler does not know if the reduction of the register stack will actually decrease the RSE traffic at run-time • alloc instruction  scheduling constraints, increase in code size • For an OLTP workload, the empirically found sweet spot was a threshold of 10 registers • alloc instructions are inserted only when at least 10 registers at the top of the register stack are found dead

  10. setjmp()/longjmp() Optimization • Sequences of setjmp()/longjmp() code are a common pattern in database applications • setjmp() • saves system state in jmp_buf structure • return 0 • longjmp() • reinstates the function state from jmp_buf • return 1 V1= r=setjmp() r==0? F T =V1 V2= =V2 foo()

  11. setjmp()/longjmp() Optimization • Limit floating-point register available • Server application don’t need many floating-point operations • Use only the eight scratch fp argument registers • Avoid saving/restoring of preserved floating-point registers in jmp_buf buffer : 320 bytes

  12. setjmp()/longjmp() Optimization • Cross lifetime • It is live at the setjmp call • Need special care r37= V1= r=setjmp() r=setjmp() r==0? r==0? F T F T = r37 r37= =r37 foo() =V1 V2= =V2 foo()

  13. setjmp()/longjmp() Optimization • Solution • Dedicate the register for the rest of the procedure • Copy it to a real preserved register(r4-r7) • Spill to a dedicated memory stack location • Explicitly model the control flow from any function that might call longjmp to the associated setjmp call

  14. setjmp()/longjmp() Optimization r37= V1= r=setjmp() r=setjmp() r==0? r==0? F T F T = r37 r38= =r38 foo() =V1 V2= =V2 foo()

  15. setjmp()/longjmp() Optimization • Reduce spill/fill • Reduce memory stack size • Reduce code size • Eliminate spill/fill of callee preserved int registers at function entries and exits • Costs : increase in RSE traffic

  16. Linux Preemption Model • Symbol preemption • A symbol is preemptible if at some time after linkage, the object it refers to may change main so(“shareable object”) int g = 10; void foo() { printf(“main %d\n”, g); } int main() { int i; i = bar(); printf(“bar = %d\n”, i); return 0; } int g = 5; void foo() { printf(“so %d \n”, g); } int bar() { foo(); return g; } Result (Symbol Preemption) main 10 bar=10 Result (No Symbol Preemption) so 5 bar=5

  17. Cost of Symbol Preemption • Require position independent code • Position independent code : doesn’t contain any absolute addresses  important for shared library • Indirect addressing through linkage table • Global data • Addressed through linkage table via gp(global pointer) • Extra level of indirection add r3 = @ltoff(data),gp ld8 r2 = [r3] ld4 r8 = [r2] add r2 = @gprel(data),gp ld4 r8 = [r2]

  18. Other Optimization • Data layout optimizations • Move string and constants to read-only section • Sort the local data on the memory stack based on frequency and size • Better D-Cache Utilization • Instruction prefetching • .few/.many completer  control instruction prefetching • Specify how many bundles get prefetched at the branch target

  19. Evaluation • Scaled Setup • 4P Itanium2 1.5Ghz • 3M, 6M L3 cache • 32 Gb memory • Large workload • Cached Setup • 4P Itanium2 1.5Ghz • 3M, 6M L3 cache • 8 Gb memory • Small workload • Negligible disk I/O • Runs CPU bound High speed-up on a cached system does not necessarily translate into a high speed-up on a scaled setup Red Hat Linux 2.1, Oracle V9 and Intel Itanium Compiler V7.1

  20. Speed-Ups per Optimization

  21. Conclusion • Compiler optimizations are essential for OLTP performance on both cached and scaled setups • Memory traffic continues to be the major bottleneck for OLTP workloads • Interaction among the compiler optimizations may well deserve further study

More Related