210 likes | 339 Views
Code Layout Optimization for Transaction Processing Workloads. Alex Ramirez, Luiz Adnre Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P.Geoffrey Lowney, and Mateo Valero. 2006/05/29 KINS Kyuhwan Kim. Introduction. OLTP ( O n L ine T ransaction P rocessing)
E N D
Code Layout Optimization for Transaction Processing Workloads Alex Ramirez, Luiz Adnre Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P.Geoffrey Lowney, and Mateo Valero 2006/05/29 KINS Kyuhwan Kim
Introduction • OLTP (OnLine Transaction Processing) • A form of transaction processing conducted via computer network. • Electronic banking, order processing, e-commerce. • Large number of clients who continually access and update small portions of the database through short running transactions. • Large memory stall Large instructions and data footprints and high communication miss rates.
Introduction (cont.) • Code Layout Optimization • Large applications have a particular problem: • A lot of instructions. • Can’t hold entire application on-chip at any one time. • Stalled waiting to fetch new instructions from memory. • Hold more useful instructions improve performance
Outline • Introduction • Code Layout Optimizations • Methodology • Behavior of the Database Application in Isolation • Combined Database Application and O/S Behavior • Conclusion
Code Layout Optimizations • Spike • DTKS tool for performing code optimization after linking • Profile-driven optimization. • Three parts of Spike optimizer algorithm • Basic Block Chaining • Fine-Grain Procedure Splitting • Procedure Ordering
Basic Block Chaining • Definition • Order the basic blocks within a procedure. • Algorithm • Simple greedy algorithm • Sort flow edges by weight • Chain two block with heaviest weight. • Gain • Improve instruction cache behavior
Unconditional branch / Fall-through A1 Conditional branch 10 A1 A1 10 Node weight 10 A2 10 0.6 0.4 Branch probability A2 10 A3 10 A3 10 A4 6 0.6 0.4 A5 A4 6 4 A7 7.6 0.4 0.6 A7 A6 7.6 2.4 A8 10 A8 10 A5 4 A6 2.4 Ex) Basic Block Chaining
Fine-Grain Procedure Splitting • Definition • Divide the chain into multiple code segments new procedures. • Algorithm • Find unconditional branch or return. (just study) • Split into hot and cold part. (current available) • Gain • Extra degree of flexibility for the procedure ordering algorithm.
Ex) Fine-Grain Procedure Splitting Procedure 1 Unconditional branch Procedure 2 Subroutine return RET Procedure 3 Subroutine return RET Procedure 4 Subroutine return RET
Procedure Ordering • Definition • Place related procedures near one another. • Algorithm • Build call graph and assign weight (# call). • Select the most heavily weighted edge and merge. • Use weights in original graph when merge. • Iterate until graph is reduced to a single node. • Gain • Improve instruction cache behavior
7 7 B A,C B,D A,C A 8 1 1 1 4 10 1 D E E 3 B C 8 1 1 D E 2 D,B,A,C E Ex) Procedure Ordering E,D,B,A,C
Outline • Introduction • Code Layout Optimizations • Methodology • Behavior of the Database Application in Isolation • Combined Database Application and O/S Behavior • Conclusion
Methodology • OLTP Workload • TPC-B • Oracle 8.0.4 • Collecting Profiles • OLTP profile data Pixie. • Kernel profile Tru64 Unix kprofile tool. • Hardware and Simulation Platforms • SimOS-Alpha environment
Outline • Introduction • Code Layout Optimizations • Methodology • Behavior of the Database Application in Isolation • Combined Database Application and O/S Behavior • Conclusion
Behavior of the DB App. Only • Instruction cache miss • X-axis: cache line size • Y-axis: # instruction cache miss • Reduction of misses is 55~65%. Baseline OLTP binary Optimized OLTP binary
Experiment (cont.) • Impact of different code layout optimization. • Procedure ordering increase cache misses. • Largest benefit comes from basic block chaining. • Procedure ordering after splitting improve performance further.
Experiment (cont.) • Sequentially executed instructions. • Optimized binary 7.3 to over 10 instructions. • Temporal locality. • # instructions reused before eviction • Optimized binary Increase # of instructions reused.
Outline • Introduction • Code Layout Optimizations • Methodology • Behavior of the Database Application in Isolation • Combined Database Application and O/S Behavior • Conclusion
Behavior of Combined DB App. & OS • Instruction cache miss • Reduction of misses is 45~60%. • Reduction of misses is 55~65% (App. in isolation). Baseline OLTP binary Optimized OLTP binary
Experiment (cont.) • Interference between App. and OS • Majority of app. misses arise due to self interference. • Kernel interferes very little with itself. Baseline OLTP binary Optimized OLTP binary
Conclusion • Profile-driven compiler optimization to improve code layout in OLTP workloads. • App in isolation reduce 55~65% cache misses. • With OS reduce 45~60% cache misses. • Overall, these optimizations yield improvement in performance of 1.33 times