Code Layout Optimization for Transaction Processing Workloads

Code Layout Optimization for Transaction Processing Workloads Alex Ramirez, Luiz Adnre Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P.Geoffrey Lowney, and Mateo Valero 2006/05/29 KINS Kyuhwan Kim

Introduction • OLTP (OnLine Transaction Processing) • A form of transaction processing conducted via computer network. • Electronic banking, order processing, e-commerce. • Large number of clients who continually access and update small portions of the database through short running transactions. • Large memory stall  Large instructions and data footprints and high communication miss rates.

Introduction (cont.) • Code Layout Optimization • Large applications have a particular problem: • A lot of instructions. • Can’t hold entire application on-chip at any one time. • Stalled waiting to fetch new instructions from memory. • Hold more useful instructions  improve performance

Outline • Introduction • Code Layout Optimizations • Methodology • Behavior of the Database Application in Isolation • Combined Database Application and O/S Behavior • Conclusion

Code Layout Optimizations • Spike • DTKS tool for performing code optimization after linking • Profile-driven optimization. • Three parts of Spike optimizer algorithm • Basic Block Chaining • Fine-Grain Procedure Splitting • Procedure Ordering

Basic Block Chaining • Definition • Order the basic blocks within a procedure. • Algorithm • Simple greedy algorithm • Sort flow edges by weight • Chain two block with heaviest weight. • Gain • Improve instruction cache behavior

Unconditional branch / Fall-through A1 Conditional branch 10 A1 A1 10 Node weight 10 A2 10 0.6 0.4 Branch probability A2 10 A3 10 A3 10 A4 6 0.6 0.4 A5 A4 6 4 A7 7.6 0.4 0.6 A7 A6 7.6 2.4 A8 10 A8 10 A5 4 A6 2.4 Ex) Basic Block Chaining

Fine-Grain Procedure Splitting • Definition • Divide the chain into multiple code segments  new procedures. • Algorithm • Find unconditional branch or return. (just study) • Split into hot and cold part. (current available) • Gain • Extra degree of flexibility for the procedure ordering algorithm.

Ex) Fine-Grain Procedure Splitting Procedure 1 Unconditional branch Procedure 2 Subroutine return RET Procedure 3 Subroutine return RET Procedure 4 Subroutine return RET

Procedure Ordering • Definition • Place related procedures near one another. • Algorithm • Build call graph and assign weight (# call). • Select the most heavily weighted edge and merge. • Use weights in original graph when merge. • Iterate until graph is reduced to a single node. • Gain • Improve instruction cache behavior

7 7 B A,C B,D A,C A 8 1 1 1 4 10 1 D E E 3 B C 8 1 1 D E 2 D,B,A,C E Ex) Procedure Ordering E,D,B,A,C

Methodology • OLTP Workload • TPC-B • Oracle 8.0.4 • Collecting Profiles • OLTP profile data  Pixie. • Kernel profile  Tru64 Unix kprofile tool. • Hardware and Simulation Platforms • SimOS-Alpha environment

Behavior of the DB App. Only • Instruction cache miss • X-axis: cache line size • Y-axis: # instruction cache miss • Reduction of misses is 55~65%. Baseline OLTP binary Optimized OLTP binary

Experiment (cont.) • Impact of different code layout optimization. • Procedure ordering  increase cache misses. • Largest benefit comes from basic block chaining. • Procedure ordering after splitting  improve performance further.

Experiment (cont.) • Sequentially executed instructions. • Optimized binary  7.3 to over 10 instructions. • Temporal locality. • # instructions reused before eviction • Optimized binary  Increase # of instructions reused.

Behavior of Combined DB App. & OS • Instruction cache miss • Reduction of misses is 45~60%. • Reduction of misses is 55~65% (App. in isolation). Baseline OLTP binary Optimized OLTP binary

Experiment (cont.) • Interference between App. and OS • Majority of app. misses arise due to self interference. • Kernel interferes very little with itself. Baseline OLTP binary Optimized OLTP binary

Conclusion • Profile-driven compiler optimization to improve code layout in OLTP workloads. • App in isolation  reduce 55~65% cache misses. • With OS  reduce 45~60% cache misses. • Overall, these optimizations yield improvement in performance of 1.33 times

Code Layout Optimization for Transaction Processing Workloads

Code Layout Optimization for Transaction Processing Workloads

Presentation Transcript

Transaction Processing Discussion

Transaction Processing

Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems

Transaction Processing:

Transaction Processing

Transaction Processing

Transaction Processing

TRANSACTION PROCESSING TECHNIQUES

Improving Server Performance on Transaction Processing Workloads by Enhanced Data Placement

Transaction Processing

Transaction Processing

Combinatorial Optimization for Text Layout

Transaction Processing

Improving Server Performance on Transaction Processing Workloads by Enhanced Data Placement