PINTOS : An Execution Phase Based Optimization and Simulation Tool )

PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim, Sreekumar Kodak Computer Science Department University of Minnesota October 9, 2004 PIN Tutorial at ASPLOS`04

Outline • What is Pintos? • What can Pintos do? • Phase detection for optimization and simulation • Optimization (instruction prefetching) • Fast Simulation • Summary

What is Pintos? • PINTOS is a PIN based Tool for Optimization and Simulation • A research framework supports adaptive object code optimization • Supports deep analysis of run-time program behavior for object code optimization (e.g. instruction, data prefetching) • Integrates HPM performance monitoring (Pfmon) with dynamic instrumentation (PIN). • Also supports fast performance simulation • Identifies program phases (with coarse and fine granularity) • Generates simulation strings that capture representative program behaviors

Pintos Framework PIN-based Analysis Filtered Opt Targets pfmon profile analysis Optimization control flow program profile Opt targets Cache Sim PIN-based Phase Detection pfmon profile analysis Simulation Strings Simulation Phase Info program profile phase targets Simulation String Gen

Our Background • ADORE dynamic optimization system Code Cache Phase Detection Main Thread Dynamic Optimization Thread Trace Selection Optimization Deployment Kernel / Pfmon Hardware Performance Monitoring Unit

ADORE Performance: Speedup of ORC2.1 +O2 Compiled SPEC2000 Benchmarks

ADORE Performance at Different Sampling Rates

Future Enhancements to ADORE • I-cache prefetching • Help thread based optimizations • Value prediction based optimizations • Dynamically undo aggressive optimizations (e.g. control/data speculations, indirect array prefetches) • Software Branch Predictions

What can Pintos do for us? • Pintos uses pfmon to identify high-level performance problems (e.g. I-cache miss) and locate target code (phases) for optimization • Pintos then uses PIN-based analysis tool tofocus on target code (phases) to conduct deep analysis • Pintos provides a framework to support deep analysis of program behavior so that we may experience with new object code optimization techniques and feed them to ADORE. • Simulation strings can be generated by Pintos and used for more efficient micro-architecture simulations

Phase based Optimization and Simulation • Phase is a sequence of code that consistently exhibits certain performance behaviors in Pintos, for example • Gzip shows consistent and repeated data cache miss patterns • Crafty exhibits consistent I-cache misses • A repeating phase can serve as an unit for dynamic and adaptive optimization, or for fast performance simulations. • Optimization unit can be basic block, trace, procedure and region (loops and loop nests including complex control transfers) • Simulation unit can be an extended code sequence

Phase Detection • One phase detection method doesn’t fit all needs. • Dynamic data cache prefetching requires coarse grain phases (e.g. loops) while dynamic I-cache prefetching requires fine-grain phases (e.g. frequent calling paths). • A phase tuple is used to determine the current point of execution in PIN instrumentation • Phase tuple: (phase ID #, ip addr, # of retired insts)

Pintos for Optimization (I-Prefetch) • Many applications still suffer from significant I-cache misses (e.g.data base apps, some SPEC CPU2000 benchmarks, etc) • Predictable call sequence • results in relatively low miss • rate • Complex control flows • cause high miss ratefrom • streaming prefetches

I-Cache Miss Analysis (pfmon) • Miss address based info • Crafty (2125/4760000) • 25% 30 (1.41%) Each topmiss PC was caused by 10-40 • 50% 91 (4.28%) different paths. • 75% 228 (10.73%) • 90% 442 (20.80%) • Path based info • Crafty (8016/4760000)Each top path leading to I-cache • 25% 28 (0.34%)miss has 1-2 possible prefetch targets • 50% 126 (1.57%) • 75% 436 (5.43%) Data show we can reducepoints of • 90% 1118 (13.94%) interest for inst prefetching

Exploring prospective points of instruction prefetching (PIN) • Pintos generatesprospective paths leading to frequent I-cache misses by analyzing pfmon profile • PIN instrumentation routine constructs control flow graph and simulates instruction cache along execution • It inserts I-cache prefetching instructions for the prospective paths based on control flow edge weight and estimated cache replacement Paths frequently causing I-cache misses B1 B2 B6 B3 B4 B5 B7 Instruction Cache Simulator B8 Control flow graph

Exploring prospective points of instruction prefetching (PIN) • Key observation • Most I-cache misses happen in the following cache lines after the entry or the return of a function call. • L1I cache misses are mostly capacity misses. We need to estimate how prefetch affect incoming instruction stream. • Key idea • Run ahead by exploring CFG and I-cache simulator • Evaluate prospective paths given by Pintos Paths frequently causing I-cache misses B1 B2 B6 B3 B4 B5 B7 Instruction Cache Simulator B8 Control flow graph

Pintos for Fast Simulation • Execution driven micro-architectral simulation is commonly used for evaluating newmicro-architecture features and respective code optimizations. • Simulation time is often too long for a complete simulation. New methods for fast simulations such as Simpoint and Smarts have been proposed. • PASS (Phase Aware Stratified Sampling)is a different way to generate representative and customized traces for targeted simulations

Fast Simulation Techniques • Truncated Execution • Run Z, FastFoward-W-R • Sampling • SMARTS • SIMPOINT • Stratified Sampling • Reduced Input Sets • MinneSPEC

Problems of Previous Works • Truncated Execution gives very inaccurate results • Reduced Input sets do not always behave the same as reference inputs sothe performance estimation based on reduced input sets may be misleading.

Program Run Time W U W U (K-1) * U Mechanism of SMARTS W: Warm up time (Fixed to 2000 instructions for SPEC 2000) U: Detailed Simulation (Fixed to 1000 instructions for SPEC2000) (K-1)*U: Function Simulation with Functional Warming (The tool gives the value of K for which the IPC will be within + 3% of the actual value with 99.7% confidence interval)

Issues in Previous Work SMARTS • Value of U and W fixed for SPEC 2000 suite. Have to identify them for every new benchmark suite (Very time consuming) • Over sampling in steady phases. Does not effectively exploit the existence of phases in programs SIMPOINT • The user chooses the length of simulation point (100 million, 10 million, 1 million) • Provides Simulation Points based on Clustering of Basic Block profiles which is generated using sim-fast or ATOM

Phase Aware Stratified Sampling (PASS) • Deploy a hierarchical method to detect coarse and fine grain program phases (1) Tracking calling stack (stable bottom = coarse grain phase)  inter-procedure (2) Detecting loops within the procedure  intra-procedure (3)Tracking data access pattern such as stride within loops (fine grain phases) • Select stratified samples from each phase until getting high statistical confidence

IPC simpoint IPC vs SimPoint (cc1-166, 1 million insts)

IPC vs Phase Classification on PASS(cc1-166, 1 million insts)

IPC vs SimPoint (cc1-166, 250 million insts)

IPC simpoint IPC vs SimPoint (gzip-source, 1 million insts)

IPC vs Phase Classificationon PASS(gzip-source, 1 million insts)

IPC vs SimPoint (gzip-source, 250 million insts)

IPC simpoint IPC vs SimPoint (mcf-ref, 1 million insts)

IPC vs Phase Classification on PASS(mcf-ref)

IPC vs SimPoint (mcf-ref, 250 million insts)

IPC vs Phase Classification on PASS(gap-ref, 1 million insts)

IPC vs SimPoint (gap-ref, 250 million insts)

Summary • We show the combination of HPM sampling (Pfmon) and dynamic instrumentation (Pin)in our research framework (Pintos) for adaptive object code optimization and micro-architectural simulation. • PASS (Phase Aware Stratified Sampling) may lead to a more efficient way in simulating the interaction between compiler optimizations and new micro-architectural features.

PINTOS : An Execution Phase Based Optimization and Simulation Tool )