Cross-Platform Performance Prediction Using Partial Execution

Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science Center for High Performance Simulations (CHiPS) North Carolina State University (* Joint Faculty with Oak Ridge National Laboratory) Supercomputing 2005

Presentation Roadmap • Introduction • Model and approach • Performance results • Conclusion and future work Supercomputing 2005

Cross-Platform Performance Prediction • Users face wide selection of machines • Need cross-platform performance prediction to • Choose platform to use / purchase • Estimate resource usage • Estimate job wall time • Machines and applications both grow larger and more complex • Modeling- and simulation-based approaches harder and more expensive • Performance data not reused in performance prediction Supercomputing 2005

Observation-based Performance Prediction T = ? hrs T = 20 hrs • Observe cross-platform behavior • Treating applications and platforms as black boxes • Avoiding case-by-case model building • Covering entire application • Computation • Communication • I/O • Convenient with third-party libraries Performance translation Observation: existence of “reference platform” Goal: Cross-platform Meta-predictor Approach: based on relative performance Supercomputing 2005

Main Idea: Utilizing Partial Execution • Observation: majority of scientific applications are iteration-based • Highly repetitive behavior • phases -> timesteps • Execute small partial executions • Low-cost “test drives” • Simple APIs (indicate timesteps: k) • Quit after k timesteps target system Partial-1 Partial-2 reference system Relative performance = 0.6 Full-2 (predicted) Full-1 Supercomputing 2005

Application Model • Execution of parallel simulations modeled as regular expression I(C*[W])*F • I: one-time initialization phase • C: computation phase • W: optional I/O phase • F: one-time finalization phase • Different phases likely have different cross-platform relative performance • Major challenges • Avoid impact of initially unstable performance • Predict correct mixture of C and W phases Supercomputing 2005

Partial Execution • Terminate applications prematurely • API • init_timestep() • Optional, useful with large setup phase • begin_timestep() • end_timestep(maxsteps) • “begin” and “end” calls bracket C or CW phase • Execution terminated after maxsteps timesteps • Easy-to-use interface • 2-3 lines of codes inserted into source codes Supercomputing 2005

Base Prediction Model • Given reference platform and target platform • Perform 1 or more partial executions • Compute average execution time of timestep on both platforms • Compute relative performance • Compute overall execution time estimate for target platform • Prediction performance (predicted-to-actual ratio) Supercomputing 2005

Refined Prediction Model • Problem 1: initial performance fluctuations • Variances due to cache warm-up, etc. • May span dozens of timesteps • Problem 2: periodic I/O phases • I/O frequency often configurable and determined at run time • Unified solution • Monitor per-timestep performance variance at runtime • Identify anomalies and repeated patterns • Filter out early, unstable timestep measurements • Consider only later results once performance stabilizes • Combine early timestep overheads into initialization cost • Computing sliding window averages of per-timestep overheads • Use multiples of observed pattern length as window size Supercomputing 2005

Proof-of-concept experiments • Questions: • Is relative performance observed in a very short early period indicative of overall relative performance? • Can we reuse partial executiondata in predicting execution with different configurations? • Experiment settings • Large-scale codes: • 2 ASCI Purple (sphot and sPPM) • fusion code (Gyro) • rocket simulation (GENx) • Full runs take >5 hours • 10 super computers: SDSC, NCSA, ORNL, LLNL, UIUC, NCSU, NERSC • 7 architectures (SP3, SP4, Altix, Cray X1, 3 clusters: G5, Xeon, Itanium) Supercomputing 2005

Base Model Accuracy (Sphot) • High accuracy with very short partial execution Supercomputing 2005

Refined Model (sPPM, Ram->Henry2) normalized • Issues: • Ram: init variance • Henry2: 1 in 10 steps I/O • Smarter algorithms • Initialization filter • Sliding window • handle anomaly and periodic I/O Supercomputing 2005

Application with Variable Problem Size • GENx Rocket Simulation (CSAR, UIUC), TuringFrost • Limited accuracy w/ variable timesteps Supercomputing 2005

Reusing Partial Execution Data • Scientists often repeat runs with different configurations • Number of processors • Input size and data content • Computation tasks • Results from Gyro fusion simulation on 5 platforms Avg. Error: 12.1% - 25.8% Avg. Error: 5.6% - 37.9% Supercomputing 2005

Conclusion T = 20 hrs • Empirical performance prediction works! • Real-world production codes • Multiple parallel platforms • Highly accurate predictions • Limitations with • Variable problem sizes • Input-size/processor scaling • Observation-based prediction • Simple • Portable • Low cost (few timesteps) T = 1 hrs T = 2 hrs T = 10 hrs Supercomputing 2005

Related Work • Parallel program performance prediction • Application-specific analytical models • Compiler/instrumentation tools • Simulation-based predictions • Cross-platform performance studies • Mostly examine multiple platforms individually • Grid job schedulers • Do not offer cross-platform performance translation Supercomputing 2005

Ongoing and Future Work • Evaluate with AMR applications • Automated partial execution • Automatic computation phase identification • Binary rewriting to avoid source code modification • Extend to non-dedicated systems • For job schedulers Supercomputing 2005

Cross-Platform Performance Prediction Using Partial Execution