250 likes | 323 Views
Towards More Meaningful Machine Comparisons. Dr. Allan Snavely PMaC ( Performance Modeling & Characterization ) Group Leader www.sdsc.edu/PMaC SDSC. PMaC Mission. To bring scientific rigor to the art or performance prediction for procurement for architectural tradeoffs
E N D
Towards More Meaningful Machine Comparisons Dr. Allan Snavely PMaC (Performance Modeling & Characterization) Group Leader www.sdsc.edu/PMaC SDSC
PMaC Mission • To bring scientific rigor to the art or performance prediction • for procurement • for architectural tradeoffs • for guiding applications to best-suited machine • for performance tuning
PMaC Mission • To bridge the gap between benchmarks and cycle-accurate simulation • Benchmarks have dubious relevancy to real apps, particularly on future machines • Cycle-accurate simulations take too long
Projects • MAPS (Memory Access Patterns) • memory subsystem & interconnect signatures • MetaSim • an on-the-fly simulator for playing “what if?” (4 orders of magnitude faster than cycle-accurate simulation) • Pseudocode Cache Simulator • Scientific Application Loop Set • Terascale Application Information • IDC HPC List
People • Dr. Allan Snavely, Group Leader • Dr. Laura Carrington, Xiaofeng Gao (MAPS) • Dr.Stuart Johnson (Pseudocode simulator) • Dr. Larry Carter (senior technical advisor) • Dr. Wayne Pfeiffer (Scientific Application Loop Set) • Nicole Wolter (Paraver/Dimemas) • Dr. Bob Leary (resident mathemeticain)
What’s wrong with benchmarks? • May anti-correlate to actual performance1 1: Conventional Benchmarks as a Sample of the Performance Spectrum John L. Gustafson, Rajat Todi Ames Laboratory, USDOE
PMaC Methods • Performance modeling via separation of concerns • Machine signatures • Application profiles • Convolution methods
L1 8192 word 128 way 16 block TLB 131072 word 4KB pages 2 way L2 1048576 word 4 way 16 block
MAPS • Useful in its own right for more meaningful machine comparisons at a glance • Work going forward to port to Compaq TCS1, SX-5, T90, Sv1, MTA, Sun HPC 10K, Origin, others? • Provides input to MetaSim (next)
Meta-Sim • Takes 2 inputs • a program • a description of a machine • Consumes instrumented trace data “on-the-fly” • 100 fold slowdown (as opposed to 1M fold!) • Performs an automated predictive convolution
Meta-Sim • Models caches and TLB • any number of levels • arbitrary sizes, line lengths, associativities • Does accounting on the Basic Block level • Looks for memory access patterns
A (simplistic) Convolution i=1 = Wt. BB Rate BB Intensity BB * * MFLOPS i i i n Wt. BB = % of total memory references i Rate BB = sustained rate of memory references i Intensity BB = ratio of floating point ops to memory ops i
How to determine rate of memory access for BB? • sum = sum + a(k)*b(colidx(k)) • Even if only 33% of memory references in a BB fall out to MM, they may slow down the whole BB to the speed of MM accesses • Why?
Occam’s Razor • Only add complexity if required to explain observed phenomena • Observation - this approach just as accurate as SMTSIM (Tullsen, Snavely, et al) but 4 orders of magnitude faster!
Conventional Benchmarks as a Sample of the Performance Spectrum
Work going forward • Development of probes ala MAPS for floating point and integer functional unit issue, logical operations, I/O • Increase sophistication of convolutions as required to fit observed facts • Big goal; a robust set of metrics and methods for performance modeling and characterization