180 likes | 246 Views
Optimum System Balance for Systems of Finite Price. John D. McCalpin, Ph.D. IBM Corporation Austin, TX. SuperComputing 2004 Pittsburgh, PA November 10, 2004. Overview. The HPC Challenge Benchmark was announced last year at SuperComputing’2003 The HPC Challenge Benchmark consists of
E N D
Optimum System Balance for Systems of Finite Price John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2004 Pittsburgh, PA November 10, 2004
Overview • The HPC Challenge Benchmark was announced last year at SuperComputing’2003 • The HPC Challenge Benchmark consists of • LINPACK (HPL) • STREAM • PTRANS (transposing the array used by HPL) • RandomAccess (random read/modify/write) • FFT • and some low-level MPI latency & BW measurements • No single figure of merit is defined
Overview (continued) • Q: What is a “balanced” system? • My answer:“A balanced system is one for which the primary applications are limited in performance by the most expensive component(s) of the system.”
The Two Questions • We need to understand both performance and cost in the context of low-level component metrics • Performance • What performance model? • Use a harmonically weighted, time-based model • Cost • What cost model? • Simple linear additive cost model
Performance Model • Composite Figures of Merit for Performance must be based on “time” rather than “rate” • i.e., weighted harmonic means of rates • Why? • Combining “rates” in any other way fails to have a “Law of Diminishing Returns” • Time = Work/Rate • Repeat for each component: Ti = Wi/Ri
A Simple Composite Model • Assume the time to solution is composed of a compute time proportional to peak GFLOPS plus a memory transfer time proportional to sustained memory bandwidth • Assume “x Bytes/FLOP” to get: • Target SPECfp_rate2000 as the workload
Optimized Model Results • Results rounded to nearby round values: • Bytes/FLOP for large caches === 0.16 • Bytes/FLOP for small caches === 0.80 • Size of asymptotically large cache === ~12 MB • Coefficient of best fit === ~6.4 • The units of the coefficient are SPECfp_rate2000 / Effective GFLOPS
Cost Model • Assume simple linear additive model • FLOPS cost some amount • Sustained BW costs a different amount • Define:b = Rmem / Rcpug = Wmem / Wcpud = ($/BW) / ($/FLOPS) System characteristics Application characteristics Technology characteristics
How to Optimize? • For a given application g (Wmem/Wcpu), what is the optimum system balance b? • Many people seem to believe that the system should be “balanced” to the application: • boptimal = g i.e., • Rmem/Rcpu = Wmem/Wcpu • This does not optimize price/performance
The Correct Optimization • This is actually an easy optimization problem • Minimize cost/performance • Same as minimizing cost * time • Optimum cost/performance occurs at • b = sqrt(g/d) • Definitely not intuitive!
Example: High BW, expensive BW g = 3 relatively high BW d = 3 relatively expensive BW Optimum price/performance is at b=1, not b=3 Improvement in price/performance is ~30%
High BW, very expensive memory g = 3 relatively high BW d = 10 very expensive BW Optimum price/performance is at b=0.58, not b=3 Improvement in price/performance is ~50%
Low-BW, expensive BW g = 0.1 low BW application d = 3 moderately expensive BW Optimum price/performance is at b=0.18, not b=0.1 Improvement in price/performance is ~5% More BW helps here even though it is expensive, because the application does not need much.
Medium BW, expensive BW g = 1 modest BW d = 3 moderately expensive BW Optimum price/performance is at b=0.58, not b=1 Improvement in price/performance is ~10%
Summary • Balance is important to cost/performance • You must understand performance • You must understand cost • Optimum cost-performance is not intuitive!