High Performance Computing: Concepts, Methods & Means Performance I: Benchmarking

High Performance Computing: Concepts, Methods & MeansPerformance I: Benchmarking Prof. Thomas Sterling Department of Computer Science Louisiana State University January 23rd, 2007

Topics • Definitions, properties and applications • Early benchmarks • Everything you ever wanted to know about Linpack (but were afraid to ask) • Other parallel benchmarks • Organized benchmarking • Presentation and interpretation of results • Summary

Definitions, properties and applications • Early benchmarks • Linpack • Other parallel benchmarks • Organized benchmarking • Presentation and interpretation of results • Summary

Basic Performance Metrics • Time related: • Execution time [seconds] • wall clock time • system and user time • Latency • Response time • Rate related: • Rate of computation • floating point operations per second [flops] • integer operations per second [ops] • Data transfer (I/O) rate [bytes/second] • Effectiveness: • Efficiency [%] • Memory consumption [bytes] • Productivity [utility/($*second)] • Modifiers: • Sustained • Peak • Theoretical peak

What Is a Benchmark? Benchmark: a standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance)[Merriam-Webster] • The term “benchmark” also commonly applies to specially-designed programs used in benchmarking • A benchmark should: • be domain specific (the more general the benchmark, the less useful it is for anything in particular) • be a distillation of the essential attributes of a workload • avoid using single metric to express the overall performance • Computational benchmark kinds • synthetic: specially-created programs that impose the load on the specific component in the system • application: derived from a real-world application program

Purpose of Benchmarking • To define the playing field • To provide a tool enabling quantitative comparisons • Acceleration of progress • enable better engineering by defining measurable and repeatable objectives • Establishing of performance agenda • measure release-to-release or version-to-version progress • set goals to meet • be understandable and useful also to the people not having the expertise in the field (managers, etc.)

Properties of a Good Benchmark • Relevance: meaningful within the target domain • Understandability • Good metric(s): linear, orthogonal, monotonic • Scalability: applicable to a broad spectrum of hardware/architecture • Coverage: does not over-constrain the typical environment • Acceptance: embraced by users and vendors • Has to enable comparative evaluation • Limited lifetime: there is a point when additional code modifications or optimizations become counterproductive Adapted from: Standard Benchmarks for Database Systems by Charles Levine, SIGMOD ‘97

Early Benchmarks • Whetstone • Floating point intensive • Dhrystone • Integer and character string oriented • Livermore Fortran Kernels • “Livermore Loops” • Collection of short kernels • NAS kernel • 7 Fortran test kernels for aerospace computation The sources of the benchmarks listed above are available from: http://www.netlib.org/benchmark

Linpack Overview • Introduced by Jack Dongarra in 1979 • Based on LINPACK linear algebra package developed by J. Dongarra, J. Bunch, C. Moler and P. Stewart (now superseded by the LAPACK library) • Solves a dense, regular system of linear equations, using matrices initialized with pseudo-random numbers • Provides an estimate of system’s effective floating-point performance • Does not reflect the overall performance of the machine!

Linpack Benchmark Variants • Linpack Fortran (single processor) • N=100 • N=1000, TPP, best effort • Linpack’s Highly Parallel Computing benchmark (HPL) • Java Linpack

Linpack Fortran Performance on Different Platforms Data excerpted from the 11-30-2006 LINPACK Benchmark Report athttp://www.netlib.org/benchmark/performance.ps

Fortran Linpack Demo > ./linpack Please send the results of this run to: Jack J. Dongarra Computer Science Department University of Tennessee Knoxville, Tennessee 37996-1300 Fax: 865-974-8296 Internet: dongarra@cs.utk.edu This is version 29.5.04. norm. resid resid machep x(1) x(n) 1.25501937E+00 1.39332990E-14 2.22044605E-16 1.00000000E+00 1.00000000E+00 times are reported for matrices of order 100 dgefa dgesl total mflops unit ratio b(1) times for array with leading dimension of 201 4.890E-04 2.003E-05 5.090E-04 1.349E+03 1.483E-03 9.090E-03 -9.159E-15 4.860E-04 1.895E-05 5.050E-04 1.360E+03 1.471E-03 9.017E-03 1.000E+00 4.850E-04 2.003E-05 5.050E-04 1.360E+03 1.471E-03 9.018E-03 1.000E+00 4.856E-04 1.730E-05 5.029E-04 1.365E+03 1.465E-03 8.981E-03 5.298E+02 times for array with leading dimension of 200 4.210E-04 1.800E-05 4.390E-04 1.564E+03 1.279E-03 7.840E-03 1.000E+00 4.200E-04 1.901E-05 4.390E-04 1.564E+03 1.279E-03 7.840E-03 1.000E+00 4.200E-04 1.699E-05 4.370E-04 1.571E+03 1.273E-03 7.804E-03 1.000E+00 4.288E-04 1.640E-05 4.452E-04 1.542E+03 1.297E-03 7.950E-03 5.298E+02 end of tests -- this version dated 05/29/04 Total time (dgefa+dgesl) “Timing” unit (obsolete) First element of right hand side vector Time spent in solver (dgesl) Fraction of Cray-1S execution time (obsolete) Sustained floating point rate Time spent in matrix factorization routine (dgefa) Two different dimensions used to test the effect of array placement in memory Reference:http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html

Linpack’s Highly Parallel Computing Benchmark (HPL) • Measures the performance of distributed memory machines • Used in the “Linpack Benchmark Report” (Table 3) and to determine the order of machines on the Top500 list • The portable version (written in C) • External dependencies: • MPI-1.1 functionality for inter-node communication • BLAS or VSIPL library for simple vector operations such as scaled vector addition (DAXPY: y = αx+y) and inner dot product (DDOT: a = Σxiyi) • Ground rules: • allows a complete user replacement of the LU factorization and solver steps (the accuracy must satisfy given bound) • same matrix as in the driver program • no restrictions on problem size

HPL Linpack Metrics • The HPL implementation of the benchmark is run for different problem sizes N on the entire machine • For certain problem size Nmax, the cumulative performance in Mflops (reflecting 64-bit addition and multiplication operations) reaches its maximum value denoted as Rmax • Another metric possible to obtain from the benchmark is N1/2, the problem size for which the half of the maximum performance (Rmax/2) is achieved • The Rmax value is used to rank supercomputers in Top500 list; listed along with this number are the theoretical peak double precision floating point performance Rpeak of the machine and N1/2

Machine Parameters Influencing Linpack Performance

Ten Fastest Supercomputers On Current Top500 List Source: http://www.top500.org/list/2006/11/100

HPL Demo > mpirun -np 4 xhpl ============================================================================ HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004 Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK ============================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 5000 NB : 32 PMAP : Row-major process mapping P : 2 1 4 Q : 2 4 1 PFACT : Left NBMIN : 2 NDIV : 2 RFACT : Left BCAST : 1ringM DEPTH : 0 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words ---------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual checks will be computed: 1) ||Ax-b||_oo / ( eps * ||A||_1 * N ) 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) 3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR01L2L2 5000 32 2 2 7.14 1.168e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0400275 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0264242 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0051580 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR01L2L2 5000 32 1 4 7.00 1.192e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0335428 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0221433 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0043224 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR01L2L2 5000 32 4 1 7.00 1.191e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0426255 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0281393 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0054928 ...... PASSED ============================================================================ Finished 3 tests with the following results: 3 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. ---------------------------------------------------------------------------- End of Tests. ============================================================================ For configuration issues, consult: http://www.netlib.org/benchmark/hpl/faqs.html

Other Parallel Benchmarks • High Performance Computing Challenge (HPCC) benchmarks • Devised and sponsored to enrich the benchmarking parameter set • NAS Parallel Benchmarks (NPB) • Powerful set of metrics • Reflects computational fluid dynamics • NPBIO-MPI • Stresses external I/O system

HPC Challenge Benchmark Consists of 7 individual tests: • HPL (Linpack TPP): floating point rate of execution of a solver of linear system of equations • DGEMM: floating point rate of execution of double precision matrix-matrix multiplication • STREAM: sustainable memory bandwidth (GB/s) and the corresponding computation rate for simple vector kernel • PTRANS (parallel matrix transpose): total capacity of the network using pairwise communicating processes • RandomAccess: the rate of integer random updates of memory (in GUPS: Giga-Updates Per Second) • FFT: floating point rate of execution of double precision complex 1-D Discrete Fourier Transform • b_eff (effective bandwidth benchmark): latency and bandwidth of a number of simultaneous communication patterns

Comparison of HPCC Results on Selected Supercomputers • Notes: • all metrics shown are “higher-better”, except for the Random Ring Latency • machine labels include: machine name (optional), manufacturer and system name, affiliation and (in parentheses) • processor/network fabric type

NAS Parallel Benchmarks • Derived from computational fluid dynamics (CFD) applications • Consist of five kernels and three pseudo-applications • Exist in several flavors: • NPB 1: original paper-and-pencil specification • generally proprietary implementations by hardware vendors • NPB 2: MPI-based sources distributed by NAS • supplements NPB 1 • can be run with little or no tuning • NPB 3: implementations in OpenMP, HPF and Java • derived from NPB-serial version with improved serial code • a set of multi-zone benchmarks was added • test implementation efficiency of multi-level and hybrid parallelization methods and tools (e.g. OpenMP with MPI) • GridNPB 3: new suite of benchmarks, designed to rate the performance of computational grids • includes only four benchmarks, derived from the original NPB • written in Fortran and Java • Globus as grid middleware

NPB 2 Overview • Multiple problem classes (S, W, A, B, C, D) • Tests written mainly in Fortran (IS in C): • BT (block tri-diagonal solver with 5x5 block size) • CG (conjugate gradient approximation to compute the smallest eigenvalue of a sparse, symmetric positive definite matrix) • EP (“embarrassingly parallel”; evaluates an integral by means of pseudorandom trials) • FT (3-D PDE solver using Fast Fourier Transforms) • IS (large integer sort; tests both integer computation speed and network performance) • LU (a regular-sparse, 5x5 block lower and upper triangular system solver) • MG (simplified multigrid kernel; tests both short and long distance data communication) • SP (solves multiple independent system of non-diagonally dominant, scalar, pentadiagonal equations) • Sources and reports available from: http://ww.nas.nasa.gov/Resources/Software/npb.html

Benchmarking Organizations • SPEC • Created to satisfy the need for realistic, fair and standardized performance tests • Motto: “An ounce of honest data is worth more than a pound of marketing hype” • TPC • Formed primarily due to lack of reliable database benchmarks

Presentation of the Results • Tables • Graphs • Bar graphs (a) • Scatter plots (b) • Line plots (c) • Pie charts (d) • Gantt charts (e) • Kiviat graphs (f) • Enhancements • Error bars, boxes or confidence intervals • Broken or offset scales (be careful!) • Multiple curves per graph (but avoid overloading) • Data labels, colors, etc. (a) (b) (c) (d) (e) (f)

Kiviat Graph Example Source:http://www.cse.clrc.ac.uk/disco/DLAB_BENCH_WEB/hpcc/hpcc_kiviat.shtml

Mixed Graph Example WRF OOCORE MILC PARATEC HOMME BSSN_PUGH Whisky_Carpet ADCIRC PETSc_FUN3D Computation fraction Floating point operations Communication fraction Load/store operations Other operations Characterization of NSF/CCT parallel applications on POWER5 architecture (using data collected by IPM)

Graph Do’s and Don’ts • Good graphs: • Require minimum effort from the reader • Maximize information • Maximize information-to-ink ratio • Use commonly accepted practices • Avoid ambiguity • Poor graphs: • Have too many alternatives on a single chart • Display too many y-variables on a single chart • Use vague symbols in place of text • Show extraneous information • Select scale ranges improperly • Use line chart instead of a bar graph Reference: Raj Jain, The Art of Computer Systems Performance Analysis, Chapter 10

Common Mistakes in Benchmarking From Chapter 9 of The Art of Computer Systems Performance Analysis by Raj Jain: • Only average behavior represented in test workload • Skewness of device demands ignored • Loading level controlled inappropriately • Caching effects ignored • Buffering sizes not appropriate • Inaccuracies due to sampling ignored • Ignoring monitoring overhead • Not validating measurements • Not ensuring same initial conditions • Not measuring transient performance • Using device utilizations for performance comparisons • Collecting too much data but doing very little analysis

Misrepresentation of Performance Results on Parallel Computers • Quote only 32-bit performance results, not 64-bit results • Present performance for an inner kernel, representing it as the performance of the entire application • Quietly employ assembly code and other low-level constructs • Scale problem size with the number of processors, but omit any mention of this fact • Quote performance results projected to the full system • Compare your results with scalar, unoptimized code run on another platform • When direct run time comparisons are required, compare with an old code on an obsolete system • If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation • Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar • Mutilate the algorithm used in the parallel implementation to match the architecture • Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment • If all else fails, show pretty pictures and animated videos, and don't talk about performance Reference: David Bailey “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”, Supercomputing Review, Aug 1991, pp.54-55,http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf

Knowledge Factors & Skills • Knowledge factors: • benchmarking and metrics • performance factors • Top500 list • Skill set: • determine state of system resources and manipulate them • acquire, run and measure benchmark performance • launch user application codes

Material For Test Basic performance metrics (slide 4) Definition of benchmark in own words; purpose of benchmarking; properties of good benchmark (slides 5, 6, 7) Linpack: what it is, what does it measure, concepts and complexities (slides 15, 17, 18) HPL: (slides 21 and 24) Linpack compare and contrast (slide 25) General knowledge about HPCC and NPB suites (slides 31 and 34) Benchmark result interpretation (slides 49, 50)

High Performance Computing: Concepts, Methods & Means Performance I: Benchmarking