250 likes | 327 Views
Benchmarks for Parallel Systems. Sources/Credits:
E N D
Benchmarks for Parallel Systems Sources/Credits: “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University of Tennessee, Knoxville TN, 37996, Computer Science Technical Report Number CS - 89 – 85, April 8, 2004, url:http://www.netlib.org/benchmark/performance.ps http://www.top500.org FAQ: http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html Courtesy: Jack Dongarra (Top500) http://www.top500.org The LINPACK Benchmark: Past, Present, and Future, Jack Dongarra, Piotr Luszczek, and Antoine Petitet NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/
LINPACK (Dongarra: 1979) • Dense system of linear equations • Initially used as a user’s guide for LINPACK package • LINPACK – 1979 • N=100 benchmark, N=1000 benchmark, Highly Parallel Computing benchmark
LINPACK benchmark • Implemented on top of BLAS1 • 2 main operations – DGEFA(Gaussian elimination - O(n3)) and DGESL(Ax = b – O(n2)) • Major operation (97%) – DAXPY: y = y + α.x • Called n3/3 + n2 times. Hence 2n3/3 + 2n2 flops (approx.) • 64-bit floating point arithmetic
LINPACK • N=100, 100x100 system of equations. No change in code. User asked to give a timing routine called SECOND, no compiler optimizations • N=1000, 1000x1000 – user can implement any code, should provide the required accuracy: Towards Peak Performance (TPP). Driver program always uses 2n3/3 +2n2 • “Highly Parallel Computing” benchmark – any software, matrix size can be chosen. Used in Top500 • Based on 64-bit floating point arithmetic
LINPACK • 100x100 – inner loop optimization • 1000x1000 – three-loop/whole program optimization • Scalable parallel program – Largest problem that can fit in memory
HPL Algorithm • 2-D block-cyclic data distribution • Right-looking LU • Panel factorization: various options • - Crout, left or right-looking recursive variants based on matrix multiply • - Number of sub-panels • - recursive stopping criteria • - pivot search and broadcast by binary-exchange
HPL algorithm • Panel broadcast: - • Update of trailing matrix: - look-ahead pipeline • Validity check - should be O(1)
Top500 (www.top500.org) • Top500 – 1993 • Twice a year – June and November • Top500 gives Nmax, Rmax, N1/2, Rpeak
NAS Parallel Benchmarks - NPB • Also for evaluation of Supercomputers • A set of 8 programs from CFD • 5 kernels, 3 pseudo applications • NPB 1 – Original benchmarks • NPB 2 – NAS’s MPI implementation. NPB 2.4 Class D has more work and more I/O • NPB 3 – based on OpenMP, HPF, Java • GridNPB3 – for computational grids • NPB 3 multi-zone – for hybrid parallelism
NPB 1.0 (March 1994) • Defines class A and class B versions • “Paper and pencil” algorithmic specifications • Generic benchmarks as compared to MPI-based LinPack • General rules for implementations – Fortran90 or C, 64-bit arithmetic etc. • Sample implementations provided
Kernel Benchmarks • EP – embarrassingly parallel • MG – multigrid. Regular communication • CG – conjugate gradient. Irregular long distance communication • FT – a 3-D PDE using FFT. Rigorous test of long distance communication • IS – large integer sort • Detailed rules regarding - brief statement of the problem - algorithm to be practiced - validation of results - where to insert timing calls - method for generating random numbers - submission of results
Pseudo applications / Synthetic CFDs • Benchmark 1 – perform few iterations of the approximate factorization algorithm (SP) • Benchmark 2 - perform few iterations of diagonal form of the approximate factorization algorithm (BT) • Benchmark 3 - perform few iterations of SSOR (LU)
Class A and Class B Class A Sample Code Class B
NPB 2.0 (1995) • MPI and Fortran 77 implementations • 2 parallel kernels (MG, FT) and 3 simulated applications (LU, SP, BT) • Class C – bigger size • Benchmark rules – 0%, 5%, >5% change in source code
NPB 2.2 (1996), 2.4 (2002), 2.4 I/O (Jan 2003) • EP and IS added • FT rewritten • NPB 2.4 – class D and rationale for class D sizes • 2.4 I/O – a new benchmark problem based on BT (BTIO) to test the output capabilities • A MPI implementation of the same (MPI-IO) – different options using collective buffering or not etc.