180 likes | 360 Views
Tuning LINPACK NxN for HP Platforms. Hsin-Ying Lin [ lin@rsn.hp.com ] Piotr Luszczek [ luszczek@utk.edu ] MLIB team/HEPS/SCL/TCD Hewlett Packard Company HiPer ’ 01 Bremen, Germany October 8, 2001. Why tune LINPACK N*N.
E N D
Tuning LINPACK NxNfor HP Platforms Hsin-Ying Lin [lin@rsn.hp.com] Piotr Luszczek [luszczek@utk.edu] MLIB team/HEPS/SCL/TCD Hewlett Packard Company HiPer’01 Bremen, Germany October 8, 2001 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
Why tune LINPACK N*N • Customers use TOP500 list as one of the criteria to purchase machines • HP wants to increase the number of computers on the TOP500 list and to help demonstrate HP’s commitment to high performance computing • See http://www.top500.org/ Technical Systems Division * Scalable Computing Lab 3 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
What is LINPACK NxN • LINPACK NxN benchmark • Solves system of linear equations by some method • Allows the vendors to choose size of problem for benchmark • Measures execution time for each size problem • LINPACK NxN report • Nmax – the size of the chosen problem run on a machine • Rmax– the performance in Gflop/s for the chosen size problem run on the machine • N1/2 – the size where half the Rmaxexecution rate is achieved • Rpeak – the theoretical peak performance Gflop/s for the machine • LINPACK NxN is used to rank TOP500 fastest computers in the world Technical Systems Division * Scalable Computing Lab 4 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
TOP500 – Past, Present, and Future • June 2000 – 47 HP systems • Cut-off: 43.82 Gflop/s (Performance of 500th computer) • November 2000 – 5 HP systems • Cut-off: 55.1 GFLOP/s (26% increase from June 2000) • June 2001 – 41 HP systems • Cut-off: 67.78 GFLOP/s (23% increase from November 2000) • November 2001 – ??? HP systems • Cut-off: 83-92 GFLOP/s (23-36%estimated increase from June 2001) Technical Systems Division * Scalable Computing Lab 5 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
HP list in TOP500 (June 2001) Technical Systems Division * Scalable Computing Lab 6 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
HP’s TOP500 Status and Goals • About 30 systems missed the entry threshold 55.1 Gflop/s by 1 Gflop/s on Nov. 1, 2000 Goal for Nov. 1, 2001: Ensure all 64 CPU Superdome systems are listed in TOP500 • Lack of excellent MPI based Linpack N*N algorithms despite relatively good single node Linpack N*N performance Goal for Nov. 1, 2001: Develop better scalable algorithm for multiple node systems Technical Systems Division * Scalable Computing Lab 7 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
The Road to Highly Scalable LINPACK NxN Algorithm Studied the public domain software HPL (High Performance LINPACK benchmark): Q: Why HPL? A: Other vendors use HPL for their LINPACK N*N benchmark and show good scalability. See: http://www.netlib.org/benchmark/hpl Technical Systems Division * Scalable Computing Lab 8 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
HPL(High Performance LINPACK) • MPI implementation of LINPACK NxN benchmark • Algorithm keywords • One- and two-dimensional block-cyclic data distribution • Right-looking variant of the LU factorization • Row partial pivoting • Multiple look-ahead depths • Recursive panel factorization • Highly tunable (matrix dimension, blocking factor, grid topology, broadcast/factorization algorithms, data alignment) Technical Systems Division * Scalable Computing Lab 9 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
HPL(High Performance LINAPCK) HPL solves a linear system of order n of the form: A x = b • Compute LU factorization with partial pivoting of n-by-(n+1) matrix: [A,b] = [[L,U],y] • Since the lower triangular factor L is applied to b as factorization progress, the solution x is obtained by solving the upper triangular system: Ux = y Technical Systems Division * Scalable Computing Lab 10 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
Caveat of HPL • The lower triangular matrix L is left un-pivoted and the array of pivots is not returned. • Array b is part of Matrix A. • These imply that HPL is not a general LU factorization software and it cannot be used to solve multiple right hand sides simultaneously. Technical Systems Division * Scalable Computing Lab 11 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
Cyclic 1D division of matrix into 8 panels – with 4 processors 0 1 4 5 6 2 3 7 Factor panel 0 Update panel 1-7 using panel 0 Factor panel 1 Update panel 2-7 using panel 1 Factor panel 2 . P3 P0 P1 P2 P0 P1 P2 P3 . . . Factor panel 7 Technical Systems Division * Scalable Computing Lab 12 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
Look Ahead Algorithm 0 1 4 5 6 2 3 7 Factor panel 0 Update panel 1 using panel 0 Factor panel 1 Mark panel 1 as factored Update panel 5 using panel 0 P3 P0 P1 P2 P0 P1 P2 P3 Update panel 5 using panel 1 . . Technical Systems Division * Scalable Computing Lab 13 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
Characteristics of HPL • Is most suitable for cluster system, i.e. relatively many low-performance CPUs connected with a relatively low-speed network. • Is not suitable for SMPs as MPI incurs overhead which causes substantial deterioration of performance for a benchmark code. • When look-ahead technique is used with MPI, it requires additional memory to be allocated on each CPU for communication buffer. In an SMP system, such buffer is unnecessary due to the shared memory mechanism. Technical Systems Division * Scalable Computing Lab 14 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
Approach for Tuning LINPACK NxN • Leverage algorithms in HPL • Use pthreads instead of MPI for single node • Use hybrid of MPI and pthreads for multi-node (Constellation) system; MPI across nodes and pthreads within the node • Leverage HP MLIB’s BLAS routines to improve single CPU performance. See http://www.hp.com/go/mlib Technical Systems Division * Scalable Computing Lab 15 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
SD PA8600 vs. other machines Note: Small is better for the number under “Ratio” Technical Systems Division * Scalable Computing Lab 16 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
Constellation PA8600 Performance 3.9x 3.8x 1.9x 1.9x G: Gigabit Ethernet H: Hyper Fabric Technical Systems Division * Scalable Computing Lab 17 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM
Summary • We believe that we reached our first goal. • Accomplished our second goal to have better scalable code for HP Constellation system. • 4x32 CPUs SD PA8600 could be ranked close to TOP 100, based on TOP500 list of June 2001. • 1x64 CPUs SD PA8600 could be ranked within TOP 250 based on TOP500 list of June 2001. • Performance/CPU of SD PA8600 is about 1.5x, 1.9x, and 2.5x of IBM Power3, SGI O3000, and Sun HPC1000 respectively. Technical Systems Division * Scalable Computing Lab 18 Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:8/17/2014 8:31:39 AM