350 likes | 360 Views
Explore Gaussian Elimination methods tolerant to large latencies for efficient matrix operations on grid computing systems. Learn about batched pivoting for minimizing synchronous communications.
E N D
Highly Latency TolerantGaussian Elimination Toshio ENDO, Kenjiro TAURA University of Tokyo Highly Latency Tolerant GE/T. Endo
Background • Demands for large scale computing are increasing • Grid computing is attractive to improve cost performance • Grid has been successful for applications with small numbers of communication • Master-worker, parameter sweeping • …@home projects • Evaluations of applications with frequent communication on Grid are still rare • Matrix ops, PDE solver • MPICH-G2, Cactus-G Highly Latency Tolerant GE/T. Endo
will be solved by progress of programming tools will be solved by improvements of WAN Large latencies will remain an obstacle! Algorithms that tolerate latencies are important Obstacles to RunningApps with Frequent Comm. • Volatility, heterogeneity of computing nodes • Low bandwidth of WAN • Large latencies of WAN • More than 10ms on Grid >> a few us on supercomputers Highly Latency Tolerant GE/T. Endo
Target Computation Gaussian elimination (GE) of dense matrices for solving linear equations • Same as LU decomposition, Linpack • Used for • Fluid simulations, structural analysis • Top500 ranking • Difficult to achieve good performance with large latencies • Partial pivoting(PP) introduces frequent synchronous communications Highly Latency Tolerant GE/T. Endo
Overview of This Work • Gaussian elimination algorithm that tolerates large latencies is presented • An alternative pivoting method, named batched pivoting(BP) is proposed • More latency tolerant than PP • BP can largely reduce the frequency of synchronous communications Highly Latency Tolerant GE/T. Endo
Outline • Gaussian elimination with partial pivoting • Batched pivoting • Evaluation • Latency tolerance • Numerical accuracy • Summary Highly Latency Tolerant GE/T. Endo
Gaussian Elimination with Partial pivoting GE of n×n matrix A for k = 1 to n Pivoting Finds the largest element (pivot) in the k-th column Row Exchange Update n Since pivot is used as divisor, its absolute value should be larger Highly Latency Tolerant GE/T. Endo
# of nodes p=6 (=2x3) 2 O ( n p ) n sb Problem of GE with PP • Well-known distribution: 2D block cyclic distribution • Good: Comm. amount is small Each column is partitioned among nodes Each pivot selection requires synchronization With large latencies, sync. costs become bottleneck Highly Latency Tolerant GE/T. Endo
Performance of GE/PPwith Large Latencies • We emulated large artificial latencies on a Linux cluster • Identical latencies are inserted among all pairs of nodes • base, +2ms, +5ms, +10ms • High performance Linpack (HPL) is measured • GE with PP • Matrix size n=32768 • 64 (=8x8) nodes With +10ms latency, it gets 6 times slower! GE with PP is weak in large latencies Highly Latency Tolerant GE/T. Endo
How about Other Pivoting Methods? Complete Partial No pivoting Rook [Neal92] Threshold [Malard91]etc. Pairwise [Sorensen85]etc. Strict Not latency tolerant Batched (Ours) numerically unstable Relaxed Highly Latency Tolerant GE/T. Endo
Outline • Gaussian elimination with partial pivoting • Batched pivoting • Evaluation • Latency tolerance • Numerical accuracy • Summary Highly Latency Tolerant GE/T. Endo
The Aim of Batched Pivoting (BP) • BP reduces the frequency of synchronous communications • We batch pivot selections of several contiguous steps • The size of batch d is determined in advance • Synchronous communications occur only every d steps Highly Latency Tolerant GE/T. Endo
d d P1 P1 P3 P2 P4 P2 P1 P3 P2 P4 sb sb Batched Pivoting Algorithm (1) Algorithm that selects d contiguous pivots • In the figure, d columns are partitioned between P1 and P2 • Each node duplicates the columns and makes a sub-matrix • Each node locally and speculatively performs GE with PP • Each node obtains d pivot candidates Highly Latency Tolerant GE/T. Endo
Compare Adopt! Batched Pivoting Algorithm (2) • Sets of pivot candidates are gathered • The ‘best’ set is selected • We try to avoid bad pivots • d=4 in the figure I recommend 4.8 on 50th row, -2.5 on 241th row, 4.3 on 285th row, -3.6 on 36th row I recommend -9.2 on 310th row, 6.8 on 121th row, 0.8 on 170th row, -5.9 on 146th row P1 P2 • Contents of the best set are broadcast as final pivots Highly Latency Tolerant GE/T. Endo
Comparison with PP • Selected pivots • PP selects pivot of each step independently • BP takes pivots of d contiguous steps from a single node Pivots may be worse than PP • Computation costs • PP: • BP: The difference is small if d<<n For local GE Highly Latency Tolerant GE/T. Endo
Comparison with other pivoting methods • Threshold pivoting [Malard91]etc. • It may not select the ‘best’ pivot. An element may become the pivot if holds • Good: It can reduce communication for row exchange • Bad: Not latency tolerant • Pairwise pivoting [Sorensen85]etc. • It repeatedly takes two adjacent rows and eliminates one of the two (cf. bubble sort) • Good: It enables pipelining pivot selections • Bad: Numerically unstable Highly Latency Tolerant GE/T. Endo
Outline • Gaussian elimination with partial pivoting • Batched pivoting • Evaluation • Latency tolerance • Numerical accuracy • Summary Highly Latency Tolerant GE/T. Endo
Environment forParallel Experiments • 192node Linux cluster • Dual Xeon 2.4/2.8GHz (1 CPU per node is used) • Gigabit ethernet • Latencies: 55—75 us • BP is implemented by modifying HPL • mpich 1.2.6 • BLAS library by Kazushige Goto Highly Latency Tolerant GE/T. Endo
14 14 14 14 14 1Gbps 2Gbps 4Gbps SW SW SW SW SW SW SW SW SW SW SW SW 14 14 14 20 20 20 20 Network Structure of Cluster SW FW SW SW SW FS ISTBS cluster at U. Tokyo Highly Latency Tolerant GE/T. Endo
Basic Parallel Performance • Comparing speeds of PP and BP (d=4, 16, 64) • 32 to 160 nodes • n=32768, sb=256 • No emulated latency • BP shows similar scalability to that of PP • BP suffers from overheads of additional computation • 7 to 15% with d=64 Highly Latency Tolerant GE/T. Endo
With latencies, BP is much faster Performance with Large Latencies • Large latencies are added • +2ms, +5ms, +10ms between all pairs of nodes • 64(=8x8) nodes • n=32768, sb=256 • BP is stable against large latencies! • When d is larger, it gets more tolerant of latencies Highly Latency Tolerant GE/T. Endo
Evaluation Method ofNumerical Accuracy We conducted experiments to evaluate numerical accuracy • Partial, Batched, Threshold, Pairwise, No pivoting are compared • Done on a single node • In BP, blocks of size 64 are regarded as nodes • 100 random matrices for each condition • Matrix sizes are 128 to 2048 • Normalized residuals are evaluated • : computed solution, ε: machine epsilon(= ) • Next slide shows the average residuals Highly Latency Tolerant GE/T. Endo
Numerical Accuracy • PP achieves the best accuracy • No pivoting, Pairwise are numerically unstable • BP and threshold achieve comparable accuracy to PP • Average residuals of BP (d=4) are x1.1--1.6 of PP • The sizes of residuals depend on d Tradeoff between accuracy and latency tolerance Highly Latency Tolerant GE/T. Endo
Outline • Gaussian elimination with partial pivoting • Batched pivoting • Evaluation • Latency tolerance • Numerical accuracy • Summary Highly Latency Tolerant GE/T. Endo
Summary • A GE algorithm that tolerates large latencies of Grid • Batched pivoting largely reduces the number of synchronous communications • BP achieves comparable numerical accuracy to PP Highly Latency Tolerant GE/T. Endo
Future Work • Performance evaluation on actual Grid • Improvement of accuracy • Combining batched pivoting with complete or rook pivoting • Theoretical error analysis • cf. average case analysis by Trefethen et al. Highly Latency Tolerant GE/T. Endo
Another Approach:Column Distribution • When each column is placed on a single node, synchronization is not necessary • It is latency tolerant, but… • Slower because of increase in comm. amount Highly Latency Tolerant GE/T. Endo
Why PP is fragile to large latencies • Batching several steps are well-known technique for row exchange and update • Then, can we reduce synchronizations for pivoting? • No. pivoting cannot be batched or pipelined • Each pivoting depends on pivoting of proceeding steps! • In total, n times synchronizations are required If latencies are too large, synchronization costs become bottleneck Highly Latency Tolerant GE/T. Endo
Our experiments Bluegene/L Bandwidth Requirements • Estimation of limit speed when bisection bandwidth is given • Depends on n • Very optimistic • Our experiments requires 250Mbps • Bluegene/L(Jun.2005) requires 5Gbps Highly Latency Tolerant GE/T. Endo
Our experiments Bluegene/L Effects of latencies • Estimation of limit speed when latency is given • Depends on n • Very optimistic • With >7ms latencies, we can never obtain performance of Bluegene/L Highly Latency Tolerant GE/T. Endo
Displayed residual of HPLdiffers from actual value • Displayed results • Source code(HPL_pdtest.c) ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- W21L2L4 1024 256 1 1 0.30 2.357e+00 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0307237 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0135777 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035758 ...... PASSED Divide by n doesn’t be shown resid1 = resid0 / ( TEST->epsil * Anorm1 * (double)(N) ); resid2 = resid0 / ( TEST->epsil * Anorm1 * Xnorm1 ); resid3 = resid0 / ( TEST->epsil * AnormI * XnormI * (double)(N) ); Actually, devided Highly Latency Tolerant GE/T. Endo
Numerical Accuracy(2) Compares PP and BP with large matrices • Matrices are generated by HPL • On 64(=8x8) nodes Highly Latency Tolerant GE/T. Endo
Detail of accuracy(Small matrices) Standard deviation Worst case in 100 trials Highly Latency Tolerant GE/T. Endo
Detail of accuracy(Large matrices) • From left to right, ||Ax-b||_oo / ( eps * ||A||_1 * N ) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo * N) Highly Latency Tolerant GE/T. Endo
Notes • Simple batched pivoting may fail, if local GEs fail on all nodes • This situation may occur with (nearly) sparse matrices • We can recover by restarting from the failed column The number of synchronization gets closer to that of partial pivoting Highly Latency Tolerant GE/T. Endo