Highly Latency Tolerant Gaussian Elimination

Highly Latency TolerantGaussian Elimination Toshio ENDO, Kenjiro TAURA University of Tokyo Highly Latency Tolerant GE/T. Endo

Background • Demands for large scale computing are increasing • Grid computing is attractive to improve cost performance • Grid has been successful for applications with small numbers of communication • Master-worker, parameter sweeping • …@home projects • Evaluations of applications with frequent communication on Grid are still rare • Matrix ops, PDE solver • MPICH-G2, Cactus-G Highly Latency Tolerant GE/T. Endo

will be solved by progress of programming tools will be solved by improvements of WAN Large latencies will remain an obstacle! Algorithms that tolerate latencies are important Obstacles to RunningApps with Frequent Comm. • Volatility, heterogeneity of computing nodes • Low bandwidth of WAN • Large latencies of WAN • More than 10ms on Grid >> a few us on supercomputers Highly Latency Tolerant GE/T. Endo

Target Computation Gaussian elimination (GE) of dense matrices for solving linear equations • Same as LU decomposition, Linpack • Used for • Fluid simulations, structural analysis • Top500 ranking • Difficult to achieve good performance with large latencies • Partial pivoting(PP) introduces frequent synchronous communications Highly Latency Tolerant GE/T. Endo

Overview of This Work • Gaussian elimination algorithm that tolerates large latencies is presented • An alternative pivoting method, named batched pivoting(BP) is proposed • More latency tolerant than PP • BP can largely reduce the frequency of synchronous communications Highly Latency Tolerant GE/T. Endo

Outline • Gaussian elimination with partial pivoting • Batched pivoting • Evaluation • Latency tolerance • Numerical accuracy • Summary Highly Latency Tolerant GE/T. Endo

Gaussian Elimination with Partial pivoting GE of n×n matrix A for k = 1 to n Pivoting Finds the largest element (pivot) in the k-th column Row Exchange Update n Since pivot is used as divisor, its absolute value should be larger Highly Latency Tolerant GE/T. Endo

# of nodes p=6 (=2x3) 2 O ( n p ) n sb Problem of GE with PP • Well-known distribution: 2D block cyclic distribution • Good: Comm. amount is small Each column is partitioned among nodes Each pivot selection requires synchronization With large latencies, sync. costs become bottleneck Highly Latency Tolerant GE/T. Endo

Performance of GE/PPwith Large Latencies • We emulated large artificial latencies on a Linux cluster • Identical latencies are inserted among all pairs of nodes • base, +2ms, +5ms, +10ms • High performance Linpack (HPL) is measured • GE with PP • Matrix size n=32768 • 64 (=8x8) nodes With +10ms latency, it gets 6 times slower! GE with PP is weak in large latencies Highly Latency Tolerant GE/T. Endo

How about Other Pivoting Methods? Complete Partial No pivoting Rook [Neal92] Threshold [Malard91]etc. Pairwise [Sorensen85]etc. Strict Not latency tolerant Batched (Ours) numerically unstable Relaxed Highly Latency Tolerant GE/T. Endo

The Aim of Batched Pivoting (BP) • BP reduces the frequency of synchronous communications • We batch pivot selections of several contiguous steps • The size of batch d is determined in advance • Synchronous communications occur only every d steps Highly Latency Tolerant GE/T. Endo

d d P1 P1 P3 P2 P4 P2 P1 P3 P2 P4 sb sb Batched Pivoting Algorithm (1) Algorithm that selects d contiguous pivots • In the figure, d columns are partitioned between P1 and P2 • Each node duplicates the columns and makes a sub-matrix • Each node locally and speculatively performs GE with PP • Each node obtains d pivot candidates Highly Latency Tolerant GE/T. Endo

Compare Adopt! Batched Pivoting Algorithm (2) • Sets of pivot candidates are gathered • The ‘best’ set is selected • We try to avoid bad pivots • d=4 in the figure I recommend 4.8 on 50th row, -2.5 on 241th row, 4.3 on 285th row, -3.6 on 36th row I recommend -9.2 on 310th row, 6.8 on 121th row, 0.8 on 170th row, -5.9 on 146th row P1 P2 • Contents of the best set are broadcast as final pivots Highly Latency Tolerant GE/T. Endo

Comparison with PP • Selected pivots • PP selects pivot of each step independently • BP takes pivots of d contiguous steps from a single node Pivots may be worse than PP • Computation costs • PP: • BP: The difference is small if d<<n For local GE Highly Latency Tolerant GE/T. Endo

Comparison with other pivoting methods • Threshold pivoting [Malard91]etc. • It may not select the ‘best’ pivot. An element may become the pivot if holds • Good: It can reduce communication for row exchange • Bad: Not latency tolerant • Pairwise pivoting [Sorensen85]etc. • It repeatedly takes two adjacent rows and eliminates one of the two (cf. bubble sort) • Good: It enables pipelining pivot selections • Bad: Numerically unstable Highly Latency Tolerant GE/T. Endo

Environment forParallel Experiments • 192node Linux cluster • Dual Xeon 2.4/2.8GHz (1 CPU per node is used) • Gigabit ethernet • Latencies: 55—75 us • BP is implemented by modifying HPL • mpich 1.2.6 • BLAS library by Kazushige Goto Highly Latency Tolerant GE/T. Endo

14 14 14 14 14 1Gbps 2Gbps 4Gbps SW SW SW SW SW SW SW SW SW SW SW SW 14 14 14 20 20 20 20 Network Structure of Cluster SW FW SW SW SW FS ISTBS cluster at U. Tokyo Highly Latency Tolerant GE/T. Endo

Basic Parallel Performance • Comparing speeds of PP and BP (d=4, 16, 64) • 32 to 160 nodes • n=32768, sb=256 • No emulated latency • BP shows similar scalability to that of PP • BP suffers from overheads of additional computation • 7 to 15% with d=64 Highly Latency Tolerant GE/T. Endo

With latencies, BP is much faster Performance with Large Latencies • Large latencies are added • +2ms, +5ms, +10ms between all pairs of nodes • 64(=8x8) nodes • n=32768, sb=256 • BP is stable against large latencies! • When d is larger, it gets more tolerant of latencies Highly Latency Tolerant GE/T. Endo

Evaluation Method ofNumerical Accuracy We conducted experiments to evaluate numerical accuracy • Partial, Batched, Threshold, Pairwise, No pivoting are compared • Done on a single node • In BP, blocks of size 64 are regarded as nodes • 100 random matrices for each condition • Matrix sizes are 128 to 2048 • Normalized residuals are evaluated • : computed solution, ε: machine epsilon(= ) • Next slide shows the average residuals Highly Latency Tolerant GE/T. Endo

Numerical Accuracy • PP achieves the best accuracy • No pivoting, Pairwise are numerically unstable • BP and threshold achieve comparable accuracy to PP • Average residuals of BP (d=4) are x1.1--1.6 of PP • The sizes of residuals depend on d Tradeoff between accuracy and latency tolerance Highly Latency Tolerant GE/T. Endo

Summary • A GE algorithm that tolerates large latencies of Grid • Batched pivoting largely reduces the number of synchronous communications • BP achieves comparable numerical accuracy to PP Highly Latency Tolerant GE/T. Endo

Future Work • Performance evaluation on actual Grid • Improvement of accuracy • Combining batched pivoting with complete or rook pivoting • Theoretical error analysis • cf. average case analysis by Trefethen et al. Highly Latency Tolerant GE/T. Endo

Another Approach:Column Distribution • When each column is placed on a single node, synchronization is not necessary • It is latency tolerant, but… • Slower because of increase in comm. amount Highly Latency Tolerant GE/T. Endo

Why PP is fragile to large latencies • Batching several steps are well-known technique for row exchange and update • Then, can we reduce synchronizations for pivoting? • No. pivoting cannot be batched or pipelined • Each pivoting depends on pivoting of proceeding steps! • In total, n times synchronizations are required If latencies are too large, synchronization costs become bottleneck Highly Latency Tolerant GE/T. Endo

Our experiments Bluegene/L Bandwidth Requirements • Estimation of limit speed when bisection bandwidth is given • Depends on n • Very optimistic • Our experiments requires 250Mbps • Bluegene/L(Jun.2005) requires 5Gbps Highly Latency Tolerant GE/T. Endo

Our experiments Bluegene/L Effects of latencies • Estimation of limit speed when latency is given • Depends on n • Very optimistic • With >7ms latencies, we can never obtain performance of Bluegene/L Highly Latency Tolerant GE/T. Endo

Displayed residual of HPLdiffers from actual value • Displayed results • Source code(HPL_pdtest.c) ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- W21L2L4 1024 256 1 1 0.30 2.357e+00 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0307237 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0135777 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035758 ...... PASSED Divide by n doesn’t be shown resid1 = resid0 / ( TEST->epsil * Anorm1 * (double)(N) ); resid2 = resid0 / ( TEST->epsil * Anorm1 * Xnorm1 ); resid3 = resid0 / ( TEST->epsil * AnormI * XnormI * (double)(N) ); Actually, devided Highly Latency Tolerant GE/T. Endo

Numerical Accuracy(2) Compares PP and BP with large matrices • Matrices are generated by HPL • On 64(=8x8) nodes Highly Latency Tolerant GE/T. Endo

Detail of accuracy(Small matrices) Standard deviation Worst case in 100 trials Highly Latency Tolerant GE/T. Endo

Detail of accuracy(Large matrices) • From left to right, ||Ax-b||_oo / ( eps * ||A||_1 * N ) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo * N) Highly Latency Tolerant GE/T. Endo

Notes • Simple batched pivoting may fail, if local GEs fail on all nodes • This situation may occur with (nearly) sparse matrices • We can recover by restarting from the failed column The number of synchronization gets closer to that of partial pivoting Highly Latency Tolerant GE/T. Endo

Highly Latency Tolerant Gaussian Elimination

Highly Latency Tolerant Gaussian Elimination

Presentation Transcript

Linear Systems Gaussian Elimination

Gaussian Elimination

Gaussian Elimination

1.2 Gaussian Elimination

Gaussian Elimination

GAUSSIAN ELIMINATION

Gaussian Elimination

Gaussian Elimination

Gaussian Elimination

Chapter 12 Gaussian Elimination (II)

Gaussian Elimination

A Fault Tolerant Gaussian Elimination Solver for the Cell Broadband Engine

Gaussian Elimination

1.2 Gaussian Elimination

Gaussian Elimination

Gaussian Elimination

Gaussian Elimination