Exploiting Mixed Precision: Iterative Refinement for the Solution of Linear Systems

Exploiting Mixed Precision:Iterative Refinement for the Solution of Linear Systems Jack Dongarra INNOVATIVE COMPUTING LABORATORY University of Tennessee Oak Ridge National Laboratory University of Manchester

Five Important Features to Consider When Computing at Scale • Effective Use of Many-Core and Hybrid architectures • Dynamic Data Driven Execution • Block Data Layout • Exploiting Mixed Precision in the Algorithms • Single Precision is 2X faster than Double Precision • With GP-GPUs 10x • Self Adapting / Auto Tuning of Software • Too hard to do by hand • Fault Tolerant Algorithms • With 1,000,000’s of cores things will fail • Communication Avoiding Algorithms • For dense computations from O(n log p) to O(log p) communications • GMRES s-step compute ( x, Ax, A2x, … Asx )

Performance of Single Precision on Conventional Processors • Realized have the similar situation on our commodity processors. • That is, SP is 2X as fast as DP on many systems • The Intel Pentium and AMD Opteron have SSE2 • 2 flops/cycle DP • 4 flops/cycle SP • IBM PowerPC has AltiVec • 8 flops/cycle SP • 4 flops/cycle DP • No DP on AltiVec • Single precision is faster because: • Operations are faster • Reduced data motion • Larger blocks gives higher locality in cache

Idea Goes Something Like This… • Exploit 32 bit floating point as much as possible. • Especially for the bulk of the computation • Correct or update the solution with selective use of 64 bit floating point to provide a refined results • Intuitively: • Compute a 32 bit result, • Calculate a correction to 32 bit result using selected higher precision and, • Perform the update of the 32 bit results with the correction using high precision.

Iterative Refinement L U = lu(A)SINGLEO(n3) x = L\(U\b)SINGLEO(n2) r = b – AxDOUBLEO(n2) WHILE || r || not small enough z = L\(U\r) SINGLEO(n2) x = x + zDOUBLEO(n1) r = b – AxDOUBLEO(n2) END • Iterative refinement for dense systems, Ax = b, can work this way. • Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. • Fixed precision • Mixed precision (residual computed in higher precision)

Mixed-Precision Iterative Refinement L U = lu(A)SINGLEO(n3) x = L\(U\b)SINGLEO(n2) r = b – AxDOUBLEO(n2) WHILE || r || not small enough z = L\(U\r) SINGLEO(n2) x = x + zDOUBLEO(n1) r = b – AxDOUBLEO(n2) END • Iterative refinement for dense systems, Ax = b, can work this way. • Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. • It can be shown that using this approach we can compute the solution to 64-bit floating point precision. • Requires extra storage, total is 1.5 times normal; • O(n3) work is done in lower precision • O(n2) work is done in high precision • Problems if the matrix is ill-conditioned in sp; O(108)

In Matlab on My Laptop! • Matlab has the ability to perform 32 bit floating point for some computations • Matlab uses LAPACK and Intel’s MKL BLAS underneath. sa=single(a); sb=single(b); [sl,su,sp]=lu(sa); O(n3) sx=su\(sl\(sp*sb)); x=double(sx); r=b-a*x; O(n2) i=0; while(norm(r)>res1), i=i+1; sr = single(r); sx1=su\(sl\(sp*sr)); x1=double(sx1); x=x1+x; r=b-a*x; O(n2) if (i==30), break; end; • Bulk of work, O(n3), in “single” precision • Refinement, O(n2), in “double” precision • Computing the correction to the SP results in DP and adding it to the SP results in DP.

Another Look at Iterative Refinement • On a Pentium; using SSE2, single precision can perform 4 floating point operations per cycle and in double precision 2 floating point operations per cycle. • In addition there is reduced memory traffic (factor on sp data) Intel Pentium M (T2500 2 GHz) 1.4 GFlop/s! A\b; Double Precision Ax = b

Another Look at Iterative Refinement • On a Pentium; using SSE2, single precision can perform 4 floating point operations per cycle and in double precision 2 floating point operations per cycle. • In addition there is reduced memory traffic (factor on sp data) Intel Pentium M (T2500 2 GHz) A\b; Single Precision w/iterative refinement With same accuracy as DP 3 GFlop/s!! A\b; Double Precision 2 X speedup Matlab on my laptop! Ax = b

Results for Mixed Precision Iterative Refinement for Dense Ax = b • Single precision is faster than DP because: • Higher parallelism within vector units • 4 ops/cycle (usually) instead of 2 ops/cycle • Reduced data motion • 32 bit data instead of 64 bit data • Higher locality in cache • More data items in cache

Intel Pentium IV Prescott (3.4GHz), Goto BLAS (1 thread) Ax = b Ax = b A = LU x=U\L\b y=y+A*x

Quadruple Precision Intel Xeon 3.2 GHz Reference implementation of the quad precision BLAS Accuracy: 10-32 No more than 3 steps of iterative refinement are needed.

IBM Cell Processor • IBM, Toshiba, Sony (similar to the chip in the PS3) • Each Cell contains 8 APUs. • An APU is a self contained vector processor which acts independently from the others. • 4 floating point units capable of a total of 32 Gflop/s (8 Gflop/s each) • 204 Gflop/speak! 32 bit floating point; 64 bit floating point at 15 Gflop/s (3.2 GHz) • IEEE format, but only rounds toward zero in 32 bit, overflow set to largest

IBM Cell 3.2 GHz, Ax = b 8 SGEMM (Embarrassingly Parallel) .30 secs 3.9 secs

IBM Cell 3.2 GHz, Ax = b 8.3X 8 SGEMM (Embarrassingly Parallel) .30 secs .47 secs 3.9 secs Algorithm so effective now a part of LAPACK

Breakdown of Execution Time on for the Cell Processor

LINPACK Benchmark

PS3 Hardware Overview PE PE PE PE PE PE SIT CELL 25.6 Gflop/s Disabled/Broken: Yield issues 25.6 Gflop/s 25.6 Gflop/s 200 GB/s GameOS Hypervisor PowerPC 25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s 25 GB/s 3.2 GHz 25 GB/s injection bandwidth 200 GB/s between SPEs 32 bit peak perf 6*25.6 Gflop/s 153.6 Gflop/s peak 64 bit peak perf 6*1.8 Gflop/s 10.8 Gflop/s peak 1 Gb/s NIC 256 MiB memory 256 MiB

PlayStation 3 LU Codes 6 SGEMM (Embarrassingly Parallel)

LINPACK Benchmark on PS3

Cholesky on the PS3, Ax=b, A=AT, xTAx > 0

Sparse Direct Solver and Iterative Refinement MUMPS package based on multifrontal approach which generates small dense matrix multiplies

Sparse Iterative Methods (PCG) Inner iteration: In 32 bit floating point Outer iterations using 64 bit floating point • Outer/Inner Iteration • Outer iteration in 64 bit floating point and inner iteration in 32 bit floating point

Mixed Precision Computations forSparse Inner/Outer-type Iterative Solvers Speedupsfor mixed precision Inner SP/Outer DP (SP/DP) iter. methods vs DP/DP(CG2, GMRES2, PCG2, and PGMRES2 with diagonal prec.)(Higher is better) 2 2 2 2 Iterations for mixed precision SP/DP iterative methods vs DP/DP(Lower is better) Machine: Intel Woodcrest (3GHz, 1333MHz bus)Stopping criteria:Relative to r0 residual reduction (10-12) Matrix size 6,021 18,000 39,000 120,000 240,000 Condition number

Cray XD-1 (OctigaBay Systems) Experiments with Field Programmable Gate Array Specify arithmetic Six Xilinx Virtex-4 Field Programmable Gate Arrays (FPGAs) per chassis

Mixed Precision Iterative Refinement- FPGA Performance Test - Junqing Sun et al Characteristics of multiplier on an FPGA* (using DSP48) * XC4LX160-10 TENNESSEE ADVANCED COMPUTING LABORATORY

Mixed Precision Iterative Refinement- Random Matrix Test - Junqing Sun et al Refinement iterations for customized formats (sXXe11). Random matrices More Bits More Iterations TENNESSEE ADVANCED COMPUTING LABORATORY

Mixed Precision Hybrid Direct Solver- Profiled Time* on Cray-XD1 - Junqing Sun et al * For a 128x128 matrix High Performance Mixed-Precision Linear Solver for FPGAs, Junqing Sun, Gregory D. Peterson, Olaf Storaasli, To appear IEEE TPDC • LU decomposition execution time for lower-precision data formats is much less than that for higher-precision. • Direct solver execution timeconsists of four parts: LU decomposition, communication, triangular solvers, and iterative refinements. The time for LU usually dominates and reduces for lower-precision computations. • LU performance:our mixed precision design overcomes the performance of a 2.2 GHz Opteron processor by 6x for LU decomposition and 2.7 times for direct solver. Lower-precision design achieves higher performance. • LU performance:our mixed precision design overcomes the performance of a 2.2 GHz Opteron processor by 6x for LU decomposition and 2.7 times for direct solver. Lower-precision design achieves higher performance. • LU performance:our mixed precision design overcomes the performance of a 2.2 GHz Opteron processor by 6x for LU decomposition and 2.7 times for direct solver. Lower-precision design achieves higher performance.

Intriguing Potential • Exploit lower precision as much as possible • Payoff in performance • Faster floating point • Less data to move • Automatically switch between SP and DP to match the desired accuracy • Compute solution in SP and then a correction to the solution in DP • Potential for GPU, FPGA, special purpose processors • Use as little you can get away with and improve the accuracy • Applies to sparse direct and iterative linear systems and Eigenvalue, optimization problems, where Newton’s method is used. Correction = - A\(b – Ax)

A New Generation of Software:Parallel Linear Algebra Software for Multicore Architectures (PLASMA) Those new algorithms - have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.

Coding for an Abstract Multicore • Parallel software for multicores should have two characteristics: • Fine granularity: • High level of parallelism is needed • Cores will probably be associated with relatively small local memories. This requires splitting an operation into tasks that operate on small portions of data in order to reduce bus traffic and improve data locality. • Asynchronicity: • As the degree of thread level parallelism grows and granularity of the operations becomes smaller, the presence of synchronization points in a parallel execution seriously affects the efficiency of an algorithm.

Major Changes to Software • Must rethink the design of our software • Another disruptive technology • Similar to what happened with cluster computing and message passing • Rethink and rewrite the applications, algorithms, and software • Numerical libraries for example will change • For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this

Steps in the LAPACK LU (Factor a panel) (Backward swap) (Forward swap) (Triangular solve) (Matrix multiply)

LU Timing Profile (4 processor system) Threads – no lookahead Time for each component DGETF2 DLASWP(L) DLASWP(R) DTRSM DGEMM Bulk Sync Phases

Adaptive Lookahead - Dynamic Event Driven Multithreading Ideas not new. Many papers use the DAG approach. Reorganizing algorithms to use this approach

Provide Highest Performance multi-core - x86 PLASMA MKL ACML LAPACK

Cholesky on the CELL • 1 CELL (8 SPEs) • 186 Gflop/s • 91 % peak • 97 % SGEMM peak • 2 CELLs (16 SPEs) • 365 Gflop/s • 89 % peak • 95 % SGEMM peak Single precision results on the Cell

If We Had A Small Matrix Problem • We would generate the DAG, find the critical path and execute it. • DAG too large to generate ahead of time • Not explicitly generate • Dynamically generate the DAG as we go • Machines will have large number of cores in a distributed fashion • Will have to engage in message passing • Distributed management • Locally have a run time system

The DAGs are Large • Here is the DAG for a factorization on a 20 x 20 matrix • For a large matrix say O(106) the DAG is huge • Many challenges for the software

Why Hybrid Computing ? • Hardware to Software Trends

Achievable Peak for GEMM on Multicore and GPUs • Can we do the transition from LAPACK to MAGMA with minor modifications (e.g. relying mostly on highly efficient BLAS) ? • In general NO, as evident from current MAGMA results, where high performance results are based on new techniques for - extracting parallelism - reducing communication

Extracting Parallelism Algorithms as DAGs Current hybrid CPU+GPU algorithms(small tasks/tiles for multicore) (small and large tasks)‏ Approach: - Critical path is done on CPU and overlapped with the GPU work whenever possible through proper task scheduling (e.g. look-ahead) - Algorithmic changes(possibly less stable)Challenges: - Splitting algorithms into tasks e.g. has to be “Heterogeneity-aware”, “Auto-tuned”, etc. - Scheduling task execution - New algorithms and studies on associated with them numerical stability - Reusing current Multicore results (although Multicore  Hybrid computing) [ Current Multicore efforts are on “tiled” algorithms and “uniform task splitting” with “block data layout”; Current Hybrid work leans more towards standard data layout, algorithms with variable block sizes, large GPU tasks and large CPU-GPU data transfers to minimize latency overheads, etc ] [ Hybrid computing presents more opportunity to better match algorithmic requirements to underlaying architecture components; e.g. current main factorizations; How about Hessenberg reduction, that is hard (open problem) on Multicore?]

CPU : Intel Xeon 2.33 GHz GPU : NVIDIA GTX280 Here we considered 3 steps of iterative refinement on symmetric and positive definite matrices (using Cholesky)

Current work • Algorithms (in particular LU) for Multicore + GPU systems • Challenges • How to split the computation • Software development • Tuning Work splitting(for single GPU + 8 cores host)

Exploiting Mixed Precision: Iterative Refinement for the Solution of Linear Systems