The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View

The Parallel Computing Laboratory: AResearch Agenda based on the Berkeley View KrsteAsanovic, RasBodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Edward Lee, Nelson Morgan, George Necula, Dave Patterson, KoushikSen, John Wawrzynek, David Wessel, and Kathy Yelick April 28, 2008

Outline • Overview of Par Lab • Motivation & Scope • Driving Applications • Need for Parallel Libraries and Frameworks • Parallel Libraries • Success Metric • High performance (speed and accuracy) • Autotuning • Required Functionality • Ease of use • Summary of meeting goals, other talks • Identify opportunities for collaboration

Outline • Overview of Par Lab • Motivation & Scope • Driving Applications • Need for Parallel Libraries and Frameworks • Parallel Libraries • Success Metric • High performance (speed and accuracy) • Autotuning • Required functionality • Ease of use • Summary of meeting goals, other talks • Identify opportunities for collaboration

A Parallel Revolution, Ready or Not • Old Moore’s Law is over • No more doubling speed of sequential code every 18 months • New Moore’s Law is here • 2X processors (“cores”) per chip every technology generation, but same clock rate • Sea change for HW & SW industries since changing the model of programming and debugging

“Motif" Popularity (Red HotBlue Cool) • How do compelling apps relate to 13 motifs?

Choosing Driving Applications • “Who needs 100 cores to run M/S Word?” • Need compelling apps that use 100s of cores • How did we pick applications? • Enthusiastic expert application partner, leader in field, promise to help design, use, evaluate our technology • Compelling in terms of likely market or social impact, with short term feasibility and longer term potential • Requires significant speed-up, or a smaller, more efficient platform to work as intended • As a whole, applications cover the most important • Platforms (handheld, laptop, games) • Markets (consumer, business, health)

Image Query by example Image Database 1000’s of images Compelling Client Applications Music/Hearing Robust Speech Input Parallel Browser Personal Health

“Motif" Popularity (Red HotBlue Cool) • How do compelling apps relate to 13 motifs?

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Applications Motifs Composition & Coordination Language (C&CL) Static Verification C&CL Compiler/Interpreter Productivity Layer Parallel Libraries Parallel Frameworks Type Systems Diagnosing Power/Performance Correctness Efficiency Languages Directed Testing Sketching Efficiency Layer Autotuners Dynamic Checking Legacy Code Schedulers Communication & Synch. Primitives Efficiency Language Compilers Debugging with Replay Legacy OS OS Libraries & Services OS Hypervisor Arch. Multicore/GPGPU RAMP Manycore

Developing Parallel Software • 2 types of programmers  2 layers • Efficiency Layer (10% of today’s programmers) • Expert programmers build Frameworks & Libraries, Hypervisors, … • “Bare metal” efficiency possible at Efficiency Layer • Productivity Layer (90% of today’s programmers) • Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries • Frameworks & libraries composed to form app frameworks • Effective composition techniques allows the efficiency programmers to be highly leveraged  Create language for Composition and Coordination (C&C) • Talk by Kathy Yelick

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Applications Motifs Composition & Coordination Language (C&CL) Static Verification C&CL Compiler/Interpreter Productivity Layer Parallel Libraries Parallel Frameworks Type Systems Diagnosing Power/Performance Correctness Efficiency Languages Directed Testing Sketching Efficiency Layer Autotuners Dynamic Checking Legacy Code Schedulers Communication & Synch. Primitives Efficiency Language Compilers Debugging with Replay Legacy OS OS Libraries & Services OS Hypervisor Arch. Multicore/GPGPU RAMP Manycore

Outline • Overview of Par Lab • Motivation & Scope • Driving Applications • Need for Parallel Libraries and Frameworks • Parallel Libraries • Success Metric • High performance (speed and accuracy) • Autotuning • Required Functionality • Ease of use • Summary of meeting goals, other talks • Identify opportunities for collaboration

Success Metric - Impact • LAPACK and ScaLAPACK are widely used • Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … • >86M web hits @ Netlib (incl. CLAPACK, LAPACK95) • 35K hits/day Cosmic Microwave Background Analysis, BOOMERanG collaboration, MADCAP code (Apr. 27, 2000). ScaLAPACK Xiaoye Li: Sparse LU

High Performance (Speed and Accuracy) • Matching Algorithms to Architectures (8 talks) • Autotuning – generate fast algorithm automatically depending on architecture and problem • Communication-Avoiding Linear Algebra – avoiding latency and bandwidth costs • Faster Algorithms (2 talks) • Symmetric eigenproblem (O(n2) instead of O(n3)) • Sparse LU factorization • More accurate algorithms (2 talks) • Either at “usual” speed, or at any cost • Structure-exploiting algorithms • Roots(p) (O(n2) instead of O(n3))

Automatic Performance Tuning • Writing high performance software is hard • Ideal: get high fraction of peak performance from one algorithm • Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,… • Best choice can depend on knowing a lot of applied mathematics and computer science • Changes with each new hardware, compiler release • Goal: Automation • Generate and search a space of algorithms • Past successes: PHiPAC, ATLAS, FFTW, Spiral • Many conferences, DOE projects, …

The Difficulty of Tuning SpMV // y <-- y + A*x for all nonzero A(i,j): y(i) += A(i,j) * x(j) // Compressed sparse row (CSR) for each row i: t = 0 for k=row[i] to row[i+1]-1: t += A[k] * x[J[k]] y[i] = t • Exploit 8x8 dense blocks

Mflop/s (31.1%) Reference Mflop/s (7.6%) Speedups on Itanium 2: The Need for Search

Mflop/s (31.1%) Best: 4x2 Reference Mflop/s (7.6%) Speedups on Itanium 2: The Need for Search

SpMV Performance—raefsky3

More surprises tuning SpMV • More complex example • Example: 3x3 blocking • Logical grid of 3x3 cells

Extra Work Can Improve Efficiency • More complex example • Example: 3x3 blocking • Logical grid of 3x3 cells • Pad with zeros • “Fill ratio” = 1.5 • 1.5x as many flops • On Pentium III: 1.5x speedup! (2/3 time) 1.52 = 2.25x flop rate

Intel Clovertown AMD Opteron Sun Niagara2 (Huron) AutotunedPerformance of SpMV • Clovertown was already fully populated with DIMMs • Gave Opteron as many DIMMs as Clovertown • Firmware update for Niagara2 • Array padding to avoid inter-thread conflict misses • PPE’s use ~1/3 of Cell chip area +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

Autotuning SpMV • Large search space of possible optimizations • Large speed ups possible • Parallelism adds more! • Later talks • Sam Williams on tuning SpMV for a variety of multicore, other platforms • Ankit Jain on easy-to-use system for incorporating autotuning into applications • Kaushik Datta on tuning special case of stencils • Rajesh Nishtala on tuning collection communications • But don’t you still have to write difficult code to generate search space?

Sketch: optimized skeleton (5 loops, missing some index/bounds) Program Synthesis • Best implementation/data structure hard to write, identify • Don’t do this by hand • Sketching: code generation using 2QBF Spec: simple implementation (3 loop 3D stencil) Optimized code (tiled, prefetched, time skewed) • Talk by Armando Solar-Lezama / RasBodik on program synthesis by sketching, applied to stencils

Communication-Avoiding Linear Algebra (CALU) • Exponentially growing gaps between • Floating point time << 1/Network BW << Network Latency • Improving 59%/year vs 26%/year vs 15%/year • Floating point time << 1/Memory BW << Memory Latency • Improving 59%/year vs 23%/year vs 5.5%/year • Goal: reorganize linear algebra to avoid communication • Notjust hiding communication (speedup  2x ) • Arbitrary speedups possible • Possible for Dense and Sparse Linear Algebra

CALU Summary (1/4) • QR or LU decomposition of m x n matrix, m>>n • Parallel implementation • Conventional: O( n log p ) messages • New: O( log p ) messages – optimal • Performance: • QR 5x faster on cluster, LU 7x faster on cluster • Serial implementation with fast memory of size W • Conventional: O( mn/W ) moves of data from slow to fast memory • mn/W = how many times larger matrix is than fast memory • New: O(1) moves of data • Performance: • OOC QR only 2x slower than having  DRAM • Expect gains with Multicore as well • Price: • Some redundant computation (but flops are cheap!) • Different representation of answer for QR (tree structured) • LU stable in practice so far, but not GEPP

CALU Summary (2/4) • QR or LU decomposition of n x n matrix • Communication lower by factor of b = block size • Lots of speed up possible (modeled and measured) • Modeled speedups of new QR over ScaLAPACK • I BM Power 5 (512 procs): up to 9.7x • Petascale (8K procs): up to 22.9x • Grid (128 procs): up to 11x • Measured and modeled speedups of new LU over ScaLAPACK • IBM Power 5 (Bassi): up to 2.3x speedup (measured) • Cray XT4 (Franklin): up to 1.8x speedup (measured) • Petascale (8K procs): up to 80x (modeled) • Speed limit: Cholesky? Matmul? • Extends to sparse LU • Communication more dominant, so pay off may be higher • Speed limit: Sparse Cholesky? • Talk by Xiaoye Li on alternative

CALU Summary (3/4) • Take k steps of Krylov subspace method • GMRES, CG, Lanczos, Arnoldi • Assume matrix “well-partitioned,” with modest surface-to-volume ratio • Parallel implementation • Conventional: O(k log p) messages • New: O(log p) messages - optimal • Serial implementation • Conventional: O(k) moves of data from slow to fast memory • New: O(1) moves of data – optimal • Can incorporate some preconditioners • Need to be able to “compress” interactions between distant i, j • Hierarchical, semiseparable matrices … • Lots of speed up possible (modeled and measured) • Price: some redundant computation • Talks by MarghoobMohiyuddin, Mark Hoemmen

CALU Summary (4/4) • Lots of related work • Some going back to 1960’s • Reports discuss this comprehensively, we will not • Our contributions • Several new algorithms, improvements on old ones • Unifying parallel and sequential approaches to avoiding communication • Time for these algorithms has come, because of growing communication costs • Systematic examination of as much of linear algebra as we can • Why just linear algebra?

Linear Algebra on GPUs • Important part of architectural space to explore • Talk by Vasily Volkov • NVIDIA has licensed our BLAS (SGEMM) • Fastest implementations of dense LU, Cholesky, QR • 80-90% of “peak” • Require various optimizations special to GPU • Use CPU for BLAS1 and BLAS2, GPU for BLAS3 • In LU, replace TRSM by TRTRI + GEMM (~stable as GEPP)

Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem • Talk by Osni Marques / B. Parlett / I. Dhillon / C. Voemel • 2006 SIAM Linear Algebra Prize for Parlett, Dhillon • Parallel Sparse LU • Talk by Xiaoye Li • Up to 10x faster HQR • R. Byers / R. Mathias / K. Braman • 2003 SIAM Linear Algebra Prize • Extensions to QZ: • B. Kågström / D. Kressner / T. Adlerborn • Faster Hessenberg, tridiagonal, bidiagonal reductions: • R. van de Geijn / E. Quintana-Orti • C. Bischof / B. Lang • G. Howell / C. Fulton

Collaborators • UC Berkeley: • Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Yozo Hida, Jason Riedy, Vasily Volkov, Christof Voemel, David Bindel, undergrads… • U Tennessee, Knoxville • Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov, Alfredo Buttari, Jakub Kurzak • Other Academic Institutions • UT Austin, UC Davis, CU Denver, Florida IT, Georgia Tech, U Kansas, U Maryland, North Carolina SU, UC Santa Barbara • TU Berlin, ETH, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb • Research Institutions • INRIA, LBL • Industrial Partners(predating ParLab) • Cray, HP, Intel, Interactive Supercomputing, MathWorks, NAG, NVIDIA

More Accurate Algorithms • Motivation • User requests, debugging • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Talk by Jason Riedy • Arbitrary precision versions of everything • Using your favorite multiple precision package • Talk by Yozo Hida • Jacobi-based SVD • Faster than QR, can be arbitrarily more accurate • Drmac / Veselic

What could go into linear algebra libraries? For all linear algebra problems For all matrix/problem structures For all data types For all architectures and networks For all programming interfaces Produce best algorithm(s) w.r.t. performance and accuracy (including condition estimates, etc) Need to prioritize, automate, enlist help!

What do users want? (1/2) • Performance, ease of use, functionality, portability • Composability • On multicore, expect to implement dense codes via DAG scheduling (Dongarra’s PLASMA) • Talk by Krste Asanovic / Heidi Pan on threads • Reproducibility • Made challenging by nonassociativity of floating point • Ongoing collaborations on Driving Apps • Jointly analyzing needs • Talk by T. Keaveny on Medical Application • Other apps so far: mostly dense and sparse linear algebra, FFTs • some interesting structured needs emerging

What do users want? (2/2) • DOE/NSF User Survey • Small but interesting sample at www.netlib.org/lapack-dev • What matrix sizes do you care about? • 1000s: 34% • 10,000s: 26% • 100,000s or 1Ms: 26% • How many processors, on distributed memory? • >10: 34%, >100: 31%, >1000: 19% • Do you use more than double precision? • Sometimes or frequently: 16% • New graduate program in CSE with 106 faculty from 18 departments • New needs may emerge

Highlights of New Dense Functionality • Updating / downdating of factorizations: • Stewart, Langou • More generalized SVDs: • Bai , Wang • More generalized Sylvester/Lyapunov eqns: • Kågström, Jonsson, Granat • Structured eigenproblems • Selected matrix polynomials: • Mehrmann

Organizing Linear Algebra www.netlib.org/lapack www.netlib.org/scalapack gams.nist.gov www.netlib.org/templates www.cs.utk.edu/~dongarra/etemplates

Improved Ease of Use • Which do you prefer? A \ B CALL PDGESV( N ,NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO) CALL PDGESVX( FACT, TRANS, N ,NRHS, A, IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C, B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR, BERR, WORK, LWORK, IWORK, LIWORK, INFO)

Ease of Use: One approach • Easy interfaces vs access to details • Some users want access to all details, because • Peak performance matters • Control over memory allocation • Other users want “simpler” interface • Automatic allocation of workspace • No universal agreement across systems on “easiest interface” • Leave decision to higher level packages • Keep expert driver / simple driver / computational routines • Add wrappers for other languages • Fortran95, Java, Matlab, Python, even C • Automatic allocation of workspace • Add wrappers to convert to “best” parallel layout

Outline • Overview of Par Lab • Motivation & Scope • Driving Applications • Need for Parallel Libraries and Frameworks • Parallel Libraries • Success Metric • High performance (speed and accuracy) • Autotuning • Required Functionality • Ease of use • Summary of meeting goals, other talks

Some goals for the meeting • Introduce ParLab • Describe numerical library efforts in detail • Exchange information • User needs, tools, goals • Identify opportunities for collaboration

Summary of other talks (1) • Monday, April 28 (531 Cory) • 12:00 - 12:45 Jim Demmel - Overview of PAR Lab / Numerical Libraries • 12:45 - 1:00 AvneeshSud (Microsoft) - Introduction to library effort at Microsoft • 1:00 - 1:45 Sam Williams/Ankit Jain - Tuning Sparse-matrix-vector multiply/Parallel OSKI • 1:45 – 1:50 Break • 1:50 - 2:20 MarghoobMohiyuddin - Avoiding Communication in SpMV-like kernels • 2:20 - 2:50 Mark Hoemmen - Avoiding communication in Krylov Subspace Methods • 2:50 - 3:00 Break • 3:00 - 3:30 Rajesh Nishtala - Tuning collective communication • 3:30 - 4:00 YozoHida - High accuracy linear algebra • 4:00 – 4:25 Jason Riedy - Iterative Refinement in linear algebra • 4:25 – 4:30 Break • 4:30 – 5:00 Tony Keaveny - Medical Image Analysis in PAR Lab • 5:00 - 5:30 RasBodik/ Armando Solar-Lezama - Program synthesis by Sketching • 5:30 - 6:00 VasilyVolkov - Linear Algebra on GPUs

Summary of other talks (2) • Tuesday, April 29 (Wozniak Lounge) • 9:00 - 10:00 Kathy Yelick - Programming Systems for PAR Lab • 10:00 - 10:30 KaushikDatta – Tuning Stencils • 10:30 - 11:00 Xiaoye Li - Parallel sparse LU factorization • 11:00 - 11:30 Osni Marques - Parallel Symmetric Eigensolvers • 11:30 – 12:00 KrsteAsanovic / Heidi Pan – Thread system

Extra Slides

The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View