1k likes | 1.19k Views
Retreat into BLIS. Field G. Van Zee. Funding and publications. NSF Award OCI-1148125: SI2-SSI : A Linear Algebra Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences. (Funded June 1, 2012 - May 31, 2015 .) Other sources (e.g. Microsoft)
E N D
Retreat into BLIS Field G. Van Zee
Funding and publications • NSF • Award OCI-1148125: SI2-SSI: A Linear Algebra Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences. (Funded June 1, 2012 - May 31, 2015.) • Other sources (e.g. Microsoft) • ACM Transactions of Mathematical Software • “BLIS: A Framework for Rapid Instantiation of BLAS Functionality” (submitted) • “The BLIS Framework: Experiments in Portability” (submitted)
Preview • What is BLIS? • Why BLIS and not BLAS? How is BLIS an improvement over existing BLAS implementations? • I’ve heard BLIS will make me more productive. How? • What kind of performance can I expect? • (and many other questions)
What is BLIS? • BLAS-like Library Instantiation Software • BLIS is a framework for • Quickly instantiating high-performance BLAS-like libraries • “Why ‘BLAS-like’?”… • For now, just assume BLAS-like = BLAS
What is BLAS? • Basic Linear Algebra Subprograms • Level 1: vector-vector [Lawson et al. 1979] • Level 2: matrix-vector [Dongarra et al. 1988] • Level 3: matrix-matrix [Dongarra et al. 1990] • Why are BLAS important?
Why are BLAS important? • BLAS constitute the “bottom of the food chain” for most dense linear algebra applications, as well as other libraries • LAPACK, libflame, MATLAB, PETSc, etc.
Why are BLAS important? • BLAS constitute the “bottom of the food chain” for most dense linear algebra applications, as well as other libraries • LAPACK, libflame, MATLAB, PETSc, etc. • The idea is simple: • if the BLAS interface is “standardized”, and • if optimized / high-performance implementation exists for your architecture then higher-level applications can easily benefit
Why are BLAS important? • Plenty of BLAS implementations available • Vendor • ACML (AMD), ESSL (IBM), MKL (Intel), cuBLAS (NVIDIA), MLIB (HP), MathKeisan (NEC), Accelerate (Apple), etc. • Open source • netlib, GotoBLAS, OpenBLAS, ATLAS, etc. • So why do we need BLIS?
Why do we need BLIS? • Actually, there are two questions • Why do we need BLIS? • Why do should we want BLIS? • Let’s look at the first question
Why do we need BLIS? • The BLAS interface is limiting for some applications • To be expected – it was finalized 20-30 years ago! • How exactly is the BLAS interface limiting? • After all, it’s served us well for a long time
Limitations of BLAS interface • Interface only allows column-major storage • We want to support column-major storage, row-major storage, and general stride (tensors). • Further yet, we want to support operands of mixed storage formats. Example: where A is column-stored, B is row-stored, and C has general stride.
Limitations of BLAS interface • Why do we need general stride storage?
Limitations of BLAS interface • Why do we need general stride storage? • Example: three-dimensional tensor
Limitations of BLAS interface • Why do we need general stride storage? • Example: three-dimensional tensor • How do we take an arbitrary slice?
Limitations of BLAS interface • Why do we need general stride storage? • Example: three-dimensional tensor • How do we take an arbitrary slice? • It may be non-contiguous in both dimensions Non-contiguous elements
Limitations of BLAS interface • Incomplete support for complex operations (no “conjugate without transposition”) Examples: • axpy • gemv • gemv, gemm • her, herk • trmv, trmm • trsv, trsm
Limitations of BLAS interface • BLAS API is opaque • No uniform way to access lower-level kernels • Why would one want access to these kernels? • Optimize higher-level (LAPACK-level) operations • Control packing, computation for multithreading • Implement new operations (without “reinventing the wheel”)
Limitations of BLAS interface • Operation support has not changed in over two decades • BLAST Technical Forum attempted to ratify some improvements • Revisions largely ignored by implementors. Why? • Best guess: No official reference implementation
Why do we need BLIS? • Why does this mean we need BLIS? • The BLAS API cannot be improved • We can’t get a better interface by building a better BLAS – we need something else altogether • This was actually one of the primary motivations for developing BLIS
Why do we need BLIS? • BLIS addresses the interface issues with BLAS • Independent row and column stride properties allow flexible matrix storage • Any input operand can be conjugated • Experts can directly call lower-level packing, computation kernels • Operation support can grow over time, as needed
Why do we need BLIS? • BLIS addresses the interface issues with BLAS • Independent row and column stride properties allow flexible matrix storage • Any input operand can be conjugated • Experts can directly call lower-level packing, computation kernels • Operation support can grow over time, as needed • This is why BLIS needs to exist
Why should we want BLIS? • Now, why should someone want BLIS?
Why should we want BLIS? • Now, why should someone want BLIS? • If you’re an end-user • Improved interface • You can still use BLAS compatibility layer
Why should we want BLIS? • Now, why should someone want BLIS? • If you’re an end-user • Improved interface • You can still use BLAS compatibility layer • If you’re a developer • As a framework, BLIS makes it easier to implement high-performance BLAS • Case study: Intel SCC
Why should we want BLIS? • How does BLIS make implementing high-performance BLAS easier? • First, let’s discuss: Why is it normally so time-consuming? • Let’s look at general matrix-matrix multiplication (gemm) as implemented by Kazushige Goto in GotoBLAS • [Goto and van de Geijn 2008]
The gemm algorithm NC NC +=
The gemm algorithm KC KC +=
The gemm algorithm += Pack row panel of B
The gemm algorithm += Pack row panel of B NR
The gemm algorithm MC +=
The gemm algorithm += Pack block of A
The gemm algorithm += Pack block of A MR
The gemm algorithm • Goto called this the “inner kernel” • Typically takes shape of a block-panel multiply • Consists of three loops • Coded entirely in assembly language (≈ 2000 lines) +=
Level-3 BLAS • So I just write one “inner kernel” and I’m done, right? • That would be great! But no.
Level-3 BLAS • General matrix multiply (gemm) • Nine cases H T += += += T T T H T += += += H H H H T += += +=
Level-3 BLAS • So we need three packing routines (at least) • One for each of: No transpose, Transpose, Conjugate-transpose • Three more if packing of A and B isn’t consolidated
Level-3 BLAS • Symmetric matrix multiplication (symm) • Four cases += += += +=
Level-3 BLAS • Needs special packing routine for each case • Lower- and upper-stored A, left and right sides • Then, we can call gemm inner kernel as if block had no structure • Symmetric matrix multiplication (symm) +=
Level-3 BLAS • So to support gemm and symm, we need one inner kernel and seven pack routines • Hermitian matrix multiply (hemm)? • Can reuse inner kernel • Needs different packing on matrix A (to conjugate the unstored regions) • Okay, one inner kernel and 11 pack routines • What else?
Level-3 BLAS • Symmetric rank-k update (syrk) • Four cases T T += += T T += +=
Level-3 BLAS • Needs two special inner kernels • Lower, upper-stored matrices C • Also needs to be able to pack transposed matrix A • Symmetric rank-k update (syrk) +=
Level-3 BLAS • Total so far: three inner kernels and 12 pack routines • What about Hermitian rank-k update (herk)? • Need to be able to pack conjugate-transpose of A • Symmetric/Hermitian rank-2k updates can reuse kernels for rank-k
Level-3 BLAS • Triangular matrix multiplication (trmm) • 24 cases T H := := := T H := := := T H := := := T H := := :=
Level-3 BLAS • Needs two (or four) special inner kernels • Lower, upper-stored matrices A (left and right cases?) • Also needs to be able to pack only stored region of matrix A, possibly [conjugate-]transposed, unit/non-unit diagonal • Triangular matrix multiplication (trmm) +=