Retreat into BLIS

Retreat into BLIS Field G. Van Zee

Funding and publications • NSF • Award OCI-1148125: SI2-SSI: A Linear Algebra Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences. (Funded June 1, 2012 - May 31, 2015.) • Other sources (e.g. Microsoft) • ACM Transactions of Mathematical Software • “BLIS: A Framework for Rapid Instantiation of BLAS Functionality” (submitted) • “The BLIS Framework: Experiments in Portability” (submitted)

Preview • What is BLIS? • Why BLIS and not BLAS? How is BLIS an improvement over existing BLAS implementations? • I’ve heard BLIS will make me more productive. How? • What kind of performance can I expect? • (and many other questions)

What is BLIS? • BLAS-like Library Instantiation Software • BLIS is a framework for • Quickly instantiating high-performance BLAS-like libraries • “Why ‘BLAS-like’?”… • For now, just assume BLAS-like = BLAS

What is BLAS? • Basic Linear Algebra Subprograms • Level 1: vector-vector [Lawson et al. 1979] • Level 2: matrix-vector [Dongarra et al. 1988] • Level 3: matrix-matrix [Dongarra et al. 1990] • Why are BLAS important?

Why are BLAS important? • BLAS constitute the “bottom of the food chain” for most dense linear algebra applications, as well as other libraries • LAPACK, libflame, MATLAB, PETSc, etc.

Why are BLAS important? • BLAS constitute the “bottom of the food chain” for most dense linear algebra applications, as well as other libraries • LAPACK, libflame, MATLAB, PETSc, etc. • The idea is simple: • if the BLAS interface is “standardized”, and • if optimized / high-performance implementation exists for your architecture then higher-level applications can easily benefit

Why are BLAS important? • Plenty of BLAS implementations available • Vendor • ACML (AMD), ESSL (IBM), MKL (Intel), cuBLAS (NVIDIA), MLIB (HP), MathKeisan (NEC), Accelerate (Apple), etc. • Open source • netlib, GotoBLAS, OpenBLAS, ATLAS, etc. • So why do we need BLIS?

Why do we need BLIS? • Actually, there are two questions • Why do we need BLIS? • Why do should we want BLIS? • Let’s look at the first question

Why do we need BLIS? • The BLAS interface is limiting for some applications • To be expected – it was finalized 20-30 years ago! • How exactly is the BLAS interface limiting? • After all, it’s served us well for a long time

Limitations of BLAS interface • Interface only allows column-major storage • We want to support column-major storage, row-major storage, and general stride (tensors). • Further yet, we want to support operands of mixed storage formats. Example: where A is column-stored, B is row-stored, and C has general stride.

Limitations of BLAS interface • Why do we need general stride storage?

Limitations of BLAS interface • Why do we need general stride storage? • Example: three-dimensional tensor

Limitations of BLAS interface • Why do we need general stride storage? • Example: three-dimensional tensor • How do we take an arbitrary slice?

Limitations of BLAS interface • Why do we need general stride storage? • Example: three-dimensional tensor • How do we take an arbitrary slice? • It may be non-contiguous in both dimensions Non-contiguous elements

Limitations of BLAS interface • Incomplete support for complex operations (no “conjugate without transposition”) Examples: • axpy • gemv • gemv, gemm • her, herk • trmv, trmm • trsv, trsm

Limitations of BLAS interface • BLAS API is opaque • No uniform way to access lower-level kernels • Why would one want access to these kernels? • Optimize higher-level (LAPACK-level) operations • Control packing, computation for multithreading • Implement new operations (without “reinventing the wheel”)

Limitations of BLAS interface • Operation support has not changed in over two decades • BLAST Technical Forum attempted to ratify some improvements • Revisions largely ignored by implementors. Why? • Best guess: No official reference implementation

Why do we need BLIS? • Why does this mean we need BLIS? • The BLAS API cannot be improved • We can’t get a better interface by building a better BLAS – we need something else altogether • This was actually one of the primary motivations for developing BLIS

Why do we need BLIS? • BLIS addresses the interface issues with BLAS • Independent row and column stride properties allow flexible matrix storage • Any input operand can be conjugated • Experts can directly call lower-level packing, computation kernels • Operation support can grow over time, as needed

Why do we need BLIS? • BLIS addresses the interface issues with BLAS • Independent row and column stride properties allow flexible matrix storage • Any input operand can be conjugated • Experts can directly call lower-level packing, computation kernels • Operation support can grow over time, as needed • This is why BLIS needs to exist

Why should we want BLIS? • Now, why should someone want BLIS?

Why should we want BLIS? • Now, why should someone want BLIS? • If you’re an end-user • Improved interface • You can still use BLAS compatibility layer

Why should we want BLIS? • Now, why should someone want BLIS? • If you’re an end-user • Improved interface • You can still use BLAS compatibility layer • If you’re a developer • As a framework, BLIS makes it easier to implement high-performance BLAS • Case study: Intel SCC

Why should we want BLIS? • How does BLIS make implementing high-performance BLAS easier? • First, let’s discuss: Why is it normally so time-consuming? • Let’s look at general matrix-matrix multiplication (gemm) as implemented by Kazushige Goto in GotoBLAS • [Goto and van de Geijn 2008]

The gemm algorithm +=

The gemm algorithm NC NC +=

The gemm algorithm KC KC +=

The gemm algorithm += Pack row panel of B

The gemm algorithm += Pack row panel of B NR

The gemm algorithm MC +=

The gemm algorithm += Pack block of A

The gemm algorithm += Pack block of A MR

The gemm algorithm • Goto called this the “inner kernel” • Typically takes shape of a block-panel multiply • Consists of three loops • Coded entirely in assembly language (≈ 2000 lines) +=

Level-3 BLAS • So I just write one “inner kernel” and I’m done, right? • That would be great! But no.

Level-3 BLAS • General matrix multiply (gemm) • Nine cases H T += += += T T T H T += += += H H H H T += += +=

Level-3 BLAS • So we need three packing routines (at least) • One for each of: No transpose, Transpose, Conjugate-transpose • Three more if packing of A and B isn’t consolidated

Level-3 BLAS • Symmetric matrix multiplication (symm) • Four cases += += += +=

Level-3 BLAS • Needs special packing routine for each case • Lower- and upper-stored A, left and right sides • Then, we can call gemm inner kernel as if block had no structure • Symmetric matrix multiplication (symm) +=

Level-3 BLAS • So to support gemm and symm, we need one inner kernel and seven pack routines • Hermitian matrix multiply (hemm)? • Can reuse inner kernel • Needs different packing on matrix A (to conjugate the unstored regions) • Okay, one inner kernel and 11 pack routines • What else?

Level-3 BLAS • Symmetric rank-k update (syrk) • Four cases T T += += T T += +=

Level-3 BLAS • Needs two special inner kernels • Lower, upper-stored matrices C • Also needs to be able to pack transposed matrix A • Symmetric rank-k update (syrk) +=

Level-3 BLAS • Total so far: three inner kernels and 12 pack routines • What about Hermitian rank-k update (herk)? • Need to be able to pack conjugate-transpose of A • Symmetric/Hermitian rank-2k updates can reuse kernels for rank-k

Level-3 BLAS • Triangular matrix multiplication (trmm) • 24 cases T H := := := T H := := := T H := := := T H := := :=

Level-3 BLAS • Needs two (or four) special inner kernels • Lower, upper-stored matrices A (left and right cases?) • Also needs to be able to pack only stored region of matrix A, possibly [conjugate-]transposed, unit/non-unit diagonal • Triangular matrix multiplication (trmm) +=

Retreat into BLIS

Retreat into BLIS

Presentation Transcript

Imaging Retreat

CSAC Berkeley Lab Information Systems BLIS Update

BL+BLIs: Emerging evidences

Board Retreat

Board Retreat

Retreat Description :

RETREAT

Board Retreat

Retreat:

KISA Retreat

BLIS Funds Control Reporting

AZD Retreat

Deans’ Retreat

Distance Learning BLIS From ksou

Maui Retreat

Couples Retreat

Escape Into Your Own "Private Retreat" - Sunny Day Saunas

Retreat Medispa

Couple retreat

spiritual retreat - Nirvana Spiritual Retreat

Ecopia Retreat