Anatomy of a High-Performance Many-Threaded Matrix Multiplication

Anatomy of a High-Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G. Van Zee

Introduction • Shared memory parallelism for GEMM • Many-threaded architectures require more sophisticated methods of parallelism • Explore the opportunities for parallelism to explain which we will exploit • Need finer grain parallelism

Outline • GotoBLAS approach • Opportunities for Parallelism • Many-threaded Results

GotoBLAS Approach The GEMM operation: n k n C A k B m += m

registers L1 cache L2 cache L3 cache += Main Memory

registers L1 cache L2 cache L3 cache nc nc += Main Memory

registers L1 cache L2 cache L3 cache kc kc += Main Memory

registers L1 cache L2 cache L3 cache Main Memory mc += mc

registers L1 cache L2 cache nr L3 cache nr nr += Main Memory

registers L1 cache L2 cache L3 cache += Main Memory mr mr

3 Loops to Parallelize in GotoBLAS +=

5 Opportunities for Parallelism +=

Multiple Levels of Parallelism += ir • All threads share micro-panel of B • Each thread has its own micro-panel of A • Fixed number of iterations:

Multiple Levels of Parallelism jr jr += • All threads share block of A • Each thread has its own micro-panel of B • Fixed number of iterations • Good if shared L2 cache

Multiple Levels of Parallelism • All threads share panel of B • Each thread has its own block of A • Number of iterations is not fixed • Good if multiple L2 caches

Multiple Levels of Parallelism • Each iteration updates entire C • Iterations of the loop are not independent • Requires mutex when updating C • Or a reduction

Multiple Levels of Parallelism • All threads share matrix A • Each thread has its own panel of B • Number of iterations is not fixed • Good if multiple L3 caches • Good for NUMA reasons

Intel Xeon Phi • Many Threads • 60 cores, 4 threads per core • Need to use > 2 threads per core to utilize FPU • We do not block for the L1 cache • Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache • We consider part of the L2 cache as a virtual L1 • Each core has its own L2 cache

IBM Blue Gene/Q • (Not quite as) Many Threads • 16 cores, 4 threads per core • Need to use > 2 threads per core to utilize FPU • We do not block for the L1 cache • Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache • We consider part of the L2 cache as a virtual L1 • Single large, shared L2 cache

Thank You • Questions? • Source code available at: • code.google.com/p/blis/

Anatomy of a High-Performance Many-Threaded Matrix Multiplication

Anatomy of a High-Performance Many-Threaded Matrix Multiplication

Presentation Transcript

High Performance Programming on a Single Processor: Memory Hierarchies Matrix Multiplication Automatic Performance Tu

Matrix Multiplication

MATRIX MULTIPLICATION

Matrix-Matrix Multiplication

Matrix Multiplication

MATRIX MULTIPLICATION

Matrix Multiplication

Matrix Multiplication

2.2 Matrix Multiplication

Developing a High Performance Anatomy

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Matrix Chain Multiplication

Matrix Multiplication

Matrix Multiplication

Matrix Multiplication

Matrix Multiplication

Recalling Matrix multiplication

Matrix Multiplication

Matrix Multiplication

Matrix Multiplication

Optimizing Cache Performance in Matrix Multiplication

Optimizing the Performance of Sparse Matrix-Vector Multiplication