330 likes | 493 Views
Anatomy of a High-Performance Many-Threaded Matrix Multiplication. Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G. Van Zee. Introduction. Shared memory parallelism for GEMM Many-threaded architectures require more sophisticated methods of parallelism
E N D
Anatomy of a High-Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G. Van Zee
Introduction • Shared memory parallelism for GEMM • Many-threaded architectures require more sophisticated methods of parallelism • Explore the opportunities for parallelism to explain which we will exploit • Need finer grain parallelism
Outline • GotoBLAS approach • Opportunities for Parallelism • Many-threaded Results
GotoBLAS Approach The GEMM operation: n k n C A k B m += m
registers L1 cache L2 cache L3 cache += Main Memory
registers L1 cache L2 cache L3 cache nc nc += Main Memory
registers L1 cache L2 cache L3 cache kc kc += Main Memory
registers L1 cache L2 cache L3 cache Main Memory mc += mc
registers L1 cache L2 cache nr L3 cache nr nr += Main Memory
registers L1 cache L2 cache L3 cache += Main Memory mr mr
Outline • GotoBLAS approach • Opportunities for Parallelism • Many-threaded Results
Multiple Levels of Parallelism += ir • All threads share micro-panel of B • Each thread has its own micro-panel of A • Fixed number of iterations:
Multiple Levels of Parallelism jr jr += • All threads share block of A • Each thread has its own micro-panel of B • Fixed number of iterations • Good if shared L2 cache
Multiple Levels of Parallelism • All threads share panel of B • Each thread has its own block of A • Number of iterations is not fixed • Good if multiple L2 caches
Multiple Levels of Parallelism • Each iteration updates entire C • Iterations of the loop are not independent • Requires mutex when updating C • Or a reduction
Multiple Levels of Parallelism • Each iteration updates entire C • Iterations of the loop are not independent • Requires mutex when updating C • Or a reduction
Multiple Levels of Parallelism • All threads share matrix A • Each thread has its own panel of B • Number of iterations is not fixed • Good if multiple L3 caches • Good for NUMA reasons
Outline • GotoBLAS approach • Opportunities for Parallelism • Many-threaded Results
Intel Xeon Phi • Many Threads • 60 cores, 4 threads per core • Need to use > 2 threads per core to utilize FPU • We do not block for the L1 cache • Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache • We consider part of the L2 cache as a virtual L1 • Each core has its own L2 cache
IBM Blue Gene/Q • (Not quite as) Many Threads • 16 cores, 4 threads per core • Need to use > 2 threads per core to utilize FPU • We do not block for the L1 cache • Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache • We consider part of the L2 cache as a virtual L1 • Single large, shared L2 cache
Thank You • Questions? • Source code available at: • code.google.com/p/blis/