Runtime Data Flow Graph Scheduling of Matrix Computations

Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan Intel talk

Motivation • “The Free Lunch Is Over” – Herb Sutter • Parallelize or perish • Popular libraries like Linear Algebra PACKage (LAPACK) 3.0 must be completely rewritten • FORTRAN 77 • Column-major order matrix storage • 187+ operations for each datatype • One routine (algorithm) per operation (L AP A C K) (L -AP -A C -K) (L AP A-C -K) (L -AP -A-C K) (L A -P -AC K) (L -A -P AC -K) Intel talk

Teaser Better Theoretical Peak Performance Intel talk

Goals • Programmability • Use tools provided by FLAME • Parallelism • Directed acyclic graph (DAG) scheduling Intel talk

Outline • Introduction • SuperMatrix • Scheduling • Performance • Conclusion 7 6 5 5 4 3 4 3 2 1 Intel talk

SuperMatrix • Formal Linear Algebra Method Environment (FLAME) • High-level abstractions for expressing linear algebra algorithms • Cholesky Factorization A → L LT Intel talk

SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*-----------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } Intel talk

SuperMatrix • Cholesky Factorization • Iteration 1 Iteration 2 CHOL Chol( A11 ) * TRSM A21 A11-T SYRK A22 – A21 A21T CHOL Chol( A11 ) * TRSM A21 A11-T SYRK A22 – A21 A21T Intel talk

SuperMatrix • LAPACK-style Implementation DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO Intel talk

SuperMatrix • FLASH • Storage-by-blocks, algorithm-by-blocks Intel talk

SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*-----------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } Intel talk

SuperMatrix • Cholesky Factorization • Iteration 1 CHOL0 CHOL0 Chol( A0,0 ) Intel talk

SuperMatrix • Cholesky Factorization • Iteration 1 CHOL0 TRSM1 TRSM2 CHOL0 Chol( A0,0 ) TRSM1 A1,0A0,0-T TRSM2 A2,0A0,0-T Intel talk

SuperMatrix • Cholesky Factorization • Iteration 1 CHOL0 TRSM1 TRSM2 CHOL0 Chol( A0,0 ) SYRK3 GEMM4 SYRK5 TRSM1 A1,0 A0,0-T SYRK3 A1,1 – A1,0 A1,0T TRSM2 A2,0 A0,0-T GEMM4 A2,1 – A2,0A1,0T SYRK5 A2,2 – A2,0 A2,0T Intel talk

SuperMatrix • Cholesky Factorization • Iteration 2 CHOL0 TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL6 CHOL6 Chol( A1,1 ) TRSM7 TRSM7 A2,1 A1,1-T SYRK8 A2,2 – A2,1 A2,1T SYRK8 Intel talk

SuperMatrix • Cholesky Factorization • Iteration 3 CHOL0 TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL6 TRSM7 CHOL9 Chol( A2,2 ) SYRK8 CHOL9 Intel talk

SuperMatrix • Separation of Concerns • Analyzer • Decomposes subproblems into component tasks • Store tasks in global task queue sequentially • Internally calculates all dependencies between tasks, which form a DAG, only using input and output parameters for each task • Dispatcher • Spawn threads • Schedule and dispatch tasks to threads in parallel Intel talk

SuperMatrix • Analyzer • Detect flow, anti, and output dependencies • Embed pointers into hierarchical matrices • Block size manifests as size of contiguously stored blocks • Can be performed statically Intel talk

Scheduling • Dispatcher Enqueue ready tasks while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end 7 6 5 5 4 3 4 3 2 1 Intel talk

Scheduling • Supermarket • p lines for each p cashiers • Efficient enqueue and dequeue • Schedule depends on task to thread assignment • Bank • 1 line for p tellers • Enqueue and dequeue become bottlenecks • Dynamic dispatching of tasks to threads Intel talk

Scheduling • Single Queue • Set of all ready and available tasks • FIFO, priority Enqueue Dequeue PE1 PE0 … PEp-1 Intel talk

Scheduling • Multiple Queues • Work stealing, data affinity Enqueue … Dequeue PE1 PE0 … PEp-1 Intel talk

Scheduling • Work Stealing Enqueue ready tasks while tasks are available do Dequeue task if task ≠ Ø then Execute task Update dependent tasks … else Steal task end end • Enqueue • Place all dependent tasks on queue of same thread that executes task • Steal • Select random thread and remove a task from tail of its queue Intel talk

Scheduling • Work Stealing Mailbox • Each thread has an associated mailbox • Enqueue task onto queue and place in mailbox • Can assign tasks to mailbox using 2D distribution • Before attempting a steal, first check mailbox • Optimize for data locality instead of random stealing • Mailbox only checked during occurrences of steals Intel talk

Scheduling • Data Affinity • Assign all tasks that write to a particular block to the same thread • Owner computes rule • 2D block cyclic distribution • Execution Trace • Cholesky factorization: 4000×4000 • Total time: 2D data affinity ~ FIFO queue • Idle threads: 2D ≈ 27% and FIFO ≈ 17% 2 3 2 0 1 0 0 1 0 Intel talk

Scheduling • Data Granularity • Cost of task >> enqueue and dequeue • Single vs. Multiple Queues • FIFO queue increases load balance • 2D data affinity decreases data communication • Combine best aspects of both Intel talk

Scheduling • Cache Affinity • Single priority queue sorted by task height • Software cache • LRU • Line = block • Fully associative Enqueue Dequeue PE1 PE0 … PEp-1 … $0 $1 $p-1 Intel talk

Enqueue Insert task Sort queue via task heights Dispatcher Update software cache via cache coherency protocol with write invalidation Cache Affinity Dequeue Search queue for task with output block in software cache If found return task Otherwise return head task Scheduling Intel talk

Scheduling PE $ • Optimizations • Prefetching • N = number of cache lines (blocks) • Touch first N blocks accessed by DAG to preload cache before start of execution • Thread preference • Allow thread enqueuing a task to dequeue it before other threads have the opportunity • Limit variability of blocks migrating between threads Memory Coherency PE $ Intel talk

Performance • Target Architecture • 4 socket 2.66 GHz Intel Dunnington • 24 cores • Linux and Windows • 16 MB shared L3 cache per socket • OpenMP • Intel compiler 11.1 • BLAS • Intel MKL 10.2 Intel talk

Performance • Implementations • SuperMatrix + serial MKL • FIFO queue, cache affinity • FLAME + multithreaded MKL • Multithreaded MKL • PLASMA + serial MKL • Double precision real floating point arithmetic • Tuned block size Intel talk

Performance • Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) • Create persistent POSIX threads • Static pipelining • All threads execute sequential algorithm by tiles • If task is ready, execute; otherwise, stall • DAG is not explicitly constructed • Copy matrix from column-major order storage to block data layout and back to column-major • Does not address programmability Innovative Computing Laboratory University of Tennessee Intel talk

Performance Intel talk

Performance • Inversion of a Symmetric Positive Definite Matrix • Cholesky factorization A → L LTCHOL • Inversion of a triangular matrix R := L-1 TRINV • Triangular matrix multiplication by its transpose A-1:= RT RTTMM Intel talk

Performance Intel talk

Runtime Data Flow Graph Scheduling of Matrix Computations