1 / 61

Runtime Data Flow Graph Scheduling of Matrix Computations

Runtime Data Flow Graph Scheduling of Matrix Computations. Ernie Chan. Motivation. “The Free Lunch Is Over” – Herb Sutter Parallelize or perish Popular libraries like Linear Algebra PACKage (LAPACK) 3.0 must be completely rewritten FORTRAN 77 Column-major order matrix storage

eryk
Download Presentation

Runtime Data Flow Graph Scheduling of Matrix Computations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan Intel talk

  2. Motivation • “The Free Lunch Is Over” – Herb Sutter • Parallelize or perish • Popular libraries like Linear Algebra PACKage (LAPACK) 3.0 must be completely rewritten • FORTRAN 77 • Column-major order matrix storage • 187+ operations for each datatype • One routine (algorithm) per operation (L AP A C K) (L -AP -A C -K) (L AP A-C -K) (L -AP -A-C K) (L A -P -AC K) (L -A -P AC -K) Intel talk

  3. Teaser Better Theoretical Peak Performance Intel talk

  4. Goals • Programmability • Use tools provided by FLAME • Parallelism • Directed acyclic graph (DAG) scheduling Intel talk

  5. Outline • Introduction • SuperMatrix • Scheduling • Performance • Conclusion 7 6 5 5 4 3 4 3 2 1 Intel talk

  6. SuperMatrix • Formal Linear Algebra Method Environment (FLAME) • High-level abstractions for expressing linear algebra algorithms • Cholesky Factorization A → L LT Intel talk

  7. SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*-----------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } Intel talk

  8. SuperMatrix • Cholesky Factorization • Iteration 1 Iteration 2 CHOL Chol( A11 ) * TRSM A21 A11-T SYRK A22 – A21 A21T CHOL Chol( A11 ) * TRSM A21 A11-T SYRK A22 – A21 A21T Intel talk

  9. SuperMatrix • LAPACK-style Implementation DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO Intel talk

  10. SuperMatrix • FLASH • Storage-by-blocks, algorithm-by-blocks Intel talk

  11. SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*-----------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } Intel talk

  12. SuperMatrix • Cholesky Factorization • Iteration 1 CHOL0 CHOL0 Chol( A0,0 ) Intel talk

  13. SuperMatrix • Cholesky Factorization • Iteration 1 CHOL0 TRSM1 TRSM2 CHOL0 Chol( A0,0 ) TRSM1 A1,0A0,0-T TRSM2 A2,0A0,0-T Intel talk

  14. SuperMatrix • Cholesky Factorization • Iteration 1 CHOL0 TRSM1 TRSM2 CHOL0 Chol( A0,0 ) SYRK3 GEMM4 SYRK5 TRSM1 A1,0 A0,0-T SYRK3 A1,1 – A1,0 A1,0T TRSM2 A2,0 A0,0-T GEMM4 A2,1 – A2,0A1,0T SYRK5 A2,2 – A2,0 A2,0T Intel talk

  15. SuperMatrix • Cholesky Factorization • Iteration 2 CHOL0 TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL6 CHOL6 Chol( A1,1 ) TRSM7 TRSM7 A2,1 A1,1-T SYRK8 A2,2 – A2,1 A2,1T SYRK8 Intel talk

  16. SuperMatrix • Cholesky Factorization • Iteration 3 CHOL0 TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL6 TRSM7 CHOL9 Chol( A2,2 ) SYRK8 CHOL9 Intel talk

  17. SuperMatrix • Separation of Concerns • Analyzer • Decomposes subproblems into component tasks • Store tasks in global task queue sequentially • Internally calculates all dependencies between tasks, which form a DAG, only using input and output parameters for each task • Dispatcher • Spawn threads • Schedule and dispatch tasks to threads in parallel Intel talk

  18. SuperMatrix • Analyzer • Detect flow, anti, and output dependencies • Embed pointers into hierarchical matrices • Block size manifests as size of contiguously stored blocks • Can be performed statically Intel talk

  19. Outline • Introduction • SuperMatrix • Scheduling • Performance • Conclusion 7 6 5 5 4 3 4 3 2 1 Intel talk

  20. Scheduling • Dispatcher Enqueue ready tasks while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end 7 6 5 5 4 3 4 3 2 1 Intel talk

  21. Scheduling • Dispatcher Enqueue ready tasks while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end 7 6 5 5 4 3 4 3 2 1 Intel talk

  22. Scheduling • Supermarket • p lines for each p cashiers • Efficient enqueue and dequeue • Schedule depends on task to thread assignment • Bank • 1 line for p tellers • Enqueue and dequeue become bottlenecks • Dynamic dispatching of tasks to threads Intel talk

  23. Scheduling • Single Queue • Set of all ready and available tasks • FIFO, priority Enqueue Dequeue PE1 PE0 … PEp-1 Intel talk

  24. Scheduling • Multiple Queues • Work stealing, data affinity Enqueue … Dequeue PE1 PE0 … PEp-1 Intel talk

  25. Scheduling • Work Stealing Enqueue ready tasks while tasks are available do Dequeue task if task ≠ Ø then Execute task Update dependent tasks … else Steal task end end • Enqueue • Place all dependent tasks on queue of same thread that executes task • Steal • Select random thread and remove a task from tail of its queue Intel talk

  26. Scheduling • Work Stealing Mailbox • Each thread has an associated mailbox • Enqueue task onto queue and place in mailbox • Can assign tasks to mailbox using 2D distribution • Before attempting a steal, first check mailbox • Optimize for data locality instead of random stealing • Mailbox only checked during occurrences of steals Intel talk

  27. Scheduling • Data Affinity • Assign all tasks that write to a particular block to the same thread • Owner computes rule • 2D block cyclic distribution • Execution Trace • Cholesky factorization: 4000×4000 • Total time: 2D data affinity ~ FIFO queue • Idle threads: 2D ≈ 27% and FIFO ≈ 17% 2 3 2 0 1 0 0 1 0 Intel talk

  28. Scheduling • Data Granularity • Cost of task >> enqueue and dequeue • Single vs. Multiple Queues • FIFO queue increases load balance • 2D data affinity decreases data communication • Combine best aspects of both Intel talk

  29. Scheduling • Cache Affinity • Single priority queue sorted by task height • Software cache • LRU • Line = block • Fully associative Enqueue Dequeue PE1 PE0 … PEp-1 … $0 $1 $p-1 Intel talk

  30. Enqueue Insert task Sort queue via task heights Dispatcher Update software cache via cache coherency protocol with write invalidation Cache Affinity Dequeue Search queue for task with output block in software cache If found return task Otherwise return head task Scheduling Intel talk

  31. Scheduling PE $ • Optimizations • Prefetching • N = number of cache lines (blocks) • Touch first N blocks accessed by DAG to preload cache before start of execution • Thread preference • Allow thread enqueuing a task to dequeue it before other threads have the opportunity • Limit variability of blocks migrating between threads Memory Coherency PE $ Intel talk

  32. Outline • Introduction • SuperMatrix • Scheduling • Performance • Conclusion 7 6 5 5 4 3 4 3 2 1 Intel talk

  33. Performance • Target Architecture • 4 socket 2.66 GHz Intel Dunnington • 24 cores • Linux and Windows • 16 MB shared L3 cache per socket • OpenMP • Intel compiler 11.1 • BLAS • Intel MKL 10.2 Intel talk

  34. Performance • Implementations • SuperMatrix + serial MKL • FIFO queue, cache affinity • FLAME + multithreaded MKL • Multithreaded MKL • PLASMA + serial MKL • Double precision real floating point arithmetic • Tuned block size Intel talk

  35. Performance • Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) • Create persistent POSIX threads • Static pipelining • All threads execute sequential algorithm by tiles • If task is ready, execute; otherwise, stall • DAG is not explicitly constructed • Copy matrix from column-major order storage to block data layout and back to column-major • Does not address programmability Innovative Computing Laboratory University of Tennessee Intel talk

  36. Performance Intel talk

  37. Performance Intel talk

  38. Performance Intel talk

  39. Performance Intel talk

  40. Performance Intel talk

  41. Performance • Inversion of a Symmetric Positive Definite Matrix • Cholesky factorization A → L LTCHOL • Inversion of a triangular matrix R := L-1 TRINV • Triangular matrix multiplication by its transpose A-1:= RT RTTMM Intel talk

  42. Performance Intel talk

  43. Performance Intel talk

  44. Performance Intel talk

  45. Performance Intel talk

  46. Performance Intel talk

  47. Performance Intel talk

  48. Performance Intel talk

  49. Performance Intel talk

  50. Performance Intel talk

More Related