SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks

SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks Ernie Chan PPoPP 2008

Outline • Inversion of a Symmetric Positive Definite Matrix • Algorithms-by-Blocks • Flow vs. Anti-Dependencies • Performance • Conclusion PPoPP 2008

Inversion of an SPD Matrix • Three Sweeps • Cholesky factorization (Chol) A → U U • Inversion of a triangular matrix (Trinv) R := U • Triangular matrix multiplication by its transpose (Ttmm) A := R R T -1 -1 T PPoPP 2008

Inversion of an SPD Matrix • Exposing Parallelism • Parallelizing each of the three sweeps independently creates inherent synchronization points • Programmability • Use tools provided by FLAME PPoPP 2008

Algorithms-by-Blocks PPoPP 2008

Algorithms-by-Blocks FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0,0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABR, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*---------------------------------------------------------------------*/ FLA_Chol( FLA_UPPER_TRIANGULAR, A11 ); FLA_Trsm( FLA_LEFT, FLA_UPPER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A12 ); FLA_Syrk( FLA_UPPER_TRIANGULAR, FLA_TRANSPOSE, FLA_MINUS_ONE, A12, FLA_ONE, A22 ); /*---------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABR, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } PPoPP 2008

PPoPP 2008

Algorithms-by-Blocks • Cholesky Factorization • Iteration 1 CHOL TRSM TRSM SYRK GEMM SYRK PPoPP 2008

Algorithms-by-Blocks • Cholesky Factorization • Iteration 2 CHOL TRSM SYRK PPoPP 2008

Algorithms-by-Blocks • Cholesky Factorization • Iteration 3 CHOL PPoPP 2008

Algorithms-by-Blocks PPoPP 2008

PPoPP 2008

Algorithms-by-Blocks • FLASH • Matrix of matrices PPoPP 2008

Algorithms-by-Blocks FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0,0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABR, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------------------------------*/ FLASH_Chol( FLA_UPPER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_LEFT, FLA_UPPER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, /* Chol Variant 3 */ FLA_ONE, A11, A12 ); FLASH_Syrk( FLA_UPPER_TRIANGULAR, FLA_TRANSPOSE, FLA_MINUS_ONE, A12, FLA_ONE, A22 ); /*---------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABR, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } PPoPP 2008

PPoPP 2008

Algorithms-by-Blocks FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0,0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABR, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------------------------------*/ FLASH_Trsm( FLA_LEFT, FLA_UPPER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_MINUS_ONE, A11, A12 ); FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE, /* Trinv Variant 3 */ FLA_ONE, A01, A12, FLA_ONE, A02 ); FLASH_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A01 ); FLASH_Trinv( FLA_UPPER_TRIANGULAR, FLA_NONUNIT_DIAG, A11 ); /*---------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABR, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } PPoPP 2008

Algorithms-by-Blocks FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0,0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABR, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------------------------------*/ FLASH_Syrk( FLA_UPPER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_ONE, A01, FLA_ONE, A00 ); FLASH_Trmm( FLA_RIGHT, FLA_UPPER_TRIANGULAR, /* Ttmm Variant 1 */ FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A01 ); FLASH_Ttmm( FLA_UPPER_TRIANGULAR, A11 ); /*---------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABR, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } PPoPP 2008

Algorithms-by-Blocks • SuperMatrix: Analyzer • Decomposes subproblems into component tasks • Enqueue tasks onto global task queue • Internally calculates all dependencies between tasks which form a directed acyclic graph FLASH_Chol_op ( A ); FLASH_Trinv_op( A ); FLASH_Ttmm_op ( A ); FLASH_Queue_exec( ); PPoPP 2008

Algorithms-by-Blocks • SuperMatrix: Dispatcher • Place ready and available tasks on global waiting queue • Threads asynchronously dequeue tasks from head of waiting queue • Once a task completes execution, notify dependent tasks and update waiting queue • Loop until all tasks complete execution PPoPP 2008

PPoPP 2008

Performance • Target Architecture • 16 CPU Itanium2 • ccNUMA • 8 dual-processor nodes • OpenMP • Intel Compiler 9.0 • BLAS • GotoBLAS 1.15 • Intel MKL 8.1 PPoPP 2008

Performance • Implementations • SuperMatrix + serial BLAS • FLAME + multithreaded BLAS • LAPACK + multithreaded BLAS • Block size = 192 • Processors = 16 PPoPP 2008

Performance • SuperMatrix Implementation • Fixed block size • Varying block sizes can lead to better performance • Experiments show 192 generally the best • Simplest scheduling • No sorting to execute task on critical path earlier • No attempt to improve data locality in these experiments PPoPP 2008

Performance PPoPP 2008

Performance • Results • Only difference between FLAME and LAPACK is the use of different algorithmic variants • GotoBLAS and MKL get similar performance curves • SuperMatrix performance ramps up much faster PPoPP 2008

Conclusion • Abstractions hide details of parallelization from users • SuperMatrix extracts parallelism across subroutine boundaries PPoPP 2008

Authors • Field G. Van Zee • Paolo Bientinesi • Enrique S. Quintana-Ortí • Gregorio Quintana-Ortí • Robert van de Geijn • The University of Texas at Austin • Duke University • Universidad Jaume I PPoPP 2008

Acknowledgements • We thank the other members of the FLAME team for their support • Funding • NSF Grants • CCF—0540926 • CCF—0702714 PPoPP 2008

References [1] Paolo Bientinesi, Brian Gunter, and Robert van de Geijn. Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix. ACM Transactions on Mathematical Software. To appear. [2] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix Out-of-Order Scheduling of Matrix Operations on SMP and Multi-Core Architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007. [3] Ernie Chan, Field G. Van Zee, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. Satisfying Your Dependencies with SuperMatrix. In Proceedings of the 2007 IEEE International Conference on Cluster Computing, pages 91-99, Austin, TX, USA, September 2007. [4] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: The LU Factorization. Accepted to MTAAP 2008. [5] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures. Accepted to Euromicro PDP 2008. PPoPP 2008

Conclusion • More Information http://www.cs.utexas.edu/~flame • Questions? echan@cs.utexas.edu PPoPP 2008

SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks

SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks

Presentation Transcript

Resource Constrained Project Scheduling Problem

Capacity Planning, Aggregate Scheduling, Master Schedule, and Short-Term Scheduling

Scheduling

Scheduling Concrete Dispatch CC-005

Analyses and Optimizations for Multithreaded Programs

Analysis of Algorithms

What is an “ SoC ”?

Greedy Algorithms

Algorithms

Online Algorithms

On-the-Fly Data-Race Detection in Multithreaded Programs

e cs150 Spring 2014 : Operating System #3: File Systems

Can we make these scheduling algorithms simpler? Using a Simpler Architecture

Chapter 10 Project Scheduling: PERT/CPM

Design and Analysis of Algorithms

Algorithms -- What we’ll do

What is an “ SoC ”?

Scheduling Problems and Solutions

Genetic Algorithms

CNT 4714: Enterprise Computing Fall 2014 Programming Multithreaded Applications in Java