Satisfying Your Dependencies with SuperMatrix

Satisfying Your Dependencies with SuperMatrix Ernie Chan Cluster 2007

Motivation • Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures • Schedule submatrix operations out-of-order via dependency analysis • Programmability • High-level abstractions to hide details of parallelization from user Cluster 2007

Outline • SuperMatrix • Implementation • Performance Results • Conclusion Cluster 2007

SuperMatrix Cluster 2007

SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) && FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*------------------------------------------------------------------*/ FLA_LU_nopiv( A11 ); FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_UNIT_DIAG, FLA_ONE, A11, A12 ); FLA_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 ); /*------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } Cluster 2007

SuperMatrix • LU Factorization Without Pivoting • Iteration 1 LU TRSM TRSM TRSM GEMM GEMM GEMM TRSM GEMM Cluster 2007

SuperMatrix • LU Factorization Without Pivoting • Iteration 2 LU TRSM TRSM GEMM Cluster 2007

SuperMatrix • LU Factorization Without Pivoting • Iteration 3 LU Cluster 2007

SuperMatrix • FLASH • Matrix of matrices Cluster 2007

SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) && FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*------------------------------------------------------------------*/ FLASH_LU_nopiv( A11 ); FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_UNIT_DIAG, FLA_ONE, A11, A12 ); FLASH_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 ); /*------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } FLASH_Queue_exec( ); Cluster 2007

SuperMatrix • Analyzer • Delay execution and place tasks on queue • Tasks are function pointers annotated with input/output information • Compute dependence information (flow, anti, output) between all tasks • Create DAG of tasks Cluster 2007

SuperMatrix • Dispatcher • Use DAG to execute tasks out-of-order in parallel • Akin to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation • SuperScalar vs. SuperMatrix Cluster 2007

SuperMatrix • Dispatcher • 4 threads • 5 x 5 matrix of blocks • 55 tasks • 18 stages LU TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM LU TRSM TRSM TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM LU TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM LU TRSM TRSM GEMM LU Cluster 2007

Implementation • Analyzer LU LU TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM GEMM Task Queue DAG of tasks GEMM LU GEMM GEMM TRSM TRSM LU TRSM TRSM GEMM GEMM LU LU Cluster 2007

Implementation • Analyzer • FLASH routines enqueue tasks onto global task queue • Dependencies between each task are calculated and stored in the task structure • Each submatrix block stores the last task enqueued that writes to it • Flow dependencies occur when a subsequent task reads that block • DAG is embedded in task queue Cluster 2007

Implementation • Dispatcher Task Queue Waiting Queue LU LU … LU TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM LU TRSM TRSM GEMM LU Threads Cluster 2007

Implementation • Dispatcher • Place ready and available tasks on global waiting queue • First task on task queue always ready and available • Threads asynchronously dequeue tasks from head of waiting queue • Once a task completes execution, notify dependent tasks and update waiting queue • Loop until all tasks complete execution Cluster 2007

Performance Results Cluster 2007

Performance Results • GotoBLAS 1.13 installed on all machines • Supported Operations • LAPACK-level functions • Cholesky factorization • LU factorization without pivoting • All level-3 BLAS • GEMM, TRMM, TRSM • SYMM, SYRK, SYR2K • HEMM, HERK, HER2K Cluster 2007

Performance Results • Implementations • SuperMatrix + serial BLAS • FLAME + multithreaded BLAS • LAPACK + multithreaded BLAS • Block size = 192 • Processing elements = 8 Cluster 2007

Performance Results • SuperMatrix Implementation • Fixed block sized • Varying block sizes can lead to better performance • Experiments show 192 generally the best • Simplest scheduling • No sorting to execute task on critical path earlier • No attempt to improve data locality in these experiments Cluster 2007

Performance Results Cluster 2007

Conclusion • Apply out-of-order execution techniques to schedule tasks • The whole is greater than the sum of the parts • Exploit parallelism between operations • Despite having to calculate dependencies, SuperMatrix only has small performance penalties Cluster 2007

Conclusion • Programmability • Code at a high level without needing to deal with aspects of parallelization Cluster 2007

Authors • Ernie Chan • Field G. Van Zee • Enrique S. Quintana-Ortí • Gregorio Quintana-Ortí • Robert van de Geijn • The University of Texas at Austin • Universidad Jaume I Cluster 2007

Acknowledgements • We thank the Texas Advanced Computing Center (TACC) for access to their machines and their support • Funding • NSF Grants • CCF—0540926 • CCF—0702714 Cluster 2007

References [1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix Out-of-Order Scheduling of Matrix Operations on SMP and Multi-Core Architectures. In SPAA ‘07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007. [2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks. Submitted to PPoPP 2008. [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures. Submitted to Euromicro PDP 2008. Cluster 2007

Conclusion • More Information http://www.cs.utexas.edu/users/flame • Questions? echan@cs.utexas.edu Cluster 2007

Satisfying Your Dependencies with SuperMatrix

Satisfying Your Dependencies with SuperMatrix

Presentation Transcript

Managing Your Report Dependencies on the Universe, with custom Code

Extending Dependencies with Conditions

Managing your dependencies with

Sweeteners: Satisfying Your Sweet Tooth

“Satisfying all your outdoor Needs”

Functional Dependencies

Functional Dependencies

Satisfying Conflicting Requirements with Reverse Auctions.

Multivalued Dependencies

Finnet's Supermatrix

Multivalued Dependencies

Dependencies

Functional Dependencies

Propagating Functional Dependencies with Conditions

Satisfying Requirements

Competitive Buffer Management with Packet Dependencies

Dealing with dependencies

Satisfying Your Printing Needs!

Dell Chargers Satisfying Your Charging Needs

Satisfying Your Printing Needs!