370 likes | 521 Views
Satisfying Your Dependencies with SuperMatrix. Ernie Chan. Motivation. Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures Schedule submatrix operations out-of-order via dependency analysis Programmability
E N D
Satisfying Your Dependencies with SuperMatrix Ernie Chan Cluster 2007
Motivation • Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures • Schedule submatrix operations out-of-order via dependency analysis • Programmability • High-level abstractions to hide details of parallelization from user Cluster 2007
Outline • SuperMatrix • Implementation • Performance Results • Conclusion Cluster 2007
SuperMatrix Cluster 2007
SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) && FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*------------------------------------------------------------------*/ FLA_LU_nopiv( A11 ); FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_UNIT_DIAG, FLA_ONE, A11, A12 ); FLA_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 ); /*------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } Cluster 2007
SuperMatrix • LU Factorization Without Pivoting • Iteration 1 LU TRSM TRSM TRSM GEMM GEMM GEMM TRSM GEMM Cluster 2007
SuperMatrix • LU Factorization Without Pivoting • Iteration 2 LU TRSM TRSM GEMM Cluster 2007
SuperMatrix • LU Factorization Without Pivoting • Iteration 3 LU Cluster 2007
SuperMatrix • FLASH • Matrix of matrices Cluster 2007
SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) && FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*------------------------------------------------------------------*/ FLASH_LU_nopiv( A11 ); FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_UNIT_DIAG, FLA_ONE, A11, A12 ); FLASH_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 ); /*------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } FLASH_Queue_exec( ); Cluster 2007
SuperMatrix • Analyzer • Delay execution and place tasks on queue • Tasks are function pointers annotated with input/output information • Compute dependence information (flow, anti, output) between all tasks • Create DAG of tasks Cluster 2007
SuperMatrix • Dispatcher • Use DAG to execute tasks out-of-order in parallel • Akin to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation • SuperScalar vs. SuperMatrix Cluster 2007
SuperMatrix • Dispatcher • 4 threads • 5 x 5 matrix of blocks • 55 tasks • 18 stages LU TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM LU TRSM TRSM TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM LU TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM LU TRSM TRSM GEMM LU Cluster 2007
Outline • SuperMatrix • Implementation • Performance Results • Conclusion Cluster 2007
Implementation • Analyzer LU LU TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM GEMM Task Queue DAG of tasks GEMM LU GEMM GEMM TRSM TRSM LU TRSM TRSM GEMM GEMM LU LU Cluster 2007
Implementation • Analyzer • FLASH routines enqueue tasks onto global task queue • Dependencies between each task are calculated and stored in the task structure • Each submatrix block stores the last task enqueued that writes to it • Flow dependencies occur when a subsequent task reads that block • DAG is embedded in task queue Cluster 2007
Implementation • Dispatcher Task Queue Waiting Queue LU LU … LU TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM LU TRSM TRSM GEMM LU Threads Cluster 2007
Implementation • Dispatcher • Place ready and available tasks on global waiting queue • First task on task queue always ready and available • Threads asynchronously dequeue tasks from head of waiting queue • Once a task completes execution, notify dependent tasks and update waiting queue • Loop until all tasks complete execution Cluster 2007
Outline • SuperMatrix • Implementation • Performance Results • Conclusion Cluster 2007
Performance Results Cluster 2007
Performance Results • GotoBLAS 1.13 installed on all machines • Supported Operations • LAPACK-level functions • Cholesky factorization • LU factorization without pivoting • All level-3 BLAS • GEMM, TRMM, TRSM • SYMM, SYRK, SYR2K • HEMM, HERK, HER2K Cluster 2007
Performance Results • Implementations • SuperMatrix + serial BLAS • FLAME + multithreaded BLAS • LAPACK + multithreaded BLAS • Block size = 192 • Processing elements = 8 Cluster 2007
Performance Results • SuperMatrix Implementation • Fixed block sized • Varying block sizes can lead to better performance • Experiments show 192 generally the best • Simplest scheduling • No sorting to execute task on critical path earlier • No attempt to improve data locality in these experiments Cluster 2007
Performance Results Cluster 2007
Performance Results Cluster 2007
Performance Results Cluster 2007
Performance Results Cluster 2007
Performance Results Cluster 2007
Performance Results Cluster 2007
Outline • SuperMatrix • Implementation • Performance Results • Conclusion Cluster 2007
Conclusion • Apply out-of-order execution techniques to schedule tasks • The whole is greater than the sum of the parts • Exploit parallelism between operations • Despite having to calculate dependencies, SuperMatrix only has small performance penalties Cluster 2007
Conclusion • Programmability • Code at a high level without needing to deal with aspects of parallelization Cluster 2007
Authors • Ernie Chan • Field G. Van Zee • Enrique S. Quintana-Ortí • Gregorio Quintana-Ortí • Robert van de Geijn • The University of Texas at Austin • Universidad Jaume I Cluster 2007
Acknowledgements • We thank the Texas Advanced Computing Center (TACC) for access to their machines and their support • Funding • NSF Grants • CCF—0540926 • CCF—0702714 Cluster 2007
References [1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix Out-of-Order Scheduling of Matrix Operations on SMP and Multi-Core Architectures. In SPAA ‘07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007. [2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks. Submitted to PPoPP 2008. [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures. Submitted to Euromicro PDP 2008. Cluster 2007
Conclusion • More Information http://www.cs.utexas.edu/users/flame • Questions? echan@cs.utexas.edu Cluster 2007