470 likes | 731 Views
SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures. Ernie Chan The University of Texas at Austin. Motivation. Motivating Example Cholesky Factorization A → L L. T. Motivation. Peak Performance 96 Gflops. Better. Outline. Performance FLAME
E N D
SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas at Austin SPAA 2007
Motivation • Motivating Example • Cholesky Factorization • A → L L T SPAA 2007
Motivation Peak Performance 96 Gflops Better SPAA 2007
Outline • Performance • FLAME • SuperMatrix • Conclusion SPAA 2007
Performance • Target Architecture • 16 CPU Itanium2 • NUMA • 8 dual-processor nodes • OpenMP • Intel Compiler 9.0 • BLAS • GotoBLAS 1.06 • Intel MKL 8.1 SPAA 2007
Performance • Implementations • Multithreaded BLAS (Sequential algorithm) • LAPACK dpotrf • FLAME var3 • Serial BLAS (Parallel algorithm) • Data-flow SPAA 2007
Performance • Implementations • Column-major order storage • Varying block sizes • { 64, 96, 128, 160, 192, 224, 256 } • Select best performance for each problem size SPAA 2007
Performance SPAA 2007
Outline • Performance • FLAME • SuperMatrix • Conclusion SPAA 2007
FLAME • Formal Linear Algebra Method Environment • High-level abstraction away from indices • “Views” into matrices • Seamless transition from algorithms to code SPAA 2007
FLAME • Cholesky Factorization for ( j = 0; j < n; j++ ) { A[j,j] = sqrt( A[j,j] ); for ( i = j+1; i < n; i++ ) A[i,j] = A[i,j] / A[j,j]; for ( k = j+1; k < n; k++ ) for ( i = k; i < n; i++ ) A[i,k] = A[i,k] – A[i,j] * A[k,j]; } SPAA 2007
FLAME • LAPACK dpotrf • Different variant (right-looking) DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO SPAA 2007
FLAME • Partitioning Matrices SPAA 2007
FLAME SPAA 2007
FLAME SPAA 2007
FLAME FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*---------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } SPAA 2007
FLAME SPAA 2007
Outline • Performance • FLAME • SuperMatrix • Data-flow • 2D data affinity • Contiguous storage • Conclusion SPAA 2007
SuperMatrix • Cholesky Factorization • Iteration 1 Chol Syrk Trsm Trsm Gemm Syrk SPAA 2007
SuperMatrix • Cholesky Factorization • Iteration 2 Chol Trsm Syrk SPAA 2007
SuperMatrix • Cholesky Factorization • Iteration 3 Chol SPAA 2007
SuperMatrix • Analyzer • Delay execution and place tasks on queue • Tasks are function pointers annotated with input/output information • Compute dependence information (flow, anti, output) between all tasks • Create DAG of tasks SPAA 2007
SuperMatrix • Analyzer Chol Chol Task Queue DAG of tasks Trsm Trsm Trsm Trsm Syrk Syrk Gemm Syrk … … Gemm Chol Syrk Chol … SPAA 2007
SuperMatrix • FLASH • Matrix of matrices SPAA 2007
SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } FLASH_Queue_exec( ); SPAA 2007
SuperMatrix • Dispatcher • Use DAG to execute tasks out-of-order in parallel • Akin to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation • SuperScalar vs. SuperMatrix SPAA 2007
SuperMatrix Chol • Dispatcher • 4 threads • 5 x 5 matrix of blocks • 35 tasks • 14 stages Trsm Trsm Trsm Trsm Syrk Gemm Syrk Gemm Gemm Syrk Gemm Gemm Gemm Syrk Chol Trsm Trsm Trsm Syrk Gemm Syrk Gemm Gemm Syrk Chol Trsm Trsm Syrk Gemm Syrk Chol Trsm Syrk Chol SPAA 2007
SuperMatrix SPAA 2007
SuperMatrix Chol • Dispatcher • Tasks write to block [2,2] • No data affinity Trsm Trsm Trsm Trsm Syrk Gemm Syrk Gemm Gemm Syrk Gemm Gemm Gemm Syrk Chol Trsm Trsm Trsm Syrk Gemm Syrk Gemm Gemm Syrk Chol Trsm Trsm Syrk Gemm Syrk Chol Trsm Syrk Chol SPAA 2007
SuperMatrix Blocks of Matrices Threads Data Affinity CPU Affinity Binding Threads to Processors Owner Computes Rule Assigning Tasks to Threads Denote Tasks by the Blocks Overwritten Tasks Processors SPAA 2007
SuperMatrix • Data Affinity • 2D block cyclic decomposition (ScaLAPACK) • 4 x 4 matrix of blocks assigned to 2 x 2 mesh SPAA 2007
SuperMatrix SPAA 2007
SuperMatrix • Contiguous Storage • One level of blocking User inherently does not need to know about the underlying storage of data SPAA 2007
SuperMatrix SPAA 2007
SuperMatrix • GotoBLAS vs. MKL • All previous graphs link with GotoBLAS • MKL better tuned for small matrices on Itanium2 SPAA 2007
SuperMatrix SPAA 2007
SuperMatrix SPAA 2007
SuperMatrix SPAA 2007
SuperMatrix SPAA 2007
SuperMatrix • Results • LAPACK chose a bad variant • Data affinity and contiguous storage have clear advantage • Multithreaded GotoBLAS tuned for large matrices • MKL better tuned for small matrices SPAA 2007
SuperMatrix SPAA 2007
Outline • Performance • FLAME • SuperMatrix • Conclusion SPAA 2007
Conclusion • Key Points • View blocks of matrices as units of computation instead of scalars • Apply instruction-level parallelism to blocks • Abstractions away from low-level details of scheduling SPAA 2007
Authors • Ernie Chan • Enrique S. Quintana-Ortí • Gregorio Quintana-Ortí • Robert van de Geijn • Universidad Jaume I • The University of Texas at Austin SPAA 2007
Acknowledgements • We thank the other members of the FLAME team for their support • Field Van Zee • Funding • NSF grant CCF-0540926 SPAA 2007
References [1] R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages 225-233, New York, NY, USA, 1989. [2] B. S. Andersen, J. Waśniewski, and F. G. Gustavson. A recursive formulation for Cholesky factorization of a matrix in packed storage. ACM Trans. Math. Soft., 27(2):214-244, 2001. [3] E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, 2004. [4] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft., 27(4):422-455, 2001. [5] F. G. Gustavson, L. Karlsson, and B. Kagstrom. Three algorithms on distributed memory using packed storage. Computational Science – Para 2006. B. Kagstrom, E. Elmroth, eds., accepted for Lecture Notes in Computer Science. Springer-Verlag, 2007. [6] R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM J. of Research and Development, 11(1), 1967. SPAA 2007
Conclusion • More Information http://www.cs.utexas.edu/users/flame • Questions? echan@cs.utexas.edu SPAA 2007