SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures

SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas at Austin SPAA 2007

Motivation • Motivating Example • Cholesky Factorization • A → L L T SPAA 2007

Motivation Peak Performance 96 Gflops Better SPAA 2007

Outline • Performance • FLAME • SuperMatrix • Conclusion SPAA 2007

Performance • Target Architecture • 16 CPU Itanium2 • NUMA • 8 dual-processor nodes • OpenMP • Intel Compiler 9.0 • BLAS • GotoBLAS 1.06 • Intel MKL 8.1 SPAA 2007

Performance • Implementations • Multithreaded BLAS (Sequential algorithm) • LAPACK dpotrf • FLAME var3 • Serial BLAS (Parallel algorithm) • Data-flow SPAA 2007

Performance • Implementations • Column-major order storage • Varying block sizes • { 64, 96, 128, 160, 192, 224, 256 } • Select best performance for each problem size SPAA 2007

Performance SPAA 2007

FLAME • Formal Linear Algebra Method Environment • High-level abstraction away from indices • “Views” into matrices • Seamless transition from algorithms to code SPAA 2007

FLAME • Cholesky Factorization for ( j = 0; j < n; j++ ) { A[j,j] = sqrt( A[j,j] ); for ( i = j+1; i < n; i++ ) A[i,j] = A[i,j] / A[j,j]; for ( k = j+1; k < n; k++ ) for ( i = k; i < n; i++ ) A[i,k] = A[i,k] – A[i,j] * A[k,j]; } SPAA 2007

FLAME • LAPACK dpotrf • Different variant (right-looking) DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO SPAA 2007

FLAME • Partitioning Matrices SPAA 2007

FLAME SPAA 2007

FLAME FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*---------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } SPAA 2007

FLAME SPAA 2007

Outline • Performance • FLAME • SuperMatrix • Data-flow • 2D data affinity • Contiguous storage • Conclusion SPAA 2007

SuperMatrix • Cholesky Factorization • Iteration 1 Chol Syrk Trsm Trsm Gemm Syrk SPAA 2007

SuperMatrix • Cholesky Factorization • Iteration 2 Chol Trsm Syrk SPAA 2007

SuperMatrix • Cholesky Factorization • Iteration 3 Chol SPAA 2007

SuperMatrix • Analyzer • Delay execution and place tasks on queue • Tasks are function pointers annotated with input/output information • Compute dependence information (flow, anti, output) between all tasks • Create DAG of tasks SPAA 2007

SuperMatrix • Analyzer Chol Chol Task Queue DAG of tasks Trsm Trsm Trsm Trsm Syrk Syrk Gemm Syrk … … Gemm Chol Syrk Chol … SPAA 2007

SuperMatrix • FLASH • Matrix of matrices SPAA 2007

SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } FLASH_Queue_exec( ); SPAA 2007

SuperMatrix • Dispatcher • Use DAG to execute tasks out-of-order in parallel • Akin to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation • SuperScalar vs. SuperMatrix SPAA 2007

SuperMatrix Chol • Dispatcher • 4 threads • 5 x 5 matrix of blocks • 35 tasks • 14 stages Trsm Trsm Trsm Trsm Syrk Gemm Syrk Gemm Gemm Syrk Gemm Gemm Gemm Syrk Chol Trsm Trsm Trsm Syrk Gemm Syrk Gemm Gemm Syrk Chol Trsm Trsm Syrk Gemm Syrk Chol Trsm Syrk Chol SPAA 2007

SuperMatrix SPAA 2007

SuperMatrix Chol • Dispatcher • Tasks write to block [2,2] • No data affinity Trsm Trsm Trsm Trsm Syrk Gemm Syrk Gemm Gemm Syrk Gemm Gemm Gemm Syrk Chol Trsm Trsm Trsm Syrk Gemm Syrk Gemm Gemm Syrk Chol Trsm Trsm Syrk Gemm Syrk Chol Trsm Syrk Chol SPAA 2007

SuperMatrix Blocks of Matrices Threads Data Affinity CPU Affinity Binding Threads to Processors Owner Computes Rule Assigning Tasks to Threads Denote Tasks by the Blocks Overwritten Tasks Processors SPAA 2007

SuperMatrix • Data Affinity • 2D block cyclic decomposition (ScaLAPACK) • 4 x 4 matrix of blocks assigned to 2 x 2 mesh SPAA 2007

SuperMatrix • Contiguous Storage • One level of blocking User inherently does not need to know about the underlying storage of data SPAA 2007

SuperMatrix • GotoBLAS vs. MKL • All previous graphs link with GotoBLAS • MKL better tuned for small matrices on Itanium2 SPAA 2007

SuperMatrix • Results • LAPACK chose a bad variant • Data affinity and contiguous storage have clear advantage • Multithreaded GotoBLAS tuned for large matrices • MKL better tuned for small matrices SPAA 2007

Conclusion • Key Points • View blocks of matrices as units of computation instead of scalars • Apply instruction-level parallelism to blocks • Abstractions away from low-level details of scheduling SPAA 2007

Authors • Ernie Chan • Enrique S. Quintana-Ortí • Gregorio Quintana-Ortí • Robert van de Geijn • Universidad Jaume I • The University of Texas at Austin SPAA 2007

Acknowledgements • We thank the other members of the FLAME team for their support • Field Van Zee • Funding • NSF grant CCF-0540926 SPAA 2007

References [1] R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages 225-233, New York, NY, USA, 1989. [2] B. S. Andersen, J. Waśniewski, and F. G. Gustavson. A recursive formulation for Cholesky factorization of a matrix in packed storage. ACM Trans. Math. Soft., 27(2):214-244, 2001. [3] E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, 2004. [4] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft., 27(4):422-455, 2001. [5] F. G. Gustavson, L. Karlsson, and B. Kagstrom. Three algorithms on distributed memory using packed storage. Computational Science – Para 2006. B. Kagstrom, E. Elmroth, eds., accepted for Lecture Notes in Computer Science. Springer-Verlag, 2007. [6] R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM J. of Research and Development, 11(1), 1967. SPAA 2007

Conclusion • More Information http://www.cs.utexas.edu/users/flame • Questions? echan@cs.utexas.edu SPAA 2007

SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures

SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures

Presentation Transcript

Order of Operations

Order of Operations

Multi-core architectures

Optimizing for Intel multi-/many-core architectures

Multi-core architectures

Data Marshaling for Multi-Core Architectures

Order of Operations

Impact of Network Sharing in Multi-core Architectures

Order of Operations

Multi-core architectures

Order of Operations

Multi-core architectures

Tomasulo Scheduling for Out-Of-Order Execution

Packet Scheduling for Deep Packet Inspection on Multi-Core Architectures

Multi-Core Architectures

Condor and Multi-core Scheduling

Multi-core Architectures

CS6801- Multi-core Architectures and Programming