340 likes | 574 Views
Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures. Ernie Chan. How to Program SCC?. Tile. 48 cores in 6×4 mesh with 2 cores per tile 4 DDR3 memory c ontrollers. Core 1. Core 1. L2$1. Router. MPB. Tile. Tile. Tile.
E N D
Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan
How to Program SCC? Tile • 48 cores in 6×4 mesh with 2 cores per tile • 4 DDR3 memory controllers Core 1 Core 1 L2$1 Router MPB Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile R Core 0 Core 0 R R R R R R R R R R R R R R R R R R R R R R R L2$0 Memory Controller Memory Controller Memory Controller Memory Controller System I/F Vienna talk
Outline • How to Program SCC? • Elemental • Collective Communication • Off-Chip Shared-Memory • Conclusion Vienna talk
Elemental • New, Modern Distributed-Memory Dense Linear Algebra Library • Replacement for PLAPACK and ScaLAPACK • Object-oriented data structures for matrices • Coded in C++ • Torus-wrap/elemental mapping of matrices to a two-dimensional process grid • Implemented entirely using bulk synchronous communication Vienna talk
Elemental • Two-Dimensional Process Grid: • Tile the process grid over the matrix to assign each matrix element to a process 0 2 4 1 3 5 Vienna talk
Elemental • Two-Dimensional Process Grid: • Tile the process grid over the matrix to assign each matrix element to a process 0 2 4 1 3 5 Vienna talk
Elemental • Two-Dimensional Process Grid: • Tile the process grid over the matrix to assign each matrix element to a process 0 2 4 1 3 5 Vienna talk
Elemental • Redistributing the Matrix Over a Process Grid • Collective communication Vienna talk
Outline • How to Program SCC? • Elemental • Collective Communication • Off-Chip Shared-Memory • Conclusion Vienna talk
Collective Communication • RCCE Message Passing API • Blocking send and receive int RCCE_send( char *buf, size_t num, int dest ); int RCCE_recv( char *buf, size_t num, int src ); • Potential for deadlock 0 1 2 3 4 5 Vienna talk
Collective Communication • Avoiding Deadlock • Even number of cores in cycle 0 1 2 3 4 5 0 1 2 3 4 5 Vienna talk
Collective Communication • Avoiding Deadlock • Odd number of cores in cycle 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 Vienna talk
Collective Communication • Allgather int RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); Before Vienna talk
Collective Communication • Allgather int RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); After Vienna talk
Collective Communication • Cyclic (Bucket) Algorithm • Allgather Vienna talk
Collective Communication • Cyclic (Bucket) Algorithm • Allgather Vienna talk
Collective Communication • Cyclic (Bucket) Algorithm • Allgather Vienna talk
Collective Communication • Cyclic (Bucket) Algorithm • Allgather Vienna talk
Collective Communication • Cyclic (Bucket) Algorithm • Allgather Vienna talk
Collective Communication • Cyclic (Bucket) Algorithm • Allgather Vienna talk
Collective Communication Vienna talk
Elemental Vienna talk
Elemental Vienna talk
Elemental Vienna talk
Elemental Vienna talk
Outline • How to Program SCC? • Elemental • Collective Communication • Off-Chip Shared-Memory • Conclusion Vienna talk
Off-Chip Shared-Memory • Distributed vs. Shared-Memory System I/F Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile R R R R R R R R R R R R R R R R R R R R R R R R Memory Controller Memory Controller Memory Controller Memory Controller Shared-Memory Distributed Memory Vienna talk
Off-Chip Shared-Memory CHOL0 • SuperMatrix • Map dense matrix computation to a directed acyclic graph • No matrix distribution • Store DAG and matrix on off-chip shared- memory TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL6 TRSM7 SYRK8 CHOL9 Vienna talk
Off-Chip Shared-Memory • Non-cacheable vs. Cacheable Shared-Memory • Non-cacheable • Allow for a simple programming interface • Poor performance • Cacheable • Need software managed cache coherency mechanism • Execute on data stored in cache • Interleave distributed and shared-memory programming concepts Vienna talk
Off-Chip Shared-Memory Vienna talk
Outline • How to Program SCC? • Elemental • Collective Communication • Off-Chip Shared-Memory • Conclusion Vienna talk
Conclusion • Distributed vs. Shared-Memory • Elemental vs. SuperMatrix? • A Collective Communication Library for SCC • RCCE_comm: released under LGPL and available on the public Intel SCC software repository http://marcbug.scc-dc.com/svn/repository/trunk/ rcce_applications/UT/RCCE_comm/ Vienna talk
Acknowledgments • We thank the other members of the FLAME team for their support • Bryan Marker, Jack Poulson, and Robert van de Geijn • We thank Intel for access to SCC and their help • Timothy G. Mattson and Rob F. Van Der Wijngaart • Funding • Intel Corporation • National Science Foundation Vienna talk
Conclusion • More Information http://www.cs.utexas.edu/~flame • Questions? echan@cs.utexas.edu Vienna talk