1 / 34

Ernie Chan

Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures. Ernie Chan. How to Program SCC?. Tile. 48 cores in 6×4 mesh with 2 cores per tile 4 DDR3 memory c ontrollers. Core 1. Core 1. L2$1. Router. MPB. Tile. Tile. Tile.

terry
Download Presentation

Ernie Chan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan

  2. How to Program SCC? Tile • 48 cores in 6×4 mesh with 2 cores per tile • 4 DDR3 memory controllers Core 1 Core 1 L2$1 Router MPB Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile R Core 0 Core 0 R R R R R R R R R R R R R R R R R R R R R R R L2$0 Memory Controller Memory Controller Memory Controller Memory Controller System I/F Vienna talk

  3. Outline • How to Program SCC? • Elemental • Collective Communication • Off-Chip Shared-Memory • Conclusion Vienna talk

  4. Elemental • New, Modern Distributed-Memory Dense Linear Algebra Library • Replacement for PLAPACK and ScaLAPACK • Object-oriented data structures for matrices • Coded in C++ • Torus-wrap/elemental mapping of matrices to a two-dimensional process grid • Implemented entirely using bulk synchronous communication Vienna talk

  5. Elemental • Two-Dimensional Process Grid: • Tile the process grid over the matrix to assign each matrix element to a process 0 2 4 1 3 5 Vienna talk

  6. Elemental • Two-Dimensional Process Grid: • Tile the process grid over the matrix to assign each matrix element to a process 0 2 4 1 3 5 Vienna talk

  7. Elemental • Two-Dimensional Process Grid: • Tile the process grid over the matrix to assign each matrix element to a process 0 2 4 1 3 5 Vienna talk

  8. Elemental • Redistributing the Matrix Over a Process Grid • Collective communication Vienna talk

  9. Outline • How to Program SCC? • Elemental • Collective Communication • Off-Chip Shared-Memory • Conclusion Vienna talk

  10. Collective Communication • RCCE Message Passing API • Blocking send and receive int RCCE_send( char *buf, size_t num, int dest ); int RCCE_recv( char *buf, size_t num, int src ); • Potential for deadlock 0 1 2 3 4 5 Vienna talk

  11. Collective Communication • Avoiding Deadlock • Even number of cores in cycle 0 1 2 3 4 5 0 1 2 3 4 5 Vienna talk

  12. Collective Communication • Avoiding Deadlock • Odd number of cores in cycle 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 Vienna talk

  13. Collective Communication • Allgather int RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); Before Vienna talk

  14. Collective Communication • Allgather int RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); After Vienna talk

  15. Collective Communication • Cyclic (Bucket) Algorithm • Allgather Vienna talk

  16. Collective Communication • Cyclic (Bucket) Algorithm • Allgather Vienna talk

  17. Collective Communication • Cyclic (Bucket) Algorithm • Allgather Vienna talk

  18. Collective Communication • Cyclic (Bucket) Algorithm • Allgather Vienna talk

  19. Collective Communication • Cyclic (Bucket) Algorithm • Allgather Vienna talk

  20. Collective Communication • Cyclic (Bucket) Algorithm • Allgather Vienna talk

  21. Collective Communication Vienna talk

  22. Elemental Vienna talk

  23. Elemental Vienna talk

  24. Elemental Vienna talk

  25. Elemental Vienna talk

  26. Outline • How to Program SCC? • Elemental • Collective Communication • Off-Chip Shared-Memory • Conclusion Vienna talk

  27. Off-Chip Shared-Memory • Distributed vs. Shared-Memory System I/F Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile R R R R R R R R R R R R R R R R R R R R R R R R Memory Controller Memory Controller Memory Controller Memory Controller Shared-Memory Distributed Memory Vienna talk

  28. Off-Chip Shared-Memory CHOL0 • SuperMatrix • Map dense matrix computation to a directed acyclic graph • No matrix distribution • Store DAG and matrix on off-chip shared- memory TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL6 TRSM7 SYRK8 CHOL9 Vienna talk

  29. Off-Chip Shared-Memory • Non-cacheable vs. Cacheable Shared-Memory • Non-cacheable • Allow for a simple programming interface • Poor performance • Cacheable • Need software managed cache coherency mechanism • Execute on data stored in cache • Interleave distributed and shared-memory programming concepts Vienna talk

  30. Off-Chip Shared-Memory Vienna talk

  31. Outline • How to Program SCC? • Elemental • Collective Communication • Off-Chip Shared-Memory • Conclusion Vienna talk

  32. Conclusion • Distributed vs. Shared-Memory • Elemental vs. SuperMatrix? • A Collective Communication Library for SCC • RCCE_comm: released under LGPL and available on the public Intel SCC software repository http://marcbug.scc-dc.com/svn/repository/trunk/ rcce_applications/UT/RCCE_comm/ Vienna talk

  33. Acknowledgments • We thank the other members of the FLAME team for their support • Bryan Marker, Jack Poulson, and Robert van de Geijn • We thank Intel for access to SCC and their help • Timothy G. Mattson and Rob F. Van Der Wijngaart • Funding • Intel Corporation • National Science Foundation Vienna talk

  34. Conclusion • More Information http://www.cs.utexas.edu/~flame • Questions? echan@cs.utexas.edu Vienna talk

More Related