460 likes | 681 Views
Runtime Data Flow Graph Scheduling of Matrix Computations. Ernie Chan. Introduction. Programmability Use tools provided by FLAME Parallelism Directed acyclic graph ( DAG) scheduling. Outline. 7. Introduction SuperMatrix Scheduling Performance Conclusion. 6.
E N D
Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan
Introduction • Programmability • Use tools provided by FLAME • Parallelism • Directed acyclic graph (DAG) scheduling NVIDIA presentation
Outline 7 • Introduction • SuperMatrix • Scheduling • Performance • Conclusion 6 5 5 4 3 4 3 2 1 NVIDIA presentation
SuperMatrix • Formal Linear Algebra Method Environment (FLAME) • High-level abstractions for expressing linear algebra algorithms • Cholesky Factorization NVIDIA presentation
SuperMatrix • Cholesky Factorization • Iteration 1 Iteration 2 * CHOL Chol( A11 ) SYRK A22 – A21 A21T CHOL Chol( A11 ) * TRSM A21 A11-T TRSM A21 A11-T SYRK A22 – A21 A21T NVIDIA presentation
SuperMatrix CHOL0 • Cholesky Factorization • Iteration 1 CHOL0 Chol( A0,0 ) NVIDIA presentation
SuperMatrix CHOL0 • Cholesky Factorization • Iteration 1 TRSM1 TRSM2 CHOL0 Chol( A0,0 ) TRSM1 A1,0A0,0-T TRSM2 A2,0A0,0-T NVIDIA presentation
SuperMatrix CHOL0 • Cholesky Factorization • Iteration 1 TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL0 Chol( A0,0 ) TRSM1 A1,0 A0,0-T SYRK3 A1,1 – A1,0 A1,0T TRSM2 A2,0 A0,0-T GEMM4 A2,1 – A2,0A1,0T SYRK5 A2,2 – A2,0 A2,0T NVIDIA presentation
SuperMatrix CHOL0 • Cholesky Factorization • Iteration 2 TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL6 CHOL6 Chol( A1,1 ) TRSM7 TRSM7 A2,1 A1,1-T SYRK8 A2,2 – A2,1 A2,1T SYRK8 NVIDIA presentation
SuperMatrix CHOL0 • Cholesky Factorization • Iteration 3 TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL6 TRSM7 CHOL9 Chol( A2,2 ) SYRK8 CHOL9 NVIDIA presentation
SuperMatrix • Cholesky Factorization • matrix of blocks NVIDIA presentation
SuperMatrix • Separation of Concerns • Analyzer • Decomposes subproblems into component tasks • Store tasks in global task queue sequentially • Internally calculates all dependencies between tasks, which form a DAG, only using input and output parameters for each task • Dispatcher • Spawn threads • Schedule and dispatch tasks to threads in parallel NVIDIA presentation
Outline 7 • Introduction • SuperMatrix • Scheduling • Performance • Conclusion 6 5 5 4 3 4 3 2 1 NVIDIA presentation
Scheduling 7 • Dispatcher foreach task in DAG do If task is ready then Enqueue task end end while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end 6 5 5 4 3 4 3 2 1 NVIDIA presentation
Scheduling 7 • Dispatcher foreach task in DAG do If task is ready then Enqueue task end end while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end 6 5 5 4 3 4 3 2 1 NVIDIA presentation
Scheduling • Supermarket • lines for each cashiers • Efficient enqueue and dequeue • Schedule depends on task to thread assignment • Bank • 1 line for tellers • Enqueue and dequeue become bottlenecks • Dynamic dispatching of tasks to threads NVIDIA presentation
Scheduling • Single Queue • Set of all ready and available tasks • FIFO, priority Enqueue Dequeue PE1 PE0 … PEp-1 NVIDIA presentation
Scheduling • Cache Affinity • Single priority queue sorted by task height • Software cache • LRU • Line = block • Fully associative Enqueue Dequeue PE1 PE0 … PEp-1 … $0 $1 $p-1 NVIDIA presentation
Scheduling • Cache Affinity • Dequeue • Search queue for task with output block in software cache • If found return task • Otherwise return head task • Enqueue • Insert task • Sort queue via task heights • Dispatcher • Update software cache via cache coherency protocol with write invalidation NVIDIA presentation
Scheduling • Multiple Graphics Processing Units • View a GPU as a single accelerator as opposed to being composed of hundreds of streaming processors • Must explicitly transfer data from main memory to GPU • No hardware cache coherency provided • Hybrid Execution Model • Execute tasks on both CPU and GPU NVIDIA presentation
Scheduling • Software Managed Cache Coherency • Use software caches developed for cache affinity to handle data transfers! • Allow blocks to be dirty on GPU until it is requested by another GPU • Apply any scheduling algorithm when utilizing GPUs, particularly cache affinity NVIDIA presentation
Outline 7 • Introduction • SuperMatrix • Scheduling • Performance • Conclusion 6 5 5 4 3 4 3 2 1 NVIDIA presentation
Performance • CPU Target Architecture • 4 socket 2.66 GHz Intel Dunnington • 24 cores • Linux and Windows • 16 MB shared L3 cache per socket • OpenMP • Intel compiler 11.1 • BLAS • Intel MKL 10.2 NVIDIA presentation
Performance • Implementations • SuperMatrix + serial MKL • FIFO queue, cache affinity • FLAME + multithreaded MKL • Multithreaded MKL • PLASMA + serial MKL • Double precision real floating point arithmetic • Tuned block size NVIDIA presentation
Performance • PLASMA • v2.1.0 uses static pipelining for scheduling and does not construct a DAG • v2.2.0 uses dynamic scheduling that roughly attains the same performance as FIFO queue • MAGMA • v1.0 only has support for single GPU execution • Does not attempt to minimize data transfers NVIDIA presentation
Performance NVIDIA presentation
Performance NVIDIA presentation
Performance NVIDIA presentation
Performance NVIDIA presentation
Performance NVIDIA presentation
Performance NVIDIA presentation
Performance NVIDIA presentation
Performance • Generalized Eigenproblem where and is symmetric and is symmetric positive definite • Cholesky Factorization where is a lower triangular matrix so that NVIDIA presentation
Performance then multiply the equation by • Standard Form where and • Reduction from Symmetric Definite Generalized Eigenproblem to Standard Form NVIDIA presentation
Performance • Reduction from … NVIDIA presentation
Performance NVIDIA presentation
Performance • GPU Target Architecture • 2 socket 2.82 GHz Intel Harpertown with NVIDIA Tesla S1070 • 4 602 MHz Tesla C1060 GPUs • 4 GB DDR memory per GPU • Linux • CUDA • CUBLAS 3.0 • Single precision real floating point arithmetic NVIDIA presentation
Performance NVIDIA presentation
Performance NVIDIA presentation
Performance NVIDIA presentation
Performance NVIDIA presentation
Performance • Results • Cache affinity vs. FIFO queue • SuperMatrix out-of-order vs. PLASMA in-order • Strong scalability on CPU and GPU • Typically use block size of 896 on GPU • Representative performance of other dense linear algebra operations NVIDIA presentation
Outline 7 • Introduction • SuperMatrix • Scheduling • Performance • Conclusion 6 5 5 4 3 4 3 2 1 NVIDIA presentation
Conclusion • Separation of Concerns • Allows us to experiment with different scheduling algorithms • Port runtime system to multiple GPUs • Locality, Locality, Locality • Data communication is important as load balance for scheduling matrix computations NVIDIA presentation
Acknowledgments • We thank the other members of the FLAME team for their support • Funding from NSF, Microsoft, and Intel • SuperMatrix is implemented within the open source library libflame released under LGPL NVIDIA presentation
Conclusion • More Information http://www.cs.utexas.edu/~flame • Questions? erniec@nvidia.com NVIDIA presentation