140 likes | 469 Views
Parallel Computing-Dense Matrix Multiplication on CUDA. Srikar Vinjamuri Mentor: Matt Johnson. Serial Computing vs Parallel Computing. Why use parallel computing??. Limits of serial computing Transmission Speeds Limits to Miniaturization Economic Limitations
E N D
Parallel Computing-Dense Matrix Multiplication on CUDA • Srikar Vinjamuri • Mentor: Matt Johnson
Why use parallel computing?? • Limits of serial computing • Transmission Speeds • Limits to Miniaturization • Economic Limitations • Use of non-local resources-SETI@Home • Solve Larger Problems-Dense Matrix Multiplication
CUDA=>Compute Unified Device Architecture • Using GPU for Computation. • More processors are commissioned for data processing than data caching and flow control. • High arithmetic intensity. • No need for a graphics API!! • No DRAM memory bandwidth bottlenecks!! • Memory access latency is hidden!!!
CUDA Memory Model • Read-write per thread registers. • Read write per thread local memory. • Read write per block shared memory. • Read write per grid global memory. • Read write per grid constant memory. Read-write
CUDA Application Programming interface • Built on simple to use C Language! • Minimal set of extensions to C. • Simple and intuitive run-time library: • Host Component • Device,Memory,Code Module Management and Execution Control • Device Component • Math,Synchrofunctions.Typeconversion,casting • Common Component • Vector types and subset of std C library • Language Extensions: • Function type qualifiers:host or device and callability • Variable type qualifiers:Mem location of device or variables • Directive to specify kernel execution: • 4 built-in variable for grid and block size and indice specifications.
Application-Dense Matrix Multiplication • Problem: Multiply a dense nxn matrix A with a nx1 vector x to yield a nx1 vector y. • Serial computation: Involves n^2 multiplications and additions. • Alternative parallel Computation..........
Parallel...Dense Matrix Multiplication • We only consider the simplest case here: say p=n. • Then the pxp matrix is partitioned among p processors. • The px1 matrix is also partitioned such that each process owns one element of x as shown in the figure.
contd.... • Here we use the concept of all to all broadcast where every node transmits it’s information to every other node.