1 / 12

Parallel Computing-Dense Matrix Multiplication on CUDA

Parallel Computing-Dense Matrix Multiplication on CUDA. Srikar Vinjamuri Mentor: Matt Johnson. Serial Computing vs Parallel Computing. Why use parallel computing??. Limits of serial computing Transmission Speeds Limits to Miniaturization Economic Limitations

keisha
Download Presentation

Parallel Computing-Dense Matrix Multiplication on CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Computing-Dense Matrix Multiplication on CUDA • Srikar Vinjamuri • Mentor: Matt Johnson

  2. Serial Computing vs Parallel Computing

  3. Why use parallel computing?? • Limits of serial computing • Transmission Speeds • Limits to Miniaturization • Economic Limitations • Use of non-local resources-SETI@Home • Solve Larger Problems-Dense Matrix Multiplication

  4. Types of parallel computers: Flynn’s Classical Taxonomy

  5. Parallel Computing Memory Architectures

  6. CUDA=>Compute Unified Device Architecture • Using GPU for Computation. • More processors are commissioned for data processing than data caching and flow control. • High arithmetic intensity. • No need for a graphics API!! • No DRAM memory bandwidth bottlenecks!! • Memory access latency is hidden!!!

  7. CUDA Memory Model • Read-write per thread registers. • Read write per thread local memory. • Read write per block shared memory. • Read write per grid global memory. • Read write per grid constant memory. Read-write

  8. CUDA Application Programming interface • Built on simple to use C Language! • Minimal set of extensions to C. • Simple and intuitive run-time library: • Host Component • Device,Memory,Code Module Management and Execution Control • Device Component • Math,Synchrofunctions.Typeconversion,casting • Common Component • Vector types and subset of std C library • Language Extensions: • Function type qualifiers:host or device and callability • Variable type qualifiers:Mem location of device or variables • Directive to specify kernel execution: • 4 built-in variable for grid and block size and indice specifications.

  9. Application-Dense Matrix Multiplication • Problem: Multiply a dense nxn matrix A with a nx1 vector x to yield a nx1 vector y. • Serial computation: Involves n^2 multiplications and additions. • Alternative parallel Computation..........

  10. Parallel...Dense Matrix Multiplication • We only consider the simplest case here: say p=n. • Then the pxp matrix is partitioned among p processors. • The px1 matrix is also partitioned such that each process owns one element of x as shown in the figure.

  11. contd.... • Here we use the concept of all to all broadcast where every node transmits it’s information to every other node.

  12. Questions????

More Related