Parallel Computing-Dense Matrix Multiplication on CUDA

Parallel Computing-Dense Matrix Multiplication on CUDA • Srikar Vinjamuri • Mentor: Matt Johnson

Serial Computing vs Parallel Computing

Why use parallel computing?? • Limits of serial computing • Transmission Speeds • Limits to Miniaturization • Economic Limitations • Use of non-local resources-SETI@Home • Solve Larger Problems-Dense Matrix Multiplication

Types of parallel computers: Flynn’s Classical Taxonomy

Parallel Computing Memory Architectures

CUDA=>Compute Unified Device Architecture • Using GPU for Computation. • More processors are commissioned for data processing than data caching and flow control. • High arithmetic intensity. • No need for a graphics API!! • No DRAM memory bandwidth bottlenecks!! • Memory access latency is hidden!!!

CUDA Memory Model • Read-write per thread registers. • Read write per thread local memory. • Read write per block shared memory. • Read write per grid global memory. • Read write per grid constant memory. Read-write

CUDA Application Programming interface • Built on simple to use C Language! • Minimal set of extensions to C. • Simple and intuitive run-time library: • Host Component • Device,Memory,Code Module Management and Execution Control • Device Component • Math,Synchrofunctions.Typeconversion,casting • Common Component • Vector types and subset of std C library • Language Extensions: • Function type qualifiers:host or device and callability • Variable type qualifiers:Mem location of device or variables • Directive to specify kernel execution: • 4 built-in variable for grid and block size and indice specifications.

Application-Dense Matrix Multiplication • Problem: Multiply a dense nxn matrix A with a nx1 vector x to yield a nx1 vector y. • Serial computation: Involves n^2 multiplications and additions. • Alternative parallel Computation..........

Parallel...Dense Matrix Multiplication • We only consider the simplest case here: say p=n. • Then the pxp matrix is partitioned among p processors. • The px1 matrix is also partitioned such that each process owns one element of x as shown in the figure.

contd.... • Here we use the concept of all to all broadcast where every node transmits it’s information to every other node.

Questions????

Parallel Computing-Dense Matrix Multiplication on CUDA

Parallel Computing-Dense Matrix Multiplication on CUDA

Presentation Transcript

CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication

CS 267 Dense Linear Algebra: Parallel Matrix Multiplication

CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication

Parallel Matrix Multiplication and other Full Matrix Algorithms

Fast matrix multiplication with CUDA

MATRIX MULTIPLICATION

Matrix-Matrix Multiplication

MATRIX MULTIPLICATION

Matrix Multiplication in CUDA

Matrix Multiplication

Sparse Matrix Dense Vector Multiplication

Matrix Multiplication

CSE5304—Project Proposal Parallel Matrix Multiplication

Matrix Multiplication

CSE5304—Project Proposal Parallel Matrix Multiplication

CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication

CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication

CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication

CS 267 Dense Linear Algebra: Parallel Matrix Multiplication

Parallel Matrix Multiplication and other Full Matrix Algorithms