90 likes | 276 Views
CUBLAS and CUSPARSE MVM Timing. Gavin Harrison. SMVM Algorithm. NVIDIA Memory Hierarchy. Global Memory: large/high latency. Shared Memory: shared cache for each set of processors. Constant/texture memory: read only in global memory + on chip cache. Constant memory faster, but only one port.
E N D
CUBLAS and CUSPARSE MVM Timing Gavin Harrison
NVIDIA Memory Hierarchy • Global Memory: large/high latency. • Shared Memory: shared cache for each set of processors. • Constant/texture memory: read only in global memory + on chip cache. • Constant memory faster, but only one port. • Texture Memory doesn’t suffer greatly from irregular access. Also, beneficial given 2D spatial locality.
Tuning SMVM for GPU (GT 280) • Use multiple threads / row, use syncthreads and combine partial results. • Access memory at stride. • Half warps access sequential addresses. • Allows for fewer memory reads from global memory. • Align rows. • Also helps decrease memory reads from global memory. • Use texture memory for input vector. • Input vector is reused. • Texture reads are cached, and benefit from spacial locality.
Improvements in Fermi (GTX 580) • General L1/L2 cache structure. • L1 cache and Shared Memory cache are configurable to be 48 KB or 16 KB (64 KB shared between them). • L2 is 768 KB. • Improved support for double precision floating point numbers. • Added support for 32 bit integer multiplication. • 32 SPs per SM.