280 likes | 471 Views
Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining. Xintian Yang , Srinivasan Parthasarathy and P. Sadayappan Department of Computer Science The Ohio State University. Outline. Motivation and Background Methods Experiments Conclusions and Future work.
E N D
Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, SrinivasanParthasarathy and P. Sadayappan Department of Computer Science The Ohio State University
Outline • Motivation and Background • Methods • Experiments • Conclusions and Future work
Introduction • Sparse Matrix-Vector Multiplication (SpMV) • y = Ax, where A is a sparse matrix and x is a dense vector. • Dominant cost when solving large-scale linear systems or eigenvalue problems in iterative methods. • Focus of much research • Scientific Applications, e.g. finite element method • Graph Mining algorithms • PageRank, Random Walk with Restart, HITS • Industrial Strength Efforts • CPUs, Clusters • GPUs (focus of this talk)
Why GPUs • High Performance • Figure on right • High Productivity • CUDA (now) vs. OpenGL (other complications)
Background: CUDA Architecture • Programming Model (logical hierarchy): • Grid • Block • Thread • Kernel
Background: CUDA Architecture • Hardware (Physical): • A set of multiprocessors • 8 processors and 1 Instruction Unit • A warp = 32 threads, concurrently run the same instructions • Conditional divergence • Different warps: time-shared • Memory System • Global memory: coalescing • Shared memory: 16KB per block • Constant/Texture memory • Constant value, cached • 16KB constant cache; 6~8KB texture cache per multiprocessor • Registers
Power-law Graphs and Challenges • Power-law graphs • Large number of nodes with low degree • Few nodes with very high degree • Challenges for GPU based computation of SpMV on such graphs • Load balancing • Inefficient memory access • Conditional divergence • Problem Statement • Can we do better than competing industrial strength efforts for processing matrices representing such graphs? • Does it yield end-to-end improvements in graph mining application (e.g. PageRank) ?
Outline • Motivation and Background • Methods • Experiments • Conclusions and Future work
Key Insights from Benchmarking Texture cache size was not available Estimated to be 250 KB (=64,000 columns) Note entire X cannot fit on texture cache • Three kinds of memory accesses • Accesses to A, Accesses to x, and Writes to y • Previous methods have optimized accesses to A. • Observation 1: Each row accesses random elements in vector x • Observation 2: The accesses are non-coalesced – poor locality • Solution 1: Tile A by columns and store x on the texture cache
Tiling • Observation 3: Column lengths are power-law distribution • Many short columns, little re-use of x in these columns • No benefit from tiling • Solution 2: Reorder by column length and partially tile A • Un-tiled elements are computed separately.
Composite Storage of Tiles • Observation 4: Rows (non-zeros) in each tile follow power law! • Observation 5: Within each tile performance is limited by • load imbalance • non-coalesced global memory accesses • conditional thread divergence serialization • Solution 3: Composite tile storage scheme • Basic observation from benchmarking study • Row major storage performs well on long rows (16 threads per row). • Column major storage performs well on short rows (1 thread per row).
Row and Column Composite Storage • Reorder the rows in each tile from long to short. • Rows are partitioned into workloads with similar size. • A thread warp is assigned to compute one workload. • A workload is a rectangle area of non-zeros with width w and height h • If w > h, row major storage • Else, column major storage
Outline • Motivation and Background • Methods • Experiments • Conclusions and Future work
Experiments • Hardware Configuration • GPU: NVIDIA Tesla C1060, 30 multiprocessors, 240 processor cores, 4GB global memory • CPU: dual core Opteron 2.6GHz, 8GB of 667 MHz DDR2 main memory • All experiments are run with single process single GPU. Datasets
SpMV Kernel Power-law matrices
SpMV Kernel Unstructured matrices: non-power-law
Data Mining Applications • Given directed graph G = (V, E) , and adjacency matrix A • PageRank: • W is row normalization of A • c = 0.85, U is a n by n matrix with all elements set to 1/n. • Random Walk with Restart (RWR): given a query node i, compute the relevance score from all other nodes to node i. • W is column normalization of A • c = 0.9, the ith element in is 1, the others are all 0. • HITS: each web page is assigned an authority score and a hub score.
Outline • Motivation and Background • Methods • Experiments • Conclusions and Future work
Conclusions and Future Work • Architecture conscious optimizations for SpMV • Architecture features of GPU • Characteristics of graph mining applications • Significant performance improvement on power-law graph datasets. • Future work • Parameter auto-tuning based on non-zero distribution • Blocking and loop unrolling • Extension to distributed systems to handle larger datasets.
Thank you • Questions? • Acknowledgements: • Grants: • DOE Early Career Principal Investigator Award No. DE-FG02-04ER25611 • NSF CAREER Grant IIS-0347662
Outline • Motivation and Background • Limitations of Previous Approach • Methods • Experiments • Conclusions and Future work
Limitations of Previous Work CSR: Imbalanced workload amongst threads, non- coalesced memory accesses. CSR-vector: many short rows, waste of threads • NVIDIA’s SpMV Library based on different storage formats of matrix A. • CSR • CSR kernel • CSR-vector kernel • Optimized CSR-vector Baskaran et al.
Limitation of Previous Work warp0 warp1 COO: thread divergence, low thread level parallelism • COO kernel • Each warp works on one interval • Warps run in parallel • With in one warp, threads do binary reduction, need to check whether two operands are from the same row
Limitation of Previous Work ELL: long rows can’t be bounded HYB: ELL part only covers small amount of computation, COO part is slow, increasing the ratio of ELL part introduces memory overhead. • ELL kernel • Requires row lengths are bounded by a small number k, 0s are padded if a row is shorter than k. • Data and index matrices are stored in column major, each thread works on one row. • HYB kernel: ELL + COO