70 likes | 178 Views
Hardware Acceleration Using GPUs. M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008. Advantages of Using Graphics Processors. Parallel architectures with lots of ALUs High memory bandwidth Cheap, fast and scalable New generation within 2 years High Gflops/$. Cons.
E N D
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008
Advantages of Using Graphics Processors • Parallel architectures with lots of ALUs • High memory bandwidth • Cheap, fast and scalable • New generation within 2 years • High Gflops/$ Cons • No double precision yet ( only SP floating point operations) • Loss of precision (not fully IEEE 754 compliant)
NVIDIA GeForce 8 Series cards • Currently using 8500GT to test our algorithms • 8500GT has 16 processors and a theoretical peak fp performance of 28.8 Gflops and memory bandwidth of 12.8GB/s • Scalable architecture • 8800 GT – 128 processors, ~350 Gflops and 86.4 GB/s
GeForce 8500GT Architecture Thread Scheduler Control Control ALU ALU ALU ALU ALU ALU ALU ALU Local Memory Local Memory ALU ALU ALU ALU ALU ALU ALU ALU Shared Memory GLOBAL MEMORY
Programming Model • Massively multi-threaded • Threads -> warps -> blocks -> grid • Shared memory and global memory • Coalesced memory access - 5GB/s – 70 GB/s
Results • Matrix-vector operations are so slow because of the data transfer from host to device. • 10 Gflops on GPU for matrix-matrix compared to 2+ Gflops on CPU and 6 Gflops reported using BLAS. Also Nvidia 8800 card is observed to have a performance of up to 180 Gflops for matrix-matrix multiplication using optimized algorithms.
Conclusion • Most reported performances for GPU are ~30-40% of theoretical peak performances. These are still 5x - 10x faster than CPU • Considerable understanding and work required to fully optimize code • Matrix-matrix operations are easily a magnitude faster than on CPU Future Work • Aim is to develop optimized routines for LU decomposition, Cholesky, Conjugate Gradient etc • Try to incorporate these routines with the DC Analyzer to achieve both performance improvement as well as tackle larger data sizes.