1 / 24

LAPACK on the NVIDIA G80 Processor

LAPACK on the NVIDIA G80 Processor. Robert Liao Tracy Wang CS252 Spring 2007. Overview. Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK Performance and Issues. A Quick Note on Naming.

kaemon
Download Presentation

LAPACK on the NVIDIA G80 Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LAPACK on the NVIDIA G80 Processor Robert Liao Tracy Wang CS252 Spring 2007

  2. Overview • Traditional GPU Architecture • The NVIDIA G80 Processor • CUDA (Compute Unified Device Architecture) • LAPACK • Performance and Issues

  3. A Quick Note on Naming • “G80” is the codename for the GPU found in the following graphics cards. • NVIDIA GeForce 8 Series Graphics Cards • NVIDIA Quadro FX 4600 • NVIDIA Quadro FX 5600

  4. Traditional GPUs From Intel Corporation

  5. Traditional GPUs • GPUs talk Polygons Vertex Processor Pixel Fragmenting Creation Process Fragments Merge Output From CPU Display

  6. Traditional GPUs • OpenGL and DirectX abstract this away. Vertex Processor Pixel Fragmenting Creation Process Fragments Merge Output From CPU Display

  7. The NVIDIA G80 Architecture • Reconfigurable Processor Pipeline From NVIDIA

  8. G80 History and Specifications • Project Started in Summer of 2002. • 128 Compute Cores • 1.35 GHz in the GeForce 8800 • Floating Point Ops • Stream Processor Architecture • One Computing Unit Streams into another Computing Unit

  9. The CUDA Interface to the G80 • Compute Unified Device Architecture • C Interface for Performing Operations on the NVIDIA Processor • Contains traditional C memory semantics with the context of a GPU

  10. Working with CUDA • Custom compiler provided to compile C code that the GPU can understand. • The API functions provide a whole host of ways to interface with the GPU. • CUDA Libraries are provided for common tasks. • CUDA Runtime helps management of memory • No DirectX or OpenGL knowledge needed!

  11. Working with CUDA Running C on the CPU Running C on the GPU • malloc • free • CPU Code • cudaMalloc • cudaFree • GPU Code Pointers on one side stay on one side. This will create issues for existing applications

  12. LAPACK • Linear Algebra PACKage • Implemented in Fortran 77 • Interfaces with BLAS (Basic Linear Algebra Subprograms) • Professor James Demmel involved in Project

  13. CLAPACK • An F2C’ed version of LAPACK. • Very ugly! s_rsle(&io___8); do_lio(&c__3, &c__1, (char *)&nm, (ftnlen)sizeof(integer)); e_rsle(); if (nm < 1) { s_wsfe(&io___10); do_fio(&c__1, " NM ", (ftnlen)4); do_fio(&c__1, (char *)&nm, (ftnlen)sizeof(integer)); do_fio(&c__1, (char *)&c__1, (ftnlen)sizeof(integer)); e_wsfe(); nm = 0; fatal = TRUE_; } else if (nm > 12) { s_wsfe(&io___11); do_fio(&c__1, " NM ", (ftnlen)4); do_fio(&c__1, (char *)&nm, (ftnlen)sizeof(integer)); do_fio(&c__1, (char *)&c__12, (ftnlen)sizeof(integer)); e_wsfe(); nm = 0;

  14. CUBLAS • NVIDIA’s CUDA Based Implementation of BLAS • Many functions are similar, but argument signatures are slightly different • Adds some other functions as well • cublasAlloc • cublasFree • CUBLAS lives in the GPU world

  15. CLAPACK and CUBLAS • Putting them together is not as easy as just linking CLAPACK to CUBLAS. • Matrices and data structures must be moved into GPU memory space. • CLAPACK executes on the CPU. • CUBLAS executes on the GPU. CLAPACK Function Memory copy CPU->GPU CUBLAS Memory copy GPU->CPU

  16. CLAPACK Concentration • General Solve • sgesv • Computes solution to linear system of equationsA × X = B • To Solve, A is factored into three matrices, P, L, and U. • P = Permutation Matrix • L = Lower Triangular • U = Upper Triangular • Currently, our results cover the triangular factoring step

  17. Performance Results

  18. Performance Results

  19. Performance Issues • Much copying must be done from the CPU to GPU and GPU to CPU to communicate results. • Why not convert all pointers into GPU pointers? • Requires CLAPACK to run in GPU memory. • Could be someone’s research paper…

  20. Other Issues • Floating Point Behaves Differently • Section 5.2 of the CUDA Programming Guide Discusses Deviations from IEEE-754 • No support for denormalized numbers • Underflowed numbers are flushed to zero • We noticed some results appearing as 0.0001 instead of 0, for example

  21. Current State • Investigating some interesting memory issues on the GPU side. • Allocations Mysteriously Fail.

  22. Conclusions To Date • Small data sets are better left off on the CPU. • GPU calculations may not be appropriate for scientific computing depending on needs.

  23. Future Directions • Moving all of LAPACK into GPU • Resolving the copying issue • Perhaps resolved by unifying the CPU and GPU? • Want to give it a try? • Can’t find Quadro FX 5600 on Market (MSRP $2,999) • GeForce 8 Series have the G80 Processor • GeForce 8500GT ($99.99) • GeForce 8800GTX ($939.99)

  24. Questions

More Related