LAPACK on the NVIDIA G80 Processor

LAPACK on the NVIDIA G80 Processor Robert Liao Tracy Wang CS252 Spring 2007

Overview • Traditional GPU Architecture • The NVIDIA G80 Processor • CUDA (Compute Unified Device Architecture) • LAPACK • Performance and Issues

A Quick Note on Naming • “G80” is the codename for the GPU found in the following graphics cards. • NVIDIA GeForce 8 Series Graphics Cards • NVIDIA Quadro FX 4600 • NVIDIA Quadro FX 5600

Traditional GPUs From Intel Corporation

Traditional GPUs • GPUs talk Polygons Vertex Processor Pixel Fragmenting Creation Process Fragments Merge Output From CPU Display

Traditional GPUs • OpenGL and DirectX abstract this away. Vertex Processor Pixel Fragmenting Creation Process Fragments Merge Output From CPU Display

The NVIDIA G80 Architecture • Reconfigurable Processor Pipeline From NVIDIA

G80 History and Specifications • Project Started in Summer of 2002. • 128 Compute Cores • 1.35 GHz in the GeForce 8800 • Floating Point Ops • Stream Processor Architecture • One Computing Unit Streams into another Computing Unit

The CUDA Interface to the G80 • Compute Unified Device Architecture • C Interface for Performing Operations on the NVIDIA Processor • Contains traditional C memory semantics with the context of a GPU

Working with CUDA • Custom compiler provided to compile C code that the GPU can understand. • The API functions provide a whole host of ways to interface with the GPU. • CUDA Libraries are provided for common tasks. • CUDA Runtime helps management of memory • No DirectX or OpenGL knowledge needed!

Working with CUDA Running C on the CPU Running C on the GPU • malloc • free • CPU Code • cudaMalloc • cudaFree • GPU Code Pointers on one side stay on one side. This will create issues for existing applications

LAPACK • Linear Algebra PACKage • Implemented in Fortran 77 • Interfaces with BLAS (Basic Linear Algebra Subprograms) • Professor James Demmel involved in Project

CLAPACK • An F2C’ed version of LAPACK. • Very ugly! s_rsle(&io___8); do_lio(&c__3, &c__1, (char *)&nm, (ftnlen)sizeof(integer)); e_rsle(); if (nm < 1) { s_wsfe(&io___10); do_fio(&c__1, " NM ", (ftnlen)4); do_fio(&c__1, (char *)&nm, (ftnlen)sizeof(integer)); do_fio(&c__1, (char *)&c__1, (ftnlen)sizeof(integer)); e_wsfe(); nm = 0; fatal = TRUE_; } else if (nm > 12) { s_wsfe(&io___11); do_fio(&c__1, " NM ", (ftnlen)4); do_fio(&c__1, (char *)&nm, (ftnlen)sizeof(integer)); do_fio(&c__1, (char *)&c__12, (ftnlen)sizeof(integer)); e_wsfe(); nm = 0;

CUBLAS • NVIDIA’s CUDA Based Implementation of BLAS • Many functions are similar, but argument signatures are slightly different • Adds some other functions as well • cublasAlloc • cublasFree • CUBLAS lives in the GPU world

CLAPACK and CUBLAS • Putting them together is not as easy as just linking CLAPACK to CUBLAS. • Matrices and data structures must be moved into GPU memory space. • CLAPACK executes on the CPU. • CUBLAS executes on the GPU. CLAPACK Function Memory copy CPU->GPU CUBLAS Memory copy GPU->CPU

CLAPACK Concentration • General Solve • sgesv • Computes solution to linear system of equationsA × X = B • To Solve, A is factored into three matrices, P, L, and U. • P = Permutation Matrix • L = Lower Triangular • U = Upper Triangular • Currently, our results cover the triangular factoring step

Performance Results

Performance Issues • Much copying must be done from the CPU to GPU and GPU to CPU to communicate results. • Why not convert all pointers into GPU pointers? • Requires CLAPACK to run in GPU memory. • Could be someone’s research paper…

Other Issues • Floating Point Behaves Differently • Section 5.2 of the CUDA Programming Guide Discusses Deviations from IEEE-754 • No support for denormalized numbers • Underflowed numbers are flushed to zero • We noticed some results appearing as 0.0001 instead of 0, for example

Current State • Investigating some interesting memory issues on the GPU side. • Allocations Mysteriously Fail.

Conclusions To Date • Small data sets are better left off on the CPU. • GPU calculations may not be appropriate for scientific computing depending on needs.

Future Directions • Moving all of LAPACK into GPU • Resolving the copying issue • Perhaps resolved by unifying the CPU and GPU? • Want to give it a try? • Can’t find Quadro FX 5600 on Market (MSRP $2,999) • GeForce 8 Series have the G80 Processor • GeForce 8500GT ($99.99) • GeForce 8800GTX ($939.99)

Questions

LAPACK on the NVIDIA G80 Processor

LAPACK on the NVIDIA G80 Processor

Presentation Transcript

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev

The Lapack for Clusters (LFC) Project

Nvidia corporation

The Processor

The Processor

NVIDIA CUDA

Nvidia FXAA

Charm++ on the Cell Processor

NVIDIA Hardware

Robotics on the Blackfin Processor

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

Lapack wrapper for matlab

NVIDIA CUDA

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

C Wrapper to LAPACK

Deep Learning on Your Desktop With NVIDIA GPU Cloud and NVIDIA TITAN

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

The Future of LAPACK and ScaLAPACK netlib/lapack-dev