80 likes | 184 Views
GPU in HPC. Scott A. Friedman friedman@ats.ucla.edu ATS Research Computing Technologies. First of all…. Double Precision is coming! GPU: late 07 or early 08 (nvidia) Will be half speed – word on street At G80 speed, that equals 175Gflop Cell HPC: summer 08
E N D
GPU in HPC Scott A. Friedman friedman@ats.ucla.edu ATS Research Computing Technologies
First of all… • Double Precision is coming! • GPU: late 07 or early 08 (nvidia) • Will be half speed – word on street • At G80 speed, that equals 175Gflop • Cell HPC: summer 08 • First appears in LANL Roadrunner • 5x increase to 100Gflop IDRE GPU Lunch
Hardware • Remember! • GPUs are for graphics (graphics processing unit) • Think data parallelism! • Must hide memory latency • Lots of computation – ‘arithmetic intensity’ • Low latency memory is precious resource • Limitations • Regs zeroed, minimal shared/static data, no r-m-w buffers • Varying latencies: dependant on memory type accessed • Designed for independent operations (legacy of graphics) • Lots of gotchas that will kill performance • Hardware constantly changing • Current generation • Proprietary architectures • NVIDIA G80, 128 ALUs, 350Gflop SP IDRE GPU Lunch
Programming Model • Streaming • Elements (array) processed by a kernel (function) • Sounds like a SIMD vector processor • Not exactly, often term SPMD used (P=program) • No index ops on streams • Input stream(s) -> compute -> output stream • No dependencies between stream elements • CUDA relaxes this somewhat • Experimentation required • Balancing essential • Compute rather than move data • Maximize use of precious low latency high bandwidth memory • Cover latencies with as much computation as possible • High arithmetic intensity, you will hear this a lot! • Often better to re-compute than cache data • Avoid code that is memory bound • Memory access progressing much slower than # of ALUs • Better to batch memory moves into large transfers • Complex memory access rules have major impact on performance IDRE GPU Lunch
Tools • Cell SDK • Direct access to the hardware • Very low level • CUDA (nvidia >8xxx) • C API – provides scalar execution model (with caveats) • Low level, think of MPI? Certain amount of hardware abstraction • User maps problem domain to processing units and memory hierarchy • Re-imagining of graphics hardware to programming concepts (e.g. threads, arrays) • GLSL, graphics tools, even lower level but not as necessary now • Kernel is : 1,2,3D Grid : Blocks : Threads • Threads within block can communicate via on chip shared memory and synchronize • Blocks are independent! • No communication between blocks • No execution ordering or concurrency guarantees • Free but specific to nvidia hardware (will hide future architecture changes) • CTM (amd/ati) • Similar, but lower level than CUDA • RapidMind • Integrate into C++ code • Higher level abstractions, think OpenMP? • SPMD oriented: e.g. streams and kernels, more restrictive than CUDA • Portable? • Let the experts do the mapping to memory hierarchy • Several back-ends supported, Cell, GPUs, Multicore CPUs • Allows tuning to specific hardware • Not free • Brook, Sh • Opensource tools • Sh is precursor to Rapidmind kit IDRE GPU Lunch
Resources • One stop shopping • http://www.gpgpu.org • More good stuff • http://www.rapidmind.com/resources.php • Great survey paper • http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=907 • Cell HPC presentation • http://www.power.org/resources/devcorner/cellcorner/hpcspe.pdf • Siggraph 2007 gpgpu course – very good • http://www.gpgpu.org/s2007/ • IBM Cell • http://www.ibm.com/developerworks/power/cell/ • NVIDIA • http://developer.nvidia.com/object/cuda.html • AMD/ATI • http://ati.amd.com/technology/streamcomputing/index.html • Rapidmind • http://developer.rapidmind.com • Google is your friend, of course IDRE GPU Lunch
Conclusions • This is the future of the highest performance codes • GPU, Cell or Larabee-sque multi-core • Industry is scaling cores not clocks • Industry contacts share that customers are 'in denial' and need to get on board. • Programming is going to get whole lot more complex • Memory hierarchies • Load and system balancing • More and more doing it – fewer and fewer who are any good at it • Education! • Mapping problem domains to these architectures is still evolving • Lots of clever solutions to lots of problems • Domain and algorithm level • Tools are currently pretty weak • Industry appears to be aware of this – not just the market opportunity • Hopefully • APIs will insolate us from variety and evolution of hardware IDRE GPU Lunch
Thank you • Questions? • Please feel free to contact me • ATS has several resources that you can access to try some of these things out • Sony Playstation3, Cell SDK • nVidia 8800GTX, CUDA, Rapidmind IDRE GPU Lunch