110 likes | 216 Views
Utilization of GPU’s for General Computing. Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda , Reiji, et al. Overview. Problem: Want to use the GPU for things other than graphics, however the costs can be high Solution:
E N D
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji, et al.
Overview • Problem: • Want to use the GPU for things other than graphics, however the costs can be high • Solution: • Improve the CUDA drivers • Results: • As compared to node of a supercomputer, worth it • Conclusion • These improvements make using GPGPU’s more feasible
Problem: Need to computation power • Why GPU’s? • GPU’s are not being fully realized as a resource, often sitting idle when not being used for graphics • Better performance for less power as compared to CPU’s • What’s the issue? Cost. • Efficient scheduling – timing data loads with its uses • Memory management – using the small amount of memory available effectively • Loads and stores – waiting for memory transfers, taking 100’s of cycles
Solutions • Brook+ by AMD, Larrabee by Intel • CUDA by NVIDA • Greatest technological maturity at the time • Paper investigating existing technology and suggested improvements 8 Streaming Processors 30 Multi-Processors 16kb
NVIDA’s Tesla C1060 GPU vs. Hitachi HA8000-tc/RS425 (T2K) Super Computer • T2K – fastest supercomputer in Japan
Issues to Overcome • High SIMD vector length • Small main memory size • High register spill cost • No L2 cachebut rather read-only texture caches
Methods to Hide Away Latency • CUDA compiler option limits number of registers used per warp • 1 warp = the 32 threads running in a block (SMID) • Maximizes number of warps that can run at a time • Could cause spills • Variable-sized multi-round data transfer scheduling with PCI express • PCI express allows for data transfer, GPU and CPU computation to occur in parallel • Allows for constant flow of information: • Allows for up to O(log x/x) as compared to uniform scheduling’s O()
Methods to Hide Away Latency • Computation time between communications > Communication latency • Worth sending the data over to the GPU • Increasing bandwidth and size of messages makes the constant term in overhead latency seem smaller • Efficient use of registers to prevent spills • Deciding what work to do where, GPU vs. CPU, work sharing • Minimizing divergent warps using atomic operations found in CUDA • Divergent warp occur when threads must follow both paths
Results • Variable-sized multi-round data transfer scheduling Number of rounds
Results • Use of atomic instructions in CUDA to minimize latency
Conclusion • CUDA gives programmers the ability to harness the power of the GPU for general uses. • The improvements presented allow this option to be more feasible. • Strategic use of GPGPU’s as a resource will improve speed and efficiency. • However, presented material mainly theoretical, not much strong data to back up • More suggestions than implementations, promoting GPGPU use