90 likes | 181 Views
GPUs and Accelerators. Jonathan Coens Lawrence Tan Yanlin Li. Outline. Graphic Processing Units Features Motivation Challenges Accelerator Methodology Performance Evaluation Discussion Rigel Methodology Performance Evaluation Discussion Conclusion.
E N D
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li
Outline • Graphic Processing Units • Features • Motivation • Challenges • Accelerator • Methodology • Performance Evaluation • Discussion • Rigel • Methodology • Performance Evaluation • Discussion • Conclusion
Graphics Processing Units (GPU) • GPU • Special purpose processors designed to render 3D scenes • In almost every desktop today • Features • Highly parallel processors • Better floating point performance than CPUs • ATI Radeon x1900 - 250 Gflops • Motivation • Use GPUs for general purpose programming • Challenges • Difficult for programmer to program • Trade off between programmability and performance GeForce 6600GT (NV43) GPU
Accelerator: Using Data Parallelism to Program GPUs for General Purpose Uses • Methodology • Data Parallelism to program GPU (SIMD) • Parallel Array C# Object • No aspects of GPU are exposed to the programmer • Programmer only needs to know how to use the Parallel Array • Accelerator takes care of the conversion to pixel shader code • Parallel programs can be represented as DAGs Simplified block diagram for a GPU Expression DAG with shader breaks marked
Accelerator: Using Data Parallelism to Program GPUs for General Purpose Uses Performance Evaluation Performance of Accelerator versus hand coded pixel shader programs on a GeForce 7800 GTX and an ATI x1800. Performance is shown as speedup relative to the C++ version of programs Speedup of Accelator programs on various GPU compared to C++ programs running on a CPU
Rigel: 1024-core Accelerator Specific Architecture • SPMD programming model • Global address space • RISC instruction set • Write-back cache • Cores laid out in clusters of 8, each cluster with local cache • Custom cores (optimized for space / power) Hierarchical Task Queueing • Single queue from programmer's perspective • Architecture handles distributing tasks • Customizable via API • Task granularity • Static vs. dynamic scheduling
Rigel's Performance Fairly Successful • Achieved speedup utilizing all 1024 cores • Hierarchical task structure effectively scaled to 1024 Issues • Cache coherence! • Memory invalidate broadcasts slow system down • Barrier flags • Task enqueue / dequeue variables • Not done in hardware... • Lazy-evaluation write-through barriers at cluster level
Improving Rigel • Will the hierarchical task structure continue to scale? If not, when will the boundary be? (Think multiple cache levels but with processor tasks) • How could we implement barriers or queues to avoid contention, but still scale? (Is memory managed cache coherence appropriate?) • Is specialized hardware the way to go (clusters of 8 custom cores), or can this be replaced by general purpose cores?
Generic and Custom Accelerators • Difficult to make generic enough programming interface between programmer and multi-core system • GPUs are limited by SIMD programming model • Specific hardware platforms still have issues for SPMD • Efficiently scaling for more cores is still an issue How do we solve these issues?