180 likes | 199 Views
This article investigates the integration of fine-grained CPU-GPU coupling in multi-core systems, with a focus on improving performance and ease of programming. Topics covered include trends in hybrid PC architecture, the current limitations of CPU-GPU coupling, the differences between CPU and GPU cores, and the benefits of using a queue-based programming model. The aim is to make GPU cores first-class execution engines in a multi-core system and explore the potential of fine-grained interaction between cores.
E N D
Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian
Trends • Multi-core CPUs • Generalized GPUs • Brook, CTM, CUDA • Tighter CPU-GPU coupling • PS3 • Xbox 360 • AMD “Fusion” (faster bus, but GPU still treated as batch coprocessor)
CPU-GPU coupling • Important apps (game engines) exhibit workloads suitable for both CPU and GPU-style cores CPU Friendly IO AI/planning Collisions Adaptive algorithms GPU Friendly Geometry processing Shading Physics (fluids/particles)
CPU-GPU coupling • Current: coarse granularity interaction • Control: CPU launches batch of work, waits for results before sending more commands (multi-pass) • Necessitates algorithmic changes • GPU is slave coprocessor • Limited mechanisms to create new work • CPU must deliver LARGE batches • CPU sends GPU commands via “driver” model
Fundamentally different cores • “CPU” cores • Small number (tens) of HW threads • Software (OS) thread scheduling • Memory system prioritizes minimizing latency • “GPU” cores • Many HW threads (>1000), hardware scheduled • Minimize per-thread state (state kept on-chip) • shared PC, wide SIMD execution, small register file • No thread stack • Memory system prioritizes throughput (not clear: sync, SW-managed memory, isolation, resource constraints)
GPU as a giant scheduler cmd buffer = on-chip queues data buffer IA VS 1-to-1 Off-chip buffers (data) GS 1-to-N (bounded) output stream RS 1-to-N (unbounded) PS 1-to-(0 or X) (X static) OM data buffer
GPU as a giant scheduler VS/GS/PS IA RS Off-chip buffers (data) Thread scoreboard Processing cores Hardware scheduler command queue vertex queue primitive queue fragment queue OM On-chip queues (read-modify-write)
GPU as a giant scheduler • Rasterizer (+ input cmd processor) is a domain specific HW work scheduler • Millions of work items/frame • On chip queues of work • Thousands of HW threads active at once • CPU threads (via API commands), GS programs, fixed function logic generate work • Pipeline describes dependencies • What is the work here? • Vertices • Geometric primitives • Fragments • In the future: Rays? Well defined resource requirements for each category.
The project • Investigate making “GPU” cores first-class execution engines in multi-core system • Add: • Fine granularity interaction between cores • Processing work on any core can create new work (for any other core) • Hypothesis: scheduling work (actions) is key problem • Keeping state on-chip • Drive architecture simulation with interactive graphics pipeline augmented with raytracing
Our architecture • Multi-core processor = some “CPU” + some “GPU” style cores • Unified system address space • “Good” interconnect between cores • Actions (work) on any core can create new work • Potentially… • Software-managed configurable L2 • Synchronization/signaling primitives across actions
Need new scheduler • GPU HW scheduler leverages highly domain-specific information • Knows dependencies • Knows resources used by threads • Need to move to more general-purpose HW/SW scheduler, yet still do okay • Questions • What scheduling algorithms? • What information is needed to make decisions?
Programming model = queues • Model system as a collection of work queues • Create work = enqueue • SW driven dispatch of “CPU” core work • HW driven dispatch of “GPU” core work • Application code does not dequeue
Benefits of queues • Describe classes of work • Associate queues with environments • GPU (no gather) • GPU + gather • GPU + create work (bounded) • CPU • CPU + SW managed L2 • Opportunity to coalesce/reorder work • Fine-created creation, bulk execution • Describe dependencies
Decisions • Granularity of work • Enqueue elements or batches? • “Coherence” of work (batching state changes) • Associate kernels/resources with queues (part of env)? • Constraints on enqueue • Fail gracefully in case of explosion • Scheduling policy • Minimize state (size of queues) • How to understand dependencies
First steps • Coarse architecture simulation • Hello world = run CPU + GPU threads, GPU threads create other threads • Identify GPU ISA additions • Establish what information scheduler needs • What are the “environments” • Eventually drive simulation with hybrid renderer
Evaluation • Compare against architectural alternatives • Multi-pass rendering (very coarse-grain) with domain-specific scheduler • Paper: “GPU” microarchitecture comparison with our design • Scheduling resources • On chip state / performance tradeoff • On chip bandwidth • Many-core homogenous CPU
Summary • Hypothesis: Elevating “GPU” cores to first-class execution engines is better way to build hybrid system • Apps with dynamic/irregular components • Performance • Ease of programming • Allow all cores to generate new work by adding to system queues • Scheduling work in these queues is key issue (goal: keep queues on chip)
Three fronts • GPU micro-architecture • GPU work creating GPU work • Generalization of DirectX 10 GS • CPU-GPU integration • GPU cores as first-class execution environments (dump the driver model) • Unified view of work throughout machine • Any core creates work for other cores • GPU resource management • Ability to correctly manage/virtualize GPU resources • Window manager