220 likes | 379 Views
Many-Core Programming with GRAMPS Jeremy Sugerman Stanford University September 12, 2008. Background, Outline. Stanford Graphics / Architecture Research Collaborators: Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan To appear in ACM Transactions on Graphics
E N D
Many-Core Programming with GRAMPSJeremy SugermanStanford UniversitySeptember 12, 2008
Background, Outline • Stanford Graphics / Architecture Research • Collaborators: Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan • To appear in ACM Transactions on Graphics • CPU, GPU trends… and collision? • Two research areas: • HW/SW Interface, Programming Model • Future Graphics API
Problem Statement • Drive efficient development and execution in many-/multi-core systems. • Support homogeneous, heterogeneous cores. • Inform future hardware Status Quo: • GPU Pipeline (Good for GL, otherwise hard) • CPU (No guidance, fast is hard)
= Thread Stage = Queue = Stage Output = Shader Stage = Fixed-func Stage GRAMPS Rasterization Pipeline Output Fragment Queue Input Fragment Queue • Software defined graphs • Producer-consumer, data-parallelism • Initial focus on rendering Frame Buffer FB Blend Shade Rasterize Ray Queue Ray Tracing Graph Camera Intersect Ray Hit Queue Fragment Queue Frame Buffer FB Blend Shade
As a Graphics Evolution • Not (too) radical for ‘graphics’ • Like fixed → programmable shading • Pipeline undergoing massive shake up • Diversity of new parameters and use cases • Bigger picture than ‘graphics’ • Rendering is more than GL/D3D • Compute is more than rendering • Some ‘GPUs’ are losing their innate pipeline
As a Compute Evolution (1) • Sounds like streaming: Execution graphs, kernels, data-parallelism • Streaming: “squeeze out every FLOP” • Goals: bulk transfer, arithmetic intensity • Intensive static analysis, custom chips (mostly) • Bounded space, data access, execution time
As a Compute Evolution (2) GRAMPS: “interesting apps are irregular” Goals: Dynamic, data-dependent code Aggregate work at run-time Heterogeneous commodity platforms Naturally allows streaming when applicable 7
GRAMPS’ Role • A ‘graphics pipeline’ is now an app! • GRAMPS models parallel state machines. • Compared to status quo: • More flexible than a GPU pipeline • More guidance than bare metal • Portability in between • Not domain specific
GRAMPS Interfaces • Host/Setup: Create execution graph • Thread: Stateful, singleton • Shader: Data-parallel, auto-instanced
GRAMPS Entities (1) Accessed via windows Queues: Connect stages, Dynamically sized Ordered or unordered Fixed max capacity or spill to memory Buffers: Random access, Pre-allocated RO, RW Private, RW Shared (Not Supported)
GRAMPS Entities (2) Queue Sets: Independent sub-queues Instanced parallelism plus mutual exclusion Hard to fake with just multiple queues
GRAMPS Scheduler • Tiered Scheduler • ‘Fat’ cores: per-thread, per-core • ‘Micro’ cores: shared hw scheduler • Top level: tier N
= Thread Stage = Queue = Stage Output = Shader Stage = Push Output = Fixed-func What We’ve Built (Apps) Direct3D Pipeline (with Ray-tracing Extension) Input Vertex Queue 1 Primitive Queue 1 Primitive Queue Fragment Queue Sample Queue Set Frame Buffer Vertex Buffers IA 1 VS 1 Rast PS OM RO … … Ray Queue IA N VS N Ray Hit Queue Primitive Queue N Input Vertex Queue N Trace PS2 Ray-tracing Extension Ray-tracing Graph Tile Queue Sample Queue Ray Queue Tiler Sampler Intersect Camera Ray Hit Queue Fragment Queue Frame Buffer FB Blend Shade
Initial Results • Queues are small, utilization is good
GRAMPS Portability • Portability really means performance. • Less portable than GL/D3D • GRAMPS graph is (more) hardware sensitive • More portable than bare metal • Enforces modularity • Best case, just works • Worst case, saves boiler plate
High-level Challenges • Is GRAMPS a suitable GPU evolution? • Enable pipeline competitive with bare metal? • Enable innovation: advanced / alternative methods? • Is GRAMPS a good parallel compute model? • Map well to hardware, hardware trends? • Support important apps? • Concepts influence developers?
What’s Next: Implementation • Better scheduling • Less bursty, better slot filling • Dynamic priorities • Handle graphs with loops better • More detailed costs • Bill for scheduling decisions • Bill for (internal) synchronization • More statistics
What’s Next: Programming Model Yes: Graph modification (state change) Probably: Data sharing / ref-counting Maybe: Blocking inter-stage calls (join) Maybe: Intra/inter-stage synchronization primitives 21
What’s Next: Possible Workloads REYES, hybrid graphics pipelines Image / video processing Game Physics Collision detection or particles Physics and scientific simulation AI, finance, sort, search or database query, … Heavy dynamic data manipulation k-D tree / octree / BVH build lazy/adaptive/procedural tree or geometry 22