710 likes | 723 Views
Explore high-level programming models for many-core systems, focusing on constructs for parallelism expression, scheduling support, and optimizing coherence dynamically. Designed for systems-savvy developers to efficiently map programs onto hardware.
E N D
Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010
The single fast core era is over • Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ • Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core Problem: How does one program all this complexity?!
High-level programming models • Two major advantages over threads & locks • Constructs to express/expose parallelism • Scheduling support to help manage concurrency, communication, and synchronization • Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …
My biases workloads • Interesting applications have irregularity • Large bundles of coherent work are efficient • Producer-consumer idiom is important Goal: Rebuild coherence dynamically by aggregating related work as it is generated.
My target audience • Highly informed, but (good) lazy • Understands the hardware and best practices • Dislikes rote, Prefers power versus constraints Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.
Contributions: Design of GRAMPS • Programs are graphs of stages and queues • Queues: • Maximum capacities, Packet sizes • Stages: • No, limited, or total automatic parallelism • Fixed, variable, or reduction (in-place) outputs Simple Graphics Pipeline
Contributions: Implementation • Broad application scope: • Rendering, MapReduce, image processing, … • Multi-platform applicability: • GRAMPS runtimes for three architectures • Performance: • Scale-out parallelism, controlled data footprint • Compares well to schedulers from other models • (Also: Tunable)
Outline • GRAMPS overview • Study 1: Future graphics architectures • Study 2: Current multi-core CPUs • Comparison with schedulers from other parallel programming models
GRAMPS • Programs are graphs of stages and queues • Expose the program structure • Leave the program internals unconstrained
Cookie Dough Pipeline Writing a GRAMPS program • Design the application graph and queues: • Design the stages • Instantiate and launch. Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html
Queues • Bounded size, operate at “packet” granularity • “Opaque” and “Collection” packets • GRAMPS can optionally preserve ordering • Required for some workloads, adds overhead
Thread (and Fixed) stages • Preemptible, long-lived, stateful • Often merge, compare, or repack inputs • Queue operations: Reserve/Commit • (Fixed: Thread stages in custom hardware)
Shader stages: • Automatically parallelized: • Horde of non-preemptible, stateless instances • Pre-reserve/post-commit • Push: Variable/conditional output support • GRAMPS coalesces elements into full packets
Cookie Dough Pipeline Queue sets: Mutual exclusion • Independent exclusive (serial) subqueues • Created statically or on first output • Densely or sparsely indexed • Bonus: Automatically instanced Thread stages
Cookie Dough (with queue set) Queue sets: Mutual exclusion • Independent exclusive (serial) subqueues • Created statically or on first output • Densely or sparsely indexed • Bonus: Automatically instanced Thread stages
In-place Shader stages /coalescing inputs • Instanced Thread stages • Queues as barriers /read all-at-once A few other tidbits
Formative influences • The Graphics Pipeline, early GPGPU • “Streaming” • Work-queues and task-queues
Study 1: Future Graphics Architectures (with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)
Graphics is a natural first domain • Table stakes for commodity parallelism • GPUs are full of heterogeneity • Poised to transition from fixed/configurable pipeline to programmable • We have a lot of experience in it
The Graphics Pipeline in GRAMPS • Graph, setup are (application) software • Can be customized or completely replaced • Like the transition to programmable shading • Not (unthinkably) radical • Fits current hw: FIFOs, cores, rasterizer, …
Reminder: Design goals • Broad application scope • Multi-platform applicability • Performance: scale-out, footprint-aware
The Experiment • Three renderers: • Rasterization, Ray Tracer, Hybrid • Two simulated future architectures • Simple scheduler for each
Rasterization Pipeline (with ray tracing extension) Ray Tracing Extension Ray Tracing Graph Scope: Two(-plus) renderers
Platforms: Two simulated systems CPU-Like: 8 Fat Cores, Rast GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched
Performance— Metrics “Maximize machine utilization while keeping working sets small” • Priority #1: Scale-out parallelism • Parallel utilization • Priority #2: ‘Reasonable’ bandwidth / storage • Worst case total footprint of all queues • Inherently a trade-off versus utilization
Performance— Scheduling Simple prototype scheduler (both platforms): • Static stage priorities: • Only preempt on Reserve and Commit • No dynamic weighting of current queue sizes (Lowest) (Highest)
Performance— Results • Utilization: 95+% for all but rasterized fairy (~80%). • Footprint: < 600KB CPU-like, < 1.5MB GPU-like • Surprised how well the simple scheduler worked • Maintaining order costs footprint
Study 2: Current Multi-core CPUs (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)
Reminder: Design goals • Broad application scope • Multi-platform applicability • Performance: scale-out, footprint-aware
The Experiment • 9 applications, 13 configurations • One (more) architecture: multi-core x86 • It’s real (no simulation here) • Built with pthreads, locks, and atomics • Per-pthread task-priority-queues with work-stealing • More advanced scheduling
GRAMPS Ray tracer (0, 1 bounce) Spheres (No rasterization, though) MapReduce Hist (reduce / combine) LR (reduce / combine) PCA Cilk(-like) Mergesort CUDA Gaussian, SRAD StreamIt FM, TDE Scope: Application bonanza
Ray Tracer FM MapReduce SRAD Scope: Many different idioms Merge Sort
Platform: 2xQuad-core Nehalem • Queues: copy in/out, global (shared) buffer • Threads: user-level scheduled contexts • Shaders: create one task per input packet Native: 8 HyperThreaded Core i7’s
Performance— Metrics (Reminder) “Maximize machine utilization while keeping working sets small” • Priority #1: Scale-out parallelism • Priority #2: ‘Reasonable’ bandwidth / storage
Performance– Scheduling • Static per-stage priorities (still) • Work-stealing task-priority-queues • Eagerly create one task per packet (naïve) • Keep running stages until a low watermark • (Limited dynamic weighting of queue depths)
Performance– Good Scale-out • (Footprint: Good; detail a little later) Parallel Speedup Hardware Threads
Performance– Low Overheads Execution Time Breakdown (8 cores / 16 hyperthreads) • ‘App’ and ‘Queue’ time are both useful work. Percentage of Execution
Comparison with Other Schedulers (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)
Three archetypes • Task-Stealing: (Cilk, TBB) • Low overhead with fine granularity tasks • No producer-consumer, priorities, or data-parallel • Breadth-First: (CUDA, OpenCL) • Simple scheduler (one stage at the time) • No producer-consumer, no pipeline parallelism • Static: (StreamIt / Streaming) • No runtime scheduler; complex schedules • Cannot adapt to irregular workloads
The Experiment • Re-use the exact same application code • Modify the scheduler per archetype: • Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks • Breadth-First: Unbounded queues, stage at a time, top-to-bottom • Static: Unbounded queues, offline per-thread schedule using SAS / SGMS
GRAMPS Breadth-First Task-Stealing Static (SAS) Seeing is believing (ray tracer)
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Percentage of Time Comparison: Execution time • Mostly similar: good parallelism, load balance
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Percentage of Time Comparison: Execution time • Breadth-first can exhibit load-imbalance
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Percentage of Time Comparison: Execution time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) • Task-stealing can ping-pong, cause contention Percentage of Time
Comparison: Footprint Relative Packet Footprint (Log-Scale) Size versus GRAMPS • Breadth-First is pathological (as expected)
Relative Packet Footprint Relative Task Footprint Footprint: GRAMPS & Task-Stealing
MapReduce Ray Tracer MapReduce Footprint: GRAMPS & Task-Stealing GRAMPS gets insight from the graph: • (Application-specified) queue bounds • Group tasks by stage for priority, preemption Ray Tracer
Packet Footprint Static scheduling is challenging Execution Time • Generating good Static schedules is *hard*. • Static schedules are fragile: • Small mismatches compound • Hardware itself is dynamic (cache traffic, IRQs, …) • Limited upside: dynamic scheduling is cheap!