1 / 71

Programming Many-Core Systems with GRAMPS

Programming Many-Core Systems with GRAMPS. Jeremy Sugerman 14 May 2010. The single fast core era is over. Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core

maalik
Download Presentation

Programming Many-Core Systems with GRAMPS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010

  2. The single fast core era is over • Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ • Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core Problem: How does one program all this complexity?!

  3. High-level programming models • Two major advantages over threads & locks • Constructs to express/expose parallelism • Scheduling support to help manage concurrency, communication, and synchronization • Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …

  4. My biases workloads • Interesting applications have irregularity • Large bundles of coherent work are efficient • Producer-consumer idiom is important Goal: Rebuild coherence dynamically by aggregating related work as it is generated.

  5. My target audience • Highly informed, but (good) lazy • Understands the hardware and best practices • Dislikes rote, Prefers power versus constraints Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.

  6. Contributions: Design of GRAMPS • Programs are graphs of stages and queues • Queues: • Maximum capacities, Packet sizes • Stages: • No, limited, or total automatic parallelism • Fixed, variable, or reduction (in-place) outputs Simple Graphics Pipeline

  7. Contributions: Implementation • Broad application scope: • Rendering, MapReduce, image processing, … • Multi-platform applicability: • GRAMPS runtimes for three architectures • Performance: • Scale-out parallelism, controlled data footprint • Compares well to schedulers from other models • (Also: Tunable)

  8. Outline • GRAMPS overview • Study 1: Future graphics architectures • Study 2: Current multi-core CPUs • Comparison with schedulers from other parallel programming models

  9. GRAMPS Overview

  10. GRAMPS • Programs are graphs of stages and queues • Expose the program structure • Leave the program internals unconstrained

  11. Cookie Dough Pipeline Writing a GRAMPS program • Design the application graph and queues: • Design the stages • Instantiate and launch. Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html

  12. Queues • Bounded size, operate at “packet” granularity • “Opaque” and “Collection” packets • GRAMPS can optionally preserve ordering • Required for some workloads, adds overhead

  13. Thread (and Fixed) stages • Preemptible, long-lived, stateful • Often merge, compare, or repack inputs • Queue operations: Reserve/Commit • (Fixed: Thread stages in custom hardware)

  14. Shader stages: • Automatically parallelized: • Horde of non-preemptible, stateless instances • Pre-reserve/post-commit • Push: Variable/conditional output support • GRAMPS coalesces elements into full packets

  15. Cookie Dough Pipeline Queue sets: Mutual exclusion • Independent exclusive (serial) subqueues • Created statically or on first output • Densely or sparsely indexed • Bonus: Automatically instanced Thread stages

  16. Cookie Dough (with queue set) Queue sets: Mutual exclusion • Independent exclusive (serial) subqueues • Created statically or on first output • Densely or sparsely indexed • Bonus: Automatically instanced Thread stages

  17. In-place Shader stages /coalescing inputs • Instanced Thread stages • Queues as barriers /read all-at-once A few other tidbits

  18. Formative influences • The Graphics Pipeline, early GPGPU • “Streaming” • Work-queues and task-queues

  19. Study 1: Future Graphics Architectures (with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)

  20. Graphics is a natural first domain • Table stakes for commodity parallelism • GPUs are full of heterogeneity • Poised to transition from fixed/configurable pipeline to programmable • We have a lot of experience in it

  21. The Graphics Pipeline in GRAMPS • Graph, setup are (application) software • Can be customized or completely replaced • Like the transition to programmable shading • Not (unthinkably) radical • Fits current hw: FIFOs, cores, rasterizer, …

  22. Reminder: Design goals • Broad application scope • Multi-platform applicability • Performance: scale-out, footprint-aware

  23. The Experiment • Three renderers: • Rasterization, Ray Tracer, Hybrid • Two simulated future architectures • Simple scheduler for each

  24. Rasterization Pipeline (with ray tracing extension) Ray Tracing Extension Ray Tracing Graph Scope: Two(-plus) renderers

  25. Platforms: Two simulated systems CPU-Like: 8 Fat Cores, Rast GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched

  26. Performance— Metrics “Maximize machine utilization while keeping working sets small” • Priority #1: Scale-out parallelism • Parallel utilization • Priority #2: ‘Reasonable’ bandwidth / storage • Worst case total footprint of all queues • Inherently a trade-off versus utilization

  27. Performance— Scheduling Simple prototype scheduler (both platforms): • Static stage priorities: • Only preempt on Reserve and Commit • No dynamic weighting of current queue sizes (Lowest) (Highest)

  28. Performance— Results • Utilization: 95+% for all but rasterized fairy (~80%). • Footprint: < 600KB CPU-like, < 1.5MB GPU-like • Surprised how well the simple scheduler worked • Maintaining order costs footprint

  29. Study 2: Current Multi-core CPUs (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)

  30. Reminder: Design goals • Broad application scope • Multi-platform applicability • Performance: scale-out, footprint-aware

  31. The Experiment • 9 applications, 13 configurations • One (more) architecture: multi-core x86 • It’s real (no simulation here) • Built with pthreads, locks, and atomics • Per-pthread task-priority-queues with work-stealing • More advanced scheduling

  32. GRAMPS Ray tracer (0, 1 bounce) Spheres (No rasterization, though) MapReduce Hist (reduce / combine) LR (reduce / combine) PCA Cilk(-like) Mergesort CUDA Gaussian, SRAD StreamIt FM, TDE Scope: Application bonanza

  33. Ray Tracer FM MapReduce SRAD Scope: Many different idioms Merge Sort

  34. Platform: 2xQuad-core Nehalem • Queues: copy in/out, global (shared) buffer • Threads: user-level scheduled contexts • Shaders: create one task per input packet Native: 8 HyperThreaded Core i7’s

  35. Performance— Metrics (Reminder) “Maximize machine utilization while keeping working sets small” • Priority #1: Scale-out parallelism • Priority #2: ‘Reasonable’ bandwidth / storage

  36. Performance– Scheduling • Static per-stage priorities (still) • Work-stealing task-priority-queues • Eagerly create one task per packet (naïve) • Keep running stages until a low watermark • (Limited dynamic weighting of queue depths)

  37. Performance– Good Scale-out • (Footprint: Good; detail a little later) Parallel Speedup Hardware Threads

  38. Performance– Low Overheads Execution Time Breakdown (8 cores / 16 hyperthreads) • ‘App’ and ‘Queue’ time are both useful work. Percentage of Execution

  39. Comparison with Other Schedulers (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)

  40. Three archetypes • Task-Stealing: (Cilk, TBB) • Low overhead with fine granularity tasks • No producer-consumer, priorities, or data-parallel • Breadth-First: (CUDA, OpenCL) • Simple scheduler (one stage at the time) • No producer-consumer, no pipeline parallelism • Static: (StreamIt / Streaming) • No runtime scheduler; complex schedules • Cannot adapt to irregular workloads

  41. GRAMPS is a natural framework

  42. The Experiment • Re-use the exact same application code • Modify the scheduler per archetype: • Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks • Breadth-First: Unbounded queues, stage at a time, top-to-bottom • Static: Unbounded queues, offline per-thread schedule using SAS / SGMS

  43. GRAMPS Breadth-First Task-Stealing Static (SAS) Seeing is believing (ray tracer)

  44. Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Percentage of Time Comparison: Execution time • Mostly similar: good parallelism, load balance

  45. Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Percentage of Time Comparison: Execution time • Breadth-first can exhibit load-imbalance

  46. Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Percentage of Time Comparison: Execution time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) • Task-stealing can ping-pong, cause contention Percentage of Time

  47. Comparison: Footprint Relative Packet Footprint (Log-Scale) Size versus GRAMPS • Breadth-First is pathological (as expected)

  48. Relative Packet Footprint Relative Task Footprint Footprint: GRAMPS & Task-Stealing

  49. MapReduce Ray Tracer MapReduce Footprint: GRAMPS & Task-Stealing GRAMPS gets insight from the graph: • (Application-specified) queue bounds • Group tasks by stage for priority, preemption Ray Tracer

  50. Packet Footprint Static scheduling is challenging Execution Time • Generating good Static schedules is *hard*. • Static schedules are fragile: • Small mismatches compound • Hardware itself is dynamic (cache traffic, IRQs, …) • Limited upside: dynamic scheduling is cheap!

More Related