1 / 28

DeNovo † : Rethinking Hardware for Disciplined Parallelism

DeNovo † : Rethinking Hardware for Disciplined Parallelism. Byn Choi, Rakesh Komuravelli, Hyojin Sung, Rob Bocchino, Sarita Adve, Vikram Adve Other collaborators: Languages: Adam Welc, Tatiana Shpeisman, Yang Ni (Intel) Applications: John Hart, Victor Lu.

demont
Download Presentation

DeNovo † : Rethinking Hardware for Disciplined Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DeNovo†: Rethinking Hardware for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Rob Bocchino, Sarita Adve, Vikram Adve Other collaborators: Languages: Adam Welc, Tatiana Shpeisman, Yang Ni (Intel) Applications: John Hart, Victor Lu †De Novo = From the beginning, anew

  2. Motivation • Goal: Power-, complexity-, performance-scalable hardware • Today: shared-memory • Directory-based coherence • Complex, unscalable • Address, communication, coherence granularity is cache line • Software-oblivious, inefficient power, bandwidth, latency, area • Difficult programming model • Data races, non-determinism, no safety/composability/modularity, … • Can’t specify “what value can read return” a.k.a. memory model • Data races defy acceptable semantics; mismatched hardware/software • Fundamentally broken for hardware & software

  3. Motivation • Goal: Power-, complexity-, performance-scalable hardware • Today: shared-memory • Directory-based coherence • Complex, unscalable • Address, communication, coherence granularity is cache line • Inefficient in power, bandwidth, latency, area, especially for O-O codes • Difficult programming model • Data races, non-determinism, no safety/composability/modularity, … • Can’t specify “what value a read can return” a.k.a. memory model • Data races defy acceptable semantics; mismatched hardware/software • Fundamentally broken for hardware & software Banish shared memory?

  4. Motivation • Goal: Power-, complexity-, performance-scalable hardware • Today: shared-memory • Directory-based coherence • Complex, unscalable • Address, communication, coherence granularity is cache line • Inefficient in power, bandwidth, latency, area, especially for O-O codes • Difficult programming model • Data races, non-determinism, no safety/composability/modularity, … • Can’t specify “what value a read can return” a.k.a. memory model • Data races defy acceptable semantics; mismatched hardware/software • Fundamentally broken for hardware & software Banish wild shared memory!

  5. Motivation • Goal: Power-, complexity-, performance-scalable hardware • Today: shared-memory • Directory-based coherence • Complex, unscalable • Address, communication, coherence granularity is cache line • Inefficient in power, bandwidth, latency, area, especially for O-O codes • Difficult programming model • Data races, non-determinism, no safety/composability/modularity, … • Can’t specify “what value a read can return” a.k.a. memory model • Data races defy acceptable semantics; mismatched hardware/software • Fundamentally broken for hardware & software Banish wild shared memory! Need disciplined shared memory!

  6. What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

  7. What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

  8. What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

  9. What is Shared-Memory? Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization Explicit, structured side-effects

  10. Disciplined Shared-Memory • Top-down view: prog model, language, … (earlier today) • Use explicit effects for semantic guarantees • Data-race-freedom, determinism-by-default, controlled non-determinism • Reward: simple semantics, safety, composability, … • Bottom-up view: hardware, runtime, … • Use explicit effects + above guarantees for • Simple coherence and consistency • Software-aware address/comm/coherence granularity, data layout • Reward: power-, complexity-, performance-scalable hardware • Top-down and bottom-up views are synergistic!

  11. DeNovo Research Strategy • Start with deterministic ( data-race-free) codes • Common & best case, basis for extension to other codes • Disciplined non-deterministic codes • Wild non-deterministic, legacy codes • Work with languages for best h/w-s/w interface • Current driver is DPJ • End-goal is language-oblivious interface • Work with realistic applications • Current work with kd-trees

  12. Progress Summary • Deterministic codes • Language level model with DPJ, translation to h/w interface • Design for simple coherence, s/w-aware comm. and layout • Baseline coherence implemented, rest underway • Disciplined non-deterministic codes • Language level model with DPJ (and Intel) • Wild non-deterministic, legacy codes • Just begun, will be influenced by above Collaborations with languages & applications groups • DPJ; first scalable SAH quality kd-tree construction

  13. Progress Summary • Deterministic codes • Language level model with DPJ, translation to h/w interface • Design for simple coherence, s/w-aware comm. and layout • Baseline coherence implemented, rest underway • Disciplined non-deterministic codes • Language level model with DPJ (and Intel) • Wild non-deterministic, legacy codes • Just begun, will be influenced by above Collaborations with languages & applications groups • DPJ; first scalable SAH quality kd-tree construction

  14. Coherence and Consistency Key Insights • Guaranteed determinism • Explicit effects

  15. Coherence and Consistency Key Insights • Guaranteed determinism  • Read should return value of last write in sequential order • From same task in this parallel phase, if it exists • Or from previous parallel phase • No concurrent conflicting writes in this phase • Explicit effects

  16. Coherence and Consistency Key Insights • Guaranteed determinism  • Read should return value of last write in sequential order • From same task in this parallel phase, if it exists • Or from previous parallel phase • No concurrent conflicting writes in this phase • Explicit effects  • Compiler knows all regions written in this parallel phase • Cache can self-invalidate before next parallel phase • Invalidates data in writeable regions not written by itself

  17. Today's Coherence Protocols • Snooping: broadcast, ordered networks • Directory: avoid broadcast through indirection • Complexity: Races in protocol • Overhead: Sharer list • Performance: All cache misses go through directory

  18. Today's Coherence Protocols • Snooping: broadcast, ordered networks • Directory: avoid broadcast through indirection • Complexity: Races in protocol • Race-free software (almost) race-free coherence protocol • No transient states, much simpler protocol • Overhead: Sharer list • Performance: All cache misses go through directory

  19. Today's Coherence Protocols • Snooping: broadcast, ordered networks • Directory: avoid broadcast through indirection • Complexity: Races in protocol • Race-free software (almost) race-free coherence protocol • No transient states, much simpler protocol • Overhead: Sharer list • Explicit effects enable self-invalidations • No need for sharer lists • Performance: All cache misses go through directory

  20. Today's Coherence Protocols • Snooping: broadcast, ordered networks • Directory: avoid broadcast through indirection • Complexity: Races in protocol • Race-free software (almost) race-free coherence protocol • No transient states, much simpler protocol • Overhead: Sharer list • Explicit effects enable self-invalidations • No need for sharer lists • Performance: All cache misses go through directory • Directory only tracks one up-to-date copy, not sharers or serialization • Data copies can move from cache to cache without telling directory

  21. Baseline DeNovo Coherence • Assume (for now): Private L1, shared L2; single word line • Directory tracks one current copy of line, not sharers • L2 data arrays double as directory • Keep valid data or registered core id, no space overhead • S/W inserts self-invalidates for regions w/ write effects • L1 states = invalid, valid, registered • No transient states: Protocol ≈ 3-state textbook pictures • Formal specification and verification with Intel

  22. DeNovo Region Granularity • DPJ regions too fine-grained • Ensure accesses to individual objects/fields don’t interfere • Too many regions for hardware • DeNovo only needs aggregate data written in phase • E.g., can summarize a field of entire data structure as one region Can we aggregate to few enough regions, without excessive invalidations?

  23. Evaluation Methodology • Modified Wisconsin GEMS + (Intel!) Simics simulator • 4 apps: LU, FFT, Barnes, kd-tree • Converted DPJ regions into DeNovo regions by hand • Compared DeNovo vs. MESI for single word lines • Goal: Does simplicity impact performance? E.g., are self-invalidations too conservative? (Efficiency enhancements are next step)

  24. Results for Baseline Coherence • DeNovo comparable to MESI • Simple protocol is foundation for efficiency enhancements

  25. Improving Performance & Power Insight: Can always copy valid data to another cache • w/o demand access, w/o going through directory • If later demand read sees it, it must be correct • No false sharing effects (no loss of “ownership”) • Simple line based protocol (with word based valid bits) • Can get line from anywhere, not just from directory L2 • Point-to-point transfer, sender-initiated transfer • Can transfer more than line at a time • Point-to-point bulk transfer • Transfer granularity can be region-driven • AoS vs. SoA optimization is natural

  26. Towards Ideal Efficiency Current systems: • Address, transfer, coherence granularity = fixed cache line Denovo so far: • Transfer is flexible, but address is still line, coherence is still word Next step: Region-centered caches • Use regions for memory layout • Region based “pool allocation,” fields of same region at fixed strides • Cache banks devoted to regions • Regions accessed together give address, transfer granularity • Regions w/ same sharing behavior give coherence granularity Applicable to main memory and pin bandwidth Interactions with runtime scheduler

  27. Summary • Current shared-memory models fundamentally broken • Semantics, programmability, hardware • Disciplined programming models solve these problems • DeNovo = hardware for disciplined programming • Software-driven memory hierarchy • Coherence, consistency, communication, data layout, … • Simpler, faster, cooler, cheaper, … • Sponsor interactions: • Nick Carter, Mani Azimi/Akhilesh Kumar/Ching-Tsun Chou • Non-deterministic model: Adam Welc/Shpeisman/Ni • SCC prototype exploration: Jim Held

  28. Next steps • Phase 1: • Full implementation, verification, results for deterministic codes • Design and results for disciplined non-determinism • Design for wild non-deterministic, legacy codes • Continue work with language and application groups • Explore prototyping on SCC • Phase 2: • Design and simulation results for complete DeNovo system running large applications • Language-oblivious hardware-software interface • Prototype and tech transfer

More Related