DeNovo † : Rethinking Hardware for Disciplined Parallelism

DeNovo†: Rethinking Hardware for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Rob Bocchino, Sarita Adve, Vikram Adve Other collaborators: Languages: Adam Welc, Tatiana Shpeisman, Yang Ni (Intel) Applications: John Hart, Victor Lu †De Novo = From the beginning, anew

Motivation • Goal: Power-, complexity-, performance-scalable hardware • Today: shared-memory • Directory-based coherence • Complex, unscalable • Address, communication, coherence granularity is cache line • Software-oblivious, inefficient power, bandwidth, latency, area • Difficult programming model • Data races, non-determinism, no safety/composability/modularity, … • Can’t specify “what value can read return” a.k.a. memory model • Data races defy acceptable semantics; mismatched hardware/software • Fundamentally broken for hardware & software

Motivation • Goal: Power-, complexity-, performance-scalable hardware • Today: shared-memory • Directory-based coherence • Complex, unscalable • Address, communication, coherence granularity is cache line • Inefficient in power, bandwidth, latency, area, especially for O-O codes • Difficult programming model • Data races, non-determinism, no safety/composability/modularity, … • Can’t specify “what value a read can return” a.k.a. memory model • Data races defy acceptable semantics; mismatched hardware/software • Fundamentally broken for hardware & software Banish shared memory?

Motivation • Goal: Power-, complexity-, performance-scalable hardware • Today: shared-memory • Directory-based coherence • Complex, unscalable • Address, communication, coherence granularity is cache line • Inefficient in power, bandwidth, latency, area, especially for O-O codes • Difficult programming model • Data races, non-determinism, no safety/composability/modularity, … • Can’t specify “what value a read can return” a.k.a. memory model • Data races defy acceptable semantics; mismatched hardware/software • Fundamentally broken for hardware & software Banish wild shared memory!

Motivation • Goal: Power-, complexity-, performance-scalable hardware • Today: shared-memory • Directory-based coherence • Complex, unscalable • Address, communication, coherence granularity is cache line • Inefficient in power, bandwidth, latency, area, especially for O-O codes • Difficult programming model • Data races, non-determinism, no safety/composability/modularity, … • Can’t specify “what value a read can return” a.k.a. memory model • Data races defy acceptable semantics; mismatched hardware/software • Fundamentally broken for hardware & software Banish wild shared memory! Need disciplined shared memory!

What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization Explicit, structured side-effects

Disciplined Shared-Memory • Top-down view: prog model, language, … (earlier today) • Use explicit effects for semantic guarantees • Data-race-freedom, determinism-by-default, controlled non-determinism • Reward: simple semantics, safety, composability, … • Bottom-up view: hardware, runtime, … • Use explicit effects + above guarantees for • Simple coherence and consistency • Software-aware address/comm/coherence granularity, data layout • Reward: power-, complexity-, performance-scalable hardware • Top-down and bottom-up views are synergistic!

DeNovo Research Strategy • Start with deterministic ( data-race-free) codes • Common & best case, basis for extension to other codes • Disciplined non-deterministic codes • Wild non-deterministic, legacy codes • Work with languages for best h/w-s/w interface • Current driver is DPJ • End-goal is language-oblivious interface • Work with realistic applications • Current work with kd-trees

Progress Summary • Deterministic codes • Language level model with DPJ, translation to h/w interface • Design for simple coherence, s/w-aware comm. and layout • Baseline coherence implemented, rest underway • Disciplined non-deterministic codes • Language level model with DPJ (and Intel) • Wild non-deterministic, legacy codes • Just begun, will be influenced by above Collaborations with languages & applications groups • DPJ; first scalable SAH quality kd-tree construction

Coherence and Consistency Key Insights • Guaranteed determinism • Explicit effects

Coherence and Consistency Key Insights • Guaranteed determinism  • Read should return value of last write in sequential order • From same task in this parallel phase, if it exists • Or from previous parallel phase • No concurrent conflicting writes in this phase • Explicit effects

Coherence and Consistency Key Insights • Guaranteed determinism  • Read should return value of last write in sequential order • From same task in this parallel phase, if it exists • Or from previous parallel phase • No concurrent conflicting writes in this phase • Explicit effects  • Compiler knows all regions written in this parallel phase • Cache can self-invalidate before next parallel phase • Invalidates data in writeable regions not written by itself

Today's Coherence Protocols • Snooping: broadcast, ordered networks • Directory: avoid broadcast through indirection • Complexity: Races in protocol • Overhead: Sharer list • Performance: All cache misses go through directory

Today's Coherence Protocols • Snooping: broadcast, ordered networks • Directory: avoid broadcast through indirection • Complexity: Races in protocol • Race-free software (almost) race-free coherence protocol • No transient states, much simpler protocol • Overhead: Sharer list • Performance: All cache misses go through directory

Today's Coherence Protocols • Snooping: broadcast, ordered networks • Directory: avoid broadcast through indirection • Complexity: Races in protocol • Race-free software (almost) race-free coherence protocol • No transient states, much simpler protocol • Overhead: Sharer list • Explicit effects enable self-invalidations • No need for sharer lists • Performance: All cache misses go through directory

Today's Coherence Protocols • Snooping: broadcast, ordered networks • Directory: avoid broadcast through indirection • Complexity: Races in protocol • Race-free software (almost) race-free coherence protocol • No transient states, much simpler protocol • Overhead: Sharer list • Explicit effects enable self-invalidations • No need for sharer lists • Performance: All cache misses go through directory • Directory only tracks one up-to-date copy, not sharers or serialization • Data copies can move from cache to cache without telling directory

Baseline DeNovo Coherence • Assume (for now): Private L1, shared L2; single word line • Directory tracks one current copy of line, not sharers • L2 data arrays double as directory • Keep valid data or registered core id, no space overhead • S/W inserts self-invalidates for regions w/ write effects • L1 states = invalid, valid, registered • No transient states: Protocol ≈ 3-state textbook pictures • Formal specification and verification with Intel

DeNovo Region Granularity • DPJ regions too fine-grained • Ensure accesses to individual objects/fields don’t interfere • Too many regions for hardware • DeNovo only needs aggregate data written in phase • E.g., can summarize a field of entire data structure as one region Can we aggregate to few enough regions, without excessive invalidations?

Evaluation Methodology • Modified Wisconsin GEMS + (Intel!) Simics simulator • 4 apps: LU, FFT, Barnes, kd-tree • Converted DPJ regions into DeNovo regions by hand • Compared DeNovo vs. MESI for single word lines • Goal: Does simplicity impact performance? E.g., are self-invalidations too conservative? (Efficiency enhancements are next step)

Results for Baseline Coherence • DeNovo comparable to MESI • Simple protocol is foundation for efficiency enhancements

Improving Performance & Power Insight: Can always copy valid data to another cache • w/o demand access, w/o going through directory • If later demand read sees it, it must be correct • No false sharing effects (no loss of “ownership”) • Simple line based protocol (with word based valid bits) • Can get line from anywhere, not just from directory L2 • Point-to-point transfer, sender-initiated transfer • Can transfer more than line at a time • Point-to-point bulk transfer • Transfer granularity can be region-driven • AoS vs. SoA optimization is natural

Towards Ideal Efficiency Current systems: • Address, transfer, coherence granularity = fixed cache line Denovo so far: • Transfer is flexible, but address is still line, coherence is still word Next step: Region-centered caches • Use regions for memory layout • Region based “pool allocation,” fields of same region at fixed strides • Cache banks devoted to regions • Regions accessed together give address, transfer granularity • Regions w/ same sharing behavior give coherence granularity Applicable to main memory and pin bandwidth Interactions with runtime scheduler

Summary • Current shared-memory models fundamentally broken • Semantics, programmability, hardware • Disciplined programming models solve these problems • DeNovo = hardware for disciplined programming • Software-driven memory hierarchy • Coherence, consistency, communication, data layout, … • Simpler, faster, cooler, cheaper, … • Sponsor interactions: • Nick Carter, Mani Azimi/Akhilesh Kumar/Ching-Tsun Chou • Non-deterministic model: Adam Welc/Shpeisman/Ni • SCC prototype exploration: Jim Held

Next steps • Phase 1: • Full implementation, verification, results for deterministic codes • Design and results for disciplined non-determinism • Design for wild non-deterministic, legacy codes • Continue work with language and application groups • Explore prototyping on SCC • Phase 2: • Design and simulation results for complete DeNovo system running large applications • Language-oblivious hardware-software interface • Prototype and tech transfer

DeNovo † : Rethinking Hardware for Disciplined Parallelism