1 / 40

DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism. Byn Choi, Rakesh Komuravelli, Hyojin Sung , Robert Smolinski, Nima Honarmand , Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, Ching-Tsun Chou University of Illinois at Urbana-Champaign and Intel

shawna
Download Presentation

DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DeNovo: Rethinking the Multicore Memory Hierarchyfor Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, NimaHonarmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, Ching-Tsun Chou University of Illinois at Urbana-Champaign and Intel denovo@cs.illinois.edu

  2. Motivation • Goal: Safe and efficient parallel computing • Easy, safe programming model • Complexity-, power-, performance-scalable hardware • Today: shared-memory • Complex, power- and performance-inefficient hardware • Complex directory coherence, unnecessary traffic, ... • Difficult programming model • Data races, non-determinism, composability/modularity?, testing? • Mismatched interface between HW and SW, a.k.a memory model • Can’t specify “what value can read return” • Data races defy acceptable semantics  Fundamentally broken for hardware & software

  3. Motivation Banish shared memory? • Goal: Safe and efficient parallel computing • Easy, safe programming model • Complexity-, power-, performance-scalable hardware • Today: shared-memory • Complex, power- and performance-inefficient hardware • Complex directory coherence, unnecessary traffic, ... • Difficult programming model • Data races, non-determinism, composability/modularity?, testing? • Mismatched interface between HW and SW, a.k.a memory model • Can’t specify “what value can read return” • Data races defy acceptable semantics  Fundamentally broken for hardware & software

  4. Motivation Banish wild shared memory! Need disciplined shared memory! • Goal: Safe and efficient parallel computing • Easy, safe programming model • Complexity-, power-, performance-scalable hardware • Today: shared-memory • Complex, power- and performance-inefficient hardware • Complex directory coherence, unnecessary traffic, ... • Difficult programming model • Data races, non-determinism, composability/modularity?, testing? • Mismatched interface between HW and SW, a.k.a memory model • Can’t specify “what value can read return” • Data races defy acceptable semantics  Fundamentally broken for hardware & software

  5. What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

  6. What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

  7. What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

  8. What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

  9. What is Shared-Memory? Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization Explicit, structured side-effects

  10. Benefits of Explicit Effects • Strong safety properties • Determinism-by-default, explicit & safe non-determinism • Simple semantics, composability, testability - Deterministic Parallel Java (DPJ) explicit effects + structured parallel control Disciplined Shared Memory • Efficiency: complexity, performance, power • Simplify coherence and consistency • Optimize communication and storage layout - DeNovo Simple programming model AND Complexity, power, performance-scalable hardware

  11. DeNovo Hardware Project • Exploit discipline for efficient hardware • Current driver is DPJ (Deterministic Parallel Java) • End goal is language-oblivious interface • Software research strategy • Deterministic codes • Safe non-determinism • OS, legacy, … • Hardware research strategy • On-chip: coherence, consistency, communication, data layout • Off-chip: similar ideas apply, next step

  12. Current Hardware Limitations • Complexity • Subtle races and numerous transient states in the protocol • Hard to extend for optimizations • Storage overhead • Directory overhead for sharer lists • Performance and power inefficiencies • Invalidation and ack messages • False sharing • Indirection through the directory • Fixed cache-line communication granularity

  13. Contributions Current Hardware Limitations • No transient states • Simple to extend for optimizations • No storage overhead for directory information • No invalidation and ack messages • No false sharing Up to 81% reduction in memory stall time • No indirection through the directory • Flexible, not cache-line, communication • Complexity • Subtle races and numerous transient sates in the protocol • Hard to extend for optimizations • Storage overhead • Directory overhead for sharer lists • Performance and power inefficiencies • Invalidation and ack messages • False sharing • Indirection through the directory • Traffic (fixed cache-line communication)

  14. Outline • Motivation • Background: DPJ • Base DeNovo Protocol • DeNovo Optimizations • Evaluation • Complexity • Performance • Conclusion and Future Work

  15. Background: DPJ[OOPSLA 09] • Extension to Java; fully Java-compatible • Structured parallel control: nested fork-join style • foreach, cobegin • A novel region-based type and effect system • Speedups close to hand-written Java programs • Expressive enough for irregular, dynamic parallelism

  16. Regions and Effects • Region: a name for a set of memory locations • Programmer assigns a region to each field and array cell • Regions partition the heap • Effect: a read or write on a region • Programmer summarizes effects of method bodies • Compiler checks that • Region types are consistent, effect summaries are correct • Parallel tasks are non-interfering (no conflicts) • Simple, modular type checking (no inter-procedural ….) • Programs that type-check are guaranteed determinism

  17. Region names have static scope (one per class) Example: A Pair Class • class Pair { • regionBlue, Red; • int Xin Blue; • int Yin Red; • void setX(int x) writes Blue{ • this.X= x; • } • void setY(int y) writes Red{ • this.Y= y; • } • void setXY(int x, int y) writes Blue; writes Red { • cobegin { • setX(x); // writes Blue • setY(y); //writes Red • } • } • } Declaring and using region names

  18. Example: A Pair Class • class Pair { • regionBlue, Red; • int Xin Blue; • intY in Red; • void setX(intx) writes Blue{ • this.X = x; • } • void setY(inty) writes Red{ • this.Y = y; • } • void setXY(int x, int y) writes Blue; writes Red{ • cobegin { • setX(x); // writes Blue • setY(y); //writes Red • } • } • } Writing method effect summaries

  19. Example: A Pair Class Inferred effects • class Pair { • regionBlue, Red; • int Xin Blue; • int Yin Red; • void setX(int x) writes Blue{ • this.X= x; • } • void setY(int y) writes Red{ • this.Y= y; • } • void setXY(int x, int y) writes Blue; writes Red{ • cobegin { • setX(x); //writes Blue • setY(y); // writes Red • } • } • } Expressing parallelism

  20. Outline • Motivation • Background: DPJ • Base DeNovo Protocol • DeNovo Optimizations • Evaluation • Complexity • Performance • Conclusion and Future Work

  21. Memory Consistency Model ST 0xa ST 0xa Parallel Phase LD 0xa • Guaranteed determinism Read returns value of last write in sequential order • Same task in this parallel phase • Or before this parallel phase

  22. Memory Consistency Model ST 0xa Coherence Mechanism Parallel Phase LD 0xa • Guaranteed determinism Read returns value of last write in sequential order • Same task in this parallel phase • Or before this parallel phase

  23. Cache Coherence • Coherence Enforcement • Invalidate stale copies in caches • Track up-to-date copy • Explicit effects • Compiler knows all regions written in this parallel phase • Cache can self-invalidate before next parallel phase • Invalidates data in writeable regions not accessed by itself • Registration • Directory keeps track of one up-to-date copy • Writer updates before next parallel phase

  24. Basic DeNovo Coherence registry Invalid Valid Read Write Write Registered • Assume (for now): Private L1, shared L2; single word line • Data-race freedom at word granularity • No space overhead • Keep valid data or registered core id • L2 data arrays double as directory • No transient states

  25. Example Run L1 of Core 1 L1 of Core 2 class Pair { X in DeNovo-region Blue; Y in DeNovo-region Red; • void setX(int x) writes Blue{ • this.X= x; • } } PairPArray[size]; ... Phase1writes Blue{// DeNovoeffect foreachi in 0, size { PArray[i].setX(…); } self_invalidate(Blue); } Registration Registration Shared L2 Ack Ack Registered Valid Invalid

  26. PracticalDeNovo Coherence Cache “Line Merging” Tag • Basic protocol impractical • High tag storage overhead (a tag per word) • Address/Transfer granularity > Coherence granularity • DeNovo Line-based protocol • Traditional software-oblivious spatial locality • Coherence granularity still at word • no word-level false-sharing

  27. Current Hardware Limitations ✔ ✔ ✔ ✔ • Complexity • Subtle races and numerous transient states in the protocol • Hard to extend for optimizations • Storage overhead • Directory overhead for sharer lists • Performance and power inefficiencies • Invalidation and ack messages • False sharing • Indirection through the directory • Fixed cache-line communication granularity

  28. Protocol Optimizations Insights • Traditional directorymust be updated at every transfer • DeNovo can copy valid data around freely • Traditional systems send cache line at a time • DeNovo uses regions to transfer only relevant data

  29. Protocol Optimizations L1 of Core 2 L1 of Core 1 … … … … Shared L2 … LD X3 Registered Valid Invalid … Direct cache-to-cache transfer Flexible communication

  30. Protocol Optimizations • Direct cache-to-cache transfer • Flexible communication L1 of Core 2 L1 of Core 1 … … LD X3 … … Shared L2 … Registered Valid Invalid … Direct cache-to-cache transfer Flexible communication

  31. Current Hardware Limitations • Complexity • Subtle races and numerous transient states in the protocol • Hard to extend for optimizations • Storage overhead • Directory overhead for sharer lists • Performance and power inefficiencies • Invalidation and ack messages • False sharing • Indirection through the directory • Fixed cache-line communication granularity ✔ ✔ ✔ ✔ ✔ ✔ ✔

  32. Outline • Motivation • Background: DPJ • Base DeNovo Protocol • DeNovo Optimizations • Evaluation • Complexity • Performance • Conclusion and Future Work

  33. Protocol Verification • DeNovo vs. MESI word with Murphi model checking • Correctness • Six bugs in MESI protocol • Difficult to find and fix • Three bugs in DeNovo protocol • Simple to fix • Complexity • 15x fewer reachable states for DeNovo • 20x difference in the runtime

  34. Performance Evaluation Methodology • Simulator: Simics + GEMS + Garnet • System Parameters • 64 cores • Simple in-order core model • Workloads • FFT, LU, Barnes-Hut, and radix from SPLASH-2 • bodytrack and fluidanimate from PARSEC 2.1 • kd-Tree (two versions) [HPG 09]

  35. MESI Word (MW) vs. DeNovo Word (DW) FFT LU kdFalse bodytrack fluidanimate radix kdPadded Barnes • DW’s performance competitive with MW

  36. MESI Line (ML) vs. DeNovo Line (DL) FFT LU kdFalse bodytrack fluidanimate radix kdPadded Barnes • DL about the same or better memory stall time than ML • DL outperforms ML significantly with apps with false sharing

  37. Optimizations on DeNovo Line FFT LU kdFalse bodytrack fluidanimate radix kdPadded Barnes • Combined optimizations perform best • Except for LU and bodytrack • Apps with low spatial locality suffer from line-granularity allocation

  38. Network Traffic FFT LU kdFalse bodytrack fluidanimate radix kdPadded Barnes • DeNovo has less traffic than MESI in most cases • DeNovo incurs more write traffic • due to word-granularity registration • Can be mitigated with “write-combining” optimization

  39. Conclusion and Future Work • DeNovorethinks hardware for disciplined models • Complexity • No transient states: 20X faster to verify than MESI • Extensible: optimizations without new states • Storage overhead • No directory overhead • Performance and power inefficiencies • No invalidations, acks, false sharing, indirection • Flexible, not cache-line, communication • Up to 81% reduction in memory stall time • Future work: region-driven layout, off-chip memory, safe non-determinism, sync, OS, legacy

  40. Thank You!

More Related