280 likes | 449 Views
DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism. Hyojin Sung , Rakesh Komuravelli , and Sarita V. Adve Department of Computer Science University of Illinois at Urbana-Champaign. Motivation. Shared memory is de-facto model for multicore SW and HW BUT …
E N D
DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism Hyojin Sung, RakeshKomuravelli, and Sarita V. Adve Department of Computer Science University of Illinois at Urbana-Champaign
Motivation • Shared memory is de-facto model for multicore SW and HW • BUT … • Complex SW: data races, unstructured parallelism, memory model, … • Inefficient HW: complex coherence/consistency, unnecessary traffic, … • Recent work on disciplined shared memory • SW: Easier programming model • HW: If SW is more disciplined, can we build more efficient HW? • DeNovo: Holistic rethinking of entire memory hierarchy
Disciplined Shared Memory Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization • Explicit, structured side-effects
Disciplined Shared Memory • Deterministic Parallel Java (DPJ) – strong safety properties • Determinism-by-default, simple semantics OOPSLA ‘09 explicit effects Disciplined Shared Memory structured parallel control • DeNovo – performance, complexity and power efficient • Simplify coherence and consistency PACT ‘11
Limitation • DeNovo for deterministic programs • Important assumptions • No conflicting concurrent accesses, only barrier synchronization • Known side-effects • Allowed DeNovo to eliminate design complexity and inefficiency • Challenges for nondeterministic programs • The assumptions do not hold any more • Can have conflicting concurrent accesses, support lock synchronization • Side-effects unknown in critical sections • Applications with lock-based non-determinism are common
Contribution • Deterministic Parallel Java (DPJ) – strong safety properties • Determinism-by-default, simple semantics Explicit & safe non-determinism POPL ‘11 explicit effects Disciplined Shared Memory structured parallel control • DeNovoND: Non-deterministic codes with benefits of DeNovo • Minimal additional HW for non-determinism • Comparable performance to MESI • 30% lower network traffic than MESI • PLUS all advantages of DeNovo for deterministic codes
Outline • Motivation • Background • DPJ/DeNovo for deterministic codes • DPJ support for disciplined non-determinism • DeNovoND Design • DeNovoND Implementation • Evaluation • Conclusion and Future Work
DPJ for Deterministic Codes . . . • Structured parallel control • Fork-join parallelism • Explicit region and effect • Regions divide heap • Read or write effects on regions • Data-race freedom guarantee • Simple, modular type checking ST ST ST ST LD . . . write effect heap
DPJ for Deterministic Codes . . . Hardware – simplify coherence problems! • Java-compatible type system • Structured parallel control • Fork-join parallelism • Explicit region and effect • Regions divide heap • Read or write effects on regions • Data-race freedom guarantee • Simple, modular type checking ST ST ST ST LD . . . write effect heap
DeNovo for Deterministic Codes • Coherence Enforcement • Invalidate stale copies in private cache • Track up-to-date copy • Explicit effects • Compiler knows all writeable regions in this parallel phase • Cache can self-invalidate before next parallel phase • Registration • Directory keeps track of one up-to-date copy • Writer registers itself before next parallel phase
DeNovo for Deterministic Codes • No space overhead • Keep valid data or registered core id • LLC data arrays double as directory • No transient states • No invalidation traffic • No false sharing registry Invalid Valid Read Write Write Registered
Example Run L1 of Core 1 L1 of Core 2 X in DeNovo-region Y in DeNovo-region ST ST . . Registration Registration Shared L2 Ack Ack self-invalidate( ) Registered Valid Invalid
DPJ Support for Safe Non-Determinism . . . • Nondeterminism comes from conflicting concurrent accesses • Isolate these accesses as “atomic” • Enclosed in “atomic” sections • “Atomic” regions and effects • “Disciplined” non-determinism • Race freedom, strong isolation • Determinism-by-default semantics ST LD . . . • DeNovoND converts “atomic” statements into locks
Outline • Motivation • Background • DeNovoND Design • Memory Consistency Model • Distributed Queue-based Lock • DeNovoND Implementation • Evaluation • Conclusion and Future Work
Memory Consistency Model . . . • Deterministic accesses • Same task in this parallel phase • Or before this parallel phase DeNovo Coherence Mechanism . . ST 0xa Parallel Phase LD 0xa
Memory Consistency Model . . . • Non-deterministic accesses • Same task in this parallel phase • Or before this parallel phase • Or in preceding critical sections . . ST 0xa Parallel Phase ST 0xa Critical Section LD 0xa
Coherence for non-deterministic data • Coherence Enforcement • Invalidate stale copies in private cache • Track up-to-date copy • When to invalidate? • Between the start of critical section and any read • What to invalidate? • Entire cache? regions with “atomic” effect? • Track atomic writes in a signature, transfer with lock • Registration • Writer updates before next critical section
Distributed Queue-based Lock • Lock primitive that works on DeNovoND • No directory, no write invalidation No spinning for lock • Modeled after QOSB Lock • Lock requests form a distributed queue • But much simpler • Details in the paper
Outline • Motivation • Background • DeNovoND Design • DeNovoND Implementation • Evaluation • Conclusion and Future Work
Access Signatures • Simple and small hardware Bloom filter per core • Track accesses with “atomic” effects only • Only 256 bits suffice • Operations on Bloom filter • On write: insert address • On read: query filter for address for self-invalidation
Read miss Registration Example Run Registration lock transfer X in DeNovo-region Y in DeNovo-region Z in atomic DeNovo-region W in atomic DeNovo-region L1 of Core 1 L1 of Core 2 Read miss Z W Z W Ack LD ST ST lock transfer . . LD Shared L2 Ack self-invalidate( ) self-invalidate( ) reset filter
Optimization to reduce self-invalidation X in DeNovo-region Y in DeNovo-region Z in atomic DeNovo-region W in atomic DeNovo-region • loads in Registered state • “Touched-atomic” bit • Set on first atomic load • Subsequent load don’t self-invalidate • More in the paper ST LD . . LD LD self-invalidate( )
Overheads • Hardware Bloom filter • 256 bits per core • Storage overhead • One additional state, but no storage overhead (2 bits) • “Touched-atomic” bit per word in L1 • Communication overhead • Bloom filter piggybacked on lock transfer message • Writeback messages for locks • Lock writebacks carry more info
Evaluation Methodology • Simulator: Simics + GEMS + Garnet • System Parameters • 16 in-order cores • Workloads • SPLASH-2, PARSEC and STAMP • Unchanged except region/effect and self-invalidation • Protocols • MESI and DeNovoND • With idealized locks and realistic locks
MESI vs. DeNovoND: Idealized lock • DeNovoND performs comparable to MESI for all apps • For both DIL-INF and DIL-256 barnesocean water fluidanimatestreamclustertspkmeans ssca2
MESI vs. DeNovoND: Realistic lock • pthread lock vs. distributed queue-based lock • DeNovoND performs comparable or better than MESI barnesocean water fluidanimatestreamclustertspkmeans ssca2
Network Traffic (Realistic lock) • DeNovoND has 33% less traffic than MESI (67% max) • No invalidation traffic • Reduced load misses due to lack of false sharing barnesocean water fluidanimatestreamclustertspkmeans ssca2
Conclusions and Future Work • DeNovoND: Efficient HW support for non-determinism • Minimal additional HW for safe non-determinism • Comparable performance to MESI • 30% lower network traffic than MESI • PLUS all advantages of DeNovo for deterministic codes • Future work: broaden the application space further • Pipeline parallelism, “lock-free” data structures, OS, legacy codes…