Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

Edge Chasing Delayed Consistency:Pushing the Limits of Weak Memory Models Harold “Trey” Cain IBM T.J. Watson Research Center Prof. Mikko H. Lipasti University of Wisconsin RACES’12

Gotta go back in time! • Part of Ph.D. Dissertation • Never submitted for publication, until now. • Looked particularly relevant when I saw the RACES CFP. • Journey back in time to the year 2004, when… • … Mark Zuckerberg launched Facebook • … Janet Jackson suffered a “wardrobe malfunction” during the Superbowl halftime show • … an incumbent president was being challenged by a Massachusetts politician • 88mph here we come! Cain and Lipasti RACES’12

Edge Chasing Delayed Consistency: Pushing the Limits of Weak Ordering • From the RACES website: • “an approach towards scalability that reduces synchronization requirements drastically, possibly to the point of discarding them altogether.” • A hardware developer’s perspective: • Constraints of Legacy Code • What if we want to apply this principle, but have no control over the applications that are running on a system? • Can one build a coherence protocol that avoids synchronizing cores as much as possible? • For example by allowing each core to use stale versions of cache lines as long as possible • While maintaining architectural correctness; i.e. we will not break existing code • If we do that, what will happen? Cain and Lipasti RACES’12

Cache-Coherent Shared-memory multiprocessors • Are ubiquitous • Coherence misses are a major source of performance loss for shared memory applications 10 years ago Today Cain and Lipasti RACES’12

16MB L3 Cache Misses per 1000 inst Cain and Lipasti RACES’12

Edge-Chasing Delayed Consistency (ECDC) • A new hardware implementation of POWER weak ordering • Not a new consistency model • Allows a cache line to be non-speculatively read after being invalidated. • Based on necessary conditions • Processor must fetch new data only if causally dependent on it. Cain and Lipasti RACES’12

Constraint graph • Introduced for SC by Landin et al., ISCA-18 • Directed-graph represents a multithreaded execution • Nodes represent dynamic instances of instructions • Edges represent their transitive orders (program order, RAW, WAW, WAR). • If the constraint graph is acyclic, then the execution is correct Cain and Lipasti RACES’12

Proc 1 Proc 2 Constraint graph example - WO Write-after-read dependence order 5. 2. LD B ST A LD->MB Order ST->MB Order MB MB 3. MB->ST Order MB->LD Order ST B LD A Read-after-write dependence order 4. 1. Observation: An aggressive coherence protocol can ignore coherence messages unless doing so will create a cycle in the constraint graph Cain and Lipasti RACES’12

Edge-chasing delayed consistency • Based on edge-chasing algorithms used by distributed database systems for deadlock detection P1 P2 P3 P4 Wham-O! Cycle in WFG detected when a locally created probe received Cain and Lipasti RACES’12

ECDC - Basic idea • Observation: Cycles in constraint graph can be detected using a similar mechanism • Protocol: • Upon write miss, create a “probe” • Upon receipt of invalidation, add probe to cache line • Continue to read stale block until the probe is re-observed on another message • Pass probe to other processors at communication Cain and Lipasti RACES’12

Example – necessary miss (SC) Proc 2 Proc 1 WAR LD A Line A is in proc 1’s cache, valid bit = 1 ST A ST B RAW LD B Line A is in proc 1’s cache, valid bit = 0 Supplanter ProbeA = RAW LD A Cain and Lipasti RACES’12

Detecting critical writes • Some write values shouldn’t be delayed (e.g. lock releases, barriers, etc.) • Two heuristics • Atomic primitives – any cache block that has been touched by a store-conditional should not be delayed • Polling detection – If consecutive cache accesses have same PC and address, discard stale line Cain and Lipasti RACES’12

Performance Evaluation • PHARMSim – Cycle-mode Full System Simulator • Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar], within the SimOS-PPC full-system simulator • Out-of-order single-threaded core • 32k DM L1 icache (1), 32k DM L1 dcache (1), 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines • Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock) • Stride-based prefetcher modeled after Power4 • Lock-free list insertion microbenchmark • Full applications • SPLASH2: fft, fmm, ocean, radix, raytrace • Commercial: DB2/TPC-B, DB2/TPC-H, SPECjbb2000, SPECweb99 Cain and Lipasti RACES’12

Why delayed consistency? • False sharing/Silent sharing • Convergant/Data-race tolerant algorithms • Genetic algorithms • Parallel equation solvers • Sparse matrix factorization • Lock-free parallel linked data structures Cain and Lipasti RACES’12

Lock-free Algorithms new • For example list insertion: • New node’s next pointer set to cur • CAS operation atomically updates prev’s next pointer to new • Increasingly common prev cur Cain and Lipasti RACES’12

Prior work (Delayed consistency) • Invalidate-based receiver-delayed protocols, sender-delayed protocols (Dubois et al., SC ’91) • Lazy release consistency (Keleher et al., ISCA ’92) • Update-based receiver-delayed, sender-delayed protocols (Afek et al., TPLS, ’93) • Tear-off blocks in DSI (Lebeck and Wood, ISCA ’95) • Write cache for reducing bandwidth in update coherence protocol (Dahlgren and Stenstrom, JPDC ’95) Cain and Lipasti RACES’12

Lock-free list microbenchmark • Based on hazard-pointer lock-free list maintenance algorithm [Michael, PODC ’02] • 15 threads randomly updating or searching linked list, 1 thread performing searches Cain and Lipasti RACES’12

Intolerable miss reduction Left to right: a) baseline, b) ECDC base, c) ECDC merged read/write sets, d) ECDC scalar probe set Cain and Lipasti RACES’12

ECDC Performance (Infinite resources) Cain and Lipasti RACES’12

Conclusions • Of nine applications studied, performance improvement for two • Mostly due to reduction in false sharing misses • Other applications: • Not enough coherence misses, or • The avoidance of those misses does not improve performance • We believe these results generalize to lock-based programs • Other programming models may have potential • As shown, lock-free data structures • Should also apply to transactional programming model • But beware, “Premature Optimization is the Root of All Evil” – Donald Knuth • Best to identify apps with a communication bottleneck before attacking Cain and Lipasti RACES’12

Questions? Cain and Lipasti RACES’12

Backup slides Cain and Lipasti RACES’12

Base machine model Cain and Lipasti RACES’12

An instruction i is causally dependent upon instruction j if there is a directed path from j to i Two operations are concurrent if neither causally depends upon the other Coherence misses are a significant source of performance degradation for many applications If two operations are concurrent, why is their performance penalized? Causality (Lamport) P1 P2 P3 st A st C ld A st B ld C ld B ld A Time Cain and Lipasti RACES’12

Prior work: formal memory model representations • Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13) • Acyclic graph representation (Landin et al., ISCA-18) • Modeling memory operation as a series of sub-operations (Collier, RAPA) • Acyclic graph + sub-operations (Adve, thesis) • Initiation event, for modeling early store-to-load forwarding (Gharachorloo, thesis) Cain and Lipasti RACES’12

Anatomy of a cycle Proc 1 ST A Proc 2 Incoming invalidate WAR LD B Program order Program order Cache miss ST B RAW LD A Cain and Lipasti RACES’12

Other prior work • Speculative stale value usage • LVP with Stale Values (Lepak, Ph.D. Thesis ‘03) • Coherence Decoupling (Huh et al., ASPLOS ’04) • Delayed RFO response to improve synchronization throughput (Rajwar et al., HPCA ’00) Cain and Lipasti RACES’12

Constraint graph extensions • Constraint graph definition differs for other consistency models • Processor consistency • Remove program order edges from stores to subsequent loads • Remaining single-thread orders: edges from • Loads to subsequent loads • Stores to subsequent stores • Loads to subsequent stores Cain and Lipasti RACES’12

Constraint graph extensions • Constraint graph definition differs for other consistency models • Weak ordering • Remove program order edges • Add single-thread ordering edges between • memory barrier and preceding/following instructions • same address reads/writes • dependent instructions Cain and Lipasti RACES’12

PC Example – Dekker’s Alg. Proc 1 ST A Proc 2 ST B Write-after-read dependence order 2. 4. Program order Program order LD B LD A 3. 1. Lack of store-to-load order results in acyclic graph Cain and Lipasti RACES’12

Constraint graph example - SC Proc 1 ST A Proc 2 Write-after-read dependence order 2. 4. LD B Program order Program order ST B LD A 3. Read-after-write dependence order 1. Cycle indicates that execution is incorrect Cain and Lipasti RACES’12

Constraint graph example - PC Proc 1 ST A Proc 2 Write-after-read dependence order 2. 4. Program order LD B ST B Program Order 3. Read-after-write dependence order LD A 1. Cain and Lipasti RACES’12

ECDC Conceptual Description • Identify causal dependences (upstream probe sets) • 1 upstream set per processor • 2 upstream sets per cache block (read set, write set) • Communicating dependences • Probe sets passed on response messages • Probes attached to incoming invalidation messages • Extra ProbePropagation messages sent at memory barriers • Identifying usable stale blocks • Extra stable state in cache (ST) • Supplanter probe Cain and Lipasti RACES’12

ECDC Operation Фprocupstream Ф(read|write)B Ф(read|write)A { | } { | } { | } { , | , } { , | , } { } { } { , } { , } { , } { | , } { | , } { | , } { | , } { | , } Initially 1. ld A 2. st A 3. ld B 4. st B 5. ld C Cain and Lipasti RACES’12

Finite ECDC Performance • When restricting PPB/STAB resources (220 KB per processor) • 16k probe lifetime counter • 128 entry STAB per processor • 32 Entry PPB per processor/directory controller (256 PPB virtual namespace) • TPC-H/SPECweb99 performance within margin of error to infinite resources Cain and Lipasti RACES’12

Non-atomicity of writes • Absent from model • Effect on optimizations • Forces unnecessary orders to exist • Correct, but another example of over-conservatism • Hopefully, infrequent performance divot Processor p2 ld r1, [A] st r2, [r1] Processor p3 ld r1, [B] membar ld r2, [A] Processor p1 st r1, [A] Cain and Lipasti RACES’12

ECDC Base machine model Cain and Lipasti RACES’12

Mapping ECDC to HW DRAM • STAB – Maintains supplanting probe for each stale cache block • PPB – Maintains approximation of upstream sets • In caches – 2 extra bits for stale state and synch heuristic Dir CastoutPPB MemCtr NIC STAB L2 $ I$ D$ P P B P Cain and Lipasti RACES’12

Probe representation • Each probe represented by n-bit timer • Stale block may be used until supplanting probe timer expires • Probe set in p-processor system represented by p timers Cain and Lipasti RACES’12

STAB Detail Incoming Invalidates p1 p2 p3 address timer (21646) 0xc123 8123 (13523) 0x24e2 12525 counters (998) 0x8000 10425 0xf2e5 92569 0x112c 998 Cache Cain and Lipasti RACES’12

Incoming upstream set PPB Detail Timer index table 724 950 735 327 12 855 282 189 12 800 127 15 12 280 27 5 … Shift register/ probe timers address hash 12 180 27 5 0 92 0 0 0 0 0 0 0 0 0 0 Expired upstream set Cain and Lipasti RACES’12

Memory consistency review • Memory consistency model • Specifies the programming interface to a shared memory • i.e. the allowable interleaving of instructions • Models discussed here: • Sequential Consistency • Processor Consistency • No store-to-load program order • Weak Ordering • Order wrt memory barriers • Same-address order • Dependence order Cain and Lipasti RACES’12

Example – necessary miss (SC) Proc 2 Proc 1 WAR LD A Block A is in proc 1’s cache, valid bit = 1 ST A PO PO ST B RAW LD B Block A is in proc 1’s cache, valid bit = 0 PO LD A Cain and Lipasti RACES’12

Example – avoidable miss (SC) Proc 2 Proc 1 WAR LD A ST B Block A is in proc 1’s cache, valid bit = 1 PO PO RAW ST A LD B PO Block A is in proc 1’s cache, valid bit = 0 LD A Cain and Lipasti RACES’12

3(a) Inval Ack S1 1. ReadX 2(b) Inval R H 2(a) Sharers/Data 2(c) Inval S2 3(b) Inval Ack Typical ReadX transaction • When sending invalidation, create probe, add to PPB • At receipt of invalidation (2b, 2c) add probe to STAB • When sending invalidate acknowledgment, add probe set to the response • When receiving invalidate acknowledgment, add incoming probe set to the PPB Cain and Lipasti RACES’12

Invalidation to read distance Cain and Lipasti RACES’12

Invalidation to read distance (synch) Cain and Lipasti RACES’12

Invalidation to read distance (data) Cain and Lipasti RACES’12

STAB entry death cdf Cain and Lipasti RACES’12

STAB Entry Lifetime Cain and Lipasti RACES’12

Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models