190 likes | 284 Views
Speculative Sequential Consistency with Little Custom Storage. Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University http://www.ece.cmu.edu/~puma2. Chris Gniady and Babak Falsafi. Distributed Shared Memory (DSM). …. CPU. CPU. CPU. Cache. Cache. Cache. Memory Bus.
E N D
Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University http://www.ece.cmu.edu/~puma2 Chris Gniady and Babak Falsafi
Distributed Shared Memory (DSM) … CPU CPU CPU Cache Cache Cache MemoryBus Network Memory DSM Hardware Logically shared but physically distributed memory • Shared-memory programming • Scalable • Long shared memory access can be a bottleneck! Speculative Sequential Consistency with Little Custom Storage
Programming DSM To achieve high performance: • Release Consistency (RC) • Relaxes memory order • Software annotation What programmers want: • Sequential Consistency (SC) • Intuitive • Memory order enforced slow Prior work: Speculative SC (SC++) [ISCA’99] • Hardware speculatively relaxes order • High performance & intuitive • Large custom “speculative history” queue Speculative Sequential Consistency with Little Custom Storage
This Talk’s Contributions • Characterize history size across apps • Varies from 16 to 8K entries! • Bursty: Over 85% of time empty • Propose SC++Lite • Allocates history in memory hierarchy • Enhances scalability across apps & systems • Reduces custom storage from 51 KB to 2 KB Result Speculative SC (almost) for Free! Speculative Sequential Consistency with Little Custom Storage
Outline • Overview • Memory Ordering in RC • Memory Ordering in SC++ • SC++Lite: SC++ with Little Custom Storage • Results • Conclusions Speculative Sequential Consistency with Little Custom Storage
ST X Miss LD Y Miss LD/ST Queue LD Z Miss ... Memory Ordering in RC ST X Out of order LD A ALU Retired ST A LD Y LD Z Reorder Buffer ... • “LD A” & “ST A” retire out of order • Overlaps “ST X”, “LD Y” & “LD Z” misses • Software guarantees overlap is ok! ... Speculative Sequential Consistency with Little Custom Storage
ST X Miss LD Y Miss LD/ST Queue LD Z Miss ... SC++: Hardware Relaxes Memory Order [ISCA’99] ST X Coherence Messages LD A Speculative History Queue ALU Look up for potential rollback ST A Speculative Retirement LD Y LD Z Reorder Buffer • Speculatively retires instructions in hardware • Rolls back when coherence messages hit in history ... ... Speculative Sequential Consistency with Little Custom Storage
SC++’s Implementation Overhead Speculative History Queue: • On-chip custom storage • Grows up to subsequent missing load • Size is application & system dependent • Must assume worst-case size at design! Can we (virtually) eliminate custom storage in SC++? Speculative Sequential Consistency with Little Custom Storage
SC++Lite: SC++ with Little Custom Storage Store history into memory hierarchy! • Queue allocated at boot time in physical memory • Use block buffer to pack history, ship to L2 • Store ack updates head pointer (in LD/ST queue) • ROB retirement updates tail pointer • “Dead” history is not written back! Speculative Sequential Consistency with Little Custom Storage
ST X Miss ST Z Miss Head LD Y Miss Index LD Z Miss ROB ... ROB ... Memory Ordering in SC++Lite Coherence Messages Cache block to L2 Look up for potential rollback LD A Speculative Block Buffer ALU LD/ST Queue Location in L2 ST A Speculative Retirement • Only history burst retires into L2 • History in L2 typically discarded LD Y LD Z Reorder Buffer ... ... Speculative Sequential Consistency with Little Custom Storage
SC++Lite Design Requirements Avoid perturbing application’s critical path! SBB: • Size depends on L2 latency & retirement rate • Large enough to filter store hits into L2 L2: • Retirement rate proportional to required bandwidth • Large blocks help • Small blocks may need multiporting • Head & tail registers reduce history traffic Speculative Sequential Consistency with Little Custom Storage
Outline • Overview • Memory Ordering in RC • Memory Ordering in SC++ • SC++Lite: SC++ with Little Custom Storage • Results • Conclusions Speculative Sequential Consistency with Little Custom Storage
Experimental Methodology Using RSIM • 16 nodes with 1 GHz, 8-issue CPU • 128-entry ROB & LD/ST queue • Average remote-to-local access ratio of ~2 • 32-Kbyte, direct-mapped L1 cache • 512-Kbyte, 8-way L2 cache, 64 GB/s • 256-entry Lookup Table • 32-entry SBB Speculative Sequential Consistency with Little Custom Storage
System & application dependent: varies 16–4K History is bursty: non-empty < 15% time History Size Characterization Speculative Sequential Consistency with Little Custom Storage
Base RC, SC++ & SC++Lite • Up to 80% gap between SC & RC • 31% average speedup for SC++, 28% for SC++lite Speculative Sequential Consistency with Little Custom Storage
Sensitivity to 4x Network Latency • SC++ requires 2x queue size to perform best • SC++Lite’s performance remains stable Speculative Sequential Consistency with Little Custom Storage
Custom Storage Requirements SC++: • ~51KB of custom storage • Doubles for 4x network latency • Radix shows worst-case history SC++Lite: • ~2KB of custom storage for all apps • Performance insensitive to network latency Speculative Sequential Consistency with Little Custom Storage
Conclusions Previously showed [ISCA’99]: • Speculative SC achieves RC’s performance This talk: • Proposed SC++Lite • Allocates history in memory hierarchy • Enhances scalability across apps & systems Result Speculative SC (almost) for Free! Speculative Sequential Consistency with Little Custom Storage
For More Information Please visit our web site at Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University http://www.ece.cmu.edu/~puma2 Speculative Sequential Consistency with Little Custom Storage