Is SC + ILP = RC?

Is SC + ILP = RC? C. Gniady, B. Falsafi, and T.N. Vijaykumar - Purdue Presented by: Eric Carty-Fickes

Introduction • SC • produces memory order with hardware • easier to program • worse performance due to conservativism • RC • produces memory order with software • harder to program • better performance due to explicitness

catching up to RC • SC limitation: no software guarantees • memory order is arbitrary, no devices such as fences • SC can allow loads and stores to bypass one another • processor state must be remembered, but rollbacks should be avoided – slow • superscalar rollbacks are faster • rollbacks caused by data races, false sharing, cache conflicts • encourage load/store speculation but make it transparent • check for reading or replacement of speculative blocks

SC++ • ILP allows more speculation in SC – invisible to outside world due to in order retirement • branch predictors, superscalar, non-blocking caches • maybe can perform up to the level of RC • allows stores to bypass as well as loads • allows out-of-order operations to hide latency • quickly recovers from mis-speculation • assumes applications designed for MP’s/DSM

SC++ Architecture • modelled after R10K • SHiQ allows for prefetching and non-blocking caches • other processors see SC • history buffer allows speculative retirement • unblocks RoB stores • load/store queue takes stores from RoB • BLT has block addr’s for SHiQ

Simulations • using RSIM for 8-node DSM, 16k L1, 8M L2 • all use non-blocking caches, prefetching, speculative loads • rollbacks = 1 cycle • SC++ rollbacks = 4 wide • SC blocks at stores • RC hides network latency with store overlaps • raytrace hurt by lock patterns, slow network

More Simulations • RoB increase = more prefetch time • unstructured causes many rollbacks for SC • SC++o = no speculative stores • radix and raytrace = store-intensive, full load/store queue

Another Simulation • L2 size reduced • less room for speculative state • lu sees many rollbacks caused by replacements

Conclusions/Questions • SC++ nearly up to snuff with RC with minor additional hardware • does this really matter – is it that much harder to program with RC? • does this add any significant risk of errors due to extra hardware and speculation? • do you buy their argument that applications causing rollback are not suited to DSM systems anyway?

Is SC + ILP = RC?

Is SC + ILP = RC?

Presentation Transcript