Is SC + ILP = RC?

Is SC + ILP = RC? Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Presented by Vamshi Kadaru Spring 2005: CS 7968 Parallel Computer Architecture

Introduction • Availability of multiprocessors (how to maximize performance?) • Atomicity of operations (synchronization) • Allow in-order processors to overlap store latency with other work (ie bypassing loads, overlapping with network latency etc) • Allow processors to execute out-of-order (speculation) • There exists a trade off between programmability and performance • To simplify programming, implement a shared memory abstraction Spring 2005: CS 7968 Parallel Computer Architecture

Memory Models • Shared memory systems implement memory consistency models • Different models make different guarantees; the processor can reorder/overlap memory operations as long as the guarantees are upheld. • Sequential Consistency (SC) is the simplest model which executes memory operations in program order • Relaxed memory models require only some memory operations to perform in program order • Release Consistency is the best of the relaxed memory models Spring 2005: CS 7968 Parallel Computer Architecture

Current Memory Consistency Models • Sequential Consistency (SC) • HP and MIPS processors • Processor Consistency (PC) • Intel processors • Total Store Order • Sun SPARC • Release Consistency (RC) • Sun SPARC, DEC Alpha, IBM PowerPC Spring 2005: CS 7968 Parallel Computer Architecture

Current Optimizations • Techiniques used to exploit ILP • Branch prediction • Execute multiple instructions per cycle • Non-blocking caches to overlap memory operations • Out-of-order execution • Implement precise exceptions and speculative execution • Reorder buffer Spring 2005: CS 7968 Parallel Computer Architecture

Comparing SC and RC • Sequential Consistency (SC) • Guarantees memory order using hardware • Easier to program • Prevents high performance due to conservative nature • Release Consistency (RC) • Guarantees memory order using software • Harder to program; more burden on programmer • Achieves highest performance due to explicitness Spring 2005: CS 7968 Parallel Computer Architecture

SC Implementations • Current SC use ILP Optimizations • Hardware prefetching and non-blocking caches to overlap loads and stores using the reorder buffer • Speculative load execution using reorder buffer and a special history buffer to roll back in case of invalidation • Limitations • Inability of stores to bypass other memory operations • Long latency remote stores cause the relative small reorder buffer and load/store queue to fill up blocking the pipeline • Capacity and conflict misses of small L2 caches causing frequent rollbacks Spring 2005: CS 7968 Parallel Computer Architecture

RC Implementations • RC allows a programmer to specify the ordering constraints (fence instr) among specific memory operations to enforce order • RC implementations use store buffering to allow loads and store to bypass pending stores • Unlike SC, RC can use binding prefetches to perform loads in the reorder buffer • RC can also relax ordering among fence instrns and use rollback mechanisms if there is a memory model violation Spring 2005: CS 7968 Parallel Computer Architecture

SC programmability with RC Perfor. • SC can approach RC if hardware can provide support for: • SC to relax the order speculatively of loads and stores • Loads and stores to take place atomically and in program order • Instructions to be allowed to execute out of program order • Processor state must be remembered for rollbacks • Limitations (costs) • Memory order is arbitrary; no guarantees • Rollbacks must be infrequent (enough space needed) Spring 2005: CS 7968 Parallel Computer Architecture

SC++ Architecture • Modelled after R10k • SHiQ allows for prefetching and non-blocking caches • Other processors see SC • History buffer allows speculative retirement • unblocks RoB stores • Load/store queue takes stores from RoB • BLT has block addr’s for SHiQ Spring 2005: CS 7968 Parallel Computer Architecture

Experimental Setup • Simulator: RSIM, on an 8-node DSM • Each DSM node is a R10k like processor • Memory model implementations use • Non-blocking caches • Hardware prefetching for loads and stores • Speculative load execution • No speculative retirement is done in either SC or RC Spring 2005: CS 7968 Parallel Computer Architecture

Base System Configuration • Each R10k processor node has the above configuration • Large L2 cache – eliminates capacity and conflict misses • Base configuration is used unless otherwise specified Spring 2005: CS 7968 Parallel Computer Architecture

Some points to remember  • SC and RC implementations… • Use non-blocking caches • Use hardware prefetching for loads and stores • Perform speculative loads • SC++ uses… • Speculative History Queue (SHiQ) • Block Lookup Table (BLT) • Rollbacks due to Instructions in reorder buffer take one cycle • Rollbacks due to Instructions in SHiQ take 4 cycles Spring 2005: CS 7968 Parallel Computer Architecture

Results – Base System • Speedup normalized to that of SC implementation • RC is better than SC • Best for radix • SC++ performs better than or equal to RC • For raytrace it performs way better Spring 2005: CS 7968 Parallel Computer Architecture

Results – Network Latency • Network latency increased by 4x • RC hides the n/w latency by overlapping stores • SC++inf keeps up with RC • raytrace performs lesser since longer n/w latency dominates lock patterns. Spring 2005: CS 7968 Parallel Computer Architecture

Results – Reorder Buffer Size • Allows more prefetch time • Speeds up both SC and RC • Hides store latencies by allowing more time for prefetches • In raytrace, no speedup in both SC and RC • Memory operations don’t overlap much • In structured, the gap grows • Due to increase in no. of rollbacks in SC Spring 2005: CS 7968 Parallel Computer Architecture

Res - SHiQ Size & Speculative Stores • Absence of speculative stores causes significance performance loss • radix and raytrace • Reducing SHiQ sizes leads to performance degradation • em3d and radix Spring 2005: CS 7968 Parallel Computer Architecture

Results – L2 Caches Size • Two effects of smaller L2 cache • Less room for speculative state => gap widens • Lots of load misses for both SC and RC => might narrow performance gap • Lu & radix – the high load miss rate degrades performance • SC ++ is also sensitive to rollbacks due to replacements Spring 2005: CS 7968 Parallel Computer Architecture

Conclusions • SC can perform equal to RC if hardware provides enough support for speculation • SC++ allows for speculative bypassing for both loads and stores • SC++ minimizes additional overheads to the processor pipeline critical paths by using the following structures • SHiQ: to store speculative state, absorb remote latencies • BLT: to allow fast lookups in SHiQ Spring 2005: CS 7968 Parallel Computer Architecture

Is SC + ILP = RC?

Is SC + ILP = RC?

Presentation Transcript