160 likes | 292 Views
Supporting Highly-Decoupled Thread-Level Redundancy for Parallel Programs. M Wasiur Rashid , Michael Huang University of Rochester. Motivation for Thread-Level Redundancy. Noise-induced hardware errors important threat Shrinking transistors fundamentally more vulnerable
E N D
Supporting Highly-Decoupled Thread-Level Redundancy for Parallel Programs M Wasiur Rashid, Michael Huang University of Rochester
Motivation for Thread-Level Redundancy • Noise-induced hardware errors important threat • Shrinking transistors fundamentally more vulnerable • Scaling increases noise sources and victims • Error frequency will rise in unprotected circuits • Efficient and effective protection against errors in logic: • Important: logic errors rising (Serfeit et al. IRPS’01, Shivakuma et al. DSN’02) • Challenging: no ECC-equivalent partial redundancy • TLR natural and well understood: AR-SMT, SRT, CRT, SRTR, CRTR • Advantages of TLR over circuit/device-level redundancies • Flexible: can easily turn on or off on demand • Avoids fundamental issues of lower-level redundancies
Designs Goals • Support parallel programs efficiently • Not a trivial extension of supporting single-threaded apps • Cover memory subsystem logic (coherence, consistency) • Manage natural non-determinism in parallel execution • Decouple redundancy support from core logic • Minimize impact to critical path • Minimize design intrusion to core logic • Decouple the timing of redundant threads • Do not require lock-stepping • Validation can happen long after retirement • To tolerate long latencies in communicating and validating results
High-Level Overview • Two wavefronts move indep. (inc. mem. hierarchy) • Compare arch. state every epoch so large buffering cap. • Non-determinism and buffer all implemented in off-path support
Decoupling with Post-Commit Buffer • PCB handles redundancy, L1 handles semantic processing • PCB keeps written cache lines and write back after validation • L1 need not worry about writing back (a dirty line is discarded) • Stores from processor also writes into PCB • Timing critical path of L1 intact • An L1 miss or coherence activity need to search PCB L1 Cache PCB
Challenges to Address • Isn’t PCB a very large store queue (and therefore impractical)? – No • PCB is searched on a miss – not timing critical, but… • It does have many more entries (~100s) to be useful • If multiple versions exists, search can be very slow • Either sequentially search each segment or use priority encoder • Frequent searches undesirable energy-wise
Using States to Address Multiple Copies • 3 states: Valid, Invalid, Superseded • Superseded lines only for committing, do not participate in search • Guarantees that only one valid version (at most) in any PCB • Searches are always parallel, no priority encoding needed
V Y V X bloom filter Y SD Using Pointers as a Filter • If line is present in the cache, no need to search PCB • Pointer also reduces bloom filter clogging tag data ptr tag data ptr tag data state new Y X V Z old L1 Cache PCB
Effectiveness of the Optimizations • Setup: multiprocessor simulator based on Simplescalar. SPLASH2 benchmarks plus two other shared memory programs. • Only 0.67% of PCB searches remain • Pointer and bloom-filter filter out about half each. • Pointer works well to filter same-processor searches. • Bloom filter works well for remote-processor requests. • Without pointers, false positives of bloom filter are 21X-800X higher.
The Issue of Non-Determinism Verification Wavefront Computing Wavefront • Non-determinism in parallel lead to different outcome • Discrepancies appear as soft errors and can’t be addressed by rollbacks • Possible solutions • Eliminate non-determinism completely – lockstepping • Ignore root cause of non-determinism and address symptom: passing load results via, e.g., load value queue (LVQ) • Our approach: throttle retirement/fetch to maintain race outcome st x, 0 st x, 1 st x, 0 ld x (1) ld x (0) st x, 1 T1 T2 T1 T2
st x, 0 … add … stall st x, 1 s s + 1 ld x (1) Subepoch-Based Instruction Partitioning Computing Wavefront Verification Wavefront Subepoch s st x, 0 Ti Tj … … ns,j st x, 1 add … ns,j ld x (1) Ti Tj • (Potential) races partition instructions into subepochs • Guarantee races only happen across subepochs • Maintain “lockstepping” of subepochs • Stall fetch or commit stage (sequential consistency) • Guarantees deterministic replay
Other Issues Detailed in the Paper • PCB write-back bandwidth • PCB moderately increases bandwidth • Simple support can cut down increase significantly • Subepoch transition implementation details • Enforcing subepoch boundaries • Not always necessary to guarantee race outcome • Studied three different policies on when to enforce • Storage and energy overhead for all structures
Experimental Setup • Modified SimpleScalar 3.0b simulator modeling CMP • Snoopy based MESI coherence protocol • Sequential memory consistency model • SPLASH-2 benchmark suit, ilink, tsp
Performance Impact • Additional performance impact less than 2.5% on average 8 16 TLR execution
Summary • TLR offers flexible protection with fundamental advantages • Proposed a support with comprehensive coverage 1. PCB decouples redundancy support from core logic With optimizations, dynamic cost is very low 2. Broad-brushed synchrony guarantees race outcome Requires non-intrusive support to throttle retirement • Overall performance impact small
Thank You Questions?