Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance Sanjeev Kumar Dongming Jiang Rohit Chandra Jaswinder Pal Singh

Classic Study on Synchronization • Software Algorithms for Locks and Barriers [Mellor-Crummey et. al., TOCS’91] • Multiprocessors machines • BBN Butterfly, Sequent Symmetry • Microbenchmarks • Little benefit from special hardware support • Handle memory/network contention in software

Case for Hardware Support • Fetch&Op [Laudon et. al., ISCA’97] • Origin 2000 • Microbenchmarks (Counter & Barrier) • QOLB [Kagi et. al., ISCA’97] • Simulations • Microbenchmarks & Applications (Locks) • Better performance with Hardware Support

Our Study • Re-examine synchronization • 64 processor Origin 2000 • New architectures CC-NUMA • New primitives LL-SC • Applications (SPLASH2) and microbenchmarks • Applications : Little benefit from H/W support • Locks : Small performance sometimes • Barriers : Load-imbalance dominates

Outline • Background • Performance evaluation: Microbenchmarks • Synchronization primitives on Origin 2000 • Lock and Barrier algorithms and performance • Performance evaluation: Applications • Is further hardware support valuable ? • Conclusions

LL-SC 2 instructions, Cached Flexible Fetch&Op Special locations, uncached Inflexible e.g. Atomic Swap Performance Tradeoffs : Atomic update • Contention Retries • Contention at Memory Performance Tradeoffs : Wait • Spinning in Cache • Cache Coherence • Spinning Traffic • No Cache Coherence Synchronization Primitives on Origin 2000

Simple One location Available ? No P P P P Simple Lock Algorithms (1) • Atomic test-and-set • LL-SC • Fetch&Op

Ticket Like in a bakery Proportional backoff Next-Ticket Now-Serving 132 125 126 127 132 P P P P Ticket Lock Algorithms (2) • Atomic fetch-and-increment • LL-SC • Fetch&Op 125

MCS Queuing Local spinning Queue 0 0 0 P P P P MCS Queuing Lock Algorithms (3) • Atomic Compare-and-Swap • LL-SC • Not Fetch&Op

Simple (LL-SC) TicketProp (LL-SC) MCS (LL-SC) TicketProp (Fetch&Op) Lock-Delay Microbenchmark

Central Increment a counter Wait on a location Arrived Go 5 No P P P P Central Barrier Algorithms (1) • Atomic fetch-and-increment • LL-SC • Fetch&Op

Tournament Tree of locations Spin on different locations Avoid hotspot and contention 0 0 0 0 0 0 P P P P Tournament Barrier Algorithms (2) • Atomic fetch-and-increment • LL-SC • Fetch&Op

Central (LL-SC) Tournament (LL-SC) Central (Fetch&Op) Hybrid (LL-SC, Fetch&Op) Barrier-Null Microbenchmark

Microbenchmarks Summary • LL-SC • Simplest algorithms perform poorly e.g. Simple lock and Central barrier • Smarter algorithms perform much better • Fetch&Op supports faster synchronization

Outline • Background • Performance evaluation: Microbenchmarks • Performance evaluation: Applications • Is further hardware support valuable ? • Conclusions

Choosing Applications: Methodology • Applications from SPLASH-2 • Undo optimizations (Added locks and barriers) • Problem Size • At least 25 fold speedup on 64 processors • Base case • Best LL-SC lock and barrier

Base Performance

Base : MCS,LL-SC 1.65 Application performance usingDifferent Locks • Better algorithm helps • Fetch&Op traffic hurts

2.68 Base : Tournament,LL-SC 1.52 Application performance using .Different Barriers . • Load-imbalance dominates • Fetch&Op traffic hurts

Applications Summary • LL-SC • Locks : Better algorithm helps • Barriers : Load imbalance dominates • Fetch&Op • Traffic due to spinning hurts performance • Different from the microbenchmarks

Outline • Background • Performance evaluation: Microbenchmarks • Performance evaluation: Applications • Is further hardware support valuable ? • Locks • Barriers • Conclusions

Raytrace Radiosity Sensitivity to Lock Performance Adding round-trip network delays Extrapolate : 20-30 % improvement from better hardware

When do faster locks help Applications ? • Applications sensitive to Lock performance • Raytrace, Radiosity ( ~ 20 -30 %) • Substantial time in synchronization • Small contended critical sections • Critical section size = actual + lock overhead • Lock overhead dilates the critical section • Effect on performance  size of critical section • 2 Apps : ~ 5 us (1-2 updates to shared locations)

Can we fix contention problems in these cases in the Application ? • Yes. Fix was fairly easy • Raytrace • Global counter Partial reductions • Radiosity • Single buffer allocation queue Multiple • Tasks added to local queue Distribute • Significant performance improvement • Raytrace : 90%, Radiosity: 220%

Barriers • Load-imbalance dominates • Other applications • Well-balanced with little communication • Like the microbenchmarks; Real applications ? • Well-balanced computation & communication • SOR : nearest neighbor on a grid • Barriers : 61 % execution time • Still dominates. Communication Imbalance

Summary & Conclusions • Fetch&op does not help Applications • At least for well-known lock & barrier algorithms • Using applications is important • Little benefit from hardware support • Locks: helps sometimes Fixable • Barriers: load imbalance dominates • Sound Methodology

Tournament barrier with Fetch&Op • Worse performance • Preliminary measurements indicated worse overhead in addition to traffic • Barrier performance did not make a difference in the applications

Small problem size • Raytrace : Decreases lock time • Barnes : Load-imbalance increases • Water-Nsq : Load-imbalance and Serialization • Ocean & SOR : Barrier time remains same • Radiosity & Water-Spatial : Not available

SOR Breakdown Load-imbalance is dominates time spent in barriers

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance