270 likes | 289 Views
Explore the motivation, background, and experimentation of scalable synchronization algorithms for multi-core processors. Learn about spin locks, barriers, and various approaches like ticket locks and MCS. Find out about experiments and results on different machine models. Discover the challenges faced and progress made in extending these algorithms to multi-core systems.
E N D
Jeremy Denham April 7, 2008 Scalable Synchronization Algorithms in Multi-core Processors
Outline • Motivation • Background / Previous work • Experimentation • Results • Questions
Motivation • Modern processor design trends are primarily concerned with the multi-core design paradigm. • Still figuring out what to do with them • Different way of thinking about “shared-memory multiprocessors” • Distributed apps? • Synchronization will be important.
Background / Previous work • Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, Mellor-Crummey & Scott 1991. • Scalable, busy-wait synchronization algorithms • No memory or interconnect contention • O(1) remote references per mechanism utilization • Spin locks and barriers
Spin Locks • “Spin” on lock by busy-waiting until available. • Typically involves “fetch-and-Φ” operations • Must be atomic!
Simple spin locks • “Test-and-set” • Needs processor support to make it atomic • “fetch-and-store” • xchgworks in x86 • Loop until lock is possessed • Expensive! • Frequently accessed, too • Networking issues
Ticket lock • Can reduce fetch-and-Φ ops to one per lock acquisition • FIFO service guarantee • Two counters • Requests • Releases • fetch_and_incrementrequest counter • Wait until release counter reflects turn • Still problematic…
Queue-based approach • T.E. Anderson • Incoming processes put themselves in the queue • Lock holder hands off the lock to next in queue • Faster than ticket, but more space
MCS List-based Queuing Lock • FIFO Guarantee • Local spinning! • Small constant amount of space • Cache coherence a non-issue
Details • Each processor allocates a record • next link • boolean flag • Adds to queue • Spins locally • Owner passes lock to next user in queue as necessary
Barriers • Mechanism for “phase separation” • Block processes from proceeding until all others have reached a checkpoint • Designed for repetitive use
Centralized Barriers • “Local” and “global” sense • As processor arrives • Reverse local sense • Signal its arrival • If last, reverse global sense • Else spin • Lots of spinning…
Dissemination Barriers • Barrier information is “disseminated” algorithmically • At each synchronization stage k, processor i signals processor (i + 2k) mod P, where P is the number of processors • Similarly, processor i continues when it is signaled by processor (i - 2k) mod P • log(P) operations on critical path, P log(P) remote operations
Tournament Barriers • Tree-based approach • Outcome statically determined • “Roles” for each round • “loser” notifies “winner,” then drops out • “winner” waits to be notified, participates in next round • “champion” sets global flag when over • log(P) rounds • Heavy interconnect traffic…
MCS Approach • Also tree-based • Local spinning • O(P) space for P processors • (2P – 2) network transactions • O(log P) network transactions on critical path
The idea • Use two P-node trees • “child-not-ready” flag for each child present in parent • When all children have signaled arrival, parent signals its parent • When root detects all children have arrived, signals to the group that it can proceed to next barrier.
MCS Results • Experiments done on BBN Butterfly 1 and Sequent Symmetry Model B machines • BBN • Supports up to 256 processor nodes • 8 MHz MC68000 • Sequent • Supports up to 30 processor nodes • 16 MHz Intel 80386 • Most concerned with Sequent
My Experiments • Want to extend to multi-core machines • Scalability of limited usefulness (not that many cores) • Shared resources • Core load
Equipment • Intel Centrino Duo T5200 Processor • Two cores • 1.60 GHz per core • 2MB L2 Cache • Windows Vista • 2GB DDR2 Memory
Experimental Procedure • Evaluate basic and MCS approaches • Simple and complex evaluations • Core pinning • Load ramping
Challenges • Code porting • Lots of Linux-specific code • Win32 Thread API • Esoteric… • How to pin a thread to a core? • Timing • Win32 μsec-granularity measurement • Surprisingly archaic C code
Progress • Spin lock base code ported • Barriers nearly done • Simple experiments for spin locks done • More complex on the way
Results • Simple spin lock tests • Simple lock outperforms MCS on: • Empty Critical Section • Simple FP Critical Section • Single core • Dual core • More procedural overhead for MCS on small scale • Next steps: • More threads! • More critical section complexity