270 likes | 286 Views
Jeremy Denham April 7, 2008. Scalable Synchronization Algorithms in Multi-core Processors. Outline. Motivation Background / Previous work Experimentation Results Questions. Motivation. Modern processor design trends are primarily concerned with the multi-core design paradigm.
E N D
Jeremy Denham April 7, 2008 Scalable Synchronization Algorithms in Multi-core Processors
Outline • Motivation • Background / Previous work • Experimentation • Results • Questions
Motivation • Modern processor design trends are primarily concerned with the multi-core design paradigm. • Still figuring out what to do with them • Different way of thinking about “shared-memory multiprocessors” • Distributed apps? • Synchronization will be important.
Background / Previous work • Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, Mellor-Crummey & Scott 1991. • Scalable, busy-wait synchronization algorithms • No memory or interconnect contention • O(1) remote references per mechanism utilization • Spin locks and barriers
Spin Locks • “Spin” on lock by busy-waiting until available. • Typically involves “fetch-and-Φ” operations • Must be atomic!
Simple spin locks • “Test-and-set” • Needs processor support to make it atomic • “fetch-and-store” • xchgworks in x86 • Loop until lock is possessed • Expensive! • Frequently accessed, too • Networking issues
Ticket lock • Can reduce fetch-and-Φ ops to one per lock acquisition • FIFO service guarantee • Two counters • Requests • Releases • fetch_and_incrementrequest counter • Wait until release counter reflects turn • Still problematic…
Queue-based approach • T.E. Anderson • Incoming processes put themselves in the queue • Lock holder hands off the lock to next in queue • Faster than ticket, but more space
MCS List-based Queuing Lock • FIFO Guarantee • Local spinning! • Small constant amount of space • Cache coherence a non-issue
Details • Each processor allocates a record • next link • boolean flag • Adds to queue • Spins locally • Owner passes lock to next user in queue as necessary
Barriers • Mechanism for “phase separation” • Block processes from proceeding until all others have reached a checkpoint • Designed for repetitive use
Centralized Barriers • “Local” and “global” sense • As processor arrives • Reverse local sense • Signal its arrival • If last, reverse global sense • Else spin • Lots of spinning…
Dissemination Barriers • Barrier information is “disseminated” algorithmically • At each synchronization stage k, processor i signals processor (i + 2k) mod P, where P is the number of processors • Similarly, processor i continues when it is signaled by processor (i - 2k) mod P • log(P) operations on critical path, P log(P) remote operations
Tournament Barriers • Tree-based approach • Outcome statically determined • “Roles” for each round • “loser” notifies “winner,” then drops out • “winner” waits to be notified, participates in next round • “champion” sets global flag when over • log(P) rounds • Heavy interconnect traffic…
MCS Approach • Also tree-based • Local spinning • O(P) space for P processors • (2P – 2) network transactions • O(log P) network transactions on critical path
The idea • Use two P-node trees • “child-not-ready” flag for each child present in parent • When all children have signaled arrival, parent signals its parent • When root detects all children have arrived, signals to the group that it can proceed to next barrier.
MCS Results • Experiments done on BBN Butterfly 1 and Sequent Symmetry Model B machines • BBN • Supports up to 256 processor nodes • 8 MHz MC68000 • Sequent • Supports up to 30 processor nodes • 16 MHz Intel 80386 • Most concerned with Sequent
My Experiments • Want to extend to multi-core machines • Scalability of limited usefulness (not that many cores) • Shared resources • Core load
Equipment • Intel Centrino Duo T5200 Processor • Two cores • 1.60 GHz per core • 2MB L2 Cache • Windows Vista • 2GB DDR2 Memory
Experimental Procedure • Evaluate basic and MCS approaches • Simple and complex evaluations • Core pinning • Load ramping
Challenges • Code porting • Lots of Linux-specific code • Win32 Thread API • Esoteric… • How to pin a thread to a core? • Timing • Win32 μsec-granularity measurement • Surprisingly archaic C code
Progress • Spin lock base code ported • Barriers nearly done • Simple experiments for spin locks done • More complex on the way
Results • Simple spin lock tests • Simple lock outperforms MCS on: • Empty Critical Section • Simple FP Critical Section • Single core • Dual core • More procedural overhead for MCS on small scale • Next steps: • More threads! • More critical section complexity