Scalable Synchronization Algorithms in Multi-core Processors

Jeremy Denham April 7, 2008 Scalable Synchronization Algorithms in Multi-core Processors

Outline • Motivation • Background / Previous work • Experimentation • Results • Questions

Motivation • Modern processor design trends are primarily concerned with the multi-core design paradigm. • Still figuring out what to do with them • Different way of thinking about “shared-memory multiprocessors” • Distributed apps? • Synchronization will be important.

Background / Previous work • Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, Mellor-Crummey & Scott 1991. • Scalable, busy-wait synchronization algorithms • No memory or interconnect contention • O(1) remote references per mechanism utilization • Spin locks and barriers

Spin Locks • “Spin” on lock by busy-waiting until available. • Typically involves “fetch-and-Φ” operations • Must be atomic!

Simple spin locks • “Test-and-set” • Needs processor support to make it atomic • “fetch-and-store” • xchgworks in x86 • Loop until lock is possessed • Expensive! • Frequently accessed, too • Networking issues

Ticket lock • Can reduce fetch-and-Φ ops to one per lock acquisition • FIFO service guarantee • Two counters • Requests • Releases • fetch_and_incrementrequest counter • Wait until release counter reflects turn • Still problematic…

Queue-based approach • T.E. Anderson • Incoming processes put themselves in the queue • Lock holder hands off the lock to next in queue • Faster than ticket, but more space

MCS List-based Queuing Lock • FIFO Guarantee • Local spinning! • Small constant amount of space • Cache coherence a non-issue

Details • Each processor allocates a record • next link • boolean flag • Adds to queue • Spins locally • Owner passes lock to next user in queue as necessary

Barriers • Mechanism for “phase separation” • Block processes from proceeding until all others have reached a checkpoint • Designed for repetitive use

Centralized Barriers • “Local” and “global” sense • As processor arrives • Reverse local sense • Signal its arrival • If last, reverse global sense • Else spin • Lots of spinning…

Dissemination Barriers • Barrier information is “disseminated” algorithmically • At each synchronization stage k, processor i signals processor (i + 2k) mod P, where P is the number of processors • Similarly, processor i continues when it is signaled by processor (i - 2k) mod P • log(P) operations on critical path, P log(P) remote operations

Tournament Barriers • Tree-based approach • Outcome statically determined • “Roles” for each round • “loser” notifies “winner,” then drops out • “winner” waits to be notified, participates in next round • “champion” sets global flag when over • log(P) rounds • Heavy interconnect traffic…

MCS Approach • Also tree-based • Local spinning • O(P) space for P processors • (2P – 2) network transactions • O(log P) network transactions on critical path

The idea • Use two P-node trees • “child-not-ready” flag for each child present in parent • When all children have signaled arrival, parent signals its parent • When root detects all children have arrived, signals to the group that it can proceed to next barrier.

MCS Results • Experiments done on BBN Butterfly 1 and Sequent Symmetry Model B machines • BBN • Supports up to 256 processor nodes • 8 MHz MC68000 • Sequent • Supports up to 30 processor nodes • 16 MHz Intel 80386 • Most concerned with Sequent

Sequent Spin Locks

Sequent Spin Locks cont’d

Sequent Barriers

My Experiments • Want to extend to multi-core machines • Scalability of limited usefulness (not that many cores) • Shared resources • Core load

Equipment • Intel Centrino Duo T5200 Processor • Two cores • 1.60 GHz per core • 2MB L2 Cache • Windows Vista • 2GB DDR2 Memory

Experimental Procedure • Evaluate basic and MCS approaches • Simple and complex evaluations • Core pinning • Load ramping

Challenges • Code porting • Lots of Linux-specific code • Win32 Thread API • Esoteric… • How to pin a thread to a core? • Timing • Win32 μsec-granularity measurement • Surprisingly archaic C code

Progress • Spin lock base code ported • Barriers nearly done • Simple experiments for spin locks done • More complex on the way

Results • Simple spin lock tests • Simple lock outperforms MCS on: • Empty Critical Section • Simple FP Critical Section • Single core • Dual core • More procedural overhead for MCS on small scale • Next steps: • More threads! • More critical section complexity

Questions?

Scalable Synchronization Algorithms in Multi-core Processors

Scalable Synchronization Algorithms in Multi-core Processors

Presentation Transcript

Circuit Placement w/ Multi-core Processors

Multi-core Processors and Virtualization

Supporting Multi-Processors

Heterogeneous Multi-Core Processors

Multi-core processors

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Scalable Multi-core Model Checking Fairness Enhanced Systems

NetSlices : Scalable Multi-Core Packet Processing in User-Space

Scalable Reader Writer Synchronization

Video Coding on Multi-core Graphics Processors

Lecture 25: Multi-core Processors

Task Partitioning for Multi-Core Network Processors

Hardware Supported Time Synchronization in Multi-Core Architectures

Message Passing On Tightly-Interconnected Multi-Core Processors

Regularity-Constrained Floorplanning for Multi-Core Processors

Network Processors A generation of multi-core processors

Multi-Object Synchronization

Alternative Processors, Heterogeneous Multi-Core and GPGPU Computing in HPC

Network Processors A generation of multi-core processors

Multi-core processors

Heterogeneous Multi-Core Processors

Multi-core processors