Synchronization

Spinlocks and all the rest Synchronization

Synchronization Overview • Cache coherency • Single versus Multi-core • Under versus Oversubscribed • Atomic operations • …

Synchronization Overview • Spinlock acquire_lock(lock) {while (TAS(lock) == true); } • TAS – test and set • Puts true in address, returns old value

Synchronization • Mellor-Crummey, Scott 1991 • Analyzed spinlocks and barriers • Linear, Proportional, Exponential Backoff • Ticket locks -> “now serving” • Proposed the “mcs” lock, a queue based lock

Overview • Synchronization Types to be Discussed • Further Developments • Implementation Details

Types to be Discussed • Mutual Exclusion • Spinlock • Mutex • Reader Writer Lock • Execution Point • Barrier • Queues, etc (time permitting)

Spinlocks • Spin until lock is acquired • Simple Implementation • Contention on lock

Queued Spinlock • Create a local lock • Spin on it • On release, signal next waiter • Additional operations • Reduced contention

Mutex • Wait to acquire • May use thread scheduler to wait

Reader Writer Lock • Readers can operate simultaneously with other readers • Only writers cause problems • Often spinlock plus count of readers

Barrier • Keep a group of threads in “sync” • Barrier has to recognize two events • Old barrier as some threads may not be active • New barrier as threads may have reached it

Further Developments

Scalable RW Lock • Modification to MCS lock • Count of Readers + Writer Waiting Flag • Queue of waiting threads • Readers unblock readers on acquire • Writers unblock next thread on release • John M. Mellor-Crummey and Michael L. Scott. 1991. Scalable reader-writer synchronization for shared-memory multiprocessors. In Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming (PPOPP '91). ACM, New York, NY, USA, 106-113.

Scalable RW Lock cont. • Split up the reader access • Since readers can acquire the lock with readers, have multiple locks • Writers, however, need all of the reader locks • Wilson C. Hsieh and William E. Weihl. 1992. Scalable Reader-Writer Locks for Parallel Systems. In Proceedings of the 6th International Parallel Processing Symposium, Viktor K. Prasanna and Larry H. Canter (Eds.). IEEE Computer Society, Washington, DC, USA, 656-659.

Scalable RW Lock cont. • Or use a C-SNZI • Closable scalable nonzero indicator • Like a semaphore, but can be “closed” • What about write upgrade? • YossiLev, Victor Luchangco, and MarekOlszewski. 2009. Scalable reader-writer locks. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures (SPAA '09). ACM, New York, NY, USA, 101-110.

Biased Locks • First and second class “citizens” • Like readers / writers, but all exclusive • Secondary locks request the lock • Primary holder grants them the lock • NaliniVasudevan, Kedar S. Namjoshi, and Stephen A. Edwards. 2010. Simple and fast biased locks. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT '10). ACM, New York, NY, USA, 65-74.

MCS Extensions • Queue based locks • What if threads are preempted? • Add a time component to the lock • Stale elements are skipped • Michael L. Scott and William N. Scherer. 2001. Scalable queue-based spin locks with timeout. In Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming (PPoPP '01). ACM, New York, NY, USA, 44-52. • B. He, W. N. Scherer III, and M. L. Scott. “Preemption Adaptivity in Time-Published Queue-Based Spin Locks,” 11th Intl. Conf. on High Performance Computing, Goa, India, Dec. 2005.

Spinning vs Blocking • Spinning = busy-waiting • Blocking = thread scheduling • What is the trade-off between the two schemes? • Tested Solaris pthread implementation that does both • Ryan Johnson, Manos Athanassoulis, RaduStoica, and Anastasia Ailamaki. 2009. A new look at the roles of spinning and blocking. In Proceedings of the Fifth International Workshop on Data Management on New Hardware (DaMoN '09). ACM, New York, NY, USA, 21-26.

Trees, etc Barriers • Lots of threads all signaling a single count • Sounds bad • Signal and Wakeup trees, with different degrees

Hardware Supported Barriers • Introduce dedicated on-chip connections • Single Centralized Controller • Transmission lines • Jungju Oh, MilosPrvulovic, and AlenkaZajic. 2011. TLSync: support for multiple fast barriers using on-chip transmission lines. In Proceeding of the 38th annual international symposium on Computer architecture (ISCA '11). ACM, New York, NY, USA, 105-116.

Implementation Details

Architectural Primitives • Compare and Swap(mem, old, new) • If (*mem == old) *mem = new • Return what was in mem • LL/SC • LL – load value • SC to same address succeeds only if data unmodified

Test and Test-and-Set • Synchronization instructions are expensive • So don’t do them until likely to succeed • Test the lock, then Test-and-set the lock • Caveat emptor • Can lead to races if used incorrectly • Can save time like TryToAcquire rather than release

Queued Spinlock Details void acquire_queued_spinlock(void* lock, entry* me) { me->next = NULL; me->state = UNLOCKED; entry* prev = atomic_swap(lock, me); if (prev == NULL) return; me->state = LOCKED; prev->next = me; while (me->state == LOCKED); }

Queued Spinlock Details cont void release_queued_spinlock(void* lock, entry* me) { while (me->next == NULL) { if (me == CAS(lock, me, NULL)) return; } me->next->state = UNLOCKED; }

Bibliography • Dave Dice, Virendra J. Marathe, and NirShavit. 2011. Flat-combining NUMA locks. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures (SPAA '11). ACM, New York, NY, USA, 65-74. • B. He, W. N. Scherer III, and M. L. Scott. “Preemption Adaptivity in Time-Published Queue-Based Spin Locks,” 11th Intl. Conf. on High Performance Computing, Goa, India, Dec. 2005. • Wilson C. Hsieh and William E. Weihl. 1992. Scalable Reader-Writer Locks for Parallel Systems. In Proceedings of the 6th International Parallel Processing Symposium, Viktor K. Prasanna and Larry H. Canter (Eds.). IEEE Computer Society, Washington, DC, USA, 656-659. • Ryan Johnson, Manos Athanassoulis, RaduStoica, and Anastasia Ailamaki. 2009. A new look at the roles of spinning and blocking. In Proceedings of the Fifth International Workshop on Data Management on New Hardware (DaMoN '09). ACM, New York, NY, USA, 21-26. • Yossi Lev, Victor Luchangco, and MarekOlszewski. 2009. Scalable reader-writer locks. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures (SPAA '09). ACM, New York, NY, USA, 101-110. • Peter S. Magnusson, Anders Landin, and Erik Hagersten. 1994. Queue Locks on Cache Coherent Multiprocessors. In Proceedings of the 8th International Symposium on Parallel Processing, Howard Jay Siegel (Ed.). IEEE Computer Society, Washington, DC, USA, 165-171.

Bibliography cont • John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9, 1 (February 1991), 21-65. • John M. Mellor-Crummey and Michael L. Scott. 1991. Scalable reader-writer synchronization for shared-memory multiprocessors. In Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming (PPOPP '91). ACM, New York, NY, USA, 106-113. • Jungju Oh, MilosPrvulovic, and AlenkaZajic. 2011. TLSync: support for multiple fast barriers using on-chip transmission lines. In Proceeding of the 38th annual international symposium on Computer architecture (ISCA '11). ACM, New York, NY, USA, 105-116. • Michael L. Scott and William N. Scherer. 2001. Scalable queue-based spin locks with timeout. In Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming (PPoPP '01). ACM, New York, NY, USA, 44-52. • NaliniVasudevan, Kedar S. Namjoshi, and Stephen A. Edwards. 2010. Simple and fast biased locks. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT '10). ACM, New York, NY, USA, 65-74.

Lock free list • Store head pointer • Atomic update head void push(node head, node n) { now = old = *head do { old = now n->next = old } while ((now = CAS(head, old, n)) != old) }

“ABA” Problem • Push C // pending • Pop A • Pop B • Push A • // Does Push C complete successfully now?

“ABA” Problem cont. • Pop A // pending • Pop A • Pop B • Push A • Does Pop A succeed?

Synchronization