500 likes | 775 Views
Dynamic Software Transactional Memory. Idan Igra Topics in Reliable Distributed Computing (048961) Technion, Nov 2008. Agenda. Motivation Software Transactional Memory Dynamic Software Transactional Memory Faser’s STM Dynamic STM vs. Faser’s STM A blocking STM implementation
Dynamic Software Transactional Memory Idan Igra Topics in Reliable Distributed Computing (048961) Technion, Nov 2008
Agenda • Motivation • Software Transactional Memory • Dynamic Software Transactional Memory • Faser’s STM • Dynamic STM vs. Faser’s STM • A blocking STM implementation • Another obstruction free STM implementation by Faser • DSTM Contention management William N. Scherer III Department of Computer Science University of Rochester Rochester, NY 14620, USA scherer@cs.rochester.edu Mark Moir Sun Microsystems Laboratories 1 Network Drive Burlington, MA 01803, USA mark.moir@sun.com Victor Luchangco Sun Microsystems Laboratories 1 Network Drive Burlington, MA 01803, USA victor.luchangco@sun.com Maurice Herlihy Department of Computer Science Brown University Providence, RI 02912, USA mph@cs.brown.edu
Multicore history • Parallel computing was used for HPCs and networking. • PRAM & other shared memory models aren’t realistic. • BSP & LogP (message passing models) were used. • Only for HPC specialists. • Demand complicated system analyze per application. • HW constraints force multicore architectures. • Today’s parallel programming based on locks. • Coarse grained code prevent parallelism, fine grained are hard to use. • Code reuse demands exposing internal locks. • No conventional way to connect mutex and its data.
Nonblocking liveness properties • Wait freedom: Every process which tries to do an operation will complete it in a finite number of steps. • Lock freedom: If any process tries to do an operation, then there is a process which will succeed completing an operation. • Obstruction freedom: Process that runs by its own tries to do an operation will complete it.
Atomic hardware primitives • Load_Linked / Store_Conditional (LL/SC): LL(addr) returns the value pointed by addr. Next call to SC(addr, val) writes val into addr if it was not written since last LL call. • Compare And Swap (CAS): The operation CAS(addr, e, v) swaps the values of addr and v if addr == e. • MCAS: Atomic m CAS operations (particular case: DCAS).
Helping methodology • A methodology for non-blocking algorithms. • Any process which holds a data that other process needs is helped by the other. • Usually recursive help. • Particularly, used widely in Transactional Memory for MCAS software implementation (known as k-RMW).
Software Transactional Memory • First try to catch the whole data it needs. • If succeeded – compute transaction and release the data. • If failed – release all and retry.
Software Transactional Memory Why Software Transactional Memory? • Unexpected delays decreases performances of locking method, besides its inherent programming difficulties. • Memory allocation and deallocation synchronization conflicts. • Hardware Transactional Memory lacks the platform support, portability and delay anomalies. • Methods like translating the code to k-RMW actions is non-trivial. • Working on a copy of the object is not good for large data structure. • Programmable and flexible non-blocking parallel programming method is needed.
Software Transactional Memory Data set pre-acquiring • Unintuitive programming. • Reduces parallelism. • Common data structures should be acquired totally. • Dynamic data structures are impossible.
Software Transactional Memory Hardware support • LL/SC is not commonly supported by hardware. • Operating system can support it. • Much slower. • Reduce parallelism (force some scheduling). • More useful primitive can be defined.
Software Transactional Memory Wait freedom cost: • Complicated acquiring code. • Not flexible. • Non-common primitives. • Long locking time.
Dynamic STM • Enables also dynamic transactions – with a changing data set. • Satisfies Obstruction freedom. • Modular contention manager for progress forcing, priorities and application-adapting.
Dynamic STM Implementation principles: • A TM object points to Locator which contains an old version, a new one and the last transaction opened it for writing. • The right version is determined by the status (active / aborted / committed). • All objects are committed at once by changing the status. • Obstruction free is obtained by aborting a conflicting transaction (conditioned by contention manager agreement).
Dynamic STM DSTM properties and results: • Much natural to write and convert sequential code into DSTM code. • Releases can significantly increase performance. • Re-use simpler algorithms for a bigger one is easier using DSTM. • Disadvantage: no way to know that an object was opened for reading.
Dynamic STM • Obstruction free enables: • simplicity, • for some application is good enough, • enables implementation of priorities, • enables separating correctness and progress • and most important – prevent the need of helping mechanism. • However, one can consider it is not a real progress property.
Dynamic STM Discussion DSTM vs. STM: • DSTM relates to STM like Coarse-grained to fine-grained. • But STM meets a real requirement and not weakened one (obstruction free). • Releases as an integral part of the mechanism reduces conflicts (compared to locks). Non-blocking, particularly obstruction free, is better for delayed/failed processes won’t stop the whole system (Very strong for DSTM). • DSTM’s implementation might cause loosing that gain for real parallelized systems. • Let the contention manager do the work is exactly like assuming the scheduler will do that.
Faser’s STM STM should satisfy: • Small fixed storage overhead per object. • Small shared memory operations. • Contention time is short. • Reduces time that transactions meet. Nice to have: • Supporting varying object sizes. • Nesting transactions.
Faser’s STM • Every object is represented as a pointer to object handler, which consists of version number and a pointer to the data block. • Open for read returns the data block pointer. • Open for write returns a pointer to a shadow copy. • Commit is done by acquiring all the opened object, MCAS and helping.
Faser’s STM • Problem: Acquiring and releasing read-only object block non-conflicted transactions. • Critical for single start point data structures (head of linked list). • Solution: not to acquire read-only objects. • Add a read-checking state in which the transactions checks all the opened read only objects, so other transactions don’t update it during this time.
Faser’s STM • Deadlock Prevention: T1 can abort T2 only if: • both’ status is read-checking • T2 holds a location that T1 tries to read • T1 < T2 according to a given total order between transactions.
DSTM vs. FSTM FSTM is much better: • Lazy acquire exposes a transaction to others for a very short time, reduces conflict number. • Indirection levels decrease performances (mainly for read-only transactions). • Obstruction freedom’s contention manager has a 5-10% overhead and hard for designing.
DSTM vs. FSTM DSTM is much better: • Eager acquire helps capturing conflicts earlier. • Possible thanks to Obstruction freedom weakness. • Fewer CAS’s (N+1 for DSTM vs. 2N+2 for FSTM). • Implementation is simpler and more efficient. • MCAS causes a lot of cache block trashing.
DSTM vs. FSTM DSTM is better for workloads which: • Opening a lot of locations. • Mainly write accesses for the same location (IntSet). • Transactions must be serialized (stack). FSTM is better for workloads which: • Livelocks are common (RBTree). • Small Transactions • Small conflict probability (IntSetRelease).
DSTM vs. FSTM General remarks: • Not validating repeatedly improves performances. • How can non-consistent (aborted) transactions be avoided?
Contention Management Recall – DSTM contention manager should: • ensure progress. • eventually returns from every call. • eventually aborts conflicting transaction. Management approaches are tested for: • Various data set • Visible/Invisible reads (optimistic/non-optimistic). • Eliminating unnecessary aborts.
Contention Management • Aggressive – always abort enemytransaction. Good baseline to compare. • Polite – backoff before aborting. Sensitive to preemption, page faults… • Randomized – (Balanced) coin if aborting or wait (64ns). • Eruption – a transaction helps its blocking transaction by giving its momentum (Momentum = successful open tries + blocked transactions momentum). • The reasoning is let transactions which hold critical data to finish.
Contention Management • Karma – the older transaction (in terms of opening tries) wins. Also tries on previous aborted runs are accounted. • Kindergarten – First backoff is used beforeaborting. Later the abort is done by turns. • KillBlocked – a transaction will abort its blocking if it is also blocked (or after fixed time). • Timestamp – the older transaction wins. Failure detector is used. • QueueOnBlock – blocked transactionsare released according to a queue whenthe blocking has finished (or after a fixed time).
Contention Management Results: • Most of Managers except TimeStamps, are good for IntSetRelease with Invisible reads. • Aggressive, Randomized, Eruption, Polite perform badly. • QueueOnBlock and KillBlocked has good performance only for RBTree with Invisible reads. • TimeStamps is good only for Counter. • KinderGarten is excellent, except for IntSetRelease with Visible reads and for RBTree. • Karma is not good for IntSet and for LFUCache with visible reads.
Contention Management Visible reads vs. Invisible reads: • In IntSet and Counter there is no difference as all the accesses are for writing. • In IntSetRelease visible reads are better (except for Kindergarten which is bad for both). • Visible reads let an option to avoid conflicts on short time accesses. • In LFUCache for all managers, and RBTree for all but Karma, Invisible reads is much better. • Most of conflicts are between a reader which scans its path and writer which updates the path to the root.
Blocking STM implementation Why not be annoyed about blocking (mainly compared to obstruction free)? • Long transactions must be aborted. Obstruction free is forced only for a single transaction. • Context switch is not a problem • Temporary. • OS automatic adaption. • Platform support (by priorities, etc.). • Independent failure • Not common in multicore. • Sequential programs also fail due to a single failure.
Blocking STM implementation Non-blocking is bad because: • Metadata and the object must be stored separately in order to satisfy non-blocking. • Doubling the cache misses. • Assume N active transactions on N processors: A new transaction mustn’t be blocked, the conflict number increases.
Blocking STM implementation • Every transaction has in its private data descriptor per opened object (consists of the version, pointer and (maybe) a copy). • Every object has a lock (with deadlock prevention) which is used when trying to commit. • Accesses wait for the object to be unlocked. Read accesses are optimistic. • Priority mechanism.
Blocking STM implementation CPU time for various processor number:
Blocking STM implementation CPU time for various contention instances:
Blocking STM implementation Discussion: • Context switch IS a problem because of long delays. • Failure are more common on parallel programs than on sequential ones. • Delay is more interesting than throughput?
Another STM • Similarly to DSTM, Committing is done by changing a state and current version is determined by owner transaction state. • But like FSTM, before committing the transaction tries to acquire all of its owned records. • Wait method is provided in order to wait an acquired data before retrying.
Another STM • An Ownership-record (orec) contains either the version number of one (or more) objects or a pointer to the owner transaction descriptor. • Before committing, any transaction tries to acquire its owned data. • In case of already acquired data, the transaction can abort the other transaction, wait for it to finish or awake it (if it sleeps).
References • Robert Ennals (Jan 2006). Software Transactional Memory Should Not Be Obstruction-Free. Technical Report Nr. IRC-TR-06-052. Intel Research Cambridge Tech Report. • K. Fraser. Practical Lock-Freedom. Technical Report UCAM-CL-TR-579, Cambridge University Computer Laboratory, February 2004. • Tim Harris , Keir Fraser. Language support for lightweight transactions. Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications, October 26-30, 2003, Anaheim, California, USA. • Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer III. Software Transactional Memory for Dynamic-Sized Data Structures.ACM Symposium on Principles of Distributed Computing (PODC): 92-101, 2003. • Maurice Herlihy , Victor Luchangco. Distributed computing and the multicore revolution. ACM SIGACT News, v.39 n.1, March 2008. • Virendra J. Marathe and William N. Scherer III and Michael L. Scott (Oct 2004). Design Tradeoffs in Modern Software Transactional Memory Systems. In: Proceedings of the 7th Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers. Houston, TX. • N. Shavit and D. Touitou. Software transactional memory. Distributed Computing, Special Issue(10): 99-116, 1997. • William N. Scherer III and Michael L. Scott (Jul 2004). Contention Management in Dynamic Software Transactional Memory. In: Proceedings of the ACM PODC Workshop on Concurrency and Synchronization in Java Programs. St. John's, NL, Canada. In conjunction with PODC'04.
More reading Ennals’ blocking STM: • Robert Ennals. Efficient Software Transactional Memory. Intel Research Cambridge Technical Report: IRC-TR-05-051, 2005. PRAM: • S. Fortune and J. Wyllie. Parallelism in Random Access Machines. In Proceedings of the 10th Annual Symposium on Theory of Computing, pages 114-118, 1978. • Phillip B. Gibbons , Yossi Matias , Vijaya Ramachandran. Can shared-memory model serve as a bridging model for parallel computation?. Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, p.72-83, June 23-25, 1997, Newport, Rhode Island, United States. • P. B. Gibbons. A more practical PRAM model. Proceedings of the first annual ACM symposium on Parallel algorithms and architectures, p.158-168, June 18-21, 1989, Santa Fe, New Mexico, United States. Popular message-passing old models: • David Culler , Richard Karp , David Patterson , Abhijit Sahay , Klaus Erik Schauser , Eunice Santos , Ramesh Subramonian , Thorsten von Eicken. LogP: towards a realistic model of parallel computation. ACM SIGPLAN Notices, v.28 n.7, p.1-12, July 1993. • Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM, v.33 n.8, p.103-111, Aug. 1990. Memory allocation in multi-core: • Andrei Gorine, Konstantin Knizhnik. Tackling memory allocation in multicore and multithreaded applications. MCObject LLC, May 29 2006. Available on the internet from http://www.embedded.com/columns/showArticle.jhtml?articleID=188101359 • Voon-Yee Vee , Wen-Jing Hsu. A Scalable and Efficient Storage Allocator on Shared Memory Multiprocessors. Proceedings of the 1999 International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN '99), p.230, June 23-25, 1999. • P.R. Wilson, M.S. Johnstone, M. Neely, and D. Boles. Dynamic storage allocation: A survey and critical review. In H.G. Baker, editor, Proceedings of International Workshop on Memory Management (IWMM'95), volume 986 of Lecture Notes in Computer Science, pages 1-116, Kirnoss, Scotland, Sept. 1995.