420 likes | 868 Views
Dynamic Performance Tuning of Word-Based Software Transactional Memory Pascal Felber Christof Fetzer Torvald Riegel Prepared by Gil Sadis Transactional memory Introduction Related Work TinySTM Basic Algorithm Hierarchical Locking Experimental Evaluation Dynamic Tuning Conclusions
E N D
Dynamic Performance Tuning of Word-Based Software Transactional Memory Pascal Felber ChristofFetzer TorvaldRiegel Prepared by Gil Sadis Transactional memory
Introduction • Related Work • TinySTM • Basic Algorithm • Hierarchical Locking • Experimental Evaluation • Dynamic Tuning • Conclusions Outline
Glossary • TL2 – One of the fastest word-based software transactional memories designed by David Dice, Ori Shalev, and Nir Shavit in 2006 • TM – Transactional Memory • STM – Software Transactional Memory • Encounter-time locking – memory writes are done by first temporarily acquiring a lock for a given location, writing the value directly, and logging it in the undo log • Commit-time locking – locks memory locations only during the commit phase Introduction
TM has been proposed as a lightweight mechanism to synchronize threads TM alleviates many of the problems associated with locking TM offer the benefits of transactions without incurring the overhead of a database TM makes memory act in a transactional way like a database Introduction
“There is no ‘one-size-fits-all’ STM implementation and adaptive mechanisms are necessary to make the most of an STM infrastructure.” • The performance of STM implementations depends on several factors: • Design – word-based vs. object-based, lock-based vs. non-blocking, write-through vs. write-back • Configuration parameters – for example,number of locks or the mapping of locks to memory addresses • Workload – for example, the ratio of update to read-only transactions Introduction
A new idea: TinySTM a lightweight and highly efficient lock based implementation STM that will dynamically tune its performance in runtime Introduces novel mechanisms to speed up the validation cost for large read sets without increasing the abort rate Introduction
Introduction • Related Work • TinySTM • Basic Algorithm • Hierarchical Locking • Experimental Evaluation • Dynamic Tuning • Conclusions Outline
Word-based TM • Access memory at the granularity of machine words or larger chunks of memory • More widely applicable, for example in applications that do not explicitly specify associated objects and run in unmanaged environments • Most word-based STM designs rely upon a shared array of locks to manage concurrent accesses to memory Related work
Object-based TM • Access memory only at object granularity • Require the TM to be aware of the object associated with every access • Example for object-based TM – Lazy Snapshot Algorithm (LSA). The LSA verifies at each object access that the view observed by a transaction is consistent Related work
Time-based TM (TBTM) • Based on a notion of time or progress • A global time base to reason about the consistency of data accessed by transactions and about the order in which transactions commit • The simplest implementation for a global time base is a shared integer counter • On large systems in which contention on this counter results in a significant bottleneck, external clocks or multiple synchronized physical clocks can be used as scalable time bases Related work
Introduction • Related Work • TinySTM • Basic Algorithm • Hierarchical Locking • Experimental Evaluation • Dynamic Tuning • Conclusions Outline
Word-based STM implementation that uses locks to protect shared memory locations Uses a time-based design Uses a single version, word-based variant of LSA algorithm and is very similar to TL2’s algorithm, however, follows different design strategies on some key aspects tinystm
Uses encounter-time locking for 2 main reasons: • The empirical observations appear to indicate that detecting conflicts early often increases the transaction throughput because transactions do not perform useless work. Commit-time locking may help avoid some read-write conflicts, but in general conflicts discovered at commit time cannot be solved without aborting at least one transaction • It allows us to efficiently handle reads-after-writes without requiring expensive or complex mechanisms tinystm
TinySTM implements two strategies for accesses to memory: • Write-through – transactions directly write to memory and revert their updates in case they need to abort • Write-back – transactions delay their updates to memory until commit time TinySTM
As most word-based STM designs, TinySTM relies upon a shared array of locks to manage concurrent accesses to memory Each lock covers a portion of the address space Each lock is the size of an address and Its least significant bit is used to indicate whether the lock is owned TinySTM – basic algorithm (locks and versions)
If it is not owned, we store in the remaining bits a version number that corresponds to the commit timestamp of the transaction that last wrote to one of the memory locations covered by the lock • If the lock is owned, we store in the remaining bits an address to either the owner transaction (when using write-through), or an entry in the write set of the owner transaction (when using write-back). TinySTM – basic algorithm (locks and versions)
When writing to a memory location, a transaction first identifies the lock entry that covers the memory address and atomically reads its value • If the lock bit is set, the transaction checks if it is the owner of the lock. In that case, it simply writes the new value and returns. Otherwise, the transaction can try to wait for some time or abort immediately. TinySTM uses the later. • If the lock bit is not set, the transaction tries to acquire the lock by writing a new value in the entry TinySTM – basic algorithm (Reads & writes)
When reading a memory location, a transaction must verify that the lock is not owed. To that end, the transaction reads the lock, then the memory location, and finally the lock again If the lock is not owned and its value (i.e. version number) did not change between both reads, then the value read is consistent TinySTM – basic algorithm (Reads & writes)
Write-through access • Updates are written directly to memory and previous values are stored in an undo log to be reinstated upon abort • Has lower commit-time overhead • Write-back access • updates are stored in a write log and written to memory upon commit • Has lower abort overhead TinySTM – basic algorithm (Write-through vs. write-back)
Using dynamic memory within transactions is not trivial: • Consider the case of a transaction that inserts an element in a dynamic data structure such as a linked list • If memory is allocated but the transaction fails, it might not be properly reclaimed, which results in memory leaks • One cannot free memory in a transaction unless one can guarantee that it will not abort • TinySTM provides memory-management functions that allow transactional code to use dynamic memory TinySTM – basic algorithm (Memory Management)
TinySTM uses a shared counter as clock In case the contention on this global counter becomes a bottleneck in large systems, we can use more scalable time bases such as an external clock or multiple synchronized physical clocks TinySTM – basic algorithm (Clock MaNAGEMENT)
TinySTM maintains a smaller hierarchical array of h << l counters As atomic operations are costly on most architectures, the size of the hierarchical array must be chosen with care: larger h values reduce the validation overhead but may require more atomic operations TinySTM – Hierarchical Locking
Memory addresses are mapped to the counters using a hash function A counter covers multiple locks and the associated memory addresses 2 memory locations that are mapped to the same lock are also mapped to the same counter TinySTM – Hierarchical Locking
Calculation: • When choosing l as a multiple of h, typically l = 2^i, h = 2^j, i > j • lock index = (hash(addr) mod l) • counter index = (hash(addr) mod h) TinySTM – Hierarchical Locking
Each transaction additionally maintains 2 private data structures: a read mask and a write mask of h bits each Read sets are partitioned into h independent parts When reading or writing a memory location, a transaction will first determine to which shared counter i in the hierarchical array it maps TinySTM – Hierarchical Locking
Evaluation used the same red-black tree benchmark application as used for the evaluation of TL2 and also a linked list All tests were run on an 8-core Intel Xeon machine at 2 GHz running Linux 2.6.18-4 (64-bit) TinySTM – Experimental Evaluation
Introduction • Related Work • TinySTM • Basic Algorithm • Hierarchical Locking • Experimental Evaluation • Dynamic Tuning • Conclusions Outline
TinySTM’s most important tuning parameters: • The hash function to map a memory location to a lock. TinySTM right-shifts the address and computes the rest modulo the size of the lock array (#shifts) • The number of entries in the lock array (l or #locks) • The size of the array used for the hierarchical locking (h) Dynamic Tuning
The first observation is that with an increasing number of locks, we get an increase in throughput • A smaller number of locks could reduce the validation time of an update transaction (because we need to check less locks), but the performance penalty of false sharing dominates Dynamic Tuning
The shift tuning parameter improves the sharing of locks within a transaction The number of shifts specifies how many consecutive words are assigned to the same lock Dynamic Tuning
Small array limits the overhead of atomic operations and permits a quick check if an update transaction can commit However, too small an array will result in many false positives Dynamic Tuning
Tuning strategy : • Start with a sensible number of locks, 2^16; shift of 0; hierarchical array of size 1 • 8 possible moves: (1-2) double/halve the number of locks, (3-4) increase/decrease the number of shifts, (5-6) double/halve the size of the hierarchical array, (7)a nop, and (8) reverse • Reverse occurs when: • 2% performance decrease • 10% away from the configuration with the highest throughput so far Dynamic Tuning
Introduction • Related Work • TinySTM • Basic Algorithm • Hierarchical Locking • Experimental Evaluation • Dynamic Tuning • Conclusions Outline
Automatic tuning and adaptivity are especially important given that there is no agreement on what constitutes a typical workload or a good benchmark for transactional memory It allow us to exploit the full potential of current TM designs, while being ready for workload classes yet to be identified Conclusions