250 likes | 480 Views
Read-Write Lock Allocation in Software Transactional Memory. Amir Ghanbari Bavarsad and Ehsan Atoofian Lakehead University. Transactional Memory. Software transactional memory (STM) exploits a global clock to validate transactional data Pros: reduces validation overhead Cons: contention
E N D
Read-Write Lock Allocation in Software Transactional Memory Amir Ghanbari Bavarsad and Ehsan Atoofian Lakehead University
Transactional Memory • Software transactional memory (STM) exploits a global clock to validate transactional data • Pros: reduces validation overhead • Cons: contention • Alternate: Read Write Lock Allocation (RWLA) • Pros: no central clock • Cons: overhead if a TX aborts • Speculative RWLA: changes validation policy dynamically → Speedup: up to 66% P P n 1 $ $ Global Clock
Outline • Background • RWLA • Speculative RWLA • Conclusion
Counter in STM TM_BEGIN(); local_counter = TM_READ(counter); local_counter++; TM_WRITE(counter, local_counter); TM_END(); T1
Validation in STM • Transactional data are validated using: • Global clock • Shared variable • Timestamp for transactions • Lock • Memory is mapped to Lock Table • Each entry of the table: • Version # Global Clock … Version # Lock Table … Memory
Version # Updating Global Clock & Lock • Increment Global Clock • Version # = global_clock Global Clock … counter Lock Table … Memory
Validation in STM • rv (read version) is set to global_clock T1 TM_BEGIN(); local_counter = TM_READ(counter); local_counter++; TM_WRITE(counter, local_counter); TM_END(); rv Metadata for TX1 Global Clock
Successful Read Validation • rv >= version# • The most recent write to counter, occurred before TM_BEGIN() T1 TM_BEGIN(); local_counter = TM_READ(counter); local_counter++; TM_WRITE(counter, local_counter); TM_END(); rv Metadata for TX1 Global Clock
Failed Read Validation • rv < version# • The most recent write to counter, occurred after TM_BEGIN() T1 TM_BEGIN(); local_counter = TM_READ(counter); local_counter++; TM_WRITE(counter, local_counter); TM_END(); rv Metadata for TX1 Global Clock
Overhead of Validation • This method, called GV4, results in many cache coherence misses if transactions commit frequently P P n 1 $ $ Global Clock
Outline • Background • RWLA • Speculative RWLA • Conclusion
Read Write Lock Allocation (RWLA) • Lock • Memory is mapped to Lock Table • Each entry of the table: • Lock bit • Read bits … lock bit Read bits Pn-1 … P1 P0 Lock Table … Memory
TM_READ TM_BEGIN(); local_counter = TM_READ(counter); local_counter++; TM_WRITE(counter, local_counter); TM_END(); 0 0 0 ….. 0 0 0
TM_READ TM_BEGIN(); local_counter = TM_READ(counter); local_counter++; TM_WRITE(counter, local_counter); TM_END(); TM_READ() Lock bit is free? Yes Set read bit in the corresponding lock entry lock bit 1 0 0 0 ….. 0 0 0
TM_READ TM_BEGIN(); local_counter = TM_READ(counter); local_counter++; TM_WRITE(counter, local_counter); TM_END(); TM_READ() No Lock bit is free? Abort Yes Set read bit in the corresponding lock entry 0 0 0 ….. 0 0 1
TM_WRITE TM_WRITE TM_BEGIN(); local_counter = TM_READ(counter); local_counter++; TM_WRITE(counter, local_counter); TM_END(); All read bits are clear? No Abort 0 0 1 ….. 0 0 0
TM_WRITE TM_WRITE TM_BEGIN(); local_counter = TM_READ(counter); local_counter++; TM_WRITE(counter, local_counter); TM_END(); All read bits are clear? No Abort Yes Acquire lock failed 0 0 0 ….. 0 0 1
TM_WRITE TM_WRITE TM_BEGIN(); local_counter = TM_READ(counter); local_counter++; TM_WRITE(counter, local_counter); TM_END(); All read bits are clear? No Abort Yes Acquire lock failed 1 0 0 0 ….. 0 0 0
Experimental Framework • Benchmarks: Stamp v0.9.7 • Run up to competition • Measured statistics over 10 runs • TL2 as an STM framework • Two Intel Xeon E5660, 6-way CMP
Performance of RWLA better
Speculative RWLA • Conflict occurs frequently → select GV4 • Conflict occurs rarely → select RWLA • How to predict conflict?
1 X1 Xn Contention Predictor xi: global transaction history, bipolar value • Prediction: • y≥0 →predict commit • y<0 →predict abort • Update • If outcome of current TX and TXi agree/disagree →increment/decrement wi wi: weight vector … w0 wn w1 y
Performance of Speculative RWLA • # of threads changes between 2 and 16 • On average, performance changes from 21% in Bayes to 47% in Labyrinth better
Conclusion • RWLA to overcome contentions over global clok • Applications react differently to GV4 and RWLA • Speculative RWLA changes validation policy dynamically • Speculative RWLA performance of STMs up to 66%
Thank You! Questions?