Two Ways of Speeding Up Transactional Memory Algorithms

Two Ways of Speeding Up Transactional Memory Algorithms Vincent Gramoli Joint work with Pascal Felber, Rachid Guerraoui, Derin Harmanci

Roadmap • Motivations • Transactional Memory • Problems of Efficiency • Input Acceptance • Elastic Transactions • Conclusion

Single CPU Limitations • Transistor size still decreases [Moore’s law] • Induced overheating disturbs computation • Clock speed no longer doubles since 2004 [“The free lunch is over” by Herb Sutter]

Manufactured Multicores SUN Niagara 2 w/ 8 cores & 64 HW threads Intel COO announces Multicore revolution AMD announces the 2-core Opteron Intel announces 6-core Xeon 7000 series Intel anounces 4-core Xeon 5000 series SUN announces the 8-core Niagara AMD announces the 4-core Opteron Intel announces 8-core Nahelem EX

Concurrent Programming • Difficult task: • Using locks, how to avoid deadlock? Thread1 {lock(x); lock(y);} // Thread2 {lock(y); lock(x);}

Concurrent Programming • Difficult task: • Using locks, how to avoid deadlock? Thread1 {lock(x); lock(y);} // Thread2 {lock(y); lock(x);} • Using lock-free (LF) primitives, how can composition preserve atomicity? LF-move(x,y) ≠ LF-delete(x) + LF-insert(y)

Concurrent Programming • Difficult task: • Using locks, how to avoid deadlock? Thread1 {lock(x); lock(y);} // Thread2 {lock(y); lock(x);} • Using lock-free (LF) primitives, how can composition preserve atomicity? LF-move(x,y) ≠ LF-delete(x) + LF-insert(y) • Dedicated to expert programmers: • Database programmers • Scientific computing programmers • What about other programmers?

Concurrent Programming • Difficult task: • Using locks, how to avoid deadlock? Thread1 {lock(x); lock(y);} // Thread2 {lock(y); lock(x);} • Using lock-free (LF) primitives, how can composition preserve atomicity? LF-move(x,y) ≠ LF-delete(x) + LF-insert(y) • Dedicated to expert programmers: • Database programmers • Scientific computing programmers • What about other programmers? • Democratizing multicores requires new programming abstractions

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently BEGIN_TX R(act) W(act,v) END_TX Assume we want to read (R) and write (W) a shared bank account ‘act’ atomically. We simply have to label the region of the sequential code using transaction delimiters BEGIN_TX and END_TX

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently after this point, operations will be handled by the TM BEGIN_TX R(act) W(act,v) END_TX TM

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently BEGIN_TX R(act) W(act,v) END_TX read through the TM? TM

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently BEGIN_TX R(act) W(act,v) END_TX read through the TM? Sounds good, I keep track of your read TM

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently BEGIN_TX R(act) W(act,v) END_TX you can return v1 TM

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently BEGIN_TX R(act) W(act,v’) END_TX BEGIN_TX R(act) W(act,v) END_TX write through the TM? TM

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently BEGIN_TX R(act) W(act,v’) END_TX BEGIN_TX R(act) W(act,v) END_TX write through the TM? Sounds good, I keep track of your write TM

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently BEGIN_TX R(act) W(act,v’) END_TX BEGIN_TX R(act) W(act,v) END_TX write has been scheduled TM

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently BEGIN_TX R(act) W(act,v) END_TX write through the TM? TM

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently BEGIN_TX R(act) W(act,v) END_TX write through the TM? No way, there is a risk of safety violation TM

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently BEGIN_TX R(act) W(act,v) END_TX abort, roll-back, and restart the whole transaction later on No way, there is a risk of safety violation TM

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently BEGIN_TX R(act) W(act,v) END_TX after this point, all operations become unprotected again

Transactional Memory An abstraction: a black box that encapsulates all synchronizations • all read/write accesses to shared data are protected transparently • atomicity is preserved under transaction composition delete(acc, amt) { BEGIN_TX v = R(act) W(act,v-amt) END_TX } insert(acc, amt) { BEGIN_TX v = R(act) W(act,v+amt) END_TX } move(acc1, acc2, amt) { BEGIN_TX delete(act1, amt) insert(acc2, amt) END_TX } + =

1st Problem: Wasted Effort Problem Transactions waste efforts while aborting and rolling-back Some aborts are unnecessary BEGIN_TX W(x) END_TX BEGIN_TX R(x) END_TX (1) (2) (3) (4) Although transactions can commit safely one is aborted by common STMs: TL2, WSTM, DSTM, TinySTM

2nd Problem: Lack of Concurrency Transactions ensure stronger guarantees than necessary Example: sorted linked list implementation of integer set insert(x)/ search(z) x y z t h search(z) insert(x) BEGIN_TX R(h) R(y) R(z) END_TX BEGIN_TX … W(h) END_TX

2nd Problem: Lack of Concurrency Transactions ensure stronger guarantees than necessary Example: sorted linked list implementation of integer set Both transactions could commit w/o violating linked list linearizability, but transactional models consider read/write atomicity. insert(x)/ search(z) x y z t h search(z) insert(x) BEGIN_TX R(h) R(y) R(z) END_TX BEGIN_TX … W(h) END_TX

Roadmap Motivations Transactional Memory Problems of Efficiency Input Acceptance Elastic Transactions Conclusion

A Metric for Input Acceptance • TM efficiency depends on • Execution speed • Number of successful (committed) transactions

A Metric for Input Acceptance • TM efficiency depends on • Execution speed • Number of successful (committed) transactions TM

A Metric for Input Acceptance • TM efficiency depends on • Execution speed • Number of successful (committed) transactions • The Input acceptance is the ability for a TM to commit transactions • The commit-abort ratio is “σ”: # committed tx / # complete tx TM

How do STMs perform w.r.t. this metric? • Ideal goal: no abort (σ = 1) • A TM accepts an input if σ = 1 • What is accepted by the existing STMs?

Identifying TM designs

Formalizing Workload as an Input Events (i.e., an alphabet): si: start event of transaction i wxi: write request of transaction i on location x rxi: read request of transaction i on location x π(x)i: any event of transaction i (on location x) ci: commit request of transaction i An input pattern is a totally ordered set of events (i.e., a word) An input class is a set of input patterns (i.e., a language): | represents the choice (e.g., “a | b” means “a” or “b”) * represents the Kleene closure (e.g., “a*” means “ε|a|aa|…”) ¬ represents the complement (e.g., “¬a” means “any event but a”)

Input Acceptance Upper-bound of VWIR • Theorem. There is no VWIR design that accepts the following input class: • C2 = π∗ (rxi ¬ci∗ wxj ¬ci∗ cj | wxj ¬cj∗ rxi) π∗ .

Input Acceptance Upper-bound of VWIR • Theorem. There is no VWIR design that accepts the following input class: • C2 = π∗ (rxi ¬ci∗ wxj ¬ci∗ cj | wxj ¬cj∗ rxi) π∗ . BEGIN_TX W(x) END_TX BEGIN_TX R(x) END_TX

Going further • Other classes: • C 1 = π∗ (πxi ¬ci∗ wxj | wxj ¬cj∗ πxi) π∗ • C 3 = π∗ (rxi ¬ci∗ wxj | wxj ¬cj∗ rxi ) ¬ci∗ cj π∗ • C 4 = (¬wx)∗ rxi ¬ci∗ wxj ¬ci∗ cj ¬ci∗ sk ¬(ci |ck|rxk)∗ wyk • ¬(ci |ck | rxk )∗ ck ¬ci∗ ryi π∗ • Other impossibility results: • Theorem 1. VWVR design does not accept input class C1. • Theorem 3. IWIR design does not accept input class C3. • Theorem 4. CTR design does not accept input class C4.

Input Acceptance Classification VWVR (e.g. SXM) ~C1

Input Acceptance Classification VWVR (e.g. SXM) ~C1 ~C2 VWIR (e.g., DSTM, TinySTM)

Input Acceptance Classification IWIR (e.g., WSTM TL2) VWVR (e.g. SXM) ~C1 ~C2 ~C3 VWIR (e.g., DSTM, TinySTM)

Input Acceptance Classification IWIR (e.g., WSTM TL2) VWVR (e.g. SXM) ~C1 ~C2 ~C3 ~C4 VWIR (e.g., DSTM, TinySTM) CTR (e.g., TSTM)

Input Acceptance Classification Serializable STM needs to track all conflicts IWIR (e.g., WSTM TL2) RTR (e.g., SSTM) VWVR (e.g. SXM) ~C1 ~C2 ~C3 ~C4 VWIR (e.g., DSTM, TinySTM) ~C5 CTR (e.g., TSTM) C5 = Ø

Experimental Validation: Scalability 20% Update operations: 10% linked-list insert, 10% linked-list delete 80% Other operations: linked-list contains Dual quad-core Intel Xeon

Roadmap Motivations Transactional Memory Problem Input Acceptance Elastic Transactions Conclusion

Software Transactional Memories • TinySTM, LSA-STM, SSTM, SwissTM: efficient? insert(x)/ search(z) x y z t h

Software Transactional Memories • TinySTM, LSA-STM, SSTM, SwissTM: efficient? insert(x)/ search(z) x y z t h search(z) insert(x) BEGIN_TX R(h) R(y) R(z) END_TX BEGIN_TX … W(h) END_TX

Software Transactional Memories • TinySTM, LSA-STM, SSTM, SwissTM: efficient? insert(x)/ search(z) x y z t h search(z) insert(x) BEGIN_TX R(h) R(y) R(z) END_TX BEGIN_TX … W(h) END_TX Both transactions cannot commit, because read/write atomicity is violated even though linked list linearizability is guaranteed.

Elastic Transactional Memory (ε-STM) • Elastic transactions: weaker than normal ones insert(x)/ search(z) x y z t h search(z) insert(x) BEGIN_TX R(h) R(y) R(z) END_TX BEGIN_TX … W(h) END_TX The goal is to cut transactions into sub-parts

Elastic Transactional Memory (ε-STM) • Elastic transactions: weaker than normal ones search(z) insert(x) search(z) insert(x) BEGIN_TX R(h) R(y) R(z) END_TX BEGIN_TX … W(h) END_TX BEGIN_EL_TX R(h) R(y) R(z) END_TX BEGIN_EL_TX … W(h) END_TX Cut • It is cut in 2 parts w/ resp. ops π(x,*) and π(y,*) if: • there are no two writes on x and y between. • all writes are in the same part; • the first op of any part is a read;

Elastic Transactional Memory (ε-STM) • Elastic transactions: weaker than normal ones insert(x)/ search(z) x y z t h • The key idea is that when reading element e: • the predecessor has not changed since it has been read • or e has not changed since the predecessor has been read. • This ensures that the parsing is always consistent although atomicity is relaxed.

Elastic Transactional Memory (ε-STM) • Elastic transactions: • Weaker than normal ones (cannot implement sum) • Compatible with normal ones (retain simplicity)

Two Ways of Speeding Up Transactional Memory Algorithms

Two Ways of Speeding Up Transactional Memory Algorithms

Presentation Transcript

Transactional memory

Software Transactional Memory

Speeding Up

Speeding Up Enumeration Algorithms with Amortized Analysis

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Software Transactional Memory

Speeding up on two string matching algorithms

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory