590 likes | 679 Views
Nested Parallelism in Transactional Memory. Kunal Agrawal , Jeremy T. Fineman and Jim Sukha MIT. Program Representation. ParallelIncrement (){ parallel { x ← x+1 }//Thread1 { x ← x+1 }//Thread2 }.
E N D
Nested Parallelism in Transactional Memory KunalAgrawal, Jeremy T. Fineman and Jim Sukha MIT
Program Representation ParallelIncrement(){ parallel { x ← x+1 }//Thread1 { x← x+1 }//Thread2 } • The parallel keyword allows the two following code blocks (enclosed in {.}) to execute in parallel.
Program Representation ParallelIncrement(){ parallel { x ← x+1 }//Thread1 { x← x+1 }//Thread2 } S0 P1 S1 S2 • The parallel keyword allows the two following code blocks (enclosed in {.}) to execute in parallel. R x W x R x W x • We model the execution of a multithreaded program as a walk of a series-parallel computation tree.
Program Representation ParallelIncrement(){ parallel //P1 { x ← x+1}//S1 { x← x+1 }//S2 } S0 P1 S1 S2 • The parallel keyword allows the two following code blocks (enclosed in {.}) to execute in parallel. R x W x R x W x u1 u2 u3 u4 • We model the execution of a multithreaded program as a walk of a series-parallel computation tree. • Internal nodes of the tree are S (series) or P (parallel) nodes. The leaves of the tree are memory operations.
Program Representation ParallelIncrement(){ parallel //P1 { x ← x+1}//S1 { x← x+1 }//S2 } S0 P1 S1 S2 • The parallel keyword allows the two following code blocks (enclosed in {.}) to execute in parallel. R x W x R x W x u1 u2 u3 u4 • We model the execution of a multithreaded program as a walk of a series-parallel computation tree. • Internal nodes of the tree are S (series) or P (parallel) nodes. The leaves of the tree are memory operations. • All child subtrees of an S node must execute in series in left-to-right order. The child subtrees of a P node can potentially execute in parallel.
Data Races ParallelIncrement(){ parallel //P1 { x ← x+1}//S1 { x← x+1 }//S2 } S0 • Two (or more) parallel accesses to the same memory location (where one of the accesses is a write) constitute a data race. (In the tree, two accesses can happen in parallel if their least common ancestor is a P node.) P1 S1 S2 • There are races between u1 and u4, u3 and u2, and u2 and u4. R x W x R x W x u1 u2 u3 u4
Data Races ParallelIncrement(){ parallel //P1 { x ← x+1}//S1 { x← x+1 }//S2 } S0 • Two (or more) parallel accesses to the same memory location (where one of the accesses is a write) constitute a data race. (In the tree, two accesses can happen in parallel if their least common ancestor is a P node.) • Data races lead to nondeterministic program behavior. • Traditionally, locks are used to prevent data races. P1 S1 S2 • There are races between u1 and u4, u3 and u2, and u2 and u4. R x W x R x W x u1 u2 u3 u4
Transactional Memory ParallelIncrement(){ parallel //P1 {atomic{x ← x+1}//A }//S1 { atomic{x ← x+1}//B }//S2 } S0 • Transactional memory has been proposed as an alternative to locks. • The programmer simply encloses the critical region in an atomic block. The runtime system ensures that the region executes atomically by tracking its reads and writes, detecting conflicts, and aborting and retrying if necessary. P1 S1 S2 B A R x W x R x W x u1 u2 u3 u4
Nested Parallelism ParallelIncrement(){ parallel //P1 { x ← x+1}//S1 { x ← x+1 parallel //P2 { x ← x+1 }//S3 { x ← x+1 }//S4 }//S2 } S0 One can generate more parallelism by nesting parallel blocks. P1 S1 S2 R x W x u1 u2 R x W x P2 u3 u4 S3 S4 R x W x R x W x u5 u6 u7 u8
Nested Parallelism in Transactions ParallelIncrement(){ parallel //P1 {atomic{x ← x+1}//A }//S1 { atomic{ x ← x+1 parallel //P2 { x ← x+1 }//S3 { x ← x+1 }//S4 }//B }//S2 } S0 P1 S1 S2 A B R x W x u1 u2 R x W x P2 u3 u4 Use transactions to prevent data races. (Notice the parallelism inside transaction B.) S3 S4 R x W x R x W x u5 u6 u7 u8
Nested Parallelism in Transactions ParallelIncrement(){ parallel //P1 {atomic{x ← x+1}//A }//S1 { atomic{ x ← x+1 parallel //P2 { x ← x+1 }//S3 { x ← x+1 }//S4 }//B }//S2 } S0 P1 S1 S2 A B R x W x u1 u2 R x W x P2 u3 u4 Use transactions to prevent data races. (Notice the parallelism inside transaction B.) This program unfortunately has data races. S3 S4 R x W x R x W x u5 u6 u7 u8
Nested Parallelism and Nested Transactions ParallelIncrement(){ parallel {atomic{x ← x+1}//A } { atomic{ x ← x+1 parallel { atomic{x← x+1}//C } { atomic{x ← x+1}//D } }//B }//S2 } S0 P1 S1 S2 A B R x W x u1 u2 R x W x P2 u3 u4 Add more transactions S3 S4 C D R x W x R x W x u5 u6 u7 u8
Nested Parallelism and Nested Transactions ParallelIncrement(){ parallel {atomic{x ← x+1}//A } { atomic{ x ← x+1 parallel { atomic{x← x+1}//C } { atomic{x ← x+1}//D } }//B }//S2 } S0 P1 S1 S2 A B R x W x u1 u2 R x W x P2 u3 u4 Transactions C and D are nested inside transaction B. Therefore transaction B has both nested transactions and nested parallelism. S3 S4 C D R x W x R x W x u5 u6 u7 u8
Our Contribution • We describe CWSTM, a theoretical design for a software transactional memory system which allows nested parallelism in transactions for dynamic multithreaded languages which use a work-stealing scheduler. • Our design efficiently supports nesting and parallelism of unbounded depth. • CWSTM supports • Efficient Eager Conflict detection, and • Eager Updates (Fast Commits). • We prove that CWSTM exhibits small overhead on a program with transactions compared to the same program with all atomic blocks removed.
More Precisely… • A work-stealing scheduler guarantees that a transaction-less program with work T1and critical path T∞ running on P processors completes in time O(T1/P + T∞). • Provides linear speedup when T1/T∞ >> P.
More Precisely… • A work-stealing scheduler guarantees that a transaction-less program with work T1and critical path T∞ running on P processors completes in time O(T1/P + T∞). • Provides linear speedup when T1/T∞ >> P. • If a program has no aborts and no read contention*, then CWSTM completes the program with transactions in time O(T1/P + PT∞). • Provides linear speedup when T1/T∞ >> P2.
More Precisely… • A work-stealing scheduler guarantees that a transaction-less program with work T1and critical path T∞ running on P processors completes in time O(T1/P + T∞). • Provides linear speedup when T1/T∞ >> P. • If a program has no aborts and no read contention*, then CWSTM completes the program with transactions in time O(T1/P + PT∞). • Provides linear speedup when T1/T∞ >> P2. *In the presence of multiple readers, a write to a memory location has to check for conflicts against multiple readers.
Outline • Introduction • Semantics of TM • Difficulty of Conflict Detection • Access Stack • Lazy Access Stack • Intuition for Final Design Using Traces and Analysis • Conclusions and Future Work
Conflicts in Transactions parallel {atomic{ x ← 1 y← 2 }//A }//S1 { atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S2 S0 • Transactional memory optimistically executes transactions and maintains the write setW(T) for each transaction T. • Active transactions A and Bconflictiff they are in parallel with each other and their write sets overlap. P1 S1 S2 W(A)={} W(B)={} B A W(C)={} C W z W x W y u1 u2 u3 W x W z u4 u5
Conflicts in Transactions parallel {atomic{ x ← 1 y← 2 }//A }//S1 { atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S2 S0 • Transactional memory optimistically executes transactions and maintains the write setW(T) for each transaction T. • Active transactions A and Bconflictiff they are in parallel with each other and their write sets overlap. P1 S1 S2 W(A)={x} W(B)={} B A W(C)={} C W z W x W y u1 u2 u3 W x W z u4 u5
Conflicts in Transactions parallel {atomic{ x ← 1 y← 2 }//A }//S1 { atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S2 S0 • Transactional memory optimistically executes transactions and maintains the write setW(T) for each transaction T. • Active transactions A and Bconflictiff they are in parallel with each other and their write sets overlap. P1 S1 S2 W(A)={x} W(B)={z} B A W(C)={} C W z W x W y u1 u2 u3 W x W z u4 u5
Conflicts in Transactions parallel {atomic{ x ← 1 y← 2 }//A }//S1 { atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S2 S0 • Transactional memory optimistically executes transactions and maintains the write setW(T) for each transaction T. • Active transactions A and Bconflictiff they are in parallel with each other and their write sets overlap. P1 S1 S2 W(A)={x} W(B)={z} B A W(C)={z} C W z W x W y u1 u2 u3 W x W z u4 u5
Conflicts in Transactions parallel {atomic{ x ← 1 y← 2 }//A }//S1 { atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S2 CONFLICT!! S0 • Transactional memory optimistically executes transactions and maintains the write setW(T) for each transaction T. • Active transactions A and Bconflictiff they are in parallel with each other and their write sets overlap. P1 S1 S2 W(A)={x} W(B)={z} B A W(C)={z, x} C W z W x W y u1 u2 u3 W x W z u4 u5
Nested Transactions: Commit and Abort S0 • If two transactions conflict, one of them is aborted and its write set is discarded. P1 S1 S2 W(B)={z, x} W(A)={y} A B W y W x W z W x P2 S3 S4 W(C)={z, u} C D W z W u W z W y
Nested Transactions: Commit and Abort S0 • If two transactions conflict, one of them is aborted and its write set is discarded. P1 S1 S2 W(B)={z, x} W(A)={y} A B W y W x W z W x P2 S3 S4 W(C)={z, u} C D W z W u W z W y
Nested Transactions: Commit and Abort S0 • If two transactions conflict, one of them is aborted and its write set is discarded. • If a transaction completes without a conflict, it is committed and its write set is merged with it’s parent transaction’s write set P1 S1 S2 W(B)={z, x} W(A)={y} A B W y W x W z W x P2 S3 S4 W(C)={z, u} C D W z W u W z W y
Nested Transactions: Commit and Abort S0 • If two transactions conflict, one of them is aborted and its write set is discarded. • If a transaction completes without a conflict, it is committed and its write set is merged with it’s parent transaction’s write set P1 S1 S2 W(B)={z, x, u} W(A)={y} A B W y W x W z W x P2 S3 S4 W(C)={z, u} C D W z W u W z W y
Outline • Introduction • Semantics of TM • Difficulty of Conflict Detection • Access Stack • Lazy Access Stack • Intuition for Final Design Using Traces and Analysis • Conclusions and Future Work
Conflicts in Serial Transactions Thread 1 Thread 2 • Virtually all proposed TM systems focus on the case where transactions are serial (no P nodes in subtrees of transactions). • Two writes to the same memory location cause a conflict if and only if they are on different threads. • TM system can just check to see if some other thread wrote to the memory location. S0 P1 S1 S2 B A C W z E W x D W z W x W z W z F W z W x
Conflicts in Serial Transactions Thread 1 Thread 2 • Virtually all proposed TM systems focus on the case where transactions are serial (no P nodes in subtrees of transactions). • Two writes to the same memory location cause a conflict if and only if they are on different threads. • TM system can just check to see if some other thread wrote to the memory location. S0 P1 S1 S2 B A C W z E W x D W z W x W z W z F CONFLICT!! W z W x
Thread ID is not enough 3 workers 1 2 3 • A work-stealing scheduler does not create a thread for every S-node; instead, it schedules a computation on a fixed number of worker threads. • Runtime can not simply compare worker ids to determine whether two transactions conflict. X0 Inactive Unexecuted P1 S1 S2 W(Y1)={x,..} Y1 P2 P2 S5 S6 S3 S4 X2 Y2 P5 P6 P4 S10 S11 S11 S12 S7 S8 P8 Y3 Z3 W(Y3)={x,..} P7 S15 Z1 S16 S13 S14 Z4 Z2
Thread ID is not enough 3 workers 1 2 3 • A work-stealing scheduler does not create a thread for every S-node; instead, it schedules a computation on a fixed number of worker threads. • Runtime can not simply compare worker ids to determine whether two transactions conflict. X0 Inactive Unexecuted P1 S1 S2 W(Y1)={x,..} Y1 P2 P2 S5 S6 S3 S4 X2 W(Y2)={x,..} Y2 P5 P6 P4 S10 S11 S11 S12 S7 S8 P8 Y3 Z3 W(Y3)={x,..} P7 S15 Z1 S16 S13 S14 Z4 Z2
Outline • Introduction • Semantics of TM • Difficulty of Conflict Detection • Access Stack • Lazy Access Stack • Intuition for Final Design Using Traces and Analysis • Conclusions and Future Work
CWSTM Invariant: Conflict-FreeExecution Inactive Active Trans accessed L Y0 INVARIANT 1: At any time, for any given locationL, all active transactions that have L in their writeset fall along a (root-to-leaf) chain. P S S Y1 P P S S S S Y2 Z2 Y3 P S S P S S Z1
CWSTM Invariant: Conflict-FreeExecution Inactive Active Trans accessed L Y0 INVARIANT 1: At any time, for any given locationL, all active transactions that have L in their writeset fall along a (root-to-leaf) chain. Let X be the end of the chain. P S S Y1 P P S S S S Y2 Z2 Y3 P S S P X S S Z1
CWSTM Invariant: Conflict-FreeExecution Inactive Active Trans accessed L Y0 INVARIANT 1: At any time, for any given locationL, all active transactions that have L in their writeset fall along a (root-to-leaf) chain. Let X be the end of the chain. • No conflict if Xis an ancestor of Z. (e.g., Z1). • Conflict if X is not an ancestor of Z.(e.g., Z2). P S S Y1 P P S S S S Y2 Z2 Y3 P INVARIANT 2: If Z tries to access object L: S S P X S S Z1
Design Attempt 1 Inactive Active Trans accessed L Y0 • For every L, keep an access stackfor L, holding the chain of active transactions which have L in their writeset. Access Stack for L. P S S Y1 Y0 P P Y1 S S S S Y3 Y2 Z2 : Y3 P S S P X Top = X S S Z1
Design Attempt 1 Inactive Active Trans accessed L Y0 • For every L, keep an access stackfor L, holding the chain of active transactions which have L in their writeset. • Access stacks are changed on commits and aborts. If Y3 commits, it is replaced by Y2. If Y3aborts, it disappears from the stack and Y1 is at the top. Access Stack for L. P S S Y1 Y0 P P Y1 S S S S Y3 Y2 Z2 : Y3 P S S P X Top = X S S Z1
Design Attempt 1 Inactive Active Trans accessed L Y0 • For every L, keep an access stackfor L, holding the chain of active transactions which have L in their writeset. • Access stacks are changed on commits and aborts. If Y3 commits, it is replaced by Y2. If Y3aborts, it disappears from the stack and Y1 is at the top. • Let X be the top of access stack for L. When transaction Z tries to access L, report a conflict if and only if X is not an ancestor of Z. Access Stack for L. P S S Y1 Y0 P P Y1 S S S S Y3 Y2 Z2 : Y3 P S S P X Top = X S S Z1
Maintenance of access stack on commit. . . . • Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li. Y0 W(Y0)= {L0} Y1 W(Y1)= {L1} Y2 W(Y2)= {L2} Yd-2 W(Yd-2)= {Ld-2} Yd-1 W(Yd-1)={Ld-1} W(Yd)= {Ld} Yd
Maintenance of access stack on commit. . . . • Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li. • Total work with no transactions: O(d). Y0 W(Y0)= {L0} Y1 W(Y1)= {L1} Y2 W(Y2)= {L2} Yd-2 W(Yd-2)= {Ld-2} Yd-1 W(Yd-1)={Ld-1} W(Yd)= {Ld} Yd
Maintenance of access stack on commit. . . . • Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li. • Total work with no transactions: O(d). • On commit of a transaction, the access stacks of all the memory locations its write set must be updated. Y0 W(Y0)= {L0} Y1 W(Y1)= {L1} Y2 W(Y2)= {L2} Yd-2 W(Yd-2)= {Ld-2} Yd-1 W(Yd-1)={Ld-1} W(Yd)= {Ld} Yd
Maintenance of access stack on commit. . . . • Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li. • Total work with no transactions: O(d). • On commit of a transaction, the access stacks of all the memory locations its write set must be updated. Y0 W(Y0)= {L0} Y1 W(Y1)= {L1} Y2 W(Y2)= {L2} Yd-2 W(Yd-2)= {Ld-2} Yd-1 W(Yd-1)={Ld-1,Ld} O(1) W(Yd)= {Ld} Yd
Maintenance of access stack on commit. . . . • Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li. • Total work with no transactions: O(d). • On commit of a transaction, the access stacks of all the memory locations its write set must be updated. Y0 W(Y0)= {L0} Y1 W(Y1)= {L1} Y2 W(Y2)= {L2} Yd-2 W(Yd-2)= {Ld-2,Ld-1,Ld} O(2) Yd-1 W(Yd-1)={Ld-1,Ld} O(1) W(Yd)= {Ld} Yd
Maintenance of access stack on commit. . . . • Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li. • Total work with no transactions: O(d). • On commit of a transaction, the access stacks of all the memory locations its write set must be updated. Y0 W(Y0)= {L0} Y1 W(Y1)= {L1,L2,L3 … Ld-1,Ld} O(d-1) Y2 W(Y2)= {L2,L3 … Ld-1,Ld} Yd-2 W(Yd-2)= {Ld-2,Ld-1,Ld} O(2) Yd-1 W(Yd-1)={Ld-1,Ld} O(1) W(Yd)= {Ld} Yd
Maintenance of access stack on commit. . . . • Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li. • Total work with no transactions: O(d). • On commit of a transaction, the access stacks of all the memory locations its write set must be updated. Y0 W(Y0)= {L0,L1,L2,… Ld-1, Ld} O(d) Y1 W(Y1)= {L1,L2,L3 … Ld-1,Ld} O(d-1) Y2 W(Y2)= {L2,L3 … Ld-1,Ld} Yd-2 W(Yd-2)= {Ld-2,Ld-1,Ld} O(2) Yd-1 W(Yd-1)={Ld-1,Ld} O(1) W(Yd)= {Ld} Yd
Maintenance of access stack on commit. . . . • Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li. • Total work with no transactions: O(d). • On commit of a transaction, the access stacks of all the memory locations its write set must be updated. • On commit of transaction Yi, (d-i+1) access stacks must be updated. Y0 W(Y0)= {L0,L1,L2,… Ld-1, Ld} O(d) Y1 W(Y1)= {L1,L2,L3 … Ld-1,Ld} O(d-1) Y2 W(Y2)= {L2,L3 … Ld-1,Ld} Yd-2 W(Yd-2)= {Ld-2,Ld-1,Ld} O(2) Yd-1 W(Yd-1)={Ld-1,Ld} O(1) W(Yd)= {Ld} Yd
Maintenance of access stack on commit. . . . • Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li. • Total work with no transactions: O(d). • On commit of a transaction, the access stacks of all the memory locations its write set must be updated. • On commit of transaction Yi, (d-i+1) access stacks must be updated. • Overhead due to transaction commits: O(d2). Y0 W(Y0)= {L0,L1,L2,… Ld-1, Ld} O(d) Y1 W(Y1)= {L1,L2,L3 … Ld-1,Ld} O(d-1) Y2 W(Y2)= {L2,L3 … Ld-1,Ld} Yd-2 W(Yd-2)= {Ld-2,Ld-1,Ld} O(2) Yd-1 W(Yd-1)={Ld-1,Ld} O(1) W(Yd)= {Ld} Yd
Outline • Introduction • Semantics of TM • Difficulty of Conflict Detection • Access Stack • Lazy Access Stack • Intuition for Final Design Using Traces and Analysis • Conclusions and Future Work
Lazy Access Stack Inactive Active Trans accessed L Don’t update access stacks on commits. Every transaction Y in the stack implicitly represents its closest active transactional ancestor. Y0 Equivalent (Non-Lazy) Access Stack Lazy Access Stack for L. P Y0 S S Y3 Y1 Y0 P P Y2 Y3 S S S S Y3 Y1 Y4 Y6 Y6 Y4 Z3 Z4 Y8 P Y8 Y5 Y7 S S Y2 Y7 P P X Y9 S S S S Top = X Y5 Z1 Z2 Y9