Computer Architecture Research Overview Focus on: Transactional Memory Rajeev Balasubramonian

Computer Architecture Research Overview Focus on: Transactional Memory Rajeev Balasubramonian School of Computing, University of Utah http://www.cs.utah.edu/~rajeev

What is Computer Architecture? • To a large extent, computer architecture determines: • the number of instructions used to execute a program • the time each instruction takes to execute • the idle cycles when no work gets done • the number of instructions that can execute in parallel

The Best Chip in 2004 2MB L2 Cache P4 –like Core 2010 2004

The Advent of Multi-Core Chips Core Cache bank • In the past, performance magically increased by 50% every year • In the future, this improvement will be only ~20% every year • … unless … the application is multi-threaded!

Upcoming Architecture Challenges • Improving single core performance • Functionalities in multi-core chips • Simplifying the programmer’s task • Efficient interconnects and on-chip communication • Power and temperature-efficient designs • Designs tolerant of errors For publications, see http://www.cs.utah.edu/~rajeev/research.html

Multi-Threaded Applications • Parallel or multi-threaded applications are difficult to • write: lots of co-ordination and data exchange between • threads (referred to as synchronization) • Example: Banking Database Alice & Bob’s joint account: $1000 ATM 1 Alice: Deposit $100 ATM 2 Bob: Deposit $100

Multi-Threaded Applications Banking Database Alice & Bob’s joint account: $1000 $1100 $1000 $1100 ATM 1 Alice: Deposit $100 ATM 2 Bob: Deposit $100 $1000 $1100 $1000 $1100 Rd balance -- $1000 Rd balance -- $1000 Update balance -- $1100 Update balance -- $1100 Write balance -- $1100 Write balance -- $1100

Synchronization with Locks Bank: lock(L1); read balance; calculate interest; update balance; unlock(L1); Each snippet executes atomically, as if it is the only process in the system ATM-withdraw: lock(L1); read balance; decrement; update balance; unlock(L1); ATM-deposit: lock(L1); read balance; increment; update balance; unlock(L1);

Problems with Locks • Deadlocks! lock(L1); lock(L2); … unlock(L2); unlock(L1); lock(L2); lock(L1); … unlock(L1); unlock(L2);

Problems with Locks • Performance inefficiencies! lock(L1); if (condt1) traverse linked list till you find the entry if (condt2) sell the ticket unlock(L1);

Transactions • New paradigm to simplify programming • instead of lock-unlock, use transaction begin-end • Can yield better performance; Eliminates deadlocks • Programmer can freely encapsulate code sections within • transactions and not worry about the impact on • performance and correctness • Programmer specifies the code sections they’d like to see • execute atomically – the hardware takes care of the rest • (provides illusion of atomicity)

Transactions • Transactional semantics: • when a transaction executes, it is as if the rest of the system is suspended and the transaction is in isolation • the reads and writes of a transaction happen as if they are all a single atomic operation • if the above conditions are not met, the transaction fails to commit (abort) and tries again transaction begin read shared variables arithmetic write shared variables transaction end

Applications • A transaction executes speculatively in the hope that there • will be no conflicts • Can replace a lock-unlock pair with a transaction begin-end • the lock is blocking, the transaction is not • programmers can conservatively introduce transactions without worsening performance lock (lock1) transaction begin read A read A operations operations write A write A unlock (lock1) transaction end

Example 1 lock (lock1) counter = counter + 1; unlock (lock1) transaction begin counter = counter + 1; transaction end No apparent advantage to using transactions (apart from fault resiliency)

Example 2 Producer-consumer relationships – producers place tasks at the tail of a work-queue and consumers pull tasks out of the head Enqueue Dequeue transaction begin transaction begin if (tail == NULL) if (head->next == NULL) update head and tail update head and tail else else update tail update head transaction end transaction end With locks, neither thread can proceed in parallel since head/tail may be updated – with transactions, enqueue and dequeue can proceed in parallel – transactions will be aborted only if the queue is nearly empty

Detecting Conflicts – Basic Implementation • When a transaction does a write, do not update memory; • save the new value in cache and keep track of all modified • lines (if the transaction is aborted, invalidate these lines) • Also keep track of all the cache lines read by the transaction • When another transaction commits, compare its write set • with your own read set – a match causes an abort • At transaction end, express intent to commit, broadcast • write-set

Key Problem • At the end of the transaction, the transaction’s writes are • broadcast – the commit does not happen until everyone • that needs to see the writes has seen them • Broadcasts are not scalable! In a multi-core with 64 • processors, 63 other transactions may have to wait while • one transaction is busy broadcasting its writes • Need efficient algorithms to handle a commit and need • clever design of on-chip networks to improve speed/power

Algorithm 1 – Sequential • Distribute memory into N nodes – each transaction keeps track of • the nodes that are read and written PN – TN P1 – T1 P2 – T2 M1 M2 MN • If two transactions touch different nodes, they can commit in parallel • If two transactions happen to touch the same node, they must be • aware of each other in case one has to abort Algorithm designed by Seth Pugsley, Junior in the CS program See tech report at http://www.cs.utah.edu/~rajeev/pubs/tr-07-016.pdf

Algorithm 1 – Sequential • Each transaction attempts to occupy the nodes in its commit set in • ascending order – a node can be occupied by only one transaction • Must wait if another transaction has occupied the node; once all • nodes are occupied, can proceed with commit PN – TN P1 – T1 P2 – T2 M1 M2 MN Example 2: T1: nodes 1, 4, 7 T2: nodes 3, 5, 8 Example 1: T1: nodes 1, 4, 7 T2: nodes 3, 4, 8

Algorithm 1 – Sequential • Cannot have hardware deadlocks: since nodes are • occupied in increasing order, a transaction is always • waiting for a transaction that is further ahead – cannot • have a cycle of dependences • If transactions usually do not pose conflicts for nodes, • multiple transactions can commit in parallel • Disadvantages: must occupy nodes sequentially, conflicts • lead to long delays

Algorithm 2 – Speculative • Attempt to occupy every node in the commit set in • parallel – if any node is already occupied, revert back • to the sequential algorithm (else, can lead to deadlocks) • Should typically perform no worse than the sequential • algorithm

Algorithm 3 – Momentum • Attempt to occupy nodes in parallel – every request has • a momentum value to indicate how many nodes have • already been occupied by the transaction • If a transaction finds that a node is already occupied, it • can attempt to steal occupancy if it has a higher momentum • The system is deadlock- and livelock-free (the transaction • with the highest momentum at any time has a path to • completion)

Interconnects as a Bottleneck • In the past, on-chip data transmission on wires cost almost nothing • Interconnect speed and power has been improving, but not at the • same rate as transistor speeds • Hence, relative to computation, communication is much more expensive • In the near future, it will take 100 cycles to travel across the chip • 50% of chip power can be attributed to interconnects

On-Going Explorations • For the various on-chip communications just • described, what is the optimal on-chip network? • What topology works best? What router microarchitecture • is most efficient in terms of performance and power? • What wires work best? Depends on criticality of specific • data transfer…

To Learn More… • CS/EE 3810: Computer Organization • CS/EE 6810: Computer Architecture • CS/EE 7810: Advanced Computer Architecture • CS/EE 7820: Parallel Computer Architecture • CS 7937 / 7940: Architecture Reading Seminar

Title • Bullet

Computer Architecture Research Overview Focus on: Transactional Memory Rajeev Balasubramonian