370 likes | 523 Views
Hardware Transactional Memory (Herlihy, Moss, 1993). Some slides are taken from a presentation by Royi Maimon & Merav Havuv, prepared for a seminar given by Prof. Yehuda Afek. Outline. Hardware Transactional Memory (HTM) Transactions Caches and coherence protocols General Implementation
E N D
Hardware Transactional Memory(Herlihy, Moss, 1993) Some slides are taken from a presentation by Royi Maimon & Merav Havuv, prepared for a seminar given by Prof. Yehuda Afek.
Outline • Hardware Transactional Memory (HTM) • Transactions • Caches and coherence protocols • General Implementation • Simulation
What is a transaction? • A transaction is a sequence of memory loads and stores executed by a single process that either commits or aborts • If a transaction commits, all the loads and stores appear to have executed atomically • If a transaction aborts, none of its stores take effect • Transaction operations aren't visible until they commit (if they do)
Transactional Memory • A new multiprocessor architecture • The goal: Implementing non-blocking synchronization that is • efficient • easy to use compared with conventional techniques based on mutual exclusion • Implemented by straightforward extensions to multiprocessor cache-coherence protocols and / orby software mechanisms
Outline • Hardware Transactional Memory (HTM) • Transactions • Caches and coherence protocols • General Implementation • Simulation 5
A cache is an associative (a.k.a. content-addressable) memory Address A Data @A Conventional memory Address A, s.t. *A=D Data D Associative memory
Cache tags and address structure Main Memory Cache Indexes and Tags are typically high-order address bits
In multiprocessors, each processor typically has its own local cache memory Minimize average latency due to memory access Decrease bus traffic Maximize cache hit ratio A Cache-coherence protocol manages the consistency of caches and main memory: Shared memory semantics maintained Caches and main memory communicate to guarantee coherency Cache-Coherence Protocol
The need to maintain coherency Figure taken from the book: “Computer architecture – A quantitative approach” by Hennessy and Peterson
Coherency requirements Text taken from the book: “Computer architecture – A quantitative approach” by Hennessy and Peterson
All caches monitor (snoop) the activity on a global bus/interconnect to determine if they have a copy of the block of data that is requested on the bus. Snoopy Cache
Coherence protocol types • Write through: the information is written to both the cache block and to the block in the lower-level memory • Write-back: the information is written only to the cache block. The modified cache block is written to main memory only when it is replaced
3-state Coherence protocol • Invalid: cache line/block does not contain legal information • Shared: cache line/block contains information that may be shared by other caches • Modified/exclusive: cache line/block was modified while in cache and is exclusively owned by current cache
Cache-coherency mechanism – state transition diagram Transitions based on processor requests Transitions based on bus requests Figure taken from the book: “Computer architecture – A quantitative approach” by Hennessy and Peterson
Outline • Hardware Transactional Memory (HTM) • Transactions • Caches and coherence protocols • General Implementation • Simulation
HTM-supported API The following primitive instructions for accessing memory are provided: • Load-transactional (LT): reads value of a shared memory location into a private register. • Load-transactional-exclusive (LTX): Like LT, but “hinting” that the location is likely to be modified. • Store-transactional (ST) tentatively writes a value from a private register to a shared memory location. • Commit (COMMIT) • Abort (ABORT) • Validate (VALIDATE) tests the current transaction status.
Some definitions • Read set:the set of locations read by LT by a transaction • Write set:the set of locations accessed by LTX or ST issued by a transaction • Data set(footprint):the union of the read and write sets. • A set of values in memory is inconsistentif it couldn’t have been produced by any serial execution of transactions
Intended Use Instead of acquiring a lock, executing the critical section, and releasing the lock, a process would: • use LT or LTX to read from a set of locations • use VALIDATE to check that the values read are consistent, • use ST to modify a set of locations • use COMMIT to make the changes permanent. If either the VALIDATE or the COMMIT fails, the process returns to Step (1).
Implementation • Hardware transactional memory is implemented by modifyingstandard multiprocessor cache coherence protocols • Herlihy and Moss suggested to extend “snoopy” cache protocol for a shared bus to support transactional memory • Supports short-lived transactions with a relatively small data set.
The basic idea • Any protocol capable of detecting register access conflicts can also detect transaction conflict at no extra cost • Once a transaction conflict is detected, it can be resolved in a variety of ways
Implementation • Each processor maintains two caches • Regular cachefor non-transactional operations, • Transactional cachesmall, fully associative cache for transactional operations.It holds all the tentative writes, without propagating them to other processors or to main memory (until commit) • An entry may reside in one cache or the other but not in both
Cache line states • Each cache line (regular or transactional) has one of the following states: (Modified) (Exclusive) • Each transactional cache lines has (in addition) one of these states: “New” values “Old” values
Cleanup • When the transactional cache needs space for a new entry, it searches for: • A TC_INVALID entry • If none - a TC_NORMAL entry • finally for an TC_COMMIT entry (why can such entries be replaced?)
Each processor maintains two flags: The transaction active (TACTIVE) flag: indicates whether a transaction is in progress The transaction status (TSTATUS) flag: indicates whether that transaction is active (True) or aborted (False) Non-transactional operations behave exactly as in original cache-coherence protocol Processor actions
Look for tc_ABORT entry Return its value Look for NORMAL entry Change it to tc_ABORT and allocate another tc_COMMIT entry with same value Ask to read this block from the shared memory Abort the transaction: • TSTATUS=FALSE • Drop tc_ABORT entries • All tc_COMMIT entries are set to tc_NORMAL Create two entries: tc_ABORT and tc_COMMIT Example – LT operation: Not Found? Found? Cache miss Not Found? Found? Successful read Busy signal
Snoopy cache actions: • Both the regular cache and the transactional cache snoop on the bus. • A cache ignores any bus cycles for lines not in that cache. • The transactional cache’s behavior: • If TSTATUS=False, or if the operation isn’t transactional, the cache acts just like the regular cache, but ignores entries with state other than TC_NORMAL • Otherwise: On LT of another cpu, if the state is TC_NORMAL or the line not written to, the cache returns the value, and in all other cases it returns BUSY
Committing/aborting a transaction • Upon commit • Set all entries tagged by TC_COMMIT to TC_INVALID • Set all entries tagged by TC_ABORT to TC_NORMAL • Upon abort • Set all entries tagged by TC_ABORT to TC_INVALID • Set all entries tagged by TC_COMMIT to TC_NORMAL Since transactional cache is small, it is assumed that these operations can be done in parallel.
Outline • Lock-Free • Hardware Transactional Memory (HTM) • Transactions • Caches and coherence protocols • General Implementation • Simulation
Simulation • We’ll see an example code for the producer/consumer algorithm using transactional memory architecture. • The simulation runs on both cache coherence protocols: snoopy and directory cache. • The simulation uses 32 processors • The simulation finishes when 2^16 operations have completed.
Part Of Producer/Consumer Code unsigned queue_deq(queue *q) { unsigned head, tail, result; unsigned backoff = BACKOFF_MIN unsigned wait; while (1) { result = QUEUE_EMPTY; tail = LTX(&q->enqs); head = LTX(&q->deqs); if (head != tail) { /* queue not empty? */ result = LT(&q->items[head % QUEUE_SIZE]); /* advance counter */ ST(&q->deqs, head + 1); } if (COMMIT()) break; /* abort => backoff */ wait = random() % (01 << backoff); while (wait--); if (backoff < BACKOFF_MAX) backoff++; } return result; } typedef struct { Word deqs; // Holds the head’s index Word enqs; // Holds the tail’s index Word items[QUEUE_SIZE]; } queue;
The results: Snoopy cache Directory-based coherency
Key Limitations: • Transactional size is limited by cache size • Transaction length effectively limited by scheduling quantum • Process migration problematic
MSA: A few sample research directions • Theoretic • Are there counters/stacks/queues with sub-linear write-contention? • What is the space complexity of obstruction-free read/write consensus? • What is the step-complexity of 1-time read/write counter? • ... • (More) practical • The design of efficient lock-free/blocking concurrent objects • Defining more realistic metrics for blocking synchronization, and designing algorithms that are efficient w.r.t these metrics • Improve the usability of transactional memory • ...