190 likes | 313 Views
Making the Fast Case Common and the Uncommon Case Simple in the Unbounded Transactional Memory. Two Ideas of This Paper. Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked.
E N D
Making the Fast Case Common and the Uncommon Case Simple in the Unbounded Transactional Memory
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow happened, just serializes transactions. Two approaches: OneTM-Serialized and OneTM-Concurrent.
Baseline of This Paper Eager Version Management: using in-place updates and LogTM-style logging to distinguish architected and speculative state. To avoid logging the same memory block multiple times, log updates are ended when the W bit is already set. Eager/Lazy Conflict Detection. Cannot survive from context switch: abort the transaction if it happened.
Making the Fast Case Common – Using Permission-only Cache Track transactional R/W bits for blocks replaced from the processor’s data cache by retaining coherence permissions, but not data. It is organized as a tagged, set-associative structure that contains R/W bits per entry.
The Permission-only Cache is: Read by external coherence requests as part of conflict detection. Updated when a transactional block is replaced from the data cache. Invalidated on a commit or abort. Read on transactional store misses to avoid redundantly logging the block.
Implementation of Permission-only Cache Naïve implementation: a full tag for each entry with R/W bits. High overhead of space. Using sector cache techniques. With good page-level spatial locality, a 4KBs of permission-only cache allows a transaction to access up to 16K replaced blocks. (1 MB if using 64B cache line)
Using L2 Cache to Support larger Transactions without Overflow Dynamically share the L2 cache acts as PO cache. Add a “permissions-only valid bit” to indicate when the frame holds transactional R/W bits. For 4MB L2 cache with 64B cache line, support up to 1 GB data access in transaction. Reasonable? Maybe not so easy.
Possible Low Utilization of Permission-only Cache How if the program have little pafe-level spatial locality? First address, (tag=100, index=100) while second address, (tag=200, index=100)? N-ways PO cache? Using partial L2 cache my not reasonable. Using bloom-filter may be the solution.
Argument about Power Issue of Permission-only Cache The paper claim: the permissions-only cache is often empty; in these circumstances, it can be completely perered down to save dynamic and static power. How to cover the delay while power it up? Define threshold in data cache?
Making the Uncommon Case Simple Now we talk about the transaction had overflowed. Can use other ideas to handle it. (The rate of overflow had been reduced.) In this paper: serialize them. Exclusivity of overflowed execution is achieved via the shared transaction status word (STSW), which resides in a fixed location in the virtual address space of each process.
OneTM-Serialized using STSW The STSW can be conherently cached in a special register to make these checks inexpensive. A transaction enters the overflow execution after it has atomically changed the bit from unset to set. If found the bit set, the processor must stall.
OneTM-Concurrent Introduce per-block persistent transaction metadata as part of the architected state. Each cache-block-sized block of physical memory is augumented with additional R/W bits and 14 bits OTID. Note that we have only ONE overflowed transaction at a time, therefore, only one bits set per block is sufficient. The bits set comes with memory accessing.
Lazy Metadata Clearing The processor will check if the OTID in the metadata is equal to the OTID in STSW and decide whether the confliction occurred. If equal, stall until the overflowed bit unset in STSW. If not equal, clear the meta of the block. There are potential false confliction.
Permission-only Cache Relates to Our Project PO Cache would be power hungry. Also, it is the dedicated architecture for evicted blocks. PO Cache will catch the evicted blocks from VR. We can set a threshold in VR. Once the threshold reached, turn on the PO cache. Move blocks in shared state first. Because the PO cache only doesn't keep the data.