Signatures in Transactional Memory Systems

Signatures in Transactional Memory Systems Dissertation Defense Luke Yen 1/29/2009

Key Contributions Trend: Transactional memory (TM) emerging parallel programming paradigm. Programmer-annotated transactions that execute atomically (all or nothing). Challenge #1: Hardware TM (HTM) systems may restrict transactions or incur overheads on common events (e.g., cache evictions). Contribution:LogTM-SE HTM: Simple hardware and interacts with operating system to virtualize transactions. No overhead on cache evictions.

Key Contributions Cont. Challenge #2: (1) H3 signatures high area & power overheads & (2) Thread-private references cause false conflicts. Contribution: Notary: (1) Page-Block-XOR - performs similar to H3 but lower overheads (2) Stack & heap-based privatization. Challenge #3: Difficult to understand HTM system performance. Contribution:TMProf: Lightweight hardware performance counters help HTM designers & TM programmers. Challenge #4: Signatures suffer from false conflicts. Contribution: Six hardware/software signature extensions to mitigate false conflicts.

Outline Introduction and Background Transactional Memory background LogTM-SE [HPCA 2007] Notary [MICRO 2008] TMProf (Submitted for publication) Conclusion Contribution #1 Contribution #2 Focus of presentation Contribution #3 Contribution #4 * Skip “Extensions to Signatures”

Transactional Memory (TM) • Locks do not compose • Can lead to deadlocks • TM programmer says • “I want this atomic” • TM system • “Makes it so” • Focus onHardware TM (HTM) Implementations • Fast • Leverage cache coherence & speculation • But hardware finite & should be policy-free void move(T s, T d, Obj key){ atomic { tmp = s.remove(key); d.insert(key, tmp); } } Example

LogTM Signature Edition (LogTM-SE) at 50,000 feet • HTMs Fast • Version management – for transaction commits & aborts • HW handles old/new versions (e.g., write buffer) • Conflict detection – commit only non-conflicting transactions • HW handles conflict detection (R/W bits & coherence) • But Closely Coupled to L1 cache • On critical paths & hard for SW to save/restore • Our Approach: Decoupled, Simple HW, SW control • LogTM-SE • HW: LogTM’s Log + Signatures (from Illinois Bulk) • SW: Unbounded nesting, thread switching, & paging Details

Signature Background • Signatures used to summarize and detect conflicts with a transaction’s read- and write-sets • Inspired by Bulk system [Ceze,ISCA’06] • Imprecise, can be implemented with Bloom filters • Can have false positives, but never false negatives • Also proposed for non-TM purposes (e.g., SC violation detection, atomicity violation detection, race recording) • Ex: Use k Bloom filters of size m/k, with independent hash functions

Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

Notary Executive Summary Tackle 2 problems with hardware signatures: • Problem 1: Best signature hashing (i.e., H3) has high area & power overheads • Solution 1: Use entropy analysis to guide lower-cost hashing (Page-Block-XOR, PBX) that performs similar to H3 • Ex: 8x fewer gates - 160 gates for H3 vs 20 gates for PBX • Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs • Solution 2: Avoid inserting private stack addrs, propose privatization interface for higher performance

Signature hash functions • Which hash function is best? [Sanchez, YEN, MICRO’07] • Bit-selection? Hash simply decodes some number of input bits • H3? Each bit of a hash value is an XOR of (on avg.) half of the input address bits LogTM-SE w/ 2kb signatures • Result: H3 better with >=2 hash functions • However, H3 uses many multi-level XOR trees • Can we improve this? Details

H3 implementation • Num XOR • Ex: 2kb signatures, k=2, c=10, 32-bit addr = 160 XOR gates per signature • Can we reduce the total gate count?

Entropy defined • Insight: Use most random bits for hashing • Use entropy to measure bit randomness • Entropy = • p(xi) = the probability of the occurrence of value xi • N = number of sample values random variable x can take on • Entropy = amount of information required on average to describe outcome of variable x (in bits) • Ex: What is the best possible lossless compression? n bits 0 bits Other cases min max Entropy value of n-bit field n-bit field constant value with probability 1 All bit patterns in n-bit field equally probable

Our measures of entropy • For our workloads, we care about: • Q1: What is the best achievable entropy? • Global entropy – upper bound on entropy of address • Q2: How does entropy change within an address? • Local entropy – entropy of bit-field within the address 6 31 Addr 31 Addr 6 Global entropy Local entropy NSkip

Entropy results • Workloads to be described later • Global entropy is at most 16 bits • Bit-window for local entropy is 16 bits wide (NSkip from 0-10) • Smaller windows (<16b) may not reach global entropy value • Larger windows (>16b) hides some fine-grain info Commercial Workloads

Page-Block-XOR (PBX) • Motivated by 3 findings: • (1) Lower-order bits have most entropy • Follows from our entropy results • (2) XORing two bit-fields produces random hash values • From prior work on XOR hashing (e.g., data placement in caches, DRAM) • (3) Bit-field overlaps can lead to higher false positives • Correlation between the two bit-fields can reduce the range of hash values produced (worse for larger signatures) Overlap Details

PBX implementation • For 2kb signatures with 2 hash functions: • 20 XOR gates for PBX vs 160 XOR gates for H3! • PPN and Cache-index fields not tied to system params: • Use entropy to find two non-overlapping bit-fields with high randomness

Summary thus far • Problem 1: H3 has high area & power overheads • Solution 1: Use entropy analysis to guide lower-cost PBX • Ex: 160 gates for H3 vs 20 gates for PBX • Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs • Solution 2: To be described

Privatization • Problem: False conflicts caused by thread-private addrs • Avoid conflicts if addrs not inserted in thread’s signatures Two privatization solutions: • (1) Remove private stack references from sigs. • Very little work for programmer/compiler • Benefits depend on fraction of stack addresses versus all transactional references • (2) Language-level interface (e.g., private_malloc(), shared_malloc()) • Even higher performance boost • WARNING: Incorrectly marking shared objects as private can lead to program errors!

Page-based implementation • Each page is assigned a status, private or shared • Invariant: Page is shared if any object is shared • If stack is private, library marks stack pages as private • If using privatization heap functions, mark heap pages accordingly

OS support • OS allocates different physical page frames for shared and private pages • Sets a per-frame bit in translation entry if shared • Reduce number of page frames used by packing objects with same status together • Signatures insert memory addresses of transactional references to shared pages • Query page sharing bit in HW TLB & current transactional status

Methodology • Full-system simulation (GEMS) • Transistor-level design for area & power of XOR gates • CACTI for Bloom filter bit array area & power • Linear scaling to 65nm or 90nm for area, original 400nm for power • Single-chip CMP • 16 single-threaded, in-order cores • 32kB, 4-way private L1 I & D • 8MB, 8-way shared L2 cache • MESI directory protocol • Signatures from 64b-64kb (8B-8kB) & “perfect”

Workloads • Micro-benchmarks • SPLASH-2 apps • Barnes & Raytrace – exert most signature pressure • Stanford STAMP apps • Vacation, Genome, Delaunay, Bayes, Labyrinth, Yada, Intruder • DNS server • BIND

PBX vs H3 area & power • Area & power overheads (2kb, k=4):

PBX vs H3 execution time PBX performs similar to H3

Privatization results summary • Removing private stack references from signatures did not help • Most addr references not to stack • Most likely because running with SPARC ISA. Other ISAs (e.g., x86) likely have more benefits • Privatization interface helps five workloads • Remainder either does not have private heap structures or does not have high transactional duty cycle Stack Results

Privatization interface results Can improve execution time

Conclusions • Tackle 2 problems with signature designs: • (1) Area and power overheads of H3 hashing • E.g., 160 XOR gates for H3, 20 for PBX • (2) False conflicts due to signature bits set by private memory references • Our solutions: • (1) Use entropy analysis to guide hashing function (PBX), a low-cost alternative that performs similarly to H3 • (2) Prevent private stack references from entering signatures, and propose a privatization interface for heap allocations • Notary can be applied to non-TM uses: • PBX hashing can directly transfer • Privatization may transfer if addr filtering applies Related Work

Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

TMProf Executive Summary • TM more parallelism than lock-based programs • Complex thread interactions • How can HTM designer understand HTM performance? • How can TM programmer understand TM program performance? • TMProf: Per-processor hardware performance counters to count cumulative event frequencies & overheads in HTM system

Critical-section Parallelism • TM enables critical-section parallelism – more thread interleavings With TM With Locks Thread 0 Lock A Thread 0 xact_begin Thread 1 xact_begin Thread 1 Lock A

Hard to Predict Program Performance • TM programmers may not have mastered intricacies of HTM system • Programs run faster on specific HTM • Example:

Profiling with TMProf • Allows HTM designers & TM programmers to understand HTM performance • With TMProf:

Background on Conflicts Thread 0 Thread 1 • Three types: RW, WR, and WW • Analogous to WAR, RAW, and WAW dependencies in uniprocessors xact_begin … LD A … xact_begin … ST A … RW xact_begin … ST B … xact_begin … ST C … WR xact_begin … LD B … xact_begin … ST C … WW

Conflict Detection & Resolution • Conflicts detected eagerly or lazily • Eagerly – when requests occur • Lazily – at transaction commit • Conflict resolution • Stall or abort on conflict • Choose set of procs to take action

TMProf • Per-processor HW counters measuring cumulative event frequencies and cumulative event overheads • Two implementations: Base & Extended • Base (BaseTMProf): Breaks down HTM execution cycles into common components • Extended (ExtTMProf): Builds on BaseTMProf & adds HTM-specific transaction-level profiling

BaseTMProf & ExtTMProf • BaseTMProf: • Total cycles = stalls + aborts + wasted_trans + useful_trans + committing + nontrans + implementation specific • Assume in-order procs, but can extend for out-of-order procs • ExtTMProf: BaseTMProf profiling plus • Size of aborted transactions • Amount of transactional work after write-set prediction • HTMs may add more detailed profiling in future Details

Two Case Studies • TMProf profiling two HTMs: • LogTM-SE (eager conflict detection & version management, EE) • Approximation of Stanford’s TCC (lazy conflict detection & version management, LL) • Examine key parameters of eager & lazy conflict detection • Idealize version management • Same system parameters as Notary • 16-processor CMP w/ in-order, single-issue processor cores • Perfect signatures • Same workloads

EE: Different Conflict Resolutions • Three different conflict resolutions: • Base, Timestamp, Hybrid • All use timestamps • Base: Requestor stalls until possible deadlock • Timestamp: Older requestors always abort younger transactions. Younger requestors stalled by older transactions. • Hybrid: Base, except RW from older writer aborts younger reader

EE: Write-set Prediction • Avoid aborts from load then store pattern from thread • Predict & serialize on these conflicts T 0 T 0 T 1 T2 T 1 T2 ABORT … GetS … … … … GetS … GetX … … GetX … … GetS … … GetS … … GetX … … … GetS … STALL ABORT STALL

Results from Conflict Resolutions Trends: 1) Timestamp & Hybrid better than Base

Timestamp & Hybrid Better than Base Fewer total stalls & eliminates all RW Requestor older stalls

Signatures in Transactional Memory Systems

Signatures in Transactional Memory Systems

Presentation Transcript

Transactional memory

Implementing Signatures for Transactional Memory

Application-Specific Signatures for Transactional Memory in Soft Processors

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Selfishness in Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory

Design and Implementation of Signatures in Transactional Memory Systems

Implementing Signatures for Transactional Memory

Transactional Memory

Transactional Memory

Transactional Memory