220 likes | 345 Views
Hardware Transactional Memory for GPU Architectures. Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44). Performance.
E N D
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)
Performance E.g. N-Body with 5M bodies CUDA SDK: O(n2) – 1640 s (barrier)Barnes Hut: O(nLogn) – 5.2 s (locks) Functionality Time Fine-Grained Locking Transactional Memory Time Time Motivation • Lifetime of GPU Application Development ? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures
Are TM and GPUs Incompatible? GPUs different from Multi-Core CPUs • 1000s Concurrent Scalar Threads • Challenges (from TM perspective) Our Solution: KILO TM • Hardware TM for GPUs 3 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Aborted Committed T0 T0 T0 T1 T1 T1 T2 T2 T2 T3 T3 T3 Hardware TM for GPUs Challenge #1: SIMD Hardware • On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 4 Scalar Threads ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... Branch Divergence! 4 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
KILO TM – Solution to Challenge #1: SIMD Hardware Transaction Abort Like a Loop Extend SIMT Stack Abort ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 5 5 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
GPU Core (SM) CPU Core 10s of Registers 32k Registers Register File Register File @ TX Entry @ TX Abort Checkpoint Register File Warp Warp Warp Warp Warp Warp Warp Warp Checkpoint? Hardware TM for GPUs Challenge #2: Transaction Rollback 2MB Total On-Chip Storage 6 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Overwritten Abort KILO TM – Solution toChallenge #2: Transaction Rollback • SW Register Checkpoint • Most TX: Registers overwritten at first use • TX in Barnes Hut: Checkpoint 2 registers TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit 7 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol • Not Available on GPUs • No Private Data Cache per Thread Signatures? • 1024-bit / Thread • 3.8MB / 30k Threads Hardware TM for GPU Architectures Hardware TM for GPU Architectures 8
GPU Core (SM) L1 Data Cache Warp Warp Warp Warp Warp Warp Warp Fermi’s L1 Data Cache (48kB) = 384 X 128B Lines 1024-1536 Threads Hardware TM for GPUs Challenge #4: Write Buffer Problem: 384 lines / 1536 threads < 1 line per thread! 9 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Read-Log Read-Log Write-Log Write-Log TX2 atomic {A=B+2} Private Memory KILO TM: Value-Based Conflict Detection • Self-Validation + Abort: • Only detects existence of conflict (not identity) Global Memory A=1 A=1 TX1 atomic {B=A+1} Private Memory B=2 B=0 A=1 TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit B=2 B=2 TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit B=0 A=2 10 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Tx1 then Tx2: A=4,B=2 OR Read-Log Read-Log Tx2 then Tx1: Write-Log Write-Log A=2,B=3 TX2 atomic {A=B+2} Private Memory Parallel Validation? Data Race!?! Global Memory A=1 A=1 TX1 atomic {B=A+1} Private Memory B=0 B=0 A=1 B=2 B=2 B=0 A=2 A=2 11 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Commit Unit Global Memory V + C Serialize Validation? TX1 TX2 Time • Benefit #1: No Data Race • Benefit #2: No Live Lock • Drawback:Serializes Non-ConflictingTransactions (“collateral damage”) V + C Stall V = Validation C = Commit 12 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Commit Unit Global Memory TX1 TX2 V+C V+C V+C Solution: Speculative Validation Key Idea: Split Conflict Detection into two parts • Recently Committed TX in Parallel • Concurrently Committing TX in Commit Order • Approximate V = Validation C = Commit TX3 Time RS Stall RS RS Conflict Rare Good Commit Parallelism 13 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
KILO TM Implementation • Minimal Modification to Existing GPU Arch. SIMT Stacks Commit Unit TX Log Unit 14 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Evaluation Methodology • GPGPU-Sim 3.0 (BSD license) • Detailed: IPC Correlation of 0.93 vs GT 200 • KILO TM (Timing-Driven Memory Accesses) • GPU TM Applications • Hash Table (HT-H, HT-L) • Bank Account (ATM) • Cloth Physics (CL) • Barnes Hut (BH) • CudaCuts (CC) • Data Mining (AP) 15 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Performance (vs. Serializing TX) Higher is Better Serializing TX ≈ Coarse-Grained Locks 16 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
3 Ideal TM e m KILO TM i T FG Lock . 2 c e x E d e 1 z i l a m r o N 0 HT-H HT-L ATM CL BH CC AP Performance (Exec. Time) Lower is Better • Captures 59% of FG Lock Performance 17 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Implementation Complexity • Logs in Private Memory @ L1 Data Cache • Commit Unit • 5kB Last Writer History Unit • 19kB Transaction Status • 32kB Read-Set and Write-Set Buffer • CACTI 5.3 @ 40nm • 0.40mm2 x 6 Memory Partition • 0.5% of 520mm2 18 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Summary • KILO TM: Hardware TM for GPUs • 1000s of Concurrent Scalar TXs • Handles Scalar TX Abort • No cache coherence protocol dependency • Word-level conflict detection • Unbounded Transaction • 59% Fine-Grained Locking Performance • 128X Faster than Serializing TX Execution • 0.5% Area Overhead Question? 19 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Backup Slides 20 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
top A B C Next Next Next Null t A Next B top A C Next Next Null top top B B C C Next Next Next Next Null Null top C Next Null ABA Problem? • Classic Example: Linked List Based Stack • Thread 0 – pop(): while (true) { t = top; Next = t->Next; // thread 2: pop A, pop B, push A if (atomicCAS(&top, t, next) == t) break; // succeeds! } Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 21 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
top A B C Next Next Next Null ABA Problem? • atomicCAS protects only a single word • Only part of the data structure • Value-based conflict detection protects all relevant parts of the data structure while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds! } Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 22 Hardware TM for GPU Architectures Hardware TM for GPU Architectures