540 likes | 813 Views
Hardware Transactional Memory for GPU Architectures*. Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44). Performance.
E N D
Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)
Performance E.g. N-Body with 5M bodies CUDA SDK: O(n2) – 1640 s (barrier)Barnes Hut: O(nLogn) – 5.2 s (locks) Functionality Time Fine-Grained Locking Transactional Memory Time Time Motivation • Lifetime of GPU Application Development ? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures
Talk Outline • What we mean by “GPU” in this work. • Data Synchronization on GPUs. • What is Transactional Memory (TM)? • TM iscompatible with OpenCL. • … but is TM compatible with GPU hardware? • KILO TM: A Hardware TM for GPUs. • Results Hardware TM for GPU Architectures
Wavefront / Warp Scalar Thread 1 2 3 4 5 6 7 8 9 10 11 12 What is a GPU (in this work)? • GPU is NVIDIA/AMD-like, Compute Accelerator • SIMD HW + Aggressive Memory Subsystem => High Compute Throughput and Efficiency • Non-Graphics API: OpenCL, DirectCompute, CUDA • Programming Model: Hierarchy of scalar threads • Today: Limited Communication & Synchronization Kernel Blocks Blocks Work Group / Thread Blocks Global Memory Barrier Shared (Local) Memory Hardware TM for GPU Architectures
SIMT Core SIMT Core SIMT Core SIMT Core SIMT Core Done (Warp ID) SIMT Front End SIMD Datapath Fetch Memory Subsystem Decode Tex $ Const$ SMem Icnt. Network Schedule Non-Coherent L1 D-Cache Branch Baseline GPU Architecture Memory Partition Memory Partition Memory Partition Atomic Op. Unit Interconnection Network Last-Level Cache Bank Off-Chip DRAM Channel Hardware TM for GPU Architectures
Stack Reconv. PC Next PC Active Mask Common PC Thread Warp TOS TOS TOS TOS TOS TOS TOS A - E - - - E - E E - - E C B E D D E E G A E E D 0110 0110 1111 1001 0110 1111 1001 1111 1111 1111 1111 1111 B Thread 1 Thread 2 Thread 3 Thread 4 C D F E A D G A B C E G Time Stack-Based SIMD Reconvergence (“SIMT”) (Levinthal SIGGRAPH’84, Fung MICRO’07) A/1111 B/1111 C/1001 D/0110 E/1111 G/1111 17
Data Synchronizations on GPUs • Motivation • Solve wider range of problems on GPU • Data Race Data Synchronization • Current Solution: Atomic read-modify-write (32-bit/64-bit). • Best Sol’n? • Why Transactional Memory? • E.g. N-Body with 5M bodies (traditional sync, not TM)CUDA SDK: O(n2) – 1640 s (barrier)Barnes Hut: O(nLogn) – 5.2 s (atomics, harder to get right) • Easier to Write/Debug Efficient Algorithms • Practical efficiency. Want efficiency of GPU with reasonable (not superhuman) effort and time. Hardware TM for GPU Architectures
Which of these states are deadlocks?! # Possible Global Lock States # Locks x # Sharing Thread Data Synchronizations on GPUs • Deadlock-free code with fine-grained locks and 10,000+ hardware scheduled threads is hard • Other general problems with lock based synchronization • Implicit relationship between locks and objects being protected • Code is not composable Hardware TM for GPU Architectures
A: while(atomicCAS(lock,0,1)==1); B: // Critical Section … C: lock = 0; Data Synchronization Problems Specific to GPUs • Interaction between locks and SIMT control flow can cause deadlocks A: done = 0; B: while(!done){ C: if(atomicCAS(lock,0,1)==1){ D: // Critical Section … E: lock = 0; F: done = 1; G: } H: } Hardware TM for GPU Architectures
Potential Deadlock! Transactional Memory • Program specifies atomic code blocks called transactions [Herlihy’93] Lock Version: Lock(X[a]); Lock(X[b]); Lock(X[c]); X[c] = X[a]+X[b]; Unlock(X[c]); Unlock(X[b]); Unlock(X[a]); TM Version: atomic { X[c] = X[a]+X[b]; } Hardware TM for GPU Architectures
Non-conflicting transactions may run in parallel Conflicting transactions automatically serialized Memory Memory A A TX1 TX2 TX1 B B TX2 C C Commit Abort Commit D D Commit TX2 Commit Transactional Memory Programmers’ View: TX1 Time TX2 Time OR TX2 TX1 Hardware TM for GPU Architectures
Transactional Memory • Each transaction has 3 phases • Execution • Track all memory accesses (Read-Set and Write-Set) • Validation • Detect any conflicting accesses between transactions • Resolve conflict if needed (abort/stall) • Commit • Update global memory Hardware TM for GPU Architectures
Transactional Memory on OpenCL • A natural extension to OpenCL Programming Model • Program can launch many more threads than the hardware can execute concurrently • GPU-TM? Current threads running transactions do not need to wait for future unscheduled threads GPU HW Hardware TM for GPU Architectures
Are TM and GPUs Incompatible? The problem with GPUs (from TM perspective): • 1000s of concurrent threads • Inter-thread spatial locality common • No cache coherence • No private cache for each thread (Buffering?) • Tx Abort Control flow divergence Hardware TM for GPU Architectures
1024-bit Signature/Thread Bus Inv C 3.8MB / 30k Threads R(A), W(C) R(D) R(A) R(C), W(B) Conflict! Hardware TM for GPUs Challenge:Conflict Detection Private Data Cache Signature TX1 Scalable Coherence No coherence on GPUs? Each scalar thread needs own cache? TX2 TX3 TX4 Hardware TM for GPU Architectures
GPU Core (SM) CPU Core 10s of Registers 32k Registers Register File Register File @ TX Entry @ TX Abort Checkpoint Register File Warp Warp Warp Warp Warp Warp Warp Warp Checkpoint? Hardware TM for GPUs Challenge:Transaction Rollback 2MB Total On-Chip Storage Hardware TM for GPU Architectures
Warp Warp Warp Warp Warp Warp 1-2 Threads 32kB Cache Warp Commit Global Memory Fermi’s L1 Data Cache (48kB) = 384 X 128B Lines 1024-1536 Threads Hardware TM for GPUs Challenge:Access Granularity and Write Buffer GPU Core (SM) L1 Data Cache CPU Core L1 Data Cache TX Problem: 384 lines / 1536 threads < 1 line per thread! Hardware TM for GPU Architectures
Aborted Committed Hardware TM on GPUs Challenge:SIMT Hardware • On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 8 Scalar Threads ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... Reconvergence? Hardware TM for GPU Architectures
Goal • We take it as a given that most programmers trying lock based programming on a GPU will give up before they manage to get their application working. • Hence, our goal was to find the most efficient approach to implement TM on GPU. Hardware TM for GPU Architectures
KILO TM • Supports 1000s of concurrent transactions • Transaction-aware SIMT stack • No cache coherence protocol dependency • Word-level conflict detection • Captures 59% of FG Lock Performance • 128X Faster than Serialized Tx Exec. Hardware TM for GPU Architectures
KILO TM: Design Highlights • Value-Based Conflict Detection • Self-Validation + Abort: Simple Communication • No Cache Coherence Dependence • Speculative Validation • Increase Commit Parallelism Hardware TM for GPU Architectures
High Level GPU Architecture+ KILO TM Implementation Overview Hardware TM for GPU Architectures
Overwritten Abort KILO TM: SIMT Core Changes • SW Register Checkpoint • Observation: Most overwritten registers not used • Compiler analysis can identify what to checkpoint • Transaction Abort • ~ Do-While Loop • Extend SIMT Stack with special entries to trackaborted transactionsin each warp TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit Hardware TM for GPU Architectures
A: t = tid.x; if (…) { @ tx_begin: B: tx_begin; Type PC RPC Active Mask C: x[t%10] = y[t] + 1; Copy N H -- 1111 1111 D: if (s[t]) Active Mask E: y[t] = 0; N B H 1111 0011 F: tx_commit; R C -- 0000 0000 G: z = y[t]; TOS T C -- 1111 0011 Implicit loop } when abort H: w = y[t+1]; @ tx_commit, thread 6 & 7 failed validation: Branch Divergence within Tx: Type PC RPC Active Mask Type PC RPC Active Mask N H -- 1111 1111 N H -- 1111 1111 Copy N B H 1111 0011 N B H 1111 0011 Active 0000 0011 R C -- TOS Mask R C -- 0000 0000 + PC T F -- 0000 0000 T F -- 1111 0011 N E F 0001 0011 TOS @ tx_commit, restart Tx for thread 6 & 7: @ tx_commit, Type PC RPC Active Mask all threads with Tx committed: N H -- 1111 1111 Type PC RPC Active Mask N B H 1111 0011 N H -- 1111 1111 0000 0000 R C -- N G H 1111 0011 TOS T C -- 0000 0011 R C -- 0000 0000 TOS Transaction-Aware SIMT Stack Hardware TM for GPU Architectures
Read-Log Read-Log Write-Log Write-Log TX2 atomic{A=B+2} Private Memory KILO TM: Value-Based Conflict Detection • Self-Validation + Abort: • Only detects existence of conflict (not identity) => No Tx to Tx Msg – Simple Communication Global Memory A=1 A=1 TX1 atomic{B=A+1} Private Memory B=0 B=2 TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit A=1 B=2 B=2 TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit B=0 A=2 Hardware TM for GPU Architectures
Init: A=1,B=0 Tx1 then Tx2: B=2,A=4 Read-Log Read-Log Tx2 then Tx1: Write-Log Write-Log A=2,B=3 TX2 atomic{A=B+2} Private Memory Parallel Validation? Data Race!?! Global Memory A=1 A=1 TX1 atomic{B=A+1} Private Memory B=0 B=0 OR A=1 B=2 B=2 B=0 A=2 A=2 Hardware TM for GPU Architectures
Commit Unit Global Memory V + C Serialize Validation? TX1 TX2 Time • Benefit #1: No Data Race • Benefit #2: No Live Lock (generic lazy TM prob.) • Drawback:Serializes Non-ConflictingTransactions (“collateral damage”) V + C Stall Hardware TM for GPU Architectures
TX3 TX1 TX1 TX1 TX1 TX1 TX1 TX3 TX3 TX3 TX3 TX3 TX3 Identifying Non-conflicting Tx: Step 1: Leverage Parallelism Global Memory Partition Commit Unit Global Memory Partition TX1 Commit Unit TX2 Global Memory Partition Commit Unit Hardware TM for GPU Architectures
Solution: Speculative Validation • Key Idea: Split Validation into two parts • Part 1: Check recently committed transactions • Part 2: Check concurrently committing transactions Hardware TM for GPU Architectures
KILO TM: Speculative Validation • Memory subsystem is deeply pipelined and highly parallel Read-Log Write-Log Commit Unit TX1 TX3 Validation Queue R(C),W(D) TX2 Global Memory Partition Log Transfer Spec. Validation TX1 TX2 Hazard Detection C R(A),W(B) Validation Wait A D TX3 Finalize Outcome R(D),W(E) Commit Hardware TM for GPU Architectures
Last Writer History Addr CID CID Evict Recency Lookup Table Bloom Filter W(D) D? A? Last Writer History C? TX2 TX1 TX3 E B D TX1 Nil KILO TM: Speculative Validation TX1 TX2 TX3 R(C),W(D) R(A),W(B) R(D),W(E) Commit Unit Validation Queue TX3 Global Memory Partition Log Transfer Spec. Validation TX2 Hazard Detection TX1 C Validation Wait A D STALL Finalize Outcome Commit Hardware TM for GPU Architectures
Consecutive physical address T0’s view of private memory K E A F L B C M G D H N Address T0 T1 T2 T3 3 6 7 1 4 6 5 6 9 8 4 7 Value LD ST LD K E A F B L G C M H N D Log Storage • Transaction logs are stored at the private memory of each thread • Located in DRAM, cached in L1 and L2 caches Wavefront Read-Log Ptr Write-Log Ptr
Partition 0 Partition 1 Partition 3 Packets to Commit Units E A K B L F C M G H N D Commit Unit 0 A 3 C 9 6 7 3 1 4 6 5 6 9 8 4 7 Commit Unit 1 B 4 Commit Unit 2 D 7 Log Transfer • Entries heading to same memory partition can be grouped into a larger packet Read-Log Ptr Write-Log Ptr
Distributed Commit / HW Org. Hardware TM for GPU Architectures
top A B C Next Next Next Null t A Next B top A C Next Next Null top top B B C C Next Next Next Next Null Null top C Next Null ABA Problem? • Classic Example: Linked List Based Stack • Thread 0 – pop(): while (true) { t = top; Next = t->Next; // thread 2: pop A, pop B, push A if (atomicCAS(&top, t, next) == t) break; // succeeds! }
top A B C Next Next Next Null ABA Problem? • atomicCAS protects only a single word • Only part of the data structure • Value-based conflict detection protects all relevant parts of the data structure while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds! }
Evaluation Methodology • GPGPU-Sim 3.0 (BSD license) • Detailed: IPC Correlation of 0.93 vs GT 200 • KILO TM (Timing-Driven Memory Accesses) • GPU TM Applications • Hash Table (HT-H, HT-L) • Bank Account (ATM) • Cloth Physics (CL) • Barnes Hut (BH) • CudaCuts (CC) • Data Mining (AP) Hardware TM for GPU Architectures
GPGPU-Sim 3.0.x running SASS (decuda) 0.976 correlation on subset of CUDA SDK that decuda correctly Disassembles Note: Rest of data uses PTX instead of SASS (0.93 correlation) (We believe GPGPU-Sim is reasonable proxy.)
Performance (vs. Serializing Tx) Hardware TM for GPU Architectures
TM on GPU performs well for applications with low contention. Poorly: Memory divergence, low parallelism, high conflict rate (tackle through alg. design/tuning?) CPU vs GPU? CC: FG-Lock version 400X faster than its CPU version BH: FG-Lock version 2.5X faster than its CPU version Absolute Performance (IPC) IPC Hardware TM for GPU Architectures
Performance (Exec. Time) • Captures 59% of FG Lock Performance • 128X Faster than Serialized Tx Exec. Hardware TM for GPU Architectures
KILO TM Scaling Hardware TM for GPU Architectures
Abort Commit Ratio Increasing number of TXs => increase probability of conflict Two possible solutions (future work): Solution 1: Application performance tuning (easier with TM vs. FG Lock) Solution 2: Transaction schedule Hardware TM for GPU Architectures
Thread Cycle Breakdown • Status of a thread at each cycle • Categories: • TC: In a warp stalled by concurrency control • TO: In a warp committing its transactions • TW: Have passed commit, and waiting for other threads in the warp to pass • TA: Executing an eventually aborted transaction • TU: Executing an eventually committed transaction (Useful work) • AT: Acquiring a lock or doing an Atomic Operation • BA: Waiting at a Barrier • NL: Doing non-transactional (Normal) work Hardware TM for GPU Architectures
Thread Cycle Breakdown KL KL KL FGL FGL FGL KL-UC KL-UC IDEAL IDEAL IDEAL KL-UC KL-UC IDEAL HT-H HT-L ATM CL BH CC AP Hardware TM for GPU Architectures
Core Cycle Breakdown • Action performed by a core at each cycle • Categories: • EXEC: Issuing a warp for execution • STALL: Stalled by a downstream warp • SCRB: All warps blocked by the scoreboard, due to data hazards, concurrency control, pending commits (or any combination thereof) • IDLE: None of the warps are ready in the instruction buffer. Hardware TM for GPU Architectures
Core Cycle Breakdown KL KL FGL FGL FGL KL-UC KL-UC IDEAL IDEAL IDEAL KL-UC KL-UC IDEAL Hardware TM for GPU Architectures
Read-Write Buffer Usage Hardware TM for GPU Architectures
# In-Flight Buffers Hardware TM for GPU Architectures
Implementation Complexity • Logs in Private Memory @ L1 Data Cache • Commit Unit • 5kB Last Writer History Unit • 19kB Transaction Status • 32kB Read-Set and Write-Set Buffer • CACTI 5.3 @ 40nm • 0.40mm2 x 6 Memory Partition • 0.5% of 520mm2 Hardware TM for GPU Architectures