300 likes | 313 Views
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim. Outline. Background and Motivation Policies Experimental Setup Results Conclusion. 2. GPU Architecture (based on Tesla Architecture).
E N D
Effect of Instruction Fetch and Memory Scheduling on GPU PerformanceNagesh B Lakshminarayana, Hyesoon Kim
Outline Background and Motivation Policies Experimental Setup Results Conclusion 2
GPU Architecture (based on Tesla Architecture) SM – Streaming Multiprocessor SP – Scalar Processor SIMT – Single Instruction Multiple Thread 3
SM Architecture (based on Tesla Architecture) • Fetch Mechanism • Fetch 1 instruction for selected warp • Stall Fetch for warp when it executes a Load/Store or when it encounters a Branch • Scheduler Policy • Oldest first and Inorder (within warp) • Caches • I Cache, Shared Memory, Constant Cache and Texture Cache 4
Handling Multiple Memory Requests • MSHR/Memory Request Queue • Allows merging of memory requests (Intra-core) • DRAM Controller • Allows merging of memory requests (Inter-core) 5
Code Example - Intra-Core Merging • From MonteCarlo in CUDA SDK for(iSum = threadIdx.x; iSum < SUM_N; iSum += blockDim.x) { … for(int i = iSum; i < pathN; i += SUM_N) { real r = d_Samples[i]; real callValue = endCallValue(S, X, r, MuByT, VBySqrtT); sumCall.Expected += callValue; sumCall.Confidence += callValue * callValue; } … } iSum 0, 2 = 2 iSum 1, 2 = 2 iSum 2, 2 = 2 A X, Y X – Block Id, Y – Thread Id i 0, 2 = 2 i 1, 2 = 2 i 2, 2 = 2 r 0, 2 = r 1, 2 = r 2, 2= d_Samples[2] multiple blocks are assigned to the same SM threads with corresponding Ids in different blocks access the same memory locations 7
Why look at Fetch? • Allows implicit control over resources allocated to a warp • Can control progress of a warp • Can boost performance by fetching more for critical warps • Implicit resource control within a core 9
Why look at DRAM Scheduling? • Memory System is a performance bottleneck for several applications • DRAM scheduling decides the order in which memory requests are granted • Can prioritize warps based on criticality • Implicit performance control across cores 10
By controlling Fetch and DRAM Scheduling we can control performance 11
How is This Useful? • Understand applications and their behavior better • Detect patterns or behavioral groups across applications • Design new policies for GPGPU applications to improve performance 12
Outline Background and Motivation Policies Experimental Setup Results Conclusion 13
Fetch Policies • Round Robin (RR) [default in Tesla architecture] • FAIR • Ensures uniform progress of all warps • ICOUNT [Tullsen’96] • Same as ICOUNT in SMT • Tries to increase throughput by giving priority to fast moving threads • Least Recently Fetched(LRF) – Prevents starvation of warps 14
New Oracle Based Fetch Policies • ALL • Gives priority to longer warps (total length until termination) • Ensures all warps finish at the same time,this results in higher occupancy Priorities: warp 0 > warp 1 > warp 2 > warp 3 15
New Oracle Based Fetch Policies • BAR • Gives priority to warps with greater number of instructions to next barrier • Idea is to reduce wait time at barriers Priorities: warp 0 > warp 1 > warp 2 > warp 3 Priorities: warp 2 > warp 1 > warp 0 > warp 3 16
New Oracle Based Fetch Policies • MEM_BAR • Similar to BAR but gives higher priority to warps with more memory instructions Priorities: warp 0 > warp 2 > warp 1 = warp 3 Priorities: warp 1 > warp 0 = warp 2 > warp 3 Priority(Wa) > Priority(Wb) If MemInst(Wa) > MemInst(Wb)or If MemInst(Wa) = MemInst(Wb) AND Inst(Wa) > Inst(Wb) 17
DRAM Scheduling Policies • FCFS • FRFCFS [Rixner’00] • FR_FAIR (new policy) • Row hit with fairness • Ensures uniform progress of warps • REM_INST (new Oracle based policy) • Row hit with priority for warps with greater number of instructions remaining for termination • Prioritizes longer warps 18
Outline Background and Motivation Policies Experimental Setup Results Conclusion 19
Experimental Setup • Simulated GPU Architecture • 8 SMs • Frontend : 1 wide, 1KB I Cache, branch stall • Execution : 8 wide SIMD execution unit, IO scheduling, 4 cycle latency for most instructions • Caches : 64KB software managed cache, 8 load accesses/cycle • Memory : 32B wide bus, 8 DRAM banks • RR fetch, FRFCFS DRAM scheduling (baseline) • Trace driven, cycle accurate simulator • Per warp traces generated using GPU Ocelot[Kerr’09] 20
Benchmarks • Taken from • CUDA SDK 2.2 – MonteCarlo, Nbody, ScalarProd • PARBOIL[UIUC’09] – MRI-Q, MRI-FHD, CP, PNS • RODINIA[Che’09] – Leukocyte, Cell, Needle • Classification based on lengths of warps • Symmetric, if <= 2% divergence • Asymmetric, otherwise (results included in paper) 21
Outline Background and Motivation Policies Experimental Setup Results Conclusion 22
Results - Symmetric Applications Baseline : RR + FRFCFS • Compute intensive – no variation with different fetch policies • Memory bound – improvement with fairness oriented fetch policies i.e., FAIR, ALL, BAR, MEM_BAR 23
Results – Symmetric Applications Baseline : RR + FRFCFS • On average, better than FRFCFS • MersenneTwister shows huge improvement • REM_INST DRAM policy performs similar to FRFAIR 24
Analysis: MonteCarlo FRFCFS DRAM Scheduling • Fairness oriented fetch policies improve performance by increasing intra-core merging 25
Analysis: MersenneTwister Baseline : RR + FRFCFS • FAIR DRAM Scheduling (FRFAIR, REM_INST) improves performance by increasing DRAM Row Buffer Hit ratio 26
Analysis: BlackScholes FRFCFS DRAM Scheduling • Fairness oriented fetch policies increase MLP • Increased (MLP + Row Buffer Hit ratio) improves performance 27
Outline Background and Motivation Policies Experimental Setup Results Conclusion 28
Conclusion • Compute intensive applications • Fetch and DRAM Scheduling do not matter • Symmetric memory intensive applications • Fairness oriented Fetch (FAIR, ALL, BAR, MEM_BAR) and DRAM policies (FR_FAIR, REM_INST) provide performance improvement • MonteCarlo(40%),MersenneTwister(50%), BlackScholes(18%) • Asymmetric memory intensive applications • No correlation between performance and Fetch and DRAM Scheduling policies 29
THANK YOU! 30