200 likes | 360 Views
APOGEE: Adaptive Prefetching on GPU for Energy Efficiency. Ankit Sethia 1 , Ganesh Dasika 2 , Mehrzad Samadi 1 , Scott Mahlke 1 1-University of Michigan 2-ARM R&D Austin. Introduction. High Throughput – 1 TeraFlop High Energy Efficiency Programmability. Further push efficiency?
E N D
APOGEE: Adaptive Prefetching on GPU for Energy Efficiency Ankit Sethia1, Ganesh Dasika2, Mehrzad Samadi1, Scott Mahlke1 1-University of Michigan 2-ARM R&D Austin
Introduction • High Throughput – 1 TeraFlop • High Energy Efficiency • Programmability • Further push efficiency? • High performance on mobile • Less cost in supercomputers
Background Register File Banks • Hide latency by fine grained multi-threading • 20,000 inflight threads • 2 MB on-chip register file • Management overhead • Scheduling • Divergence …. …. …. .… …. …. …. .… ALU Warp Scheduler SFU SFU MemCtrlr Data Cache To global memory
Motivation - I Too many warps decrease efficiency
Motivation - II Hardware added to hide memory latency has under-utilization
APOGEE: Overview Register File Banks • Prefetch data from memory to cache. • Less latency to hide. • Less Multithreading. • Less register context. … … … … … … … … ALU Warp Scheduler SFU SFU Prefetcher MemCtrlr Data Cache From global memory To global memory
Traditional CPU Prefetching 64 -32 64 0 31 W0 32 63 W1 64 95 W2 96 64 32 96 127 W3 Stride cannot be found . . . . Next line cannot be timely prefetched Both the traditional prefetching techniques do not work 4032 4063 W30 4064 4095 W31
Fixed Offset Address Prefetching Warp 1 Warp 0 0 31 W0 32 63 64 W1 64 95 W2 64 96 127 W3 64 128 160 W4 64 160 191 . . . . W5 4032 4063 W30 4064 4095 W31 Less warps enables more iterations with fixed offset(stride*numWarp)
Timeliness in Prefetching 00 01 10 Load Prefetch sent to Memory Prefetchrecv from Memory Early Prefetch Timely Prefetch Timely Prefetch Slow Prefetch Slow Prefetch Time - Increase distance of prefetching if next load happens in state 01 00 Prefetch sent to Memory New Load - Decrease distance of prefetching if correctly prefetched address in state 10 was a miss 10 01 Prefetch received from Memory
FOA Prefetching Prefetch Table Current PC 3 + 1 > PrefetchQueue Prefetch Enable Miss address of thread index 4 - = PrefetchAddress * + # Threads
GRAPHICS MEMORY ACCESSES 3 major type of accesses: • Fixed Offset Access: Data accessed by adjacent threads in an array have fixed offset. • Thread Invariant Address: Same address is accessed by all the threads in a warp. • Texture Access: Address accessed during texturing operation.
Thread Invariant Addresses TIA Prefetch Table Iteration 3 Iteration 1 Iteration 2 Iteration 5 PC3 PC3 PC3 Ld Ld Ld PC2 1 PC1 0xabc 0 Slow Prefetch Timely Prefetch PC2 Ld PC2 Ld PC2 Slow Prefetch Ld Prefetch Address PC1 PC1 Ld Ld PC1 Ld Const. Ld PC0 Const. Ld PC0 Const. Ld PC0 Time Time Time Prefetch Queue
Experimental Evaluation • Benchmarks • Mesa Driver, SPEC-Viewperftraces, MV5 GPGPU benchmarks • Performance Eval • MV5 simulator, 1 GHz, Inorder SIMT, 400 cycle mem latency. • Prefetcher – 32 entry • Prefetch latency – 10 cycles • D-Cache – 64kB, 8 way, 32 bytes per line. • Power Eval • Dynamic power from analytical model. Hong et. al(ISCA10). • Static power from published numbers and tools: • FPU, SFU, Caches, Register File, Fetch/Decode/Schedule.
APOGEE Performance On average 19% improvement in speedup with APOGEE
APOGEE Performance -II MTA with 32 warps is within 3% of APOGEE
Performance and Power SIMT MTA APOGEE # of warps 32 warps 16 warps + 8 warps 4 warps x 2 warps 1 warp • 20% speedup over SIMT, with 14k less registers. • Around 51% perf/Watt improvement.
Prefetcher Accuracy APOGEE has 93.5% accuracy in prediction
D-Cache Miss Rate Prefetching results in 80% reduction in cache accesses
Conclusion • Use of high multi-threading on GPUs is inefficient • Adaptive prefetching exploits the regular memory access pattern of GPU applications: • Adapt for the timeliness of prefetching • Prefetch for two different patterns • Over 12 graphics and GPGPU benchmarks, 20% improvement in performance and 51% improvement in performance/Watt