1 / 20

APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

APOGEE: Adaptive Prefetching on GPU for Energy Efficiency. Ankit Sethia 1 , Ganesh Dasika 2 , Mehrzad Samadi 1 , Scott Mahlke 1 1-University of Michigan 2-ARM R&D Austin. Introduction. High Throughput – 1 TeraFlop High Energy Efficiency Programmability. Further push efficiency?

Download Presentation

APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. APOGEE: Adaptive Prefetching on GPU for Energy Efficiency Ankit Sethia1, Ganesh Dasika2, Mehrzad Samadi1, Scott Mahlke1 1-University of Michigan 2-ARM R&D Austin

  2. Introduction • High Throughput – 1 TeraFlop • High Energy Efficiency • Programmability • Further push efficiency? • High performance on mobile • Less cost in supercomputers

  3. Background Register File Banks • Hide latency by fine grained multi-threading • 20,000 inflight threads • 2 MB on-chip register file • Management overhead • Scheduling • Divergence …. …. …. .… …. …. …. .… ALU Warp Scheduler SFU SFU MemCtrlr Data Cache To global memory

  4. Motivation - I Too many warps decrease efficiency

  5. Motivation - II Hardware added to hide memory latency has under-utilization

  6. APOGEE: Overview Register File Banks • Prefetch data from memory to cache. • Less latency to hide. • Less Multithreading. • Less register context. … … … … … … … … ALU Warp Scheduler SFU SFU Prefetcher MemCtrlr Data Cache From global memory To global memory

  7. Traditional CPU Prefetching 64 -32 64 0 31 W0 32 63 W1 64 95 W2 96 64 32 96 127 W3 Stride cannot be found . . . . Next line cannot be timely prefetched Both the traditional prefetching techniques do not work 4032 4063 W30 4064 4095 W31

  8. Fixed Offset Address Prefetching Warp 1 Warp 0 0 31 W0 32 63 64 W1 64 95 W2 64 96 127 W3 64 128 160 W4 64 160 191 . . . . W5 4032 4063 W30 4064 4095 W31 Less warps enables more iterations with fixed offset(stride*numWarp)

  9. Timeliness in Prefetching 00 01 10 Load Prefetch sent to Memory Prefetchrecv from Memory Early Prefetch Timely Prefetch Timely Prefetch Slow Prefetch Slow Prefetch Time - Increase distance of prefetching if next load happens in state 01 00 Prefetch sent to Memory New Load - Decrease distance of prefetching if correctly prefetched address in state 10 was a miss 10 01 Prefetch received from Memory

  10. FOA Prefetching Prefetch Table Current PC 3 + 1 > PrefetchQueue Prefetch Enable Miss address of thread index 4 - = PrefetchAddress * + # Threads

  11. GRAPHICS MEMORY ACCESSES 3 major type of accesses: • Fixed Offset Access: Data accessed by adjacent threads in an array have fixed offset. • Thread Invariant Address: Same address is accessed by all the threads in a warp. • Texture Access: Address accessed during texturing operation.

  12. Thread Invariant Addresses TIA Prefetch Table Iteration 3 Iteration 1 Iteration 2 Iteration 5 PC3 PC3 PC3 Ld Ld Ld PC2 1 PC1 0xabc 0 Slow Prefetch Timely Prefetch PC2 Ld PC2 Ld PC2 Slow Prefetch Ld Prefetch Address PC1 PC1 Ld Ld PC1 Ld Const. Ld PC0 Const. Ld PC0 Const. Ld PC0 Time Time Time Prefetch Queue

  13. Experimental Evaluation • Benchmarks • Mesa Driver, SPEC-Viewperftraces, MV5 GPGPU benchmarks • Performance Eval • MV5 simulator, 1 GHz, Inorder SIMT, 400 cycle mem latency. • Prefetcher – 32 entry • Prefetch latency – 10 cycles • D-Cache – 64kB, 8 way, 32 bytes per line. • Power Eval • Dynamic power from analytical model. Hong et. al(ISCA10). • Static power from published numbers and tools: • FPU, SFU, Caches, Register File, Fetch/Decode/Schedule.

  14. APOGEE Performance On average 19% improvement in speedup with APOGEE

  15. APOGEE Performance -II MTA with 32 warps is within 3% of APOGEE

  16. Performance and Power SIMT MTA APOGEE # of warps 32 warps 16 warps + 8 warps 4 warps x 2 warps 1 warp • 20% speedup over SIMT, with 14k less registers. • Around 51% perf/Watt improvement.

  17. Prefetcher Accuracy APOGEE has 93.5% accuracy in prediction

  18. D-Cache Miss Rate Prefetching results in 80% reduction in cache accesses

  19. Conclusion • Use of high multi-threading on GPUs is inefficient • Adaptive prefetching exploits the regular memory access pattern of GPU applications: • Adapt for the timeliness of prefetching • Prefetch for two different patterns • Over 12 graphics and GPGPU benchmarks, 20% improvement in performance and 51% improvement in performance/Watt

  20. Questions?

More Related