APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

APOGEE: Adaptive Prefetching on GPU for Energy Efficiency Ankit Sethia1, Ganesh Dasika2, Mehrzad Samadi1, Scott Mahlke1 1-University of Michigan 2-ARM R&D Austin

Introduction • High Throughput – 1 TeraFlop • High Energy Efficiency • Programmability • Further push efficiency? • High performance on mobile • Less cost in supercomputers

Background Register File Banks • Hide latency by fine grained multi-threading • 20,000 inflight threads • 2 MB on-chip register file • Management overhead • Scheduling • Divergence …. …. …. .… …. …. …. .… ALU Warp Scheduler SFU SFU MemCtrlr Data Cache To global memory

Motivation - I Too many warps decrease efficiency

Motivation - II Hardware added to hide memory latency has under-utilization

APOGEE: Overview Register File Banks • Prefetch data from memory to cache. • Less latency to hide. • Less Multithreading. • Less register context. … … … … … … … … ALU Warp Scheduler SFU SFU Prefetcher MemCtrlr Data Cache From global memory To global memory

Traditional CPU Prefetching 64 -32 64 0 31 W0 32 63 W1 64 95 W2 96 64 32 96 127 W3 Stride cannot be found . . . . Next line cannot be timely prefetched Both the traditional prefetching techniques do not work 4032 4063 W30 4064 4095 W31

Fixed Offset Address Prefetching Warp 1 Warp 0 0 31 W0 32 63 64 W1 64 95 W2 64 96 127 W3 64 128 160 W4 64 160 191 . . . . W5 4032 4063 W30 4064 4095 W31 Less warps enables more iterations with fixed offset(stride*numWarp)

Timeliness in Prefetching 00 01 10 Load Prefetch sent to Memory Prefetchrecv from Memory Early Prefetch Timely Prefetch Timely Prefetch Slow Prefetch Slow Prefetch Time - Increase distance of prefetching if next load happens in state 01 00 Prefetch sent to Memory New Load - Decrease distance of prefetching if correctly prefetched address in state 10 was a miss 10 01 Prefetch received from Memory

FOA Prefetching Prefetch Table Current PC 3 + 1 > PrefetchQueue Prefetch Enable Miss address of thread index 4 - = PrefetchAddress * + # Threads

GRAPHICS MEMORY ACCESSES 3 major type of accesses: • Fixed Offset Access: Data accessed by adjacent threads in an array have fixed offset. • Thread Invariant Address: Same address is accessed by all the threads in a warp. • Texture Access: Address accessed during texturing operation.

Thread Invariant Addresses TIA Prefetch Table Iteration 3 Iteration 1 Iteration 2 Iteration 5 PC3 PC3 PC3 Ld Ld Ld PC2 1 PC1 0xabc 0 Slow Prefetch Timely Prefetch PC2 Ld PC2 Ld PC2 Slow Prefetch Ld Prefetch Address PC1 PC1 Ld Ld PC1 Ld Const. Ld PC0 Const. Ld PC0 Const. Ld PC0 Time Time Time Prefetch Queue

Experimental Evaluation • Benchmarks • Mesa Driver, SPEC-Viewperftraces, MV5 GPGPU benchmarks • Performance Eval • MV5 simulator, 1 GHz, Inorder SIMT, 400 cycle mem latency. • Prefetcher – 32 entry • Prefetch latency – 10 cycles • D-Cache – 64kB, 8 way, 32 bytes per line. • Power Eval • Dynamic power from analytical model. Hong et. al(ISCA10). • Static power from published numbers and tools: • FPU, SFU, Caches, Register File, Fetch/Decode/Schedule.

APOGEE Performance On average 19% improvement in speedup with APOGEE

APOGEE Performance -II MTA with 32 warps is within 3% of APOGEE

Performance and Power SIMT MTA APOGEE # of warps 32 warps 16 warps + 8 warps 4 warps x 2 warps 1 warp • 20% speedup over SIMT, with 14k less registers. • Around 51% perf/Watt improvement.

Prefetcher Accuracy APOGEE has 93.5% accuracy in prediction

D-Cache Miss Rate Prefetching results in 80% reduction in cache accesses

Conclusion • Use of high multi-threading on GPUs is inefficient • Adaptive prefetching exploits the regular memory access pattern of GPU applications: • Adapt for the timeliness of prefetching • Prefetch for two different patterns • Over 12 graphics and GPGPU benchmarks, 20% improvement in performance and 51% improvement in performance/Watt

Questions?

APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

Presentation Transcript

ProForm: A Tool for Pre-Feasibility Analysis of Renewable Energy and Energy Efficiency Projects

Adaptive Energy-Efficient Group Communication Support in Wireless Ad hoc Networks*

Section 9006 Renewable Energy and Energy Efficiency Program Ground-Source Heat Pumps and Geothermal Direct Use Works

Energy Efficiency/Weatherization – New Technologies

Energy Efficiency Assessment Tools

Energy Efficiency and Renewable Energy

Energy Efficiency and Renewable Energy

Adaptive Antenna Tutorial: Spectral Efficiency and Spatial Processing

Introduction to Commercial Building Energy Efficiency through EPA’s ENERGY STAR program

Energy Efficiency Methodology Energizing Cleaner Production Management Course

Chapter 16

Energy-efficiency issues in Distributed Cyber-Physical Systems

Multifamily Energy Efficiency Web Training 80 Slides

Comparison of Utility Energy Efficiency Programs

Why Should Utility Customers Support Energy Efficiency Investments in Rates?

Energy Efficiency Programs August 23, 2010

Thinking Green

Adaptive Learning Environments

Philippe Benoit Manager, Energy Unit Latin America and Caribbean Region The World Bank

From Adaptive Hypermedia to the Adaptive Web … and beyond

Energy Efficiency and Renewable Energy

Utilizing DeltaV Adaptive Control