1 / 18

Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors

Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. Mark Gebhart Daniel R. Johnson The University of Texas University of Illinois Stephen W. Keckler William J. Dally

ronat
Download Presentation

Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors Mark Gebhart Daniel R. Johnson The University of Texas University of Illinois Stephen W. Keckler William J. Dally NVIDIA / The University of Texas NVIDIA / Stanford University David Tarjan Erik Lindholm Kevin Skadron NVIDIA NVIDIA University of Virginia

  2. Motivation • All processors are effectively power limited • From mobile, to desktop, to enterprise, to GPU • Energy efficiency is a primary design constraint • Throughput processors • Massive multithreading to tolerate memory latency • Large register files • Complicated thread scheduler • Focus of this work • Improve energy efficiency of register file and scheduler without sacrificing performance

  3. Optimization Opportunity #1 • Large number of threads hide two types of latency • Long: Global memory access (~400 cycles) • Short: ALU and shared memory access (8-20 cycles) • Partition threads into two sets • Active threads for short latency events • Inactive threads for long latency events • Simplify thread scheduler

  4. Optimization Opportunity #2 • Examined register reuse patterns of GPU workloads • Up to 40% of values read once within 3 instructions of being produced • Exploit register file locality with a cache

  5. Outline • Motivation • Proposed Architecture • Baseline GPU • Thread Scheduling • Register File Caching • Evaluation • Conclusion

  6. Baseline GPU Architecture • 16 streaming multiprocessors (SM) per chip • Small amount of on-chip cache • Memory interface designed to maximize bandwidth rather than latency

  7. Single Level Warp Scheduler • 1024 threads per SM to tolerate latency • 32 warps • 32 threads per warp • Scheduler chooses one warp to execute each cycle

  8. Two-Level Warp Scheduler • Only active warps issue instructions • Simplified scheduler chooses from active warps • Long latency events trigger active warps to be descheduled • Goal: Minimize number of active warps without harming performance

  9. Outline • Motivation • Proposed Architecture • Baseline GPU • Thread Scheduling • Register File Caching • Evaluation • Conclusion

  10. Baseline SM • Register file heavily banked for high bandwidth • 32 SIMT lanes • Reduced number of special function units • Low latency access to scratchpad memory

  11. Two-Level Warp Scheduler and RFC • Register file cache (RFC) • Close to functional units • 21 times smaller than MRF • Only active warps have RFC entries • When warp is descheduled RFC is flushed • Static liveness information used to prevent writeback of dead values

  12. Program Execution RFC MRF Dead values PC add R3, R1, R2 sub R4, R1, R3 ld.global R5, R4 mul R6, R3, R4 div R7, R5, R6 R7 R3 R1 R2 R4 R6 R5 R5 R6 R3 R4 R1 R3 R4 R1 R2 Warp is descheduled until load completes RFC is flushed ALU R3 R4 R5 R6 R7

  13. Outline • Motivation • Background • Proposed Architecture • Evaluation • Methodology • Performance • Energy • Conclusion

  14. Methodology • Simulator • Custom trace-based performance simulator • Workloads • 19 Video processing: H264 encoding, Video enhancement • 11 Simulation: Molecular dynamics, Computational graphics, Path finding • 7 Image processing: Image blur, JPEG • 18 HPC: DGEMM, SGEMM, FFT • 155 Shader: 12 recent video games • Energy modeling • Synthesized SRAM banks and flip-flop arrays to measure simulated energy per access cost • Model wire energy as function of distance traveled

  15. Performance Evaluation • Minimal performance loss with 8 active warps • 3% performance loss with 6 active warps • 15% performance loss with 4 active warps

  16. Caching Evaluation • 6-entry per-thread RFC removes 40-80% of MRF accesses

  17. Energy Evaluation • 36% RF energy reduction with no performance loss • 6 entries per thread / 8 active warps

  18. Conclusion • RFC and two-level warp scheduler reduce RF energy by 36% • 5.4% of SM energy, 3.8% chip-wide • Energy-efficient designs enable high performance • Combination of many techniques required for future designs • Future work • Different RF hierarchies • Compiler managed rather than register cache

More Related