300 likes | 537 Views
Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. Mark Gebhart Daniel R. Johnson The University of Texas University of Illinois Stephen W. Keckler William J. Dally
E N D
Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors Mark Gebhart Daniel R. Johnson The University of Texas University of Illinois Stephen W. Keckler William J. Dally NVIDIA / The University of Texas NVIDIA / Stanford University David Tarjan Erik Lindholm Kevin Skadron NVIDIA NVIDIA University of Virginia
Motivation • All processors are effectively power limited • From mobile, to desktop, to enterprise, to GPU • Energy efficiency is a primary design constraint • Throughput processors • Massive multithreading to tolerate memory latency • Large register files • Complicated thread scheduler • Focus of this work • Improve energy efficiency of register file and scheduler without sacrificing performance
Optimization Opportunity #1 • Large number of threads hide two types of latency • Long: Global memory access (~400 cycles) • Short: ALU and shared memory access (8-20 cycles) • Partition threads into two sets • Active threads for short latency events • Inactive threads for long latency events • Simplify thread scheduler
Optimization Opportunity #2 • Examined register reuse patterns of GPU workloads • Up to 40% of values read once within 3 instructions of being produced • Exploit register file locality with a cache
Outline • Motivation • Proposed Architecture • Baseline GPU • Thread Scheduling • Register File Caching • Evaluation • Conclusion
Baseline GPU Architecture • 16 streaming multiprocessors (SM) per chip • Small amount of on-chip cache • Memory interface designed to maximize bandwidth rather than latency
Single Level Warp Scheduler • 1024 threads per SM to tolerate latency • 32 warps • 32 threads per warp • Scheduler chooses one warp to execute each cycle
Two-Level Warp Scheduler • Only active warps issue instructions • Simplified scheduler chooses from active warps • Long latency events trigger active warps to be descheduled • Goal: Minimize number of active warps without harming performance
Outline • Motivation • Proposed Architecture • Baseline GPU • Thread Scheduling • Register File Caching • Evaluation • Conclusion
Baseline SM • Register file heavily banked for high bandwidth • 32 SIMT lanes • Reduced number of special function units • Low latency access to scratchpad memory
Two-Level Warp Scheduler and RFC • Register file cache (RFC) • Close to functional units • 21 times smaller than MRF • Only active warps have RFC entries • When warp is descheduled RFC is flushed • Static liveness information used to prevent writeback of dead values
Program Execution RFC MRF Dead values PC add R3, R1, R2 sub R4, R1, R3 ld.global R5, R4 mul R6, R3, R4 div R7, R5, R6 R7 R3 R1 R2 R4 R6 R5 R5 R6 R3 R4 R1 R3 R4 R1 R2 Warp is descheduled until load completes RFC is flushed ALU R3 R4 R5 R6 R7
Outline • Motivation • Background • Proposed Architecture • Evaluation • Methodology • Performance • Energy • Conclusion
Methodology • Simulator • Custom trace-based performance simulator • Workloads • 19 Video processing: H264 encoding, Video enhancement • 11 Simulation: Molecular dynamics, Computational graphics, Path finding • 7 Image processing: Image blur, JPEG • 18 HPC: DGEMM, SGEMM, FFT • 155 Shader: 12 recent video games • Energy modeling • Synthesized SRAM banks and flip-flop arrays to measure simulated energy per access cost • Model wire energy as function of distance traveled
Performance Evaluation • Minimal performance loss with 8 active warps • 3% performance loss with 6 active warps • 15% performance loss with 4 active warps
Caching Evaluation • 6-entry per-thread RFC removes 40-80% of MRF accesses
Energy Evaluation • 36% RF energy reduction with no performance loss • 6 entries per thread / 8 active warps
Conclusion • RFC and two-level warp scheduler reduce RF energy by 36% • 5.4% of SM energy, 3.8% chip-wide • Energy-efficient designs enable high performance • Combination of many techniques required for future designs • Future work • Different RF hierarchies • Compiler managed rather than register cache