Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors

Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors Mark Gebhart Daniel R. Johnson The University of Texas University of Illinois Stephen W. Keckler William J. Dally NVIDIA / The University of Texas NVIDIA / Stanford University David Tarjan Erik Lindholm Kevin Skadron NVIDIA NVIDIA University of Virginia

Motivation • All processors are effectively power limited • From mobile, to desktop, to enterprise, to GPU • Energy efficiency is a primary design constraint • Throughput processors • Massive multithreading to tolerate memory latency • Large register files • Complicated thread scheduler • Focus of this work • Improve energy efficiency of register file and scheduler without sacrificing performance

Optimization Opportunity #1 • Large number of threads hide two types of latency • Long: Global memory access (~400 cycles) • Short: ALU and shared memory access (8-20 cycles) • Partition threads into two sets • Active threads for short latency events • Inactive threads for long latency events • Simplify thread scheduler

Optimization Opportunity #2 • Examined register reuse patterns of GPU workloads • Up to 40% of values read once within 3 instructions of being produced • Exploit register file locality with a cache

Outline • Motivation • Proposed Architecture • Baseline GPU • Thread Scheduling • Register File Caching • Evaluation • Conclusion

Baseline GPU Architecture • 16 streaming multiprocessors (SM) per chip • Small amount of on-chip cache • Memory interface designed to maximize bandwidth rather than latency

Single Level Warp Scheduler • 1024 threads per SM to tolerate latency • 32 warps • 32 threads per warp • Scheduler chooses one warp to execute each cycle

Two-Level Warp Scheduler • Only active warps issue instructions • Simplified scheduler chooses from active warps • Long latency events trigger active warps to be descheduled • Goal: Minimize number of active warps without harming performance

Outline • Motivation • Proposed Architecture • Baseline GPU • Thread Scheduling • Register File Caching • Evaluation • Conclusion

Baseline SM • Register file heavily banked for high bandwidth • 32 SIMT lanes • Reduced number of special function units • Low latency access to scratchpad memory

Two-Level Warp Scheduler and RFC • Register file cache (RFC) • Close to functional units • 21 times smaller than MRF • Only active warps have RFC entries • When warp is descheduled RFC is flushed • Static liveness information used to prevent writeback of dead values

Program Execution RFC MRF Dead values PC add R3, R1, R2 sub R4, R1, R3 ld.global R5, R4 mul R6, R3, R4 div R7, R5, R6 R7 R3 R1 R2 R4 R6 R5 R5 R6 R3 R4 R1 R3 R4 R1 R2 Warp is descheduled until load completes RFC is flushed ALU R3 R4 R5 R6 R7

Outline • Motivation • Background • Proposed Architecture • Evaluation • Methodology • Performance • Energy • Conclusion

Methodology • Simulator • Custom trace-based performance simulator • Workloads • 19 Video processing: H264 encoding, Video enhancement • 11 Simulation: Molecular dynamics, Computational graphics, Path finding • 7 Image processing: Image blur, JPEG • 18 HPC: DGEMM, SGEMM, FFT • 155 Shader: 12 recent video games • Energy modeling • Synthesized SRAM banks and flip-flop arrays to measure simulated energy per access cost • Model wire energy as function of distance traveled

Performance Evaluation • Minimal performance loss with 8 active warps • 3% performance loss with 6 active warps • 15% performance loss with 4 active warps

Caching Evaluation • 6-entry per-thread RFC removes 40-80% of MRF accesses

Energy Evaluation • 36% RF energy reduction with no performance loss • 6 entries per thread / 8 active warps

Conclusion • RFC and two-level warp scheduler reduce RF energy by 36% • 5.4% of SM energy, 3.8% chip-wide • Energy-efficient designs enable high performance • Combination of many techniques required for future designs • Future work • Different RF hierarchies • Compiler managed rather than register cache

Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors

Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors

Presentation Transcript

Energy Efficient Instruction Cache for Wide-issue Processors

Efficient Security Mechanisms for Routing Protocols

ACE: Exploiting Correlation for Energy-Efficient and Continuous Context Sensing

Smart Cache Cleaning : Energy Efficient Vulnerability Reduction in Embedded Processors

ACE : Exploiting Correlation for Energy-Efficient and Continuous Context Sensing

Mechanisms for Industrial Energy Management

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors*

Power Efficient Comparators for Long Arguments in Superscalar Processors

Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

MANAGING GLAUCOMA IN AFRICAN CONTEXT

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Computationally-Efficient Approximation Mechanisms

Context and Thread

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors

Energy-Efficient Computing and Computing for Efficient Energy Usage

A Survey of Energy-Efficient Scheduling Mechanisms in Sensor Networks

A Power-Efficient High Throughput 32-Thread SPARC Processor

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors

Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Efficient coordination mechanisms for selfish scheduling

Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Managing Processors