160 likes | 168 Views
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor. José-María Arnau , Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel). Focusing on Mobile GPUs. 1. Market demands. Energy-efficient mobile GPUs. 2. Technology limitations.
E N D
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel)
Focusing on Mobile GPUs 1 Market demands Energy-efficient mobile GPUs 2 Technology limitations 1 http://www.digitalversus.com/mobile-phone/samsung-galaxy-note-p11735/test.html Samsung galaxy SII vs Samsung Galaxy Note when running the game Shadow Gun 3D 2 http://www.ispsd.com/02/battery-psd-templates/ Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 2
GPU Performance and Memory A mobile single-threaded GPU with perfect caches achieves a speedup of 3.2x on a set of commercial Android games • Graphical workloads: • Large working sets not amenable to caching • Texture memory accesses are fine-grained and unpredictable • Traditional techniques to deal with memory: • Caches • Prefetching • Multithreading Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 3
Outline • Background • Methodology • Multithreading & Prefetching • Decoupled Access/Execute • Conclusions Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 4
Assumed GPU Architecture Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 5
Assumed Fragment Processor • Warp: group of threads executed in lockstep mode (SIMD group) • 4 threads per warp • 4-wide vectorial registers (16 bytes) • 36 registers per thread Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 6
Methodology Power Model:CACTI 6.5 and Qsilver Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 7
Workload Selection 2D games Simple 3D games Complex 3D games • Small/medium sized textures • Texture filtering: 1 memory access • Small fragment programs • Small/medium sized textures • Texture filtering: 1-4 memory accesses • Small/medium fragment programs • Medium/big sized textures • Texture filtering: 4-8 memory accesses • Big, memory intensive fragment programs Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 8
Improving Performance Using Multithreading • Very effective • High energy cost (25% more energy) • Huge register file to maintain the state of all the threads • 36 KB MRF for a GPU with 16 warps/core (bigger than L2) Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 9
Employing Prefetching • Hardware prefetchers: • Global History Buffer • K. J. Nesbit and J. E. Smith. “Data Cache Prefetching Using a Global History Buffer”. HPCA, 2004. • Many-Thread Aware • J. Lee, N. B. Lakshminarayana, H. Kim and R, Vuduc. “Many-Thread Aware Prefetching Mechanisms for GPGPU Applications”. MICRO, 2010. • Prefetching is effective but there is still ample room for improvement Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 10
Decoupled Access/Execute • Use the fragment information to compute the addresses that will be requested when processing the fragment • Issue memory requests while the fragments are waiting in the tile queue • Tile queue size: • Too small: timeliness is not achieved • Too big: cache conflicts Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 11
Inter-Core Data Sharing • 66.3% of cache misses are requests to data available in the L1 cache of another fragment processor • Use the prefetch queue to detect inter-core data sharing • Saves bandwidth to the L2 cache • Saves power (L1 caches smaller than L2) • Associative comparisons require additional energy Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 12
Decoupled Access/Execute • 33% faster than hardware prefetchers, 9% energy savings • DAE with 2 warps/core achieves 93% of the performance of a bigger GPU with 16 warps/core, providing 34% energy savings Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 13
Benefits of Remote L1 Cache Accesses • Single threaded GPU • Baseline: Global History Buffer • 30% speedup • 5.4% energy savings Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 14
Conclusions • High performance, energy efficient GPUs can be architected based on the decoupled access/execute concept • A combination of decoupled access/execute -to hide memory latency- and multithreading -to hide functional units latency- provides the most energy efficient solution • Allowing for remote L1 cache accesses provides L2 cache bandwidth savings and energy savings • The decoupled access/execute architecture outperforms hardware prefetchers: 33% speedup, 9% energy savings Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 15
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor Thank you! Questions? José-María Arnau (UPC) Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel)