1 / 16

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor. José-María Arnau , Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel). Focusing on Mobile GPUs. 1. Market demands. Energy-efficient mobile GPUs. 2. Technology limitations.

chen
Download Presentation

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel)

  2. Focusing on Mobile GPUs 1 Market demands Energy-efficient mobile GPUs 2 Technology limitations 1 http://www.digitalversus.com/mobile-phone/samsung-galaxy-note-p11735/test.html Samsung galaxy SII vs Samsung Galaxy Note when running the game Shadow Gun 3D 2 http://www.ispsd.com/02/battery-psd-templates/ Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 2

  3. GPU Performance and Memory A mobile single-threaded GPU with perfect caches achieves a speedup of 3.2x on a set of commercial Android games • Graphical workloads: • Large working sets not amenable to caching • Texture memory accesses are fine-grained and unpredictable • Traditional techniques to deal with memory: • Caches • Prefetching • Multithreading Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 3

  4. Outline • Background • Methodology • Multithreading & Prefetching • Decoupled Access/Execute • Conclusions Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 4

  5. Assumed GPU Architecture Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 5

  6. Assumed Fragment Processor • Warp: group of threads executed in lockstep mode (SIMD group) • 4 threads per warp • 4-wide vectorial registers (16 bytes) • 36 registers per thread Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 6

  7. Methodology Power Model:CACTI 6.5 and Qsilver Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 7

  8. Workload Selection 2D games Simple 3D games Complex 3D games • Small/medium sized textures • Texture filtering: 1 memory access • Small fragment programs • Small/medium sized textures • Texture filtering: 1-4 memory accesses • Small/medium fragment programs • Medium/big sized textures • Texture filtering: 4-8 memory accesses • Big, memory intensive fragment programs Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 8

  9. Improving Performance Using Multithreading • Very effective • High energy cost (25% more energy) • Huge register file to maintain the state of all the threads • 36 KB MRF for a GPU with 16 warps/core (bigger than L2) Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 9

  10. Employing Prefetching • Hardware prefetchers: • Global History Buffer • K. J. Nesbit and J. E. Smith. “Data Cache Prefetching Using a Global History Buffer”. HPCA, 2004. • Many-Thread Aware • J. Lee, N. B. Lakshminarayana, H. Kim and R, Vuduc. “Many-Thread Aware Prefetching Mechanisms for GPGPU Applications”. MICRO, 2010. • Prefetching is effective but there is still ample room for improvement Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 10

  11. Decoupled Access/Execute • Use the fragment information to compute the addresses that will be requested when processing the fragment • Issue memory requests while the fragments are waiting in the tile queue • Tile queue size: • Too small: timeliness is not achieved • Too big: cache conflicts Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 11

  12. Inter-Core Data Sharing • 66.3% of cache misses are requests to data available in the L1 cache of another fragment processor • Use the prefetch queue to detect inter-core data sharing • Saves bandwidth to the L2 cache • Saves power (L1 caches smaller than L2) • Associative comparisons require additional energy Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 12

  13. Decoupled Access/Execute • 33% faster than hardware prefetchers, 9% energy savings • DAE with 2 warps/core achieves 93% of the performance of a bigger GPU with 16 warps/core, providing 34% energy savings Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 13

  14. Benefits of Remote L1 Cache Accesses • Single threaded GPU • Baseline: Global History Buffer • 30% speedup • 5.4% energy savings Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 14

  15. Conclusions • High performance, energy efficient GPUs can be architected based on the decoupled access/execute concept • A combination of decoupled access/execute -to hide memory latency- and multithreading -to hide functional units latency- provides the most energy efficient solution • Allowing for remote L1 cache accesses provides L2 cache bandwidth savings and energy savings • The decoupled access/execute architecture outperforms hardware prefetchers: 33% speedup, 9% energy savings Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 15

  16. Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor Thank you! Questions? José-María Arnau (UPC) Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel)

More Related