1 / 43

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware. Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University. Motivation. GPU Programming Interactive shading Offline rendering Computation physical simulations numerical methods

nau
Download Presentation

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University

  2. Motivation GPU Programming • Interactive shading • Offline rendering • Computation • physical simulations • numerical methods • BrookGPU [Buck et al. 2004] • Shouldn’t be constrained by hardware limits • but demand high runtime performance

  3. Motivation – Multipass Partitioning • Divide GPU program (shader) into a partition • set of rendering passes • each pass satisfies all resource constraints • save/restore intermediate values in textures • Many possible partitions exist • The problem: • given a program, find the best partition

  4. Related Work • SGI’s ISL [Peercy et al. 2000] • treat OpenGL machine as SIMD processor • Recursive Dominator Split (RDS) [Chan et al. 2002] • graph partitioning of shader dag • Data-Dependent Multipass Control Flow on GPU [Popa and McCool 2004] • partition around flow control and schedule passes • Mio [Riffel et al. 2004] • instruction scheduling with backtracking

  5. Contribution • Merging Recursive Dominator Split (MRDS) • MRDS – Extends RDS • support shaders with multiple outputs • support hardware with multiple render targets • generate more optimal partitions • same running time as RDS

  6. Outline • Motivation • Related Work • RDS Algorithm • MRDS Algorithm • Results • Future Work

  7. RDS - Overview • Input: dag of n nodes • shader ops • inputs • interpolants • constants • textures • Goal: mark subset of nodes as splits • split nodes define pass boundaries • 2n possible subsets

  8. RDS - Overview • Input: dag of n nodes • shader ops • inputs • interpolants • constants • textures • Goal: mark subset of nodes as splits • split nodes define pass boundaries • 2n possible subsets

  9. RDS - Overview • Input: dag of n nodes • shader ops • inputs • interpolants • constants • textures • Goal: mark subset of nodes as splits • split nodes define pass boundaries • 2n possible subsets

  10. RDS - Overview Combination of approaches to limit search space • Save/recompute decisions • primary performance tradeoff • Dominator tree • used to avoid save/recompute tradeoffs

  11. RDS – Save / Recompute M – multiply refereced node

  12. RDS – Save / Recompute M – multiply refereced node

  13. RDS – Save / Recompute M – multiply refereced node

  14. RDS – Save / Recompute M – multiply refereced node

  15. Dominator • B dom G • all paths to B go through G

  16. Dominator Tree

  17. Key Insight if B, G in same pass and B dom G then no save/recompute costs for G

  18. MRDS – Multiple-Output Shaders

  19. MRDS – Multiple-Output Shaders

  20. MRDS – Multiple-Output Hardware float4 x, y; ... for( i=0; i<N; i++ ) { x' = x*x - y*y; y' = 2*x*y; x = x'; y = y'; } ...

  21. MRDS – Multiple-Output Hardware float4 x, y; ... for( i=0; i<N; i++ ) { x' = f( x, y ); y' = g( x, y ); x = x'; y = y'; } ...

  22. MRDS – Multiple-Output Hardware float4 x, y; ... for( i=0; i<N; i++ ) { x' = f( x, y ); y' = g( x, y ); x = x'; y = y'; } ...

  23. MRDS – Multiple-Output Hardware • State cannot fit in single output float4 x, y; ... for( i=0; i<N; i++ ) { x' = f( x, y ); y' = g( x, y ); x = x'; y = y'; } ...

  24. MRDS – Multiple-Output Hardware • State cannot fit in single output float4 x, y; ... for( i=0; i<N; i++ ) { x' = f( x, y ); y' = g( x, y ); x = x'; y = y'; } ...

  25. MRDS – Dominating Sets • Dominating Set S = {A,D} • S dom G • All paths to G go through element of S • S, G in same pass • avoid save/recompute for G

  26. MRDS – Pass Merging • Generate initial passes with RDS • Find potential merges • check if valid • evaluate change in cost • Execute from best to worst • revalidate • Stop when no more beneficial merges

  27. MRDS – Pass Merging • Generate initial passes with RDS • Find potential merges • check if valid • evaluate change in cost • Execute from best to worst • revalidate • Stop when no more beneficial merges

  28. MRDS – Pass Merging • Generate initial passes with RDS • Find potential merges • check if valid • evaluate change in cost • Execute from best to worst • revalidate • Stop when no more beneficial merges

  29. MRDS – Pass Merging • Generate initial passes with RDS • Find potential merges • check if valid • evaluate change in cost • Execute from best to worst • revalidate • Stop when no more beneficial merges

  30. MRDS – Pass Merging • Generate initial passes with RDS • Find potential merges • check if valid • evaluate change in cost • Execute from best to worst • revalidate • Stop when no more beneficial merges

  31. MRDS – Pass Merging • What if RDS chose to recompute G? • Merge between passes A and D • eliminates duplicate instructions • gets high score

  32. MRDS – Pass Merging • What if RDS chose to recompute G? • Merge between passes A and D • eliminates duplicate instructions • gets high score

  33. MRDS – Time Complexity • Cost of merging dominated by initial search • iterates over s2 pairs of splits • each pair requires size-s set operations and 1 compiler call • O(s2(s+n)) • s = O(n) in worst case • MRDS = O(n3) in worst case • in practice we expect s << n • Assumes compiler calls are linear • not true for fxc

  34. MRDS' • RDS uses linear search for save/recompute • evaluates cost of both alternatives with RDSh • RDS = O(n * RDSh) = O(n3) • MRDS merges after RDS has made these decisions • MRDS = O(RDS + n3) = O(n3) • MRDS' merges during cost evaluation • adds linear factor in worst case • MRDS' = O(n * (RDSh + n3)) = O(n4)

  35. Results • 3 Brook Programs • Procedural Fire • Mandelbrot Fractal • Matrix Mulitply • Compiled for ATI Radeon 9800 XT with • RDS • MRDS • MRDS'

  36. Results – Procedural Fire • MRDS' better than MRDS and RDS • better save/recompute decisions • results in less bandwidth used

  37. Results – Compile Times

  38. Results – Mandelbrot Fractal • MRDS', MRDS better than RDS • iterative computation – state in 2 variables • RDS duplicates computation

  39. Results – Matrix Multiply • Matrix-matrix multiply benefits from blocking • blocking cuts computation by ~2 • Blocking requires multiple outputs • performance limited by MRT performance

  40. Summary • Modified RDS algorithm, MRDS • supports multiple-output shaders • generates code for multiple-render-targets • easy to implement, same running time • generates better-performing partitions

  41. Future Work • Implementations • Ashli • combine with Mio • Exploit new hardware • data-dependent flow control • large numbers of outputs

  42. Acknowledgements • Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot • RDS implementation, design discussions • Kayvon Fatahalian, Ian Buck • GPUBench results • ATI • hardware • DARPA, ATI, IBM, NVIDIA, SONY • funding

More Related