430 likes | 564 Views
Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware. Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University. Motivation. GPU Programming Interactive shading Offline rendering Computation physical simulations numerical methods
E N D
Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University
Motivation GPU Programming • Interactive shading • Offline rendering • Computation • physical simulations • numerical methods • BrookGPU [Buck et al. 2004] • Shouldn’t be constrained by hardware limits • but demand high runtime performance
Motivation – Multipass Partitioning • Divide GPU program (shader) into a partition • set of rendering passes • each pass satisfies all resource constraints • save/restore intermediate values in textures • Many possible partitions exist • The problem: • given a program, find the best partition
Related Work • SGI’s ISL [Peercy et al. 2000] • treat OpenGL machine as SIMD processor • Recursive Dominator Split (RDS) [Chan et al. 2002] • graph partitioning of shader dag • Data-Dependent Multipass Control Flow on GPU [Popa and McCool 2004] • partition around flow control and schedule passes • Mio [Riffel et al. 2004] • instruction scheduling with backtracking
Contribution • Merging Recursive Dominator Split (MRDS) • MRDS – Extends RDS • support shaders with multiple outputs • support hardware with multiple render targets • generate more optimal partitions • same running time as RDS
Outline • Motivation • Related Work • RDS Algorithm • MRDS Algorithm • Results • Future Work
RDS - Overview • Input: dag of n nodes • shader ops • inputs • interpolants • constants • textures • Goal: mark subset of nodes as splits • split nodes define pass boundaries • 2n possible subsets
RDS - Overview • Input: dag of n nodes • shader ops • inputs • interpolants • constants • textures • Goal: mark subset of nodes as splits • split nodes define pass boundaries • 2n possible subsets
RDS - Overview • Input: dag of n nodes • shader ops • inputs • interpolants • constants • textures • Goal: mark subset of nodes as splits • split nodes define pass boundaries • 2n possible subsets
RDS - Overview Combination of approaches to limit search space • Save/recompute decisions • primary performance tradeoff • Dominator tree • used to avoid save/recompute tradeoffs
RDS – Save / Recompute M – multiply refereced node
RDS – Save / Recompute M – multiply refereced node
RDS – Save / Recompute M – multiply refereced node
RDS – Save / Recompute M – multiply refereced node
Dominator • B dom G • all paths to B go through G
Key Insight if B, G in same pass and B dom G then no save/recompute costs for G
MRDS – Multiple-Output Hardware float4 x, y; ... for( i=0; i<N; i++ ) { x' = x*x - y*y; y' = 2*x*y; x = x'; y = y'; } ...
MRDS – Multiple-Output Hardware float4 x, y; ... for( i=0; i<N; i++ ) { x' = f( x, y ); y' = g( x, y ); x = x'; y = y'; } ...
MRDS – Multiple-Output Hardware float4 x, y; ... for( i=0; i<N; i++ ) { x' = f( x, y ); y' = g( x, y ); x = x'; y = y'; } ...
MRDS – Multiple-Output Hardware • State cannot fit in single output float4 x, y; ... for( i=0; i<N; i++ ) { x' = f( x, y ); y' = g( x, y ); x = x'; y = y'; } ...
MRDS – Multiple-Output Hardware • State cannot fit in single output float4 x, y; ... for( i=0; i<N; i++ ) { x' = f( x, y ); y' = g( x, y ); x = x'; y = y'; } ...
MRDS – Dominating Sets • Dominating Set S = {A,D} • S dom G • All paths to G go through element of S • S, G in same pass • avoid save/recompute for G
MRDS – Pass Merging • Generate initial passes with RDS • Find potential merges • check if valid • evaluate change in cost • Execute from best to worst • revalidate • Stop when no more beneficial merges
MRDS – Pass Merging • Generate initial passes with RDS • Find potential merges • check if valid • evaluate change in cost • Execute from best to worst • revalidate • Stop when no more beneficial merges
MRDS – Pass Merging • Generate initial passes with RDS • Find potential merges • check if valid • evaluate change in cost • Execute from best to worst • revalidate • Stop when no more beneficial merges
MRDS – Pass Merging • Generate initial passes with RDS • Find potential merges • check if valid • evaluate change in cost • Execute from best to worst • revalidate • Stop when no more beneficial merges
MRDS – Pass Merging • Generate initial passes with RDS • Find potential merges • check if valid • evaluate change in cost • Execute from best to worst • revalidate • Stop when no more beneficial merges
MRDS – Pass Merging • What if RDS chose to recompute G? • Merge between passes A and D • eliminates duplicate instructions • gets high score
MRDS – Pass Merging • What if RDS chose to recompute G? • Merge between passes A and D • eliminates duplicate instructions • gets high score
MRDS – Time Complexity • Cost of merging dominated by initial search • iterates over s2 pairs of splits • each pair requires size-s set operations and 1 compiler call • O(s2(s+n)) • s = O(n) in worst case • MRDS = O(n3) in worst case • in practice we expect s << n • Assumes compiler calls are linear • not true for fxc
MRDS' • RDS uses linear search for save/recompute • evaluates cost of both alternatives with RDSh • RDS = O(n * RDSh) = O(n3) • MRDS merges after RDS has made these decisions • MRDS = O(RDS + n3) = O(n3) • MRDS' merges during cost evaluation • adds linear factor in worst case • MRDS' = O(n * (RDSh + n3)) = O(n4)
Results • 3 Brook Programs • Procedural Fire • Mandelbrot Fractal • Matrix Mulitply • Compiled for ATI Radeon 9800 XT with • RDS • MRDS • MRDS'
Results – Procedural Fire • MRDS' better than MRDS and RDS • better save/recompute decisions • results in less bandwidth used
Results – Mandelbrot Fractal • MRDS', MRDS better than RDS • iterative computation – state in 2 variables • RDS duplicates computation
Results – Matrix Multiply • Matrix-matrix multiply benefits from blocking • blocking cuts computation by ~2 • Blocking requires multiple outputs • performance limited by MRT performance
Summary • Modified RDS algorithm, MRDS • supports multiple-output shaders • generates code for multiple-render-targets • easy to implement, same running time • generates better-performing partitions
Future Work • Implementations • Ashli • combine with Mio • Exploit new hardware • data-dependent flow control • large numbers of outputs
Acknowledgements • Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot • RDS implementation, design discussions • Kayvon Fatahalian, Ian Buck • GPUBench results • ATI • hardware • DARPA, ATI, IBM, NVIDIA, SONY • funding