Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University

Motivation GPU Programming • Interactive shading • Offline rendering • Computation • physical simulations • numerical methods • BrookGPU [Buck et al. 2004] • Shouldn’t be constrained by hardware limits • but demand high runtime performance

Motivation – Multipass Partitioning • Divide GPU program (shader) into a partition • set of rendering passes • each pass satisfies all resource constraints • save/restore intermediate values in textures • Many possible partitions exist • The problem: • given a program, find the best partition

Related Work • SGI’s ISL [Peercy et al. 2000] • treat OpenGL machine as SIMD processor • Recursive Dominator Split (RDS) [Chan et al. 2002] • graph partitioning of shader dag • Data-Dependent Multipass Control Flow on GPU [Popa and McCool 2004] • partition around flow control and schedule passes • Mio [Riffel et al. 2004] • instruction scheduling with backtracking

Contribution • Merging Recursive Dominator Split (MRDS) • MRDS – Extends RDS • support shaders with multiple outputs • support hardware with multiple render targets • generate more optimal partitions • same running time as RDS

Outline • Motivation • Related Work • RDS Algorithm • MRDS Algorithm • Results • Future Work

RDS - Overview • Input: dag of n nodes • shader ops • inputs • interpolants • constants • textures • Goal: mark subset of nodes as splits • split nodes define pass boundaries • 2n possible subsets

RDS - Overview Combination of approaches to limit search space • Save/recompute decisions • primary performance tradeoff • Dominator tree • used to avoid save/recompute tradeoffs

RDS – Save / Recompute M – multiply refereced node

Dominator • B dom G • all paths to B go through G

Dominator Tree

Key Insight if B, G in same pass and B dom G then no save/recompute costs for G

MRDS – Multiple-Output Shaders

MRDS – Multiple-Output Hardware float4 x, y; ... for( i=0; i<N; i++ ) { x' = x*x - y*y; y' = 2*x*y; x = x'; y = y'; } ...

MRDS – Multiple-Output Hardware float4 x, y; ... for( i=0; i<N; i++ ) { x' = f( x, y ); y' = g( x, y ); x = x'; y = y'; } ...

MRDS – Multiple-Output Hardware • State cannot fit in single output float4 x, y; ... for( i=0; i<N; i++ ) { x' = f( x, y ); y' = g( x, y ); x = x'; y = y'; } ...

MRDS – Dominating Sets • Dominating Set S = {A,D} • S dom G • All paths to G go through element of S • S, G in same pass • avoid save/recompute for G

MRDS – Pass Merging • Generate initial passes with RDS • Find potential merges • check if valid • evaluate change in cost • Execute from best to worst • revalidate • Stop when no more beneficial merges

MRDS – Pass Merging • What if RDS chose to recompute G? • Merge between passes A and D • eliminates duplicate instructions • gets high score

MRDS – Time Complexity • Cost of merging dominated by initial search • iterates over s2 pairs of splits • each pair requires size-s set operations and 1 compiler call • O(s2(s+n)) • s = O(n) in worst case • MRDS = O(n3) in worst case • in practice we expect s << n • Assumes compiler calls are linear • not true for fxc

MRDS' • RDS uses linear search for save/recompute • evaluates cost of both alternatives with RDSh • RDS = O(n * RDSh) = O(n3) • MRDS merges after RDS has made these decisions • MRDS = O(RDS + n3) = O(n3) • MRDS' merges during cost evaluation • adds linear factor in worst case • MRDS' = O(n * (RDSh + n3)) = O(n4)

Results • 3 Brook Programs • Procedural Fire • Mandelbrot Fractal • Matrix Mulitply • Compiled for ATI Radeon 9800 XT with • RDS • MRDS • MRDS'

Results – Procedural Fire • MRDS' better than MRDS and RDS • better save/recompute decisions • results in less bandwidth used

Results – Compile Times

Results – Mandelbrot Fractal • MRDS', MRDS better than RDS • iterative computation – state in 2 variables • RDS duplicates computation

Results – Matrix Multiply • Matrix-matrix multiply benefits from blocking • blocking cuts computation by ~2 • Blocking requires multiple outputs • performance limited by MRT performance

Summary • Modified RDS algorithm, MRDS • supports multiple-output shaders • generates code for multiple-render-targets • easy to implement, same running time • generates better-performing partitions

Future Work • Implementations • Ashli • combine with Mio • Exploit new hardware • data-dependent flow control • large numbers of outputs

Acknowledgements • Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot • RDS implementation, design discussions • Kayvon Fatahalian, Ian Buck • GPUBench results • ATI • hardware • DARPA, ATI, IBM, NVIDIA, SONY • funding

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware