310 likes | 410 Views
Fragment-Parallel Composite and Filter. Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis. Parallelism in Interactive Graphics. Well-expressed in hardware as well as APIs Consistently growing in degree & expression More and more cores on upcoming GPUs
E N D
Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis
Parallelism in Interactive Graphics • Well-expressed in hardware as well as APIs • Consistently growing in degree & expression • More and more cores on upcoming GPUs • From programmable shaders to pipelines • We should rethink algorithms to exploit this • This paper provides one example • Parallelization of composite/filter stages
A Feed-Forward Rendering Pipeline Primitives Geometry Processing Rasterization Composite Filter Pixels
Composite & Filter Sample Locations Pixel • Input: • Unordered list of fragments • Output • Pixel colors • Assumption • No fragments are discarded
Basic Idea Processors Pixel-Parallel
Basic Idea Processors Insufficient parallelism Fragment-Parallel Irregularity
Motivation • Most applications have low depth complexity • Pixel-level parallelism is sufficient • We are interested in applications with • Very high depth complexity • High variation in depth complexity • Further • Future platforms will demand more parallelism • High depth-complexity can limit pixel-parallelism
Related Work 1 Maximum MSAA samples per pixel 2 Maximum render targets Order-Independent Transparency (OIT) • Depth-Peeling [Everitt 01] • One pass per transparent layer • Stencil-Routed A-buffer [Myers & Bavoil 07] • One pass per 8 depth layers1 • Bucket Depth-Peeling [Liu et al. 09] • One pass per up to 32 layers2
Related Work Order-Independent Transparency (OIT) • OIT using Direct3D 11 [Gruen et al. 10] • Use fragment linked-lists • Per-pixel sort and composite • Hair Self-Shadowing [Sintorn et al. 09] • Each fragment computes its contribution • Assumes constant opacity
Related Work Programmable Rendering Pipelines • RenderAnts[Zhou et al. 09] • Sort fragments globally • Per-pixel composite/filter • FreePipe[Liu et al. 10] • Sort fragments globally • Per-pixel composite/filter
Pixel-Parallel Formulation Pi P(i+1) P(i+2) Sj j S(j+1) (j+1) (j+2) S(j+2) S(j+3) (j+3) S(j+4) (j+4) (j+5) S(j+5) (j+6) S(j+6) Thread IDs P: Pixel S: Subsample
Fragment-Parallel Formulation Pi P(i+1) P(i+2) Sj S(j+1) S(j+2) S(j+3) S(j+4) S(j+5) S(j+6) j j+1 j+2 j+3 j+4 j+5 j+6 j+7 j+8 j+9 j+10 j+11 j+12 j+13 j+14 j+15 j+16 j+17 j+18 j+19 j+20 j+21 j+22 j+23 P: Pixel S: Subsample P: Pixel S: Subsample Thread IDs
Fragment-Parallel Formulation fragment 1 fragment 2 … background Cs = α1C1 + (1-α1){α2C2+(1-α2)(…(αN+(1-αN)CB)…} Cs = 1.α1.C1 + (1-α1).α2.C2 + (1-α1)(1-α2).α3.C3 + … + (1-α1)(1-α2)…(1-αk-1).αi.Ck + … + (1-α1)(1-α2)…(1-αN).CB Local Contribution Lk Global Contribution Gk How can this behavior be achieved? Revisit the composite equation
Fragment-Parallel Formulation Cs = G1.L1 + G2.L2 + G3.L3 … GN.LN Gk = (1-α1).(1-α2)…(1-αk-1) Lk = αk.Ck • Lk is trivially parallel (local computation) • Gk is the result of a scan operation (product) • For the list of input fragments • Compute G[ ] and L[ ], multiply • Perform reduction to add subpixel contributions
Fragment-Parallel Formulation • Cp = Cs1.κ1 + Cs2.κ2 + … + CsM.κM • Filter, for every pixel: • This can be expressed as another reduction • After multiplying with subpixel weights κm • Can be merged with previous reduction
Fragment-Parallel Composite & Filter Final Algorithm • Two-key sort (Subpixel ID, depth) • Segmented Scan (obtain Gk) • Premultiply with weights (Lk, κm) • Segmented Reduction
Fragment-Parallel Formulation Pi P(i+1) P(i+2) Segmented Scan (product) Segmented Reduction (sum) P: Pixel S: Subsample P: Pixel S: Subsample
Implementation • Hardware used: NVIDIA GeForce GTX 280 • We require fast Segmented Scan and Reduce • CUDPP library provides that • Restricts implementation to NVIDIA CUDA • No direct access to hardware rasterizer • We wrote our own
Example System – Polygons • Applications • Games • Depth Complexity • 1 to few tens of layers • Suited to pixel-parallel • Fragment-parallel software rasterizer
Example System – Particles • Applications • Simulations, games • Depth Complexity • Hundreds of layers • High depth-variance • Particle-parallel sprite rasterizer
Example System – Volumes • Applications • Scientific Visualization • Depth Complexity • Tens to Hundreds of layers • Low depth-variance • Major-axis-slice rasterizer
Example System – Reyes • Applications • Offline rendering • Depth Complexity • Tens of layers • Moderate depth variance • Data-parallel micropolygon rasterizer
Limitations • Increased memory traffic • Several passes through CUDPP primitives • Unclear how to optimize for special cases • Threshold opacity • Threshold depth complexity
Summary and Conclusion • Parallel formulation of composite equation • Maps well to known primitives • Can be integrated with filter • Consistent performance across varying workloads • FPC is applicable to future rendering pipelines • Exploits higher degree of parallelism • Better related to size of rendering workload • A tool for building programmable pipelines
Future Work • Performance • Reduction in memory traffic • Extension to special-case scenes • Hybrid PPC-FPC formulations • Applications • Integration with hardware rasterizer • Cinematic rendering, Photoshop
Acknowledgments • NSF Award 0541448 • SciDACInsitute for Ultrascale Visualization • NVIDIA Research Fellowship • Equipment donated by NVIDIA • Discussions and Feedback • ShubhoSengupta (UC Davis), Matt Pharr (Intel), Aaron Lefohn (Intel), Mike Houston (AMD) • Anonymous reviewers • Implementation assistance • Jeff Stuart, ShubhoSengupta