230 likes | 361 Views
Introduction to Parallel Rendering: Sorting, Chromium, and MPI. Mengxia Zhu Spring 2006. Parallel Rendering. Graphics rendering process is computationally intensive Parallel computation is a natural measure to leverage for higher performance Two levels of parallelism:
E N D
Introduction to Parallel Rendering: Sorting, Chromium, and MPI Mengxia Zhu Spring 2006
Parallel Rendering • Graphics rendering process is computationally intensive • Parallel computation is a natural measure to leverage for higher performance • Two levels of parallelism: • Functional parallelism – pipelining • Data parallelism – multiple results computed at the same time
Data Parallel Algorithms • A lot of taxonomies of categorizing parallel algorithms • Image space vs. object space • Shared memory architecture, distributed memory architecture • MPI, OpenMP, … • Need a uniform framework to study and understand parallel rendering
Sorting in Rendering • Rendering as a sorting process: • Sort from object coordinates to screen coordinates • Use this concept to study computational and communication costs • The key procedure: calculating the effect of each primitive on each pixel • Use this concept to study computational and communication costs
Sorting Categories • The location of this ‘sort’ determines the structure of the parallel algorithm • Sort-first • during geometry processing • distributes “raw” primitives • Sort-middle • between geom. processing and rasterization • distributes screen-space primitives • Sort-last • during rasterization • distributes pixels/fragments
Sort-First Sort-Middle Sort-Last G R G R G R G R G R G R C G R G R G R Sorting cont • A landmark paper: “A sorting classification of parallel rendering”, Molner, et. al., IEEE CG&A’94.
Sort First • Primitives initially assigned arbitrarily • Pre-transformation is done to determine which screen regions are covered • Primitives are then redistributed over the network to the correct renderer • Renderer performs the work of the entire pipeline for that primitive from that point on
Sort First cont • Screen space is partitioned into non-overlapping 2D tiles, each is rendered independently by a tightly coupled pair of geometry and rasterization processors. • Sub-image of 2D tiles are composited without depth comparison.
Analysis Terms • Assume a dataset containing nr raw primitives with average size ar. • We will call primitives that result from tessellation display primitives. If T is the tessellation ratio, there are nd= Tnrof these, with average size ad= ar /T. If there is no tessellation, T = 1, nd= nr, and ad= ar . • Assume an image containing A pixels and need to compute S samples per pixel. Assume that all primitives within the viewing frustum.
Sort-first analysis • Pros: • Low communication requirements when tessellation or oversampling are high, or when inter-frame coherence exploited • Processors implement entire rendering pipeline for a given screen region • Cons: • Susceptible to load imbalance (clumping) • Exploiting coherence is difficult
Sort Middle • Primitives initially assigned arbitrarily • Primitives fully transformed, lit, etc., by the geometry processor to which they are initially assigned • Transformed primitives are distributed over the network to the rasterizer assigned to their region of the screen
Sort Middle Analysis • Pros: • Redistribution occurs at a “natural” place • Cons: • High communication cost if T is high • Susceptible to load imbalance in the same way as sort-first • Overhead: • Display primitive distribution cost • Tessellation factor
Sort Last • Defers sorting until the end (imagine phase) • Renderers operate independently until the visibility stage • Fragments transmitted over network to compositing processors to resolve visibility
Sort Last Analysis • Pros: • Renderers implement full pipeline and are independent until pixel merging • Less prone to load imbalance • Very scalable • Cons: • Pixel traffic can be extremely high
Image Composition • A naïve approach is binary compositing. • Each disjoint pair of processors produces a new subimage. • N/2 subimages are left after the first stage. • Half the number of the original processors are paired up for the next level of compositing hence another half would be idle. • The binary-swap compositing method makes sure that every processor participates in all the stages of the process. • The key idea – at each compositing stage, the two processors involved in a composite operation split the image plane into two pieces.
Binary Swap Example • The binary-swap compositing algorithm for four processors:
Which to choose? • It depends. • Which ones can be best matched to hardware capabilities? • Number of primitives, tessellation factor, coherence, etc., are all considerations. Many tradeoffs.
Load Balancing • For better load balancing, • Task queuing: the task queue can be ordered in decreasing task size, such that the concurrency gets finer until the queue is exhausted. • Load stealing: having nodes steal smaller tasks from other nodes, once they have completed their own tasks • Time stamp: timeout stamps used for each task, such that if the node can not finish its task before the timeout, it takes the remnant of the task, re-partitions it and re-distributes it. • Hierarchical data structures, such as octree, k-d tree, etc., are commonly used.
References • These slides reference contents from • Jian Huang at University of Tennessee at Knoxville • William Gropp and Ewing Lusk at Argonne National Laboratory