Introduction to Parallel Rendering: Sorting, Chromium, and MPI

Introduction to Parallel Rendering: Sorting, Chromium, and MPI Mengxia Zhu Spring 2006

Parallel Rendering • Graphics rendering process is computationally intensive • Parallel computation is a natural measure to leverage for higher performance • Two levels of parallelism: • Functional parallelism – pipelining • Data parallelism – multiple results computed at the same time

Rendering Pipeline

Data Parallel Algorithms • A lot of taxonomies of categorizing parallel algorithms • Image space vs. object space • Shared memory architecture, distributed memory architecture • MPI, OpenMP, … • Need a uniform framework to study and understand parallel rendering

Sorting in Rendering • Rendering as a sorting process: • Sort from object coordinates to screen coordinates • Use this concept to study computational and communication costs • The key procedure: calculating the effect of each primitive on each pixel • Use this concept to study computational and communication costs

Sorting Categories • The location of this ‘sort’ determines the structure of the parallel algorithm • Sort-first • during geometry processing • distributes “raw” primitives • Sort-middle • between geom. processing and rasterization • distributes screen-space primitives • Sort-last • during rasterization • distributes pixels/fragments

Sort-First Sort-Middle Sort-Last G R G R G R G R G R G R C G R G R G R Sorting cont • A landmark paper: “A sorting classification of parallel rendering”, Molner, et. al., IEEE CG&A’94.

Sort First • Primitives initially assigned arbitrarily • Pre-transformation is done to determine which screen regions are covered • Primitives are then redistributed over the network to the correct renderer • Renderer performs the work of the entire pipeline for that primitive from that point on

Sort First cont

Sort First cont • Screen space is partitioned into non-overlapping 2D tiles, each is rendered independently by a tightly coupled pair of geometry and rasterization processors. • Sub-image of 2D tiles are composited without depth comparison.

Analysis Terms • Assume a dataset containing nr raw primitives with average size ar. • We will call primitives that result from tessellation display primitives. If T is the tessellation ratio, there are nd= Tnrof these, with average size ad= ar /T. If there is no tessellation, T = 1, nd= nr, and ad= ar . • Assume an image containing A pixels and need to compute S samples per pixel. Assume that all primitives within the viewing frustum.

Sort-first analysis • Pros: • Low communication requirements when tessellation or oversampling are high, or when inter-frame coherence exploited • Processors implement entire rendering pipeline for a given screen region • Cons: • Susceptible to load imbalance (clumping) • Exploiting coherence is difficult

Sort Middle • Primitives initially assigned arbitrarily • Primitives fully transformed, lit, etc., by the geometry processor to which they are initially assigned • Transformed primitives are distributed over the network to the rasterizer assigned to their region of the screen

Sort Middle

Sort Middle Analysis • Pros: • Redistribution occurs at a “natural” place • Cons: • High communication cost if T is high • Susceptible to load imbalance in the same way as sort-first • Overhead: • Display primitive distribution cost • Tessellation factor

Sort Last

Sort Last • Defers sorting until the end (imagine phase) • Renderers operate independently until the visibility stage • Fragments transmitted over network to compositing processors to resolve visibility

Sort Last Analysis • Pros: • Renderers implement full pipeline and are independent until pixel merging • Less prone to load imbalance • Very scalable • Cons: • Pixel traffic can be extremely high

Image Composition • A naïve approach is binary compositing. • Each disjoint pair of processors produces a new subimage. • N/2 subimages are left after the first stage. • Half the number of the original processors are paired up for the next level of compositing hence another half would be idle. • The binary-swap compositing method makes sure that every processor participates in all the stages of the process. • The key idea – at each compositing stage, the two processors involved in a composite operation split the image plane into two pieces.

Binary Swap Example • The binary-swap compositing algorithm for four processors:

Which to choose? • It depends. • Which ones can be best matched to hardware capabilities? • Number of primitives, tessellation factor, coherence, etc., are all considerations. Many tradeoffs.

Load Balancing • For better load balancing, • Task queuing: the task queue can be ordered in decreasing task size, such that the concurrency gets finer until the queue is exhausted. • Load stealing: having nodes steal smaller tasks from other nodes, once they have completed their own tasks • Time stamp: timeout stamps used for each task, such that if the node can not finish its task before the timeout, it takes the remnant of the task, re-partitions it and re-distributes it. • Hierarchical data structures, such as octree, k-d tree, etc., are commonly used.

References • These slides reference contents from • Jian Huang at University of Tennessee at Knoxville • William Gropp and Ewing Lusk at Argonne National Laboratory

Introduction to Parallel Rendering: Sorting, Chromium, and MPI

Introduction to Parallel Rendering: Sorting, Chromium, and MPI

Presentation Transcript

Volume Rendering using Graphics Hardware

An Introduction and Overview of the Parallel Curriculum Model: Promise and Process

Volume Rendering using Graphics Hardware

Introduction to Parallel Computing with MPI

An Introduction to Parallel Processing

Lecture 3: Introduction to Parallel Computing Using CUDA

Sorting

Sorting and Searching Algorithms

Parallel Algorithms and Computing Selected topics

Parallel HDF5

Principles of Parallel Algorithm Design

Introduction

Introduction to Supercomputers, Architectures and High Performance Computing

Parallel and Series Circuits

An Interactive Introduction to OpenGL Programming

CS 591x Clutter Computing and Programming Parallel Computers

An Interactive Introduction to OpenGL Programming

Agenda

Introduction to Parallel Computing

CSE 3101

Parallel Algorithms and Computing Selected topics