Practical Parallel Processing for Today’s Rendering Challenges SIGGRAPH 2001 Course 40

Practical Parallel Processing for Today’s Rendering Challenges SIGGRAPH 2001 Course 40 Los Angeles, CA

Speakers • Alan Chalmers, University of Bristol • Tim Davis, Clemson University • Erik Reinhard, University of Utah • Toshi Kato, SquareUSA

Schedule • Introduction • Parallel / Distributed Rendering Issues • Classification of Parallel Rendering Systems • Practical Applications • Summary / Discussion

Schedule • Introduction (Davis) • Parallel / Distributed Rendering Issues • Classification of Parallel Rendering Systems • Practical Applications • Summary / Discussion

The Need for Speed • Graphics rendering is time-consuming • large amount of data in a single image • animations much worse • Demand continues to rise for high-quality graphics

Rendering and Parallel Processing • A holy union • Many graphics rendering tasks can be performed in parallel • Often “embarrassing parallel”

3-D Graphics Boards • Getting better • Perform “tricks” with texture mapping • Steve Jobs’ remark on constant frame rendering time

Parallel / Distributed Rendering • Fundamental Issues • Task Management • Task subdivision, Migration, Load balancing • Data Management • Data distributed across system • Communication

Schedule • Introduction • Parallel / Distributed Rendering Issues (Chalmers) • Classification of Parallel Rendering Systems • Practical Applications • Summary / Discussion

Introduction “Parallel processing is like a dog’s walking on its hind legs. It is not done well, but you are surprised to find it done at all” [Steve Fiddes (apologies to Samuel Johnson)] • Co-operation • Dependencies • Scalability • Control

Co-operation • Solution of a single problem • One person takes a certain time to solve the problem • Divide problem into a number of sub-problems • Each sub-problem solved by a single worker • Reduced problem solution time • BUT • co-operation  overheads

Working Together • Overheads • access to pool • collision avoidance

Dependencies • Divide a problem into a number of distinct stages • Parallel solution of one stage before next can start • May be too severe  no parallel solution • each sub-problem dependent on previous stage • Dependency-free problems • order of task completion unimportant • BUT co-operation still required

Building with Blocks Strictly sequential Dependency-free

Scalability • Upper bound on the number of workers • Additional workers will NOT improve solution time • Shows how suitable a problem is for parallel processing • Given problem  finite number of sub-problems • more workers than tasks • Upper bound may be (a lot) less than number of tasks • bottlenecks

Bottleneck at Doorway More workers may result in LONGER solution time

Control • Required by all parallel implementations • What constitutes a task • When has the problem been solved • How to deal with multiple stages • Forms of control • centralised • distributed

Control Required Sequential Parallel

Inherent Difficulties • Failure to successfully complete • Sequential solution • deficiencies in algorithm or data • Parallel solution • deficiencies in algorithm or data • deadlock • data consistency

Novel Difficulties • Factors arising from implementation • Deadlock • processor waiting indefinitely for an event • Data consistency • data is distributed amongst processors • Communication overheads • latency in message transfer

Evaluating Parallel Implementations • Realisation penalties • Algorithmic penalty • nature of the algorithm chosen • Implementation penalty • need to communicate • concurrent computation & communication activities • idle time

Solution Times

Task Management • Providing tasks to the processors • Problem decomposition • algorithmic decomposition • domain decomposition • Definition of a task • Computational Model

Problem Decomposition • Exploit parallelism • Inherent in algorithm • algorithmic decomposition • parallelising compilers • Applying same algorithm to different data items • domain decomposition • need for explicit system software support

Abstract Definition of a Task • Principal Data Item (PDI) - application of algorithm • Additional Data Items (ADIs) - needed to complete computation

Computational Models • Determines the manner tasks are allocated to PEs • Maximise PE computation time • Minimise idle time • load balancing • Evenly allocate tasks amongst the processors

number of PDIs portion at each PE= number of PEs Data Driven Models • All PDIs allocated to specific PEs before computation starts • Each PE knows a priori which PDIs it is responsible for • Balanced (geometric decomposition) • evenly allocate tasks amongst the processors • if PDIs not exact multiple of Pes then some PEs do one extra task

Balanced Data Driven initial distribution solution time = + 24 3 + result collation

Demand Driven Model • Task computation time unknown • Work is allocated dynamically as PEs become idle • PEs no longer bound to particular PDIs • PEs explicitly demand new tasks • Task supplier process must satisfy these demands

Dynamic Allocation of Tasks 2 x total comms time solution time = + total comp time for all PDIs number of PEs

Task Supplier Process PROCESS Task_Supplier() Begin remaining_tasks := total_number_of_tasks (* initialise all processors with one task *) FOR p = 1 TO number_of_PEs SEND task TO PE[p] remaining_tasks := remaining_tasks -1 WHILE results_outstanding DO RECEIVE result FROM PE[i] IF remaining_tasks > 0 THEN SEND task TO PE[i] remaining_tasks := remaining_tasks -1 ENDIF End (* Task_Supplier *) Simple demand driven task supplier

Load Balancing • All PEs should complete at the same time • Some PEs busy with complex tasks • Other PEs available for easier tasks • Computation effort of each task unknown • hot spot at end of processing  unbalanced solution • Any knowledge about hot spots should be used

Task Definition & Granularity • Computational elements • Atomic element (ray-object intersection) • sequential problem’s lowest computational element • Task (trace complete path of one ray) • parallel problem’s smallest computational element • Task granularity • number of atomic units is one task

Task Packet • Unit of task distribution • Informs a PE of which task(s) to perform • Task packet may include • indication of which task(s) to compute • data items (the PDI and (possibly) ADIs) • Task packet for ray tracer  one or more rays to be traced

Algorithmic Dependencies • Algorithm adopted for parallelisation: • May specify order of task completion • Dependencies MUST be preserved • Algorithmic dependencies introduce: • synchronisation points  distinct problem stages • data dependencies  careful data management

Distributed Task Management • Centralised task supply • All requests for new tasks to System Controller  bottleneck • Significant delay in fetching new tasks • Distributed task supply • task requests handled remotely from System Controller • spread of communication load across system • reduced time to satisfy task request

Preferred Bias Allocation • Combining Data driven & Demand driven • Balanced data driven • tasks allocated in a predetermined manner • Demand driven • tasks allocated dynamically on demand • Preferred Bias: Regions are purely conceptual • enables the exploitation of any coherence

Conceptual Regions • task allocation no longer arbitrary

Data Management • Providing data to the processors • World model • Virtual shared memory • Data manager process • local data cache • requesting & locating data • Consistency

Remote Data Fetches • Advanced data management • Minimising communication latencies • Prefetching • Multi-threading • Profiling • Multi-stage problems

Data Requirements • Requirements may be large • Fit in the local memory of each processor • world model • Too large for each local memory • distributed data • provide virtual world model/virtual shared memory

Higher level System Software Provided by DM process Compiler HPF, ORCA Operating System Coherent Paging Hardware Lower level DDM, DASH, KSR-1 Virtual Shared Memory (VSM) • Providing a conceptual single memory space • Memory is in fact distributed • Request is the same for both local & remote data • Speed of access may be (very) different

Consistency • Read/write can result in inconsistencies • Distributed memory • multiple copies of the same data item • Updating such a data item • update all copies of this data item • invalidate all other copies of this data item

Minimising Impact of Remote Data • Failure to find a data item locally  remote fetch • Time to find data item can be significant • Processor idle during this time • Latency difficult to predict • eg depends on current message densities • Data management must minimise this idle time

Data Management Techniques • Hiding the Latency • Overlapping the communication with computation • prefetching • multi-threading • Minimising the Latency • Reducing the time of a remote fetch • profiling • caching

Prefetching • Exploiting knowledge of data requests • A priori knowledge of data requirements • nature of the problem • choice of computational model • DM can prefetch them (up to some specified horizon) • available locally when required • overlapping communication with computation

Multi-Threading • Keeping PE busy with useful computation • Remote data fetch  current task stalled • Start another task (Processor kept busy) • separate threads of computation (BSP) • Disadvantages: Overheads • Context switches between threads • Increased message densities • Reduced local cache for each thread

Results for Multi-Threading • More than optimal threads reduces performance • “Cache 22” situation • less local cache  more data misses  more threads

Profiling • Reducing the remote fetch time • At the end of computation all data requests are known • if known then can be prefetched • Monitor data requests for each task • build up a “picture” of possible requirements • Exploit spatial coherence (with preferred bias allocation) • prefetch those data items likely to be required

Spatial Coherence

Practical Parallel Processing for Today’s Rendering Challenges SIGGRAPH 2001 Course 40