2.79k likes | 2.92k Views
Practical Parallel Processing for Today’s Rendering Challenges SIGGRAPH 2001 Course 40 Los Angeles, CA. Speakers. Alan Chalmers, University of Bristol Tim Davis, Clemson University Erik Reinhard, University of Utah Toshi Kato, SquareUSA. Schedule. Introduction
E N D
Practical Parallel Processing for Today’s Rendering Challenges SIGGRAPH 2001 Course 40 Los Angeles, CA
Speakers • Alan Chalmers, University of Bristol • Tim Davis, Clemson University • Erik Reinhard, University of Utah • Toshi Kato, SquareUSA
Schedule • Introduction • Parallel / Distributed Rendering Issues • Classification of Parallel Rendering Systems • Practical Applications • Summary / Discussion
Schedule • Introduction (Davis) • Parallel / Distributed Rendering Issues • Classification of Parallel Rendering Systems • Practical Applications • Summary / Discussion
The Need for Speed • Graphics rendering is time-consuming • large amount of data in a single image • animations much worse • Demand continues to rise for high-quality graphics
Rendering and Parallel Processing • A holy union • Many graphics rendering tasks can be performed in parallel • Often “embarrassing parallel”
3-D Graphics Boards • Getting better • Perform “tricks” with texture mapping • Steve Jobs’ remark on constant frame rendering time
Parallel / Distributed Rendering • Fundamental Issues • Task Management • Task subdivision, Migration, Load balancing • Data Management • Data distributed across system • Communication
Schedule • Introduction • Parallel / Distributed Rendering Issues (Chalmers) • Classification of Parallel Rendering Systems • Practical Applications • Summary / Discussion
Introduction “Parallel processing is like a dog’s walking on its hind legs. It is not done well, but you are surprised to find it done at all” [Steve Fiddes (apologies to Samuel Johnson)] • Co-operation • Dependencies • Scalability • Control
Co-operation • Solution of a single problem • One person takes a certain time to solve the problem • Divide problem into a number of sub-problems • Each sub-problem solved by a single worker • Reduced problem solution time • BUT • co-operation overheads
Working Together • Overheads • access to pool • collision avoidance
Dependencies • Divide a problem into a number of distinct stages • Parallel solution of one stage before next can start • May be too severe no parallel solution • each sub-problem dependent on previous stage • Dependency-free problems • order of task completion unimportant • BUT co-operation still required
Building with Blocks Strictly sequential Dependency-free
Scalability • Upper bound on the number of workers • Additional workers will NOT improve solution time • Shows how suitable a problem is for parallel processing • Given problem finite number of sub-problems • more workers than tasks • Upper bound may be (a lot) less than number of tasks • bottlenecks
Bottleneck at Doorway More workers may result in LONGER solution time
Control • Required by all parallel implementations • What constitutes a task • When has the problem been solved • How to deal with multiple stages • Forms of control • centralised • distributed
Control Required Sequential Parallel
Inherent Difficulties • Failure to successfully complete • Sequential solution • deficiencies in algorithm or data • Parallel solution • deficiencies in algorithm or data • deadlock • data consistency
Novel Difficulties • Factors arising from implementation • Deadlock • processor waiting indefinitely for an event • Data consistency • data is distributed amongst processors • Communication overheads • latency in message transfer
Evaluating Parallel Implementations • Realisation penalties • Algorithmic penalty • nature of the algorithm chosen • Implementation penalty • need to communicate • concurrent computation & communication activities • idle time
Task Management • Providing tasks to the processors • Problem decomposition • algorithmic decomposition • domain decomposition • Definition of a task • Computational Model
Problem Decomposition • Exploit parallelism • Inherent in algorithm • algorithmic decomposition • parallelising compilers • Applying same algorithm to different data items • domain decomposition • need for explicit system software support
Abstract Definition of a Task • Principal Data Item (PDI) - application of algorithm • Additional Data Items (ADIs) - needed to complete computation
Computational Models • Determines the manner tasks are allocated to PEs • Maximise PE computation time • Minimise idle time • load balancing • Evenly allocate tasks amongst the processors
number of PDIs portion at each PE= number of PEs Data Driven Models • All PDIs allocated to specific PEs before computation starts • Each PE knows a priori which PDIs it is responsible for • Balanced (geometric decomposition) • evenly allocate tasks amongst the processors • if PDIs not exact multiple of Pes then some PEs do one extra task
Balanced Data Driven initial distribution solution time = + 24 3 + result collation
Demand Driven Model • Task computation time unknown • Work is allocated dynamically as PEs become idle • PEs no longer bound to particular PDIs • PEs explicitly demand new tasks • Task supplier process must satisfy these demands
Dynamic Allocation of Tasks 2 x total comms time solution time = + total comp time for all PDIs number of PEs
Task Supplier Process PROCESS Task_Supplier() Begin remaining_tasks := total_number_of_tasks (* initialise all processors with one task *) FOR p = 1 TO number_of_PEs SEND task TO PE[p] remaining_tasks := remaining_tasks -1 WHILE results_outstanding DO RECEIVE result FROM PE[i] IF remaining_tasks > 0 THEN SEND task TO PE[i] remaining_tasks := remaining_tasks -1 ENDIF End (* Task_Supplier *) Simple demand driven task supplier
Load Balancing • All PEs should complete at the same time • Some PEs busy with complex tasks • Other PEs available for easier tasks • Computation effort of each task unknown • hot spot at end of processing unbalanced solution • Any knowledge about hot spots should be used
Task Definition & Granularity • Computational elements • Atomic element (ray-object intersection) • sequential problem’s lowest computational element • Task (trace complete path of one ray) • parallel problem’s smallest computational element • Task granularity • number of atomic units is one task
Task Packet • Unit of task distribution • Informs a PE of which task(s) to perform • Task packet may include • indication of which task(s) to compute • data items (the PDI and (possibly) ADIs) • Task packet for ray tracer one or more rays to be traced
Algorithmic Dependencies • Algorithm adopted for parallelisation: • May specify order of task completion • Dependencies MUST be preserved • Algorithmic dependencies introduce: • synchronisation points distinct problem stages • data dependencies careful data management
Distributed Task Management • Centralised task supply • All requests for new tasks to System Controller bottleneck • Significant delay in fetching new tasks • Distributed task supply • task requests handled remotely from System Controller • spread of communication load across system • reduced time to satisfy task request
Preferred Bias Allocation • Combining Data driven & Demand driven • Balanced data driven • tasks allocated in a predetermined manner • Demand driven • tasks allocated dynamically on demand • Preferred Bias: Regions are purely conceptual • enables the exploitation of any coherence
Conceptual Regions • task allocation no longer arbitrary
Data Management • Providing data to the processors • World model • Virtual shared memory • Data manager process • local data cache • requesting & locating data • Consistency
Remote Data Fetches • Advanced data management • Minimising communication latencies • Prefetching • Multi-threading • Profiling • Multi-stage problems
Data Requirements • Requirements may be large • Fit in the local memory of each processor • world model • Too large for each local memory • distributed data • provide virtual world model/virtual shared memory
Higher level System Software Provided by DM process Compiler HPF, ORCA Operating System Coherent Paging Hardware Lower level DDM, DASH, KSR-1 Virtual Shared Memory (VSM) • Providing a conceptual single memory space • Memory is in fact distributed • Request is the same for both local & remote data • Speed of access may be (very) different
Consistency • Read/write can result in inconsistencies • Distributed memory • multiple copies of the same data item • Updating such a data item • update all copies of this data item • invalidate all other copies of this data item
Minimising Impact of Remote Data • Failure to find a data item locally remote fetch • Time to find data item can be significant • Processor idle during this time • Latency difficult to predict • eg depends on current message densities • Data management must minimise this idle time
Data Management Techniques • Hiding the Latency • Overlapping the communication with computation • prefetching • multi-threading • Minimising the Latency • Reducing the time of a remote fetch • profiling • caching
Prefetching • Exploiting knowledge of data requests • A priori knowledge of data requirements • nature of the problem • choice of computational model • DM can prefetch them (up to some specified horizon) • available locally when required • overlapping communication with computation
Multi-Threading • Keeping PE busy with useful computation • Remote data fetch current task stalled • Start another task (Processor kept busy) • separate threads of computation (BSP) • Disadvantages: Overheads • Context switches between threads • Increased message densities • Reduced local cache for each thread
Results for Multi-Threading • More than optimal threads reduces performance • “Cache 22” situation • less local cache more data misses more threads
Profiling • Reducing the remote fetch time • At the end of computation all data requests are known • if known then can be prefetched • Monitor data requests for each task • build up a “picture” of possible requirements • Exploit spatial coherence (with preferred bias allocation) • prefetch those data items likely to be required