1 / 279

Practical Parallel Processing for Today’s Rendering Challenges SIGGRAPH 2001 Course 40

Practical Parallel Processing for Today’s Rendering Challenges SIGGRAPH 2001 Course 40 Los Angeles, CA. Speakers. Alan Chalmers, University of Bristol Tim Davis, Clemson University Erik Reinhard, University of Utah Toshi Kato, SquareUSA. Schedule. Introduction

dana
Download Presentation

Practical Parallel Processing for Today’s Rendering Challenges SIGGRAPH 2001 Course 40

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Practical Parallel Processing for Today’s Rendering Challenges SIGGRAPH 2001 Course 40 Los Angeles, CA

  2. Speakers • Alan Chalmers, University of Bristol • Tim Davis, Clemson University • Erik Reinhard, University of Utah • Toshi Kato, SquareUSA

  3. Schedule • Introduction • Parallel / Distributed Rendering Issues • Classification of Parallel Rendering Systems • Practical Applications • Summary / Discussion

  4. Schedule • Introduction (Davis) • Parallel / Distributed Rendering Issues • Classification of Parallel Rendering Systems • Practical Applications • Summary / Discussion

  5. The Need for Speed • Graphics rendering is time-consuming • large amount of data in a single image • animations much worse • Demand continues to rise for high-quality graphics

  6. Rendering and Parallel Processing • A holy union • Many graphics rendering tasks can be performed in parallel • Often “embarrassing parallel”

  7. 3-D Graphics Boards • Getting better • Perform “tricks” with texture mapping • Steve Jobs’ remark on constant frame rendering time

  8. Parallel / Distributed Rendering • Fundamental Issues • Task Management • Task subdivision, Migration, Load balancing • Data Management • Data distributed across system • Communication

  9. Schedule • Introduction • Parallel / Distributed Rendering Issues (Chalmers) • Classification of Parallel Rendering Systems • Practical Applications • Summary / Discussion

  10. Introduction “Parallel processing is like a dog’s walking on its hind legs. It is not done well, but you are surprised to find it done at all” [Steve Fiddes (apologies to Samuel Johnson)] • Co-operation • Dependencies • Scalability • Control

  11. Co-operation • Solution of a single problem • One person takes a certain time to solve the problem • Divide problem into a number of sub-problems • Each sub-problem solved by a single worker • Reduced problem solution time • BUT • co-operation  overheads

  12. Working Together • Overheads • access to pool • collision avoidance

  13. Dependencies • Divide a problem into a number of distinct stages • Parallel solution of one stage before next can start • May be too severe  no parallel solution • each sub-problem dependent on previous stage • Dependency-free problems • order of task completion unimportant • BUT co-operation still required

  14. Building with Blocks Strictly sequential Dependency-free

  15. Scalability • Upper bound on the number of workers • Additional workers will NOT improve solution time • Shows how suitable a problem is for parallel processing • Given problem  finite number of sub-problems • more workers than tasks • Upper bound may be (a lot) less than number of tasks • bottlenecks

  16. Bottleneck at Doorway More workers may result in LONGER solution time

  17. Control • Required by all parallel implementations • What constitutes a task • When has the problem been solved • How to deal with multiple stages • Forms of control • centralised • distributed

  18. Control Required Sequential Parallel

  19. Inherent Difficulties • Failure to successfully complete • Sequential solution • deficiencies in algorithm or data • Parallel solution • deficiencies in algorithm or data • deadlock • data consistency

  20. Novel Difficulties • Factors arising from implementation • Deadlock • processor waiting indefinitely for an event • Data consistency • data is distributed amongst processors • Communication overheads • latency in message transfer

  21. Evaluating Parallel Implementations • Realisation penalties • Algorithmic penalty • nature of the algorithm chosen • Implementation penalty • need to communicate • concurrent computation & communication activities • idle time

  22. Solution Times

  23. Task Management • Providing tasks to the processors • Problem decomposition • algorithmic decomposition • domain decomposition • Definition of a task • Computational Model

  24. Problem Decomposition • Exploit parallelism • Inherent in algorithm • algorithmic decomposition • parallelising compilers • Applying same algorithm to different data items • domain decomposition • need for explicit system software support

  25. Abstract Definition of a Task • Principal Data Item (PDI) - application of algorithm • Additional Data Items (ADIs) - needed to complete computation

  26. Computational Models • Determines the manner tasks are allocated to PEs • Maximise PE computation time • Minimise idle time • load balancing • Evenly allocate tasks amongst the processors

  27. number of PDIs portion at each PE= number of PEs Data Driven Models • All PDIs allocated to specific PEs before computation starts • Each PE knows a priori which PDIs it is responsible for • Balanced (geometric decomposition) • evenly allocate tasks amongst the processors • if PDIs not exact multiple of Pes then some PEs do one extra task

  28. Balanced Data Driven initial distribution solution time = + 24 3 + result collation

  29. Demand Driven Model • Task computation time unknown • Work is allocated dynamically as PEs become idle • PEs no longer bound to particular PDIs • PEs explicitly demand new tasks • Task supplier process must satisfy these demands

  30. Dynamic Allocation of Tasks 2 x total comms time solution time = + total comp time for all PDIs number of PEs

  31. Task Supplier Process PROCESS Task_Supplier() Begin remaining_tasks := total_number_of_tasks (* initialise all processors with one task *) FOR p = 1 TO number_of_PEs SEND task TO PE[p] remaining_tasks := remaining_tasks -1 WHILE results_outstanding DO RECEIVE result FROM PE[i] IF remaining_tasks > 0 THEN SEND task TO PE[i] remaining_tasks := remaining_tasks -1 ENDIF End (* Task_Supplier *) Simple demand driven task supplier

  32. Load Balancing • All PEs should complete at the same time • Some PEs busy with complex tasks • Other PEs available for easier tasks • Computation effort of each task unknown • hot spot at end of processing  unbalanced solution • Any knowledge about hot spots should be used

  33. Task Definition & Granularity • Computational elements • Atomic element (ray-object intersection) • sequential problem’s lowest computational element • Task (trace complete path of one ray) • parallel problem’s smallest computational element • Task granularity • number of atomic units is one task

  34. Task Packet • Unit of task distribution • Informs a PE of which task(s) to perform • Task packet may include • indication of which task(s) to compute • data items (the PDI and (possibly) ADIs) • Task packet for ray tracer  one or more rays to be traced

  35. Algorithmic Dependencies • Algorithm adopted for parallelisation: • May specify order of task completion • Dependencies MUST be preserved • Algorithmic dependencies introduce: • synchronisation points  distinct problem stages • data dependencies  careful data management

  36. Distributed Task Management • Centralised task supply • All requests for new tasks to System Controller  bottleneck • Significant delay in fetching new tasks • Distributed task supply • task requests handled remotely from System Controller • spread of communication load across system • reduced time to satisfy task request

  37. Preferred Bias Allocation • Combining Data driven & Demand driven • Balanced data driven • tasks allocated in a predetermined manner • Demand driven • tasks allocated dynamically on demand • Preferred Bias: Regions are purely conceptual • enables the exploitation of any coherence

  38. Conceptual Regions • task allocation no longer arbitrary

  39. Data Management • Providing data to the processors • World model • Virtual shared memory • Data manager process • local data cache • requesting & locating data • Consistency

  40. Remote Data Fetches • Advanced data management • Minimising communication latencies • Prefetching • Multi-threading • Profiling • Multi-stage problems

  41. Data Requirements • Requirements may be large • Fit in the local memory of each processor • world model • Too large for each local memory • distributed data • provide virtual world model/virtual shared memory

  42. Higher level System Software Provided by DM process Compiler HPF, ORCA Operating System Coherent Paging Hardware Lower level DDM, DASH, KSR-1 Virtual Shared Memory (VSM) • Providing a conceptual single memory space • Memory is in fact distributed • Request is the same for both local & remote data • Speed of access may be (very) different

  43. Consistency • Read/write can result in inconsistencies • Distributed memory • multiple copies of the same data item • Updating such a data item • update all copies of this data item • invalidate all other copies of this data item

  44. Minimising Impact of Remote Data • Failure to find a data item locally  remote fetch • Time to find data item can be significant • Processor idle during this time • Latency difficult to predict • eg depends on current message densities • Data management must minimise this idle time

  45. Data Management Techniques • Hiding the Latency • Overlapping the communication with computation • prefetching • multi-threading • Minimising the Latency • Reducing the time of a remote fetch • profiling • caching

  46. Prefetching • Exploiting knowledge of data requests • A priori knowledge of data requirements • nature of the problem • choice of computational model • DM can prefetch them (up to some specified horizon) • available locally when required • overlapping communication with computation

  47. Multi-Threading • Keeping PE busy with useful computation • Remote data fetch  current task stalled • Start another task (Processor kept busy) • separate threads of computation (BSP) • Disadvantages: Overheads • Context switches between threads • Increased message densities • Reduced local cache for each thread

  48. Results for Multi-Threading • More than optimal threads reduces performance • “Cache 22” situation • less local cache  more data misses  more threads

  49. Profiling • Reducing the remote fetch time • At the end of computation all data requests are known • if known then can be prefetched • Monitor data requests for each task • build up a “picture” of possible requirements • Exploit spatial coherence (with preferred bias allocation) • prefetch those data items likely to be required

  50. Spatial Coherence

More Related