1 / 45

Mass Market Applications of Massively Parallel Computing

Mass Market Applications of Massively Parallel Computing. Chas. Boyd. Outline. Projections of future hardware The client computing space Mass-market parallel applications Common application characteristics Interesting processor features. The Physics of Silicon.

cleave
Download Presentation

Mass Market Applications of Massively Parallel Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mass Market Applications of Massively Parallel Computing Chas. Boyd

  2. Outline • Projections of future hardware • The client computing space • Mass-market parallel applications • Common application characteristics • Interesting processor features

  3. The Physics of Silicon • The way processors get faster has fundamentally changed • No more free performance gains due to clock rate and Instruction-Level Parallelism • Yet gates-per-die continues to grow • Possibly faster now that clock rate isn’t an issue • Estimate: doubling every 2-2.5 years • New area means more cores and caches • In-order core counts may grow faster than Out-of-Order core counts do

  4. The Old Story

  5. A Surplus of Cores • ‘More cores than we know what to do with’ • Literally • Servers scale with transaction counts • Technical Computing • history of dealing with parallel workloads • What are the opportunities for the PC client? • Are there mass market applications that are parallelizable?

  6. Requirements of Mass Market Space • Fairly easy to program and maintain • Cannot break on future hardware or operating systems • Transparent back-compatibility, fwd compatibility • Mass market customers hate regressions! • Consumer software must operate for decades • Must get faster automatically • Why we are here

  7. AMD Term: • Personal Stream Computing • Actually nothing like ‘stream processing’ as used by Stanford Brook, etc.

  8. Data-Parallel Processing • Key technique, how do we apply it to consumers? • What takes lots of data? • Media, pixels, audio samples • Video, imaging, audio • Games

  9. Video • Decode, encode, transcode • Motion Estimation, DCT, Quantization • Effects • Anything you would want to do to an image • Scaling, sepia, DVE effects (transitions) • Indexing • Search/Recognition -convolutions

  10. Imaging • Demosaicing • Extract colors with knowledge of sensor layout • Segmentation • Identify areas of image to process • Cleanup • Color correction, noise removal, etc. • Indexing • Identify areas for tagging

  11. User Interaction with Media • Client applications can/should be interactive • Mass market wants full automation • ‘Pro-sumer’ wants some options to participate, but with real-time feedback (20+ fps) on 16 GPixel images • Automating media processing requires analysis • Recognition, segmentation, image understanding • Is this image outdoors or inside? • Is this image right-side up? • Does it contain faces? • Are their eyes red?

  12. Imaging Markets • In some sense, the broader the market, the more sophisticated the algorithm required • Although pro-sumers care more about performance, and they are the ones that write the reviews

  13. FFT Performance

  14. Game Applications of Mass Parallel • Rendering • Imaging • Physics • IK • AI

  15. Ultima Underworld 1993

  16. Dark Messiah 2007

  17. Game Rendering • Well established at this point, but new techniques keep being discovered • Rendering different terms at different spatial scales • E.g. Irradiance can be computed more sparsely than exit radiance enabling large increases in the number of incident light sources considered • Spherical harmonic coefficient manipulations

  18. Game Imaging • Post processing • Reduction (histogram or single average value) • Exposure estimation based on log average luminance • Exposure correction • Oversaturation extraction • Large blurs (proportional to screen size) • Depth of field • Motion blur

  19. Half-Life 2

  20. Half-Life 2

  21. Half-Life 2

  22. Game Physics • Particles -non-interacting • Particles -interacting • Rigid bodies • Deformable bodies • Etc.

  23. Game Processor Evolution CPU AI Animation Game Stack Physics Content Creation Process Mesh Modeling Vertex Shader Texture Creation Pixel Shader GPU Offline Real Time

  24. Common Properties of Mass Apps • Results of client computations are displayed at interactive rates • Fundamental requirement of client systems • Tight coupling with graphics is optimal • Physical proximity to renderer is beneficial • Smaller data types are key

  25. Support for Image Data Types • Pixels, texels, motion vectors, etc. • Image data more important than float32s • Data declines in size as importance drops • Bytes, words, fp11, fp16, single, double • Bytes may be declining in importance • Hardware support for formatting is useful • Clock cycles required by shift/or/mul, etc. cost too much power

  26. I/O Considerations • Like most computations that are not 3-D rendering, GPUs are i/o bound • Arithmetic intensity is lower than GPUs • Convolutions • Support for efficient data types is very important

  27. GPU Arithmetic Intensity Projection

  28. GPU Arithmetic Intensity Projection • 2-3 more process doublings before new memory technologies will help much • Stacked die?, 2k wide bus?, optical? • Estimate at least 4x increase in nr of compute instructions per read operation • Arithmetic intensities reach 64?? • This is fine for 3-D rendering • Other data-parallel apps need more i/o

  29. I/O Patterns • Solutions will have a variety of mechanisms to help with worsening i/o constraints • Data re-use (at cache size scales) is relatively rare in media applications • Read-write use of memory is rare • Read-write caches are less critical • Streaming data behavior is sufficient • Read contention and write contention are the issue, not read-after-write scenarios

  30. Interesting Techniques • Shared registers • Possibly interesting to help with i/o bandwidth • Reducing on-chip bandwidth may help power/heat • Scatter • Can be useful in scenarios that don’t thrash output subsystem • Can reduce pressure on gather input system

  31. Convolution • Key element of almost all image and video processing operations • Scaling, glows, blurs, search, segmentation • Algorithm has very low arithmetic intensity • 1 MAD per sample • Also has huge re-use (order of kernel size) • Shared registers should reduce arithmetic intensity by factor of kernel size

  32. Processor Core Types • Heterogeneous Many-core • In-Order vs. Out-of-Order • Distinction arose from targeting 2 different workload design points • Software can select ideal core type for each algorithm (workload design point) • Since not all cores can be powered anyway • Hardware can make trade-offs on: • Power, area, performance growth rate

  33. Workloads GPUs Code Branchiness CPUs Local Memory Accesses Streaming Memory Access

  34. Workload Differences General Processing • Small batches • Frequent branches • Many data inter-dependencies • Scalar ops • Vector ops Media Processing • Large batches • Few branches • Few data inter-dependencies • Scalar ops • Vector ops

  35. Lesson from GPGPU Research • Many important tasks have data-parallel implementations • Typically requires a new algorithm • May be just as maintainable • Definitely more scalable with core count

  36. APIs Must Hide Implementations • Implementation attributes must be hidden from apps to enable scaling over time • Number of cores operating • Number of registers available • Number of i/o ports • Sizes of caches • Thread scheduling policies • Otherwise, these cannot change, and performance will not grow

  37. Order of Thread Execution • Shared registers and scatter share a pitfall: • It may be possible to write code that is dependent on the order of thread execution • This violates scaling requirement • The order of thread execution may vary from run-to-run (each frame) • Will certainly vary between implementations • Cross-vendor and within vendor product lines • All such code is considered incorrect

  38. System Design Goals • Enable massively parallel implementations • Efficient scaling to 1000s of cores • No blocking/waiting • No constraints on order of thread execution • No read-after-write hazards • Enable future compatibility • New hardware releases, new operating systems

  39. Other Computing Paradigms • CPU –originated: • Lock-based, Lockless • Message Passing • Transactional Memory • May not scale well to 1000s of cores • GPU Paradigms • CUDA, CtM • May not scale well over time

  40. High Level APIs • Microsoft Accelerator • Google Peakstream • Rapidmind • Acceleware • Stream processing • Brook, Sequoia

  41. Additional GPU Features • Linear Filtering • 1-D, 2-D, 3-D floating point array indices • Image and video data benefit • Triangle Interpolators • Address calculations take many clocks • Blenders • Atomic reduction ops reduce ordering concerns • 4-vector operations • Vector data, syntactic convenience

  42. Processor Opportunities • Client computing performance can improve • Client space is a large un-tapped opportunity for parallel processing • Hardware changes required are minimal and fairly obvious • Fast display, efficient i/o, scalable over time

More Related