450 likes | 576 Views
Mass Market Applications of Massively Parallel Computing. Chas. Boyd. Outline. Projections of future hardware The client computing space Mass-market parallel applications Common application characteristics Interesting processor features. The Physics of Silicon.
E N D
Mass Market Applications of Massively Parallel Computing Chas. Boyd
Outline • Projections of future hardware • The client computing space • Mass-market parallel applications • Common application characteristics • Interesting processor features
The Physics of Silicon • The way processors get faster has fundamentally changed • No more free performance gains due to clock rate and Instruction-Level Parallelism • Yet gates-per-die continues to grow • Possibly faster now that clock rate isn’t an issue • Estimate: doubling every 2-2.5 years • New area means more cores and caches • In-order core counts may grow faster than Out-of-Order core counts do
A Surplus of Cores • ‘More cores than we know what to do with’ • Literally • Servers scale with transaction counts • Technical Computing • history of dealing with parallel workloads • What are the opportunities for the PC client? • Are there mass market applications that are parallelizable?
Requirements of Mass Market Space • Fairly easy to program and maintain • Cannot break on future hardware or operating systems • Transparent back-compatibility, fwd compatibility • Mass market customers hate regressions! • Consumer software must operate for decades • Must get faster automatically • Why we are here
AMD Term: • Personal Stream Computing • Actually nothing like ‘stream processing’ as used by Stanford Brook, etc.
Data-Parallel Processing • Key technique, how do we apply it to consumers? • What takes lots of data? • Media, pixels, audio samples • Video, imaging, audio • Games
Video • Decode, encode, transcode • Motion Estimation, DCT, Quantization • Effects • Anything you would want to do to an image • Scaling, sepia, DVE effects (transitions) • Indexing • Search/Recognition -convolutions
Imaging • Demosaicing • Extract colors with knowledge of sensor layout • Segmentation • Identify areas of image to process • Cleanup • Color correction, noise removal, etc. • Indexing • Identify areas for tagging
User Interaction with Media • Client applications can/should be interactive • Mass market wants full automation • ‘Pro-sumer’ wants some options to participate, but with real-time feedback (20+ fps) on 16 GPixel images • Automating media processing requires analysis • Recognition, segmentation, image understanding • Is this image outdoors or inside? • Is this image right-side up? • Does it contain faces? • Are their eyes red?
Imaging Markets • In some sense, the broader the market, the more sophisticated the algorithm required • Although pro-sumers care more about performance, and they are the ones that write the reviews
Game Applications of Mass Parallel • Rendering • Imaging • Physics • IK • AI
Game Rendering • Well established at this point, but new techniques keep being discovered • Rendering different terms at different spatial scales • E.g. Irradiance can be computed more sparsely than exit radiance enabling large increases in the number of incident light sources considered • Spherical harmonic coefficient manipulations
Game Imaging • Post processing • Reduction (histogram or single average value) • Exposure estimation based on log average luminance • Exposure correction • Oversaturation extraction • Large blurs (proportional to screen size) • Depth of field • Motion blur
Game Physics • Particles -non-interacting • Particles -interacting • Rigid bodies • Deformable bodies • Etc.
Game Processor Evolution CPU AI Animation Game Stack Physics Content Creation Process Mesh Modeling Vertex Shader Texture Creation Pixel Shader GPU Offline Real Time
Common Properties of Mass Apps • Results of client computations are displayed at interactive rates • Fundamental requirement of client systems • Tight coupling with graphics is optimal • Physical proximity to renderer is beneficial • Smaller data types are key
Support for Image Data Types • Pixels, texels, motion vectors, etc. • Image data more important than float32s • Data declines in size as importance drops • Bytes, words, fp11, fp16, single, double • Bytes may be declining in importance • Hardware support for formatting is useful • Clock cycles required by shift/or/mul, etc. cost too much power
I/O Considerations • Like most computations that are not 3-D rendering, GPUs are i/o bound • Arithmetic intensity is lower than GPUs • Convolutions • Support for efficient data types is very important
GPU Arithmetic Intensity Projection • 2-3 more process doublings before new memory technologies will help much • Stacked die?, 2k wide bus?, optical? • Estimate at least 4x increase in nr of compute instructions per read operation • Arithmetic intensities reach 64?? • This is fine for 3-D rendering • Other data-parallel apps need more i/o
I/O Patterns • Solutions will have a variety of mechanisms to help with worsening i/o constraints • Data re-use (at cache size scales) is relatively rare in media applications • Read-write use of memory is rare • Read-write caches are less critical • Streaming data behavior is sufficient • Read contention and write contention are the issue, not read-after-write scenarios
Interesting Techniques • Shared registers • Possibly interesting to help with i/o bandwidth • Reducing on-chip bandwidth may help power/heat • Scatter • Can be useful in scenarios that don’t thrash output subsystem • Can reduce pressure on gather input system
Convolution • Key element of almost all image and video processing operations • Scaling, glows, blurs, search, segmentation • Algorithm has very low arithmetic intensity • 1 MAD per sample • Also has huge re-use (order of kernel size) • Shared registers should reduce arithmetic intensity by factor of kernel size
Processor Core Types • Heterogeneous Many-core • In-Order vs. Out-of-Order • Distinction arose from targeting 2 different workload design points • Software can select ideal core type for each algorithm (workload design point) • Since not all cores can be powered anyway • Hardware can make trade-offs on: • Power, area, performance growth rate
Workloads GPUs Code Branchiness CPUs Local Memory Accesses Streaming Memory Access
Workload Differences General Processing • Small batches • Frequent branches • Many data inter-dependencies • Scalar ops • Vector ops Media Processing • Large batches • Few branches • Few data inter-dependencies • Scalar ops • Vector ops
Lesson from GPGPU Research • Many important tasks have data-parallel implementations • Typically requires a new algorithm • May be just as maintainable • Definitely more scalable with core count
APIs Must Hide Implementations • Implementation attributes must be hidden from apps to enable scaling over time • Number of cores operating • Number of registers available • Number of i/o ports • Sizes of caches • Thread scheduling policies • Otherwise, these cannot change, and performance will not grow
Order of Thread Execution • Shared registers and scatter share a pitfall: • It may be possible to write code that is dependent on the order of thread execution • This violates scaling requirement • The order of thread execution may vary from run-to-run (each frame) • Will certainly vary between implementations • Cross-vendor and within vendor product lines • All such code is considered incorrect
System Design Goals • Enable massively parallel implementations • Efficient scaling to 1000s of cores • No blocking/waiting • No constraints on order of thread execution • No read-after-write hazards • Enable future compatibility • New hardware releases, new operating systems
Other Computing Paradigms • CPU –originated: • Lock-based, Lockless • Message Passing • Transactional Memory • May not scale well to 1000s of cores • GPU Paradigms • CUDA, CtM • May not scale well over time
High Level APIs • Microsoft Accelerator • Google Peakstream • Rapidmind • Acceleware • Stream processing • Brook, Sequoia
Additional GPU Features • Linear Filtering • 1-D, 2-D, 3-D floating point array indices • Image and video data benefit • Triangle Interpolators • Address calculations take many clocks • Blenders • Atomic reduction ops reduce ordering concerns • 4-vector operations • Vector data, syntactic convenience
Processor Opportunities • Client computing performance can improve • Client space is a large un-tapped opportunity for parallel processing • Hardware changes required are minimal and fairly obvious • Fast display, efficient i/o, scalable over time