Mass Market Applications of Massively Parallel Computing

Mass Market Applications of Massively Parallel Computing Chas. Boyd

Outline • Projections of future hardware • The client computing space • Mass-market parallel applications • Common application characteristics • Interesting processor features

The Physics of Silicon • The way processors get faster has fundamentally changed • No more free performance gains due to clock rate and Instruction-Level Parallelism • Yet gates-per-die continues to grow • Possibly faster now that clock rate isn’t an issue • Estimate: doubling every 2-2.5 years • New area means more cores and caches • In-order core counts may grow faster than Out-of-Order core counts do

The Old Story

A Surplus of Cores • ‘More cores than we know what to do with’ • Literally • Servers scale with transaction counts • Technical Computing • history of dealing with parallel workloads • What are the opportunities for the PC client? • Are there mass market applications that are parallelizable?

Requirements of Mass Market Space • Fairly easy to program and maintain • Cannot break on future hardware or operating systems • Transparent back-compatibility, fwd compatibility • Mass market customers hate regressions! • Consumer software must operate for decades • Must get faster automatically • Why we are here

AMD Term: • Personal Stream Computing • Actually nothing like ‘stream processing’ as used by Stanford Brook, etc.

Data-Parallel Processing • Key technique, how do we apply it to consumers? • What takes lots of data? • Media, pixels, audio samples • Video, imaging, audio • Games

Video • Decode, encode, transcode • Motion Estimation, DCT, Quantization • Effects • Anything you would want to do to an image • Scaling, sepia, DVE effects (transitions) • Indexing • Search/Recognition -convolutions

Imaging • Demosaicing • Extract colors with knowledge of sensor layout • Segmentation • Identify areas of image to process • Cleanup • Color correction, noise removal, etc. • Indexing • Identify areas for tagging

User Interaction with Media • Client applications can/should be interactive • Mass market wants full automation • ‘Pro-sumer’ wants some options to participate, but with real-time feedback (20+ fps) on 16 GPixel images • Automating media processing requires analysis • Recognition, segmentation, image understanding • Is this image outdoors or inside? • Is this image right-side up? • Does it contain faces? • Are their eyes red?

Imaging Markets • In some sense, the broader the market, the more sophisticated the algorithm required • Although pro-sumers care more about performance, and they are the ones that write the reviews

FFT Performance

Game Applications of Mass Parallel • Rendering • Imaging • Physics • IK • AI

Ultima Underworld 1993

Dark Messiah 2007

Game Rendering • Well established at this point, but new techniques keep being discovered • Rendering different terms at different spatial scales • E.g. Irradiance can be computed more sparsely than exit radiance enabling large increases in the number of incident light sources considered • Spherical harmonic coefficient manipulations

Game Imaging • Post processing • Reduction (histogram or single average value) • Exposure estimation based on log average luminance • Exposure correction • Oversaturation extraction • Large blurs (proportional to screen size) • Depth of field • Motion blur

Half-Life 2

Game Physics • Particles -non-interacting • Particles -interacting • Rigid bodies • Deformable bodies • Etc.

Game Processor Evolution CPU AI Animation Game Stack Physics Content Creation Process Mesh Modeling Vertex Shader Texture Creation Pixel Shader GPU Offline Real Time

Common Properties of Mass Apps • Results of client computations are displayed at interactive rates • Fundamental requirement of client systems • Tight coupling with graphics is optimal • Physical proximity to renderer is beneficial • Smaller data types are key

Support for Image Data Types • Pixels, texels, motion vectors, etc. • Image data more important than float32s • Data declines in size as importance drops • Bytes, words, fp11, fp16, single, double • Bytes may be declining in importance • Hardware support for formatting is useful • Clock cycles required by shift/or/mul, etc. cost too much power

I/O Considerations • Like most computations that are not 3-D rendering, GPUs are i/o bound • Arithmetic intensity is lower than GPUs • Convolutions • Support for efficient data types is very important

GPU Arithmetic Intensity Projection

GPU Arithmetic Intensity Projection • 2-3 more process doublings before new memory technologies will help much • Stacked die?, 2k wide bus?, optical? • Estimate at least 4x increase in nr of compute instructions per read operation • Arithmetic intensities reach 64?? • This is fine for 3-D rendering • Other data-parallel apps need more i/o

I/O Patterns • Solutions will have a variety of mechanisms to help with worsening i/o constraints • Data re-use (at cache size scales) is relatively rare in media applications • Read-write use of memory is rare • Read-write caches are less critical • Streaming data behavior is sufficient • Read contention and write contention are the issue, not read-after-write scenarios

Interesting Techniques • Shared registers • Possibly interesting to help with i/o bandwidth • Reducing on-chip bandwidth may help power/heat • Scatter • Can be useful in scenarios that don’t thrash output subsystem • Can reduce pressure on gather input system

Convolution • Key element of almost all image and video processing operations • Scaling, glows, blurs, search, segmentation • Algorithm has very low arithmetic intensity • 1 MAD per sample • Also has huge re-use (order of kernel size) • Shared registers should reduce arithmetic intensity by factor of kernel size

Processor Core Types • Heterogeneous Many-core • In-Order vs. Out-of-Order • Distinction arose from targeting 2 different workload design points • Software can select ideal core type for each algorithm (workload design point) • Since not all cores can be powered anyway • Hardware can make trade-offs on: • Power, area, performance growth rate

Workloads GPUs Code Branchiness CPUs Local Memory Accesses Streaming Memory Access

Workload Differences General Processing • Small batches • Frequent branches • Many data inter-dependencies • Scalar ops • Vector ops Media Processing • Large batches • Few branches • Few data inter-dependencies • Scalar ops • Vector ops

Lesson from GPGPU Research • Many important tasks have data-parallel implementations • Typically requires a new algorithm • May be just as maintainable • Definitely more scalable with core count

APIs Must Hide Implementations • Implementation attributes must be hidden from apps to enable scaling over time • Number of cores operating • Number of registers available • Number of i/o ports • Sizes of caches • Thread scheduling policies • Otherwise, these cannot change, and performance will not grow

Order of Thread Execution • Shared registers and scatter share a pitfall: • It may be possible to write code that is dependent on the order of thread execution • This violates scaling requirement • The order of thread execution may vary from run-to-run (each frame) • Will certainly vary between implementations • Cross-vendor and within vendor product lines • All such code is considered incorrect

System Design Goals • Enable massively parallel implementations • Efficient scaling to 1000s of cores • No blocking/waiting • No constraints on order of thread execution • No read-after-write hazards • Enable future compatibility • New hardware releases, new operating systems

Other Computing Paradigms • CPU –originated: • Lock-based, Lockless • Message Passing • Transactional Memory • May not scale well to 1000s of cores • GPU Paradigms • CUDA, CtM • May not scale well over time

High Level APIs • Microsoft Accelerator • Google Peakstream • Rapidmind • Acceleware • Stream processing • Brook, Sequoia

Additional GPU Features • Linear Filtering • 1-D, 2-D, 3-D floating point array indices • Image and video data benefit • Triangle Interpolators • Address calculations take many clocks • Blenders • Atomic reduction ops reduce ordering concerns • 4-vector operations • Vector data, syntactic convenience

Processor Opportunities • Client computing performance can improve • Client space is a large un-tapped opportunity for parallel processing • Hardware changes required are minimal and fairly obvious • Fast display, efficient i/o, scalable over time

Mass Market Applications of Massively Parallel Computing

Mass Market Applications of Massively Parallel Computing

Presentation Transcript

Parallel Computing

Massively Parallel Processors

Parallel Computing

Parallel Computing Explained Parallel Computing Overview

Parallel Computing

Parallel Computing

Programming Massively Parallel Graphics Processors

CUDA Lecture 1 Introduction to Massively Parallel Computing

Parallel Computing

Computing beyond a Million Processors - bio-inspired massively-parallel architectures

Parallel Computing

Parallel Computing

Theoretical limitations of massively parallel biology

Parallel Computing

ITC Research Computing for Parallel Applications

Parallel Computing

Massively Parallel Signature Sequencing (MPSS)

Steering Massively Parallel Applications Under Python

EECE 571e (Fall 2014) (Massively) Parallel Computing Platforms

Optoelectronic Parallel Computing with its Applications

CM-5 Massively Parallel Supercomputer

Massively Parallel Computing for Protein Alignment