The Imagine Stream Processor Flexibility with Performance March 30, 2001

The Imagine Stream ProcessorFlexibility with PerformanceMarch 30, 2001 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu

Outline • Motivation • We need low-power, programmable TeraOps • The problem is bandwidth • Growing gap between special-purpose and general-purpose hardware • Its easy to make ALUs, hard to keep them fed • A stream processor gives programmable bandwidth • Streams expose locality and concurrency in the application • A bandwidth hierarchy exploits this • Imagine is a 20GFLOPS prototype stream processor • Many opportunities to do better • Scaling up • Simplifying programming Convergence Workshop

Motivation • Some things I’d like to do with a few TeraOps • Have a realistic face-to-face meeting with someone in Boston without riding an airplane • 4-8 cameras, extract depth, fit model, compress, render to several screens • High-quality rendering at video rates • Ray tracing a 2K x 4K image with 105 objects at 60 frames/s Convergence Workshop

The good news – FLOPS are cheap, OPS are cheaper • 32-bit FPU – 2GFLOPS/mm2 – 400GFLOPS/chip • 16-bit add – 40GOPS/mm2 – 8TOPS/chip 460 mm Local RF 146.7 mm Integer Adder Convergence Workshop

The bad news – General purpose processors can’t harness this Convergence Workshop

Why do Special-Purpose Processors Perform Well? Lots (100s) of ALUs Fed by dedicated wires/memories Convergence Workshop

Care and Feeding of ALUs Instr. Cache IP Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU Convergence Workshop

The problem is bandwidth • Can we solve this bandwidth problem without sacrificing programmability? Convergence Workshop

Streams expose locality and concurrency Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Streams expose data parallelism Convergence Workshop

SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s A Bandwidth Hierarchy exploits locality and concurrency • VLIW clusters with shared control • 41.2 32-bit operations per word of memory bandwidth Convergence Workshop

SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Bandwidth Usage Convergence Workshop

SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor The Imagine Stream Processor Convergence Workshop

Intercluster Network Local Register File + * * + + / CU To SRF Cross Point From SRF Arithmetic Clusters Convergence Workshop

Performance floating-point application 16-bit applications 16-bit kernels floating-point kernel Convergence Workshop

Power GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9 Convergence Workshop

A Look Inside an ApplicationStereo Depth Extraction • 320x240 8-bit grayscale images • 30 disparity search • 220 frames/second • 12.7 GOPS • 5.7 GOPS/W Convergence Workshop

Stereo Depth Extractor Convolutions Disparity Search Load Convolved Rows Load original packed row Calculate BlockSADs at different disparities Unpack (8bit -> 16 bit) 7x7 Convolve 3x3 Convolve Store best disparity values Store convolved row

7x7 Convolve Kernel Convergence Workshop

Imagine gives high performance with low power and flexible programming • Matches capabilities of communication-limited technology to demands of signal and image processing applications • Performance • compound stream operations realize >10GOPS on key applications • can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) • Power • three-level register hierarchy gives 2-10GOPS/W • Flexibility • programmed in “C” • streaming model • conditional stream operations enable applications like sort Convergence Workshop

A look forward • Next steps • Build some Imagine prototypes • Dual-processor 40GFLOPS systems, 64-processor TeraFLOPS systems • Longer term • ‘Industrial Strength’ Imagine – 100-200GFLOPS/chip • Multiple sets of arithmetic clusters per chip, higher clock rate, on-chip cache, more off-chip bandwidth • Graphics extensions • Texture cache, raster unit – as SRF clients • A streaming supercomputer • 64-bit FP, high-bandwidth global memory, MIMD extensions • Simplified stream programming • Automate inter-cluster communication, partitioning into kernels, sub-word arithmetic, staging of data. Convergence Workshop

Take home message • VLSI technology enables us to put TeraOPS on a chip • Conventional general-purpose architecture cannot exploit this • The problem is bandwidth • Casting an application as kernels operating on streams exposes locality and concurrency • A stream architecture exploits this locality and concurrency to achieve high arithmetic rates with limited bandwidth • Bandwidth hierarchy, compound stream operations • Imagine is a prototype stream processor • One chip – 20GFLOPS peak, 10GFLOPS sustained, 4W • Systems scale to TeraFLOPS and more. Convergence Workshop

The Imagine Stream Processor Flexibility with Performance March 30, 2001

The Imagine Stream Processor Flexibility with Performance March 30, 2001

Presentation Transcript

Graphics on a Stream Processor

QUARTERLY FINANCIAL RESULTS JUNE 30, 2001 Performance Highlights 24th July 2001

Data Stream Processor

Stream Processor Simulator

Internet Security UTD EMBA March 30, 2001

March 30

The Imagine Stream Processor

Evaluating the Imagine Stream Processor

April 30, 2001

Flexibility and Business Performance

January–March 2001

March 2001

March 2001 Training

The Imagine Stream Processor Flexibility with Performance March 30, 2001

Afro-Drum Group Performance 30 Sep 2001

Cambridge March 2001

Performance Institute - November 30, 2001

April 30, 2001

March 27, 2001