210 likes | 304 Views
The Imagine Stream Processor Flexibility with Performance March 30, 2001. William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu. Outline. Motivation We need low-power, programmable TeraOps The problem is bandwidth
E N D
The Imagine Stream ProcessorFlexibility with PerformanceMarch 30, 2001 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu
Outline • Motivation • We need low-power, programmable TeraOps • The problem is bandwidth • Growing gap between special-purpose and general-purpose hardware • Its easy to make ALUs, hard to keep them fed • A stream processor gives programmable bandwidth • Streams expose locality and concurrency in the application • A bandwidth hierarchy exploits this • Imagine is a 20GFLOPS prototype stream processor • Many opportunities to do better • Scaling up • Simplifying programming Convergence Workshop
Motivation • Some things I’d like to do with a few TeraOps • Have a realistic face-to-face meeting with someone in Boston without riding an airplane • 4-8 cameras, extract depth, fit model, compress, render to several screens • High-quality rendering at video rates • Ray tracing a 2K x 4K image with 105 objects at 60 frames/s Convergence Workshop
The good news – FLOPS are cheap, OPS are cheaper • 32-bit FPU – 2GFLOPS/mm2 – 400GFLOPS/chip • 16-bit add – 40GOPS/mm2 – 8TOPS/chip 460 mm Local RF 146.7 mm Integer Adder Convergence Workshop
The bad news – General purpose processors can’t harness this Convergence Workshop
Why do Special-Purpose Processors Perform Well? Lots (100s) of ALUs Fed by dedicated wires/memories Convergence Workshop
Care and Feeding of ALUs Instr. Cache IP Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU Convergence Workshop
The problem is bandwidth • Can we solve this bandwidth problem without sacrificing programmability? Convergence Workshop
Streams expose locality and concurrency Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Streams expose data parallelism Convergence Workshop
SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s A Bandwidth Hierarchy exploits locality and concurrency • VLIW clusters with shared control • 41.2 32-bit operations per word of memory bandwidth Convergence Workshop
SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Bandwidth Usage Convergence Workshop
SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor The Imagine Stream Processor Convergence Workshop
Intercluster Network Local Register File + * * + + / CU To SRF Cross Point From SRF Arithmetic Clusters Convergence Workshop
Performance floating-point application 16-bit applications 16-bit kernels floating-point kernel Convergence Workshop
Power GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9 Convergence Workshop
A Look Inside an ApplicationStereo Depth Extraction • 320x240 8-bit grayscale images • 30 disparity search • 220 frames/second • 12.7 GOPS • 5.7 GOPS/W Convergence Workshop
Stereo Depth Extractor Convolutions Disparity Search Load Convolved Rows Load original packed row Calculate BlockSADs at different disparities Unpack (8bit -> 16 bit) 7x7 Convolve 3x3 Convolve Store best disparity values Store convolved row
7x7 Convolve Kernel Convergence Workshop
Imagine gives high performance with low power and flexible programming • Matches capabilities of communication-limited technology to demands of signal and image processing applications • Performance • compound stream operations realize >10GOPS on key applications • can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) • Power • three-level register hierarchy gives 2-10GOPS/W • Flexibility • programmed in “C” • streaming model • conditional stream operations enable applications like sort Convergence Workshop
A look forward • Next steps • Build some Imagine prototypes • Dual-processor 40GFLOPS systems, 64-processor TeraFLOPS systems • Longer term • ‘Industrial Strength’ Imagine – 100-200GFLOPS/chip • Multiple sets of arithmetic clusters per chip, higher clock rate, on-chip cache, more off-chip bandwidth • Graphics extensions • Texture cache, raster unit – as SRF clients • A streaming supercomputer • 64-bit FP, high-bandwidth global memory, MIMD extensions • Simplified stream programming • Automate inter-cluster communication, partitioning into kernels, sub-word arithmetic, staging of data. Convergence Workshop
Take home message • VLSI technology enables us to put TeraOPS on a chip • Conventional general-purpose architecture cannot exploit this • The problem is bandwidth • Casting an application as kernels operating on streams exposes locality and concurrency • A stream architecture exploits this locality and concurrency to achieve high arithmetic rates with limited bandwidth • Bandwidth hierarchy, compound stream operations • Imagine is a prototype stream processor • One chip – 20GFLOPS peak, 10GFLOPS sustained, 4W • Systems scale to TeraFLOPS and more. Convergence Workshop