1 / 21

The Imagine Stream Processor Flexibility with Performance March 30, 2001

The Imagine Stream Processor Flexibility with Performance March 30, 2001. William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu. Outline. Motivation We need low-power, programmable TeraOps The problem is bandwidth

Download Presentation

The Imagine Stream Processor Flexibility with Performance March 30, 2001

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Imagine Stream ProcessorFlexibility with PerformanceMarch 30, 2001 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu

  2. Outline • Motivation • We need low-power, programmable TeraOps • The problem is bandwidth • Growing gap between special-purpose and general-purpose hardware • Its easy to make ALUs, hard to keep them fed • A stream processor gives programmable bandwidth • Streams expose locality and concurrency in the application • A bandwidth hierarchy exploits this • Imagine is a 20GFLOPS prototype stream processor • Many opportunities to do better • Scaling up • Simplifying programming Convergence Workshop

  3. Motivation • Some things I’d like to do with a few TeraOps • Have a realistic face-to-face meeting with someone in Boston without riding an airplane • 4-8 cameras, extract depth, fit model, compress, render to several screens • High-quality rendering at video rates • Ray tracing a 2K x 4K image with 105 objects at 60 frames/s Convergence Workshop

  4. The good news – FLOPS are cheap, OPS are cheaper • 32-bit FPU – 2GFLOPS/mm2 – 400GFLOPS/chip • 16-bit add – 40GOPS/mm2 – 8TOPS/chip 460 mm Local RF 146.7 mm Integer Adder Convergence Workshop

  5. The bad news – General purpose processors can’t harness this Convergence Workshop

  6. Why do Special-Purpose Processors Perform Well? Lots (100s) of ALUs Fed by dedicated wires/memories Convergence Workshop

  7. Care and Feeding of ALUs Instr. Cache IP Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU Convergence Workshop

  8. The problem is bandwidth • Can we solve this bandwidth problem without sacrificing programmability? Convergence Workshop

  9. Streams expose locality and concurrency Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Streams expose data parallelism Convergence Workshop

  10. SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s A Bandwidth Hierarchy exploits locality and concurrency • VLIW clusters with shared control • 41.2 32-bit operations per word of memory bandwidth Convergence Workshop

  11. SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Bandwidth Usage Convergence Workshop

  12. SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor The Imagine Stream Processor Convergence Workshop

  13. Intercluster Network Local Register File + * * + + / CU To SRF Cross Point From SRF Arithmetic Clusters Convergence Workshop

  14. Performance floating-point application 16-bit applications 16-bit kernels floating-point kernel Convergence Workshop

  15. Power GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9 Convergence Workshop

  16. A Look Inside an ApplicationStereo Depth Extraction • 320x240 8-bit grayscale images • 30 disparity search • 220 frames/second • 12.7 GOPS • 5.7 GOPS/W Convergence Workshop

  17. Stereo Depth Extractor Convolutions Disparity Search Load Convolved Rows Load original packed row Calculate BlockSADs at different disparities Unpack (8bit -> 16 bit) 7x7 Convolve 3x3 Convolve Store best disparity values Store convolved row

  18. 7x7 Convolve Kernel Convergence Workshop

  19. Imagine gives high performance with low power and flexible programming • Matches capabilities of communication-limited technology to demands of signal and image processing applications • Performance • compound stream operations realize >10GOPS on key applications • can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) • Power • three-level register hierarchy gives 2-10GOPS/W • Flexibility • programmed in “C” • streaming model • conditional stream operations enable applications like sort Convergence Workshop

  20. A look forward • Next steps • Build some Imagine prototypes • Dual-processor 40GFLOPS systems, 64-processor TeraFLOPS systems • Longer term • ‘Industrial Strength’ Imagine – 100-200GFLOPS/chip • Multiple sets of arithmetic clusters per chip, higher clock rate, on-chip cache, more off-chip bandwidth • Graphics extensions • Texture cache, raster unit – as SRF clients • A streaming supercomputer • 64-bit FP, high-bandwidth global memory, MIMD extensions • Simplified stream programming • Automate inter-cluster communication, partitioning into kernels, sub-word arithmetic, staging of data. Convergence Workshop

  21. Take home message • VLSI technology enables us to put TeraOPS on a chip • Conventional general-purpose architecture cannot exploit this • The problem is bandwidth • Casting an application as kernels operating on streams exposes locality and concurrency • A stream architecture exploits this locality and concurrency to achieve high arithmetic rates with limited bandwidth • Bandwidth hierarchy, compound stream operations • Imagine is a prototype stream processor • One chip – 20GFLOPS peak, 10GFLOPS sustained, 4W • Systems scale to TeraFLOPS and more. Convergence Workshop

More Related