1 / 37

Stream Architecture: Rethinking Media Processor Design

Stream Architecture: Rethinking Media Processor Design. Scott Rixner April 9, 2001. Rice University Computer Systems Laboratory. Video/image compression & decompression MPEG, JPEG, ... Signal Processing DSL modems, cellular base stations, ... Image synthesis

dava
Download Presentation

Stream Architecture: Rethinking Media Processor Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stream Architecture:Rethinking Media Processor Design Scott Rixner April 9, 2001 Rice University Computer Systems Laboratory

  2. Video/image compression & decompression MPEG, JPEG, ... Signal Processing DSL modems, cellular base stations, ... Image synthesis Polygon rendering, image-based rendering, ... Image understanding Face recognition, depth extraction, ... Media Processing Stream Architecture

  3. Stereo Depth Extraction Left Camera Image Right Camera Image • 640x480 @ 30 fps • Requirements • 11 GOPS • Imagine stream processor • 12.1 GOPS, 4.6 GOPS/W Depth Map Stream Architecture

  4. Outline • Stream Processing • VLSI Constraints • Register Organization • Imagine • Conclusions Stream Architecture

  5. Media Processing Characteristics • Low-precision data • 24% 8-bit integer operations • 29% 16-bit integer operations • Abundant data-parallelism • Little global data reuse • Average of 1.5 references per global data word • Numerous computations per global reference • 50-500 operations per global data reference Stream Architecture

  6. Input Data Kernel Stream Output Data Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Stream Processing • Little data reuse (pixels never revisited) • Highly data parallel (output pixels not dependent on other output pixels) • Compute intensive (>60 operations per memory reference) Stream Architecture

  7. Locality and Concurrency Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Streams expose data parallelism Stream Architecture

  8. Sony PlayStation2 Emotion Engine FPU VPU0 VPU1 Graphics Synthesizer MIPS Core Display IPU RDRAM, I/O, DMAC, etc. Stream Architecture

  9. Instruction Cache IP IR Registers Special vs. General Purpose • Special Purpose • Fixed function • High performance • General Purpose • Programmable • Insufficient performance Stream Architecture

  10. Register Files Dwarf ALUs Stream Architecture

  11. Register File Area • Each cell requires: • 1 word line per port • 1 bit line per port • Each cell grows as p2 • R registers in the file • Area: p2R µN3 Register Bit Cell Stream Architecture

  12. Register File Access Delay • Signal must traverse: • Word line to access cell • Bit line to transfer data • Wire capacitance dominates • Delay: pR1/2 µN3/2 Register File Stream Architecture

  13. Register File Power Dissipation • 100% utilization requires driving all pR1/2 bit lines • Wire capacitance dominates • Power: p2RµN3 Register File Stream Architecture

  14. Centralized Register Organization • Area, Power µN3, Delay µN3/2 Stream Architecture

  15. Partitioned Organizations • SIMD • Data-parallel axis • Distributed Register Files (DRF) • Instruction-level parallel axis • Hierarchical • Memory hierarchy axis • Stream • Optimizing for streams Stream Architecture

  16. SIMD Register Organization • Area, Power µN3/C2, Delay µ (N/C)3/2 Stream Architecture

  17. Distributed Register Organization • Area, Power µN2, Delay µN Stream Architecture

  18. Combining SIMD and DRF Scalar SIMD Central DRF Stream Architecture

  19. Hierarchical Register Organization • Area, Power µN3, Delay µN3/2 Hierarchical T=40 Stream Architecture

  20. Hierarchical Organizations Scalar SIMD Central DRF Stream Architecture

  21. Stream Register Organization • Area, Power µN2/C, Delay µN/C Stream Architecture

  22. Stream Organizations Scalar SIMD Central DRF Stream Architecture

  23. Comparison of Organizations • 48 ALUs (32-bit), 500 MHz • Stream organization improves central organization by Area: 195x, Delay: 20x, Power: 430x Stream Architecture

  24. Performance 16% Performance Drop (8% with latency constraints) 180x Improvement Stream Architecture

  25. Stream Architecture • Stream Processing • Matched to media processing • Exposes locality and concurrency • Stream Register Organization • Efficiency of special-purpose hardware • Optimized for streaming applications • Data bandwidth • Bandwidth hierarchy • Memory access scheduling • Conditional streams Stream Architecture

  26. SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor The Imagine Stream Processor Stream Architecture

  27. Arithmetic Clusters Communication Unit Scratch-pad Register File Intercluster Network Local Register File + * * + + / CU To SRF Cross Point From SRF Stream Architecture

  28. Bandwidth Hierarchy • 41.2 32-bit operations per word of memory bandwidth SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Stream Architecture

  29. Stream Recirculation Stream Architecture

  30. Bandwidth Demands of FIR Filter Stream Architecture

  31. Bandwidth Utilization of FIR Filter Stream Architecture

  32. Performance floating-point application 16-bit kernels 16-bit applications floating-point kernel Stream Architecture

  33. Power GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3 Stream Architecture

  34. Relative Performance and Power Efficiency FFT Performance Power Efficiency Stream Architecture

  35. Tapeout ~Q2 ’01 21 million T’s 6M SRF SRAM 6M UC SRAM 6M Clusters 3M Other Target: 32 FO4 300 MHz at SSSS 500 MHz at TTSS TI GS30KA: 0.15 mm Ldrawn 457 Signal Pins Imagine Floorplan Stream Architecture

  36. William J. Dally Ujval Kapasi Brucek Khailany Peter Mattson Jinyung Namkoong John Owens Ben Serebrin Brian Towles Scott Rixner Don Alpert (Intel) Ghazi Ben Amor Chris Buehler (MIT) JP Grossman (MIT) Brad Johanson Abelardo Lopez-Lagunas Ben Mowery Manman Ren Imagine Team Stream Architecture

  37. Conclusions • Media Processing • Little data reuse • Highly data parallel • Compute intensive • VLSI • Stream register organization • Bandwidth hierarchy • Imagine • Stream architecture • 10 GOPS sustained application performance • 5 GOPS/W application power efficiency Stream Architecture

More Related