370 likes | 508 Views
Stream Architecture: Rethinking Media Processor Design. Scott Rixner April 9, 2001. Rice University Computer Systems Laboratory. Video/image compression & decompression MPEG, JPEG, ... Signal Processing DSL modems, cellular base stations, ... Image synthesis
E N D
Stream Architecture:Rethinking Media Processor Design Scott Rixner April 9, 2001 Rice University Computer Systems Laboratory
Video/image compression & decompression MPEG, JPEG, ... Signal Processing DSL modems, cellular base stations, ... Image synthesis Polygon rendering, image-based rendering, ... Image understanding Face recognition, depth extraction, ... Media Processing Stream Architecture
Stereo Depth Extraction Left Camera Image Right Camera Image • 640x480 @ 30 fps • Requirements • 11 GOPS • Imagine stream processor • 12.1 GOPS, 4.6 GOPS/W Depth Map Stream Architecture
Outline • Stream Processing • VLSI Constraints • Register Organization • Imagine • Conclusions Stream Architecture
Media Processing Characteristics • Low-precision data • 24% 8-bit integer operations • 29% 16-bit integer operations • Abundant data-parallelism • Little global data reuse • Average of 1.5 references per global data word • Numerous computations per global reference • 50-500 operations per global data reference Stream Architecture
Input Data Kernel Stream Output Data Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Stream Processing • Little data reuse (pixels never revisited) • Highly data parallel (output pixels not dependent on other output pixels) • Compute intensive (>60 operations per memory reference) Stream Architecture
Locality and Concurrency Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Streams expose data parallelism Stream Architecture
Sony PlayStation2 Emotion Engine FPU VPU0 VPU1 Graphics Synthesizer MIPS Core Display IPU RDRAM, I/O, DMAC, etc. Stream Architecture
Instruction Cache IP IR Registers Special vs. General Purpose • Special Purpose • Fixed function • High performance • General Purpose • Programmable • Insufficient performance Stream Architecture
Register Files Dwarf ALUs Stream Architecture
Register File Area • Each cell requires: • 1 word line per port • 1 bit line per port • Each cell grows as p2 • R registers in the file • Area: p2R µN3 Register Bit Cell Stream Architecture
Register File Access Delay • Signal must traverse: • Word line to access cell • Bit line to transfer data • Wire capacitance dominates • Delay: pR1/2 µN3/2 Register File Stream Architecture
Register File Power Dissipation • 100% utilization requires driving all pR1/2 bit lines • Wire capacitance dominates • Power: p2RµN3 Register File Stream Architecture
Centralized Register Organization • Area, Power µN3, Delay µN3/2 Stream Architecture
Partitioned Organizations • SIMD • Data-parallel axis • Distributed Register Files (DRF) • Instruction-level parallel axis • Hierarchical • Memory hierarchy axis • Stream • Optimizing for streams Stream Architecture
SIMD Register Organization • Area, Power µN3/C2, Delay µ (N/C)3/2 Stream Architecture
Distributed Register Organization • Area, Power µN2, Delay µN Stream Architecture
Combining SIMD and DRF Scalar SIMD Central DRF Stream Architecture
Hierarchical Register Organization • Area, Power µN3, Delay µN3/2 Hierarchical T=40 Stream Architecture
Hierarchical Organizations Scalar SIMD Central DRF Stream Architecture
Stream Register Organization • Area, Power µN2/C, Delay µN/C Stream Architecture
Stream Organizations Scalar SIMD Central DRF Stream Architecture
Comparison of Organizations • 48 ALUs (32-bit), 500 MHz • Stream organization improves central organization by Area: 195x, Delay: 20x, Power: 430x Stream Architecture
Performance 16% Performance Drop (8% with latency constraints) 180x Improvement Stream Architecture
Stream Architecture • Stream Processing • Matched to media processing • Exposes locality and concurrency • Stream Register Organization • Efficiency of special-purpose hardware • Optimized for streaming applications • Data bandwidth • Bandwidth hierarchy • Memory access scheduling • Conditional streams Stream Architecture
SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor The Imagine Stream Processor Stream Architecture
Arithmetic Clusters Communication Unit Scratch-pad Register File Intercluster Network Local Register File + * * + + / CU To SRF Cross Point From SRF Stream Architecture
Bandwidth Hierarchy • 41.2 32-bit operations per word of memory bandwidth SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Stream Architecture
Stream Recirculation Stream Architecture
Bandwidth Demands of FIR Filter Stream Architecture
Bandwidth Utilization of FIR Filter Stream Architecture
Performance floating-point application 16-bit kernels 16-bit applications floating-point kernel Stream Architecture
Power GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3 Stream Architecture
Relative Performance and Power Efficiency FFT Performance Power Efficiency Stream Architecture
Tapeout ~Q2 ’01 21 million T’s 6M SRF SRAM 6M UC SRAM 6M Clusters 3M Other Target: 32 FO4 300 MHz at SSSS 500 MHz at TTSS TI GS30KA: 0.15 mm Ldrawn 457 Signal Pins Imagine Floorplan Stream Architecture
William J. Dally Ujval Kapasi Brucek Khailany Peter Mattson Jinyung Namkoong John Owens Ben Serebrin Brian Towles Scott Rixner Don Alpert (Intel) Ghazi Ben Amor Chris Buehler (MIT) JP Grossman (MIT) Brad Johanson Abelardo Lopez-Lagunas Ben Mowery Manman Ren Imagine Team Stream Architecture
Conclusions • Media Processing • Little data reuse • Highly data parallel • Compute intensive • VLSI • Stream register organization • Bandwidth hierarchy • Imagine • Stream architecture • 10 GOPS sustained application performance • 5 GOPS/W application power efficiency Stream Architecture