230 likes | 393 Views
Memory System Support for Image Processing. Lixin Zhang, John B. Carter, Wilson C. Hsieh, Sally A. Mckee Department of Computer Science University of Utah Presented by Lixin Zhang. Characteristics of Image Processing. High data bandwidth needs Large cache footprints Lack of data reuse
E N D
Memory System Support for Image Processing Lixin Zhang, John B. Carter, Wilson C. Hsieh, Sally A. Mckee Department of Computer Science University of Utah Presented by Lixin Zhang
Characteristics of Image Processing • High data bandwidth needs • Large cache footprints • Lack of data reuse • Non-unit strides • Bottom line: traditional memory system does not work well for image processing! • But their access patterns are often predictable!
Outline • Characteristics of image processing • Basic idea • Two remapping algorithms • Benchmarks • Performance • Conclusion
Basic Idea: An Innovative Memory System (Impulse) • Allow sparsely-stored data to be accessed densely • Load only the data needed by the processor into caches
virtualspace physical space real physical memory MMU/TLB Impulse MC Real physical space Shadow address space Approaches • Using “unused” physical space (shadow addresses) to reorganize data • Memory controller (Impulse MC) maps shadow addresses to physical memory • OS/Compiler support
Simple Example • Sum of diagonal elements of a dense matrix • Problems • Wasted bus bandwidth • Low cache utilization • Low cache hit ratio for(i = 0; i < n; i++) sum += A[i][i]; Physical Memory Cache Conventional Memory Controller Wasted bus bandwidth
Using Impulse • Strided remapping • Benefits • No wasted bus bandwidth • Better cache utilization • Higher cache hit ratio diagonal = Impulse_remap(A,n,...); for(i = 0; i < n; i++) sum +=diagonal[i]; Physical Memory Cache Impulse Memory Controller shadow address
Address Translations Physical Memory Virtual Memory MMU/TLB Conventional System A MMU/TLB diagonal Pseudo Virtual Memory Shadow Memory Physical Memory Virtual Memory Impulse System
Impulse MC Internals • Shadow descriptor: stores remapping information • ALU: shadow addresses ==> pseudo-virtual addresses • Page table: pseudo-virtual addresses ==> real addresses
Outline • Characteristics of image processing • Basic idea • Two remapping algorithms • Transpose Remapping • Scatter/gather through an indirection vector • Benchmarks • Performance • Conclusion
Transpose Remapping • Create the transposed version of a matrix TA = Impulse_map(...) for(j=0;j<n;j++) for(i=0;i<m;i++) ..TA[j][i]..; for(j=0;j<n;j++) for(i=0;i<m;i++) ..A[i][j]..; • MC maps TA[j][i] to A[i][j] • Benefit • Unit-stride accesses instead of row-size-stride
Scatter/gather through An Indirection Vector • Reorganize an array according to an indirection vector NA = Impulse_map(...) for(i=0;i<n;i++) ..NA[i]..; for(i=0;i<n;i++) ..A[iv[i]]..; • MC maps NA[i] to A[iv[i]] • Benefits: • Sequentially accessNA • No need to access iv in the processor
Outline • Characteristics of image processing • Basic idea • Two remapping algorithms • Benchmarks • Volume rendering • Image rotation • Image filtering • Performance • Conclusion
Volume Rendering • Algorithm • Brute-force ray tracing, orthographic tracer, 4x4x4 macro cells • Optimization • Pre-compute voxel sequences being visited • Impulse • Map voxels on a ray to contiguous shadow addresses
Image Rotation • Algorithm • Separable image warp • Three-shear image rotation: horizontal, vertical, horizontal • Optimization • Tile the second shear operation • Impulse • Create transposed versions for the second shear operation
Image Filtering • Algorithm • Binomial filter: applying a two-dimensional mask • Decomposed into a pair of linear filter: first row, then column • Impulse • Create transposed versions of both input and output image for walk along column Mask of order-3
Performance of Volume Rendering • The rays are parallel to x-axis, 1kx1kx1k volume P.S. Time is in million cycles; TLB misses is in millions.
Performance of Volume Rendering • The rays are perpendicular to x-axis
Performance of Image Rotation • Rotate 1kx1k color image through one radian, 32x32 tile
Performance of Image Filtering • Order-121 binomial filter on a 1024x1024 color image
Conclusion • Impulse memory system • Reorganize data in shadow space • MC maps them back to DRAM • Improve performance of memory system • Improved image processing benchmarks • Volume rendering by 226% • Image rotation by 19% • Image filtering by 44.7% • Looking for applications with • Poor cache/TLB behaviors • And probably predictable access patterns
Simulation Environment • 120MHz HP PA-RISC 1.1 processor • 120MHz HP Runway Bus • L1 Cache: • 32Kbytes, 32-byte line, direct-mapped, virtual-indexed, physically-tagged, 1-cycle latency, write-through • L2 Cache • 256Kbytes, 128-byte line, 2-way associative, physically-indexed, physically-tagged, 8-cycle latency, write-allocate, write-back. • Mcache • 8Kbytes for non-remapped data; 512bytes for each remapped data structure
MC-based Prefetching • Basic idea: prefetch data from DRAMs to MC • Hide DRAM latency • MCache: a small SRAM at the MC • A buffer for non-remapped data • A small buffer for each remapped data structure • Simple scheme: • Sequential prefetch for non-remapped data • Configurable-strided prefetch for remapped data