Memory System Support for Image Processing

Memory System Support for Image Processing Lixin Zhang, John B. Carter, Wilson C. Hsieh, Sally A. Mckee Department of Computer Science University of Utah Presented by Lixin Zhang

Characteristics of Image Processing • High data bandwidth needs • Large cache footprints • Lack of data reuse • Non-unit strides • Bottom line: traditional memory system does not work well for image processing! • But their access patterns are often predictable!

Outline • Characteristics of image processing • Basic idea • Two remapping algorithms • Benchmarks • Performance • Conclusion

Basic Idea: An Innovative Memory System (Impulse) • Allow sparsely-stored data to be accessed densely • Load only the data needed by the processor into caches

virtualspace physical space real physical memory MMU/TLB Impulse MC Real physical space Shadow address space Approaches • Using “unused” physical space (shadow addresses) to reorganize data • Memory controller (Impulse MC) maps shadow addresses to physical memory • OS/Compiler support

Simple Example • Sum of diagonal elements of a dense matrix • Problems • Wasted bus bandwidth • Low cache utilization • Low cache hit ratio for(i = 0; i < n; i++) sum += A[i][i]; Physical Memory Cache Conventional Memory Controller Wasted bus bandwidth

Using Impulse • Strided remapping • Benefits • No wasted bus bandwidth • Better cache utilization • Higher cache hit ratio diagonal = Impulse_remap(A,n,...); for(i = 0; i < n; i++) sum +=diagonal[i]; Physical Memory Cache Impulse Memory Controller shadow address

Address Translations Physical Memory Virtual Memory MMU/TLB Conventional System A MMU/TLB diagonal Pseudo Virtual Memory Shadow Memory Physical Memory Virtual Memory Impulse System

Impulse MC Internals • Shadow descriptor: stores remapping information • ALU: shadow addresses ==> pseudo-virtual addresses • Page table: pseudo-virtual addresses ==> real addresses

Outline • Characteristics of image processing • Basic idea • Two remapping algorithms • Transpose Remapping • Scatter/gather through an indirection vector • Benchmarks • Performance • Conclusion

Transpose Remapping • Create the transposed version of a matrix TA = Impulse_map(...) for(j=0;j<n;j++) for(i=0;i<m;i++) ..TA[j][i]..; for(j=0;j<n;j++) for(i=0;i<m;i++) ..A[i][j]..; • MC maps TA[j][i] to A[i][j] • Benefit • Unit-stride accesses instead of row-size-stride

Scatter/gather through An Indirection Vector • Reorganize an array according to an indirection vector NA = Impulse_map(...) for(i=0;i<n;i++) ..NA[i]..; for(i=0;i<n;i++) ..A[iv[i]]..; • MC maps NA[i] to A[iv[i]] • Benefits: • Sequentially accessNA • No need to access iv in the processor

Outline • Characteristics of image processing • Basic idea • Two remapping algorithms • Benchmarks • Volume rendering • Image rotation • Image filtering • Performance • Conclusion

Volume Rendering • Algorithm • Brute-force ray tracing, orthographic tracer, 4x4x4 macro cells • Optimization • Pre-compute voxel sequences being visited • Impulse • Map voxels on a ray to contiguous shadow addresses

Image Rotation • Algorithm • Separable image warp • Three-shear image rotation: horizontal, vertical, horizontal • Optimization • Tile the second shear operation • Impulse • Create transposed versions for the second shear operation

Image Filtering • Algorithm • Binomial filter: applying a two-dimensional mask • Decomposed into a pair of linear filter: first row, then column • Impulse • Create transposed versions of both input and output image for walk along column Mask of order-3

Performance of Volume Rendering • The rays are parallel to x-axis, 1kx1kx1k volume P.S. Time is in million cycles; TLB misses is in millions.

Performance of Volume Rendering • The rays are perpendicular to x-axis

Performance of Image Rotation • Rotate 1kx1k color image through one radian, 32x32 tile

Performance of Image Filtering • Order-121 binomial filter on a 1024x1024 color image

Conclusion • Impulse memory system • Reorganize data in shadow space • MC maps them back to DRAM • Improve performance of memory system • Improved image processing benchmarks • Volume rendering by 226% • Image rotation by 19% • Image filtering by 44.7% • Looking for applications with • Poor cache/TLB behaviors • And probably predictable access patterns

Simulation Environment • 120MHz HP PA-RISC 1.1 processor • 120MHz HP Runway Bus • L1 Cache: • 32Kbytes, 32-byte line, direct-mapped, virtual-indexed, physically-tagged, 1-cycle latency, write-through • L2 Cache • 256Kbytes, 128-byte line, 2-way associative, physically-indexed, physically-tagged, 8-cycle latency, write-allocate, write-back. • Mcache • 8Kbytes for non-remapped data; 512bytes for each remapped data structure

MC-based Prefetching • Basic idea: prefetch data from DRAMs to MC • Hide DRAM latency • MCache: a small SRAM at the MC • A buffer for non-remapped data • A small buffer for each remapped data structure • Simple scheme: • Sequential prefetch for non-remapped data • Configurable-strided prefetch for remapped data

Memory System Support for Image Processing

Memory System Support for Image Processing

Presentation Transcript

Image processing

Image Processing

Outline For Image Processing

Image Processing

Fuzzy for Image Processing

Image Processing

Image Processing

Image Processing

MATLAB for Image Processing

Image processing

Image Processing

Image Processing for MRI

Image Processing

Image Processing

Image Processing

Memory System Support for Image Processing

Image Processing

Image Processing

image processing

Image Processing

Image Processing

Image Processing