260 likes | 355 Views
Interleaved Pixel Lookup for Embedded Computer Vision. Kota Yamaguchi, Yoshihiro Watanabe, Takashi Komuro , Masatoshi Ishikawa. Outline. Introduction Problems to apply interleaving Techniques Example: Lucas- Kanade Conclusion. Purpose.
E N D
Interleaved Pixel Lookup for Embedded Computer Vision Kota Yamaguchi, Yoshihiro Watanabe, Takashi Komuro, Masatoshi Ishikawa
Outline • Introduction • Problems to apply interleaving • Techniques • Example: Lucas-Kanade • Conclusion
Purpose • To find a technique to efficiently implement a parallel memory for pixel lookup operations Interleaving Image Processing Computer Vision Tasks … Model objects, Feature space (e.g. Pose, Shape) Camera captures … Images
Motivation • Strong influence to downstream performance • Massive memory operations • Always a headache for embedded designers Image Processing Computer Vision Tasks … Model objects, Feature space (e.g. Pose, Shape) Camera captures … Images
Motivation • Interleaving in graphics hardware • Texram [Schilling, 96] • Texture memory in Recent GPUs • Is it also beneficial to an embedded computer vision hardware? • Yes, if appropriately implemented
Pixel lookup operations • Geometry-to-pixel conversion Geometry stream Pixel stream … … xk+2 xk+1 xk I (xk+2) I (xk+1) I (xk ) … … … Input images as a lookup table
Straightforward implementation • Random access memory • Expensive and slow Geometry stream Pixel stream RAM … … xk+2 xk+1 xk I (xk+2) I (xk+1) I (xk ) … … Input images
Interleaved implementation • Higher throughput with same capacity • But, suffers from partitioning and alignment issues Geometry stream Pixel stream Interleaved Memory … … Packed words Input images
Partitioning issue • Parallel word does not match to operations • e.g. packing neighboring 1x4 pixels into a word, but required 4x1 pixels at each operation Pixel read read read align read
Misalignment issue • Unaligned access requires multiple reads and sub-word alignment Word boundary read align read
Techniques • 2D partitioning • Indirect addressing • Data switching
2D partitioning • See an entire image as tiled spatial patterns • Packed word = spatial pattern required • Avoids partitioning issue Memory banks Spatial Pattern Packedword
Spatial pattern • Certain pattern present in a lookup sequence E.g. - 2x2 block for interpolation - 3x3 block for convolution (i’, j’) (i’+1, j’) (i, j) (i+1, j) … (i’+1, j’) (i’+1, j’+1) (i ,j+1) (i+1, j+1) … … Input images
2D partitioning and misalignment • Tiled patterns guarantee data elements in a word are always distributed even if an access overlaps address boundaries Bank 1 Bank 2 Bank 3 Bank 4 4 3 2 1 4 3 2 1
Indirect addressing • Generating patterned addresses for each bank removes multiple reads for misaligned access Bank 1 Bank 2 Bank 3 Bank 4 4 3 2 1 4 3 2 1 Address generator
Data switching • Switch removes throughput decrease caused by sub-word alignment Bank 1 Bank 2 Bank 3 Bank 4 4 3 2 1 4 3 2 1 Address generator
Techniques overview Indirect addressing Data switching Geometry stream Address generator Pixel stream … Memory banks … 2D partitioning Input images
Example: Lucas-Kanade • Image registration algorithm • Non-linear least squares to solve for parameters of affine transformation between input and template [Baker & Matthews, 04] Input image Gauss-Newton method Affine parameters Template image
LK data flow • Bottleneck: for-each-x for-each-iteration stack • Includes pixel lookup For each iteration For each
Pixel lookup in LK • Affine warped coordinates to pixels conversion • Lookup neighboring 4x4 pixels for each output Raw pixels Warped gradient pixels Warped coordinates Pixel lookup table … … … … … Interpolation Warped input pixels Input images
Straightforward implementation Filter Kernels Raw pixels RAM Multiply-Adds … … … … … Input images
Interleaved implementation Filter Kernels Raw pixels Interleaved memory Multiply-Adds Address generator … Memory banks … … … … Input images 4x4 block partitioning
Comparison of memory configurations Easier to implement peripherals than increasing memory capacity
FPGA implementation of LK pipeline • Just interleaving contributes to 16x larger throughput for the dedicated pipeline Dedicated hardware pipeline FPU Affine Warp Calculator Filter Kernel Generator Gradient / Interpolation Filter Jacobian Filter Hessian Matrix Calculator FP ALU Input Pixel Table SDPU Calculator Error Calculator FP Register Template Pixel Table For each x For each iteration
HDL synthesis • 16x larger throughput, but still same capacity requirement and feasible hardware costs • Estimated performance: 200 fps for registration of 5 pieces of 64x64 8-bit image patches at 100 MHz • Assumption: all registration converge within 10 iterations
Summary • Interleaved pixel lookup • Sub-word parallel memory operations utilizing spatial pattern in lookup sequences • Techniques • 2D partitioning • Indirect addressing • Data switching • Example: Lucas-Kanade • 16x larger throughput with same memory capacity and feasible hardware cost