200 likes | 387 Views
A Parallel Implementation of MSER detection. GPGPU Final Project Lin Cao. Review. Invariant to affine transformation, such as rotation, translation, and scale change; Denotes a set of stable connected components that
E N D
A Parallel Implementation of MSER detection GPGPU Final Project Lin Cao
Review Invariant to affine transformation, such as rotation, translation, and scale change; Denotes a set of stable connected components that are detected in gray scale image;
Review • MSER is a stable Connected Component of thresholded image • All pixels inside the MSER have higher or lower intensities than in the surrounding regions • Regions are selected to be stable over intensity range
Sequential and Parallel Approach Sequential { Parallel { bucketSort(); buildDirectedGraph( ); Find ( ); blockReduction( ); Union( ); parentCompression( ); Update( ); // already get regions GetRegion( ); computeVariation( ); computeVariation( ); findRoot( ); leastVariation( ); } } leastVariation( );
buildDirectedGraph A parent’s value of each pixel should no less than its current value. local memory: visited, members Shared memory
buildDirectedGraph Memory Usage: local memory: visited, members Shared memory Also process edge for next step
16*16, 8*8 Block Reduction
16*16, 8*8 Block Reduction
16*16, 8*8 Block Reduction
log 24 Block Reduction log 22 totally 3 iterations are needed
Load edge information to each pixel Block Reduction If (horizontal_pixelUpdate)
History buffer Block Reduction
Parent Compression Shared memory based on parent locality
FindRegion • FindRoot, so that we can process each region’s tree respectively • Find region’s parent and child based on the delta, so that variation can be computed. • var = (area(parent) – area(child))/area(current region); • Send the region information to CPU • Scan every region’s tree, find the minival variation, which is MSER regions. • Filter the region
Performance Analysis • For 256*256 image,
Performance Analysis • For 1024*768 image,
Performance Analysis Why 8*8 better than 16*16? • local memory usage • recursion times • block execution • block reduction times • parent locality
Performance Analysis GPU vs CPU timing • intermidiate values • Synchronization • record information • memory transfer
Conclusion • Very large data dependancy, still can be solved. • Should be suitable to multicore microprocessor, whose individual core is strong enough than the single thread in GPU. • The bottenleck is still memory.
Future Work • More efficient block reduction. (decoder and encoder) Memory random access GPU code effciency