420 likes | 431 Views
Learn about a cutting-edge framework for image recognition, from segmentation to matching. Discover how this method handles various challenges like changes in pose and illumination to deliver accurate results.
E N D
The Visual Recognition Machine Jitendra Malik University of California at Berkeley
From images to objects Labeled sets: tiger, grass etc
Possible for both instances or object classes (Mona Lisa vs. faces or Beetle vs. cars) • Tolerant to changes in pose and illumination Recognition
Three stages • Segmentation: Images Regions • Association: Regions Super-regions • Matching: Super-regions Prototype views
Three stages • Segmentation: Images Regions • Association: Regions Super-regions • Matching: Super-regions Prototype views
Boundaries of image regions defined by a number of attributes • Brightness/color • Texture • Motion • Stereoscopic depth • Familiar configuration
Image Segmentation as Graph Partitioning Build a weighted graph G=(V,E) from image V: image pixels E: connections between pairs of nearby pixels Partition graph so that similarity within group is large and similarity between groups is small -- Normalized Cuts [Shi&Malik 97]
Some Terminology for Graph Partitioning • How do we bipartition a graph:
Normalized Cut, A measure of dissimilarity • Minimum cut is not appropriate since it favors cutting small pieces. • Normalized Cut, Ncut:
Solving the Normalized Cut problem • Exact discrete solution to Ncut is NP-complete even on regular grid, • [Papadimitriou’97] • Drawing on spectral graph theory, good approximation can be obtained by solving a generalized eigenvalue problem.
Normalized Cut As Generalized Eigenvalue problem • after simplification, we get
Computational Aspects • Solving for the generalized eigensystem: • (D-W) is of size , but it is sparse with O(N) nonzero entries, where N is the number of pixels. • Using Lanczos algorithm.
Three stages • Segmentation: Images Regions • Association: Regions Super-regions • Matching: Super-regions Prototype views
Association • Number of super-regions of size k in image with n regions is approximately (4**k)*n/k • For typical images, this ranges between 1000 and 10000 • Plausibility ordering could reduce effective number substantially • Computing time for this stage negligible
Three stages • Segmentation: Images Regions • Association: Regions Super-regions • Matching: Super-regions Prototype views
Matching • Objects are represented by a set of prototypical views (~10 per object) • For each super-region S, calculate probability that it is an instance of view V • Determine most probable labeling of image into objects
Matching super-regions to views • Based on color, texture and shape similarity • Color, texture matching is relatively well understood and fast • Shape matching is difficult because the algorithm should tolerate pose, illumination and intra-category variation • GOAL: small misclassification error with few views.
Core idea • Find corresponding points on the two shapes and use those to deform prototype into alignment • Allowing this flexibility reduces number of prototype views needed
Matching with original and deformed prototypes Prototype Test Error
Only 25 deformable templates needed (instead of 60 K) to get 5% error
Computing cost on a Pentium PC • Segmentation: 5 minutes /image • Matching : 0.5 sec / match
Cost on 10**4 node machine • Segmentation: 0.03 sec /image, which is 30 Hz (video rate) • Matching : 20K matches/sec at full resolution (100 points/shape)
How many prototype views can one match at 1 Hz? • 1K candidate super-regions • Consider only 1% of matches at full resolution (10% pass color/texture filter, 10% of those pass low resolution shape filter) • If half time spent in pruning and half in full resolution matching, 1000 prototype views can be matched at 1 Hz.
What can one do with matching 1000 views a second? • Worst case: 100 object categories • Best case depends on how well one can exploit context, hierarchy and hashing. • Cf. humans can recognize 10-100K objects
Memory requirements • 10 K object categories * 10 views/category * 100 * 100 pixels/view * 1 byte/pixel gives us 1 Gigabyte.
Concluding remarks • Speech in 1985 was in the same state as vision in 2000. Hidden Markov Models adoption led to a decade of research which refined the paradigm for continuous speech recognition. • The proposed 3 stage framework for recognition: segmentation, association and matching, could provide the same focus and coherence to vision research leading to general purpose object recognition in 10 years.