260 likes | 275 Views
Explore a detailed presentation on the context-based vision system for place and object recognition by Antonio Torralba, Kevin Murphy, Bill Freeman, and Mark Rubin, highlighted by David Lee. Discover the system's components, functionalities, and impacts on place and object recognition.
E N D
Context-based vision system for place and object recognition Antonio Torralba Kevin Murphy Bill Freeman Mark Rubin Presented by David Lee Some slides borrowed from Kevin Murphy
4x4x24 =384 dim 80 dim Downsample to 4x4 24 filtered Images
Visualizing the filter bank output Images 80-dimensional representation
Hidden Markov Model • Hidden states = location (63 values) • Observations = vGt∈ R80 • Transition model encodes topology of environment • Observation model is a mixture of Gaussians (100 views per place)
Mixture of Gaussians MLE (counting) Hidden Markov Model Observation Likelihood Prediction Prior Transition Matrix
Scene Categorization • 17 Categories (Office, Corridor, Street, etc) • Train a separate HMM on category labels
Performance on known env. Ground truth System estimate Specific location Location category Indoor/outdoor
Comparison of features Categorization Recognition
Effect of HMM on recognition Without With (But with temporal smoothing)
Object priming • Predict object properties based oncontext (top-down signals): • Visual gist, vtG • Specific Location, Qt • Kind of location, Ct
MLE Mixture of Gaussians Object Priming Estimate of current place (Output of HMM) Probability of object i in image vi given entire video sequence Probability of object i Given current observation & place Prior probability of object i being in place q Observation Likelihood Probability of object i Again…
Predicting object position and scale Probability of an object i being present and location being q (Output of previous system) Estimate of mask Estimate of mask given current gist, place, and object delta Gaussian
Conclusion • Real world problem (and it works!) • Uses only global feature (context) • How much did {HMM / place prior} affect{place recognition / object detection}?Can we really say “context” did the job?