Learning and Inference in Vision: from Features to Scene Understanding

Learning and Inference in Vision: from Features to Scene Understanding Jonathan Huang, Tomasz Malisiewicz MLD Student Research Symposium, 2009

Sky Bridge Sign Trees Car Road

Huge datasets PASCAL Visual Objects Challenge (VOC) dataset ~15000 annotated images, ~35,000 annotated object instances, 20 object classes with segmentations, bounding boxes

Huge datasets LabelMe dataset ~11845 static images, >100,000 labeled polygons

Outline I. Recognizing single object classes (Jon) II. Scene understanding with multiple classes (Tomasz)

Recognition task #1: Find all markers

Recognition task #2: Find all cats Object recognition is often hard due to: Geometric Variability

Variation within an object class

Viewpoint/Scales/Illumination Variability Images from Flickr

From Pixels to Visual features Imaging Scene Pixels Features car Inference Higher level inference Low level features

Local Visual Features Images are high dimensional! Compute image statistics in a region (e.g., estimate the distribution of image gradient orientations) (640 width) *(480 height) = (307200 pixels)

Key ideas in feature design Be invariant to stuff you don’t care about… while not being too invariant

Object classification Inference: What object class is this? Learning: What does each object class look like? Let’s look at a simpler example first… Cow or Horse??

John Terry scored on a header to lift Chelsea to a 1-0 victory over Manchester United and extend the Blues’ Premier League lead to 5 points. Chelsea had been frustrated by Manchester United for 76 minutes, but took advantage of a free kick awarded when Darren Fletcher fouled Ashley Cole.Brian Ching scored six minutes into overtime and the Houston Dynamo advanced to Major League Soccer’s Western ... In the Senate, where proposals differ substantially from the House-passed measure on issues like a government-run plan and how to pay for coverage, the bill is stalled while budget analysts assess its overall costs. The slim margin in the House — the bill passed with just two votes to spare, and 39 Democrats opposed it — suggests even greater challenges in the Senate, where the majority leader, ... Document classification analogy ??? ??? Classify each document as sports or politics

Bag-of-words models for text classification “Much of the meaning behind written language is preserved even when the ordering of the individual words is lost.” [El-Arini et al.,’09] bag words (Sue Ann)

but to on Darren awarded Fletcher advanced Ashley lift over to 1-0 scored advantage Major for lead 76 Chelsea Premier to Terry League John Houston the kick Chelsea took United points. free minutes fouled United been frustrated overtime Manchester six a when League a extendvictory Ching 5 and to and Western Manchester Brian Cole. Dynamo Soccer’s by a minutes, Blues’ the had header into of scored ... the margin how In on majority 39 costs. with measure slim overall — to like opposed suggests challenges pay even substantially stalled government run where the issues votes it the where billfor spare, from bill and a Senate, analysts coverage, in — the Democrats greater differ two proposals budget its House assess while Senate, to in just the leader and the plan passed the is Housepassed The ... Document classification analogy ??? ???

but to on Darren awarded Fletcher advanced Ashley lift over to 1-0 scored advantage Major for lead 76 Chelsea Premier to Terry League John Houston the kick Chelsea took United points. free minutes fouled United been frustrated overtime Manchester six a when League a extendvictory Ching 5 and to and Western Manchester Brian Cole. Dynamo Soccer’s by a minutes, Blues’ the had header into of scored... the margin how In on majority 39 costs. with measure slim overall — to like opposed suggests challenges pay even substantially stalled government-run where the issuesvotes it the where billfor spare, from bill and a Senate, analysts coverage, in — the Democrats greater differ two proposals budget its Houseassess while Senate, to in just the leader and the plan passed the is House-passed The ... Document classification analogy ??? ???

Visual words (discretization) Words are discrete, visual features are typically continuous… Discretization via clustering/vector quantization

Visual words [Sivic et al., ‘05]

Object classification with bag of words [Sivic et al., ‘05]

Object classification with bag of words Faces Performance on Caltech 101 dataset with linear SVM on bag-of-word vectors: Airplanes Cars [Csurka et al., ‘04]

Object Detection problem Detection: Locate all the faces in this image. Classification:Is this a face, or not a face?

Face detection via a series of classifications (a.k.a. sliding window brain damage)

False Detection Missed Faces Sliding window detection results

The need for… capturing spatial relationships

One Approach Create a more descriptive (complicated) feature gradient magnitudes gradient orientations Histogrammed gradients in each cell Original Image Estimated Image Gradients Subdivided Image cells Histograms of Oriented Gradients (HOG) features [Dalal & Triggs, ‘06]

People Tracking with HOG features better

Modeling Spatial Relationships with Deformable Part Based Models Spring-based models: Parts prefer low-energy configurations [Fischler & Elschlager ,’73], [Ramanan et al,’07], [Felszwenwalb et al,’05,’09], [Kumar et al, ‘09]

Parts Based Model Goal: Assign model parts to image regions preserving both local appearance and spatial relationships Vertices – Local Appearance       Edges - Spatial Relationship

Parts based models - Inference Problem Inference problem: What is the best scoring assignment f? For trees can use belief propagation for exact solution in polytime Local Appearance term Pairwise Spatial Relationship term Inference is NP-hard for general graphs

Parts based models - Learning Problem Local Appearance term Pairwise Spatial Relationship term Linear models: Learning linear models: Find weight vectors that best separate positive and negative examples. E.g., Convex max-margin objective s.t. Positive examples on one side Negative examples on the other [Kumar et al,’09]

Person deformable part model Quadratic spatial configuration model Root filter (8x8 resolution) Part filter (4x4 resolution) [Felszwenwalb et al,’09]

[Felszwenwalb et al,’09]

[Ramanan et al,’09]

Outline I. Recognizing single object classes (Jon) II. Scene understanding with multiple classes (Tomasz)

Car Tree Building Fire Hydrant Fence Sidewalk Part II: Scene Understanding with Multiple Classes Goal: Predict Many Different Objects in a Single Image

Wait... • What’s wrong with just learning a different sliding window classifier for each object type in the world?

The image as seen from a object detector’s point of view

Relationships between objects make recognition possible Antonio Torralba. The Context Challenge. http://web.mit.edu/torralba/www/carsAndFacesInContext.html 41 41

Objects as the “Parts” of a Scene Scene Model Deformable Part Model Key Challenge in Scene Understanding: Modeling relationships between objects from different categories 43

Tree Building Car Fence Fire Hydrant Sidewalk Fixed Extent “Things” vs Free-form “Stuff” Things have a well-defined shape. A part of a car is not a car. Stuff is free-form and mostly defined by color/texture. A part of a building is still a building.

3 Types of Scene Models Pixel-based Window-based Segment-based

Pixel-based Scene Understanding Unable to reason about instances Produces Segmentation Only limited notion of context Works well on “stuff” TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation. Shotton et al. ECCV 2006

Pixel-wise Conditional Random Fields (TextonBoost) • Inference • y^* = argmax_y p(y|x) • Training: Use boosting to learn unary potential • Future Direction: Higher-Order Cliques 50 50 TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation. Shotton et al. ECCV 2006

Window-based Scene Understanding Discriminative models for multi-class object layout. Desai et al. ICCV 2009 Object Recognition by Scene Alignment. Russell et al. NIPS 2007 Often not possible to model “stuff” using windows. Window assumption also questionable for some “things.” Possible to model interactions between object instances.

Discriminative models for multi-class object layout • Inference via Greedy Forward Search • Training 52 52

Window-based results 53 53

Region-Based Scene Understanding Use Segmentation algorithm to extract stable regions Use CRF to label those segments Problem: Hard to get object-segments. Problem: Inference difficult for fully connected models.

Region-Based CRF Spatial Relations • Training: Bag of Words with Nearest Neighbor classifier • Maximum Likelihood training of pairwise potentials Object Categorization using Co-Occurrence, Location and Appearance. Galleguillos et al. CVPR 2008. 56 56

Learning and Inference in Vision: from Features to Scene Understanding