Learning Spatial Context: Can stuff help us find things ?

Learning Spatial Context:Can stuff help us find things? Geremy Heitz Daphne Koller April 14, 2008 DAGS • Stuff (n): Material defined by a homogeneous or repetitive pattern of fine-scale properties, but has no specific or distinctive spatial extent or shape. • Thing (n): An object with a specific size and shape.

Outline • Sliding window object detection • What is context? • The Things and Stuff (TAS) model • Results

Object Detection • Task: Find all the cars in this image. • Return a “bounding box” for each • Evaluation: • Maximize true positives • Minimize false positives • Precision-Recall tradeoff

Sliding Window Detection • Consider every bounding box • All shifts • All scales • Possibly all rotations • Each box gets a score: D(x,y,s,Θ) • Detections: Local peaks in D() • Pros: • Covers the entire image • Flexible to allow variety of D()’s • Cons: • Brute force – can be slow • Only considers features in box D = 1.5 D = -0.3

Features: Haar wavelets Haar filters and integral image Viola and Jones, ICCV 2001 The average intensity in the block is computed with four sums independently of the block size. BOOSTING!

Features: Edge fragments Opelt, Pinz, Zisserman, ECCV 2006 Weak detector = Match of edge chain(s) from training image to edgemap of test image BOOSTING!

Histograms of oriented gradients • SIFT, D. Lowe, ICCV 1999 • Dalal & Trigs, 2006 SVM!

Sliding Window Results PASCALVisual Object Classes ChallengeCows 2006 Recall = TP / (TP + FN)Precision = TP / (TP + FP) A B score(A,B) = |A∩B| / |AUB|True Pos: B s.t. score(A,B) > 0.5 for some AFalse Pos: B s.t. score(A,B) < 0.5 for all A False Neg: A s.t. score(A,B) < 0.5 for all B

Satellite Detection Example

Quantitative Evaluation 1 0.8 0.6 Recall Rate 0.4 0.2 0 40 80 120 160 False Positives Per Image

Why does this suck? True Positives in Context False Positives in Context False Positives out of Context Context!

What is Context?

What is Context? gist car “likely” keyboard “unlikely” • Thing-Thing: • Scene-Thing: • Stuff-Stuff: Torralba et al., 2005 Gouldet al., 2008 Rabinovich et al., 2007

What is Context? • Stuff-Thing: • Based on intuitive “relationships” Road = cars here Trees = no cars Houses = cars nearby

Things • Candidate detections • Bounding Box + Score • Boolean R.V. Ti • Ti = 1: Candidate is a positive detection • Thing-only model ImageWindowWi Ti

Stuff • Coherent image regions • Coarse “superpixels” • Feature vector Fj in Rn • Cluster label Sj • Stuff-only model • Naïve Bayes Sj Fj

Relationships S72 = Trees S10 = Road S4 = Houses • Descriptive Relations • “Near”, “Above”, “In front of”, etc. • Choose a set R • Rij: Relation between detection i and region j • Relationship model Sj Ti Rij

The TAS Model Wi: Window Ti: Object Presence Sj: Region Label Fj: Region Features Rij: Relationship ImageWindowWi Ti Rij Sj N Fj J

Unrolled Model R11 = “Left” S1 T1 R21 = “Above” S2 R31 = “Left” T2 S3 R13 = “In” S4 T3 R33 = “In” S5 CandidateWindows ImageRegions

Learning • Everything observed except Sj’s • Expectation-Maximization • Mostly discrete variables • Like Mixture-of-Gaussians • An ode to directed models: Oh directed probabilistic models You are so beautiful and palatable Because unlike your undirected friends Your parameters are so very interpretable - Unknown Russian Mathematician (Translated by Geremy Heitz)

Learned Satellite Clusters

Inference • Goal: • Gibbs Sampling • Easy to sample Ti’s given Sj’s and vice versa • Could do distributional particles

Results - Satellite Posterior:TAS Model Prior:Detector Only Region Labels

Results - Satellite

PASCAL VOC Challenge • 2005 Challenge • 2232 images split into {train, val, test} • Cars, Bikes, People, and Motorbikes • 2006 • 5304 images plit into {train, test} • 12 classes, we use Cows and Sheep • Results reported for challenge with state-of-the-art approaches • Caveat: They didn’t get to see the test set before the challenge, but I did!

Results – PASCAL Cows

Results – PASCAL Bicycles Cluster #3

Results – PASCAL • Good examples • Discover “true positives” • Remove “false positives”

Results – VOC 2005 Car Motorbike Bicycle People

Results – VOC 2006 Sheep Cow

Conclusions • Detectors can benefit from context • The TAS model captures an important type of context • We can improve any sliding windowdetector using TAS • The TAS model can be interpreted and matches our intuitions • Geremy is smart

Detections in Context Task: Identify all cars in the satellite image Idea: The surrounding context adds info to the local window detector + = Houses Road Region Labels Prior:Detector Only Posterior:TAS Model

Equations

Learning Spatial Context: Can stuff help us find things ?