1 / 53

Integrating Vision Models for Holistic Scene Understanding

Integrating Vision Models for Holistic Scene Understanding. Geremy Heitz CS223B March 4 th , 2009. Scene/Image Understanding. What’s happening in these pictures?. Human View of a “Scene”. BUILDING. PEOPLE WALKING. BUS. CAR. ROAD.

gkenner
Download Presentation

Integrating Vision Models for Holistic Scene Understanding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Vision Models for Holistic Scene Understanding Geremy Heitz CS223B March 4th, 2009

  2. Scene/Image Understanding What’s happening in these pictures?

  3. Human View of a “Scene” BUILDING PEOPLEWALKING BUS CAR ROAD “A car passes a bus on the road, while people walk past a building.”

  4. Computer View of a “Scene” BUILDING Can we integrateall of these subtasks, so that whole > sum of parts ? ROAD STREETSCENE

  5. Outline • Overview • Integrating Vision Models • CCM: Cascaded Classification Models • Learning Spatial Context • TAS: Things and Stuff • Future Directions [Heitz et al. NIPS 2008a] [Heitz & Koller ECCV 2008]

  6. Image/Scene Understanding • Primitives • Objects • Parts • Surfaces • Regions • Interactions • Context • Actions • Scene Descriptions Established techniques address these in isolation. Reasoning over image statistics Man Building Cigarette Backpack Dog Complex web of relations well represented by graphical models. Reasoning over more abstract entities. Sidewalk “a man and a dogare walking on a sidewalkin front of a building”

  7. Why will integration help? What is this object?

  8. More Context Context is key!

  9. Outline • Overview • Integrating Vision Models • CCM: Cascaded Classification Models • Learning Spatial Context • TAS: Things and Stuff • Future Directions [Heitz et al. NIPS 2008a]

  10. Human View of a “Scene” BUILDING • Scene Categorization • Object Detection • Region Labelling • Depth Reconstruction • Surface Orientations • Boundary/Edge Detection • Outlining/Refined Localization • Occlusion Reasoning • ... PEOPLEWALKING BUS CAR ROAD

  11. Related Work • Intrinsic Images • [Barrow and Tenenbaum, 1978], [Tappen et al., 2005] • Hoiem et al., “Closing the Loop in Scene Interpretation” , 2008 • We want to focus more on “semantic” classes • We want to be flexible to using outside models • We want an extendable framework, not one engineered for a particular set of tasks = + =

  12. How Should we Integrate? • Single joint model over all variables • Pros: Tighter interactions, more designer control • Cons: Need expertise in each of the subtasks • Simple, flexible combination of existing models • Pros: State-of-the-art models, easier to extendLimited “black-box” interface to components • Cons: Missing some of the modeling power DETECTIONDalal & Triggs, 2006 DEPTH RECONSTRUCTIONSaxena et al., 2007 REGION LABELINGGould et al., 2007

  13. Cascaded Classification Models Image fREG fDET fREC Features IndependentModels DET0 REG0 REC0 DET1 REG1 REC1 Context-awareModels 3DReconstruction Object Detection RegionLabeling

  14. Integrated Model for Scene Understanding • Object Detection • Multi-class Segmentation • Depth Reconstruction • Scene Categorization I’ll show youthese

  15. Basic Object Detection Detection Window W = Car = Person = Motorcycle = Boat = Sheep = Cow Score(W) > 0.5

  16. Base Detector - HOG • HOG Detector: [ Dalal & Triggs, CVPR, 2006 ] Feature Vector X SVM Classifier

  17. Context-Aware Object Detection • From Base Detector • Log Score D(W) • From Scene Category • MAP category, marginals • From Region Labels • How much of each label is ina window adjacent to W • From Depths • Mean, variance of depths,estimate of “true” object size • Final Classifier Scene Type: Urban scene % of “road” below W P(Y) = Logistic(Φ(W)) Variance of depths in W

  18. Multi-class Segmentation CRF Model • Label each pixel as one of:{‘grass’, ‘road’, ‘sky’, etc } • Conditional Markov random field (CRF) over superpixels: • Singleton potentials: log-linear function of boosted detectors scores for each class • Pairwise potentials: affinity of classes appearing together conditioned on (x,y) location within the image [Gould et al., IJCV 2007]

  19. Context-Aware Multi-class Seg. Where isthe grass? Additional Feature:Relative Location Map

  20. Depth Reconstruction CRF • Label each pixel with it’s distance from the camera • Conditional Markov random field (CRF) over superpixels • Continuous variables • Models depth as linear function of features with pairwise smoothness constraints [Saxena et al., PAMI 2008] http://make3d.stanford.edu

  21. Depth Reconstruction with Context SKY GRASS Sky is far away Grass is horizontal • Find d* • Reoptimize depths with new constraints: BLACK BOX dCCM = argmin α||d - d*|| + β||d - dCONTEXT||

  22. Training ŶS ŶD ŶZ ŶD ŶS ŶZ 0 1 0 1 1 0 I fD fS fZ • I: Image • f: Image Features • Ŷ: Output labels • Training Regimes • Independent • Ground: Groundtruth Input I fD fS fZ ŶS * ŶZ *

  23. Training ŶZ ŶS ŶS ŶZ ŶD ŶD 1 0 1 1 0 0 I • CCM Training Regime • Later models can ignore the mistakes of previous models • Training realistically emulates testing setup • Allows disjoint datasets • K-CCM: A CCM with K levels of classifiers fD fS fZ

  24. Experiments • DS1 • 422 Images, fully labeled • Categorization, Detection, Multi-class Segmentation • 5-fold cross validation • DS2 • 1745 Images, disjoint labels • Detection, Multi-class Segmentation, 3D Reconstruction • 997 Train, 748 Test

  25. CCM Results – DS1 CATEGORIES PEDESTRIAN CAR REGION LABELS BOAT MOTORBIKE

  26. CCM Results – DS2 Boats

  27. Example Results INDEPENDENT CCM

  28. Example Results Independent Objects Independent Regions CCM Objects Independent Objects Independent Regions CCM Regions

  29. Understanding the man “a man, a dog, a sidewalk, a building”

  30. Outline • Overview • Integrating Vision Models • CCM: Cascaded Classification Models • Learning Spatial Context • TAS: Things and Stuff • Future Directions [Heitz & Koller ECCV 2008]

  31. Things vs. Stuff From: Forsyth et al. Finding pictures of objects in large collections of images. Object Representation in Computer Vision, 1996. Thing (n): An object with a specific size and shape. (DETECTIONS) Stuff (n): Material defined by a homogeneous or repetitive pattern of fine-scale properties, but has no specific or distinctive spatial extent or shape. (REGIONS)

  32. Cascaded Classification Models Image fREG fDET fREC Features IndependentModels DET0 REG0 REC0 DET1 REG1 REC1 Context-awareModels 3DReconstruction Object Detection RegionLabeling

  33. CCMs vs. TAS TAS Modeled Jointly CCM Feedforward Image Image fREG fDET fREG fDET DET0 REG0 DET REG DET1 REG1 Relationships

  34. Satellite Detection Example FALSE POSITIVE TRUE POSITIVE

  35. Stuff-Thing Context • Stuff-Thing: • Based on spatial relationships • Intuition: Road = cars here Trees = no cars “Cars drive on roads” “Cows graze on grass” “Boats sail on water” Houses = cars nearby Goal: Unsupervised

  36. Things • Detection TiЄ {0,1} • Ti = 1: Candidate window contains a positive detection ImageWindowWi P(Ti) = Logistic(score(Wi)) Ti

  37. Stuff • Coherent image regions • Coarse “superpixels” • Feature vector Fj in Rn • Cluster label Sj in {1…C} • Stuff model • Naïve Bayes Sj Fj

  38. Relationships • Descriptive Relations • “Near”, “Above”, “In front of”, etc. • Choose set R = {r1…rK} • Rijk=1: Detection i and region j have relation k • Relationship model T1 S72 = Trees S10 = Road S4 = Houses Sj Ti R1,10,in=1 Rijk

  39. Unrolled Model R1,1,left = 1 S1 T1 R2,1,above = 0 S2 R3,1,left = 1 T2 S3 R1,3,near = 0 S4 T3 R3,3,in = 1 S5 CandidateWindows ImageRegions

  40. Learning the Parameters • Assume we know R • Sj is hidden • Everything else observed • Expectation-Maximization • “Contextual clustering” • Parameters are readily interpretable ImageWindowWi K Ti Rijk Sj N Fj J AlwaysObserved AlwaysHidden Supervisedin Training Set

  41. Which Relationships to Use? • Rijk = spatial relationship between candidate i and region j Rij1 = candidate in region Rij2 = candidate closer than 2 bounding boxes (BBs) to region Rij3 = candidate closer than 4 BBs to region Rij4 = candidate farther than 8 BBs from region Rij5 = candidate 2BBs left of region Rij6 = candidate 2BBs right of region Rij7 = candidate 2BBs below region Rij8 = candidate more than 2 and less than 4 BBs from region … RijK = candidate near region boundary How do we avoid overfitting?

  42. Learning the TAS Relations • Intuition • “Detached” Rijk = inactive relationship • Structural EM iterates: • Learn parameters • Decide which edge to toggle • Evaluate with l(T|F,W,R) • Requires inference • Better results than using standard E[l(T,S,F,W,R)] Rij1 Rij2 RijK Ti Sj Fj

  43. Inference • Goal: • Block Gibbs Sampling • Easy to sample Ti’s given Sj’s and vice versa

  44. Learned Satellite Clusters

  45. Results - Satellite Posterior:Detections Prior:Detector Only Posterior:Region Labels

  46. Discovered Context - Bicycles Bicycles Cluster #3

  47. TAS Results – Bicycles • Examples • Discover “true positives” • Remove “false positives” ? BIKE ? ?

  48. Results – VOC 2005 TAS Base Detector

  49. Understanding the man “a man and a dog on a sidewalk, in front of a building ”

  50. Outline • Overview • Integrating Vision Models • CCM: Cascaded Classification Models • Learning Spatial Context • TAS: Things and Stuff • Future Directions

More Related