330 likes | 584 Views
3D Scene Models 6.870 Object recognition and scene understanding Krista Ehinger. Questions. What makes a good 3D scene model? How accurate does it need to be? How far can you get with automatic surface detection? Where do you need human input?. Modelling the scene.
E N D
3D Scene Models 6.870 Object recognition and scene understanding Krista Ehinger
Questions • What makes a good 3D scene model? How accurate does it need to be? • How far can you get with automatic surface detection? Where do you need human input?
Modelling the scene • Real scenes have way too many surfaces
Modelling the scene • Option 1: Diorama world
Tour Into the Picture (TIP) • Model the scene as 5 planes + foreground objects • Easy implementation: planes/objects defined by humans Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997
TIP Implementation • User defines vanishing point, rear wall of the scene (inner rectangle) • Given some assumptions about the camera, position/size of all planes can be computed... Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997
Defining the box • Define planes: Floor -> y=0, Ceiling -> y=H • Given horizon (vanishing point), corners of floor, ceiling can be computed from 2D image position Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997
Defining the box • Once the positions of the planes are known, compute the texture of the planes Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997
What about foreground objects? • Assume a quadrangle attached to floor, compute attachment points, upper points • Hierarchical model of foreground objects Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997
Extracting foreground objects • Foreground objects removed, added to mask • Holes in background filled in using photo completion software Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997
TIP Discussion • Pros: • Accurate model (due to human input) • Deals with foreground objects, occlusions • Cons: • Requires human input, not automatic • Model too simple for many real-world scenes
Modelling the scene • Option 2: Pop-up book world
Automatic Photo Pop-Up • Three classes of surface: ground, sky, vertical • Not just a box: can model more kinds of scenes • Automatic classification, no labeling D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up", ACM SIGGRAPH 2005.
Photo Pop-Up Implementation • Pixels -> superpixels -> constellations • Automatic labeling of constellations as ground, vertical, or sky • Define angles of vertical planes (using attachment to ground) • Map textures to vertical planes (as in TIP) D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up", ACM SIGGRAPH 2005.
Superpixels, constellations • Superpixels are neighboring pixels that have nearly the same color (Tao et al, 2001) • Superpixels assigned to constellations according to how likely they are to share a label (ground, vertical, sky) based on difference between feature vectors
Feature vectors • Color features: RGB, hue, saturation • Texture features: Difference of oriented Gaussians, Textons • Location (absolute and percentile) • N superpixels in constellation • Line and intersection detectors • Not used: constellation shape (contiguous, N sides), some texture features
Training process • For each of 82 labeled training images • Compute superpixels, features, pairwise likelihoods • Form a set of N constellations (N = 3 to 25), each labeled with ground truth • Compute constellation features • Compute constellation label, homogeneity likelihood:
Training process • Adaboost weak classifiers learn to estimate whether superpixels have same label (based on feature vector) • Another set of Adaboost week classifiers learns constellation label, homogeneity likelihood (expressed as percent ground, vertical, sky, mixed) • Emphasis on classifying larger constellations
Building the 3D model • Along vertical/ground boundary, fit line segments (Hough transform) – goal is to find simplest shape (fewest lines) • Project lines up from corners of boundary lines, cut and fold D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up", ACM SIGGRAPH 2005.
Photo Pop-Up Demonstration D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up", ACM SIGGRAPH 2005.
Photo Pop-Up Discussion • Pros: • Automatic • Can handle a variety of scenes, not just boxes • Cons: • No handling of foreground objects • Misclassification leads to very strange models • Only 2 kinds of surface: ground, vertical D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up", ACM SIGGRAPH 2005.
Modelling the scene • Option 3: Actually try to model surface angles
3D Scene Structure from Still Image • Compute surface normal for each surface • No right-angle assumptions; surfaces can have any angle • Automatic (trained on images with known depth maps)
3D Scene Implementation • Segment image into superpixels • Estimate surface normal of each superpixel (using Markov Random Field model) • Optional: Detect and extract foreground objects • Map textures to planes Original image Modeled depth map A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image". In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007 A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image". In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007
Image features • Superpixel features (xi) • Color and texture features as in Photo Pop-Up • Vector also includes features of neighboring superpixels • Boundary features (xij) • Color difference, texture difference, edge detector
Markov Random Field Model • First term: model planes in terms of image features of superpixels • Second term: model planes in terms of pairs of superpixels, with constraints... A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image". In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007
Model constraints • Connected structure: except where there is an occlusion, neighboring superpixels are likely to be connected • Coplanar structure: except where there are folds, neighboring superpixels are likely to lie on the same plane • Co-linearity: long straight lines in the image correspond to straight lines in 3D
Foreground objects • Automatically-detected foreground objects may be removed from model (for example: pedestrians, using Dalal & Triggs detector) • Detected objects add 3D cues (pedestrians are basically vertical, occlude other surfaces)
Results A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image". In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007
3D Scene Discussion • Pros: • Handles a variety of scene types • Fairly accurate (about 2/3 of scenes correct) • Automatic • Handles foreground objects • Cons: • Still fails on 1/3 of scenes
Discussion • Simple 3D models are adequate for many scenes • You can get pretty far without human input (but still would be better results with human annotation of scenes) • Extensions? • Use photo completion techniques to handle occlusions? • Massive training sets -> better 3D models?