1 / 48

Indoor Scene Segmentation using a Structured Light Sensor

Explore indoor scene recognition using a Kinect sensor with new depth dataset and CRF-based model. Learn the use of RGB/depth cues for accurate segmentation. The model incorporates appearance terms and spatial smoothness to optimize class labels. Descriptor types such as RGB-SIFT, Depth-SIFT, Depth-SPIN, RGBD-SIFT are employed, and appearance models are defined using neural networks for probabilistic distribution over classes. Spatial priors including 2D and 3D location priors are integrated to enhance segmentation accuracy. Aligning rooms using normalized cylindrical coordinate system is discussed for improved scene recognition.

nickolasj
Download Presentation

Indoor Scene Segmentation using a Structured Light Sensor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indoor Scene Segmentation using a Structured Light Sensor ICCV 2011 Workshop on 3D Representation and Recognition Nathan Silberman andRob Fergus Courant Institute

  2. Overview Indoor Scene Recognition using the Kinect • Introduce new Indoor Scene Depth Dataset • Describe CRF-based model • Explore the use of rgb/depth cues

  3. Motivation • Indoor Scene recognition is hard • Far less texture than outdoor scenes • More geometric structure

  4. Motivation • Indoor Scene recognition is hard • Far less texture than outdoor scenes • More geometric structure • Kinect gives us depth map (and RGB) • Direct access to shape and geometry information

  5. Overview Indoor Scene Recognition using the Kinect • Introduce new Indoor Scene Depth Dataset • Describe CRF-based model • Explore the use of rgb/depth cues

  6. Capturing our Dataset

  7. Statistics of the Dataset * Labels obtained via LabelMe

  8. Dataset Examples Living Room RGB Raw Depth Labels

  9. Dataset Examples Living Room RGB Depth* Labels * Bilateral Filtering used to clean up raw depth image

  10. Dataset Examples Bathroom RGB Depth Labels

  11. Dataset Examples Bedroom RGB Depth Labels

  12. Existing Depth Datasets RGB-D Dataset [1] [1] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. ICRA 2011 [2] B. Liu, S. Gould and D. Koller. Single Image Depth Estimation from Predicted Semantic Labels. CVPR 2010 Stanford Make3d [2]

  13. Existing Depth Datasets Point Cloud Data [1] B3DO [2] [1] AbhishekAnand, HemaSwethaKoppula, Thorsten Joachims, AshutoshSaxena. Semantic Labeling of 3D Point Clouds for Indoor Scenes. NIPS, 2011 [2] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, T. Darrell. A Category-Level 3-D Object Dataset: Putting the Kinect to Work. ICCV Workshop on Consumer Depth Cameras for Computer Vision. 2011

  14. Dataset Freely Available http://cs.nyu.edu/~silberman/nyu_indoor_scenes.html

  15. Overview Indoor Scene Recognition using the Kinect • Introduce new Indoor Scene Depth Dataset • Describe CRF-based model • Explore the use of rgb/depth cues

  16. Segmentation using CRF Model Cost(labels) = Local Terms(labeli) + Spatial Smoothness (label i, label j) • Standard CRF formulation • Optimized via graph cuts • Discrete label set (~12 classes)

  17. Model Cost(labels) = Local Terms(labeli) + Spatial Smoothness (label i, label j) = Appearance(labeli | descriptor i) Location(i)

  18. Model Cost(labels) = Local Terms(labeli) + Spatial Smoothness (label i, label j) = Appearance(labeli | descriptor i) Location(i)

  19. Appearance Term Appearance(labeli | descriptor i) • Several Descriptor Types to choose from: • RGB-SIFT • Depth-SIFT • Depth-SPIN • RGBD-SIFT • RGB-SIFT/D-SPIN

  20. Descriptor Type: RGB-SIFT RGB image from the Kinect 128 D Extracted Over Discrete Grid

  21. Descriptor Type: Depth-SIFT Depth image from kinect with linear scaling 128 D Extracted Over Discrete Grid

  22. Descriptor Type: Depth-SPIN Depth image from kinect with linear scaling Radius 50 D Depth Extracted Over Discrete Grid A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE PAMI, 21(5):433–449, 1999

  23. Descriptor Type: RGBD-SIFT RGB image from the Kinect Concatenate 256 D Depth image from kinect with linear scaling

  24. Descriptor Type: RGD-SIFT, D-SPIN RGB image from the Kinect Concatenate 178 D Depth image from kinect with linear scaling

  25. Appearance Model Appearance(labeli | descriptor i) - Modeled by a Neural Network with a single hidden layer Descriptor at each location

  26. Appearance Model Appearance(labeli | descriptor i) Softmax output layer 13 Classes 1000-D Hidden Layer 128/178/256-D Input Descriptor at each location

  27. Appearance Model Appearance(label i | descriptor i) Interpreted as p(label | descriptor) Probability Distribution over classes 13 Classes 1000-D Hidden Layer 128/178/256-D Input Descriptor at each location

  28. Appearance Model Appearance(label i | descriptor i) Probability Distribution over classes 13 Classes Trained with backpropagation 1000-D Hidden Layer 128/178/256-D Input Descriptor at each location

  29. Model Cost(labels) = Local Terms(labeli) + Spatial Smoothness (label i, label j) = Appearance(labeli | descriptor i) Location(i)

  30. Model Cost(labels) = Local Terms(labeli) + Spatial Smoothness (label i, label j) = Appearance(labeli | descriptor i) Location(i)

  31. Model Cost(labels) = Local Terms(labeli) + Spatial Smoothness (label i, label j) = Appearance(labeli | descriptor i) Location(i) 2D Priors 3D Priors

  32. Location Priors: 2D • 2D Priors are histograms of P(class,location) • Smoothed to avoid image-specific artifacts

  33. Motivation: 3D Location Priors • 2D Priors don’t capture 3d geomety • 3D Priors can be built from depth data • Rooms are of different shapes and sizes, how do we align them?

  34. Motivation: 3D Location Priors • To align rooms, we’ll use a normalized cylindrical coordinate system: Band of maximum depths along each vertical scanline

  35. Relative Depth Distributions Table Television Bed Wall Density 0 0 1 1 Relative Depth

  36. Location Priors: 3D

  37. Model Cost(labels) = Local Terms(labeli) + Spatial Smoothness (label i, label j) = Appearance(labeli | descriptor i) Location(i) 2D Priors 3D Priors

  38. Model Cost(labels) = Local Terms(labeli) + Spatial Smoothness (label i, label j) Penalty for adjacent labels disagreeing (Standard Potts Model)

  39. Model Cost(labels) = Local Terms(labeli) + Spatial Smoothness (label i, label j) • Spatial Modulation of Smoothness • None • RGB Edge • Depth Edges • RGB + Depth Edges • Superpixel Edges • Superpixel + RGB Edges • Superpixel + Depth Edges

  40. Experimental Setup • 60% Train (~1408 images) • 40% Test (~939 images) • 10 fold cross validation • Images of the same scene cannot appear apart • Performance criteria is pixel-level classification (mean diagonal of confusion matrix) • 12 most common classes, 1 background class (from the rest)

  41. Evaluating Descriptors Percent 2D Descriptors 3D Descriptors

  42. Evaluating Location Priors Percent 2D Descriptors 3D Descriptors

  43. Conclusion • Kinect Depth signal helps scene parsing • Still a long way from great performance • Shown standard approaches on RGB-D data. • Lots of potential for more sophisticated methods. • No complicated geometric reasoning • http://cs.nyu.edu/~silberman/nyu_indoor_scenes.html

  44. Preprocessing the Data We use open source calibration software [1] to infer: • Parameters of RGB & Depth cameras • Homography between cameras. [1] N. Burrus. Kinect RGB Demo v0.4.0. http://nicolas.burrus.name/index.php/Research/KinectRgbDemoV4?from=Research.KinectRgbDemoV2, Feb. 2011

  45. Preprocessing the data • Bilateral filter used to diffuse depth across regions of similar RGB intensity • Naïve GPU implementation runs in ~100 ms

  46. Motivation Results from Spatial Pyramid-based classification [1] using 5 indoor scene types. Contrast this with the 81% received by [1] on a 13-class (mostly outdoor) scene dataset. They note similar confusion within indoor scenes. [1] Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene CategoriesS. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006

More Related