160 likes | 409 Views
Vladimir Nedović. Depth Estimation via Scene Classification. vnedovic@science.uva.nl. with: Arnold Smeulders & Jan-Mark Geusebroek (UvA) André Redert (Philips Research). 28-05-2008. Order in Pollock's Chaos.
E N D
Vladimir Nedović Depth Estimation via Scene Classification vnedovic@science.uva.nl with: Arnold Smeulders & Jan-Mark Geusebroek (UvA) André Redert (Philips Research) 28-05-2008
Order in Pollock's Chaos R.P. Taylor, A.P. Micolich and D. Jonas, Fractal Analysis Of Pollock's Drip Paintings, Nature, vol. 399, p.422 (1999) Jackson Pollock, Blue Poles: Number 1, 1952 Pre-perspective (Gothic art, before 1430) Know any tilted buildings? Simone Martini (1285-1344) Post-perspective (Quattrocento, after 1430) W. Richards, A. Jepson and J. Feldman, Priors, Preferences and Categorical Percepts, in Perception as Bayesian Inference, pp. 80-111, 1996. Sandro Botticelli, Annunciation, 1489-90 seems chaotic, but there is structure - same as in natural image statistics viewpoint constraints understood, influence on film art ‘modal’ scene configurations – structures orthogonalto each other
Outline Introduction Related work Our approach Preliminary classification Conclusions
Introduction The context: fully automatic 2D to 3D conversion of video data for 3DTV • We know about stereo, structure from motion, etc. but can we also derive depth from a single image? • humans can, right? • Can we exploit some constraints? • is the data really chaotic? • what about perceptual limitations of viewers? GOAL: in a fast manner, obtain an approximate, but visually pleasing 3D model from a single image
Related work • Related work (1): Torralba & Oliva • showed that depth can be derived from structure, itself derived from natural image statistics (IEEE PAMI 2001) • Related work (2): Hoiem (Carnegie Melon Univ.) • obtained 3D orientation of scene surfaces using machine learning (ICCV 2005) • improved object detection (CVPR 2006 best paper) + accounted for occlusions to derive relative ordering of elements (ICCV 2007) • BUT: • outdoor images only + assumes sky&ground are always present • i.e. accounts for less than half of all possibilities • Related work (3): Saxena (Stanford Univ.) • 3D mesh from ML on low-level features (no classes)
stage • Separate a visual scene into its two constituent elements: • consider objects separately from the stage on which they act object Our approach Our approach: depth estimation via geometric scene classification • i.e. holistic, not pixel-based Determine the 3D stage model first • Stage ≈ first approximation of global depth • reduces subsequent (finer) depth processing tasks • can guide other processes, e.g. object localization & recognition V. Nedović et al. ICCV2007
Our approach- stage models - For the stage, a rough depth model is sufficient • regularities arise from: • natural image statistics -> texture gradients • viewpoint constraints -> perspective • modal configurations & film rules -> orthogonality Exploit geometric structure of images, which reduces the number of possible configurations Only a few configurations are prominent => the first step in depth estimation can be stage classification
Our approach- stage hierarchy - • Structure of the visual world leads to only 15 geometric scene types • Influence of structure identical indoors & outdoors => such distinction unnecessary • Three-level hierarchy • perform classification in steps: first determine the geometric neighbourhood, then proceed further
i.e. 2-3 sub-stages per each stage accounting for variability in parameters • geometry at bottom so constrained that pre-defined crude depth maps already possible i.e. no parameter estimation needed! Our approach- three-level hierarchy -
TRECVID dataset of TV news used for evaluation • Features extracted based on a 4x4 region grid over the image • two features per region => 64 features in total A.F. Smeaton et al. “Evaluation campaigns and TRECVid”, 8th ACM Int’l Workshop on Multimedia Info. Retrieval, 2006. Preliminary classification (1) • Proof of concept with a single feature type • natural image statistics-based Weibull features (i.e. texture gradients)
stage groups individual stages (results of symmetrical variants combined) • two-step classification, average within group (assuming super-stage is known) Preliminary classification (2) • Support Vector Machines (SVM) classifier based on a 1 vs. 1 multi-class approach
Conclusions (1) • We need a fast & approximate solution: • do only what is necessary, viewers may not perceive it anyway • generalize where possible, to reduce the problem at every step • Separate a scene into a stage and the objects • Determine the stage 3D model first • rough model is sufficient • plus, structure greatly reduces the number of possible configurations • and, stage will help us to locate and process objects
Conclusions (2) • Due to structure, we can create simple models that fit TV data • 15 stages is sufficient • no need to distinguish between indoor & outdoor • Therefore, we can use scene classification as the first step in depth estimation
Conclusions (3) • Our approach: three-step classification • geometry at the bottom constrained enough, so we can already assign pre-defined depth maps • no parameter estimation necessary • Proof of concept demonstrated with a single feature type • performance much better than chance • but enhancements needed (more features etc.)