380 likes | 395 Views
This research delves into combining models for comprehensive scene understanding using cascaded classification models. It covers topics such as object detection, scene categorization, region labeling, and depth reconstruction. The study explores the importance of visual context and model desiderata to enhance overall scene interpretation. Various model frameworks and results are presented, with extensions like 3D from line drawings also discussed in detail. Contextual cues and intrinsic images are utilized for improved understanding of scenes. The study culminates in showcasing successful experiments on datasets, proving the efficacy of the approach.
E N D
Cascaded Classification Models:Combining Models for Holistic Scene UnderstandingHelping models play nice since 2008... Geremy Heitz Steve Gould Ashutosh Saxena Daphne Koller August 11, 2008 DAGS
Outline • Understanding Scene Understanding • Related Work • Model Desiderata • CCM Framework • Results • Extensions
Computer View of a “Scene” SKY GRASS SEASIDEPASTURE
Human View of a “Scene” She’s walking. A cow Some grass… “The cow is walking through the grass on a pasture by the sea.”
Scene Understanding • Requires combining many tasks • Object Detection • Scene Categorization • Region Labeling • Depth Reconstruction • Requires the “right” representation • Matches the questions we might ask • Operates at multiple granularities • The whole is greater than the sum… • What information can they share?
Visual Context • Context (from http://www.thefreedictionary.com): • “The words before and after a word or passage in a piece of writing that contribute to its meaning.” • Visual Context: • “The visual objects ‘near’ a particular visual object that contribute to its meaning” • Visual Context Cues: • “Signals obtained from nearby visual objects that may help a classifier classify a query object”
Outline • Understanding Scene Understanding • Related Work • Model Desiderata • CCM Framework • Results • Extensions
3D from Line Drawings • David Waltz – “Understanding Line Drawings of Scenes with Shadows” - 1975
Intrinsic Images • Barrow and Tenenbaum – “Recovering intrinsic scene characteristics from images” - 1978 • Tappen et al. – “Recovering intrinsic images from a single image” - 2005 Original Image Reflectance Image Shading Image
Scene Understanding • Derek Hoiem – “Closing the Loop in Scene Interpretation” – CVPR 2008 • Uses “Intrinsic Image” idea • But… • Tailored specifically to his previous models • Fewer classes • Regions get generic properties • Hard to pronounce his name
Context Model Desiderata • Allow state-of-the-art subcomponents • Generic method of combining them • Limited interface into “black boxes” REGION LABELINGGould et al., 2007 DEPTH RECONSTRUCTIONSaxena et al., 2007 DETECTIONDalal & Triggs, 2006
Context Model Desiderata SKY SKY GRASS GRASS • Learn from datasets with arbitrary sets of labels • Different components improve each other MSRC MulticlassSegmentation Pascal VisualObject Classes LabelMe Stanford RangeImage Data + > ,
Cascaded Classification Models • Component modules must have 3 properties • Learning The classifier should be able to learn from a set of training instances. • Classification We should be able to obtain a classification of the output variables. • Connectivity The classifier should provide a mechanism for including features from other modules.
CCMs I ΦD ΦS ΦZ ŶD ŶD ŶS ŶS ŶD ŶZ ŶZ ŶZ ŶS L L 0 1 1 L 0 0 1 • I: Image • Φ: Image Features • Ŷ: Output labels • Features for level ℓ+1 computed from Φ and labels of level ℓ
How to use black boxes? BLACK BOX WAHOOCLASSIFIER Output Labels YWAHOO BLACK BOX SHAZAMCLASSIFIER YSHAZAM
CCMs for Scene Understanding • Scene Categorization • Object Detection • Region Labeling • Depth Reconstruction
Scene Categorization C = { ‘urban’, ‘rural’, ‘ocean’, ‘other’ } RGB Mean/StddevYCbCr Mean/Stddev From Detection: # of detections of each object From Regions Labeling: Fraction of each region type
Object Detection – HOG Features • Dalal & Triggs, 2006 SVM
Object Detection - Sliding Window • Consider every bounding box • All shifts • All scales • Possibly all rotations • Each box gets a score: • D(x,y,s,Θ) • Detections: • Local peaks in D() D = 1.5 D = -0.3
Object Detection = [1 D(x,y,s) X Y X2 Y2 XY W W2] P(Y) = LogReg(Φ,w) Y = 1{is a car} F2: Detector Scoreof window F10: Amount of “building” above window F50: Variance of depthsin window
Region Labeling CRF SKY GRASS Y = { ‘grass’, ‘road’, ‘tree’, ‘sky’, ‘water’, ‘building’, ‘foreground’ } Mean R,G,BMean H,U,VTexture ResponsesAreaAspect Ratio… Delta R,G,BOffset Vector…
Region Labeling Context Predict“grass” Relative Location Map
Depth Reconstruction with Context SKY GRASS Normals point out Normals point up • Find d* • Reoptimize depths with new constraints: BLACK BOX dCCM = argmin γ||d - d*|| + β||n - nCONTEXT|| + …
SU-CCM SKY GRASS SEASIDEPASTURE Grass = FlatSky = FarFG = Vertical 40% Grass,30% Sky… 1 cow, 2 boats…
Results • Experiments on 2 datasets • SU-1 • 362 images, fully labeled • Scene categorization, object detection, region labeling • Gathered by us • SU-2 • 1746 images, disjoint labels • Object detection, region labeling, depth reconstruction • Combination of PASCAL data, MSRC data, Stanford Range Image Data, other…
Methods I ΦD ΦS ΦZ ŶS ŶD ŶD ŶD ŶZ ŶS ŶZ ŶS ŶZ L L 1 0 L 0 1 0 1 • Independent • Level 0 Models • Groundtruth • Each tier is trained using the groundtruth outputs from the previous tier • 2-CCM • Parameters from tier 1 are copied to all other levels • 5-CCM
SU-1 Segment Labeling 0.75 0.73 0.71 Pixel Accuracy Independent 0.69 Groundtruth 2-CCM 0.67 5-CCM 0.65 1 2 3 4 5 6 Classification Tiers
SU-1 Object Detection 0.38 0.37 0.36 Detection AP 0.35 0.34 0.33 1 2 3 4 5 6 Classification Tiers Detection AP = Robust Area Under Precision-Recall Curve
SU-1 Scene Categorization 0.8 0.76 0.72 Scene Category Acc. 0.68 0.64 0.6 1 2 3 4 5 6 Classification Tiers
Scene Understanding • Requires combining many tasks • Object Detection • Scene Categorization • Region Labeling • Depth Reconstruction • Requires the “right” representation • Matches the questions we might ask • Operates at multiple granularities • The whole is greater than the sum… • What information can they share?
Descriptive Classification Localized Test Outlines Up Down UP DOWN Descriptive Classification City walking during rush hour?ORLong walk on the beach? Object Level Scene Level?