220 likes | 321 Views
Extracting Simple Verb Frames from Images. Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA CLLR Workshop December 2, 2008. Grand Goal: Scene Understanding. Cigarette. Backpack. Man. Dog. “ A cow walking through the grass
E N D
Extracting Simple Verb Framesfrom Images Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA CLLR Workshop December 2, 2008
Grand Goal: Scene Understanding Cigarette Backpack Man Dog “A cow walking through the grass on a pasture by the sea” “man wearing a backpack, smoking a cigarette, walking a dog on a sidewalk”
Understanding Verb Frames • Primitives • Objects • Parts • Surfaces • Regions • Interactions • Context • Actions Methods exist to extract these, but we need to both do a better job, and get them all at once “a man is walkingon a sidewalk” Frame: to walk “a dog is walkingon a sidewalk” Man Building Cigarette Modeling verb frames requires understanding the interactions between primitives, and which fit well into the framework of graphical models. Backpack Dog Sidewalk
Outline • Extracting the Primitives • Qualitative 3D Scene Layout • Modeling Relationships • Learning Frames • Refined Characterization of Objects
Computer View of a “Scene” BUILDING ROAD STREETSCENE
Object Detection Detection Window W = Car = Person = Motorcycle = Boat = Sheep = Cow Score(W) > 0.5
Finding the Primitives Jointly SKY GRASS SEASIDEPASTURE Grass = FlatSky = FarFG = Vertical 40% Grass,30% Sky… 1 cow, 2 boats… [Heitz et al., NIPS 2008a]
Results – TAS Model Contextual Detector Base Detector [Heitz et al., ECCV 2008]
Qualitative 3D Scene Layout Primitives imply a certain 3D layout of the scene, absolute depth may not be preserved For example: Sky is a far, vertical plane Water, road are horizontal planes Objects “popup” from the image
Modeling Relationships • We have explored how to model 2D relationships • We should be able to extend this to 3D relationships [Heitz et al., ECCV 2008] [Gould et al., IJCV 2008] Beside In front of On
Outline • Extracting the Primitives • Qualitative 3D Scene Layout • Modeling Relationships • Learning Frames • Refined Characterization of Objects
Learning Semantics: Verb Frames The [S][V] the [O]. [S],[O] CAR ROAD COW GRASS PERSON APPLE … [V] WALKS ON EATS DRIVES ON JUMPS OVER THROWS … Given primitives, rough layout, and relationships Let’s learn subjects, verb, and objects for frames:
Refined Characterization We need to know that the white stick is a cigarette… and where the man’s mouth is… in order to determine that he’s smoking.
Refined Object Characterization Set of “keypoint” landmarks Outline shape defined by connecting contour [Heitz et al., NIPS 2008b, IJCV in submission]
Results Rhino Giraffe Llama
Mammals Running Standing Eating Standing [Heitz et al., NIPS 2008b, IJCV in submission]
Activity Recognition Drinking Eating 1) Localize the landmarks of the cow, including the head. Grass Eating Cow 2) Extract histogram of “stuff” in a window around the head landmark 3) Make a decision
Activity Recognition with People Running Walking Standing Hitting • Pose of person is one of the important factors • Also need to recognize objects person interacts with
How far can we take this? Front legs off ground = Jumping Apple near mouth = Eating Ball near hands = Throwing
Does phased learning help? Cartoon/Caricature Exaggerates the most salient features of the object class. Simple BG Real object with no confusing clutter. Cluttered BG Object in standard pose on natural background. Articulated Once we have built a strong appearance model, can we learn complicated articulations?
Our Related Papers • G. Elidan, B. Packer, G. Heitz, and D. Koller. Convex Point Estimation using Undirected Bayesian Transfer Hierarchies. UAI, 2008. • S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller. Multi-Class Segmentation with Relative Location Prior.IJCV, 2008. • S. Gould, P. Baumstarck, M. Quigley, A. Ng, and D. Koller. Integrating Visual and Range Data for Robotic Object Detection.ECCV Workshop M2SFA2, 2008. • G. Heitz and D. Koller. Learning Spatial Context: Using Stuff to Find Things.ECCV, 2008. • G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded Classification Models: Combining Models for Holistic Scene Understanding.NIPS, 2008. • G. Heitz, G. Elidan, B. Packer, and D. Koller. Shape-based Object Localization for Descriptive Classification. NIPS, 2008.