Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities

Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities {bangpeng,feifeili}@cs.stanford.edu Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford University

Human-Object Interaction Robots interact with objects Automatic sports commentary Medical care “Kobe is dunking the ball.”

Human-Object Interaction Holistic image based classification (Previous talk: Grouplet) Detailed understanding and reasoning Playing bassoon Vs. Playing saxophone Playing saxophone Grouplet is a generic feature for structured objects, or interactions of groups of objects. HOI activity: Tennis Forehand Caltech101

Human-Object Interaction Holistic image based classification Detailed understanding and reasoning • Human pose estimation Head Left-arm Right-arm Torso Right-leg Left-leg

Human-Object Interaction Holistic image based classification Detailed understanding and reasoning • Human pose estimation • Object detection Tennis racket

Human-Object Interaction Holistic image based classification Detailed understanding and reasoning • Human pose estimation • Object detection Head Left-arm Right-arm Torso Tennis racket Right-leg Left-leg HOI activity: Tennis Forehand

Outline • Background and Intuition • Mutual Context of Object and Human Pose • Model Representation • Model Learning • Model Inference • Experiments • Conclusion

Human pose estimation & Object detection Human pose estimation is challenging. Difficult part appearance Self-occlusion Image region looks like a body part • Felzenszwalb & Huttenlocher, 2005 • Ren et al, 2005 • Ramanan, 2006 • Ferrari et al, 2008 • Yang & Mori, 2008 • Andriluka et al, 2009 • Eichner & Ferrari, 2009

Human pose estimation & Object detection Human pose estimation is challenging. • Felzenszwalb & Huttenlocher, 2005 • Ren et al, 2005 • Ramanan, 2006 • Ferrari et al, 2008 • Yang & Mori, 2008 • Andriluka et al, 2009 • Eichner & Ferrari, 2009

Human pose estimation & Object detection Facilitate Given the object is detected.

Human pose estimation & Object detection Object detection is challenging Small, low-resolution, partially occluded Image region similar to detection target • Viola & Jones, 2001 • Lampert et al, 2008 • Divvala et al, 2009 • Vedaldi et al, 2009

Human pose estimation & Object detection Object detection is challenging • Viola & Jones, 2001 • Lampert et al, 2008 • Divvala et al, 2009 • Vedaldi et al, 2009

Human pose estimation & Object detection Facilitate Given the pose is estimated.

Human pose estimation & Object detection Mutual Context

Context in Computer Vision Previous work – Use context cues to facilitate object detection: Helpful, but only moderately outperform better ~3-4% with context without context • Murphy et al, 2003 • Shotton et al, 2006 • Harzallah et al, 2009 • Li, Socher & Fei-Fei, 2009 • Marszalek et al, 2009 • Bao & Savarese, 2010 • Hoiem et al, 2006 • Rabinovich et al, 2007 • Oliva & Torralba, 2007 • Heitz & Koller, 2008 • Desai et al, 2009 • Divvala et al, 2009 • Viola & Jones, 2001 • Lampert et al, 2008

Context in Computer Vision Our approach – Two challenging tasks serve as mutual context of each other: Previous work – Use context cues to facilitate object detection: With mutual context: Helpful, but only moderately outperform better ~3-4% Without context: with context without context • Murphy et al, 2003 • Shotton et al, 2006 • Harzallah et al, 2009 • Li, Socher & Fei-Fei, 2009 • Marszalek et al, 2009 • Bao & Savarese, 2010 • Hoiem et al, 2006 • Rabinovich et al, 2007 • Oliva & Torralba, 2007 • Heitz & Koller, 2008 • Desai et al, 2009 • Divvala et al, 2009

Mutual Context Model Representation A: Activity A Human pose Tennis forehand Croquet shot Volleyball smash H O: Object O Body parts Tennis racket Croquet mallet Volleyball P2 P1 PN H: fO Intra-class variations f2 f1 fN Image evidence • More than one H for each A; • Unobserved during training. P: lP: location; θP: orientation; sP: scale. f: Shape context. [Belongie et al, 2002]

Mutual Context Model Representation Markov Random Field • , , : Frequency of co-occurrence between A, O, and H. A Clique weight Clique potential H O P2 P1 PN fO f2 f1 fN

Mutual Context Model Representation Markov Random Field • , , : Frequency of co-occurrence between A, O, and H. A Clique weight Clique potential • , , : Spatial relationship among object and body parts. H O size location orientation P2 P1 PN fO f2 f1 fN

Mutual Context Model Representation Markov Random Field • , , : Frequency of co-occurrence between A, O, and H. A Clique weight Clique potential • , , : Spatial relationship among object and body parts. H O Obtained by structure learning size location orientation • Learn structural connectivity among the body parts and the object. P2 P1 PN fO f2 f1 fN

Mutual Context Model Representation Markov Random Field • , , : Frequency of co-occurrence between A, O, and H. A Clique weight Clique potential • , , : Spatial relationship among object and body parts. H O size location orientation • Learn structural connectivity among the body parts and the object. P2 P1 PN • and : Discriminative part detection scores. fO f2 Shape context + AdaBoost f1 fN [Andriluka et al, 2009] [Belongie et al, 2002] [Viola & Jones, 2001]

Model Learning Input: A H cricket bowling cricket shot O P2 P1 PN Goals: fO Hidden human poses f2 f1 fN

Model Learning Input: A H cricket bowling cricket shot O P2 P1 PN Goals: fO Hidden human poses f2 f1 Structural connectivity fN

Model Learning Input: A H cricket bowling cricket shot O P2 P1 PN Goals: fO Hidden human poses f2 f1 Structural connectivity fN Potential parameters Potential weights

Model Learning Input: A H cricket bowling cricket shot O P2 P1 PN Goals: fO Hidden human poses Hidden variables f2 f1 Structural connectivity fN Structure learning Potential parameters Parameter estimation Potential weights

Model Learning A Approach: croquet shot H O P2 P1 PN Goals: fO Hidden human poses f2 f1 Structural connectivity fN Potential parameters Potential weights

Model Learning A Approach: Hill-climbing H Joint density of the model Gaussian priori of the edge number O P2 P1 PN Remove an edge Add an edge Goals: fO Hidden human poses f2 f1 Structural connectivity fN Potential parameters Remove an edge Add an edge Potential weights

Model Learning A Approach: • Maximum likelihood H O • Standard AdaBoost P2 P1 PN Goals: fO Hidden human poses f2 f1 Structural connectivity fN Potential parameters Potential weights

Model Learning A Approach: Max-margin learning H O P2 P1 PN Goals: fO Hidden human poses Notations f2 • xi: Potential values of the i-th image. • wr: Potential weights of the r-th pose. • y(r): Activity of the r-th pose. • ξi: A slack variable for the i-th image. f1 Structural connectivity fN Potential parameters Potential weights

Learning Results Cricket defensive shot Cricket bowling Croquet shot

Learning Results Tennis forehand Tennis serve Volleyball smash

Model Inference The learned models

Model Inference The learned models Head detection Torso detection Compositional Inference [Chen et al, 2007] Tennis racket detection Layout of the object and body parts.

Model Inference The learned models Output

Dataset and Experiment Setup Sport data set: 6 classes 180 training (supervised with object and part locations) & 120 testing images Tasks: • Object detection; • Pose estimation; • Activity classification. Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash [Gupta et al, 2009]

Object Detection Results Cricket ball Cricket bat Valid region Sliding window Pedestrian context Our Method [Andriluka et al, 2009] [Dalal & Triggs, 2006] Croquet mallet Tennis racket Volleyball 42

Object Detection Results Cricket ball Sliding window Pedestrian context Our method Small object Volleyball Background clutter 43

Dataset and Experiment Setup Sport data set: 6 classes 180 training & 120 testing images Tasks: • Object detection; • Pose estimation; • Activity classification. Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash [Gupta et al, 2009]

Human Pose Estimation Results

Human Pose Estimation Results Tennis serve model Our estimation result Andriluka et al, 2009 Volleyball smash model Our estimation result Andriluka et al, 2009

Human Pose Estimation Results Estimation result Estimation result Estimation result Estimation result

Dataset and Experiment Setup Sport data set: 6 classes 180 training & 120 testing images Tasks: • Object detection; • Pose estimation; • Activity classification. Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash [Gupta et al, 2009]

Activity Classification Results No scene information Scene is critical!! Cricket shot Tennis forehand Our model Gupta et al, 2009 Bag-of-words SIFT+SVM

Conclusion Grouplet representation Human-Object Interaction Vs. Mutual context model Next Steps • Pose estimation & Object detection on PPMI images. • Modeling multiple objects and humans.

Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities

Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities

Presentation Transcript

Human-Human Interaction

Human Pose Recognition

Object-oriented modeling Class/Object Diagrams

Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities

Object-oriented Modeling of Object Oriented Concepts

Human Interaction

Object Modeling (2)

Human Interaction

Visual Analysis of Human/Object Motion

Object-Relational Modeling

Analysis: Object Modeling

MODELING HUMANS IN HUMAN COMPUTER INTERACTION

Object Modeling Approach

Object Modeling

Object Modeling

Object Interaction 2

Object Oriented Modeling

Human Pose detection

Object Modeling

Object Modeling Concepts:-

Object Oriented Modeling