Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples

Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples Yu-Gang Jiang Zhenguo Li Shih-Fu Chang IEEE Transactions on CSVT 2011

Outline • Context-based Action Retrieval Framework • Experiment Result • Conclusion

Framework • Video Representation and Negative Sample Selection • Obtaining Action Context • Scene Recognition • Object Recognition • Estimating Action-Scene-Object Relationship • Incorporationg Multiple Contextual Cues

Context-Based Action Retrival Framework

A. Video Representation and Negative Sample Selection • Use the bag-of-features framework

A. Video Representation and Negative Sample Selection • Use the bag-of-features framework • Use k-means clustering to generate 4000 visual words

A. Video Representation and Negative Sample Selection • Use the bag-of-features framework • Use k-means clustering to generate 4000 visual words • Quantize each video clip into two 4000-D histograms of visual words

A. Video Representation and Negative Sample Selection • Use the bag-of-features framework • Use k-means clustering to generate 4000 visual words • Quantize each video clip into two 4000-D histograms of visual words • Apply Local and Global Consistency(LGC) [27] • Pick negative samples after propagation [27] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” in Proc. Neural Inform. Process. Syst., 2004, pp. 321–328.

B. Scene Recognition • Train different classifiers for two bag-of-features and simply average their probability predictions • The scene models are learned by SVM • Adopt 10 scene classes

B. Object Recognition • It can only detect person, chair and car • Define actions • Track objects based on location and box size • Discard isolated detections • Compute average spatial distance between different types of object

B. Object Recognition

C. Estimating Action-Scene-Object Relationship • Define context-based inference score • Well distinguish samples from P and N • Produce similar scores if two samples are close

C. Estimating Action-Scene-Object Relationship • F : prediction matrix of contextual cues • c : coefficient vector c m contextual cues … n training samples F ... ...

C. Estimating Action-Scene-Object Relationship Constraint 2 Constraint 1

D. Incorporating Multiple Contextual Cues • Given an action a and a test sample x : context weight parameter : the prediction score of contextual cues on x : action prediction score based on raw visual features : refined prediction after incorporating contextual cues

Experiment Results • Mean average precision(mAP) • Retrieval Performance by Raw Features

Experiment Results • Scene vs. Object

Experiment Results • Comparison to the state of art • SVM learning • Movie script-mining

Conclusion • An algorithm based on semi-supervised learning paradigm is used to model action-scene-object dependency from limited samples • This algorithm can be applied to other types of action videos

Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples