230 likes | 423 Views
Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples. Yu-Gang Jiang Zhenguo Li Shih-Fu Chang IEEE Transactions on CSVT 2011. Outline. Context-based Action Retrieval Framework Experiment Result Conclusion. Framework.
E N D
Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples Yu-Gang Jiang Zhenguo Li Shih-Fu Chang IEEE Transactions on CSVT 2011
Outline • Context-based Action Retrieval Framework • Experiment Result • Conclusion
Framework • Video Representation and Negative Sample Selection • Obtaining Action Context • Scene Recognition • Object Recognition • Estimating Action-Scene-Object Relationship • Incorporationg Multiple Contextual Cues
A. Video Representation and Negative Sample Selection • Use the bag-of-features framework
A. Video Representation and Negative Sample Selection • Use the bag-of-features framework • Use k-means clustering to generate 4000 visual words
A. Video Representation and Negative Sample Selection • Use the bag-of-features framework • Use k-means clustering to generate 4000 visual words • Quantize each video clip into two 4000-D histograms of visual words
A. Video Representation and Negative Sample Selection • Use the bag-of-features framework • Use k-means clustering to generate 4000 visual words • Quantize each video clip into two 4000-D histograms of visual words • Apply Local and Global Consistency(LGC) [27] • Pick negative samples after propagation [27] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” in Proc. Neural Inform. Process. Syst., 2004, pp. 321–328.
B. Scene Recognition • Train different classifiers for two bag-of-features and simply average their probability predictions • The scene models are learned by SVM • Adopt 10 scene classes
B. Object Recognition • It can only detect person, chair and car • Define actions • Track objects based on location and box size • Discard isolated detections • Compute average spatial distance between different types of object
C. Estimating Action-Scene-Object Relationship • Define context-based inference score • Well distinguish samples from P and N • Produce similar scores if two samples are close
C. Estimating Action-Scene-Object Relationship • F : prediction matrix of contextual cues • c : coefficient vector c m contextual cues … n training samples F ... ...
C. Estimating Action-Scene-Object Relationship Constraint 2 Constraint 1
D. Incorporating Multiple Contextual Cues • Given an action a and a test sample x : context weight parameter : the prediction score of contextual cues on x : action prediction score based on raw visual features : refined prediction after incorporating contextual cues
Experiment Results • Mean average precision(mAP) • Retrieval Performance by Raw Features
Experiment Results • Scene vs. Object
Experiment Results • Scene vs. Object
Experiment Results • Comparison to the state of art • SVM learning • Movie script-mining
Conclusion • An algorithm based on semi-supervised learning paradigm is used to model action-scene-object dependency from limited samples • This algorithm can be applied to other types of action videos