1 / 23

Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples

Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples. Yu-Gang Jiang Zhenguo Li Shih-Fu Chang IEEE Transactions on CSVT 2011. Outline. Context-based Action Retrieval Framework Experiment Result Conclusion. Framework.

Download Presentation

Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples Yu-Gang Jiang Zhenguo Li Shih-Fu Chang IEEE Transactions on CSVT 2011

  2. Outline • Context-based Action Retrieval Framework • Experiment Result • Conclusion

  3. Framework • Video Representation and Negative Sample Selection • Obtaining Action Context • Scene Recognition • Object Recognition • Estimating Action-Scene-Object Relationship • Incorporationg Multiple Contextual Cues

  4. Context-Based Action Retrival Framework

  5. A. Video Representation and Negative Sample Selection • Use the bag-of-features framework

  6. A. Video Representation and Negative Sample Selection • Use the bag-of-features framework • Use k-means clustering to generate 4000 visual words

  7. A. Video Representation and Negative Sample Selection • Use the bag-of-features framework • Use k-means clustering to generate 4000 visual words • Quantize each video clip into two 4000-D histograms of visual words

  8. A. Video Representation and Negative Sample Selection • Use the bag-of-features framework • Use k-means clustering to generate 4000 visual words • Quantize each video clip into two 4000-D histograms of visual words • Apply Local and Global Consistency(LGC) [27] • Pick negative samples after propagation [27] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” in Proc. Neural Inform. Process. Syst., 2004, pp. 321–328.

  9. Context-Based Action Retrival Framework

  10. B. Scene Recognition • Train different classifiers for two bag-of-features and simply average their probability predictions • The scene models are learned by SVM • Adopt 10 scene classes

  11. B. Object Recognition • It can only detect person, chair and car • Define actions • Track objects based on location and box size • Discard isolated detections • Compute average spatial distance between different types of object

  12. B. Object Recognition

  13. Context-Based Action Retrival Framework

  14. C. Estimating Action-Scene-Object Relationship • Define context-based inference score • Well distinguish samples from P and N • Produce similar scores if two samples are close

  15. C. Estimating Action-Scene-Object Relationship • F : prediction matrix of contextual cues • c : coefficient vector c m contextual cues … n training samples F ... ...

  16. C. Estimating Action-Scene-Object Relationship Constraint 2 Constraint 1

  17. Context-Based Action Retrival Framework

  18. D. Incorporating Multiple Contextual Cues • Given an action a and a test sample x : context weight parameter : the prediction score of contextual cues on x : action prediction score based on raw visual features : refined prediction after incorporating contextual cues

  19. Experiment Results • Mean average precision(mAP) • Retrieval Performance by Raw Features

  20. Experiment Results • Scene vs. Object

  21. Experiment Results • Scene vs. Object

  22. Experiment Results • Comparison to the state of art • SVM learning • Movie script-mining

  23. Conclusion • An algorithm based on semi-supervised learning paradigm is used to model action-scene-object dependency from limited samples • This algorithm can be applied to other types of action videos

More Related