320 likes | 423 Views
Cross-View Action Recognition via View Knowledge Transfer. Jingen Liu 1 , Mubarak Shah 2 , Benjamin Kuipers 1 , Silvio Savarese 1. 1 Department of EECS University of Michigan Ann Arbor , MI, USA. 2 Department of EECS University of Central Florida Orlando, FL, USA.
E N D
Cross-View Action Recognition via View Knowledge Transfer Jingen Liu1, Mubarak Shah2, Benjamin Kuipers1, Silvio Savarese1 1 Departmentof EECS University of Michigan Ann Arbor, MI, USA 2 Departmentof EECS University of Central Florida Orlando, FL, USA • IEEE International Conference on Computer Vision and Pattern Recognition, 2011
Cross-View Action Recognition • View 1: having labeled examples to train an action classifier F1 • View 2: having NO training examples, i.e., “Checking watch” • Question: How to use knowledge of view 1 to recognize unknown actions of view 2? Low-level features representation Low-level features representation View 1 “Checking watch” ? Classifier View 2 “Checking watch”
Cross-View Action Recognition • Directly use classifier F1 to recognize actions of view 2? • No! Performance decreases dramatically • Motion appearance looks very different across views Low-level features representation Low-level features representation View 1 “Checking watch” ? Classifier View 2 “Checking watch”
Analogy to Text Analysis • Cross-lingual text categorization/retrieval [Bel et al. 2004, Pirkola 98] • Translate them into a common language • E.g., an interlingua, as used in machine translation [Hutchins et al. 92] • Underlying assumption: having word-by-word Common Languages OR An Interlingua In Chinese In French
Our Proposal • An “action view interlingua” • Treat each view point as a language; construct vocabulary • Model an action by a Bag-of-Visual-Words (BoVW) • Translate two BoVWs into an “action view interlingua” View 1 Histogram of Visual-Words Vocabulary V1 Videos An Action View Interlingua View 2 Vocabulary V2 Videos Histogram of Visual-Words
Previous Work • Geometry-based approaches • Geometric measurement of body joints • C. Rao et al. IJCV 2002, V. Paramesmaran et al. IJCV 2006, etc. • Require stable body joint detection and tracking • 3D reconstruction related • D. Weinland et al. ICCV07, P. Yan et al. CVPR08, F. Lv et al. ICCV07, D. Gavrila et al. CVPR96, R. Li et al. ICCV07, etc. • Strict alignments between views • Computationally expensive in reconstruction • Temporal self-similarity matrix [Junejo et al. ECCV08] • Non knowledge transfer; • Poor performance on top view
Previous Work • Transfer-based approaches • Farhadi et al. ECCV08 • Requires feature to feature correspondence at frame level • Mapping is provided by a trained predictor • Mapping is conducted in one direction • Farhadi et al. ICCV 09 • Abstract discriminative aspects • Training a hash mapping • No explicit model transfer
Our Contributions • Advantages of our approach • More flexible: no geometry constraints, human body joint detection and tracking, and 3D reconstruction • No requirement on strict temporal alignment • Two directional mapping rather than one direction • No supervision for bilingual words discovery • Fuse transferred multi-view knowledge using Locally Weighted Ensemble method Info. Exchange First View Features Second View Features First View Features Second View Features
Our Framework First View Second View • Phase I: Discovery of bilingual words • Given N pairs of unlabelled videos captured from two views • Learn two view-dependent visual vocabularies • Discover bi-lingual words by bipartite graph partitioning Training Data Matrix M BoVW models MS V2 V1 First View A Graph Partitioning Vocabulary V1 Vocabulary V2 Second View Z Bipartite Graph Bilingual Words BoVW models MT
Our Framework First View Second View • Phase I: Discovery of bilingual words • Given N pairs of unlabelled videos captured from two views • Learn two view-dependent visual vocabularies • Discover bi-lingual words by bipartite graph partitioning Training Data Matrix M BoVW models M1 V2 V1 First View B Y Z A Graph Partitioning Vocabulary V1 Vocabulary V2 BoBW models Second View Bipartite Graph Bilingual Words BoVW models M2
Our Framework • Phase II: cross-view novel action recognition Training Classifier on Source View Source View Target View Novel Action Recognizing Source View Action Model Learning Bag-of-Visual-Words Bag-of-Bilingual-Words Bag-of-Bilingual-Words Training videos Bilingual Words Target View Testing videos Testing Classifier on Target View Bag-of-Visual-Words
Low-level Action Representation visual words x Examples d • Acquiring the training matrix M Feature Detector 3D cuboids extraction View 1 View 2 Bag-of-Visual-Words (BoVW) model Feature Clustering Visual Word A Visual Word B Visual vocabulary Video-words histogram
Bipartite Graph Modeling visual words Target View Source View • Build a bipartite graph between two views • Edge weights matrix , where S is a similarity matrix • Generate similarity matrix S • In the column space of M, each S(i,j) of S can be estimated, Video Examples X: Visual words of view 1 W Y: Visual words of view 2
Bipartite Graph Bi-Partitioning Bipartite graph partition: • [1] H. Zha, X. He, C. Ding, H. Simon & M. Gu, CIKM 2001 • [2] I.S. Dhillon, SIGKDD 2001 A. Before Partition B. After Partition Two clusters (1,2,3; a, b) & (4,5; c, d, e) -> two bilingual words
IXMAS Data Set • IXMAS videos: 11 actions performed by 10 actors, taken from 5 views. Check-watch Scratch-head Wave-hand Kicking Pick-up Sit-down C0 C1 C2 C3 C4
Data Partition Classes Ys IXMAS Data Classes Z Classes Z Classes Y Check-watch Scratch-head Wave-hand Pick-up Kick Sit-down Source View Target View
Data Partition View 1 Classes Ys View 2 IXMAS Data source view Classes Z Classes Z Classes Y target view Learning Bilingual Words Training Z classes Testing Z classes Check-watch Scratch-head Wave-hand Pick-up Kick Sit-down Source View Target View
Data Partition target view Classes Ys source view IXMAS Data source view Classes Z Classes Z Classes Y target view Learning Bilingual Words Training Z+Y classes Testing Z classes Check-watch Scratch-head Wave-hand Pick-up Kick Sit-down Source View Target View
Results on View Knowledge Transfer Training View Testing View • “W/O” and “W/” show the results without and with view knowledge transfer via the bag of bilingual words, respectively • Average, “W/O”=10.9%, “W/” = 67.4%
Performance of Transfer Training View Testing View • “W/O” and “W/” show the results without and with view knowledge transfer via the bag of bilingual words, respectively • Average, “W/O”=10.9%, “W/” = 67.4%
Performance of Transfer Training View Testing View • “W/O” and “W/” show the results without and with view knowledge transfer via the bag of bilingual words, respectively • Average, “W/O”=10.9%, “W/” = 67.4%
Performance of Transfer Training View Testing View • “W/O” and “W/” show the results without and with knowledge transfer respectively • Average, woTran=10.9%, wTran = 67.4%
Performance of Transfer Training View Testing View • “W/O” and “W/” show the results without and with knowledge transfer respectively • Average, woTran=10.9%, wTran = 67.4%
Performance Comparison • Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] • Columns “A”: A. Farhadi et al. ECCV 2008 • Columns “B”: I. N. Junejo, et al. ECCV 2008 • Columns “C”: A. Farhadi et al. ICCV 2009
Performance Comparison • Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] • Columns “A”: A. Farhadi et al. ECCV 2008 • Columns “B”: I. N. Junejo, et al. ECCV 2008 • Columns “C”: A. Farhadi et al. ICCV 2009
Performance Comparison • Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] • Columns “A”: A. Farhadi et al. ECCV 2008 • Columns “B”: I. N. Junejo, et al. ECCV 2008 • Columns “C”: A. Farhadi et al. ICCV 2009
Transferred Knowledge Fusion • One target view V.S. n-1 source views • Each source view have an action classifier • How to fuse the knowledge to final decision? • Locally Weighted Ensemble strategy [ Gao et al. SIGKDD 08 ] + + + + + + + + + + + + + + + + + + + + + + + + – + – + + + – – + + – – – + – – R – – – – – – – – – – – R – – – – – – – – – – – – – – – Fusion Classifier of Source 1 Classifier of Source 2
Knowledge Fusion Results • Each column denotes a testing (target) view, and the rest four views are source view
Knowledge Fusion Results • Each column denotes a testing (target) view, and the rest four views are source view
Summary • Create an “action view interlingua” for cross-view action recognition • Bilingual words serve as a bridge for view knowledge transfer • Fuse multiple transferred knowledge using Locally Weighted Ensemble method • Our approach achieves state-of-the-art performance
Thank You! Acknowledgements: UMich Intelligent Robotics Lab UMich Computer Vision Lab UCF Computer Vision Lab NSF