Cross-View Action Recognition via View Knowledge Transfer

Cross-View Action Recognition via View Knowledge Transfer Jingen Liu1, Mubarak Shah2, Benjamin Kuipers1, Silvio Savarese1 1 Departmentof EECS University of Michigan Ann Arbor, MI, USA 2 Departmentof EECS University of Central Florida Orlando, FL, USA • IEEE International Conference on Computer Vision and Pattern Recognition, 2011

Cross-View Action Recognition • View 1: having labeled examples to train an action classifier F1 • View 2: having NO training examples, i.e., “Checking watch” • Question: How to use knowledge of view 1 to recognize unknown actions of view 2? Low-level features representation Low-level features representation View 1 “Checking watch” ? Classifier View 2 “Checking watch”

Cross-View Action Recognition • Directly use classifier F1 to recognize actions of view 2? • No! Performance decreases dramatically • Motion appearance looks very different across views Low-level features representation Low-level features representation View 1 “Checking watch” ? Classifier View 2 “Checking watch”

Analogy to Text Analysis • Cross-lingual text categorization/retrieval [Bel et al. 2004, Pirkola 98] • Translate them into a common language • E.g., an interlingua, as used in machine translation [Hutchins et al. 92] • Underlying assumption: having word-by-word Common Languages OR An Interlingua In Chinese In French

Our Proposal • An “action view interlingua” • Treat each view point as a language; construct vocabulary • Model an action by a Bag-of-Visual-Words (BoVW) • Translate two BoVWs into an “action view interlingua” View 1 Histogram of Visual-Words Vocabulary V1 Videos An Action View Interlingua View 2 Vocabulary V2 Videos Histogram of Visual-Words

Previous Work • Geometry-based approaches • Geometric measurement of body joints • C. Rao et al. IJCV 2002, V. Paramesmaran et al. IJCV 2006, etc. • Require stable body joint detection and tracking • 3D reconstruction related • D. Weinland et al. ICCV07, P. Yan et al. CVPR08, F. Lv et al. ICCV07, D. Gavrila et al. CVPR96, R. Li et al. ICCV07, etc. • Strict alignments between views • Computationally expensive in reconstruction • Temporal self-similarity matrix [Junejo et al. ECCV08] • Non knowledge transfer; • Poor performance on top view

Previous Work • Transfer-based approaches • Farhadi et al. ECCV08 • Requires feature to feature correspondence at frame level • Mapping is provided by a trained predictor • Mapping is conducted in one direction • Farhadi et al. ICCV 09 • Abstract discriminative aspects • Training a hash mapping • No explicit model transfer

Our Contributions • Advantages of our approach • More flexible: no geometry constraints, human body joint detection and tracking, and 3D reconstruction • No requirement on strict temporal alignment • Two directional mapping rather than one direction • No supervision for bilingual words discovery • Fuse transferred multi-view knowledge using Locally Weighted Ensemble method  Info. Exchange  First View Features Second View Features First View Features Second View Features

Our Framework First View Second View • Phase I: Discovery of bilingual words • Given N pairs of unlabelled videos captured from two views • Learn two view-dependent visual vocabularies • Discover bi-lingual words by bipartite graph partitioning Training Data Matrix M BoVW models MS V2 V1 First View A Graph Partitioning Vocabulary V1 Vocabulary V2 Second View Z Bipartite Graph Bilingual Words BoVW models MT

Our Framework First View Second View • Phase I: Discovery of bilingual words • Given N pairs of unlabelled videos captured from two views • Learn two view-dependent visual vocabularies • Discover bi-lingual words by bipartite graph partitioning Training Data Matrix M BoVW models M1 V2 V1 First View B Y Z A Graph Partitioning Vocabulary V1 Vocabulary V2 BoBW models Second View Bipartite Graph Bilingual Words BoVW models M2

Our Framework • Phase II: cross-view novel action recognition Training Classifier on Source View Source View Target View Novel Action Recognizing Source View Action Model Learning Bag-of-Visual-Words Bag-of-Bilingual-Words Bag-of-Bilingual-Words Training videos Bilingual Words Target View Testing videos Testing Classifier on Target View Bag-of-Visual-Words

Low-level Action Representation visual words x Examples d • Acquiring the training matrix M Feature Detector 3D cuboids extraction View 1 View 2 Bag-of-Visual-Words (BoVW) model Feature Clustering Visual Word A Visual Word B Visual vocabulary Video-words histogram

Bipartite Graph Modeling visual words Target View Source View • Build a bipartite graph between two views • Edge weights matrix , where S is a similarity matrix • Generate similarity matrix S • In the column space of M, each S(i,j) of S can be estimated, Video Examples X: Visual words of view 1 W Y: Visual words of view 2

Bipartite Graph Bi-Partitioning Bipartite graph partition: • [1] H. Zha, X. He, C. Ding, H. Simon & M. Gu, CIKM 2001 • [2] I.S. Dhillon, SIGKDD 2001 A. Before Partition B. After Partition Two clusters (1,2,3; a, b) & (4,5; c, d, e) -> two bilingual words

IXMAS Data Set • IXMAS videos: 11 actions performed by 10 actors, taken from 5 views. Check-watch Scratch-head Wave-hand Kicking Pick-up Sit-down C0 C1 C2 C3 C4

Data Partition Classes Ys IXMAS Data Classes Z Classes Z Classes Y Check-watch Scratch-head Wave-hand Pick-up Kick Sit-down Source View Target View

Data Partition View 1 Classes Ys View 2 IXMAS Data source view Classes Z Classes Z Classes Y target view Learning Bilingual Words Training Z classes Testing Z classes Check-watch Scratch-head Wave-hand Pick-up Kick Sit-down Source View Target View

Data Partition target view Classes Ys source view IXMAS Data source view Classes Z Classes Z Classes Y target view Learning Bilingual Words Training Z+Y classes Testing Z classes Check-watch Scratch-head Wave-hand Pick-up Kick Sit-down Source View Target View

Results on View Knowledge Transfer Training View Testing View • “W/O” and “W/” show the results without and with view knowledge transfer via the bag of bilingual words, respectively • Average, “W/O”=10.9%, “W/” = 67.4%

Performance of Transfer Training View Testing View • “W/O” and “W/” show the results without and with view knowledge transfer via the bag of bilingual words, respectively • Average, “W/O”=10.9%, “W/” = 67.4%

Performance of Transfer Training View Testing View • “W/O” and “W/” show the results without and with knowledge transfer respectively • Average, woTran=10.9%, wTran = 67.4%

Performance Comparison • Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] • Columns “A”: A. Farhadi et al. ECCV 2008 • Columns “B”: I. N. Junejo, et al. ECCV 2008 • Columns “C”: A. Farhadi et al. ICCV 2009

Transferred Knowledge Fusion • One target view V.S. n-1 source views • Each source view have an action classifier • How to fuse the knowledge to final decision? • Locally Weighted Ensemble strategy [ Gao et al. SIGKDD 08 ] + + + + + + + + + + + + + + + + + + + + + + + + – + – + + + – – + + – – – + – – R – – – – – – – – – – – R – – – – – – – – – – – – – – – Fusion Classifier of Source 1 Classifier of Source 2

Knowledge Fusion Results • Each column denotes a testing (target) view, and the rest four views are source view

Detailed Recognition Rate

Summary • Create an “action view interlingua” for cross-view action recognition • Bilingual words serve as a bridge for view knowledge transfer • Fuse multiple transferred knowledge using Locally Weighted Ensemble method • Our approach achieves state-of-the-art performance

Thank You! Acknowledgements: UMich Intelligent Robotics Lab UMich Computer Vision Lab UCF Computer Vision Lab NSF

Cross-View Action Recognition via View Knowledge Transfer

Cross-View Action Recognition via View Knowledge Transfer

Presentation Transcript

View

View

View For General Knowledge

View

view

Transferable Dictionary Pair based Cross-view Action Recognition

Cross Sectional View of FET

Arrhythmia Recognition An Emergency View

view

Local Action: Global View

view

Cross-View Image Geolocalization

KANTIAN VIEW OF KNOWLEDGE

Multi-View Super Vector for Action Recognition

view

view

Sockets: lower-level view via Oracle

side view top view

View and Materialized view

view

Multi-View Super Vector for Action Recognition

View