250 likes | 414 Views
Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests . Tsz-Ho Yu. T-K Kim. Danhang Tang. Sponsored by . Motivation. Multiple cameras with invserse kinematics [Bissacco et al. CVPR2007] [Yao et al. IJCV2012] [Sigal IJCV2011] .
E N D
Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests Tsz-Ho Yu T-K Kim DanhangTang Sponsored by
Motivation Multiple cameras with invserse kinematics [Bissacco et al. CVPR2007] [Yao et al. IJCV2012] [Sigal IJCV2011] Specialized hardware(e.g. structured light sensor, TOF camera) [Shotton et al. CVPR’11][Baak et al. ICCV2011] [Ye et al. CVPR2011] [Sun et al. CVPR2012] Learning-based (regression) [Navaratnam et al. BMVC2006][Andriluka et al. CVPR2010]
Motivation • Discriminative approaches (RF) have achieved great success in humanbody pose estimation. • Efficient – real-time • Accurate – frame-basis, not rely on tracking • Require a large dataset to cover many poses • Train on synthetic, test on real data • Didn’t exploit kinematic constraints Examples: Shotton et al. CVPR’11, Girshick et al. ICCV’11, Sun et al. CVPR’12
Challenges for Hand? • Viewpoint changes and self occlusions • Discrepancy between synthetic and real data is larger than human body • Labeling is difficult and tedious!
Our method • Viewpoint changes and self occlusions Hierarchical Hybrid Forest Transductive Learning • Discrepancy between synthetic and real data is larger than human body Semi-supervised Learning • Labeling is difficult and tedious!
Existing Approaches • Generative approaches • Model-fitting • No training is required • Slow • Needs initialisation and tracking Oikonomidis et al. ICCV2011 Motion capture Ballan et al. ECCV 2012 De La Gorce et al. PAMI2010 Hamer et al. ICCV2009 • Discriminative approaches • Similar solutions to human body pose estimation • Performance on real data remains challenging • Discriminative approaches • Similar solutions to human body pose estimation • Performance on real data remains challenging • Xu and Cheng ICCV 2013 Stengeret al. IVC 2007 Keskin et al. ECCV2012 Wang et al. SIGGRAPH2009
Our method • Viewpoint changes and self occlusions Hierarchical Hybrid Forest • Discrepancy between synthetic and real data is larger than human body • Labeling is difficult and tedious!
Hierarchical Hybrid Forest Viewpoint Classification: Qa • STR forest: • Qa – View point classification quality (Information gain) Qapv =αQa+ (1-α)βQP + (1-α)(1-β)QV
Hierarchical Hybrid Forest Viewpoint Classification: Qa Finger joint Classification: Qp • STR forest: • Qa – View point classification quality (Information gain) • Qp – Joint label classification quality (Information gain) Qapv =αQa+ (1-α)βQP+ (1-α)(1-β)QV
Hierarchical Hybrid Forest Viewpoint Classification: Qa Finger joint Classification: Qp Pose Regression: Qv • STR forest: • Qa – View point classification quality (Information gain) • Qp – Joint label classification quality (Information gain) • Qv – Compactness of voting vectors (Determinant of covariance trace) Qapv =αQa+ (1-α)βQP + (1-α)(1-β)QV
Hierarchical Hybrid Forest Viewpoint Classification: Qa Finger Joint Classification: Qp Pose Regression:Qv • STR forest: • Qa – View point classification quality (Information gain) • Qp – Joint label classification quality (Information gain) • Qv – Compactness of voting vectors (Determinant of covariance trace) • (α,β) – Margin measures of view point labels and joint labels Qapv =αQa+ (1-α)βQP + (1-α)(1-β)QV
Our method • Viewpoint changes and self occlusions Transductive Learning • Discrepancy between synthetic and real data is larger than human body Semi-supervised Learning • Labeling is difficult and tedious!
Transductive learning Source space (Synthetic data S) Target space (Realistic data R) • Training data D = {Rl, Ru, S}: labeled unlabeled • Synthetic data S: • Generated from an articulated hand model. All labeled. • Realistic data R: • Captured from Primesense depth sensor • A small part of R, Rlare labeled manually (unlabeled set Ru)
Transductive learning Source space (Synthetic data S) Target space (Realistic data R) • Training data D = {Rl, Ru, S}: • Realistic data R: • Captured from Kinect • A small part of R, Rlare labeled manually (unlabeled set Ru) • Synthetic data S: • Generated from a articulated hand model, where |S| >> |R|
Transductive learning Source space (Synthetic data S) Target space (Realistic data R) • Training data D = {Rl, Ru, S}: • Similar data-points in Rl and S are paired(if separated by split function give penalty)
Semi-supervised learning Source space (Synthetic data S) Target space (Realistic data R) • Training data D = {Rl, Ru, S}: • Similar data-points in Rl and S are paired(if separated by split function give penalty) • Introduce a semi-supervised term to make use of unlabeled real data when evaluating split function
Experiment settings • Training data: • Synthetic data(337.5K images) • Real data(81K images, <1.2K labeled) • Evaluation data: • Three different testing sequences • Sequence A --- Single viewpoint(450 frames) • Sequence B --- Multiple viewpoints, with slow hand movements(1000 frames) • Sequence C --- Multiple viewpoints, with fast hand movements(240 frames)
Self comparison experiment • Self comparison(Sequence A): • This graph shows the joint classification accuracy of Sequence A. • Realistic and synthetic baselines produced similar accuracies. • Using the transductive term is better than simply augmented real and synthetic data. • All terms together achieves the best results.
Multiview experiments • Multi view experiment (Sequence C):
Conclusion • A 3D hand pose estimation algorithm • STR forest: Semi-supervised and transductive regression forest • A data-driven refinement scheme to rectify the shortcomings of STR forest • Real-time (25Hz on Intel i7 PC without CPU/GPU optimisation) • Works better than state-of-the-arts • Makes use of unlabelled data, required less manual annotation. • More accurate in real scenario
Thank you! http://www.iis.ee.ic.ac.uk/icvl