Dependency Modeling for Information Fusion with Applications in Visual Recognition

Dependency Modeling for Information Fusion with Applications in Visual Recognition Andy Jinhua MA Instructor: Prof. Pong Chi YUEN

Outline • Motivation • Related Works • Supervised Spatio-Temporal Manifold Learning • Linear Dependency Modeling • Reduced Analytic Dependency Modeling • Conclusion

Motivation • Multiple features provide complementary information, e.g. • Color information can differ Daffodil from Windflower Daffodil Windflower Flower images from Oxford Flowers dataset [CVPR’06]

Motivation • Multiple features provide complementary information, e.g. • Color information can differ Daffodil from Windflower • Shape characteristics can differ Daffodil from Buttercup Daffodil Windflower Buttercup Flower images from Oxford Flowers dataset [CVPR’06]

Motivation • Fusion by estimating the joint distribution, but • Not accurate with high-dimension • Independent assumption can simplify the fusion process, , but • May not be valid in practice • Degrades the fusion performance • Existing dependency modeling techniques based on normal assumption, but • Not robust to non-normal cases • Solution • Develop dependency modeling methods without normal assumption

Related Works • Probabilistic approach • Independent assumption based [TPAMI’98] • Product, Sum, Majority Votes • Normal assumption based [TPAMI’2009] • Independent Normal (IN) combination • Dependent Normal (DN) combination • Non-probabilistic approach • Supervised weighting • LPBoost[ML’2002] • LP-B [ICCV’09] • Reduced multivariate polynomial (RM) [TCSVT’04] • Multiple kernel learning (MKL) [ICML’04, JMLR’06, JMLR’08] • Unsupervised approach • Signal strength combination (SSC) [TNNLS’12] • Graph-regularized robust late fusion (GRLF) [CVPR’12]

Spatio-Temporal Manifold Learning • Why manifold learning • Can discover non-linear structures in visual data • Successful applications in image analysis, e.g. Laplacianfaces • Limitation for video applications • Temporal information not fully considered • Proposed method can • Discover non-linear structures • Utilize global constraint of temporal labels

Manifold Learning Based Action Recognition Framework Video Input Preprocessing By information saliency method [PR’09] Action Unit Image representation Feature Vectors Spatio-temporal manifold projection Embedded Manifold Classification Label Output

Supervised Spatial (SS) Topology • Construct a new topological base by • Local information • Label information • Mathematical formulation • Temporal adjacency neighbors are in • Poses deform continuously over time

Temporal Pose Correspondence (TPC) Topology • Sequences of the same action share similar poses • Employ dynamic time warping (DTW) [TASSP’78] to construct TPC sets • TPC topological base is

Topology Combination • Combine SS and TPC topological bases • Supervised spatio-temporal neighborhood topology learning (SSTNTL) can • Preserve local structure • Separate sequences of different actions

Experiments • Methods for comparison • Manifold learning methods • Locality preserving projection (LPP) [NIPS’03] • Supervised LPP (SLPP) [SP’07] • Locality sensitive discriminant analysis (LSDA) [IJCAI’07] • Local spatio-temporal discriminant embedding (LSTDE) [CVPR’08] • State-of-the-art action recognition algorithms • Classifier: nearest neighbor framework with median Hausdorff distance [TIP’07] where is the learnt projection, is action label, represents query data, is training data index

Experiments • Datasets For Evaluation • Weizmann Human Action • KTH Human Action • UCF Sports • HOllywood Human Action (HOHA) • Cambridge Gesture • Image representation after preprocessing • Gray-scale for Weizmann and KTH • Gist [IJCV’01]for KTH, UCF sports, HOHA and Cambridge Gesture • Perform principle component analysis (PCA) [TPAMI’97] to avoid singular matrix problem

Results • Accuracy (%) compared with other manifold embedding methods • Our method achieves highest accuracy • Image representation method affects the performance

Results • Accuracy (%) compared with state-of-the-art methods under different scenarios in KTH • Outdoor (S1), Scale Change (S2), Clothes Change (S3), Indoor (S4) • Interest region based method outperforms others under fixed camera setting • Interest points based, e.g. Trackletand AFMKL, are better for scale change

Results • Does global constraint of temporal labels help? • Compare the proposed method with and without TPC neighbors Neighbors not detected by local similarity

Linear Dependency Modeling • Main idea • If and • Under independent assumption [TPAMI’98] • Final decision is dominated by one classifier Feature 1 Input Classifier 1 Independent fusion Independent Model (Product): Feature 2 Classifier 2 … … (Windflower image) Feature M Classifier M Output × Not windflower Add dependency terms, s.t. the fused score is large

Linear Classifier Dependency Modeling (LCDM) • Design of the dependency term • Dependency terms cannot be too large • Dependency weight to determine the feature importance • Prior probability • Following J. Kittler et al [TPAMI’98], suppose posteriors will not deviate dramatically from priors, where is small • Define dependency term as , with Dependency weight Prior Small number

Linear Classifier Dependency Modeling (LCDM) • Main idea • If and • Dependency model Dependency model Feature 1 Input Classifier 1 Proposed: Feature 2 Classifier 2 … … (Windflower image) Feature M Classifier M Output √ Windflower Dependency term

Linear Classifier Dependency Modeling (LCDM) • Expend the product formulation by neglecting terms • Linear Classifier Dependency Model (LCDM) is where ,

Linear Feature Dependency Modeling (LFDM) • Why dependency modeling in featurelevel? • Feature level contains more information • Symbol definition • Denote , • Denote , • Denote be label viewed as random variable • Rigorous result • By Data Processing Inequality • Information about label is more in feature level, i.e. where represents mutual information

Linear Feature Dependency Modeling (LFDM) • Posterior probability can be written as • Linear Feature Dependency Model (LFDM) is where can be calculated by one- dimensional density estimation

Model Learning • Objective function in LCDM Maximizing margins Normalization constraint Dependency model constraint

Model Learning • Objective function in LFDM • Solve by off-the-shelf techniques

Estimation Error Analysis • Upper bounds of error factors in LCDM and LFDM where and represent estimation errors in and • Compare the denominators and numerators • LFDM is better than LCDM in the worst case,

Experiments • Methods for comparison • Independent assumption: Sum rule [TPAMI’98] • Normal assumption [TPAMI’09]: Independent Normal (IN) and Dependent Normal (DN) combination rules • Boosting methods: LPBoost[ML’02] and LP-B [ICCV’09] • Multiple kernel learning (MKL) [JMLR’08] • Support vector machines (SVM) as base classifier • Datasets for evaluation • Synthetic data • Oxford 17 Flower • Human Action

Experiments with Synthetic data • Data setting: 4 kinds of distributions • Independent Normal (IndNormal) • Dependent Normal (DepNormal) • Independent Non-Normal (IndNonNor) • Dependent Normal (DepNonNor) • Results: recognition rates • IN and DN methods outperform others under normal distributions

Experiments with Synthetic data • Data setting: 4 kinds of distributions • Independent Normal (IndNormal) • Dependent Normal (DepNormal) • Independent Non-Normal (IndNonNor) • Dependent Normal (DepNonNor) • Results: recognition rates • IN and DN methods outperform others under normal assumption • LCDM achieves the best results when the distributions are non-normal

Experiments with Oxford 17 Flower Dataset • Data setting • 17 flowers with 80 images per category • 3 predefined splits with 17 × 40 for training, 17 × 20 for validation, and 17×20 for testing • 7 kinds of features [CVPR’06] • Shape, color, texture, HSV, HoG, SIFT internal, and SIFT boundary • Results: recognition accuracy Example images • Feature combination outperform single feature • LCDM highest accuracy

Experiments with Human Action Datasets • Data setting • Weizmann • Nine fold cross-validation • KTH • Training (8 persons), validation (8 persons), and testing (9 persons) • Space-time interest point (STIP) detection [VSPETS’05] STIP detection example in Weizmann STIP detection example in KTH

Experiments with Human Action Datasets • Data setting • Weizmann • Nine fold cross-validation • KTH • Training (8 persons), validation (8 persons), and testing (9 persons) • Space-time interest points (STIP) detection [VSPETS’05] • 8 kinds of descriptors are computed on each STIP • Gray-scale intensity • Intensity difference • HoF and HoGwithout grid • HoF and HoGwith 2D grid • HoF and HoGwith 3D grid • 8 kinds of features are generated by Bag-of-Words

Experiments with Human Action Datasets • Recognition accuracy (%) • LFDM outperforms others • Feature-level Improvement by LFDM is significant Classifier fusion Feature fusion

Problems in Linear Dependency Modeling (LDM) • Product formulation may not best model dependency • Assumption that posteriors will not deviate dramatically from priors • With strong classifiers could be large • Propose a new method removing these two assumptions Dependency term

Analytic Dependency Modeling • Observation • Independent fusion [TPAMI’98] Constant w.r.t label Function of posteriors

Analytic Dependency Modeling • Observation • Independent fusion [TPAMI’98] • Linear Dependency Model where denote

Analytic Dependency Modeling • General score fusion model • Explicitly write out by converged power series • Denote and as weight vector • Rearrange according to where and is an analytic function of similar to

Analytic Dependency Modeling • By Bayes’ rule and marginal distribution property where is a linear function of Trivial solution to equation system

Analytic Dependency Modeling • By Bayes’ rule and marginal distribution property where is a linear function of • Independent condition solution to the equation system is trivial, i.e. • Model dependency by setting non-trivial solution

Reduced Model • Analytic function contains infinite number of coefficients

Reduced Model • Analytic function contains infinite number of coefficients • Approximate by converged power series property • Reduced Analytic Dependency Model (RADM)

Modeling Learning • Objective function of empirical classification error • Objective function of dependency model constraint • Final optimization problem with regularization term • Solve by setting the first derivative to zero

Experiments • Methods for comparison • Sum rule [TPAMI’98] • Independent Normal (IN) combination rule [TPAMI’09] • Dependent Normal (DN) combination rule [TPAMI’09] • Multi-class LPBoost namely LP-B [ICCV’09] • Reduced multivariate polynomial (RM) [TCSVT’04] • Signal strength combination (SSC) [TNNLS’12] • Graph-regularized robust late fusion (GRLF) [CVPR’12] • Datasets for evaluation • PASCAL VOC 2007 • Columbia Consumer Video (CCV) • HOllywood Human Action (HOHA)

Experiments with VOC 2007 and CCV Datasets • PASCAL VOC 2007 • 20 classes, 9,963 images, 5,011 for training, 4,952 for testing • 8 features [CVPR’10] • RGB, HSV, LAB, dense SIFT, Harris SIFT, dense HUE and Harris HUE with horizontal decomposition • Gist descriptor • Columbia Consumer Video (CCV) • 20 categories, 9,317 videos, 4,659 for training, 4,658 for testing • 3 features [ICMR’11] • Visual features: SIFT and space-time interest point (STIP) • Audio feature: Mel-frequency cepstralcoefficients (MFCC)

Experiments with VOC 2007 and CCV Datasets • Mean average precision (MAP) • RADM achieves highest MAP

RADM Fusion with SSTNTL • Data setting • HOHA dataset is used • 8 actions • Answer Phone (AnP), Get out of Car (GoC), Hand Shake (HS), Hug Person (HP), Kiss (Ki), Sit Down (SiD), Sit Up (SiU), Stand Up (StU) • Features • Supervised spatio-temporal neighborhood topology learning (SSTNTL) • 8 kinds of space-time interest point (STIP) based features STIP detection examples

Dependency Modeling for Information Fusion with Applications in Visual Recognition