1 / 25

Action Recognition from Video Using Feature Covariance Matrices

Action Recognition from Video Using Feature Covariance Matrices. Kai Guo , Prakash Ishwar , Senior Member, IEEE , and Janusz Konrad , Fellow, IEEE. Outline. Introduction Framework Action Feature Experiments Conclusion. Introduction.

hasana
Download Presentation

Action Recognition from Video Using Feature Covariance Matrices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Action Recognition fromVideo UsingFeature Covariance Matrices Kai Guo, PrakashIshwar, Senior Member, IEEE, and JanuszKonrad, Fellow, IEEE

  2. Outline • Introduction • Framework • Action Feature • Experiments • Conclusion

  3. Introduction • A new approach to action representation—one based on the empirical covariance matrix of a bag of localaction features. • We apply the covariance matrix representation to two typesof local feature collections: 1.A sequence ofsilhouettes of an object (the so–called silhouette tunnel) 2.The optical flow.

  4. Introduction • We focus on two distinct types of classifiers: 1.Thenearest-neighbor (NN) classifier. 2. the sparse-linearapproximation(SLA) classifier. • Transformation of the supervised classification problem in theclosed convex cone of covariance matrices into an equivalentproblem in the vector space of symmetric matrices via thematrix logarithm.

  5. Framework • Feature Covariance Matrices • We adopt a “bag of dense local feature vectors” modeling approach. • Inspired by Tuzelet al.’s work, the feature-covariance matrix can provide a very discriminativerepresentation for action recognition.

  6. Framework • Let F = {fn} denote a “bag of feature vectors” extracted from a video sample, the size of the feature set |F| be N. • The empirical estimate of the covariance estimate of the covariance matrix of F is given by: • Where is the empirical mean feature vector.

  7. Framework • Log-Covariance Matrices • A key idea is to map the convex cone of covariance matrices to the vector space of symmetric matrices1 by using the matrix logarithm proposed by Arsignyet al. . • The eigen-decomposition of C is given by C = • Then log(C) := , where is a diagonal matrix obtained from D by replacing D’s diagonal entries by their logarithms.

  8. Framework • Classification Using Log-Covariance Matrices • Nearest-Neighbor (NN) Classification: • Given a query sample, find the most similar sample in the annotated training set, where similarity is measured with respect to some distance measure, and assign its label to the query sample.

  9. Framework • Sparse Linear Approximation (SLA) Classification: • We approximate the log-covariance matrix of a query sample by a sparse linear combination of log-covariance matrices of all training samples p1, . . . , pN.

  10. Framework • Given a query sample , one may attempt to express it as a linear combination of training samples by solving the matrix-vector equation given by • By solving the following NP-hard optimization problem: • If the optimal solution α∗is sufficiently sparse: • This difficulty can be overcome by introducing a noise term as follows: where z is an additive noise term whose length is assumed to be bounded by ε, • This leads to the following -minimization problem:

  11. Framework • Use a reconstruction residual error (RRE) measure to decide the query class. • Let α∗ denotethe coefficients associated with class i(having label li ), corresponding to columns of training matrix Pi. • The RRE measure of class iis defined as : • To annotate the sample we assign the class label that leads to the minimum RRE

  12. Action Feature • Silhouette Tunnel Shape Features • Our goal is to reliably discriminate between shapes; not to accurately reconstruct them. Hence a coarse, low-dimensional representation of shape would suffice. • We capture the shape of the 3D silhouette tunnel by the empirical covariance matrix of a bag of thirteen-dimensional local shape features.

  13. Action Feature • We associate the following 13-dimensional feature vector f(s) that captures certain shape characteristics of the tunnel:

  14. Action Feature • After obtaining 13-dimensional silhouette shape feature vectors, we can compute their 13 × 13 covariance matrix, denoted by C, using (1) (with N = |S|): • Where is the mean feature vector. • Thus, C is an empirical covariance matrix of the collection of vectors F.

  15. Action Feature • Optical Flow Features • Here we use a variant of the Horn and Schunck method, which optimizes a functional based on residuals from the intensity constraints and a smoothness regularization term. • Let I (x, y, t) denote the luminance of the raw video sequence at pixel position (x, y, t) and let u(x, y, t) represent the corresponding optical flow vector . • Based on I (x, y, t) and u(x, y, t), we use the following feature vector f(x, y, t):

  16. Experiments

  17. Experiments

  18. Experiments

  19. Experiments

  20. Experiments

  21. Experiments

  22. Experiments

  23. Conclusion • The action recognition framework that we have developedin this paper is conceptually simple, easy to implement, hasgood run-time performance. • The TRECVID [63]and VIRAT [64] video datasets exemplify these types of realworldchallenges and much work remains to be done to addressthem.

  24. Conclusion • Our method’s relative simplicity, as compared to someof the top methods in the literature, enables almost tuning-freerapid deployment and real-time operation. • This opens newapplication areas outside the traditional surveillance/securityarena, for example in sports video annotation and customizablehuman-computer interaction.

  25. The End

More Related