1 / 39

Local Descriptors for Spatio-Temporal Recognition

Local Descriptors for Spatio-Temporal Recognition. Ivan Laptev and Tony Lindeberg. Computational Vision and Active Perception Laboratory (CVAP) Dept of Numerical Analysis and Computer Science KTH (Royal Institute of Technology) SE-100 44 Stockholm, Sweden. .

lam
Download Presentation

Local Descriptors for Spatio-Temporal Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Local Descriptorsfor Spatio-Temporal Recognition Ivan Laptev and Tony Lindeberg Computational Vision and Active Perception Laboratory (CVAP) Dept of Numerical Analysis and Computer Science KTH (Royal Institute of Technology) SE-100 44 Stockholm, Sweden

  2. Events are often characterized by non-constant motion and complex spatio-temporal appearance.  Events provide a compact way to capture important aspects of spatio-temporal structure. Motivation Area: Interpretation of non-rigid motion Non-rigid motion results in visual events such as • Occlusions, disocclusions • Appearance, disappearance • Unifications, splits • Velocity discontinuities

  3. Idea: look forspatio-temporal neighborhoodsthat maximize the local variation of image values over space and time Local Motion Events

  4. Select maxima over (x,y)of where • Analogy in space-time: Select space-time maxima of  points with high variation of image values over space and time. (Laptev and Lindeberg, ICCV’03) Interest points • Spatial domain (Harris and Stephens, 1988):

  5. Velocity discontinuity (spatio-temporal ”corner”) Unification and split Synthetic examples

  6. Spatial scale: • p p’ • • Temporal scale: • p p’ • • Galilean transformation: • p • p’  Estimate locally to obtain invariance to these transformations (Laptev and Lindeberg ICCV’03, ICPR’04) Image transformations

  7. Feature detection: Selection of spatial scale Invariance with respect to size changes

  8. Feature detection: Velocity adaptation Stabilized camera Stationary camera

  9. Feature detection: Selection of temporal scale Selection of temporal scales captures the temporal extent of events

  10. Features from human actions

  11. Why local features in space-time? • Make a sparse and informative representation of complex motion patterns; • Obtain robustness w.r.t. missing data (occlusions) and outliers (complex, dynamic backgrounds, multiple motions); • Match similar events in image sequences; • Recognize image patterns of non-rigid motion. • Do not rely on tracking or spatial segmentation prior to motion recognition

  12. Space-time neighborhoods boxing walking hand waving

  13. A well-founded choice of local descriptors is the local jet (Koenderink and van Doorn, 1987) computed from spatio-temporal Gaussian derivatives (here at interest points pi) where Local space-time descriptors • Describe image structures in the neighborhoods of detected features defined by positions and covariance matrices

  14. Use of descriptors:Clustering • Group similar points in the space of image descriptors using K-means clustering • Select significant clusters Clustering c1 c2 c3 c4 Classification

  15. Use of descriptors:Clustering

  16. Use of descriptors:Matching • Find similar events in pairs of video sequences

  17. Other descriptors better? Consider the following choices: • Multi-scale spatio-temporal derivatives • Projections to orthogonal bases obtained with PCA • Histogram-based descriptors Spatio-temporal neighborhood

  18. Multi-scale derivative filters Derivatives up to order 2 or 4; 3 spatial scales; 3 temporal scales: • 9 x 3 x 3 = 81 or 34 x 3 x 3 = 306 dimensional descriptors

  19. PCA descriptors • Compute normal flow or optic flow in locally adapted spatio-temporal neighborhoods of features • Subsample the flow fields to resolution 9x9x9 pixels • Learn PCA basis vectors (separately for each flow) from features in training sequences • Project flow fields of the new features onto the 100 most significant eigen-flow-vectors:

  20. Position-dependent histograms • Divide the neighborhood i of each point piinto M^3subneighborhoods, here M=1,2,3 • Compute space-time gradients (Lx, Ly, Lt)T or optic flow (vx, vy)T at combinations of 3 temporal and 3 spatial scales where are locally adapted detection scales • Compute separable histograms over all subneighborhoods, derivatives/velocities and scales ...

  21. Evaluation: Action Recognition Database: walking running jogging handwaving handclapping boxing Initially, recognition with Nearest Neighbor Classifier (NNC): • Take sequences of X subjects for training (Strain) • For each test sequence stest find the closest training sequence strain,i by minimizing the distance • Action of stest is regarded as recognized if class(stest)= class(strain,i)

  22. Results: Recognition rates (all) Scale and velocity adapted features Scale-adapted features

  23. Results: Recognition rates (Hist) Scale and velocity adapted features Scale-adapted features

  24. Results: Recognition rates (Jets) Scale and velocity adapted features Scale-adapted features

  25. Results: Comparison Global-STG-HIST: Zelnik-Manor and Irani CVPR’01 Spatial-4Jets: Spatial interest points (Harris and Stephens, 1988)

  26. Confusion matrices Position-dependent histograms for space-time interest points Local jets at spatial interest points

  27. Confusion matrices STG-PCA, ED STG-PD2HIST, ED

  28. Related work • Mikolayczyk and Schmid CVPR’03, ECCV’02 • Lowe ICCV’99 • Zelnik and Irani CVPR’01 • Fablet, Bouthemy and Peréz PAMI’02 • Laptev and Lindeberg ICCV’03, IVC 2004, ICPR’04 • Efros et.al. ICCV’03 • Harris and Stephens Alvey’88 • Koenderink and Doorn PAMI 1992 • Lindeberg IJCV 1998

  29. Summary • Descriptors of local spatio-temporal features enable classification and matching of motion events in video • Position-dependent histograms of space-time gradients and optical flow give high recognition performance. Results consistent with findings for SIFT descriptor (Lowe, 1999) in the spatial domain. Future: • Include spatial and temporal consistency of local features • Multiple actions in the scene • Information inbetween events

  30. walking running jogging handwaving handclapping boxing

  31. Results: Recognition Rates Scalar product Distance Euclidean Distance

  32. Walking model • Represent the gait pattern using classified spatio-temporal points corresponding the one gait cycle • Define the state of the model X for the moment t0 by the position, the size, the phase and the velocity of a person: • Associate each phase  with a silhouette of a person extracted from the original sequence

  33. Sequence alignment • Given a data sequence with the current moment t0, detect and classify interest points in the time window of length tw: (t0, t0-tw) • Transform model features according to X and for each model feature fm,i=(xm,i, ym,i, tm,i, m,i, m,i, cm,i) compute its distance dito the most close data feature fd,j, cd,j=cm,i: • Define the ”fit function” D of model configuration X as a sum of distances of all features weighted w.r.t. their ”age” (t0-tm) such that recent features get more influence on the matching

  34. Sequence alignment At each moment t0 minimize D with respect to X using standard Gauss-Newton minimization method data features model features

  35. Experiments

  36. Experiments

More Related