390 likes | 559 Views
Local Descriptors for Spatio-Temporal Recognition. Ivan Laptev and Tony Lindeberg. Computational Vision and Active Perception Laboratory (CVAP) Dept of Numerical Analysis and Computer Science KTH (Royal Institute of Technology) SE-100 44 Stockholm, Sweden. .
E N D
Local Descriptorsfor Spatio-Temporal Recognition Ivan Laptev and Tony Lindeberg Computational Vision and Active Perception Laboratory (CVAP) Dept of Numerical Analysis and Computer Science KTH (Royal Institute of Technology) SE-100 44 Stockholm, Sweden
Events are often characterized by non-constant motion and complex spatio-temporal appearance. Events provide a compact way to capture important aspects of spatio-temporal structure. Motivation Area: Interpretation of non-rigid motion Non-rigid motion results in visual events such as • Occlusions, disocclusions • Appearance, disappearance • Unifications, splits • Velocity discontinuities
Idea: look forspatio-temporal neighborhoodsthat maximize the local variation of image values over space and time Local Motion Events
Select maxima over (x,y)of where • Analogy in space-time: Select space-time maxima of points with high variation of image values over space and time. (Laptev and Lindeberg, ICCV’03) Interest points • Spatial domain (Harris and Stephens, 1988):
Velocity discontinuity (spatio-temporal ”corner”) Unification and split Synthetic examples
Spatial scale: • p p’ • • Temporal scale: • p p’ • • Galilean transformation: • p • p’ Estimate locally to obtain invariance to these transformations (Laptev and Lindeberg ICCV’03, ICPR’04) Image transformations
Feature detection: Selection of spatial scale Invariance with respect to size changes
Feature detection: Velocity adaptation Stabilized camera Stationary camera
Feature detection: Selection of temporal scale Selection of temporal scales captures the temporal extent of events
Why local features in space-time? • Make a sparse and informative representation of complex motion patterns; • Obtain robustness w.r.t. missing data (occlusions) and outliers (complex, dynamic backgrounds, multiple motions); • Match similar events in image sequences; • Recognize image patterns of non-rigid motion. • Do not rely on tracking or spatial segmentation prior to motion recognition
Space-time neighborhoods boxing walking hand waving
A well-founded choice of local descriptors is the local jet (Koenderink and van Doorn, 1987) computed from spatio-temporal Gaussian derivatives (here at interest points pi) where Local space-time descriptors • Describe image structures in the neighborhoods of detected features defined by positions and covariance matrices
Use of descriptors:Clustering • Group similar points in the space of image descriptors using K-means clustering • Select significant clusters Clustering c1 c2 c3 c4 Classification
Use of descriptors:Matching • Find similar events in pairs of video sequences
Other descriptors better? Consider the following choices: • Multi-scale spatio-temporal derivatives • Projections to orthogonal bases obtained with PCA • Histogram-based descriptors Spatio-temporal neighborhood
Multi-scale derivative filters Derivatives up to order 2 or 4; 3 spatial scales; 3 temporal scales: • 9 x 3 x 3 = 81 or 34 x 3 x 3 = 306 dimensional descriptors
PCA descriptors • Compute normal flow or optic flow in locally adapted spatio-temporal neighborhoods of features • Subsample the flow fields to resolution 9x9x9 pixels • Learn PCA basis vectors (separately for each flow) from features in training sequences • Project flow fields of the new features onto the 100 most significant eigen-flow-vectors:
Position-dependent histograms • Divide the neighborhood i of each point piinto M^3subneighborhoods, here M=1,2,3 • Compute space-time gradients (Lx, Ly, Lt)T or optic flow (vx, vy)T at combinations of 3 temporal and 3 spatial scales where are locally adapted detection scales • Compute separable histograms over all subneighborhoods, derivatives/velocities and scales ...
Evaluation: Action Recognition Database: walking running jogging handwaving handclapping boxing Initially, recognition with Nearest Neighbor Classifier (NNC): • Take sequences of X subjects for training (Strain) • For each test sequence stest find the closest training sequence strain,i by minimizing the distance • Action of stest is regarded as recognized if class(stest)= class(strain,i)
Results: Recognition rates (all) Scale and velocity adapted features Scale-adapted features
Results: Recognition rates (Hist) Scale and velocity adapted features Scale-adapted features
Results: Recognition rates (Jets) Scale and velocity adapted features Scale-adapted features
Results: Comparison Global-STG-HIST: Zelnik-Manor and Irani CVPR’01 Spatial-4Jets: Spatial interest points (Harris and Stephens, 1988)
Confusion matrices Position-dependent histograms for space-time interest points Local jets at spatial interest points
Confusion matrices STG-PCA, ED STG-PD2HIST, ED
Related work • Mikolayczyk and Schmid CVPR’03, ECCV’02 • Lowe ICCV’99 • Zelnik and Irani CVPR’01 • Fablet, Bouthemy and Peréz PAMI’02 • Laptev and Lindeberg ICCV’03, IVC 2004, ICPR’04 • Efros et.al. ICCV’03 • Harris and Stephens Alvey’88 • Koenderink and Doorn PAMI 1992 • Lindeberg IJCV 1998
Summary • Descriptors of local spatio-temporal features enable classification and matching of motion events in video • Position-dependent histograms of space-time gradients and optical flow give high recognition performance. Results consistent with findings for SIFT descriptor (Lowe, 1999) in the spatial domain. Future: • Include spatial and temporal consistency of local features • Multiple actions in the scene • Information inbetween events
walking running jogging handwaving handclapping boxing
Results: Recognition Rates Scalar product Distance Euclidean Distance
Walking model • Represent the gait pattern using classified spatio-temporal points corresponding the one gait cycle • Define the state of the model X for the moment t0 by the position, the size, the phase and the velocity of a person: • Associate each phase with a silhouette of a person extracted from the original sequence
Sequence alignment • Given a data sequence with the current moment t0, detect and classify interest points in the time window of length tw: (t0, t0-tw) • Transform model features according to X and for each model feature fm,i=(xm,i, ym,i, tm,i, m,i, m,i, cm,i) compute its distance dito the most close data feature fd,j, cd,j=cm,i: • Define the ”fit function” D of model configuration X as a sum of distances of all features weighted w.r.t. their ”age” (t0-tm) such that recent features get more influence on the matching
Sequence alignment At each moment t0 minimize D with respect to X using standard Gauss-Newton minimization method data features model features