Local Descriptors for Spatio-Temporal Recognition

Local Descriptorsfor Spatio-Temporal Recognition Ivan Laptev and Tony Lindeberg Computational Vision and Active Perception Laboratory (CVAP) Dept of Numerical Analysis and Computer Science KTH (Royal Institute of Technology) SE-100 44 Stockholm, Sweden

 Events are often characterized by non-constant motion and complex spatio-temporal appearance.  Events provide a compact way to capture important aspects of spatio-temporal structure. Motivation Area: Interpretation of non-rigid motion Non-rigid motion results in visual events such as • Occlusions, disocclusions • Appearance, disappearance • Unifications, splits • Velocity discontinuities

Idea: look forspatio-temporal neighborhoodsthat maximize the local variation of image values over space and time Local Motion Events

Select maxima over (x,y)of where • Analogy in space-time: Select space-time maxima of  points with high variation of image values over space and time. (Laptev and Lindeberg, ICCV’03) Interest points • Spatial domain (Harris and Stephens, 1988):

Velocity discontinuity (spatio-temporal ”corner”) Unification and split Synthetic examples

Spatial scale: • p p’ • • Temporal scale: • p p’ • • Galilean transformation: • p • p’  Estimate locally to obtain invariance to these transformations (Laptev and Lindeberg ICCV’03, ICPR’04) Image transformations

Feature detection: Selection of spatial scale Invariance with respect to size changes

Feature detection: Velocity adaptation Stabilized camera Stationary camera

Feature detection: Selection of temporal scale Selection of temporal scales captures the temporal extent of events

Features from human actions

Why local features in space-time? • Make a sparse and informative representation of complex motion patterns; • Obtain robustness w.r.t. missing data (occlusions) and outliers (complex, dynamic backgrounds, multiple motions); • Match similar events in image sequences; • Recognize image patterns of non-rigid motion. • Do not rely on tracking or spatial segmentation prior to motion recognition

Space-time neighborhoods boxing walking hand waving

A well-founded choice of local descriptors is the local jet (Koenderink and van Doorn, 1987) computed from spatio-temporal Gaussian derivatives (here at interest points pi) where Local space-time descriptors • Describe image structures in the neighborhoods of detected features defined by positions and covariance matrices

Use of descriptors:Clustering • Group similar points in the space of image descriptors using K-means clustering • Select significant clusters Clustering c1 c2 c3 c4 Classification

Use of descriptors:Clustering

Use of descriptors:Matching • Find similar events in pairs of video sequences

Other descriptors better? Consider the following choices: • Multi-scale spatio-temporal derivatives • Projections to orthogonal bases obtained with PCA • Histogram-based descriptors Spatio-temporal neighborhood

Multi-scale derivative filters Derivatives up to order 2 or 4; 3 spatial scales; 3 temporal scales: • 9 x 3 x 3 = 81 or 34 x 3 x 3 = 306 dimensional descriptors

PCA descriptors • Compute normal flow or optic flow in locally adapted spatio-temporal neighborhoods of features • Subsample the flow fields to resolution 9x9x9 pixels • Learn PCA basis vectors (separately for each flow) from features in training sequences • Project flow fields of the new features onto the 100 most significant eigen-flow-vectors:

Position-dependent histograms • Divide the neighborhood i of each point piinto M^3subneighborhoods, here M=1,2,3 • Compute space-time gradients (Lx, Ly, Lt)T or optic flow (vx, vy)T at combinations of 3 temporal and 3 spatial scales where are locally adapted detection scales • Compute separable histograms over all subneighborhoods, derivatives/velocities and scales ...

Evaluation: Action Recognition Database: walking running jogging handwaving handclapping boxing Initially, recognition with Nearest Neighbor Classifier (NNC): • Take sequences of X subjects for training (Strain) • For each test sequence stest find the closest training sequence strain,i by minimizing the distance • Action of stest is regarded as recognized if class(stest)= class(strain,i)

Results: Recognition rates (all) Scale and velocity adapted features Scale-adapted features

Results: Recognition rates (Hist) Scale and velocity adapted features Scale-adapted features

Results: Recognition rates (Jets) Scale and velocity adapted features Scale-adapted features

Results: Comparison Global-STG-HIST: Zelnik-Manor and Irani CVPR’01 Spatial-4Jets: Spatial interest points (Harris and Stephens, 1988)

Confusion matrices Position-dependent histograms for space-time interest points Local jets at spatial interest points

Confusion matrices STG-PCA, ED STG-PD2HIST, ED

Related work • Mikolayczyk and Schmid CVPR’03, ECCV’02 • Lowe ICCV’99 • Zelnik and Irani CVPR’01 • Fablet, Bouthemy and Peréz PAMI’02 • Laptev and Lindeberg ICCV’03, IVC 2004, ICPR’04 • Efros et.al. ICCV’03 • Harris and Stephens Alvey’88 • Koenderink and Doorn PAMI 1992 • Lindeberg IJCV 1998

Summary • Descriptors of local spatio-temporal features enable classification and matching of motion events in video • Position-dependent histograms of space-time gradients and optical flow give high recognition performance. Results consistent with findings for SIFT descriptor (Lowe, 1999) in the spatial domain. Future: • Include spatial and temporal consistency of local features • Multiple actions in the scene • Information inbetween events

walking running jogging handwaving handclapping boxing

Results: Recognition Rates Scalar product Distance Euclidean Distance

Walking model • Represent the gait pattern using classified spatio-temporal points corresponding the one gait cycle • Define the state of the model X for the moment t0 by the position, the size, the phase and the velocity of a person: • Associate each phase  with a silhouette of a person extracted from the original sequence

Sequence alignment • Given a data sequence with the current moment t0, detect and classify interest points in the time window of length tw: (t0, t0-tw) • Transform model features according to X and for each model feature fm,i=(xm,i, ym,i, tm,i, m,i, m,i, cm,i) compute its distance dito the most close data feature fd,j, cd,j=cm,i: • Define the ”fit function” D of model configuration X as a sum of distances of all features weighted w.r.t. their ”age” (t0-tm) such that recent features get more influence on the matching

Sequence alignment At each moment t0 minimize D with respect to X using standard Gauss-Newton minimization method data features model features

Experiments

Local Descriptors for Spatio-Temporal Recognition

Local Descriptors for Spatio-Temporal Recognition

Presentation Transcript

Behavior Recognition via Sparse Spatio-Temporal Features

Spatio-Temporal Compressive Sensing

Spatio-Temporal Data Mining

Object Recognition using Local Descriptors

Extracting features from spatio-temporal volumes (STVs) for activity recognition

SPATIO TEMPORAL FRAMEWORKS

Combining Local Descriptors for 3D Object Recognition and Categorization

Spatio Temporal Video Retrieval

Spatio-temporal HAC

Spatio-Temporal Databases

Human Action Recognition using Spatio-Temporal Classification

Spatio-Temporal Clustering

Spatio-Temporal Databases

SPATIO-TEMPORAL DATABASES

Local spatio-temporal image features for motion interpretation

Spatio-Temporal WiFi Localization

SPATIO-TEMPORAL DATABASES

Spatio-temporal Pattern Queries

Spatio-temporal Databases

Fourier Descriptors For Shape Recognition

Spatio-Temporal Predicates

UCERF3 Spatio-Temporal Clustering