Action Recognition

A general survey of previous works on Action Recognition SobhanNaderiParizi September 2009

List of papers • Statistical Analysis of Dynamic Actions • On Space-Time Interest Points • Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words • What, where and who? Classifying events by scene and object recognition • Recognizing Actions at a Distance • Recognizing Human Actions: A Local SVM Approach • Retrieving Actions in Movies • Learning Realistic Human Actions from Movies • Actions in Context • Selection and Context for Action Recognition

Non-parametric Distance Measure for Action Recognition • Paper info: • Title: • Statistical Analysis of Dynamic Actions • Authors: • LihiZelnik-Manor • Michal Irani • TPAMI 2006 • A preliminary version appeared in CVPR 2001 • “Event-Based video Analysis”

“Statistical Analysis of Dynamic Actions” • Overview: • Introduce a non-parametric distance measure • Video matching (no action model): given a reference video, similar sequences are found • Dense features from multiple temporal scales (only corresponding scales are compared) • Temporal extent of videos in each category should be the same! (a fast and slow dancing are different) • New database is introduced • Periodic activities (walk) • Non-periodic activities (Punch, Kick, Duck, Tennis) • Temporal Textures (water) • www.wisdom.weizmann.ac.il/~vision/EventDetection.html

“Statistical Analysis of Dynamic Actions” • Feature description: • Space-time gradient of each pixel • Threshold the gradient magnitudes • Normalization (ignoring appearance) • Absolute value (invariant to dark/light transitions) • Direction invariant

“Statistical Analysis of Dynamic Actions” • Comments: • Actions are represented by 3L independent 1D distributions (L being number of temporal scales) • The frames are blurred first • Robust to change of appearance e.g. high textured clothing • Action recognition/localization • For a test video sequence S and a reference sequence of T frames: • Each consequent sub-sequence of length T is compared to the reference • In case of multiple reference videos: • Mahalanobis distance

Space-Time Interest Points (STIP) • Paper info: • Title: • On Space-Time Interest Points • Authors: • Ivan Laptev: INRIA / IRISA • IJCV 2009

“On Space-Time Interest Points” • Extends Harris detector to 3D (space-time) • Local space-time points with non-constant motion: • Points with accelerated motion: physical forces • Independent space and time scales • Automatic scale selection

“On Space-Time Interest Points” • Automatic scale selection procedure: • Detect interest points • Move in the direction of optimal scale • Repeat until locally optimal scale is reached (iterative) • The procedure can not be used in real-time: • Frames in future time are needed • There exist estimation approaches to solve this problem

Unsupervised Action Recognition • Paper info: • Title: • Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words • Authors: • Juan Carlos Niebles: University of Illinois • Hongcheng Wang: University of Illinois • Li Fei-Fei: University of Illinois • BMVC 2006

“Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words” • Generative graphical model (pLSA) • STIP detector is used (piotrdollár et al.) • Laptev’s STIP detector is too sparse • Dictionary of video words is created • The method is unsupervised • Simultaneous action recognition/localization • Evaluations on: • KTH action database • Skating actions database (4 action classes)

“Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words” • Overview of the method: • w: video word • d: video sequence • z: latent topic (action category)

“Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words” • Feature descriptor: • Brightness gradient + PCA • Brightness gradient found equiv. to Optical Flow for motion capturing • Multiple action can be localized in the video: • Average classification accuracy: • KTH action database: 81.5% • Skating dataset: 80.67%

Event recognition in sport images • Paper info: • Title: • What, where and who? Classifying events by scene and object recognition • Authors: • Li-Jia Li: University of Illinois • Li Fei-Fei: Princeton University • ICCV 2007

“What, where and who? Classifying events by scene and object recognition” • Goal of the paper: • Event classification in still images • Scene labeling • Object labeling • Approach: • Generative graphical model • Assumes that objects and scenes are independent given the event category • Ignores spatial relationships between objects

“What, where and who? Classifying events by scene and object recognition” • Information channels: • Scene context (holistic representation) • Object appearance • Geometrical layout (sky at infinity/vertical structure/ground plane) • Feature extraction: • 12x12 patches obtained by grid sampling (10x10) • For each patch: • SIFT feature (used both for scene and object models) • Layout label (used only for object model)

“What, where and who? Classifying events by scene and object recognition” • The graphical model • E: event • S: scene • O: object • X: scene feature • A: appearance feature • G: geometry layout

“What, where and who? Classifying events by scene and object recognition” • A new database is compiled: • 8 sport even categories (downloaded from web) • Bocce, croquet, polo, rowing, snowboarding, badminton, sailing, rock climbing • Average classification accuracy over all 8 event classes = 74.3%

“What, where and who? Classifying events by scene and object recognition” • Sample results:

Action recognition in medium resolution regimes • Paper info: • Title: • Recognizing Actions at a Distance • Authors: • Alexei A. Efros: UC Berkeley • Alexander C. Berg: UC Berkeley • Greg Mori: UC Berkeley • JitendraMalik: UC Berkeley • ICCV 2003

“Recognizing Actions at a Distance” • Overall review: • Actions in medium resolution (30 pix tall) • Proposing a new motion descriptor • KNN for classification • Consistent tracking bounding box of the actor is required • Action recognition is done only on the tracking bounding box • Motion in terms of as relative movement of body parts • No info. about movements is given by the tracker

“Recognizing Actions at a Distance” • Motion Feature: • For each frame, a local temporal neighborhood is considered • Optical flow is extracted (other alternatives: image pixel values, temporal gradients) • OF is noisy: • half-wave rectifying + blurring • To preserve motion info: • OF vector is decomposed to its vertical/horizontal components

“Recognizing Actions at a Distance” • Similarity measure: • i,j: index of frame • T: temporal extent • I: spatial extent • A: 1st video sequence = • B: 2nd video sequence =

“Recognizing Actions at a Distance” • New Dataset: • Ballet (stationary camera): • 16 action classes • 2 men + 2 women • Easy dataset (controlled environment) • Tennis (real action, stationary camera): • 6 action classes (stand, swing, move-left, …) • different days/location/camera position • 2 players (man + woman) • Football (real action, moving camera): • 8 action classes (run-left 45˚, run-left, walk-left, …) • Zoom in/out

“Recognizing Actions at a Distance” • Average classification accuracy: • Ballet: 87.44% (5NN) • Tennis: 64.33% (5NN) • Football: 65.38% (1NN) • What can be done?

“Recognizing Actions at a Distance” • Applications: • Do as I Do: • Replace actors in videos • Do as I Say: • Develop real-world motions in computer games • 2D/3D skeleton transfer • Figure Correction: • Remove occlusion/clutter in movies

KTH Action Dataset • Paper info: • Title: • Recognizing Human Actions: A Local SVM Approach • Authors: • Christian Schuldt: KTH university • Ivan Laptev: KTH university • ICPR 2004

“Recognizing Human Actions: A Local SVM Approach” • New dataset (KTH action database): • 2391 video sequences • 6 action classes (Walking, Jogging, Running, Handclapping, Boxing, Hand-waving) • 25 persons • Static camera • 4 scenarios: • Outdoors (s1) • Outdoors + scale variation (s2): the hardest scenario • Outdoors + cloth variation (s3) • Indoors (s4)

“Recognizing Human Actions: A Local SVM Approach” • Features: • Sparse (STIP detector) • Spatio-temporal jets of order 4 • Different feature representations: • Raw jet feature descriptors • Exponential kernel on the histogram of jets • Spatial HoG with temporal pyramid • Different classifiers: • SVM • NN

“Recognizing Human Actions: A Local SVM Approach” • Experimental results: • Local Feature (jets) + SVM performs the best • SVM outperforms NN • HistLF (histogram of jets) is slightly better than HistSTG (histogram of spatio-temporal gradients) • Average classification accuracy on all scenarios = 71.72%

Action Recognition in Real Scenarios • Paper info: • Title: • Retrieving Actions in Movies • Authors: • Ivan Laptev: INRIA / IRISA • Patrik Perez: INRIA / IRISA • ICCV 2007

“Retrieving Actions in Movies” • A new action database from real movies • Experiments only on Drinking action vs. random/Smoking • Main contributions: • Recognizing unrestricted real actions • Key-frame priming • Configuration of experiments: • Action recognition (on pre-segmented seq.) • Comparing different features • Action detection (using key-frame priming)

“Retrieving Actions in Movies” • Real movie action database: • 105 drinking actions • 141 smoking actions • Different scenes/people/views • www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html • Action representation: • R = (P, ΔP) • P = (X, Y, T): space-time coordinates • ΔP = (ΔX, ΔY, ΔT): • ΔX: 1.6 width of head bounding box • ΔY: 1.3 height of head bounding box

“Retrieving Actions in Movies” • Learning scheme: • Discrete AdaBoost + FLD (Fisher Linear Discriminant) • All action cuboids are normalized to 14x14x8 cells of 5x5x5 pixels (needed for boosting) • Slightly temporal-randomized sequences is added to training • HoG(4bins)/OF(5bins) is used • Local features: • Θ=(x,y,t, δx, δy, δt, β, Ψ) • ΒЄ{plain, temp-2, spat-4} • ΨЄ{OF5, Grad4}

“Retrieving Actions in Movies” • HoG captures shape, OF captures motion • Informative motions: start & end of action • Key-frame: • When hand reaches head • Boosted-Histogram on HOG • No motion info around key-frame • Integration of motion & key-frame should help

“Retrieving Actions in Movies” • Experiments: • OF/OF+HoG/STIP+NN/only key-frame • OF/OF+HoG works best on hard test (drinking vs. smoking) • Extension of OF5 to OFGrad9 does not help! • Key-frame priming: • #FPs decreases significantly (different info. channels) • Significant overall accuracy: • It’s better to model motion and appearance separately • Speed of key-primed version: 3 seconds per frame

“Retrieving Actions in Movies” • Possible extensions: • Extend the experiments to more action classes • Make it real-time

Automatic Video Annotation • Paper info: • Title: • Learning Realistic Human Actions from Movies • Authors: • Ivan Laptev: INRIA / IRISA • MarcinMarszalek: INRIA / LEAR • CordeliaSchmid: INRIA / LEAR • Benjamin Rozenfeld: Bar-Ilan university • CVPR 2008

“Learning Realistic Human Actions from Movies” • Overview: • Automatic movie annotation: • Alignment of movie scripts • Text classification • Classification of real action • Providing a new dataset • Beat state-of-the-art results on KTH dataset • Extending spatial pyramid to space-time pyramid

“Learning Realistic Human Actions from Movies” • Movie script: • Publicly available textual description about: • Scene description • Characters • Transcribed dialogs • Actions (descriptive) • Limitations: • No exact timing alignment • No guarantee for correspondence with real actions • Actions are expressed literally (diverse descriptions) • Actions may be missed due to lack of conversation

“Learning Realistic Human Actions from Movies” • Automatic annotation: • Subtitles include exact time alignment • Timing of scripts is matched by subtitles • Textual description of action is done by a text classifier • New dataset: • 8 action classes (AnswerPhone, GetOutCar, SitUp, …) • Two training sets (automatically/manually annotated) • 60% of the automatic training set is correctly annotated • http://www.irisa.fr/vista/actions

“Learning Realistic Human Actions from Movies” • Action classification approach: • BoF framework (k=4000) • Space-time pyramids • 6 spatial grids: {1x1, 2x2, 3x3, 1x3, 3x1, o2x2} • 4 temporal grids: {t1, t2, t3, ot2} • STIP with multiple scales • HoG and HoF

“Learning Realistic Human Actions from Movies” • Feature extraction: • A volume of (2kσ x 2kσ x 2kτ) is taken around each STIP where σ/τ is spatial/temporal extent (k=9) • The volume is divided to grid • HoG and HoF for each grid cell is calculated and concatenated together • These concatenated features are concatenated once more according to the pattern of spatio-temporal pyramid

“Learning Realistic Human Actions from Movies” • Different channels: • Each spatio-temporal template: one channel • Greedy search to find the best channel combination • Kernel function = • Chi2 distance • Observations: • HoG performs better than HoF • No temporal subdivision is preferred (temporal grid = t1) • Combination of channels improves classification in real scenario • Mean AP on KTH action database = 91.8% • Mean AP on real movies database: • Trained on manually annotated dataset : 39.5% • Trained on automatically annotated dataset : 22.9% • Random classifier (chance) : 12.5%

“Learning Realistic Human Actions from Movies” • Future works: • Increase robustness to annotation noise • Improve script to video alignment • Learn on larger database of automatic annotation • Experiment more low-level features • Move from BoF to detector based methods • The table shows: • effect of temporal division when combining channels (HMM based methods should work) • Pattern of spatio-temporal pyramid changes so that context is best captured when the action is scene-dependent

Image Context in Action Recognition • Paper info: • Title: • Actions in Context • Authors: • MarcinMarszalek: INRIA / LEAR • Ivan Laptev: INRIA / IRISA • CordeliaSchmid: INRIA / LEAR • CVPR 2009

“Actions in Context” • Contributions: • Automatic learning of scene classes from video • Improve action recognition using image context and vice versa • Movie scripts is used for automatic training • For both action and scene: BoF + SVM • New large database: • 12 action classes • 69 movies involved • 10 scene classes • www.irisa.fr/vista/actions/hollywood2

“Actions in Context” • For automatic annotation, scenes are identified only from text • Features: • SIFT (modeling scene) on 2D-Harris • HoG and HoF (motion) on 3D-Harris (STIP)

“Actions in Context” • Features: • SIFT: extracted from 2D-Harris detector • Captaures static appearance • Used for modeling scene context • Calculated for single frame (every 2 seconds) • HoG/HoF: extracted from 3D-Harris detector • HoG captures dynamic appearance • HoF captures motion pattern • One video dictionary per channel is created • Histogram of video words is created for each channel • Classifier: • SVM using chi2 distance • Exponential kernel (RBF) • Sum over multiple channels

“Actions in Context” • Evaluations: • SIFT: better for context • HoG/HoF: better for action • Only context can also classify actions fairly good! • Combination of the 3 channels works best

Action Recognition