1 / 57

Action Recognition

A general survey of previous works on. Action Recognition. Sobhan Naderi Parizi. September 2009. List of papers. Statistical Analysis of Dynamic Actions On Space-Time Interest Points Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

sharis
Download Presentation

Action Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A general survey of previous works on Action Recognition SobhanNaderiParizi September 2009

  2. List of papers • Statistical Analysis of Dynamic Actions • On Space-Time Interest Points • Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words • What, where and who? Classifying events by scene and object recognition • Recognizing Actions at a Distance • Recognizing Human Actions: A Local SVM Approach • Retrieving Actions in Movies • Learning Realistic Human Actions from Movies • Actions in Context • Selection and Context for Action Recognition

  3. Non-parametric Distance Measure for Action Recognition • Paper info: • Title: • Statistical Analysis of Dynamic Actions • Authors: • LihiZelnik-Manor • Michal Irani • TPAMI 2006 • A preliminary version appeared in CVPR 2001 • “Event-Based video Analysis”

  4. “Statistical Analysis of Dynamic Actions” • Overview: • Introduce a non-parametric distance measure • Video matching (no action model): given a reference video, similar sequences are found • Dense features from multiple temporal scales (only corresponding scales are compared) • Temporal extent of videos in each category should be the same! (a fast and slow dancing are different) • New database is introduced • Periodic activities (walk) • Non-periodic activities (Punch, Kick, Duck, Tennis) • Temporal Textures (water) • www.wisdom.weizmann.ac.il/~vision/EventDetection.html

  5. “Statistical Analysis of Dynamic Actions” • Feature description: • Space-time gradient of each pixel • Threshold the gradient magnitudes • Normalization (ignoring appearance) • Absolute value (invariant to dark/light transitions) • Direction invariant

  6. “Statistical Analysis of Dynamic Actions” • Comments: • Actions are represented by 3L independent 1D distributions (L being number of temporal scales) • The frames are blurred first • Robust to change of appearance e.g. high textured clothing • Action recognition/localization • For a test video sequence S and a reference sequence of T frames: • Each consequent sub-sequence of length T is compared to the reference • In case of multiple reference videos: • Mahalanobis distance

  7. Space-Time Interest Points (STIP) • Paper info: • Title: • On Space-Time Interest Points • Authors: • Ivan Laptev: INRIA / IRISA • IJCV 2009

  8. “On Space-Time Interest Points” • Extends Harris detector to 3D (space-time) • Local space-time points with non-constant motion: • Points with accelerated motion: physical forces • Independent space and time scales • Automatic scale selection

  9. “On Space-Time Interest Points” • Automatic scale selection procedure: • Detect interest points • Move in the direction of optimal scale • Repeat until locally optimal scale is reached (iterative) • The procedure can not be used in real-time: • Frames in future time are needed • There exist estimation approaches to solve this problem

  10. Unsupervised Action Recognition • Paper info: • Title: • Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words • Authors: • Juan Carlos Niebles: University of Illinois • Hongcheng Wang: University of Illinois • Li Fei-Fei: University of Illinois • BMVC 2006

  11. “Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words” • Generative graphical model (pLSA) • STIP detector is used (piotrdollár et al.) • Laptev’s STIP detector is too sparse • Dictionary of video words is created • The method is unsupervised • Simultaneous action recognition/localization • Evaluations on: • KTH action database • Skating actions database (4 action classes)

  12. “Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words” • Overview of the method: • w: video word • d: video sequence • z: latent topic (action category)

  13. “Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words” • Feature descriptor: • Brightness gradient + PCA • Brightness gradient found equiv. to Optical Flow for motion capturing • Multiple action can be localized in the video: • Average classification accuracy: • KTH action database: 81.5% • Skating dataset: 80.67%

  14. Event recognition in sport images • Paper info: • Title: • What, where and who? Classifying events by scene and object recognition • Authors: • Li-Jia Li: University of Illinois • Li Fei-Fei: Princeton University • ICCV 2007

  15. “What, where and who? Classifying events by scene and object recognition” • Goal of the paper: • Event classification in still images • Scene labeling • Object labeling • Approach: • Generative graphical model • Assumes that objects and scenes are independent given the event category • Ignores spatial relationships between objects

  16. “What, where and who? Classifying events by scene and object recognition” • Information channels: • Scene context (holistic representation) • Object appearance • Geometrical layout (sky at infinity/vertical structure/ground plane) • Feature extraction: • 12x12 patches obtained by grid sampling (10x10) • For each patch: • SIFT feature (used both for scene and object models) • Layout label (used only for object model)

  17. “What, where and who? Classifying events by scene and object recognition” • The graphical model • E: event • S: scene • O: object • X: scene feature • A: appearance feature • G: geometry layout

  18. “What, where and who? Classifying events by scene and object recognition” • A new database is compiled: • 8 sport even categories (downloaded from web) • Bocce, croquet, polo, rowing, snowboarding, badminton, sailing, rock climbing • Average classification accuracy over all 8 event classes = 74.3%

  19. “What, where and who? Classifying events by scene and object recognition” • Sample results:

  20. Action recognition in medium resolution regimes • Paper info: • Title: • Recognizing Actions at a Distance • Authors: • Alexei A. Efros: UC Berkeley • Alexander C. Berg: UC Berkeley • Greg Mori: UC Berkeley • JitendraMalik: UC Berkeley • ICCV 2003

  21. “Recognizing Actions at a Distance” • Overall review: • Actions in medium resolution (30 pix tall) • Proposing a new motion descriptor • KNN for classification • Consistent tracking bounding box of the actor is required • Action recognition is done only on the tracking bounding box • Motion in terms of as relative movement of body parts • No info. about movements is given by the tracker

  22. “Recognizing Actions at a Distance” • Motion Feature: • For each frame, a local temporal neighborhood is considered • Optical flow is extracted (other alternatives: image pixel values, temporal gradients) • OF is noisy: • half-wave rectifying + blurring • To preserve motion info: • OF vector is decomposed to its vertical/horizontal components

  23. “Recognizing Actions at a Distance” • Similarity measure: • i,j: index of frame • T: temporal extent • I: spatial extent • A: 1st video sequence = • B: 2nd video sequence =

  24. “Recognizing Actions at a Distance” • New Dataset: • Ballet (stationary camera): • 16 action classes • 2 men + 2 women • Easy dataset (controlled environment) • Tennis (real action, stationary camera): • 6 action classes (stand, swing, move-left, …) • different days/location/camera position • 2 players (man + woman) • Football (real action, moving camera): • 8 action classes (run-left 45˚, run-left, walk-left, …) • Zoom in/out

  25. “Recognizing Actions at a Distance” • Average classification accuracy: • Ballet: 87.44% (5NN) • Tennis: 64.33% (5NN) • Football: 65.38% (1NN) • What can be done?

  26. “Recognizing Actions at a Distance” • Applications: • Do as I Do: • Replace actors in videos • Do as I Say: • Develop real-world motions in computer games • 2D/3D skeleton transfer • Figure Correction: • Remove occlusion/clutter in movies

  27. KTH Action Dataset • Paper info: • Title: • Recognizing Human Actions: A Local SVM Approach • Authors: • Christian Schuldt: KTH university • Ivan Laptev: KTH university • ICPR 2004

  28. “Recognizing Human Actions: A Local SVM Approach” • New dataset (KTH action database): • 2391 video sequences • 6 action classes (Walking, Jogging, Running, Handclapping, Boxing, Hand-waving) • 25 persons • Static camera • 4 scenarios: • Outdoors (s1) • Outdoors + scale variation (s2): the hardest scenario • Outdoors + cloth variation (s3) • Indoors (s4)

  29. “Recognizing Human Actions: A Local SVM Approach” • Features: • Sparse (STIP detector) • Spatio-temporal jets of order 4 • Different feature representations: • Raw jet feature descriptors • Exponential kernel on the histogram of jets • Spatial HoG with temporal pyramid • Different classifiers: • SVM • NN

  30. “Recognizing Human Actions: A Local SVM Approach” • Experimental results: • Local Feature (jets) + SVM performs the best • SVM outperforms NN • HistLF (histogram of jets) is slightly better than HistSTG (histogram of spatio-temporal gradients) • Average classification accuracy on all scenarios = 71.72%

  31. Action Recognition in Real Scenarios • Paper info: • Title: • Retrieving Actions in Movies • Authors: • Ivan Laptev: INRIA / IRISA • Patrik Perez: INRIA / IRISA • ICCV 2007

  32. “Retrieving Actions in Movies” • A new action database from real movies • Experiments only on Drinking action vs. random/Smoking • Main contributions: • Recognizing unrestricted real actions • Key-frame priming • Configuration of experiments: • Action recognition (on pre-segmented seq.) • Comparing different features • Action detection (using key-frame priming)

  33. “Retrieving Actions in Movies” • Real movie action database: • 105 drinking actions • 141 smoking actions • Different scenes/people/views • www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html • Action representation: • R = (P, ΔP) • P = (X, Y, T): space-time coordinates • ΔP = (ΔX, ΔY, ΔT): • ΔX: 1.6 width of head bounding box • ΔY: 1.3 height of head bounding box

  34. “Retrieving Actions in Movies” • Learning scheme: • Discrete AdaBoost + FLD (Fisher Linear Discriminant) • All action cuboids are normalized to 14x14x8 cells of 5x5x5 pixels (needed for boosting) • Slightly temporal-randomized sequences is added to training • HoG(4bins)/OF(5bins) is used • Local features: • Θ=(x,y,t, δx, δy, δt, β, Ψ) • ΒЄ{plain, temp-2, spat-4} • ΨЄ{OF5, Grad4}

  35. “Retrieving Actions in Movies” • HoG captures shape, OF captures motion • Informative motions: start & end of action • Key-frame: • When hand reaches head • Boosted-Histogram on HOG • No motion info around key-frame • Integration of motion & key-frame should help

  36. “Retrieving Actions in Movies” • Experiments: • OF/OF+HoG/STIP+NN/only key-frame • OF/OF+HoG works best on hard test (drinking vs. smoking) • Extension of OF5 to OFGrad9 does not help! • Key-frame priming: • #FPs decreases significantly (different info. channels) • Significant overall accuracy: • It’s better to model motion and appearance separately • Speed of key-primed version: 3 seconds per frame

  37. “Retrieving Actions in Movies” • Possible extensions: • Extend the experiments to more action classes • Make it real-time

  38. Automatic Video Annotation • Paper info: • Title: • Learning Realistic Human Actions from Movies • Authors: • Ivan Laptev: INRIA / IRISA • MarcinMarszalek: INRIA / LEAR • CordeliaSchmid: INRIA / LEAR • Benjamin Rozenfeld: Bar-Ilan university • CVPR 2008

  39. “Learning Realistic Human Actions from Movies” • Overview: • Automatic movie annotation: • Alignment of movie scripts • Text classification • Classification of real action • Providing a new dataset • Beat state-of-the-art results on KTH dataset • Extending spatial pyramid to space-time pyramid

  40. “Learning Realistic Human Actions from Movies” • Movie script: • Publicly available textual description about: • Scene description • Characters • Transcribed dialogs • Actions (descriptive) • Limitations: • No exact timing alignment • No guarantee for correspondence with real actions • Actions are expressed literally (diverse descriptions) • Actions may be missed due to lack of conversation

  41. “Learning Realistic Human Actions from Movies” • Automatic annotation: • Subtitles include exact time alignment • Timing of scripts is matched by subtitles • Textual description of action is done by a text classifier • New dataset: • 8 action classes (AnswerPhone, GetOutCar, SitUp, …) • Two training sets (automatically/manually annotated) • 60% of the automatic training set is correctly annotated • http://www.irisa.fr/vista/actions

  42. “Learning Realistic Human Actions from Movies” • Action classification approach: • BoF framework (k=4000) • Space-time pyramids • 6 spatial grids: {1x1, 2x2, 3x3, 1x3, 3x1, o2x2} • 4 temporal grids: {t1, t2, t3, ot2} • STIP with multiple scales • HoG and HoF

  43. “Learning Realistic Human Actions from Movies” • Feature extraction: • A volume of (2kσ x 2kσ x 2kτ) is taken around each STIP where σ/τ is spatial/temporal extent (k=9) • The volume is divided to grid • HoG and HoF for each grid cell is calculated and concatenated together • These concatenated features are concatenated once more according to the pattern of spatio-temporal pyramid

  44. “Learning Realistic Human Actions from Movies” • Different channels: • Each spatio-temporal template: one channel • Greedy search to find the best channel combination • Kernel function = • Chi2 distance • Observations: • HoG performs better than HoF • No temporal subdivision is preferred (temporal grid = t1) • Combination of channels improves classification in real scenario • Mean AP on KTH action database = 91.8% • Mean AP on real movies database: • Trained on manually annotated dataset : 39.5% • Trained on automatically annotated dataset : 22.9% • Random classifier (chance) : 12.5%

  45. “Learning Realistic Human Actions from Movies” • Future works: • Increase robustness to annotation noise • Improve script to video alignment • Learn on larger database of automatic annotation • Experiment more low-level features • Move from BoF to detector based methods • The table shows: • effect of temporal division when combining channels (HMM based methods should work) • Pattern of spatio-temporal pyramid changes so that context is best captured when the action is scene-dependent

  46. Image Context in Action Recognition • Paper info: • Title: • Actions in Context • Authors: • MarcinMarszalek: INRIA / LEAR • Ivan Laptev: INRIA / IRISA • CordeliaSchmid: INRIA / LEAR • CVPR 2009

  47. “Actions in Context” • Contributions: • Automatic learning of scene classes from video • Improve action recognition using image context and vice versa • Movie scripts is used for automatic training • For both action and scene: BoF + SVM • New large database: • 12 action classes • 69 movies involved • 10 scene classes • www.irisa.fr/vista/actions/hollywood2

  48. “Actions in Context” • For automatic annotation, scenes are identified only from text • Features: • SIFT (modeling scene) on 2D-Harris • HoG and HoF (motion) on 3D-Harris (STIP)

  49. “Actions in Context” • Features: • SIFT: extracted from 2D-Harris detector • Captaures static appearance • Used for modeling scene context • Calculated for single frame (every 2 seconds) • HoG/HoF: extracted from 3D-Harris detector • HoG captures dynamic appearance • HoF captures motion pattern • One video dictionary per channel is created • Histogram of video words is created for each channel • Classifier: • SVM using chi2 distance • Exponential kernel (RBF) • Sum over multiple channels

  50. “Actions in Context” • Evaluations: • SIFT: better for context • HoG/HoF: better for action • Only context can also classify actions fairly good! • Combination of the 3 channels works best

More Related