1 / 24

Human Activity Recognition at Mid and Near Range

Human Activity Recognition at Mid and Near Range. Ram Nevatia University of Southern California Based on work of several collaborators: F. Lv, P. Natarajan, S. Lee, C. Huang International Workshop on Video 2009 May 26, 2009. Activity Recognition: Motivation.

parley
Download Presentation

Human Activity Recognition at Mid and Near Range

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Human Activity Recognition at Mid and Near Range Ram Nevatia University of Southern California Based on work of several collaborators: F. Lv, P. Natarajan, S. Lee, C. Huang International Workshop on Video 2009 May 26, 2009

  2. Activity Recognition: Motivation • Is the key content of a video (along with scene description) • Useful for • Monitoring (alerts) • Indexing (forensic, deep analysis, entertainment…) • HCI • ..

  3. Activity Recognition: Goals • Goal is not just to give a name, but also a description (not just the verb but a sentence) • Who, what, when, where, why etc…? • Some of these inferences require object recognition in addition to “action” recognition • Actor, object, instrument…. • Context and story understanding is important to infer intent

  4. Action as Change of State • A change in state, is given by some function, say f (s, s’, t), • Example: walking changes position of the walker • An event can also be defined over an interval where some properties of f are constant (or within a certain range) • Example: walking at a constant speed or in the same direction • Recognition methods require some estimate of the state, such as positions or pose of actors, their trajectories and relation to scene objects

  5. Event Composition • Composite Events • Compositions of other, simpler events. • Composition is usually, but not necessarily, a sequence operation, e.g. getting out of a car, opening a door and entering a building. • Primitive events: those we choose not to decompose, e.g. walking • Primitive events can be recognized directly from observables, by using standard classifiers. • Graphical models, such as HMMs and CRFs are natural tools for recognition of composite events.

  6. Hierarchical Models • Hierarchical structure of events is naturally reflected in hierarchical graphical models

  7. Issues in Activity Recognition • Variations in image/video appearance due to changes in viewpoint, illumination, clothing, style of activity etc. • Inherent ambiguities in 2-D videos • Reliable detection and tracking of objects, especially those directly involved in activities • Temporal segmentation • “Recognition” of novel events

  8. Mid vs Near Range • Mid-range • Limbs of human body, particularly the arms, are not distinguishable • Common approach is to detect and track moving objects and make inferences based on trajectories • Near-range • Hands/arms are visible; activities are defined by pose transitions, not just the position transitions • Pose tracking is difficult; top-down methods are commonly used

  9. Mid-Range Example Example of abandoned luggage detection Based on trajectory analysis and simple object detection/recognition Uses a simple Bayesian classifier and logical reasoning about order of sub-events Tested on PETS and ETISEO data

  10. Tracking in Crowded Environments • Results from CVPR09 paper

  11. Dealing with Track Failures • In crowded environments, track fragmentation is common • Events of interest themselves may cause occlusions, e.g. two (or more) people meeting • Possible event detection can trigger a re-evaluation of the tracks • Meeting event example • People must have been separate, then get close to each other and stay together for some time • How to distinguish between passing by and meeting? Both may cause tracks to vanish.

  12. Meeting Event Result (Videos) Meeting Event Detection Result Tracking Result

  13. Events requiring fine Pose Tracking • Many events, e.g. gestures, requiring tracking of body pose, not just position • Humans pose has large degrees of freedom • >50 joint angles/positions • Bottom up pose tracking approaches are slow and not robust • Top down approaches attempt to recognize activity and pose simultaneously • Note that usually data is not pre-segmented into primitive action segments • Closed-world assumption

  14. … … … Input sequence … … … … 3D body pose Activity Recognition w/o Tracking check watch Action segments punch kick pick up throw +

  15. Difficulties • Viewpoint change & pose ambiguity (with a single camera view) • Spatial and temporal variations (style, speed)

  16. Key Poses and Action Nets • Key poses are determined by an automatic method that computes large changes in energy; key poses may be shared among different actions

  17. Experiments: Training Set 15 action models 177 key poses 6372 nodes in Action Net

  18. Action Net: Apply constraints 0o 10o …

  19. Experiments: Test Set 50 clips, average length 1165 frames 5 viewpoints 10 actors (5 men, 5 women)

  20. Experiments: Results

  21. A Video Result extracted blob & ground truth original frame without action net with action net

  22. Working with Natural Environments • Foreground segmentation is difficult • Leads to use of lower level features, e.g. edges and optical flow • Key poses are not discriminative enough w/o accurate segmentation; actor position also needs to be inferred • We introduce use of continuous pose sequence. • More general graphical models that include • Hierarchy • Transition probabilities may depend on observations • Observations may depend on multiple states • Duration models (HMMs imply an exponential decay)

  23. Experiments Tested the approach on videos of 6 actions- sit-on-ground(SG), standup-from-ground(StG), sit-on-chair(SC), standup-from-ground(StC), pickup(PK), point(P). Collected instances of these actions around 4 tilt angles and 5 pan angles A total of 400 instances over all actions with various backgrounds. We compared the relative importance of shape, flow and duration features with our system (shape+flow+duration).

  24. Results Combining flow and shape produces a clear improvement. Bulk of the expense is in computing the flow.

More Related