240 likes | 451 Views
Human Activity Recognition at Mid and Near Range. Ram Nevatia University of Southern California Based on work of several collaborators: F. Lv, P. Natarajan, S. Lee, C. Huang International Workshop on Video 2009 May 26, 2009. Activity Recognition: Motivation.
E N D
Human Activity Recognition at Mid and Near Range Ram Nevatia University of Southern California Based on work of several collaborators: F. Lv, P. Natarajan, S. Lee, C. Huang International Workshop on Video 2009 May 26, 2009
Activity Recognition: Motivation • Is the key content of a video (along with scene description) • Useful for • Monitoring (alerts) • Indexing (forensic, deep analysis, entertainment…) • HCI • ..
Activity Recognition: Goals • Goal is not just to give a name, but also a description (not just the verb but a sentence) • Who, what, when, where, why etc…? • Some of these inferences require object recognition in addition to “action” recognition • Actor, object, instrument…. • Context and story understanding is important to infer intent
Action as Change of State • A change in state, is given by some function, say f (s, s’, t), • Example: walking changes position of the walker • An event can also be defined over an interval where some properties of f are constant (or within a certain range) • Example: walking at a constant speed or in the same direction • Recognition methods require some estimate of the state, such as positions or pose of actors, their trajectories and relation to scene objects
Event Composition • Composite Events • Compositions of other, simpler events. • Composition is usually, but not necessarily, a sequence operation, e.g. getting out of a car, opening a door and entering a building. • Primitive events: those we choose not to decompose, e.g. walking • Primitive events can be recognized directly from observables, by using standard classifiers. • Graphical models, such as HMMs and CRFs are natural tools for recognition of composite events.
Hierarchical Models • Hierarchical structure of events is naturally reflected in hierarchical graphical models
Issues in Activity Recognition • Variations in image/video appearance due to changes in viewpoint, illumination, clothing, style of activity etc. • Inherent ambiguities in 2-D videos • Reliable detection and tracking of objects, especially those directly involved in activities • Temporal segmentation • “Recognition” of novel events
Mid vs Near Range • Mid-range • Limbs of human body, particularly the arms, are not distinguishable • Common approach is to detect and track moving objects and make inferences based on trajectories • Near-range • Hands/arms are visible; activities are defined by pose transitions, not just the position transitions • Pose tracking is difficult; top-down methods are commonly used
Mid-Range Example Example of abandoned luggage detection Based on trajectory analysis and simple object detection/recognition Uses a simple Bayesian classifier and logical reasoning about order of sub-events Tested on PETS and ETISEO data
Tracking in Crowded Environments • Results from CVPR09 paper
Dealing with Track Failures • In crowded environments, track fragmentation is common • Events of interest themselves may cause occlusions, e.g. two (or more) people meeting • Possible event detection can trigger a re-evaluation of the tracks • Meeting event example • People must have been separate, then get close to each other and stay together for some time • How to distinguish between passing by and meeting? Both may cause tracks to vanish.
Meeting Event Result (Videos) Meeting Event Detection Result Tracking Result
Events requiring fine Pose Tracking • Many events, e.g. gestures, requiring tracking of body pose, not just position • Humans pose has large degrees of freedom • >50 joint angles/positions • Bottom up pose tracking approaches are slow and not robust • Top down approaches attempt to recognize activity and pose simultaneously • Note that usually data is not pre-segmented into primitive action segments • Closed-world assumption
… … … … Input sequence … … … … 3D body pose Activity Recognition w/o Tracking check watch Action segments punch kick pick up throw +
Difficulties • Viewpoint change & pose ambiguity (with a single camera view) • Spatial and temporal variations (style, speed)
Key Poses and Action Nets • Key poses are determined by an automatic method that computes large changes in energy; key poses may be shared among different actions
Experiments: Training Set 15 action models 177 key poses 6372 nodes in Action Net
Action Net: Apply constraints 0o 10o …
Experiments: Test Set 50 clips, average length 1165 frames 5 viewpoints 10 actors (5 men, 5 women)
A Video Result extracted blob & ground truth original frame without action net with action net
Working with Natural Environments • Foreground segmentation is difficult • Leads to use of lower level features, e.g. edges and optical flow • Key poses are not discriminative enough w/o accurate segmentation; actor position also needs to be inferred • We introduce use of continuous pose sequence. • More general graphical models that include • Hierarchy • Transition probabilities may depend on observations • Observations may depend on multiple states • Duration models (HMMs imply an exponential decay)
Experiments Tested the approach on videos of 6 actions- sit-on-ground(SG), standup-from-ground(StG), sit-on-chair(SC), standup-from-ground(StC), pickup(PK), point(P). Collected instances of these actions around 4 tilt angles and 5 pan angles A total of 400 instances over all actions with various backgrounds. We compared the relative importance of shape, flow and duration features with our system (shape+flow+duration).
Results Combining flow and shape produces a clear improvement. Bulk of the expense is in computing the flow.