470 likes | 530 Views
Explore the three levels of motion understanding, from movement to action to activity, and delve into the grammar-based representation and parsing of structured events using Stochastic Context-Free Grammars. Learn about the advantages of using SCFGs for handling uncertainties in activities and how they enforce spatial and temporal consistency for parsing complex behaviors.
E N D
Seeing Action: Part 2Statistics and/or Structure Aaron Bobickafb@cc.gatech.edu School of Interactive Computing College of ComputingGeorgia Tech
Context Continuing from the lower middle??? • Three levels of understanding motion or behavior: Movement - atomic behaviors defined by motion "Bending down", "(door) rising up", Swinging a hammer Action – a single, semantically meaningful "event" "Opening a door", "Lifting a package" Typically short in time Might be definable in terms of motion; especially so in a particular context. Activity – a behavior or collection of actions with a purpose/intention. "Delivering packages" Typically has causal underpinnings Can be thought of as statistically structured events Maybe Actions are movements in context??
Structure and Statistics (Old and new) • Grammar-based representation and parsing • Highly expressive for activity description • Easy to build higher level activity from reused low level vocabulary. • P-Net (Propagation nets) – really stochastic Petri nets • Specify the structure – with some annotation can learn detectors and triggering probabilities • Statistics of events • Low level events are statistically sequenced – too hard to learn full model. • N-grams or suffix trees
"Higher-level" Activities: Known structure, uncertain elements • Many activities are comprised of a priori defined sequences of primitive elements. • Dancing, conducting, pitching, stealing a car from a parking lot. • The states are not hidden. • The activities can be described by a set of grammar-like rules; often ad hoc approaches taken. • But, the sequences are uncertain: • Uncertain performance of elements • Uncertain observation of elements
The basic idea and approach • Low-level primitives with uncertain feature detection (individual elements might be HMMs) • High-level description found by parsing input stream of uncertain primitives. • Extend Stochastic Context Free Grammars to handle perceptually relevant uncertainty. Idea: split the problem into: Approach:
Stochastic CFGs • Traditional SCFGs have probabilities associated with the production rules. Traditional parsing yields most likely parse given a known set of input symbols. PIECE -> BAR PIECE | [0.5] BAR [0.5] BAR -> TWO | [0.5] THREE [0.5] THREE -> down3 right3 up3 [1.0] TWO -> down2 up2 [1.0] • Thanks to Andreas Stolcke’spriori work on parsing SCFGsusing efficient Earley parser.
Extending SCFGs (Ivanov and Bobick, PAMI) • Within the parser we handle: • Uncertainty about input symbols Input is multi-valued string (vector of likelihoods) • Deletion, substitution, and insertion errors Introduce error rules • Individually recognized primitives typically temporally inconsistent Introduce penalty for overlap. Spatial and temporal consistency enforced. • Need to define when a symbol has been generated. • How do we learn production probabilities? (Not many examples.) Make sure not too sensitive to them.
Enforcing temporal consistency Output of oneHMM parsing backwards - Output event P(primitive) Time
Event Grammar and Parsing • Tracker generates events: ENTER, LOST, FOUND, EXIT, STOP. Tracks have properties (e.g. size) and trajectories. • Tracker assigns class to each event, though only probabilistically. • Parser parses single stream that contains interleaved events: (CAR-ENTER, CAR-STOP, PERSON-FOUND, CAR-EXIT, PERSON-EXIT) • Parser enforces spatial and temporal consistency for each object class and interactions (e.g. to be a PICK-UP, the PERSON-FOUND event must be close to CAR-STOP) • Spatial and temporal consistency eliminates symbolic ambiguity.
Advantages of SCFGs • What grammar can do (simplified): CAR_PASS -> CAR_ENTER CAR_EXIT | CAR_ENTER CAR_HIDDEN CAR_EXIT CAR_HIDDEN -> CAR_LOST CAR_FOUND | CAR_LOST CAR_FOUND CAR_HIDDEN • Skip allows concurrency (and junk): PERSON_LOST -> person_lost | SKIP person_lost • Concurrent parse: Events: ce pe cl cf cs px pl cx PICKUP -> ce pe clcfcs px plcx P_PASS -> ce pe cl cf cs px pl cx
Advantages of STCFG approach • Structure and components of activities defined a priori and are the right levels of annotation to recover (compare to HMMs). • FSM vs CFG is not the point. Rather explicit representation of structural elements and uncertainties. • Often many (enough) examples of each primitive to support training, but not of higher level activity. • Allows for integration of heterogeneous primitive detectors; only assumes likelihood generation. • More robust than ad-hoc rule based techniques: handles errors through probability. • No notion of causality, or anything other than (multi-stream) sequencing.
Advantages of STCFG approach • Structure and components of activities defined a priori and are the right levels of annotation to recover (compare to HMMs). • FSM vs CFG is not the point. Rather explicit representation of structural elements and uncertainties. • Often many (enough) examples of each primitive to support training, but not of higher level activity. • Allows for integration of heterogeneous primitive detectors; only assumes likelihood generation. • More robust than ad-hoc rule based techniques: handles errors through probability. • No notion of causality, or anything other than (multi-stream) sequencing.
Some Q's about Representations… • Scope and Range: • thoughts??? • "Language" of the representation • Grammar of explicit symbols • Computability of an instance: • Quite easy. Given the input string the parsing is both the computation and the matching • Learnability of the "class": • Inside-outside algorithm for learning CFGs but lets be serious… • Stability in face of perceptual uncertainty • Explicitly designed to handle this uncertainty • Inference-support • Depends on what you mean by inference. No notion of real semantics or explicit time.
P-Nets (Propagation Networks) (Shi and Bobick, ’04 and ’06) • Nodes represent activationintervals • Active vs. inactive: Token propagation • More than one node can be active at a time! • Links represent partial order as well logical constraint • Duration model on each link and node: • Explicit model on length of activation • Explicit model on length between successive intervals • Observation model on each node
Conceptual Schema • Logical relation • Autonomous assumption: logic constraint only exists at start/end points of any intervals • Condition probability function can represent any logical function Examples of logic constraint
Propagation Net – Computing • Computational Schema A DBN style rollout to compute corresponding conceptual schema
Experiment: Glucose Project • Task: monitor an user to calibrate a glucose meter and point out operating error as feedback. • Constructed 16 node P-Net as representation • 3 subjects with total of 21 perfect sequences, 10 missing_1_step sequences and 10 missing_6_steps sequences
D-Condensation Initiate 1 particle at dummy starting node Repeat For each particle generate all possible consequent states calculate the probability for each states End Select n particles to survive Until the final time steps is reached Output the path represented by the particle with highest probability
Experiment: Label individual frames Labeling individual nodes Labels on Node J: Insert
And now some statistics… • Problem: the higher level world of activity is not usually a P-Net or an FSM or an HMM or … • Two possible solutions: • Understand what's really going on… …another time. • Lose the structure
Stochastic sequencing (Hamid and Bobick) • A priori define some low-level "actions"/events that can be stochastically detected in context – e.g. Door opening • Collect training data (streams of events) of activities – making a delivery, UPS pick-up, trash collection • Collect histograms of N-tuples and do both activity discovery and recognition • Later can focus on anomalies • Advantages: cheat where easy, learn the hard stuff, exploit the context
Two levels in the representation Low Level: Events (computer vision problem) • Background subtraction and Foreground extraction (better “modeling”) • Classifying (per frame) each foreground object as either • Person • Vehicle (what type if possible) • Package • Tool used to move packages • Miscellaneous object • Tracking people, vehicles, packages, tools, and miscellaneous objects over multiple frames
Two levels in the representation • Higher Level: Statistical characterization of subsequences • Instances of same activity class have certain common subsequences. • But, partially ordered will typically rearrange subsequences within the sequence. • Find a “soft” characterization of the statistics of the subsequences • Deifne similarity measure for such characterization. • Allows discovery of activity classes • Allows for detection of anomalous examples • Caveats: • We provide the events – whether it’s manually or specifying the detector doesn’t really matter (except for publication) • Training needs pre-segmentation
Experimental Setup – Loading Dock • Barnes & Noble Loading Dock Area • One month worth of data: • 5 days a week • 9 a.m. till 5 p.m. • Event Vocabulary – 61 events • Hand-labeled for testing activity labeling, noise sensitivity. • Training detectors for these events Bird’s Eye View of Experimental Setup
Activity-Class Discovery • Treating activities as individual instances • Activity-class discovery – finding maximal cliques in edge weighted graphs • Need to come up with: • Activity Similarity Metric • Procedure to group similar activities
Activity Similarity • Two types of differences • structural differences • frequency differences • Sim(A,B) = 1 – normalized difference between the counts of non-zeros event n-grams • Properties • additive identity • is commutative • does not follow triangular in-equality
Activity-Class Discovery • A graphic theoretic problem of finding maximal cliques in edge-weighted graphs [Pavan, Pelillo ‘03] • Sequentially find maximal cliques in edge weighted graph of activities • Activities different enough from all the regular activities are anomalies
Anomaly Detection • Compute the within-Class similarity of the test activity w.r.t. previous class members • Learn the detection threshold from training data – can be done using an R.O.C.
Anomaly "Explanation" • Explanatory features – their frequency has high mean and low variance • Explanation based on features that were: • Missingfrom an anomaly but were frequently and consistentlypresent in regular members • Extraneous in an anomaly but consistently absentfrom the regular members
Results General Characteristics of Discovered Activity Classes Few of the detected Anomalies • UPS Delivery Vehicles • Fed Ex Delivery Vehicles • Delivery Trucks – multiple packages delivered • Cars and vans, only 1 or 2 packages delivered • Motorized cart used to pick and drop packages • Van deliveries – no use of motorized cart • Delivery trucks – multiple people • Back door of delivery not closed • (b) More than usual number of people • involved in unloading • (c) Very few vocabulary events performed
Results • Are the detected anomalous activities ‘interesting’ from human view-point? • Anecdotal Validation: • Studied 7 users • Showed each user 8 regular activities selected randomly • Showed each user 10 test activities, 5 regular and 5 detected anomalous activities • 8 out of 10 activity-labels of the users matched the labels of our system • Probability of this match happening by chance is 4.4%
Some Q's about Representations… (more discussion) • Scope and Range: • A monitored scene with pre-designed detectors • "Language" of the representation • Histograms and other statistics of feature n-gram occurrences • Computability of an instance: • Given detectors, easy to compute • Learnability of the "class": • Full power of statistical learning. Even allowing notion of outlier detector. • Stability in face of perceptual uncertainty • Fair. Needs to be better. • Inference-support • Distance-in-feature space reasoning only.