410 likes | 553 Views
S-Seer: A Selective Perception System for Multimodal Office Activity Recognition. Nuria Oliver & Eric Horvitz Adaptive Systems & Interaction Microsoft Research. Overview of the Talk. Background: Seer system Value of information Selective Perception Policies Selective-Seer (S-Seer)
E N D
S-Seer: A Selective Perception System for Multimodal Office Activity Recognition Nuria Oliver & Eric Horvitz Adaptive Systems & Interaction Microsoft Research
Overview of the Talk • Background: Seer system • Value of information • Selective Perception Policies • Selective-Seer (S-Seer) • Experiments and Video • Summary and Future Directions
Background and Motivation • Research Area:Automatic recognition of human behavior from sensory observations • Applications: • Multimodal human-computer interaction • Visual surveillance, office awareness, distributed teams • Accessibility, medical applications
Sensing in Multimodal Systems • Multimodal sensing & reasoning in personal computing as central vs. peripheral • Multimodal signal processing: Typically requires large portion–if not nearly all–of computational resources • Need for strategies to control allocation of resources for perception in multimodal systems • Design-time and/or real-time
Seer Office Awareness System(ICMI 2002, CVIU 2004 (to appear)) • Seer: Prototype for performing real-time, multimodal, multi-scale office activity recognition • Distinguish among: • Phone conversation • Face-to-face conversation • Working on the computer • Presentation • Nobody around • Distant conversation • Other activities
Multimodal Inputs • Vision: One static Firewire camera sampled at 30 fps • Audio: Two binaural mini-microphones (20-16000 Hz, SNR 58 dB), sampled at 44100 KHz • Keyboard and mouse: History of the activity during the last 1, 5 and 60 seconds
Observations O States S T=N T=1 T=2 T=3 HMMs for Behavior Recognition e.g., Four State HMM Graphical Model State Trellis Most likely Path Viterbi Time State=1 State=2 State=3 State=4
Several Limitations of HMMs for Multimodal Reasoning • First-order Markov assumption doesn’t address long-term dependencies and multiple time granularities • Assumes single process dynamics—but signals may be generated by multiple processes. • Context limited to a single state variable. Multiple processes represented by Cartesian product HMM becomes intractable quickly • Large parameter space large data needs • Empirical experience: Representation sensitive to changes in the environment (lighting, background noise, etc)
Seer Explored Layered HMMs (LHMMs) • Goal: Decompose parameter space to reduce training and re-training requirements. • Approach: Segment the problem into distinct layers that operate at different temporal granularities. • Consequence: Data explained at different levels of temporal abstraction.
Phone Ring Speech Silence Speech Silence Audio One Active Person Present One Person Present One Person Present Video High Level Interpretation On the Telephone Seer: Multi-Scale Activity Recognition Time
Layer 3 K’’ Bank of K’’ -Level 3 HMM Classifiers with time granularity T’’ O O O O O O O O O O O O T’’ T’’ T’’ 1 1 1 2 2 2 3 3 3 2 1 Layer 2 K’ Bank of K’ -Level 2 HMM Classifiers with time granularity T’ O O O O O O O O O O O O 2 1 1 2 T’ 3 T’ 1 2 T’ 3 3 2 1 Layer 1 K Bank of K -Level 1 HMM Classifiers with time granularity T 2 1 3 O O O O O O O O O O O O T T T 1 1 1 2 2 2
SEER’s Architecture K O O O O T 1 2 K K 2 O O O O O O O O O 2 2 O O 1 O T 1 2 O O O O O O O O O O O O 1 1 T 1 2 3 O O O O O O O O Phone Conversation Face to Face Conversation Working on the Computer Presentation Nobody Present Distant Conversation Keyboard/Mouse Activities Feature Vector Sound Localization Audio HMMs Classification Results Video HMMs Classification Results One Person Present One Active Person Present Multiple People Present Nobody Present Ambient Noise Human Speech Music Keyboard Phone Ring Feature Vector Feature Vector PCA on LPC Coeff Energy, Mean & Variance of Fundamental Freq Zero Crossing Rate Skin Color Probability Face Density Foreground/ Background Motion Density
Value of LHMMs for Seer Task • Comparison between traditional (cartesian product) HMMs and LHMMs • 60 minutes of office activity data (10min/activity;3 users) • 50% of data for training and 50% for testing • 6 office activities recognized: • Phone Conversation • Face to Face Conversation • Working on the computer • Distant conversation • Presentation • Nobody Around
Selective Perception Policies (ICMI’03) • Seer performs well but sensing consumes a large portion of the available CPU • Seek to understand value and cost of different sensors / analyses • Define policies for dynamically selecting sensors/features: • Principled decision-theoretic approach • Expected Value of Information (EVI) • Heuristic approach • Observational Frequencies: Rate-based perception • Random • Select features randomly as background case
Related work • Principles for Guiding Perception • Expected value of information (EVI) as core concept of decision analysis (Raiffa, Howard) • Value of information in probabilistic reasoning systems, use in sequential diagnosis (Gorry, 79; Ben-Bassat 80, Horvitz, et al, 89; Heckerman, et al. 90) • Probability and utility to model the behavior of vision modules (Bolles, IJCAI’77), to score plans of perceptual actions (Garvey’76), reliability indicators to fuse vision modules (Toyama & Horvitz, 2000) • Growing interest in applying decision theory in perceptual applications in the area of active vision search tasks (Rimey’93)
Policy 1: Expected Value of Information (EVI) • Decision-theoretic principles to determine value of observations. • EVI computed by considering value of eliminating uncertainty about the state of observational features under consideration • Example:Vision sensor (camera) features: • Motion density • Face density • Foreground density • Skin color density There are K=16 possible combinations of these features representing plausible sets of observations.
Subsets of Features: Example No Features Skin Color Probability Motion Density Skin Color Probability Face Density Fgnd/Bckgnd Skin Color Probability & Motion Density Skin Color Probability & Face Density Motion Density Skin Color Probability & Fgnd/Bckgnd Motion Density & Face Density Motion Density & Fgnd/Bckgnd Face Density & Fgnd/Bckgnd Face Density Skin Color Probability & Motion Density & Face Density Skin Color Probability & Motion Density & Fgnd/Bckgnd Skin Color Probability & Face Density & Fgnd/Bckgnd Foreground/ Background Face Density & Motion Density & Fgnd/Bckgnd Skin Color Probability & Face Density & Motion Density & Fgnd/Bckgnd
Criticality of Utility Model • EVI guided sensing via considering influence of observations on expected utility of the system’s actions • Need to endow system with representation of utility of actions in the world. • Assess the utilities as the value of asserting that the real world activity is • Maximum expected utility action,
Considering Outcome of Making an Observation • Expected value (EV) of observing & computing features: • values of a set observational features f • E: prior observational evidence • Represent uncertainty about the values that the system will observe when evaluating • Consider the change in expected value given the current probability distribution,
Balancing Costs and Benefits • The net expected value of the information (NEVI) of feature combination is, • The cost is the cost assigned to the computational latency associated with sensing and computing feature combination, • If the difference is positive, it is worth collecting the information and therefore computing the feature
Cost Models • Distinct cost models • Measure of total computation usage • Cost associated with latencies that users will experience • Costs of computation can be context dependent • Example: Expected cost model that takes into account the likelihood the user will experience poor responsiveness, and frustration if encountered.
Single and Multistep Analyses • Real-world applications of EVI typically employ a greedy approach, i.e., compute next best observations at each step • In our analysis, we extend typical EVI computations by reasoning about groups (combinations) of features • We select the feature combination with the greatest EVI at each step, • Sequential diagnosis or hypothetico-deductive cycle
K O O O O T 1 2 2 O O O 1 O T 1 2 O O O O T 1 2 3 Hypothetico-Deductive Cycle Selective Perception Analysis Control Module Probability model for selective perception analysis Decides which set of features to compute Probabilistic Module
EVI with HMMs • Given that our probabilistic models are HMMs, the term can be computed as: • Where: • is the forward variable at time ‘t’ and state ‘s’ • is the state transition probability of going from state ‘s’ to ‘l’ • is the observation probability • All of them for model
EVI in HMMs • If we discretize the observation space, the NEVI becomes: • In SEER we discretize the observation space into M bins with M typically 10. • The computational overhead to carry out EVI in the discrete case is O(M*F*N*N*J), where • M is the maximum cardinality of the features, • F is the number of feature combinations, • N is the maximum number of states in the HMMs and • J is the number of HMMs.
Policy 2: Heuristic Rate-based Perception • For comparison, we consider selective perception policies based on defining observational frequencies and duty cycles for each feature. Period ON Audio Classification OFF Duty Cycle ON Video Classification OFF ON Sound Localization OFF ON Keyboard/Mouse OFF Time
Policy 3: Random Selection • Baseline policy for comparisons • Randomly select a feature combination from all possible
Selective Perception Module K K K O O O O O O O O O O O O T T T 1 1 1 2 2 2 2 2 2 O O O O O O O O O 1 1 1 O O O T T T 1 1 1 2 2 2 O O O O O O O O O O O O Selective Perception Module Selective Perception Module T T T 1 1 1 2 2 2 3 3 3 S-Seer Phone Conversation Face to Face Conversation Working on the Computer Presentation Nobody Present Distant Conversation Keyboard/Mouse Activities Sound Localization Feature Vector Video HMMs Classification Results Audio HMMs Classification Results One Person Present One Active Person Present Multiple People Present Nobody Present Ambient Noise Human Speech Music Keyboard Phone Ring Feature Vector Feature Vector PCA on LPC Coeff Energy, Mean & Variance of Fundamental Freq Zero Crossing Rate Skin Color Probability Face Density Foreground/ Background Motion Density
Experiments with Selective Perception • Qualitative and formal evaluations • DC: Distant Conversation • NP: Nobody Present • O: Other • P: Presentation • FFC: F-F Conversation • WC: Working on computer • PC: Phone Conversation
Comparison of the Selective Perception Policies • Mean accuracies when testing EVI, observational frequencies and random selection with 600 sequences of real-time data (100 seq/behavior)
Comparison of the Selective Perception Policies • Mean computational costs (% CPU time) when testing EVI, observational frequencies and random selection with 600 sequences of real-time data (100 seq/behavior)
Richer Utility and Cost Models • Initial experiments • Identity matrix as the system’s utility model • Measure of the cost, , as the percentage of CPU usage • Richer utility models for misdiagnosis • One can assess in $ the cost to a user of misclassifying as • Seek $ amounts that users would be willing to pay to avoid having the activity misdiagnosed for all possible N-1 misdiagnoses • Richer models for cost of perceptual analysis • We map computational costs and utility to the same $ currency • Cost: $ that a user would be willing to pay to avoid latencies of different kinds in different settings.
Context-sensitive Cost Models • S-SEER’s domain level reasoning supports such context-sensitive cost models • Assuming cost (C) of computation is zero when the users are not using the computer, we can generate an expected cost (EC) of perception as follows: • where • represents the latency associated with observing and analyzing the set of features • E represents the evidence already observed • The index 1..m contains the subset of activities that do not include interaction with the user
Studies with Richer Utility and Cost Models • Condition cost models on software application that has focus • Consider that user is interacting vs. not interacting • Analysis of influence of activity-dependent cost model • 900 sequences of office activity (150 seq/activity) with • Rich cost of misdiagnosis • Activity-dependent cost model cost higher when user interacting with computer • e.g. Presentation, Person-present other activity vs. Nobody present, distant converation overheard, etc.
Feature Activation (% time) Constant Cost Activity Dependent Cost
Summary • Decision-theoretic approach to feature selection in multimodal systems • How do observations affect the utility of system? • Selective perception significantly reduces the computational burden of S-SEER while preserving good recognition accuracies • Comparative studies, EVI provides overall best trade-off between the recognition accuracy of the system and its computational burden
Future Work • Extending utility models • Models of cost of latencies • Cost of misdiagnosis in applications • Models of persistence and volatility • Models that represent the decay of confidence about states of the world with increasing time since an observation was made • Design-time and real-time applications • Exploration of the decision theoretic approach to other graphical models • Emotional content