270 likes | 423 Views
CALO VISUAL INTERFACE RESEARCH PROGRESS. David Demirdjian Trevor Darrell MIT CSAIL. pTablet (or pLaptop!). Goal: visual cues to conversation or interaction state: presence attention turn-taking agreement and grounding gestures emotion and expression cues visual speech features.
E N D
CALO VISUAL INTERFACERESEARCH PROGRESS David Demirdjian Trevor Darrell MIT CSAIL
pTablet (or pLaptop!) Goal: visual cues to conversation or interaction state: • presence • attention • turn-taking • agreement and grounding gestures • emotion and expression cues • visual speech features
Functional Capabilities Help CALO infer: • whether the user is still participating in a conversation or interaction, • is focused on the interface or listening to another person. • when the user is speaking, • further features pertaining to visual speech • non-verbal means to observe whether a user is confirming understanding of or agreement with the current topic or question, • is confused or irritated both for meeting understanding, and CALO UI…
Machine Learning Research Challenges Focusing on learning methods which capture personalized interaction • Articulatory models of visual speech • Sample-based methods for body tracking • Hidden-state conditional random fields • Context-based gesture recognition (Not all are yet in deployed demo…)
Articulatory models of visual speech: • Traditional models of visual speech presume synchronous units based on visimes, the visual correlate of phonemes. • Audiovisual speech production is often asynchronous • Model with formed with a series of loosely coupled streams of articulatory features. (See Saenko and Darrell, ICMI 2004, and Saenko et al., ICCV 2005, for more information.)
Sample-based methods for body tracking • Tracking human bodies requires exploration of a high-dimensional state space • Estimated posteriors are often sharp and multimodal. • New tracking techniques based on novel approximate nearest neighbor hashing method which have comprehensive pose coverage, and optimally integrate information over time. • These techniques are suitable for real-time markerless motion capture, and for tracking the human body to infer attention and gesture. (See Demirdjian et al. ICCV 2005, and Taycher et al. CVPR 2006 for more information.)
Hidden-state conditional random fields Discriminative techniques are efficient and accurate, and learn to represent only the portion of a state necessary for a specific task. Conditional random fields are effective at recognizing visual gestures, but lack the ability of generative models to capture gesture substructure through hidden state. We have developed a hidden-state conditional random field formulation. (See Wang et al. CVPR 2006.)
Hidden Conditional Random Fields for Head Gesture Recognition 3 classes – Nods, Shakes, Junk
Context-based gesture recognition Recognition of user’s gesture should be done in the context of the current interaction Visual recognition can be augmented with context cues from the interaction state • conversational dialog with an embodied agent • interaction with a conventional windows and mouse interface. See Morency, Sidner and Darrell, ICMI 2005 and Morency and Darrell, IUI 2006
User Adaptive Agreement Recognition • Person’s idiolect • User agreement from recognized speech and head gestures • multimodal co-training Challenges: • Asynchrony between modalities • “Missing data” problem
Status • New pTablet functionalities: • Face/gaze tracking • Head gesture recognition (nod/shake) + Gaze • Lip/Mouth motion detection • User enrollment/recognition (ongoing work) • A/V Integration: • Audio-visual sync./calibration • Meeting visualization/understanding
VTracker pTablet system • user model (frontal view) • pose (6D) • OAA messages • person ID • head pose • gesture • lips moving pTablet camera Head Gesture Recognition Speech audio
Speaking activity detection • Face tracking as: Rigid pose
Speaking activity detection • Face tracking as: Rigid pose + Non-rigid facial deformations
Speaking activity detection Speaking activity ~ high motion energy in Mouth/lips region • weak assumption (eg. hand moving in front of mouth will trigger speaking activity detection) • But complement well audio-based speaker detection
User enrollment/recognition Idea: • At start the user is automatically identified and logged in by the pTablet. • If the user is not recognized or misrecognized, he will have to login manually. • Face recognition based on a Feature Set Matching algorithm (The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. Grauman and Darrell ICCV’05)
Audio-Visual Calibration • Temporal calibration: • aligning audio with visual data How? by aligning lip motion energy in images with audio energy • Geometric calibration: • estimate camera location/orientation in the world coordinate system
Audio-Visual Integration CAMEO pTablets
Calibration • Alternative approach to estimate the position/orientation of the pTablets with or without global view (eg. from CAMEO) Idea: use discourse information (eg. who is talking to who, dialog bw. 2 people) and local head pose to find the location of the pTablets…
A/V Integration • AVIntegrator: • Same functionalities as Year 2 (eg. includes activity recognition, etc…) • modified to accept calibration data estimated externally
Integration and activity estimation • A/V integration • Activity estimation: • Who’s in the room • Who is looking at who? • Who is talking to who? • …
CAMEO VTracker VTracker pTablet A/V Integrator system Calibration information • OAA/MOKB messages • user list • speaker • agrees/disagrees • who to whom A/V Integrator eg. current speaker Discourse/Dialog Speechrecognition ?
Demonstration? • Real-time meeting understanding • Use of pTablet suite for interaction with personal CALO:eg.: • use of head pose/lip motion for speaking activity detection • Yes/No answer by head nods/shakes • Visual login