570 likes | 957 Views
Natural Interfaces. Conversation would improve many interactions.Currently, conversational interfaces are useless in most situations with more than one user, or with real-world references.Visual Context is missing
E N D
1. Learning and Vision for Multimodal Conversational Interfaces
3. Visual Context for Conversation Who is there? (presence, identity)
Which person said that? (audiovisual grouping)
Where are they? (location)
What are they looking / pointing at? (pose, gaze)
What are they doing? (activity)
4. Learning Visual conversational context cues are hard to model analytically.
Learning methods are appropriate
Different techniques for different cues, levels of representation, input modes, ...
(At least for now…)
5. Today Speaker segregation using audio-visual mutual information
discard background sounds
separate multiple conversational streams
Head pose detection and tracking with multi-view appearance models
attention
agreement
Articulated pose tracking by learning model constraints, or example-based inference…
gesture
“body language”
6. Today Speaker segregation using audio-visual mutual information
discard background sounds
separate multiple conversational streams
Head pose detection and tracking with multi-view appearance models
attention
agreement
Articulated pose tracking by learning model constraints, or example-based inference…
gesture
“body language”
7. Is that you talking?
8. Audio-visual synchrony Can we find a relationship between audio and visual events (e.g., speech)?
9. Audio-visual synchrony Can we find a relationship between audio and visual events (e.g., speech)?
Model-free?
10. Audio-visual synchrony Yes, by learning a model of audio-visual synchrony. Three approaches:
Pixel-wise corellation with video [Hershey and Movellan]
Correlation of optimal projection [Slaney and Covell]
Non-parametric Mutual Information analysis on optimal projection [Fisher et al.]
11. Audio-based Image localization E.g., locate visual sources given audio information:
12. Audio-based Image localization Image variance (ignoring audio) will find all motion in the sequence:
13. Audio-based Image localization Estimate mutual information between audio and video:
15. Cannonical correlation projection Different from hershey and movellan in that it asks what combination of audio and video data produces the best correlation rather than treating each pixel independently. However the performance is dependent on both the training and testing data sizes.
Perhaps the best contribution was that they showed MFCCs and LPC representations were much better than audio power or spectrogram using their technique.Different from hershey and movellan in that it asks what combination of audio and video data produces the best correlation rather than treating each pixel independently. However the performance is dependent on both the training and testing data sizes.
Perhaps the best contribution was that they showed MFCCs and LPC representations were much better than audio power or spectrogram using their technique.
16. Non-parametric Mutual Information Match audio to video using adaptive feature basis
Exploit joint statistics of image and audio signal
Efficient non-parametric density estimation
17. Maximally Informative Subspace Key difference is that we don’t have labels for the training data. We have to learn statistic for both modalities.
We learn projections to reveal” simple structure”Key difference is that we don’t have labels for the training data. We have to learn statistic for both modalities.
We learn projections to reveal” simple structure”
18. Audio-visual synchrony detection MI: 0.68 0.61 0.19 0.20
Compute similarity matrix for 8 subjects:
19. Today Speaker segregation using audio-visual mutual information
discard background sounds
separate multiple conversational streams
Head pose detection and tracking with multi-view appearance models
attention
agreement
Articulated pose tracking by learning model constraints, or example-based inference…
gesture
“body language”
20. Head pose tracking
21. Lots of Work on Face Pose Tracking… Cylindrical approx. [LaCascia & Sclaroff]
3D Mesh approx. [Essa]
3D Morphable model [Blanz & Vetter]
Multi-view keyframes from 3D model [Vachetti et al.]
View-based eigenspaces [Srinivasan & Boyer] [Pentland et al.]
… Online: Still hard to initialize...
Offline: Constraint pose space;Online: Still hard to initialize...
Offline: Constraint pose space;
22. Pose Estimation
23. User Dependent Keyframes
24. User-Independent Prior Model
25. 3D View-based Eigenspaces
26. View-based Eigenspaces Explain what are the images (mean face +3 first eigenvectorsExplain what are the images (mean face +3 first eigenvectors
27. 3D View-based Eigenspaces Per view PCA
We also keep basis images for the depth channelPer view PCA
We also keep basis images for the depth channel
28. 3D View-based Eigenspaces UD is weight (add to slide)
Present the 3 techniques: (1) Independent SVD (2) Concatenate I and Z; SVD (3) SVD on I; Transfer weights
Informally, we found (3) to work best. So far it’s what works best. We are still exploring this topic.
Depth basis images instead of depth eigenspaces
Keep to answer a question
Variation on depth is not independent for variation on intensity.
Intensity variation are more relevant to match identity;
Depth variation will show up with intensity variation.
UD is weight (add to slide)
Present the 3 techniques: (1) Independent SVD (2) Concatenate I and Z; SVD (3) SVD on I; Transfer weights
Informally, we found (3) to work best. So far it’s what works best. We are still exploring this topic.
Depth basis images instead of depth eigenspaces
Keep to answer a question
Variation on depth is not independent for variation on intensity.
Intensity variation are more relevant to match identity;
Depth variation will show up with intensity variation.
29. Reconstruction Where come from subwindow
Animate equationsWhere come from subwindow
Animate equations
30. Reconstruction Remove correlation function?
Lambda=var(I)/var(z)Remove correlation function?
Lambda=var(I)/var(z)
31. Reconstruction All view and depth
Equation for reconstructionAll view and depth
Equation for reconstruction
32. Pose Estimation Add depth images
Motion model is constant (identity matrix) == Random walkAdd depth images
Motion model is constant (identity matrix) == Random walk
33. Pose Estimation Spend less time on matrix C
Add depth images
Motion model is constant (identity matrix) == Random walk
Deltas are pose-changes measurements; Spend less time on matrix C
Add depth images
Motion model is constant (identity matrix) == Random walk
Deltas are pose-changes measurements;
34. Experiments Image sequences from stereo cameras
Prior model: 14 subjects in 28 orientations
Ground truth with Inertia Cube sensor
Compare with OSU pose estimator [Srinivasan & Boyer ’02]
Use same training set for eigenspaces Chrinivassan
Describe OSU
Explain Inertia Cube
Learn interpolation between correlation coefficient to estimate poseChrinivassan
Describe OSU
Explain Inertia Cube
Learn interpolation between correlation coefficient to estimate pose
35. Results Comment on video
Subtract ground true.
Merge Rx Ry RzComment on video
Subtract ground true.
Merge Rx Ry Rz
36. Exploiting cascades for speed But, correlation search step is very slow!
Using a cascade detection paradigm [Viola, Jones], many patterns can be quickly rejected.
Set false negative rate to be very low (e.g. 1%) per stage
each stage may have low hit rate (30-40%) but overall architecture is efficient and accurate
Multi-view cascade detection to obtain coarse initial pose estimate In general simple classifiers, while they are more efficient, they are also weaker.
We could define a computational risk hierarchy (in analogy with structural risk minimization)…
A nested set of classifier classes
The training process is reminiscent of boosting…
- previous classifiers reweight the examples used to train subsequent classifiers
The goal of the training process is different
- instead of minimizing errors minimize false positivesIn general simple classifiers, while they are more efficient, they are also weaker.
We could define a computational risk hierarchy (in analogy with structural risk minimization)…
A nested set of classifier classes
The training process is reminiscent of boosting…
- previous classifiers reweight the examples used to train subsequent classifiers
The goal of the training process is different
- instead of minimizing errors minimize false positives
37. Pose aware interfaces
Interface Agent responds to gaze of user
agent should know when it’s being attended to
turn-taking pragmatics
eventually, anaphora + object reference
Prototype
Smart-room interface “sam”
Early experiments with face tracker on meeting room table…
38. SAM
40. Head nod detection Track 6DOF motion of head nod and shake gestures
Experiment with simple motion energy ratio test.
Initial results promising
41. Today Speaker segregation using audio-visual mutual information
discard background sounds
separate multiple conversational streams
Head pose detection and tracking with multi-view appearance models
attention
agreement
Articulated pose tracking by learning model constraints, or example-based inference…
gesture
“body language”
42. Articulated pose sensing
43. Learning Articulated Tracking Model-based approach works for 3-D data and pure articulation constraints…
Need to learn joint limits and other behavioral constraints (with a classic model-based tracker)
Without direct 3-D data, example-based techniques are most promising…
44. Model-based Approach
45. Model-based Approach
46. Model-based Approach
47. ICP with articulated motion constraint Minimize distance between 3D-data and 3-D articulated model
Apply ICP to each object in the articulated model to find motion (twist) dk = (w,t) with covariance Lk for each limb.
Enforce joint constraints: find a set of motions dk’ close to original motions that satisfy joint constraints
Pure articulation can be expressed as a linear projection on stacked rigid motion
48. Non-linear constraints Limitations of Pure Articulation Constraints
Can not capture the limits on the range of motion of human joints
Can not capture behavioral limits of body pose
Learning approach: learn a discriminative model of valid / invalid pose
Train SVM for use as a Lagrangian constraint
Valid body poses extracted from mocap data (150,000 poses)
Invalid body poses generated randomly
Cross-validation classification error rates at around .061% Three key parts and decision functions: Body Tracker, Gesture Detection, and Gesture Classification
Computational expensive - Gesture Classification and Body Tracking
Three key parts and decision functions: Body Tracker, Gesture Detection, and Gesture Classification
Computational expensive - Gesture Classification and Body Tracking
49. Video
50. Multimodal gestures Three key parts and decision functions: Body Tracker, Gesture Detection, and Gesture Classification
Computational expensive - Gesture Classification and Body Tracking
Three key parts and decision functions: Body Tracker, Gesture Detection, and Gesture Classification
Computational expensive - Gesture Classification and Body Tracking
51. Learning pose without 3-D observations Model based approach difficult with more impoverished observations…e.g., contour or edge features
Example based learning approach
Generate corpus of training data with model (Poser)
Find nearest neighbors using fast hashing techniques (LSH)
Optionally use local regression on NN
With segmented contours
shape context features
bipartite graph matching via Earth Movers’ Distance
With unsegmented edge features
feature selection using paired classification problem
extend LSH to use “Parameter sensitive Hashing”
52. Parameter sensitive hashing When explicit feature (shape context) is not available, feature selection is needed
Features for an optimal distance can be found by training a classifier on an equivalence task
LSH+classifier-based feature selection=PSH
e.g., hashing functions sensitive to distance in a parameter space, not feature space.
“Parameter Sensitive Hashing” [Shakhnarovich et al.]
53. Parameter sensitive hashing
54. Saturday Workshop
55. Schedule
56. Today Learning methods are critical for robust estimation of synchrony, pose and other conversational context cues:
Speaker segregation using audiovisual mutual information
Head pose estimation using multi-view manifolds and detection cascade trees
Real-time articulated tracking from stereo data with SVM-based joint constraints
Monocular tracking using example-based inference with fast nearest neighbor methods
57. Acknowledgements Greg Shakhnarovich
Kristen Grauman
Neal Checka
David Demirdjian
Theresa Ko
John Fisher
Louis-Philippe Morency
Mike Siracusa
…