Multimodal Analysis of Expressive Human Communication: Speech and gesture interplay

Multimodal Analysis of Expressive Human Communication:Speech and gesture interplay Ph.D. Dissertation Proposal Carlos Busso Adviser: Dr. Shrikanth S. Narayanan

Outline • Introduction • Analysis • Recognition • Synthesis • Conclusions

Introduction Motivation • Gestures and speech are intricately coordinated to express messages • Affective and articulatory goals jointly modulate these channels in a non-trivial manner • A joint analysis of these modalities is needed to better understand expressive human communication • Goals: • Understand how to model the spatial-temporal modulation of these communicative goals in gestures and speech • Use these models to improve human-machine interfaces • Computer could give specific and appropriate help to users • Realistic facial animation could be improved by learning human-like gestures This proposal focuses on the analysis, recognition and synthesis of expressive human communication under a multimodal framework 01/40 Introduction

Introduction Open challenges • How to model the spatio-temporal emotional modulation • If audio-visual models do not consider how the coupling between gestures and speech changes in presence of emotion, they will not accurately reflect the manner in which human communicate • Which interdependencies between the various communicative channels appear in conveying verbal and non-verbal messages? • Interplay between communicative, affective and social goals • How to infer meta-information from speakers (emotion, engagement)? • How gestures are used to respond to the feedback given by the listener? • How the verbal and non-verbal messages conveyed by one speaker are perceived by others? • How to use models to design and enhance applications that will help and engage the users? 01/40 Introduction

Introduction Proposed Approach 01/40 Introduction

Facial Gesture/speech Interrelation C. Busso and S.S. Narayanan. Interrelation between Speech and Facial Gestures in Emotional Utterances. Under submission to IEEE Transactions on Audio, Speech and Language Processing. Analysis • Introduction • Analysis • Facial Gesture/speech Interrelation • Affective/Linguistic Interplay • Recognition • Synthesis • Conclusions 01/40 Analysis

Facial gestures/speech interrelation Motivation • Gestures and speech interact and cooperate to convey a desired message[McNeill,1992], [Vatikiotis,1996], [Cassell,1994] • Notable among communicative components are the linguistic, emotional and idiosyncratic aspects of human communication • Both gestures and speech are affected by these modulations • It is important to understand the interrelation between facial gestures and speech in terms of all these aspects of human communication 01/40 Analysis

Facial gestures/speech interrelation Goals • To focus on the linguistic and emotional aspects of human communication • To investigate the relation between certain gestures and acoustic features • To propose recommendationsfor synthesis and recognition applications • Relationship between gestures and speech as conversational functions [Ekman,1979], [Cassell,1999], [Valbonesi,2002], [Graf,2002], [Granstrom,2005] • Relationship between gestures and speech as results of articulation [Vatikiotis,1996], [Yeshia,1998], [Jiang,2002], [Barker,1999] • Relationship between gestures and speech influenced by emotions [Nordstrand,2003], [Caldognetto,2003], [Bevacqua,2004] [Lee, 2005] Related work[s] 01/40 Analysis

Facial gestures/speech interrelation Proposed Framework: Data-driven approach • Pearson’s correlation is used to quantify relationship between speech and facial features • Affine Minimum Mean-Square Error is used to estimate the facial gestures from speech • Sentence-level mapping • Global-level mapping 01/40 Analysis

Facial gestures/speech interrelation Audio-Visual Database • Four emotions are targeted • Sadness • Angry • Happiness • Neutral state • 102 Markers to track facial expressions • Single subject • Phoneme balanced corpus (258 sentences) • Facial motion and speech are simultaneously captured 01/40 Analysis

Facial gestures/speech interrelation Facial and acoustic features • Speech • Prosodic features (source of the speech): Pitch, energy and they first and second derivatives • MFCC coefficients (vocal tract) • Facial features • Head motion • Eyebrow • Lips • Each marker grouped in • Upper, middle and lower face regions 01/40 Analysis

Neutral Sad Happy Angry Neutral Sad Happy Angry Facial gestures/speech interrelation Correlation results : Sentence-level • High levels of correlation • Correlation levels are higher when MFCC features are used • Clear emotional effects • Correlation levels are equal or greater than neutral case • Happiness and anger are similar Prosodic MFCCs 01/40 Analysis

Neutral Sad Happy Angry Neutral Sad Happy Angry Facial gestures/speech interrelation Correlation results : Global-level • Correlation levels decreases compared sentence-level mapping • Link between facial gestures and speech varies from sentence to sentence • Correlation levels are higher when MFCC features are used • The lower face region presents the highest correlation • Clear emotional effects • Correlation levels for neutral speech are higher than emotional category Prosodic MFCCs 01/40 Analysis

Facial gestures/speech interrelation Mapping parameter • Goal: study structure of mapping parameters • Approach: Principal Component analysis (PCA) • For each facial feature, find P such that it cover 90% of the variance • Emotional-dependent vs. emotional independent analysis 01/40 Analysis

Facial gestures/speech interrelation Mapping parameter’ Results • Parameters[are] cluster in small subspace • Prosodic-based parameters [are] cluster in a smaller subspace than MFCC-based parameters • Further evidences of emotional-dependent influence in the relationship between facial gestures and speech Fraction of eigenvectors used to span 90% or more of the variance of the parameter T 01/40 Analysis

Facial gestures/speech interrelation Mapping parameter’ Results • Correlation levels as function of P • Slope in prosodic-based features is lower than in MFCCs • Smaller dimension of the cluster • Slope depends on the facial region • Different levels of coupling Prosodic MFCCs Upper Middle Lower 01/40 Analysis

Affective/Linguistic Interplay C. Busso and S.S. Narayanan. Interplay between linguistic and affective goals in facial expression during emotional utterances. To appear International seminar on Speech Production (ISSP 2006) Affective/Linguistic Interplay • Introduction • Analysis • Facial Gesture/speech Interrelation • Affective/Linguistic Interplay • Recognition • Synthesis • Conclusions 01/40 Analysis

Linguistic/affective interplay Motivation • Linguistic and emotional goals jointly modulate speech and gestures to convey the desired messages • Articulatory and affective goals co-occur during normal human interaction, sharing the same channels • Some control needs to buffer, prioritize and execute these communicative goals in coherent manner • Linguistic and affective goals interplay interchangeably as primary and secondary controls • During speech, affective goals are displayed under articulatory constraints • Some facial areas have more degree of freedom to display non-verbal clues Hypotheses 01/40 Analysis

Linguistic/affective interplay Previous results • Low vowels (/a/) with less restrictive tongue position observed greater emotional coloring then high vowels (/i/)[Yildirim, 2004] [Lee,2005] [Lee, 2004] • Focus of this analysis is on the interplay in facial expressions • Compare facial expressionsof neutral and emotional utterances with same semantic content • Correlation • Euclidean Distance • The database is a subset of the MOCAP data Approach 01/40 Analysis

Neutral Sad Happy Angry Linguistic/affective interplay Facial activation analysis • Measure of facial motion • Lower face area has the highest activeness levels • Articulatory processes play a crucial role • Emotional modulation • Happy and angry more active • Sadness less active than neutral • Activeness in upper face region increases more then other regions 01/40 Analysis

Linguistic/affective interplay Neutral vs. emotional analysis • Goal: Compare in detail[s] the facial expression displayed during neutral and emotional utterances with similar semantic content • Dynamic Time Warping (DTW) is used to align the utterances 01/40 Analysis

Neutral-Sad Neutral-Happy Neutral-Angry Linguistic/affective interplay Correlation analysis : neutral vs. emotional • Higher correlation implies higher articulatory constraints • Lower facial region has the highest correlation levels • More constrained • Upper facial region has the lower correlation levels • Can communicate non-verbal information regardless of the linguistic content (median results) 01/40 Analysis

Neutral-Sad Neutral-Happy Neutral-Angry Linguistic/affective interplay Euclidean distance analysis : neutral vs. emotional • After scaling the facial features, the Euclidean distance was estimated • High values indicate that facial features are more independent of the articulation. • Similar results than in correlation • Upper face region less constrained by articulatory processes (median results) 01/40 Analysis

Analysis Remarks from analysis section • Facial gestures and speech are strongly interrelated • The correlation levels present inter-emotion differences • There is an emotion-dependent structure in the mapping parameter that may be learned • The prosodic-based mapping parameter set is grouped in a small cluster • Facial areas and speech are coupled at different resolutions 01/40 Analysis

Analysis Remarks from analysis section • During speech, facial activeness is mainly driven by articulation • However, linguistic and affective goals co-occur during active speech. • There is an interplay between linguistic and affective goals in facial expression • Forehead and cheeks have more degree of freedom to convey non-verbal messages • The lower face region is more constrained by the articulatory process 01/40 Analysis

Emotion recognition C. Busso, Z. Deng, S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Analysis of emotion recognition using facial expressions, speech and multimodal information,” in Sixth International Conference on Multimodal Interfaces ICMI 2004, State College, PA, 2004, pp. 205–211, ACM Press. Recognition • Introduction • Analysis • Recognition • Emotion recognition • Engagement recognition • Synthesis • Conclusions 01/40 Recognition

Multimodal Emotion Recognition Motivation • Emotions are an important element of human-human interaction • Design improved human-machine interfaces • Give specific and appropriate help to user • Modalities give complementary information • Some emotions are better recognized in a particular domain • Multimodal approach provide better performance and robustness • Decision-level fusion systems (rule-based system) [Chen,1998] [DeSilva,2000] [Yoshitomi,2000] • Feature-level fusion systems [Chen,1998_2] [Huang,1998] Hypotheses Related work 01/40 Recognition

Multimodal Emotion Recognition Proposed work • Analyze the strength and limitation of unimodal systems to recognize emotion states • Study the performance of multimodal system • MOCAP database is used • Sentence-level features (e.g. mean, variance, range, etc.) • Speech : prosodic features • Facial expression: upper and middle face area • Sequential backward features selection • Support vector machine classifier (SVC) • Decision and feature level integration 01/40 Recognition

Multimodal Emotion Recognition Emotion recognition results • From speech • Average ~70% • Confusion sadness-neutral ( ) • Confusion happiness-anger ( ) • From facial expression • Average ~85% • Confusion anger-sadness ( ) • Confusion neutral-happiness ( ) • Confusion sadness-neutral ( ) • Multimodal system (feature-level) • Average ~90% • Confusion neutral-sadness ( ) • Other pairs are correctly separated 01/40 Recognition

Engagement recognition C. Busso, S. Hernanz, C.W. Chu, S. Kwon, S. Lee, P. G. Georgiou, I. Cohen, S. Narayanan. Smart Room: Participant and Speaker Localization and Identification. In Proc. ICASSP, Philadelphia, PA, March 2005. C. Busso, P.G. Georgiou and S.S. Narayanan. Real-time monitoring of participants’ interaction in a meeting using audio-visual sensors. Under submission to International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007) Inferring participants’ engagement • Introduction • Analysis • Recognition • Emotion recognition • Engagement recognition • Synthesis • Conclusions 01/40 Recognition

Inferring participants’ engagement Motivation • At small group level, the strategies of one participant are affected by the strategies of other participants • Automatic annotations of human interaction will provide better tools for analyzing teamwork and collaboration strategies • Examples of application in which monitoring human interaction is very useful are summarization, retrieval and classification of meetings • Infer meta-information from participants in a multiperson meeting • To monitor and track the behaviors, strategies and engagements of the participants • Infer interaction flow of the discussion Goals 01/40 Recognition

Inferring participants’ engagement Approach • Extract high-level features from automatic annotations of speaker activity (e.g. number and average duration of each turn) • Use an intelligent environment equipped with audio-visual sensors to get the annotations • Intelligent environment [Checka,2004] [Gatica-Perez,2003] [Pingali,1999] • Monitoring human interaction [McCowan,2005] [Banerjee,2004] [Zhang,2006] [Basu,2001] Related work 01/40 Recognition

Inferring participants’ engagement Smart Room • Visual • 4 firewire CCD cameras • 360o Omnidirectional camera • Audio • 16-channel microphone array • Directional microphone (SID) 01/40 Recognition

Inferring participants’ engagement Localization and identification • After fusing audio-visual stream of data, the system gives • Participants’ location • Sitting arrangement • Speaker identity • Speaker activity • Testing (~85%) • Three 20-minute meeting (4 participants) • Casual conversation with interruptions and overlap 01/40 Recognition

Inferring participants’ engagement Participants interaction • High level features per participant • Number of turns • Average duration of turns • Amount of time as active speaker • Transition matrix depicting turn-taking between participants • Evaluation • Hand-based annotation of speaker activity • Results describe here correspond to one of the meetings 01/40 Recognition

Inferring participants’ engagement Results : Participants interaction • Automatic annotations are good approximation • The distribution of time used as active speaker correlate dominance [Rienks,2006] • Subject 1 spoke more than 65% of the time • Discussion are characterized by many short turns to show agreement (e.g. “uh-huh”) and longer turns taken by mediators [Burger,2002] • Subject 1 was leading discussion • Subject 3 was only an active listener Ground-true duration Estimated duration Estimated time distribution Ground-true time distribution Estimated no. of turns Ground-true no. of turns 01/40 Recognition

Inferring participants’ engagement Results : Participants interaction • The transition matrix gives the interaction flow and turn taking patterns • Claim: transition between speaker ~ who was being addressed • To evaluate this hypothesis, addressee was manually annotated and compared with transition matrix • Transition matrix provides a good first approximation to identifying the interlocutor dynamics. • Discussion was mainly between subjects 1 and 3. Estimated transition Ground-true transition 01/40 Recognition

Inferring participants’ engagement Results : Participants interaction • These high-level features can be estimated in small windows over time to infer participants’ engagement • Subject 4 not engaged • Subjects 1, 2 and 3 engaged Dynamic behavior of speakers’ activeness over time

Recognition Remarks from recognition section • Multimodal approaches to infer meta-information from speaker gives better performance than unimodal system • When acoustic and facial features are fused, the performance and the robustness of the emotion recognition system improve measurably • In small group meetings, it is possible to accurately estimate in real-time not only the flow of the interaction, but also how dominant and engaged each participant was during the discussion

Head motion synthesis C. Busso, Z. Deng, U. Neumann, and S.S. Narayanan, “Natural head motion synthesis driven by acoustic prosodic features,” Computer Animation and Virtual Worlds, vol. 16, no. 3-4, pp. 283–290, July 2005. C. Busso, Z. Deng, M. Grimm, U. Neumann and S. Narayanan. Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis. IEEE Transactions on Audio, Speech and Language Processing, March 2007 Synthesis • Introduction • Analysis • Recognition • Synthesis • Head motion synthesis • Conclusions • Future Work 01/40 Synthesis

Natural Head Motion Synthesis Motivation • Mapping between facial gestures and speech can be learned using more sophisticated framework • A useful and practical application is avatars driven by speech • Engaging human-computer interfaces and application such as animated features films have motivated realistic avatars • Focus of this section: head motion 01/40 Synthesis

Natural Head Motion Synthesis Why head motion? • It has received little attention compared to other gestures • Important for acknowledging active listening • Improves acoustic perception [Munhall,2004] • Distinguish interrogative and declarative statements [Munhall,2004] • Recognize speaker identity [Hill,2001] • Segment spoken content [Graf,2002 01/40 Synthesis

Natural Head Motion Synthesis Hypotheses • Head motion is important for human-like facial animation • Head motion change the perception of the emotion • Head motion can be synthesize from acoustic features • Rule-based systems [Pelachaud,1994] • Gaussian Mixtures Model [Costa,2001] • Specific head motion (e.g. ‘nod’) [Cassell, 1994][Graf, 2002] • Example-based system [Deng, 2004], [Chuang, 2004] Related Work 01/40 Synthesis

Natural Head Motion Synthesis Proposed Framework • Hidden Markov Models are trained to capture the temporal relation between the prosodic features and the head motion sequence • Vector quantization is used to produce a discrete representation of head poses • Two-step smoothing techniques are used based on first order Markov model and spherical cubic interpolation • Emotion perception is studied by rendering deliberate mismatches between the emotional speech and the emotional head motion sequences 01/40 Synthesis

Natural Head Motion Synthesis Database and features • Same audio-visual database • Acoustic Features ~ Prosody (6D) • Pitch • RMS energy • First and second derivative • Head motion ~ head rotation (3DOF) • Reduce the number of HMMs • For closed-view of the face, translation effect less important 01/40 Synthesis

Natural Head Motion Synthesis Head motion analysis in expressive speech • Prosodic features are coupled with head motion (emotion dependent) • Emotional patterns in activeness, range and velocity • Discriminate analysis ~ 65.5% • Emotion-dependent models are needed 01/40 Synthesis

Natural Head Motion Synthesis Head motion analysis in expressive speech • Head motions are modeled with HMMs • HMMs provide a suitable and natural framework to model the temporal relation between prosodic features and head motions • HMMs will be used as sequence generator (head motion sequence) • Discrete head pose representation • The 3D head motion data is quantized using K-dimensional vector quantization • Each cluster is characterized by its mean, and covariance, 01/40 Synthesis

Natural Head Motion Synthesis Learning natural head motion • The observation, O, are the acoustic prosodic features • One HMM will be trained for each head pose cluster, • Likelihood distribution: P(O/Vi ) • It is modeled as a Markov process • A mixture of M Gaussian densities is used to model the pdf of the observations • Standard algorithm are used to train the parameters (Forward-backward, Baum-Welch re-estimation) 01/40 Synthesis

Natural Head Motion Synthesis Learning natural head motion • Prior distribution: P(Vi) • It is built as bi-gram models learned from the data (1st smoothing step) • Transitions between clusters that do not appear in the training data are penalized • This smoothing constraint is imposed in the decoding step 01/40 Synthesis

Natural Head Motion Synthesis Synthesis of natural head motion • For a novel sentence, the HMMs generate the most likely head motion sequence • Interpolation is used to smooth the cluster transition region (2nd smoothing step)

Multimodal Analysis of Expressive Human Communication: Speech and gesture interplay

Multimodal Analysis of Expressive Human Communication: Speech and gesture interplay

Presentation Transcript

The Speech Mechanism

Computerized Speech Lab CSL

Genre Analysis and Business Communication

Speech Segregation

An Ethnography of Communication

Communication

Speech Recognition Introduction II

CYPOP 15 and Unit 303:

Farewell Speech

Reported speech / Indirect speech

Human-Machine Dialogue Espere and Reality

Java 语言程序设计

Nonverbal Communication Presented by: Waqas Khan

Gesture Recognition

Part-of-speech tagging

Speech Segregation

multimodal+emotion+recognition

Chapter 2 Speech Sounds

HUMAN FACTORS ANALYSIS AND CLASSIFICATION SYSTEM (HFACS)

Human-Machine Coevolution