Emotional Speech detection Laurence Devillers, LIMSI-CNRS, devil@limsi.fr

Emotional Speech Emotional Speech detection Laurence Devillers, LIMSI-CNRS, devil@limsi.fr Expression of emotions in Speech synthesis Marc Schröder, DFKI, schroed@dfki.de Humaine Plenary Meeting, 4-6 June 2007, Paris L-Devillers - Plenary 5 juin 2007

Overview Challenge: Real-time system for “real-life” emotional speech detection in order to build an affectively competent agent Emotion is considered in the broad sense Real-life emotions are often shaded, blended, masked emotions due to social aspects L-Devillers - Plenary 5 juin 2007

Static emotion detection system (emotional unit level: word, chunk, sentence) • Statistical approach (such as SVM) using large amount of data to train models • 4-6 emotions detected, rarely more State-of-the-art E models 0: 0bservation Emotion detection P(Ei /O) Extraction features The scheme shows the components of an automatic emotion recognition systemThe performances on realistic data (CEICES): 2 emotions > 80% 4 emotions >60% L-Devillers - Plenary 5 juin 2007

Automatic emotion detection • The difficulty of the detection task increases with the variability of the emotional speech expression. • 4 dimensions: • Speaker (dependent/independent, age, gender, health), • Environment (transmission channel, noise environment), • Number and type of emotions (primary, secondary) • Acted/real-life data and applications context L-Devillers - Plenary 5 juin 2007

Automatic emotion detection: Research evolution 2007 1996 2003 fictions documentaires journaux TV clips actors Primary acted-emotions Emotion/Unemotion (WoZ) Positive/Negative emotions HMI Speaker-dependent Pluri-speaker Phone Voice Superposition Public place Channel-independent Emotion representation Emotion in interaction . >5 Real emotions . • >4 acted • -emotions • 2- 5 realistic emotions • (children, CEICES), HMI • Real-life call-center emotions HMI . Call center data. WoZ . Acted/Woz/ real-life data Quiet room Speakers Channel-dependent withadaptation • Speaker-independent: • Adaptation to gender Personality, Health, Age, Culture Environment Transmission L-Devillers - Plenary 5 juin 2007

Challenge with spontaneous emotions • Authenticity is present but there is no control on the emotion • Need to find appropriate labels and measures for annotation validation • Blended emotions (Scherer: Geneva Airport Lost Luggage Study) • Annotation and Validation of annotation • Expert annotation phase by several coders (10 coders, CEICES (5 coders), often only two) • Control of the quality of annotations: • Intra/Inter annotations agreement • Perception tests • Validate the annotation scheme and the annotations Perception of emotion mixtures (40 subjects) NEG/POS valence Importance of the context • Give measure for comparing human perception with automatic detection. L-Devillers - Plenary 5 juin 2007

Human-Human Real-life Corpora Audio Audio Visuel L-Devillers - Plenary 5 juin 2007

Context-dependent emotion labels Do the labels represent the emotion of a considered task or context? Example: Real-life emotion studies (call center): The Fear label represents different expressions of Fear due to different contexts: Fear for callers of losing money, Fear for callers for life, Fear for agents of mistaking The difference is not just a question of intensity/activation -> Primary/Secondary fear ? -> Degree of Urgency/reality of the threat ? Fear in the fiction (movies): study of many different contexts How to generalize ? Should we define labels in function of the type of context? We just defined the social role (agent/caller) as a context See Poster of C. Clavel L-Devillers - Plenary 5 juin 2007

Emotional labels • The majority of the detection systems uses emotion discrete representation • Need a sufficient amount of data. In that objective, we use hierarchical organization of labels (LIMSI example) L-Devillers - Plenary 5 juin 2007

No bad coders but different perceptionsCombining annotations of different coders: a Soft vector of emotions Labeler 1: (Major) Annoyance, (Minor) Interest Labeler 2: (Major) Stress, (Minor) Annoyance  (wM/W Annoyance, wm/W Stress, wm/W Interest) For wM=2 , wm=1 ,W=6  (0.5 Annoyance, 0.33 Stress, 0.17 Interest). L-Devillers - Plenary 5 juin 2007

~200 cues • Prosodic • - F0 • - Formants • - Energy • Micro- • prosody • Jitter • Shimmer… • Disfluences • Affect bursts Praat WEKA: - attribute Selection - SVM, .. transcription combination • Lu mots • Preprocessing • Stemming N-grams model Speech data processing LIMSI – see Poster L. Vidrascu • Standard features • Pich level, range, • Energy level, range • Speaking rate • Spectral features (formants, Mfccs) • Less standard • Voice quality: local disturbances (jitter/shimmer) • Disfluences (pauses, filler pauses) • Affect bursts • We need to automatically detect affect bursts and to add new features such as voice quality features • Phone signal is not of sufficient quality for many existing techniques • WEKA toolkit:(www.cs.waikato.ac.nz - Witten & Franck, 1999) L-Devillers - Plenary 5 juin 2007 see Ni Chasaide poster

LIMSI: Results with paralinguistic cues (SVMs): from 2 to 5 emotion classes (% of good detection) L-Devillers - Plenary 5 juin 2007 Fe:fear, Sd:sadness; Ag:anger; Ax anxi, St:stress, Re relief

25 best features for 5 emotions detection Anger, Fear, Sadness, Relief Neutral state Features from all the classes were selected (different from one class to another) The difference of the media channel (phone/microphone), the type of data (adult vs. children, realistic vs. naturalistic) and the emotion classes have an impact on the best relevant set of features. Out of our 5 classes, Sadness is the least recognized without mixing the cues. L-Devillers - Plenary 5 juin 2007

Real-life emotional system System based on acted data -> inadequate for real-life data detection (Batliner) GEMEP/CEMO comparison: different emotions First experiments show only an acceptable detection score for Anger. Real-life emotion studies are necessary Detection results on call center data: state of the art for « realistic emotions » > 80% 2 emotions, > 60% 4 emotions, ~55% 5 emotions L-Devillers - Plenary 5 juin 2007

Challenges ahead Short-term: • Acceptable solutions for targeted applications are in reach • Use dynamic model of emotion for real-time emotion detection (history memory) • New features: Automatically extracted information on voice quality, affect bursts and disfluences from the signal that does not require exact speech recognition. • Detect relaxed/tensed voice (Scherer) • Add contextual knowledge to the blind statistical model: social role, type of action, regulation (adapt emotional expression to strategic interaction goals (faces theory, Goffman)). Long-term • Emotion dynamic processus based on appraisal model. • Combining informations at several levels: acoustic/linguistic, multimodal cues, adding contextual informations (social role) L-Devillers - Plenary 5 juin 2007

Demo (coffee break…) L-Devillers - Plenary 5 juin 2007

Thanks L-Devillers - Plenary 5 juin 2007

Emotional Speech detection Laurence Devillers, LIMSI-CNRS, devil@limsi.fr

Emotional Speech detection Laurence Devillers, LIMSI-CNRS, devil@limsi.fr

Presentation Transcript

Emotional Speech

Emotional Speech

Computation Approaches to Emotional Speech

Project 1 Speech Detection

FATHER LAURENCE

Rodolphe Devillers

Laurence McLean

Emotional Speech

WELCOME to eNTERFACE 08 4th-29th August 2008 Orsay - France Christophe d’Alessandro LIMSI-CNRS

Emotional Speech Analysis using Artificial Neural Networks

Spontaneous Emotional Facial Expression Detection

Laurence Sterne

Laurence sterne

Acoustic Cues to Emotional Speech

Laurence Sterne

Christophe d’Alessandro LIMSI-CNRS Orsay, France

Evaluation of Speech Detection

Laurence sterne