290 likes | 577 Views
Robust Recognition of Emotion from Speech. Mohammed E. Hoque Mohammed Yeasin Max M. Louwerse {mhoque, myeasin, mlouwerse}@memphis.edu Institute for Intelligent Systems University of Memphis. Presentation Overview. Motivation Methods Database Results Conclusion. Motivations.
E N D
Robust Recognition of Emotion from Speech Mohammed E. Hoque Mohammed Yeasin Max M. Louwerse {mhoque, myeasin, mlouwerse}@memphis.edu Institute for Intelligent Systems University of Memphis
Presentation Overview • Motivation • Methods • Database • Results • Conclusion
Motivations • Animated agents to recognize emotion in e-Learning environment. • Agents need to be sensitive and adaptive to learners’ emotion.
Methods • Our method is partially motivated by the work of Lee and Naranyan [1], who first introduced the notion of salient words.
Shortcomings of Lee and Narayan’s work Lee et al. argued that there is one-to-one correspondence between a word and a positive or negative emotion. This is NOT true for every case.
Examples Confusion Flow Normal Delight Figure 1: Pictorial depiction of the word “okay” uttered with different intonations to express different emotions.
More examples.. Scar!! Scar??
More examples… Two months!! Two months??
Our Hypothesis • Lexical information extracted from combined prosodic and acoustic features that correspond to intonation pattern of “salient words” will yield robust recognition of emotion from speech. • It also provides a framework for signal level analysis of speech for emotion.
Details on the Database • 15 utterances were selected for four emotion categories: confusion/uncertain, delight, flow (confident, encouragement), and frustration [2]. • Utterances were stand-alone ambiguous expressions in conversations, dependent on the context. • Examples are “Great”, “Yes”, “Yeah”, “No”, “Ok”, “Good”, “Right”, “Really”, “What”, “God”.
Details on the Database… • Three graduate students listened to the audio clips. • They successfully distinguished between the positive and negative emotions 65% of the time. • No specific instructions were given as to what intonation patterns to listen to.
High Level Diagram Positive Word Level Utterances Feature Extraction Data Projection Classifiers Negative Figure 2. The high level description of the overall emotion recognition process.
Emotion Positive Negative Delight Flow Confusion Frustration Hierarchical Classifiers Figure 3. The design of the hierarchical binary classifiers.
Emotion Models using Lexical Information • Pitch: Minimum, maximum, mean, standard deviation, absolute value, quantile, ratio between voiced and unvoiced frames. • Duration: εtime εheight • Intensity: Minimum, maximum, mean, standard deviation, quantile. • Formant: First formant, second formant, third formant, fourth formant, fifth formant, second formant / first formant, third formant / first formant • Rhythm: Speaking rate.
Duration Features Figure 4. Measures of F0 for computing parameters (εtime, εheight) which corresponds to rising and lowering of intonation. Inclusion of height and time accounts for possible low or high pitch accents.
Limitations and Future work • Algorithm • Feature Selection • Discourse Information • Future efforts will include fusion of video and audio data in a signal level framework. • Database • Clipping arbitrary words from a conversation may be ineffective at various cases. • May need to look words in a sequence.
Acknowledgments • This research was partially supported by grant NSF-IIS-0416128 awarded to the third author. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding institution.