220 likes | 502 Views
Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary. Speaker-Dependent Audio-Visual Emotion Recognition. Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing,
E N D
Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey, UK
Overview Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary IntroductionMethodAudio-visual experimentsSummary
Introduction Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary • Human-to-human interaction: • emotions play an important role by allowing people to express • themselves beyond the verbal domain. • Message: • Linguistic information: textual content of a message • Paralinguistic information: body language, facial expressions, • pitch and tone of voice, health and identity. [Busso et al., 2007] • Audio based emotion recognition: • Borchert et al., 2005; Lin and Wei, 2005; Luengo et al., 2005. • Visual based emotion recognition: • Ashraf et al., 2007; Bartlett et al., 2005; Valstar et al., 2007. • Audio-visual based emotion recognition: • Busso et al., 2004; Schuller et al., 2007; Song et al., 2004.
Method Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Recording of audio-visual databaseFeature extractionFeature selectionFeature reductionClassification
Audio-visual database Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary • Four English male actors recorded in 7 emotions: anger, disgust, fear, happiness, neutral, sadness and surprise. • 15 phonetically-balanced TIMIT sentences per emotion: 3 common, 2 emotion specific and 10 generic sentences. • Size of database: 480 sentences, 120 utterances per subject. • For visual features, face was painted with 60 markers. • 3dMD dynamic face capture system [3dMD capture system, 2009 ]. Figure: Subjects with expressions (from left): Displeased (anger, disgust), Gloomy (fear, sadness), Excited (happiness, surprise) and Neutral (neutral).
Feature extraction Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Audio features Pitch features: min., max., mean and standard deviation of Mel frequency and its first order difference. Energy features: min., max., mean, standard deviation and range of energies in speech signal and in the frequency bands 0-0.5 kHz, 0.5-1 kHz, 1-2 kHz, 2-4 kHz and 4-8 kHz. Duration features: voiced speech, unvoiced speech, sentence duration, voiced-to-unvoiced speech duration ratio, speech rate, voiced-speech-to-sentence duration ratio. Spectral features: mean and standard deviation of 12 MFCCs. Visual features mean and standard deviation of 2D facial marker coordinates (60 markers).
Feature extraction Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Figure: Audio feature extraction with Speech Filing System software (left), and video data (right) with overlaid tracked marker locations.
Feature selection Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Plus l-Take Away r algorithm [Chen, 1978] • Combines both Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) algorithms. • Criterion: Bhattacharyya distance
Feature reduction Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Within-class distance Between-class distance LDA PCA • Principal Component Analysis • Linear Discriminant Analysis
Classification Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Gaussian classifier A Gaussian classifier uses Bayes decision theory where the class-conditional probability density is assumed to have Gaussian distribution for each class . The Bayes decision rule is described as where is the posterior probability, is the prior class probability.
Human evaluation experiments Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary • Human recognition results provide a useful benchmark for audio, visual and audio-visual scenarios. • 10 subjects: 5 native speaker, 5 non-native. • Half of the evaluators were female. • 10 different sets of the data were created by using Balanced Latin Square method. [Edwards, 1962] • Training: three facial expression pictures, two audio files, and a short movie clip for each of the emotion. • Displeased (anger, disgust), Gloomy (fear, sadness), Excited (happiness, surprise) and Neutral (neutral).
Experiments Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary • Audio • Visual • Audio-visual • Decision-level fusion • Overcomes the problem of different time scales and metric levels of audio and visual data. • High dimensionality of the concatenated vector resulted in case of feature-level fusion. • Haq, Jackson & Edge, 2008. Busso et al., 2004; Wang et al., 2005; Zeng et al., 2007; Pal et al., 2006;
Results Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 7 emotion classes.
Results Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 4 emotion classes.
Results Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary 7 Emotion Classes
Results Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary 4 Emotion Classes
Summary Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Conclusions • For British English emotional database a recognition rate comparable to human was achieved (speaker-dependent). • The LDA outperformed PCA with the top 40 features. • Both audio and visual information were useful for emotion recognition. • Important audio features: Energy and MFCC. • Important visual features: mean of marker y-coordinate (vertical movement of face). • Reason for higher performance of system: evaluators were not given equal opportunity to exploit the data. Future Work • To perform speaker-independent experiments, using more data and other classifiers, such as GMM and SVM.
References Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Ashraf, A.B., Lucey, S., Cohn, J.F., Chen, T., Ambadar, Z., Prkachin, K., Solomon, P. & Theobald, B.J. (2007). The Painful Face: Pain Expression Recognition Using Active Appearance Models, Proc. Ninth ACM Int'l Conf. Multimodal Interfaces, pp. 9-14. Bartlett, M.S., Littlewort, G., Frank, M. et al. (2005). Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior, Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition , pp. 568-573. Borchert, M. & Düsterhöft, A. (2005). Emotions in Speech – Experiments with Prosody and Quality Features in Speech for Use in Categorical and Dimensional Emotion Recognition Environments, Proc. NLPKE, pp. 147–151. Busso, C. & Narayanan, S.S. (2007). Interrelation between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study. IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2331-2347. Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U. & Narayanan, S. (2004). Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Information, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 205-211. Chen, C. (1978). Pattern Recognition and Signal Processing. Sijthoff & Noordoff, The Netherlands. Edwards, A. L. (1962). Experimental Design in Psychological Research. New York: Holt, Rinehart and Winston. 3dMD 4D Capture System. Online: http://www.3dmd.com, accessed on 3 May, 2009.
References Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Haq, S., Jackson, P.J.B. & Edge, J. (2008). Audio-visual feature selection and reduction for emotion classification, Proc.Auditory-Visual Speech Processing, pp. 185-190. Lin, Y. & Wei, G. (2005). Speech Emotion Recognition Based on HMM and SVM, Proc. 4th Int’l Conf. on Mach. Learn. and Cybernetics, pp. 4898-4901. Luengo, I., Navas, E. et al. (2005). Automatic Emotion Recognition using Prosodic Parameters, Proc. Interspeech, pp. 493-496. Pal, P., Iyer, A.N. & Yantorno, R.E. (2006). Emotion Detection from Infant Facial Expressions and Cries, Proc. IEEE Int'l Conf. Acoustics, Speech and Signal Processing, vol. 2, pp.721-724. Schuller, B., Muller, R., Hornler, B. et al. (2007). Audiovisual Recognition of Spontaneous Interest within Conversations, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 30-37. Song, M., Bu, J., Chen, C. & Li, N. (2004). Audio-Visual-Based Emotion Recognition: A New Approach, Proc. Int'l Conf. Computer Vision and Pattern Recognition, pp. 1020-1025. Valstar, M.F., Gunes, H. & Pantic, M. (2007). How to Distinguish Posed from Spontaneous Smiles Using Geometric Features, Proc. ACM Int’l Conf. Multimodal Interfaces, pp. 38-45. Wang, Y. & Guan, L. (2005). Recognizing Human Emotion from Audiovisual Information, Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 1125-1128. Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D. & Levinson, S. (2007). Audio-Visual Affect Recognition. IEEE Trans. Multimedia, vol. 9, no. 2, pp. 424-428.
R Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Thank you for your attention!