1 / 20

Speaker-Dependent Audio-Visual Emotion Recognition

Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary. Speaker-Dependent Audio-Visual Emotion Recognition. Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing,

love
Download Presentation

Speaker-Dependent Audio-Visual Emotion Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey, UK

  2. Overview Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary IntroductionMethodAudio-visual experimentsSummary

  3. Introduction Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary • Human-to-human interaction: • emotions play an important role by allowing people to express • themselves beyond the verbal domain. • Message: • Linguistic information: textual content of a message • Paralinguistic information: body language, facial expressions, • pitch and tone of voice, health and identity. [Busso et al., 2007] • Audio based emotion recognition: • Borchert et al., 2005; Lin and Wei, 2005; Luengo et al., 2005. • Visual based emotion recognition: • Ashraf et al., 2007; Bartlett et al., 2005; Valstar et al., 2007. • Audio-visual based emotion recognition: • Busso et al., 2004; Schuller et al., 2007; Song et al., 2004.

  4. Method Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Recording of audio-visual databaseFeature extractionFeature selectionFeature reductionClassification

  5. Audio-visual database Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary • Four English male actors recorded in 7 emotions: anger, disgust, fear, happiness, neutral, sadness and surprise. • 15 phonetically-balanced TIMIT sentences per emotion: 3 common, 2 emotion specific and 10 generic sentences. • Size of database: 480 sentences, 120 utterances per subject. • For visual features, face was painted with 60 markers. • 3dMD dynamic face capture system [3dMD capture system, 2009 ]. Figure: Subjects with expressions (from left): Displeased (anger, disgust), Gloomy (fear, sadness), Excited (happiness, surprise) and Neutral (neutral).

  6. Feature extraction Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Audio features Pitch features: min., max., mean and standard deviation of Mel frequency and its first order difference. Energy features: min., max., mean, standard deviation and range of energies in speech signal and in the frequency bands 0-0.5 kHz, 0.5-1 kHz, 1-2 kHz, 2-4 kHz and 4-8 kHz. Duration features: voiced speech, unvoiced speech, sentence duration, voiced-to-unvoiced speech duration ratio, speech rate, voiced-speech-to-sentence duration ratio. Spectral features: mean and standard deviation of 12 MFCCs. Visual features mean and standard deviation of 2D facial marker coordinates (60 markers).

  7. Feature extraction Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Figure: Audio feature extraction with Speech Filing System software (left), and video data (right) with overlaid tracked marker locations.

  8. Feature selection Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Plus l-Take Away r algorithm [Chen, 1978] • Combines both Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) algorithms. • Criterion: Bhattacharyya distance

  9. Feature reduction Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Within-class distance Between-class distance LDA PCA • Principal Component Analysis • Linear Discriminant Analysis

  10. Classification Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Gaussian classifier A Gaussian classifier uses Bayes decision theory where the class-conditional probability density is assumed to have Gaussian distribution for each class . The Bayes decision rule is described as where is the posterior probability, is the prior class probability.

  11. Human evaluation experiments Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary • Human recognition results provide a useful benchmark for audio, visual and audio-visual scenarios. • 10 subjects: 5 native speaker, 5 non-native. • Half of the evaluators were female. • 10 different sets of the data were created by using Balanced Latin Square method. [Edwards, 1962] • Training: three facial expression pictures, two audio files, and a short movie clip for each of the emotion. • Displeased (anger, disgust), Gloomy (fear, sadness), Excited (happiness, surprise) and Neutral (neutral).

  12. Experiments Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary • Audio • Visual • Audio-visual • Decision-level fusion • Overcomes the problem of different time scales and metric levels of audio and visual data. • High dimensionality of the concatenated vector resulted in case of feature-level fusion. • Haq, Jackson & Edge, 2008. Busso et al., 2004; Wang et al., 2005; Zeng et al., 2007; Pal et al., 2006;

  13. Results Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 7 emotion classes.

  14. Results Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 4 emotion classes.

  15. Results Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary 7 Emotion Classes

  16. Results Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary 4 Emotion Classes

  17. Summary Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Conclusions • For British English emotional database a recognition rate comparable to human was achieved (speaker-dependent). • The LDA outperformed PCA with the top 40 features. • Both audio and visual information were useful for emotion recognition. • Important audio features: Energy and MFCC. • Important visual features: mean of marker y-coordinate (vertical movement of face). • Reason for higher performance of system: evaluators were not given equal opportunity to exploit the data. Future Work • To perform speaker-independent experiments, using more data and other classifiers, such as GMM and SVM.

  18. References Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Ashraf, A.B., Lucey, S., Cohn, J.F., Chen, T., Ambadar, Z., Prkachin, K., Solomon, P. & Theobald, B.J. (2007). The Painful Face: Pain Expression Recognition Using Active Appearance Models, Proc. Ninth ACM Int'l Conf. Multimodal Interfaces, pp. 9-14. Bartlett, M.S., Littlewort, G., Frank, M. et al. (2005). Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior, Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition , pp. 568-573. Borchert, M. & Düsterhöft, A. (2005). Emotions in Speech – Experiments with Prosody and Quality Features in Speech for Use in Categorical and Dimensional Emotion Recognition Environments, Proc. NLPKE, pp. 147–151. Busso, C. & Narayanan, S.S. (2007). Interrelation between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study. IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2331-2347. Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U. & Narayanan, S. (2004). Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Information, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 205-211. Chen, C. (1978). Pattern Recognition and Signal Processing. Sijthoff & Noordoff, The Netherlands. Edwards, A. L. (1962). Experimental Design in Psychological Research. New York: Holt, Rinehart and Winston. 3dMD 4D Capture System. Online: http://www.3dmd.com, accessed on 3 May, 2009.

  19. References Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Haq, S., Jackson, P.J.B. & Edge, J. (2008). Audio-visual feature selection and reduction for emotion classification, Proc.Auditory-Visual Speech Processing, pp. 185-190. Lin, Y. & Wei, G. (2005). Speech Emotion Recognition Based on HMM and SVM, Proc. 4th Int’l Conf. on Mach. Learn. and Cybernetics, pp. 4898-4901. Luengo, I., Navas, E. et al. (2005). Automatic Emotion Recognition using Prosodic Parameters, Proc. Interspeech, pp. 493-496. Pal, P., Iyer, A.N. & Yantorno, R.E. (2006). Emotion Detection from Infant Facial Expressions and Cries, Proc. IEEE Int'l Conf. Acoustics, Speech and Signal Processing, vol. 2, pp.721-724. Schuller, B., Muller, R., Hornler, B. et al. (2007). Audiovisual Recognition of Spontaneous Interest within Conversations, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 30-37. Song, M., Bu, J., Chen, C. & Li, N. (2004). Audio-Visual-Based Emotion Recognition: A New Approach, Proc. Int'l Conf. Computer Vision and Pattern Recognition, pp. 1020-1025. Valstar, M.F., Gunes, H. & Pantic, M. (2007). How to Distinguish Posed from Spontaneous Smiles Using Geometric Features, Proc. ACM Int’l Conf. Multimodal Interfaces, pp. 38-45. Wang, Y. & Guan, L. (2005). Recognizing Human Emotion from Audiovisual Information, Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 1125-1128. Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D. & Levinson, S. (2007). Audio-Visual Affect Recognition. IEEE Trans. Multimedia, vol. 9, no. 2, pp. 424-428.

  20. R Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Outline Introduction Method Audio-visual experiments Summary Thank you for your attention!

More Related