1 / 19

Cues to Emotion: Language

Cues to Emotion: Language. Suzanne Yuen Monday Oct 5, 2009 COMS 6998 . Overview. Two-Stream Emotion Recognition for Call Center Monitoring Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis. Two Stream Emotion Recognition for Call Center Monitoring.

abram
Download Presentation

Cues to Emotion: Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cues to Emotion: Language Suzanne Yuen Monday Oct 5, 2009 COMS 6998

  2. Overview • Two-Stream Emotion Recognition for Call Center Monitoring • Voice Quality and f0 Cues for Affect Expression: Implications for Synthesis

  3. Two Stream Emotion Recognition for Call Center Monitoring • Background: To aid supervisors in the evaluation of agents at call centers* • Objective: To present a two stream processing technique to detect strong emotion • Previous Work: • Fernandez categorized affect into four main components: intonation, loudness, rhythm, and voice quality • Yang studied feature selection methods in text categorization and suggested that information gain should be used • Petrushin and Yacoub examined agitation and calm states in people-machine interaction *Typical medium-sized call-center receives about 100,000 calls per day

  4. Two-Stream Recognition • Semantic Stream • Performed speech-to-text conversion • Text classification algorithms identified phrases such as “pleasure,” “thanks,” “useless,” & “disgusting.” • Acoustic Stream • Extracted features based on pitch and energy • Trained on 900 calls, ~60hrs of speech • Vocabulary system of more than 10 000 words • TF-IDF scheme = Term Frequency – Inverse Document Frequency

  5. Implementation • Method: • Two streams analyzed separately: • speech utterance/acoustic features • spoken text/semantics/speech recognition of conversation • Confidence levels of two streams combined • Examined 3 emotions • Neutral • Hot-anger • Happy • Tested two data sets: • LDC data • 20 real-world call-center calls

  6. Two Stream - Conclusion • Table 2 suggested that two-stream analysis is more accurate than acoustic or semantic alone • LDC data recognition significantly higher than real-world data • Neutral emotions had less accuracy • Combination of two-stream processing showed improvement (~20%) in identification of “happy” and “anger” emotions • Low acoustic stream accuracy may be attributed to length of sentences in real-world data. Normal people do not exhibit different emotions significantly in long sentences

  7. Discussion • Gupta analyzed three emotions (happy, neutral, hot-anger): Why break it down into these categories? Implications? Can this technique be applied to a wider range of emotions? For other applications? • Speech to text may not translate the complete conversation. Would further examination greatly improve results? What are the pros and cons? • Pitch range was from 50-400Hz. Research may not be applicable outside this range. Do you think it necessary to examine other frequencies? • In this paper, TF-IDF (Term Frequency – Inverse Document Frequency) technique is used to classify utterances. Accuracy for acoustics only is about 55%. Previous research suggest that alternative techniques may be better. Would implementation better results? What are the pros and cons of using the TF-IDF technique?

  8. Voice Quality and f0 Cues for Affect Expression: Implications for Synthesis • Previous work: • 1995; Mozziconacci suggested that VQ combined with f0 combined could create affect • 2002; Gobl suggested synthesized stimuli with VQ can add affective coloring. Study suggested that “VQ + f0” stimuli is more affective than “f0 only” • 2003; Gobl tested VQ with large f0 range. Did not examine contribution of affect-related f0 contours • Objective: To examine affects of VQ and f0 on affect expression

  9. Voice Quality and f0 Cues for Affect Expression: Implications for Synthesis • 3 series of stimuli of Sweden utterance – “jaadjo”: • Stimuli exemplifying VQ • Stimuli with modal voice quality with different affect-related f0 contours • Stimuli combining both • Tested parameters exemplifying 5 voice quality (VQ): • Modal voice • Breathy voice • Whispery voice • Lax-creaky voice • Tense voice • 15 synthesized stimuli test samples (see Table 1)

  10. What is Voice Quality? Phonation Gestures • Derived from a variety of laryngeal and supralaryngeal features • Adductive tension: interarytenoid muscles adduct the arytenoid muscles • Medial compression: adductive force on vocal processes- adjustment of ligamental glottis • Longitudinal pressure: tension of vocal folds

  11. Tense Voice • Very strong tension of vocal folds, very high tension in vocal tract

  12. Whispery Voice • Very low adductive tension • Medial compression moderately high • Longitudinal tension moderately high • Little or no vocal fold vibration • Turbulence generated by friction of air in and above larynx

  13. Creaky Voice • Vocal fold vibration at low frequency, irregular • Low tension (only ligamental part of glottis vibrates) • The vocal folds strongly adducted • Longitudinal tension weak • Moderately high medial compression

  14. Breathy Voice • Tension low • Minimal adductive tension • Weak medial compression • Medium longitudinal vocal fold tension • Vocal folds do not come together completely, leading to frication

  15. Modal Voice • “Neutral” mode • Muscular adjustments moderate • Vibration of vocal folds periodic, full closing of glottis, no audible friction • Frequency of vibration and loudness in low to mid range for conversational speech

  16. Voice Quality and f0 Cues for Affect Expression: Implications for Synthesis • Six sub-tests with 20 native speakers of Hiberno-English. • Rated on 12 different affective attributes: • Sad – happy • Intimate – formal • Relaxed – stressed • Bored – interested • Apologetic – indignant • Fearless – scared • Participants asked to mark their response on scale Intimate Formal No affective load

  17. Voice Quality and f0 Test: Conclusion • Categorized results into 4 groups. No simple one-to-one mapping between quality and affect • “Happy” was most difficult to synthesis • Suggested that, in addition to f0 ,VQ should be used to synthesis affectively colored speech. VQ appears to be crucial for expressive synthesis

  18. Voice Quality and f0 Test: Discussion • If the scale is on a 1-7, then 3.5 should be “neutral”; however, most ratings are less than 2. Do the conclusions (see Fig 2) seem strong? • In terms of VQ and f0, the groupings in Fig 2 seem to suggest that certain affects are closely related. What are the implications of this? For example, are happy and indignant affects closer than relaxed or formal? Do you agree? • Do you consider an intimate voice more “breathy” or “whispery?” Does your intuition agree with the paper? • Yanushevskaya found that the VQ accounts for the highest affect ratings overall. How to compare range of voice quality with frequency? Do you think they are comparable? Is there a different way to describe these qualities?

  19. Questions?

More Related