1 / 20

Emotional Speech

Emotional Speech. Julia Hirschberg CS 6998. Today. Defining emotional speech Emotional categories Eliciting judgments Producing emotional speech Detecting emotional speech A Subclass: Deceptive speech. Cowie ‘00. Is there a good theoretical or practical definition of emotional speech?

diazs
Download Presentation

Emotional Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Emotional Speech Julia Hirschberg CS 6998

  2. Today • Defining emotional speech • Emotional categories • Eliciting judgments • Producing emotional speech • Detecting emotional speech • A Subclass: Deceptive speech

  3. Cowie ‘00 • Is there a good theoretical or practical definition of emotional speech? • “Full-blown” emotion vs. emotional state • Cause and effect descriptions • Primary and secondary (second order) • Everyday descriptions • Representations • Biological

  4. Dimensions in continuous space, e.g. • Valence: positive or negative • Activation level: how disposed to take action • Structural models: different ways of appraising situation that evokes emotion • e.g. positive or negative? Does situation help agent to achieve his/her goals? • Timing as a key variable • sadness vs. grief vs. depression vs. gloominess

  5. How are emotions expressed? • Display rules? In speech? • Mixing • Simulation

  6. Schroeder ‘01: Emotion in Synthesis • How is a given emotion expressed in speech? • What are the properties of the emotion to be expressed? How are they related to those of other emotions? • What kind of synthesizer works best? • Formant • Diphone • Unit selection

  7. Prosody rules: what to modify? • How do we evaluate the results? • Forced choice • Free response • Recognition rate • Perceived naturalness

  8. Ten Bosch ‘00: Emotion Recognition • How hard is the problem? • Is ‘standard’ ASR technology well-suited to it? • Acoustic and language models target short local events • Feature extraction normlizes/excludes e.g. pitch, rate, amplitude -- why? • Interaction: emotional speech and ASR performance • Synthesis needs one good example but...

  9. Ang et al • Challenges: • Use output from ASR system • Use automatic prosodic features • Find good speaker normalization • Combine with lexical features • Pioneered approach of “direct modeling” – no use of intermediate phonological units • Applications: detecting frustration, disappointment/tiredness, amusement/surprise • Results: prediction comparable to human accuracy 70-75%

  10. Method: Prosodic Models • Extract pitch from signal • Speech recognizer outputs word and phone alignments (duration features) • Utterance-level features extracted (e.g., max speaker normalized pitch in the longest phone-normalized vowel, etc) • Decision trees created to provide posterior probabilities of emotion classes given features • Feature selection from development test set • Separate test set used for evaluation

  11. Prosodic Features • Duration features • Phone / Vowel / Syllable Durations • Normalized by Phone/Vowel Means, Speaker • Speaking rate features (vowels/time) • Pause features • Speech to pause ratio, number of long pauses • Maximum pause length • Energy features (RMS energy) • Pitch features • Used pitch stylization algorithm (Sonmez et al.) • LTM model of F0 to estimate speaker range • Pitch ranges, slopes, locations of interest • Spectral tilt features • Other (non-prosodic) features • Position of utterance in dialog • Repeat or correction

  12. Emotion in Deception • Motivation: why might such cues exist? • Deception evokes emotion in deceivers (e.g. Ekman ‘85-92) • Fear of discovery: higher pitch, faster, louder, pauses disfluencies, indirect speech • Elation at successful deceiving: higher pitch, faster, louder, greater elaboration

  13. Acoustic/Prosodic/Lexical Cues • Are deceivers less forthcoming? • Shorter speech with fewer details • Arelies less compelling than truths? • Less plausible, logical, more discrepancies • Less verbal and vocal ‘involvement’ • Less verbal ‘immediacy’: more passives, negations, indirect speech • More uncertainty (subjective) • More repetitions • Are liars less positive, pleasant?

  14. More negative statements, complaints • Are liars more tense? • Nervous overall • Vocal tension • High pitch • Do lies contain fewer ‘imperfections’? • Fewer self-repairs • Fewer admissions of forgetfulness • Fewer scene descriptions, details • More mention of peripheral events or relationships

  15. Current State-of-the-Art • No single cue to deceptive speech: most studied are visual • Other acoustic/prosodic features proposed, but evidence mixed so far • Loudness/intensity • Speaking rate • Response latency • Disfluencies • No attested method to detect deception automatically using acoustic/prosodic/lexical cues • All current findings are descriptive, suggestive • All proposed methods require human intervention

  16. Our Approach • Elicit deceptive and non-deceptive corpus • Motivation: Identity-relevant (self-image) and instrumental (monetary) incentives • “Real” deception vs. acted • Good recording conditions • Tasks/interview paradigm • Transcription/annotation • Acoustic/prosodic/lexical analysis to identify features of interest, test validity of paradigm • Automatic feature extraction and analysis to train models of deceptive and non-deceptive speech

  17. Corpus Collection • Subjects asked to perform tasks for comparison with target profile of 25 top entrepreneurs • Performance manipulated to produce performance same as/differing from target • Monetary incentive to convince an interviewer they matched target • Recorded interview/interrogation • Biographical information (t/f) • “Big lie” on task performance • “Local lie”: Pedal indicators of t/f for each answer

  18. Collection • To date: 15 subjects, totaling ~3h of subject speech • Planned: 7-8h hours of subject speech

  19. Results of Prosodic/Acoustic Analysis • On Arizona Mock Theft data subset: • 32 interviews/72m, required segmentation, recording issues (50/160m more being segmented) • Significant pitch feature differences between deceptive and non-deceptive speech, but... • Highly motivated speakers lower pitch when lying • Low motivation speakers raise pitch when lying • Males lower pitch when lying • Females raise pitch when lying

  20. On Columbia corpus: • Preliminary analyses of 8 speakers for ‘local’ t/f • Significant differences in pitch range for six subjects, but differ from Mock Theft wrt gender • Lexical findings: • Preliminary analyses on Columbia data using LIWC show negative words more prevalent in deceptive speech

More Related