Acoustic Cues to Emotional Speech

Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003

Motivation • A speaker’s emotional state conveys important and potentially useful information • To recognize (e.g. Spoken Dialogue Systems , tutoring systems ) • To generate (e.g. games) • If we know what emotion is and what aspects of productions convey different types • Defining emotion in multidimensional space • Valence: happy vs. sad • Activation: sad vs. despairing

Features that might convey emotion • Acoustic and prosodic • Lexical and syntactic • Facial and gestural

Previous Research • Emotion detection in corpus studies • Batliner, Noeth, et al; Ang et al: anger/frustration in dialogue systems • Lee et al: pos/neg emotion in call center data • Ringel & Hirschberg: voicemail • … in laboratory studies • Forced choice among 10-12 emotion categories • Sometimes with confidence rating

Problems • Hard to identify emotions reliably • Variation in ‘emotional’ utterances: production and perception • How can we obtain better training data? • Easier to detect variation in activation than in valence • Variation in ‘emotional’ utterances • Large space of potential features • Which are necessary and sufficient?

New methods for eliciting judgments • Hypothesis: Utterances in natural speech may evoke multiple emotions • Elicit judgments on multiple scales • Tokens from LDC Emotional Prosody Speech and Transcripts Corpus • Professional actors reading 4-syllable dates and numbers • disgust, panic, anxiety, hot anger, cold anger, despair, sadness, elation, happiness, interest, boredom, shame, pride, contempt, neutrality

Modified category set: • Positive: confident, encouraging, friendly, happy, interested • Negative: angry, anxious, bored, frustrated, sad • Neutral • For study: 1 token of each from each of 4 voices plus practice tokens • Subjects participated over the internet

40 native speakers of standard American English with no reported hearing impairment • 17 female, 23 male, all 18+ • 4 random orders rotated among subjects

Correlations between Judgments sad ang bor fru anxfri con hap int enc sad .06 .44.26 .22-.27 -.32 -.42 -.32 -.33 angry .05 .70 .21-.41 .02 .37 -.09 -.32 bored .14 -.14-.28 -.17 -.32 -.42 -.27 frustrated .32 -.43 -.09 -.47 -.16 -.39 anxious -.14 -.25 -.17 .07 -.14 friendly .44 .77 .59 .75 confident .45 .51 .53 happy .58 .73 interested .62 encouraging

What acoustic features correlate with which emotion categories? • F0: min, max, mean, ‘range’, stdev • RMS: min, max, mean, range, stdev • Voiced samples/all samples (VCD) • Mean syllable length • TILT: spectral tilt (2-1 harmonic over 30ms window) of highest ampl vowel, nuclear stressed vowel • Type of nuclear accent, contour, phrasal ending

Results • F0, RMS and rate distinguish emotion categories by activation (act) • +act correlate with higher F0 and RMS, faster • do not distinguish valence (val) • Tilt of highest amplitude vowel groups +act emotions with different val into different categories (e.g. friendly, happy, encouraging vs. angry, frustrated) • Phrase accent/boundary tone also separates +val from -val

H-L% positively correlated with -val and negatively with +val • +val positively correlated with L-L% and -val not

Predicting Emotion Categories Automatically • 1760 judgment/token datapoints (90%/10% training/test) • collapse 2-5 ratings to one • Ripper machine learning algorithm • Baseline: choose most frequent ranking • Mean performance over all emotions 75% (22% improvement over baseline) • Individual emotion categories

Happy, encouraging, sad, and anxious predicted well • Confident and interested show little improvement • Which features best predict which emotion categories?

Best Performing Features

Conclusions • New features to distinguish valence: spectral tilt and prosodic endings • New understanding of relations among emotion categories • Judgments • Features

Current/Future Work • Use ML to rank rather than classify (RankBoost) • Eye-tracking task, matching tokens to ‘emotional’ pictures • Web survey to ‘norm’ pictures • Layout issues

Acoustic Cues to Emotional Speech