1 / 26

WP3 speech and emotion (analysis & recognition)

human language technologies. hlt. WP3 speech and emotion (analysis & recognition). Databases and Annotations. UERLN: SYMPAFLY. Fully automatic speech dialogue telephone system for flight reservation and booking, different system stages; 270 Dialogues.

neviah
Download Presentation

WP3 speech and emotion (analysis & recognition)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. human language technologies hlt WP3 speech and emotion (analysis & recognition)

  2. Databases and Annotations

  3. UERLN: SYMPAFLY • Fully automatic speech dialogue telephone system for flight reservation and booking, different system stages; 270 Dialogues. • Annotations: word-based emotional user states, prosodic and conversational peculiarities; dialogue (step) success; emotional user states  distribution follows nested Pareto (80/20) principle

  4. UERLN: AIBO • Children's interaction (age 10-12, 51 children, 9.2 hours of speech) with SONY’s AIBO robot, Wizard-of-Oz-scenario; cf. WP5 (plus English and read speech) • Annotations: word-based emotional user states (holistic, 5 labellers) and prosodic peculiarities; alignment of children's utterances with AIBO's actions; manual correction of F0, labelling of voice quality. Emotional user states for the English data.

  5. AIBO disobedient: frommotherese to angry g'radeaus Aibolein ja M fein M gut M machst M du M *da M | *tz l"aufst du mal bitte nach links | stopp E Aibo stopp | nach links E umdrehen | nein M <*ne> nein M <*ne> nein M <*ne> so M weit M *simma M noch M nicht M aufstehen M Schlafm"utze M komm M hoch M | ja M so M ist M es M <*is> guter M Hund M lauf mal jetzt nach links | nach links Aibo | Aibolein M aufstehen M *son M sonst M werd' M ich M b"ose M hoch E | nach A links A | Aibo A nach A links A | Aibolein A ganz A b"oser A Hund A jetzt A stehst A du A auf A | hoch A | dreh dich ein bisschen | ja M so ist es <*is> gut stopp Aibo stopp | *tz lauf g'radeaus |

  6. UERLN: Different Conceptualizations Remote control tool Pet dog Straight on little Aibo ok great You‘re doing fine now please to the left stop Aibo stop turn to the left no no no we aren´t that far yet get up sleepyhead get up yes that´s a good dog now go left left Aibo little Aibo get up else I´m getting angry get up Aibo left little Aibobad boy now get up turn a little ok that´s fine stop Aibo stop straight on Aibo straight on stop Aibo stop turn round to the left Aibo get up turn round to the left Aibo get up turn round, to the left Aibo get up get up Aibo now go left now straight on Aibo st´ straight on

  7. ITC: Targhe • Fully automatic speech dialogue telephone system • 15,6 hours of Italian natural speech • 9444 files (turns) -> 450 emotionally rich • Word-level • Orthographic transcription and word segmentation • Prosodic peculiarities annotated • Turn-level • Holistic emotion labels • Sympafly(cf. UERLN) • for comparison and benchmarking

  8. UKA: LDC2002S28 • Elicited emotional speech database; native American English • labels: 1 of 15 holistic speaker states per utterance; used in algorithm and feature set development

  9. UKA: ISL Meeting Corpus • 18 recordings of multi-party (mean 5.1 participants) meetings; mean 35 minute duration; American English • Annotations: orthographic transcription; Verbmobil II, and discourse-level annotations.

  10. Assessment of Data Collection: • focus on • spontaneous, realistic data • important/new types of dialogues/interaction • evaluation of annotations • considerable percentage of realistic (processed and available) databases world-wide

  11. Features & Classification

  12. UERLN: Features • large feature vector for a context of  2 words: •  95 prosodic (duration, energy, F0, pauses) •  80 spectral (HNR, formant based frequencies and energy) • 24 MFCC •  30 POS • Language Models & dialogue based features

  13. ITC: Features • Baseline feature set • 96 features • Based on energy, duration, and pitch • Final feature set • 273 features (many redundant) • Based on energy, duration, pitch, and pauses • Different pitch extractors tried • Normalized Cross Correlation • Weighted Auto Correlation • UERLN PDA • Different subsets compared • Different tests to reduce the feature space • Principal component analysis

  14. UKA: 133 Acoustic Features • pitch, unvoiced/unvoiced energy, quartiles (15) • voice quality, Praat metrics (11) • harmonicity, quartiles (5) and Praat metrics (3) • zero-crossing rate vs energy, histogram (20) • correlation/regression, coefficients (36) • vocal tract volume, quartiles (25) • duration/timing, verbmobil features (18)

  15. Classifiers • UERLN: Linear Discriminant Analysis LDA, Decision Trees (CARTs), Neural Networks NN, Support Vector machines SVM, Gaussian Mixtures GM, Language Models LM • ITC: Decision Trees (CARTs), Neural Networks NN • UKA: Linear, Neural Networks NN, Support Vector machines SVM

  16. UERLN classification I: SympaFly GM/NN, 2 classes, neutral vs. problem, l≠t LDA, 4 classes SVM/CART, 2 classes, loo dialogue step success, 2 classes, SVM: CL 82.5 dialogue success, 2 classes, CART: CL 85.4 RR: overall rec. rate CL: class-wise averaged rec. rate

  17. UERLN classification II: AIBO • joyful • surprised • motherese • neutral (default) • rest (non-neutral) • bored • helpless, hesitant • emphatic • touchy (=irritated) • angry • reprimanding 4 classes "AMEN", NN

  18. ITC Classification II: • Final feature set • 273 (acoustic/temporal) features • 2 class problem (neutral and non neutral) RR = overall rec. rate; CL = class-wise averaged rec. rate N = neutral turns; NN = Non neutral turns

  19. UKA Classification II: 133 utterance-level prosodic features, 15 classes, acted speech, 8 speakers:

  20. Assessment of Features • a pool of many different features/feature groups implemented/compared • prosodic features better (more consistent) than "spectral" features in realistic speech • combination of knowledge sources improves performance • relevance of single features (feature classes)?

  21. Assessment of Classifications • not much difference between different classifiers in classification performance (linear classifiers highly competitive in speaker-independent classification) • large differences between speaker-dependent and speaker-independent classification

  22. Categories & Dimensions cf. also tomorrow

  23. UKA: Meeting Annotation Meeting audio appears to be rich in non-neutral speech. Open-set holistic labeling of 5 meetings by 3 labellers

  24. UKA: towards new Dimensions for Social Interaction in Meetings denoting conflict, bulding community, or skepticism etc. weakpower strong self support group

  25. Assessment of Categories & Dimensions • New categories, new dimensions, new consistency measure • prototypical "full-blown" emotions are rare • labels depend on type of data (call center, human-robot, different types of multi-party meeting) • new dimensions that do not model emotions but interaction between participants in communication • new entropy based consistency measure

  26. Thak you for your attention

More Related