1 / 41

Affective computing course

Affective computing course. Lecture – 5 Emotional s peech recognition. Outline. Speech in affective computing Vocal expression of emotion Linear model of speech production Speech production/Psychoacoustics Speech technology/feature extraction Mel-frequency cepstral coefficients

Download Presentation

Affective computing course

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Affective computing course Lecture – 5 Emotional speech recognition

  2. Outline • Speech in affective computing • Vocal expression of emotion • Linear model of speech production • Speech production/Psychoacoustics • Speech technology/feature extraction • Mel-frequency cepstral coefficients • Basic prosodic features • Shimmer/Jitter • Higher formants • Glottal flow • Emotion recognition from speech • Machine learning • State-of-the-art performance summary • Course programming assignment databases and tools

  3. Speech in affective computing • Affective messages in speech • Emotions, moods, feelings • Intentional, elicited • Lexical (explicit) • Affective words, stress, etc. • Paralinguistic (implicit) • Speech prosody • voice quality • vocalizations

  4. Vocal expression of emotion • Emotional models for speech • Discrete, dimensional, etc. • Acoustic and prosodic features of emotion • Emotional effects in speech • Perception of emotional speech • Perception tests (semantic truth data) • Forced choice, wide alternatives, free-form, dimensional annotation • Cross-cultural emotion perception • Universality of recognition

  5. Dimensional model of emotion: circumplex

  6. Emotionaleffects in speech

  7. Perception of emotionalspeech

  8. Brief overview of speech • Speech production • Linearized source filter model • Voice source and vocal tract • Auditory perception and psychoacoustics • Acoustic speech signals • Pressure and flow (measurement) • Taxonomy of speech • Basic linguistic structures and phonetics • Segmentals and suprasegmentals • Phonemes, prosody, quality, etc.

  9. Prosodicparameters • Pitch and intonations • Quality (Vocalsource and tract) • Intensity • Durations Linearmodel of speechproduction

  10. Speechproduction

  11. Formants, vocaltract

  12. Formants: vowels

  13. Psychoacoustics • Frequency • Mel/Barkscales • Tuning curves • Effected by spectral and temporal composition • Masking • Formant frequency integration (3.5 bark) • Intensity • Exponential functions • Equal loudness contours • Head related transfer functions (HRTF) • Omnidirectional sounds • NOTE: vocaleffort, lombardeffect!

  14. Speech technology • Segmentation • Active speech level, voiced/unvoiced, phonemes, etc. NOTE Linguistics: segmentals/suprasegmentals • Speech feature extraction • Acoustic features • Spectral features and other transformations • Glottal source parameterization (voice quality) • Formant analysis • Prosodic features (suprasegmentals) • Pitch/F0, Loudness/Intensity, Rhythm/Duration • “Voice quality” NOTE the taxonomical ambiguity!

  15. Segmentation • Active voice level • Energy and frequency measures • Silence, noises, speech • Voiced/Unvoiced • Periodicity (autocorrelation, cepstrum) • Phonation • Higher level segmentation • Model selection/clustering, F0/intensity trends, pauses, phoneme/lexical analysis • Diarization, words, sentences and utterances

  16. Acoustic and prosodic features • Acoustic features • Traditional acoustic features (MFCC, LPC, PLP, etc.) • Simplesignalmeasures (e.g. zero-crossings, HNR) • Other spectral measures (e.g. formants, long term spectrum) • Prosodicfeatures (suprasegmentalproperties) • Pitch • Pitch tracker • F0 –contour and derivative distributions • Duration • Voiced/unvoiced/silence segmentation • Distributions of segments and segment ratios • Phoneme segmentation • Speech rate • Intensity • FFT and short segment energy • Energy contours and spectral parameters • Quality • Inverse filtering • Vocal source parameters

  17. Features: Mel-cepstrum • Mel-Frequency Cepstral Coefficients (MFCC) • Mel-scale spaced filter bank • Corresponds to human auditory system (equal perceived pitch increments) • Usually ~12-24 coefficients used with 50% overlapping window • Mean cepstral subtraction for relative features • Delta and delta-delta features possible for sequences

  18. Features: Basic prosodicfeatures

  19. Emotional F0 contourexamples Same text with different speakers and different emotions Same text and speakers with different emotions

  20. Features: Shimmer/Jitter

  21. Glottalflow, LiljecrantsFant (LF) model

  22. Inverse filtering • Glottalflow and derivativefrommicrophonerecordedpressuresignal • Usessourcefiltermodel • E.g. IterativeAdaptiveInverseFiltering (IAIF) • Parameterization (directormodelfitting) • Time-based • OQ, SQ, etc. • Amplitude-based • AC flow, min. flow, etc. • Frequence-domain • FFT, AR • Spectraldecay, HNR

  23. Features: Glottalparameters • open quotient (OQ) • (opening+closing)/total cycle • closed quotient (CQ) • 1-OQ • closing quotient (ClQ) • Closing/total cycle • quasi-open quotient (QOQ) • Quasi open/total cycle • speed quotient (SQ) • Opening/closing • amplitude quotient (AQ) • (Flow ampl.)/(Flow. Deriv. ampl.) • normalized amplitude quotient (NAQ) • AQ/total cycle

  24. Inversefiltering: realexample

  25. Emotion recognition from speech • Machine learning tools used frequently • Feature selection and transformations • Sequential floating search (SFFS), principal component analysis (PCA), nonlinear manifold modeling, etc. • Classifiers • Linear discriminant analysis (LDA), k-nearest neighbors (kNN), support vector machines (SVM), hidden markov models (HMM), neural networks (NN) • Validation and bias • Cross-validation, structural risk minimization (SRM), etc.

  26. Feature Selection • Sequential Forwards/Backwards Floating Search • The best feature is used to initialize the feature vector (or all features) • A forwards step is taken (a feature is added) if this increases performance maximally • A backwards step is taken (a feature is removed) if this increases performance maximally • Process continues until the desired dimension is reached

  27. SFFS: flowchart

  28. Manifold Learning • Specialized data transformation methods • Attempt to retain data topology structures • Nonlinear methods • Neighborhood connected graph trees • Structurally motivated distance metrics • Many methods developed (large similarities) • Global full spectral based • PCA (linear or kernel based) • Isomap (geodesic distance) • Sparse local linearity based • LLE (Local Linear Embedding)

  29. Supervised Isomap • K-nearest neighbor linking • Shortest path calculation (Dijkstra’s algorithm) • Classical multidimensional scaling • Supervised weights/modified divergence function

  30. Nonlinearmapping: Isomap Valence Neutral Sad Angry Happy Activation

  31. Embeddingperformance: Isomap • Emotional speech data • Prosodic/acoustic features • Supervised Isomap • Class weights • Nonlinear divergence measure • Supervised learning • Sequential forwards floating search • Parameter grid search • Classifier optimization target • kNN classifier in embedding space • GRNN mapping of data • Validation • Hold-out cross-validation • Person independent Human

  32. Emotionclassification: LDA F0 Mean • Basic emotions: (1) Neutral, (2) Sad, (3) Angry, (4) Happy Voiced segment ratio

  33. State-of-the-artmethods • Pitch tracker • Autocorrelation is probably the best short term method • Cepstrum is also ok in practice • Need a better estimate of glottal closures • e.g. waveform matching (time-domain) • Classifier • SVM or neural network • Any classifier accepting nonlinear data will do • Training • Genetic algorithms, floating search • PCA transformation of features seems to help very little, nonlinear methods (e.g. Isomap) are better

  34. State-of-the-artperformance • Theoretical performance according to literature • 60-70 % in an automatic speaker-independent limited emotion case (discrimination) • Neutral, sad, happy, angry • 55-70% for human reference in a non-limited recognition of basic emotions in multicultural context • Neutral, sad, happy, angry, disgusted, surprised, fearful • In practice • 40-90+% depending on the scenario constraints, sample size, quality, number of emotions, and available features

  35. Available databases, speech • Emotional speech databases • MediaTeam emotional speech corpus (easy) • Finnish, acted (stereotypical), single modal, basic emotions (4-7 discrete classes), proprietary • Hytke (challenging) • Finnish, spontaneous (emphasized), multimodal, SAM scale (dimensional annotation: valence, activation) • Speech, facial video, heartrate (RR), eyetracking, posture (via eyetracking)

  36. The MediaTeam Emotional Speech Corpus • The MediaTeam Emotional Speech Corpus is a large database of simulated emotional speech for continuous spoken Finnish • 14 professional actors (8 men, 6 women, aged between 25 and 50) were recruited to simulate basic emotions: neutral, sadness, happiness, anger • The speech material was a semantically neutral text dealing with the nutritional value of the Finnish crowberry. One-minute text was used, some 100 words, phonetically rich • anechoic room, 44.1 kHz, mono,16 bits

  37. Listening tests • 50 Finnish 8-9th grade school students • Male and female • Random order forced choice test; emotion discrimination • Weekly sessions within 2 months • Accuracy 57-93%, on average 76.9%

  38. Listener/Actorperformances Listeners Actors

  39. Speech tools (GPL licenses) • Praat • Phonetics toolbox/software, for prosody and general speech analysis • http://www.fon.hum.uva.nl/praat/ • Voicebox • MATLAB toolbox for speech processing, feature extraction etc. • http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html • TKK Aparat • Inversefiltering and voicesourceparameterizationMatlabtoolkit (needsruntime) • http://sourceforge.net/projects/aparat/

  40. Tools: Praat

  41. Bibliography • Airas M (2008) TKK Aparat: An environment for voice inverse filtering and parameterization. LogopedicsPhoniatricsVocology 33(1): 49–64. • Alku P (2011) Glottal inverse filtering analysis of human voice production — A review of estimation and parameterization methods of the glottal excitation and their applications. Sadhana 36(5): 623–650. • Boersma P (1993) Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. Proc. Institute of Phonetic Sciences, 17: 97–110. • Bosch L ten (2003) Emotions, speech and the ASR framework. Speech Communication 40(1-2): 213–225. • Fant G, Liljencrants J & Lin Q (1985) A four- parameter model of glottal flow. STL-QPSR 4: 1–13. • Scherer KR (2003)Vocal communication of emotion: A review of research paradigms. Speech Communication 40: 227–256. • Tenenbaum JB, de Silva V & Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2323. • Ververidis D, Kotropoulos C (2006) Emotional speech recognition: Resources, features, and methods. Speech Communication 48:1162–1181.

More Related