420 likes | 717 Views
Affective computing course. Lecture – 5 Emotional s peech recognition. Outline. Speech in affective computing Vocal expression of emotion Linear model of speech production Speech production/Psychoacoustics Speech technology/feature extraction Mel-frequency cepstral coefficients
E N D
Affective computing course Lecture – 5 Emotional speech recognition
Outline • Speech in affective computing • Vocal expression of emotion • Linear model of speech production • Speech production/Psychoacoustics • Speech technology/feature extraction • Mel-frequency cepstral coefficients • Basic prosodic features • Shimmer/Jitter • Higher formants • Glottal flow • Emotion recognition from speech • Machine learning • State-of-the-art performance summary • Course programming assignment databases and tools
Speech in affective computing • Affective messages in speech • Emotions, moods, feelings • Intentional, elicited • Lexical (explicit) • Affective words, stress, etc. • Paralinguistic (implicit) • Speech prosody • voice quality • vocalizations
Vocal expression of emotion • Emotional models for speech • Discrete, dimensional, etc. • Acoustic and prosodic features of emotion • Emotional effects in speech • Perception of emotional speech • Perception tests (semantic truth data) • Forced choice, wide alternatives, free-form, dimensional annotation • Cross-cultural emotion perception • Universality of recognition
Brief overview of speech • Speech production • Linearized source filter model • Voice source and vocal tract • Auditory perception and psychoacoustics • Acoustic speech signals • Pressure and flow (measurement) • Taxonomy of speech • Basic linguistic structures and phonetics • Segmentals and suprasegmentals • Phonemes, prosody, quality, etc.
Prosodicparameters • Pitch and intonations • Quality (Vocalsource and tract) • Intensity • Durations Linearmodel of speechproduction
Psychoacoustics • Frequency • Mel/Barkscales • Tuning curves • Effected by spectral and temporal composition • Masking • Formant frequency integration (3.5 bark) • Intensity • Exponential functions • Equal loudness contours • Head related transfer functions (HRTF) • Omnidirectional sounds • NOTE: vocaleffort, lombardeffect!
Speech technology • Segmentation • Active speech level, voiced/unvoiced, phonemes, etc. NOTE Linguistics: segmentals/suprasegmentals • Speech feature extraction • Acoustic features • Spectral features and other transformations • Glottal source parameterization (voice quality) • Formant analysis • Prosodic features (suprasegmentals) • Pitch/F0, Loudness/Intensity, Rhythm/Duration • “Voice quality” NOTE the taxonomical ambiguity!
Segmentation • Active voice level • Energy and frequency measures • Silence, noises, speech • Voiced/Unvoiced • Periodicity (autocorrelation, cepstrum) • Phonation • Higher level segmentation • Model selection/clustering, F0/intensity trends, pauses, phoneme/lexical analysis • Diarization, words, sentences and utterances
Acoustic and prosodic features • Acoustic features • Traditional acoustic features (MFCC, LPC, PLP, etc.) • Simplesignalmeasures (e.g. zero-crossings, HNR) • Other spectral measures (e.g. formants, long term spectrum) • Prosodicfeatures (suprasegmentalproperties) • Pitch • Pitch tracker • F0 –contour and derivative distributions • Duration • Voiced/unvoiced/silence segmentation • Distributions of segments and segment ratios • Phoneme segmentation • Speech rate • Intensity • FFT and short segment energy • Energy contours and spectral parameters • Quality • Inverse filtering • Vocal source parameters
Features: Mel-cepstrum • Mel-Frequency Cepstral Coefficients (MFCC) • Mel-scale spaced filter bank • Corresponds to human auditory system (equal perceived pitch increments) • Usually ~12-24 coefficients used with 50% overlapping window • Mean cepstral subtraction for relative features • Delta and delta-delta features possible for sequences
Emotional F0 contourexamples Same text with different speakers and different emotions Same text and speakers with different emotions
Inverse filtering • Glottalflow and derivativefrommicrophonerecordedpressuresignal • Usessourcefiltermodel • E.g. IterativeAdaptiveInverseFiltering (IAIF) • Parameterization (directormodelfitting) • Time-based • OQ, SQ, etc. • Amplitude-based • AC flow, min. flow, etc. • Frequence-domain • FFT, AR • Spectraldecay, HNR
Features: Glottalparameters • open quotient (OQ) • (opening+closing)/total cycle • closed quotient (CQ) • 1-OQ • closing quotient (ClQ) • Closing/total cycle • quasi-open quotient (QOQ) • Quasi open/total cycle • speed quotient (SQ) • Opening/closing • amplitude quotient (AQ) • (Flow ampl.)/(Flow. Deriv. ampl.) • normalized amplitude quotient (NAQ) • AQ/total cycle
Emotion recognition from speech • Machine learning tools used frequently • Feature selection and transformations • Sequential floating search (SFFS), principal component analysis (PCA), nonlinear manifold modeling, etc. • Classifiers • Linear discriminant analysis (LDA), k-nearest neighbors (kNN), support vector machines (SVM), hidden markov models (HMM), neural networks (NN) • Validation and bias • Cross-validation, structural risk minimization (SRM), etc.
Feature Selection • Sequential Forwards/Backwards Floating Search • The best feature is used to initialize the feature vector (or all features) • A forwards step is taken (a feature is added) if this increases performance maximally • A backwards step is taken (a feature is removed) if this increases performance maximally • Process continues until the desired dimension is reached
Manifold Learning • Specialized data transformation methods • Attempt to retain data topology structures • Nonlinear methods • Neighborhood connected graph trees • Structurally motivated distance metrics • Many methods developed (large similarities) • Global full spectral based • PCA (linear or kernel based) • Isomap (geodesic distance) • Sparse local linearity based • LLE (Local Linear Embedding)
Supervised Isomap • K-nearest neighbor linking • Shortest path calculation (Dijkstra’s algorithm) • Classical multidimensional scaling • Supervised weights/modified divergence function
Nonlinearmapping: Isomap Valence Neutral Sad Angry Happy Activation
Embeddingperformance: Isomap • Emotional speech data • Prosodic/acoustic features • Supervised Isomap • Class weights • Nonlinear divergence measure • Supervised learning • Sequential forwards floating search • Parameter grid search • Classifier optimization target • kNN classifier in embedding space • GRNN mapping of data • Validation • Hold-out cross-validation • Person independent Human
Emotionclassification: LDA F0 Mean • Basic emotions: (1) Neutral, (2) Sad, (3) Angry, (4) Happy Voiced segment ratio
State-of-the-artmethods • Pitch tracker • Autocorrelation is probably the best short term method • Cepstrum is also ok in practice • Need a better estimate of glottal closures • e.g. waveform matching (time-domain) • Classifier • SVM or neural network • Any classifier accepting nonlinear data will do • Training • Genetic algorithms, floating search • PCA transformation of features seems to help very little, nonlinear methods (e.g. Isomap) are better
State-of-the-artperformance • Theoretical performance according to literature • 60-70 % in an automatic speaker-independent limited emotion case (discrimination) • Neutral, sad, happy, angry • 55-70% for human reference in a non-limited recognition of basic emotions in multicultural context • Neutral, sad, happy, angry, disgusted, surprised, fearful • In practice • 40-90+% depending on the scenario constraints, sample size, quality, number of emotions, and available features
Available databases, speech • Emotional speech databases • MediaTeam emotional speech corpus (easy) • Finnish, acted (stereotypical), single modal, basic emotions (4-7 discrete classes), proprietary • Hytke (challenging) • Finnish, spontaneous (emphasized), multimodal, SAM scale (dimensional annotation: valence, activation) • Speech, facial video, heartrate (RR), eyetracking, posture (via eyetracking)
The MediaTeam Emotional Speech Corpus • The MediaTeam Emotional Speech Corpus is a large database of simulated emotional speech for continuous spoken Finnish • 14 professional actors (8 men, 6 women, aged between 25 and 50) were recruited to simulate basic emotions: neutral, sadness, happiness, anger • The speech material was a semantically neutral text dealing with the nutritional value of the Finnish crowberry. One-minute text was used, some 100 words, phonetically rich • anechoic room, 44.1 kHz, mono,16 bits
Listening tests • 50 Finnish 8-9th grade school students • Male and female • Random order forced choice test; emotion discrimination • Weekly sessions within 2 months • Accuracy 57-93%, on average 76.9%
Listener/Actorperformances Listeners Actors
Speech tools (GPL licenses) • Praat • Phonetics toolbox/software, for prosody and general speech analysis • http://www.fon.hum.uva.nl/praat/ • Voicebox • MATLAB toolbox for speech processing, feature extraction etc. • http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html • TKK Aparat • Inversefiltering and voicesourceparameterizationMatlabtoolkit (needsruntime) • http://sourceforge.net/projects/aparat/
Bibliography • Airas M (2008) TKK Aparat: An environment for voice inverse filtering and parameterization. LogopedicsPhoniatricsVocology 33(1): 49–64. • Alku P (2011) Glottal inverse filtering analysis of human voice production — A review of estimation and parameterization methods of the glottal excitation and their applications. Sadhana 36(5): 623–650. • Boersma P (1993) Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. Proc. Institute of Phonetic Sciences, 17: 97–110. • Bosch L ten (2003) Emotions, speech and the ASR framework. Speech Communication 40(1-2): 213–225. • Fant G, Liljencrants J & Lin Q (1985) A four- parameter model of glottal flow. STL-QPSR 4: 1–13. • Scherer KR (2003)Vocal communication of emotion: A review of research paradigms. Speech Communication 40: 227–256. • Tenenbaum JB, de Silva V & Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2323. • Ververidis D, Kotropoulos C (2006) Emotional speech recognition: Resources, features, and methods. Speech Communication 48:1162–1181.