Affective computing course

Affective computing course Lecture – 5 Emotional speech recognition

Outline • Speech in affective computing • Vocal expression of emotion • Linear model of speech production • Speech production/Psychoacoustics • Speech technology/feature extraction • Mel-frequency cepstral coefficients • Basic prosodic features • Shimmer/Jitter • Higher formants • Glottal flow • Emotion recognition from speech • Machine learning • State-of-the-art performance summary • Course programming assignment databases and tools

Speech in affective computing • Affective messages in speech • Emotions, moods, feelings • Intentional, elicited • Lexical (explicit) • Affective words, stress, etc. • Paralinguistic (implicit) • Speech prosody • voice quality • vocalizations

Vocal expression of emotion • Emotional models for speech • Discrete, dimensional, etc. • Acoustic and prosodic features of emotion • Emotional effects in speech • Perception of emotional speech • Perception tests (semantic truth data) • Forced choice, wide alternatives, free-form, dimensional annotation • Cross-cultural emotion perception • Universality of recognition

Dimensional model of emotion: circumplex

Emotionaleffects in speech

Perception of emotionalspeech

Brief overview of speech • Speech production • Linearized source filter model • Voice source and vocal tract • Auditory perception and psychoacoustics • Acoustic speech signals • Pressure and flow (measurement) • Taxonomy of speech • Basic linguistic structures and phonetics • Segmentals and suprasegmentals • Phonemes, prosody, quality, etc.

Prosodicparameters • Pitch and intonations • Quality (Vocalsource and tract) • Intensity • Durations Linearmodel of speechproduction

Speechproduction

Formants, vocaltract

Formants: vowels

Psychoacoustics • Frequency • Mel/Barkscales • Tuning curves • Effected by spectral and temporal composition • Masking • Formant frequency integration (3.5 bark) • Intensity • Exponential functions • Equal loudness contours • Head related transfer functions (HRTF) • Omnidirectional sounds • NOTE: vocaleffort, lombardeffect!

Speech technology • Segmentation • Active speech level, voiced/unvoiced, phonemes, etc. NOTE Linguistics: segmentals/suprasegmentals • Speech feature extraction • Acoustic features • Spectral features and other transformations • Glottal source parameterization (voice quality) • Formant analysis • Prosodic features (suprasegmentals) • Pitch/F0, Loudness/Intensity, Rhythm/Duration • “Voice quality” NOTE the taxonomical ambiguity!

Segmentation • Active voice level • Energy and frequency measures • Silence, noises, speech • Voiced/Unvoiced • Periodicity (autocorrelation, cepstrum) • Phonation • Higher level segmentation • Model selection/clustering, F0/intensity trends, pauses, phoneme/lexical analysis • Diarization, words, sentences and utterances

Acoustic and prosodic features • Acoustic features • Traditional acoustic features (MFCC, LPC, PLP, etc.) • Simplesignalmeasures (e.g. zero-crossings, HNR) • Other spectral measures (e.g. formants, long term spectrum) • Prosodicfeatures (suprasegmentalproperties) • Pitch • Pitch tracker • F0 –contour and derivative distributions • Duration • Voiced/unvoiced/silence segmentation • Distributions of segments and segment ratios • Phoneme segmentation • Speech rate • Intensity • FFT and short segment energy • Energy contours and spectral parameters • Quality • Inverse filtering • Vocal source parameters

Features: Mel-cepstrum • Mel-Frequency Cepstral Coefficients (MFCC) • Mel-scale spaced filter bank • Corresponds to human auditory system (equal perceived pitch increments) • Usually ~12-24 coefficients used with 50% overlapping window • Mean cepstral subtraction for relative features • Delta and delta-delta features possible for sequences

Features: Basic prosodicfeatures

Emotional F0 contourexamples Same text with different speakers and different emotions Same text and speakers with different emotions

Features: Shimmer/Jitter

Glottalflow, LiljecrantsFant (LF) model

Inverse filtering • Glottalflow and derivativefrommicrophonerecordedpressuresignal • Usessourcefiltermodel • E.g. IterativeAdaptiveInverseFiltering (IAIF) • Parameterization (directormodelfitting) • Time-based • OQ, SQ, etc. • Amplitude-based • AC flow, min. flow, etc. • Frequence-domain • FFT, AR • Spectraldecay, HNR

Features: Glottalparameters • open quotient (OQ) • (opening+closing)/total cycle • closed quotient (CQ) • 1-OQ • closing quotient (ClQ) • Closing/total cycle • quasi-open quotient (QOQ) • Quasi open/total cycle • speed quotient (SQ) • Opening/closing • amplitude quotient (AQ) • (Flow ampl.)/(Flow. Deriv. ampl.) • normalized amplitude quotient (NAQ) • AQ/total cycle

Inversefiltering: realexample

Emotion recognition from speech • Machine learning tools used frequently • Feature selection and transformations • Sequential floating search (SFFS), principal component analysis (PCA), nonlinear manifold modeling, etc. • Classifiers • Linear discriminant analysis (LDA), k-nearest neighbors (kNN), support vector machines (SVM), hidden markov models (HMM), neural networks (NN) • Validation and bias • Cross-validation, structural risk minimization (SRM), etc.

Feature Selection • Sequential Forwards/Backwards Floating Search • The best feature is used to initialize the feature vector (or all features) • A forwards step is taken (a feature is added) if this increases performance maximally • A backwards step is taken (a feature is removed) if this increases performance maximally • Process continues until the desired dimension is reached

SFFS: flowchart

Manifold Learning • Specialized data transformation methods • Attempt to retain data topology structures • Nonlinear methods • Neighborhood connected graph trees • Structurally motivated distance metrics • Many methods developed (large similarities) • Global full spectral based • PCA (linear or kernel based) • Isomap (geodesic distance) • Sparse local linearity based • LLE (Local Linear Embedding)

Supervised Isomap • K-nearest neighbor linking • Shortest path calculation (Dijkstra’s algorithm) • Classical multidimensional scaling • Supervised weights/modified divergence function

Nonlinearmapping: Isomap Valence Neutral Sad Angry Happy Activation

Embeddingperformance: Isomap • Emotional speech data • Prosodic/acoustic features • Supervised Isomap • Class weights • Nonlinear divergence measure • Supervised learning • Sequential forwards floating search • Parameter grid search • Classifier optimization target • kNN classifier in embedding space • GRNN mapping of data • Validation • Hold-out cross-validation • Person independent Human

Emotionclassification: LDA F0 Mean • Basic emotions: (1) Neutral, (2) Sad, (3) Angry, (4) Happy Voiced segment ratio

State-of-the-artmethods • Pitch tracker • Autocorrelation is probably the best short term method • Cepstrum is also ok in practice • Need a better estimate of glottal closures • e.g. waveform matching (time-domain) • Classifier • SVM or neural network • Any classifier accepting nonlinear data will do • Training • Genetic algorithms, floating search • PCA transformation of features seems to help very little, nonlinear methods (e.g. Isomap) are better

State-of-the-artperformance • Theoretical performance according to literature • 60-70 % in an automatic speaker-independent limited emotion case (discrimination) • Neutral, sad, happy, angry • 55-70% for human reference in a non-limited recognition of basic emotions in multicultural context • Neutral, sad, happy, angry, disgusted, surprised, fearful • In practice • 40-90+% depending on the scenario constraints, sample size, quality, number of emotions, and available features

Available databases, speech • Emotional speech databases • MediaTeam emotional speech corpus (easy) • Finnish, acted (stereotypical), single modal, basic emotions (4-7 discrete classes), proprietary • Hytke (challenging) • Finnish, spontaneous (emphasized), multimodal, SAM scale (dimensional annotation: valence, activation) • Speech, facial video, heartrate (RR), eyetracking, posture (via eyetracking)

The MediaTeam Emotional Speech Corpus • The MediaTeam Emotional Speech Corpus is a large database of simulated emotional speech for continuous spoken Finnish • 14 professional actors (8 men, 6 women, aged between 25 and 50) were recruited to simulate basic emotions: neutral, sadness, happiness, anger • The speech material was a semantically neutral text dealing with the nutritional value of the Finnish crowberry. One-minute text was used, some 100 words, phonetically rich • anechoic room, 44.1 kHz, mono,16 bits

Listening tests • 50 Finnish 8-9th grade school students • Male and female • Random order forced choice test; emotion discrimination • Weekly sessions within 2 months • Accuracy 57-93%, on average 76.9%

Listener/Actorperformances Listeners Actors

Speech tools (GPL licenses) • Praat • Phonetics toolbox/software, for prosody and general speech analysis • http://www.fon.hum.uva.nl/praat/ • Voicebox • MATLAB toolbox for speech processing, feature extraction etc. • http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html • TKK Aparat • Inversefiltering and voicesourceparameterizationMatlabtoolkit (needsruntime) • http://sourceforge.net/projects/aparat/

Tools: Praat

Bibliography • Airas M (2008) TKK Aparat: An environment for voice inverse filtering and parameterization. LogopedicsPhoniatricsVocology 33(1): 49–64. • Alku P (2011) Glottal inverse filtering analysis of human voice production — A review of estimation and parameterization methods of the glottal excitation and their applications. Sadhana 36(5): 623–650. • Boersma P (1993) Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. Proc. Institute of Phonetic Sciences, 17: 97–110. • Bosch L ten (2003) Emotions, speech and the ASR framework. Speech Communication 40(1-2): 213–225. • Fant G, Liljencrants J & Lin Q (1985) A four- parameter model of glottal flow. STL-QPSR 4: 1–13. • Scherer KR (2003)Vocal communication of emotion: A review of research paradigms. Speech Communication 40: 227–256. • Tenenbaum JB, de Silva V & Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2323. • Ververidis D, Kotropoulos C (2006) Emotional speech recognition: Resources, features, and methods. Speech Communication 48:1162–1181.

Affective computing course

Affective computing course

Presentation Transcript

Affective Computing: Agents With Emotion

Affective Computing

Affective Computing

Affective computing and interface design

ACHESS Affective Computing

Affective Computing

AFFECTIVE DISORDERS

Affective Computing and Intelligent Interaction

Computing Course Talk

Survey of Affective Computing for Digital Home

Affective Computing: Machines with Emotional Intelligence

Affective Computing: Machines with Emotional Intelligence

Affective Computing

Affective Computing for Game Design

cloud computing course

Affective Computing Market Overview 2019

Affective Computing: Machines with Emotional Intelligence

Cloud computing course

Affective Computing Industry

Affective Computing Market

cloud computing course

Cloud Computing Course