E N D
Some Statements on the Automatic Classification of Emotion in Speechor:A Set of Hypotheses and Statements re Automatic Extraction and Evaluation of Acoustic (mainly Prosodic) and other Features for the Automatic Classification of Emotional States in Spontaneous Speech Based on Experience and Educated Guessing Anton Batliner, Christian Hacker, Elmar Nöth, Stefan Steidl FAU (University of Erlangen)HUMAINE WP4, Santorin, September 2004
Overview • why these hypotheses / statements? • our basic approach: • features • classifiers • examples • the hypotheses / statements • suggestions and outlook
Why these Hypotheses • significance and bench-marking in single studies are no proof – cumulative evidence is better - and that's the way how it works, i.e., via repetition • there's much expertise in HUMAINE at different sites • to put forth "one man's opinion" to be compared with other experience • hopefully, at the end of HUMAINE, corroboration, modification, or disproval
Overview • why these hypotheses / statements? • our basic approach: • features • classifiers • examples • the hypotheses / statements • suggestions and outlook
Feature vector for a context of 2 words: 95 prosodic, 80 spectral, 30 POS • mean values of duration, energy, F0 • duration features: absolute; normalized with mean duration; absolute duration divided by number of syllables • energy features: regression coefficient with mean square error; mean , maximum with position on the time axis, absolute; with mean energy normalized energy • F0 features: regression coefficient with mean square error; mean, maximum, minimum, onset, and offset values, and their positions on the time axis; all F0 features logarithmized and normalized as to mean F0 value • length of pause before and after word • HNR (Harmonicity to Noise Ratio) - and formant-based features for most frequent vowels (frequency and energy), MFCC • 6 coarse POS (part-of-speech) features
Classifiers • Linear Discriminant Analysis LDA, Decision Trees DT (e.g., Cart and Regression Trees), Neural Networks NN, Support Vector machines SVM, Language Models LM, Gaussian Mixtures GM, ... • heretical suggestion: classifiers are not that important - at least in the context of HUMAINE
Problem vs. no problem (=joy/neutral), LDA, Sympafly (aut. call centre), cross classification, different feature classes, turn-based classification RR: overall recognition rateCL: class-wise averaged recognition rate: mean of probem / no problem
Overview • why these hypotheses / statements? • our basic approach: • features • classifiers • examples • the hypotheses / statements • suggestions and outlook
The Hypothesis • and the reason why e.g. Black Swans do not exist • reason: I've never seen any (and I´ve seen plenty of white ones) caveat: few black swans do not falsify the hypothesis (the exception proves the rule because it is unlikely, i.e. no natural law)
Best Way to do: Compute many (more or less) relevant features, and let the classifier do the work • cons: features often (highly) correlated, only wood – no trees, „ brute force, dumb shotgun approach“, i.e., interpretation not possible* • pros: best classification possible, i.e., let the classifier do the work of finding out / throwing away irrelevant ones • and if optimization too costly, still good performance • note: our feature vector has been selected carefully: > 500 276 125 95 * Interpretations is s.th. else, cf. below!
Omnibus Approach: With many features, you can always use the same feature vector, irrespective of the task • e.g., for accents, boundaries, questions, offtalk, repairs, etc. – and emotions • this makes live easier = less effort • and – possibly – classifiers more robust while coping with new tasks / databases (cf. questions) • not much deterioration w.r.t. to optimized feature vector • and it is not much difference between 79% and 81% - if it is not world record but overall quality of the system that counts: it is always 4/5)
No dramatic decline in performance if all features are used, still good performance if only a few are used • but of course, these few have to be the most important ones, and they can only be obtained by automatic assessment • slight deterioration with (more) / less than some 40 / 20 features – can such results be generalized somehow?
Raw values always contribute more than complex / combined values (other things being equal) • because the classifier is better at evaluating the impact of single features than we are, and normally, there is some added (= more) information in raw values (range vs. max + min, integral vs. energy + duration, etc.)
There is no „most important“ feature(s) in prosody • all the features are – more or less – correlated with each other • thus, it is not detrimental if one is or some are missing • and we are far away from any definite assessment • still, of course, it might matter if one feature is added or missing - but if the effect is pronounced chances are that you´ve made s.th. wrong before (note: there are exceptions) • and of course, different phenomena can be characterized by different features
F0 features are not more important than other prosodic features • 15 years ago, this would have been a very undecent suggestion in some sub-cultures (not emotion research?) • two possible – and not competing – reasons: they simply aren't, or they aren't because extraction is error-prone • „They simply aren´t“ means: high F0 range correlates with longer duration and vice versa, etc. - is it important what´s the hen and what´s the egg? • assessment only possible with manually corrected feature values
Intonational Models are sub-optimal for the use in automatic classification * • They concentrate on pitch • They are designed for something else • because of quantisation error * note: this holds for classification but not necessarily for synthesis / typology / etc. !
Two new "dimensions" many features bad Class. Performance good few features bad Interpretation good
Classification performance is negatively correlated with interpretability • so it is best to separate them • and maybe use different classifiers for the two tasks, e.g., LDA and DT (with PC) for interpretation, and – if you have plenty of time – NN or SVM for classification • context features are good for performance but often not easy to interpret – maybe because of spurious effects; open question: how to incorporate which context - to represent a „neutral“ baseline? „neutral“ w.r.t. speaker or task? unit of analysis
Spectral features are not irrelevant but much less important than people would like to believe • either because of extraction problems, sparse data, noise, or because they simply are not important / indicate different things: • spectral features good at micro-level = segmental, prosodic features good at macro-level = supra-segmental • the „sparse data problem“ in spontaneous speech maybe most important, because much more dependent on segmental context than prosody • definitely no bi-uniqueness of form and function
An example: What laryngealisation can indicate(phonation type / voice quality, prosodic and spectral features) • accentuation • vowels • word boundaries • native language • the end of an utterance, i.e., turn-taking • speaker idiosyncrasies • speech pathology • too many drinks / cigarettes • competence / power • social class membership • and: emotional state (sadness, etc.)
Multi-Modality does not enhance classification performancebecause: emotions are no brontosauruses* but sausages** • because humans are holistic beings, i.e., if the emotion is strong, then all simultaneous modalities are pronounced and vice versa (no hens, no eggs) • of course, one modality might be „complementary“ in the absence / ambiguity of the other modality (the "open mouth problem") "sequential multi-modality" • Fusion problem in itself! No added value but added noise? * All brontosauruses are thin at one end, much MUCH thicker in the middle, and then thin again at the far end.(J. Cleese alias Miss Anne Elk) ** Sausages are either thick or thin.
In "representative" data, peformance for a two-class problem is below 80%, for a 4-class problem, it is below 60% • thus, results that are (much!) better are unicorns! • this will hopefully chance slightly but we will face an upper limit pretty soon – unless new knowledge sources are detected and taken into account • inconsistency of annotations such „low“ recognition rates maybe the best ones one can get? • question: are there similar statements for (facial and hand) gestures?
All this only holds for speaker-independent analyses of spontaneous, „representative“ speech but: do not use acted and/or speaker-dependent data unless this is your intended application! • it is like read vs. spontaneous speech (remember the sobering break-down of classification performance: 98% 20%) • thus, results obtained from acted emotions might be taken as basis where to look for but not more! • two types of „emotional“ data: rich and poor?
Overview • why these hypotheses / statements? • our basic approach: • features • classifiers • examples • the hypotheses / statements • suggestions and outlook
open for discussion • consent / rejections to statements? • if rejection: same type of data / features / procedures? catalogue of old and new / alternative statements