240 likes | 382 Views
Compensating for Hyperarticulation by Modeling Articulatory Properties Hagen Soltau, Florian Metze, Alex Waibel Interactions between Speech Recognition Problems and User Emotions Mihai Rotaru, Diane J. Litman, Kate Forbes-Riley. Feb.21.2006 Rohit Kumar Affective Dialog Systems.
E N D
Compensating for Hyperarticulationby Modeling Articulatory PropertiesHagen Soltau, Florian Metze, Alex WaibelInteractions betweenSpeech Recognition Problemsand User EmotionsMihai Rotaru, Diane J. Litman, Kate Forbes-Riley Feb.21.2006 Rohit Kumar Affective Dialog Systems
(Affective Computing) User Centered Computing Audience Centered Presentation
Queries & Concerns • What are Articulatory Features ? • Large conflicts in enumeration of these features • Use of Articulatory Features to detect Emotions • Training data for Hyperarticulation models • Use of Isolated words • No Annotation of Hyperarticulation • Methodology of data collection • Task Specific, …
Queries & Concerns • Humans use Hyperarticulation to recover from error in HH interaction while Hyperarticulation is a source of error in HC interaction. Why ??? • Lots of big Questions • Should we make Human like ASRs ? • Can we ? • What is different ? • Gaussian Mixture Models (GMM) • No Significance Numbers of WERs
Queries & Concerns • Applicability test of Chi – Square • Hypothesis to explain lack of dependancies where it is expected • Users more forgiving in Tutorial Dialog (higher tolerance to error) • May be due to Conflation of Emotions • Separate out +ves and -ves • Due to YES/NO turns after semantic misrecognition • Difficult to capture emotion in Yes/No • Better recognition to not reject
But before we turn into “Self” Centered Maniacs Lets look at what Soltau and Rotaru have to say
What are these papers about Both these papers are about • Automatic (& Human) Speech Recognition • Error Handling Strategies in Spoken Dialog • Interaction between Affect and Misrecognitions by ASR
Soltau et. al. • Suggest that Articulatory Features to be used to improve performance of ASR in Hyperarticulated speech • Assumption: People don’t substitute whole phone to contrast a previous recognition error • Basically, more precise modeling of whats being hyperarticulated • How did they do it ? • Besides what HMM based ASRs usually do • Trained additional GMMs for Articulatory features (and also anti-models ) • Get probability scores (from the GMMs) for the Articulatory Features • Linearly combine (with different weights) the scores from all the models • Get better hypothesis (just like “get more minutes”)
Soltau et. al. (continued) (Add in if I am missing something) • Methodology • Acoustic Models • Feature Extraction (MFCC + Context reduced to 40 features by LDA transform) • Other front end processing • AF Models • Same front end • GMMs (48 per feature) trained on middle state time alignments • Data collection for Hyperarticulated speech • 2 Sessions: Normal / Induced Hyperarticulated • Simulated Recognition Errors • Subjects 45
Soltau et. al. (continued) • Various Experiments • Classification of Articulatory Features • Decoding with Adapted Acoustic Models + AF • Decoding with Specialized models + AF
Rotaru et. al. • Domain: Spoken Tutorial Dialog • Chaining Effect of misrecognition across turns • Recognition Problems & Emotions in student turns
Rotaru et. al. (continued) • Methodology • ITSPOKE Corpus + Emotion Annotation • Student Utterances annotated by • ASR Misrecognitions • Rejections • Semantic Misrecognition • Student Emotion • Emotion Source
Rotaru et. al. (continued) • Chi-Square Analysis • Rejection in previous turn vs. Rejection in current turn • ASR Mis. in previous turn vs. ASR Mis. in current turn • ASR Mis. in previous turn vs. Rejection in current turn • Rejection in previous turn vs. Emotion in current turn • Rejection in previous turn vs. Emotion Src. in current turn • Sem. Mis. in previous turn vs. Emotion in current turn • Emotion in previous turn vs. (ASR) Mis. in current turn • Emotion in current turn vs. (ASR) Mis. in current turn
Articulatory Features • Speech Production Mechanism
Articulatory Features • Vowels • Vowel Height • High, Mid, Low • Vowel Backwardness • Front, Mid, Back • Long / Short Vowel • Dipthong • Schwa • Lip Rounding (+/-) • Voicing ! • Oral / Nasal
Articulatory Features • Consonant • Place of Articulation • Labial, Alveolar, Palatal, Labio-Dental, Dental, Velar, Glottal, {Retroflex} • Manner of Articulation • Stop, Fricative, Affricative, Nasal, Lateral, Approximant, {Liquids, Semivowels} • Voicing (+/-) Rohit Kumar, Amit Kataria, Sanjeev Sofat, "Building Non - Native Pronunciation Lexicon for English using a Rule based Approach," International Conference on Natural Language Processing (ICON) 2003, Mysore, India http://en.wikipedia.org/wiki/Articulatory_phonetics
Training data for Hyperarticulation models • Use of Isolated words • No Annotation of Hyperarticulation • Methodology of data collection • Task Specific, …
Humans use Hyperarticulation to recover from error in HH interaction while Hyperarticulation is a source of error in HC interaction. Why ??? • Lots of big Questions • Should we make Human like ASRs ? • Could we ? Would we ? • What is different ?
Gaussian Mixture Models Andrew Moore’s Lecture Slides Pg 7 - 10, 20 - 24 http://www.autonlab.org/tutorials/gmm.html
Applicability Test of (Chi)2 The following minimum frequency thresholds should be obeyed: • for a 1 X 2 or 2 X 2 table, expected frequencies in each cell should be at least 5 • for a 2 X 3 table, expected frequencies should be at least 2 • for a 2 X 4 or 3 X 3 or larger table, if all expected frequencies but one are at least 5 and if the one small cell is at least 1, chi-square is still a good approximation In general, the greater the degrees of freedom (i.e., the more values/categories on the independent and dependent variables), the more lenient the minimum expected frequencies threshold. http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html
Hypothesis to explain lack of dependencies where it is expected • Users more forgiving in Tutorial Dialog (higher tolerance to error) • May be due to Conflation of Emotions • Separate out +ves and -ves • Due to YES/NO turns after semantic misrecognition • Difficult to capture emotion in Yes/No • Better recognition to not reject
That’s all Folks Unless you have something to say ?!