Feb.21.2006 Rohit Kumar Affective Dialog Systems

Compensating for Hyperarticulationby Modeling Articulatory PropertiesHagen Soltau, Florian Metze, Alex WaibelInteractions betweenSpeech Recognition Problemsand User EmotionsMihai Rotaru, Diane J. Litman, Kate Forbes-Riley Feb.21.2006 Rohit Kumar Affective Dialog Systems

(Affective Computing) User Centered Computing  Audience Centered Presentation             

Queries & Concerns •  What are Articulatory Features ? • Large conflicts in enumeration of these features  •  Use of Articulatory Features to detect Emotions • Training data for Hyperarticulation models • Use of Isolated words  • No Annotation of Hyperarticulation  • Methodology of data collection  • Task Specific, …   

Queries & Concerns •  Humans use Hyperarticulation to recover from error in HH interaction while Hyperarticulation is a source of error in HC interaction. Why ??? • Lots of big Questions • Should we make Human like ASRs ? • Can we ? • What is different ? •  Gaussian Mixture Models (GMM) • No Significance Numbers of WERs

Queries & Concerns • Applicability test of Chi – Square •   Hypothesis to explain lack of dependancies where it is expected • Users more forgiving in Tutorial Dialog (higher tolerance to error) • May be due to Conflation of Emotions • Separate out +ves and -ves • Due to YES/NO turns after semantic misrecognition • Difficult to capture emotion in Yes/No • Better recognition to not reject

But before we turn into “Self” Centered Maniacs  Lets look at what Soltau and Rotaru have to say

What are these papers about Both these papers are about • Automatic (& Human) Speech Recognition • Error Handling Strategies in Spoken Dialog • Interaction between Affect and Misrecognitions by ASR

Soltau et. al. • Suggest that Articulatory Features to be used to improve performance of ASR in Hyperarticulated speech • Assumption: People don’t substitute whole phone to contrast a previous recognition error • Basically, more precise modeling of whats being hyperarticulated • How did they do it ? • Besides what HMM based ASRs usually do • Trained additional GMMs for Articulatory features (and also anti-models   ) • Get probability scores (from the GMMs) for the Articulatory Features • Linearly combine (with different weights) the scores from all the models • Get better hypothesis (just like “get more minutes”)

Soltau et. al. (continued) (Add in if I am missing something) • Methodology • Acoustic Models • Feature Extraction (MFCC + Context reduced to 40 features by LDA transform) • Other front end processing • AF Models • Same front end • GMMs (48 per feature) trained on middle state time alignments • Data collection for Hyperarticulated speech • 2 Sessions: Normal / Induced Hyperarticulated • Simulated Recognition Errors • Subjects 45

Soltau et. al. (continued) • Various Experiments • Classification of Articulatory Features • Decoding with Adapted Acoustic Models + AF • Decoding with Specialized models + AF

Rotaru et. al. • Domain: Spoken Tutorial Dialog • Chaining Effect of misrecognition across turns • Recognition Problems & Emotions in student turns

Rotaru et. al. (continued) • Methodology • ITSPOKE Corpus + Emotion Annotation • Student Utterances annotated by • ASR Misrecognitions • Rejections • Semantic Misrecognition • Student Emotion • Emotion Source

Rotaru et. al. (continued) • Chi-Square Analysis • Rejection in previous turn vs. Rejection in current turn • ASR Mis. in previous turn vs. ASR Mis. in current turn • ASR Mis. in previous turn vs. Rejection in current turn • Rejection in previous turn vs. Emotion in current turn • Rejection in previous turn vs. Emotion Src. in current turn • Sem. Mis. in previous turn vs. Emotion in current turn • Emotion in previous turn vs. (ASR) Mis. in current turn • Emotion in current turn vs. (ASR) Mis. in current turn

Articulatory Features • Speech Production Mechanism

Articulatory Features • Vowels • Vowel Height • High, Mid, Low • Vowel Backwardness • Front, Mid, Back • Long / Short Vowel • Dipthong • Schwa • Lip Rounding (+/-) • Voicing ! • Oral / Nasal

Articulatory Features • Consonant • Place of Articulation • Labial, Alveolar, Palatal, Labio-Dental, Dental, Velar, Glottal, {Retroflex} • Manner of Articulation • Stop, Fricative, Affricative, Nasal, Lateral, Approximant, {Liquids, Semivowels} • Voicing (+/-) Rohit Kumar, Amit Kataria, Sanjeev Sofat, "Building Non - Native Pronunciation Lexicon for English using a Rule based Approach," International Conference on Natural Language Processing (ICON) 2003, Mysore, India http://en.wikipedia.org/wiki/Articulatory_phonetics

Use of Articulatory Features to detect Emotions

Training data for Hyperarticulation models • Use of Isolated words • No Annotation of Hyperarticulation • Methodology of data collection • Task Specific, …

Humans use Hyperarticulation to recover from error in HH interaction while Hyperarticulation is a source of error in HC interaction. Why ??? • Lots of big Questions • Should we make Human like ASRs ? • Could we ? Would we ? • What is different ?

Gaussian Mixture Models Andrew Moore’s Lecture Slides Pg 7 - 10, 20 - 24 http://www.autonlab.org/tutorials/gmm.html

No Significance Numbers of WERs

Applicability Test of (Chi)2 The following minimum frequency thresholds should be obeyed: • for a 1 X 2 or 2 X 2 table, expected frequencies in each cell should be at least 5 • for a 2 X 3 table, expected frequencies should be at least 2 • for a 2 X 4 or 3 X 3 or larger table, if all expected frequencies but one are at least 5 and if the one small cell is at least 1, chi-square is still a good approximation In general, the greater the degrees of freedom (i.e., the more values/categories on the independent and dependent variables), the more lenient the minimum expected frequencies threshold. http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html

Hypothesis to explain lack of dependencies where it is expected • Users more forgiving in Tutorial Dialog (higher tolerance to error) • May be due to Conflation of Emotions • Separate out +ves and -ves • Due to YES/NO turns after semantic misrecognition • Difficult to capture emotion in Yes/No • Better recognition to not reject

That’s all Folks Unless you have something to say ?!

Feb.21.2006 Rohit Kumar Affective Dialog Systems

Feb.21.2006 Rohit Kumar Affective Dialog Systems

Presentation Transcript

Affective Systems

Spoken Dialog Systems

Rohit Kate

Rohit Kate

dialog and dialog systems elevator project seminar, ws06/07

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Operating Systems Rohit Khokher

Rohit Khokher

Number Systems Rohit Khokher

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Flexible Dialog Management for In-vehicle Dialog Systems

Rohit Kate