210 likes | 337 Views
Incorporating In-domain Confidence and Discourse Coherence Measures in Utterance Verification ドメイン内の信頼度と談話の整合性 を用いた音声認識誤りの検出. Ian R. Lane, Tatsuya Kawahara Spoken Language Communications Research Laboratories, ATR School of Informatics, Kyoto University. Introduction.
E N D
Incorporating In-domain Confidence and Discourse Coherence Measures in Utterance Verificationドメイン内の信頼度と談話の整合性を用いた音声認識誤りの検出 Ian R. Lane, Tatsuya Kawahara Spoken Language Communications Research Laboratories, ATR School of Informatics, Kyoto University
Introduction • Current ASR technologies not robust against: • Acoustic mismatch: noise, channel, speaker variance • Linguistic mismatch: disfluencies, OOV, OOD • Assess confidence of recognition hypothesis, and detect recognition errors Effective user feedback • Select recovery strategy based on type of error and specific application
Previous Works on Confidence Measures • Feature-based • [Kemp] word-duration, AM/LM back-off • Explicit model-based • [Rahim] likelihood ratio test against cohort model • Posterior probability • [Komatani, Soong, Wessel] estimate posterior probability given all competing hypotheses in a word-graph Approaches limited to “low-level” information available during ASR decoding
Proposed Approach • Exploit knowledge sources outside ASR framework for estimating recognition confidence e.g. knowledge about application domain, discourse flow Incorporate CM based on “high-level” knowledge sources • In-domain confidence • degree of match between utterance and application domain • Discourse coherence • consistency between consecutive utterances in dialogue
CMin-domain(Xi): in-domain confidence CMdiscourse(Xi|Xi-1): discourse coherence CM(Xi): joint confidence score, combine above with generalized posteriorprobability CMgpp(Xi) Utterance Verification Framework Input utterance Out-of-domain Detection Xi-1 ASR front-end Topic Classification In-domain Verification CMin-domain(Xi-1) dist(Xi,Xi-1) CMdiscourse(Xi|Xi-1) Out-of-domain Detection CM(Xi) Xi ASR front-end Topic Classification In-domain Verification CMin-domain(Xi) CMgpp(Xi)
In-domain Confidence • Measure of topic consistency with application domain • Previously applied in out-of-domain utterance detection • Examples of errors detected via in-domain confidence • Mismatch of domain • REF: How can I print this WORD file double-sided • ASR: How can I open this word on the pool-side • hypothesis not consistent by topic in-domain confidence low • Erroneous recognition hypothesis • REF: I want to go to Kyoto, can I go by bus • ASR: I want to go to Kyoto, can I take a bath • hypothesis not consistent by topic in-domain confidence low REF: correct transcription ASR: speech recognition hypothesis
In-domain Confidence Input UtteranceXi (recognition hypothesis) Transformation to Vector-space Feature Vector Classification of Multiple Topics SVM (1~m) Topic confidence scores(C(t1|Xi), ... ,C(tm|Xi)) In-Domain Verification Vin-domain(Xi) CMin-domain(Xi) In-domain confidence
(a, an, …, room, …, seat, …, I+have, … (1, 0 , …, 0 , …, 1 , …, 1 , … accom. airplane airport … 0.05 0.36 0.94 90 % In-domain Confidence Input UtteranceXi (recognition hypothesis) e.g. ‘could I have a non-smoking seat’ Transformation to Vector-space Classification of Multiple Topics SVM (1~m) In-Domain Verification Vin-domain(Xi) CMin-domain(Xi)
In-domain Verification Model • Linear discriminate verification model applied • 1, …, mtrained on in-domain data using “deleted interpolation of topics” and GPD [lane ‘04] C(tj|Xi): topic classification confidence score of topic tj for input utterance X j: discriminate weight for topic tj
Discourse Coherence • Topic consistency with preceding utterance • Examples of errors detected via discourse-coherence • Erroneous recognition hypothesis • Speaker A: Previous utterance [Xi-1] • REF: What type of shirt are you looking for? • ASR: What type of shirt are you looking for? • Speaker B: Current utterance [Xi] • REF: I’m looking for a white T-shirt. • ASR: I’m looking for a white teacher. • topic not consistent across utterances • discourse coherence low REF: correct transcription ASR: speech recognition hypothesis
Discourse Coherence • Euclidean distance between current (Xi) and previous (Xi-1) utterances in topic confidence space • CMdiscourse large when Xi, Xi-1 related, low when differ
Joint Confidence Score Generalized Posterior Probability • Confusability of recognition hypothesis against competing hypotheses [Lo & Soong] • At utterance level: GWPP(xj): generalized word posterior probability of xj xj: j-th word in recognition hypothesis of X
Joint Confidence Score where • For utterance verification compare CM(Xi) to threshold () • Model weights (gpp, in-domain, discourse), and threshold () trained on development set
Experimental Setup • Training-set: ATR BTEC (basic-travel-expressions-corpus) • ~400k sentences (Japanese/English pairs) • 14 topic classes (accommodation, shopping, transit, …) • Train: topic-classification + in-domain verification models • Evaluation data: ATR MAD (machine aided dialogue) • Natural dialogue between English and Japanese speakers via ATR speech-to-speech translation system • Dialogue data collected based on set of pre-defined scenarios • Development-set: 270 dialogues Test-set: 90 dialogues On development set train: CM sigmoid transforms CM weights (gpp, in-domain, discourse) Verification threshold ()
Speech Recognition Performance • ASR performed with ATRASR; 2-gram LM applied during decoding, rescore lattice with 3-gram LM
Evaluation Measure • Utterance-based Verification • No definite “keyword” set in S-2-S translation • If recognition error occurs (one or more errors) prompt user to rephrase entire utterance • CER (confidence error rate) • FA: false acceptance of incorrectly recognized utterance • FR: false rejection of correctly recognized utterance
GPP-based Verification Performance • Accept All: Assume all utterances are correctly recognized • GPP: Generalized posterior probability Accept All Accept All GPP GPP • Large reduction in verification errors compared with “Accept all” case • CER 17.3% (Japanese) and 15.3% (English)
Incorporation of IC and DC Measures (Japanese) GPP: Generalized posterior probability IC:In-domain confidenceDC:Discourse coherence GPP GPP +IC GPP +DC GPP +IC +DC • CER reduced by 5.7% and 4.6% for “GPP+IC” and “GPP+DC” cases • CER 17.3% 15.9% (8.0% relative) for “GPP+IC+DC” case
Incorporation of IC and DC Measures (English) GPP: Generalized posterior probability IC:In-domain confidenceDC:Discourse coherence GPP GPP +IC GPP +DC GPP +IC +DC • Similar performance for English side • CER 15.3% 14.4% for “GPP+IC+DC” case
Conclusions • Proposed novel utterance verification scheme incorporating “high-level” knowledge In-domain confidence: degree of match between utterance and application domain Discourse coherence: consistency between consecutive utterances • Two proposed measures effective • Relative reduction in CER of 8.0% and 6.1% (Japanese/English)
Future work • “High-level” content-based verification • Ignore ASR-errors that do not affect translation quality Further improvement in performance • Topic Switching • Determine when users switch task Consider single task per dialogue session