Ian R. Lane, Tatsuya Kawahara Spoken Language Communications Research Laboratories, ATR

Incorporating In-domain Confidence and Discourse Coherence Measures in Utterance Verificationドメイン内の信頼度と談話の整合性を用いた音声認識誤りの検出 Ian R. Lane, Tatsuya Kawahara Spoken Language Communications Research Laboratories, ATR School of Informatics, Kyoto University

Introduction • Current ASR technologies not robust against: • Acoustic mismatch: noise, channel, speaker variance • Linguistic mismatch: disfluencies, OOV, OOD • Assess confidence of recognition hypothesis, and detect recognition errors Effective user feedback • Select recovery strategy based on type of error and specific application

Previous Works on Confidence Measures • Feature-based • [Kemp] word-duration, AM/LM back-off • Explicit model-based • [Rahim] likelihood ratio test against cohort model • Posterior probability • [Komatani, Soong, Wessel] estimate posterior probability given all competing hypotheses in a word-graph Approaches limited to “low-level” information available during ASR decoding

Proposed Approach • Exploit knowledge sources outside ASR framework for estimating recognition confidence e.g. knowledge about application domain, discourse flow Incorporate CM based on “high-level” knowledge sources • In-domain confidence • degree of match between utterance and application domain • Discourse coherence • consistency between consecutive utterances in dialogue

CMin-domain(Xi): in-domain confidence CMdiscourse(Xi|Xi-1): discourse coherence CM(Xi): joint confidence score, combine above with generalized posteriorprobability CMgpp(Xi) Utterance Verification Framework Input utterance Out-of-domain Detection Xi-1 ASR front-end Topic Classification In-domain Verification CMin-domain(Xi-1) dist(Xi,Xi-1) CMdiscourse(Xi|Xi-1) Out-of-domain Detection CM(Xi) Xi ASR front-end Topic Classification In-domain Verification CMin-domain(Xi) CMgpp(Xi)

In-domain Confidence • Measure of topic consistency with application domain • Previously applied in out-of-domain utterance detection • Examples of errors detected via in-domain confidence • Mismatch of domain • REF: How can I print this WORD file double-sided • ASR: How can I open this word on the pool-side • hypothesis not consistent by topic in-domain confidence low • Erroneous recognition hypothesis • REF: I want to go to Kyoto, can I go by bus • ASR: I want to go to Kyoto, can I take a bath • hypothesis not consistent by topic in-domain confidence low REF: correct transcription ASR: speech recognition hypothesis

In-domain Confidence Input UtteranceXi (recognition hypothesis) Transformation to Vector-space Feature Vector Classification of Multiple Topics SVM (1~m) Topic confidence scores(C(t1|Xi), ... ,C(tm|Xi)) In-Domain Verification Vin-domain(Xi) CMin-domain(Xi) In-domain confidence

(a, an, …, room, …, seat, …, I+have, … (1, 0 , …, 0 , …, 1 , …, 1 , … accom. airplane airport … 0.05 0.36 0.94 90 % In-domain Confidence Input UtteranceXi (recognition hypothesis) e.g. ‘could I have a non-smoking seat’ Transformation to Vector-space Classification of Multiple Topics SVM (1~m) In-Domain Verification Vin-domain(Xi) CMin-domain(Xi)

In-domain Verification Model • Linear discriminate verification model applied • 1, …, mtrained on in-domain data using “deleted interpolation of topics” and GPD [lane ‘04] C(tj|Xi): topic classification confidence score of topic tj for input utterance X j: discriminate weight for topic tj

Discourse Coherence • Topic consistency with preceding utterance • Examples of errors detected via discourse-coherence • Erroneous recognition hypothesis • Speaker A: Previous utterance [Xi-1] • REF: What type of shirt are you looking for? • ASR: What type of shirt are you looking for? • Speaker B: Current utterance [Xi] • REF: I’m looking for a white T-shirt. • ASR: I’m looking for a white teacher. • topic not consistent across utterances • discourse coherence low REF: correct transcription ASR: speech recognition hypothesis

Discourse Coherence • Euclidean distance between current (Xi) and previous (Xi-1) utterances in topic confidence space • CMdiscourse large when Xi, Xi-1 related, low when differ

Joint Confidence Score Generalized Posterior Probability • Confusability of recognition hypothesis against competing hypotheses [Lo & Soong] • At utterance level: GWPP(xj): generalized word posterior probability of xj xj: j-th word in recognition hypothesis of X

Joint Confidence Score where • For utterance verification compare CM(Xi) to threshold () • Model weights (gpp, in-domain, discourse), and threshold () trained on development set

Experimental Setup • Training-set: ATR BTEC (basic-travel-expressions-corpus) • ~400k sentences (Japanese/English pairs) • 14 topic classes (accommodation, shopping, transit, …) • Train: topic-classification + in-domain verification models • Evaluation data: ATR MAD (machine aided dialogue) • Natural dialogue between English and Japanese speakers via ATR speech-to-speech translation system • Dialogue data collected based on set of pre-defined scenarios • Development-set: 270 dialogues Test-set: 90 dialogues On development set train: CM sigmoid transforms CM weights (gpp, in-domain, discourse) Verification threshold ()

Speech Recognition Performance • ASR performed with ATRASR; 2-gram LM applied during decoding, rescore lattice with 3-gram LM

Evaluation Measure • Utterance-based Verification • No definite “keyword” set in S-2-S translation • If recognition error occurs (one or more errors)  prompt user to rephrase entire utterance • CER (confidence error rate) • FA: false acceptance of incorrectly recognized utterance • FR: false rejection of correctly recognized utterance

GPP-based Verification Performance • Accept All: Assume all utterances are correctly recognized • GPP: Generalized posterior probability Accept All Accept All GPP GPP • Large reduction in verification errors compared with “Accept all” case • CER 17.3% (Japanese) and 15.3% (English)

Incorporation of IC and DC Measures (Japanese) GPP: Generalized posterior probability IC:In-domain confidenceDC:Discourse coherence GPP GPP +IC GPP +DC GPP +IC +DC • CER reduced by 5.7% and 4.6% for “GPP+IC” and “GPP+DC” cases • CER 17.3%  15.9% (8.0% relative) for “GPP+IC+DC” case

Incorporation of IC and DC Measures (English) GPP: Generalized posterior probability IC:In-domain confidenceDC:Discourse coherence GPP GPP +IC GPP +DC GPP +IC +DC • Similar performance for English side • CER 15.3%  14.4% for “GPP+IC+DC” case

Conclusions • Proposed novel utterance verification scheme incorporating “high-level” knowledge In-domain confidence: degree of match between utterance and application domain Discourse coherence: consistency between consecutive utterances • Two proposed measures effective • Relative reduction in CER of 8.0% and 6.1% (Japanese/English)

Future work • “High-level” content-based verification • Ignore ASR-errors that do not affect translation quality Further improvement in performance • Topic Switching • Determine when users switch task Consider single task per dialogue session

Ian R. Lane, Tatsuya Kawahara Spoken Language Communications Research Laboratories, ATR

Ian R. Lane, Tatsuya Kawahara Spoken Language Communications Research Laboratories, ATR

Presentation Transcript

Nobody is Perfect: ATR s Hybrid Approach to Spoken Language Translation

Spoken Language

Spoken Language Structure

Spoken Language Processing

Spoken Language

spoken language

Spoken Language difficulties:

Koichiro Yoshino , Shinsuke Mori and Tatsuya Kawahara Kyoto University, Japan

Yasushi Tsubota, Tatsuya Kawahara, Masatake Dantsuji Kyoto University, Japan

Yuya Akita , Tatsuya Kawahara

Research Laboratories

SPOKEN LANGUAGE COMPREHENSION

Hideki Kawahara Wakayama University ATR-HIS

Spoken Language Understanding

Spoken Language

Tatsuya Kawahara (Kyoto University, Japan) kawahara@i.kyoto-u.ac.jp

Spoken Language Understanding, the Research/Industry Chasm

Spoken Language Understanding

Spoken Language Processing

Tatsuya Kawahara (Kyoto University, Japan)

Spoken Language Translation

SPOKEN LANGUAGE ANALYSIS