“k hypotheses + other” belief updating in spoken dialog systems

“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University dbohus@cs.cmu.edu Pittsburgh, PA 15213

problem spoken language interfaces lack robustness when faced with understanding errors • errors stem mostly from speech recognition • typical word error rates: 20-30% • significant negative impact on interactions

guarding against understanding errors • use confidence scores • machine learning approaches for detecting misunderstadings [Walker, Litman, San-Segundo, Wright, and others] • engage in confirmation actions • explicit confirmation did you say you wanted to fly to Seoul? • yes → trust hypothesis • no → delete hypothesis • “other” → non-understanding • implicit confirmation traveling to Seoul … what day did you need to travel? • rely on new values overwriting old values related work : data : user response analysis : proposed approach: experiments and results: conclusion

today’s talk … construct accurate beliefs by integrating information over multiple turns in a conversation S: Where would you like to go? U: Huntsville [SEOUL / 0.65] destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?}

belief updating: problem statement • given • an initial belief Binitial(C) over concept C • a system action SA • a user response R • construct an updated belief • Bupdated(C) ← f (Binitial(C), SA, R) destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?}

outline • proposed approach • data • experiments and results • effect on dialog performance • conclusion proposed approach: data: experiments and results : effect on dialog performance : conclusion

belief updating: problem statement • given • an initial belief Binitial(C) over concept C • a system action SA(C) • a user response R • construct an updated belief • Bupdated(C) ← f(Binitial(C),SA(C),R) destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?} proposed approach: data: experiments and results : effect on dialog performance : conclusion

belief representation Bupdated(C) ← f(Binitial(C), SA(C), R) • most accurate representation • probability distribution over the set of possible values • however • system will “hear” only a small number of conflicting values for a concept within a dialog session • in our data • max = 3 (conflicting values heard) • only in 6.9% of cases, more than 1 value heard proposed approach: data: experiments and results : effect on dialog performance : conclusion

belief representation Bupdated(C) ← f(Binitial(C), SA(C), R) • compressed belief representation • k hypotheses + other • at each turn, the system retains the top m initial hypotheses and adds n new hypotheses from the input (m+n=k) proposed approach: data: experiments and results : effect on dialog performance : conclusion

belief representation Bupdated(C) ← f(Binitial(C), SA(C), R) • B(C) modeled as a multinomial variable • {h1, h2, … hk, other} • B(C) = <ch1, ch2, …, chk, cother> • where ch1 + ch2 + … + chk + cother = 1 • belief updating can be cast as multinomial regression problem: Bupdated(C) ← Binitial(C) + SA(C) + R proposed approach: data: experiments and results : effect on dialog performance : conclusion

system action Bupdated(C) ← f(Binitial(C), SA(C), R) proposed approach: data: experiments and results : effect on dialog performance : conclusion

user response Bupdated(C) ← f(Binitial(C), SA(C), R) proposed approach: data: experiments and results : effect on dialog performance : conclusion

approach Bupdated(C) ← f(Binitial(C), SA(C), R) • problem • <uch1, … uchk, ucoth> ← f(<ich1, … ichk, icoth>, SA(C), R) • approach: multinomial generalized linear model • regression model, multinomial independent variable • sample efficient • stepwise approach • feature selection • BIC to control over-fitting • one model for each system action • <uch1, … uchk, ucoth> ← fSA(C)(<ich1, … ichk, icoth>, R) proposed approach: data: experiments and results : effect on dialog performance : conclusion

data • collected with RoomLine • a phone-based mixed-initiative spoken dialog system • conference room reservation • explicit and implicit confirmations • simple heuristic rules for belief updating • explicit confirm: yes / no • implicit confirm: new values overwrite old ones proposed approach: data: experiments and results : effect on dialog performance : conclusion

corpus • user study • 46 participants (naïve users) • 10 scenario-based interactions each • compensated per task success • corpus • 449 sessions, 8848 user turns • orthographically transcribed • manually annotated • misunderstandings • corrections • correct concept values proposed approach: data: experiments and results : effect on dialog performance : conclusion

baselines • initial baseline • accuracy of system beliefs before the update • heuristic baseline • accuracy of heuristic update rule used by the system • oracle baseline • accuracy if we knew exactly when the user corrects proposed approach: data: experiments and results : effect on dialog performance : conclusion

k=2 hypotheses + other Informative features • priors and confusability • initial confidence score • concept identity • barge-in • expectation match • repeated grammar slots proposed approach: data: experiments and results : effect on dialog performance : conclusion

a question remains … … does this really matter? what is the effect on global dialog performance? proposed approach: data: experiments and results : effect on dialog performance : conclusion

let’s run an experiment guinea pigs from Speech Lab for exp: $0 getting change from guys in the lab: $2/$3/$5 real subjects for the experiment: $25 picture with advisor of the VERY last exp at CMU: priceless!!!! [courtesy of Mohit Kumar]

a new user study … • implemented models in RavenClaw, performed a new user study • 40 participants, first-time users • 10 scenario-driven interactions each • non-native speakers of North-American English • improvements more likely at higher WER • supported by empirical evidence • between-subjects; 2 gender-balanced groups • control: RoomLine using heuristic update rules • treatment: RoomLine using runtime models proposed approach: data: experiments and results : effect on dialog performance : conclusion

even though control 21.9% average user WER treatment 24.2% effect on task success control 73.6% task success 81.3% treatment proposed approach: data: experiments and results : effect on dialog performance : conclusion

78% 78% 64% 30% WER 16% WER effect on task success … a closer look probability of task success word error rate Task Success ← 2.09 - 0.05∙WER + 0.69∙Condition p=0.001 proposed approach: data: experiments and results : effect on dialog performance : conclusion

improvements at different WER absolute Improvement in task success word-error-rate proposed approach: data: experiments and results : effect on dialog performance : conclusion

effect on task duration (for successful tasks) • ANOVA on task duration for successful tasks Duration ← -0.21 + 0.013∙WER - 0.106∙Condition • significant improvement, equivalent to 7.9% absolute reduction in WER proposed approach: data: experiments and results : effect on dialog performance : conclusion

summary • data-driven approach for constructing accurate system beliefs • integrate information across multiple turns • bridge together detection of misunderstandings and corrections • significantly outperforms current heuristics • significantly improves effectiveness and efficiency

other advantages • sample efficient • performs a local one-turn optimization • good local performance leads to good global performance • scalable • works independently on concepts • 29 concepts, varying cardinalities • portable • decoupled from dialog task specification • doesn’t make strong assumptions about dialog management technology

thank you! questions …

user study • 10 scenarios, fixed order • presented graphically (explained during briefing) • participants compensated per task success

“k hypotheses + other” belief updating in spoken dialog systems

“k hypotheses + other” belief updating in spoken dialog systems

Presentation Transcript