400 likes | 659 Views
belief updating in spoken dialog systems. Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213. acknowledgements Alex Rudnicky, CMU Eric Horvitz & Tim Paek, MSR Antoine Raux. spoken dialog systems. use natural language processing technology
E N D
belief updating in spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 acknowledgements Alex Rudnicky, CMU Eric Horvitz & Tim Paek, MSR Antoine Raux
spoken dialog systems • use natural language processing technology • engage in a goal-oriented conversation • research community • information access, command-and-control • personal assistants, taskable agents, tutoring systems • industry • simpler automated phone systems
the problem … misunderstandings lead to interaction breakdowns
more concretely … S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT same PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [for minute since HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING to berlin P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [flight destination mr WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………
understanding errors • stem from the speech recognition process • difficult operating conditions • typical word-error-rates • 20-30% • up to 50% for non-natives • strong negative impact on interactions
pathways to increased robustness • gracefully handle errors through conversation • improve recognition • detectthe problems • develop a set of recovery strategies • know how to choose between them (policy)
/ 0.07 / 0.72 / 0.65 confidence scores / 0.35 / 0.58 / 0.28 guarding against misunderstandings S: […] What city are you leaving from ? U: [for minute since HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: [CHICAGO] S: traveling from Chicago. Where would you like to go? U: [SEOUL] S: traveling to Seoul…What day did you need to travel? U: [THE TRAVELING to berlin P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: [flight destination mr WEEKEND]
guarding against misunderstandings S: […] What city are you leaving from ? U: [for minute since HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: [CHICAGO] S: traveling from Chicago. Where would you like to go? U: [SEOUL] S: traveling to Seoul…What day did you need to travel? U: [THE TRAVELING to berlin P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: [flight destination mr WEEKEND] / 0.07 / 0.72 arrival = {Seoul / 0.65} / 0.65 confidence scores confirmation actions • reject • explicit confirmationDid you say Seoul? • implicit confirmationtraveling to Seoul … What day did you need to travel? • accept / 0.35 / 0.58 / 0.28
departure = { … } arrival = { … } departure = { … } arrival = { … } departure = { … } f departure = { … } departure = { … } arrival = { … } departure = { … } arrival = { … } belief updating S: […] What city are you leaving from ? U: [for minute since HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: [CHICAGO] S: traveling from Chicago. Where would you like to go? U: [SEOUL] S: traveling to Seoul…What day did you need to travel? U: [THE TRAVELING to berlin P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: [flight destination mr WEEKEND] / 0.07 / 0.72 arrival = {Seoul / 0.65} / 0.65 confidence scores / 0.35 arrival = ? / 0.58 / 0.28
arrival = {Seoul / 0.65} f / 0.35 arrival = ? belief updating: problem statement S: traveling to Seoul…What day did you need to travel? U: [THE TRAVELING to berlin P_M] • given • an initial belief Binitial(C) over concept C • a system action SA(C) • a user response R • construct an updated belief • Bupdated(C) ← f(Binitial(C), SA(C), R)
outline • related work • proposed approach • data • experiments and results • effects on global performance • conclusion and future work related work : proposed approach : data : experiments and results: global performance : conclusion
? detecting misunderstandings and corrections • confidence annotation • word-level [Cox, Chase, Bansal, Ravinshankar, etc] • semantic confidence annotation [Walker, San-Segundo, Bohus, etc] • correction detection [Litman, Swerts, Hirschberg, Krahmer, Levow] • detect when the user corrects the system arrival = {Seoul / 0.65} S: traveling to Seoul…What day did you need to travel? U: [THE TRAVELING to berlin P_M] Conf=0.35 Corr=0.47 arrival = ? related work : proposed approach : data : experiments and results: global performance : conclusion
current solutions for tracking beliefs • most systems only track single values • new values overwrite old values • use simple heuristic rules • explicit confirmation S: did you say you wanted to fly to Seoul? • yes → trust hypothesis • no → delete hypothesis • “other” → non-understanding • implicit confirmation S: traveling to Seoul … what day did you need to travel? • rely on new values overwriting old values related work : proposed approach : data : experiments and results: global performance : conclusion
outline • related work • proposed approach • data • experiments and results • effects on global performance • conclusion and future work related work : proposed approach : data : experiments and results: global performance : conclusion
belief updating: problem statement S: traveling to Seoul…What day did you need to travel? U: [THE TRAVELING to berlin P_M] arrival = {Seoul / 0.65} f / 0.35 arrival = ? • given • an initial belief Binitial(C) over concept C • a system action SA(C) • a user response R • construct an updated belief • Bupdated(C) ← f(Binitial(C), SA(C), R) related work : proposed approach : data : experiments and results: global performance : conclusion
YUMA, AZ ALPINE, TX ALPENA, MI ALBANY, NY ABILENE, TX ALLIANCE, NE ABERDEEN, TX ALLAKAKET, AK ALLENTOWN, PA ALEXANDRIA, LA ALBUQUERQUE, NM belief representation Bupdated(C)← f(Binitial(C), SA(C), R) • most accurate representation • probability distribution over the set of possible values departure • however • system “hears” only a small number of conflicting values for a concept throughout a session • max = 3 conflicting values heard • only in 7% of cases, more than 1 value heard related work : proposed approach : data : experiments and results: global performance : conclusion
departure_city [k=3, m=2, n=1] Austin Houston other Boston S: Did you say you were flying from Austin? U: [NO ASPEN] Boston Austin other Ø Aspen Boston Aspen other belief representation • compressed belief representation • khypotheses + other • dynamically add and drop hypotheses • remember m hypotheses, add n new ones (m+n=k) Bupdated(C)← f(Binitial(C), SA(C), R) S: flying from Aspen… what is your destination? U: [NO NO I DIDN’T THAT THAT] • B…(C) is a multinomial variable of degree k+1 related work : proposed approach : data : experiments and results: global performance : conclusion
system action Bupdated(C) ← f(Binitial(C), SA(C), R) related work : proposed approach : data : experiments and results: global performance : conclusion
user response Bupdated(C) ← f(Binitial(C), SA(C), R) related work : proposed approach : data : experiments and results: global performance : conclusion
approach • multinomial regression problem • multinomial generalized linear model • sample efficient • stepwise approach • feature selection • BIC to control over-fitting • one separate model for each system action • Bupdated(C) ← fSA(C)(Binitial(C), R) Bupdated(C) ← f(Binitial(C), SA(C), R) related work : proposed approach : data : experiments and results: global performance : conclusion
outline • related work • proposed approach • data • experiments and results • effects on global performance • conclusion and future work related work : proposed approach : data : experiments and results: global performance : conclusion
data • collected with RoomLine • a phone-based mixed-initiative spoken dialog system • conference room reservation • explicit and implicit confirmations • simple heuristic rules for belief updating • explicit confirm: yes / no • implicit confirm: new values overwrite old ones related work : proposed approach : data : experiments and results: global performance : conclusion
corpus • user study • 46 participants (first-time users) • 10 scenario-based interactions each • corpus • 449 sessions, 8848 user turns • orthographically transcribed • manually annotated • misunderstandings • corrections • correct concept values related work : proposed approach : data : experiments and results: global performance : conclusion
outline • related work • proposed approach • data • experiments and results • effects on global performance • conclusion and future work related work : proposed approach : data : experiments and results : global performance : conclusion
models • k=2 + other (m=1, n=1) • k=3 + other (m=2, n=1) • k=4 + other (m=3, n=1) • full model • all features • basic model • all features except priors and confusability • runtime model • all features available at runtime related work : proposed approach : data : experiments and results : global performance : conclusion
baselines • initialbaseline • accuracy of system beliefs before the update • heuristicbaseline • accuracy of heuristic update rule used by the system • correctionbaseline • accuracy if we knew exactly when the user corrects the system related work : proposed approach : data : experiments and results : global performance : conclusion
implicit confirm 30.8 30.3 30% 30% 26.0 21.5 18.3 20% 20% 16.1 15.8 15.0 10% 10% 6.1 6.2 5.0 5.2 0% 0% i h BM FM RM c i h BM FM RM c request other 98.2 79.7 44.8 12% 45% 9.5 8.6 8% 30% 5.7 5.6 19.3 14.8 14.8 4% 15% 0% 0% i h BM FM RM i h BM FM RM results for k=2 hyps + other explicit confirm initial baseline (i) heuristic baseline (h) basic model (BM) full model (FM) runtime model (RM) correctionbaseline (c) related work : proposed approach : data : experiments and results : global performance : conclusion
a question remains … … does this really matter? related work : proposed approach : data : experiments and results : global performance : conclusion
outline • related work • proposed approach • data • experiments and results • effects on global performance • conclusion and future work related work : proposed approach : data : experiments and results: global performance : conclusion
a new user study … • implemented models in RavenClaw • 40 participants, first-time, non-native users • improvements more likely at high word-error-rates • 10 scenario-driven interactions each • between-subjects; 2 gender-balanced groups • control: RoomLine using heuristic update rules • treatment: RoomLine using runtime models related work : proposed approach : data : experiments and results: global performance : conclusion
78% 78% treatment control 64% 30% word error rate 16% word error rate effect on task success • logistic ANOVA on task success p=0.009 logit(TaskSuccess) ← 2.09 - 0.05∙WER + 0.69∙Condition 100% 80% probability of task success 60% 40% 20% 0% 0% 20% 40% 60% 80% 100% word error rate related work : proposed approach : data : experiments and results: global performance : conclusion
how about efficiency? • ANOVA on task duration for successful tasks Duration ← -0.21 + 0.013∙WER - 0.106∙Condition • significant improvement • equivalent to 7.9% absolute reduction in word-error p=0.0003 related work : proposed approach : data : experiments and results: global performance : conclusion
outline • related work • proposed approach • data • experiments and results • effects on global performance • conclusion and future work related work : proposed approach : data : experiments and results: global performance : conclusion
f summary U: [CHICAGO] S: traveling from Chicago. Where would you like to go? U: [SEOUL] S: traveling to Seoul…What day did you need to travel? U: [THE TRAVELING to berlin P_M] S: traveling in the afternoon. Okay what day would you be departing chicago departure = { … } arrival = { … } / 0.72 / 0.65 arrival = {Seoul / 0.65} departure = { … } / 0.35 arrival = ? departure = { … } • approach for constructing accurate beliefs • integrate information across multiple turns • large gains in task success and efficiency related work : proposed approach : data : experiments and results: global performance : conclusion
other advantages • learns from data • tuned to the domain in which it operates • sample efficient / scalable • performs a local one-turn optimization • works independently on concepts • portable • decoupled from dialog task specification • no strong assumptions about dialog management related work : proposed approach : data : experiments and results: global performance : conclusion
future work • integrate information from n-best list • integrate other high-level knowledge • domain-specific constraints • inter-concept dependencies • unsupervised / implicit learning • domain-specificity related work : proposed approach : data : experiments and results: global performance : conclusion
improvements at different WER absolute improvement in task success word-error-rate
user study • 10 scenarios, fixed order • presented graphically (explained during briefing) • participants compensated per task success
informative features • priors and confusability • initial confidence scores • concept identity • barge-in • expectation match • repeated grammar slots