a “k-hypotheses + other” belief updating model

a “k-hypotheses + other”belief updating model Dan Bohus Alex Rudnicky Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 acknowledgements Tim Paek Eric Horvitz Microsoft Research

motivation spoken language interfaces are still very brittle [Parade, Sunday, March 26]

/ 0.72 / 0.65 confidence score / 0.35 / 0.58 / 0.28 misunderstandings S: What city are you leaving from ? U: [CHICAGO] S: traveling from Chicago. Where would you like to go? U: [SEOUL] S: traveling to Seoul…What day did you need to travel? U: [THE TRAVELING to berlin P_M] S: traveling in the afternoon… okay, what day would you be departing Chicago? U: [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: [flight destination mr WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at …… Chicago  arrival = {Seoul / 0.65} Huntsville  no no I’m traveling to Birmingham  the tenth of August  my destination is Birmingham 

/ 0.72 arrival = { … } departure = { … } / 0.65 departure = { … } confidence score / 0.35 departure = { … } f / 0.58 arrival = { … } departure = { … } / 0.28 arrival = { … } departure = { … } misunderstandings S: What city are you leaving from ? U: [CHICAGO] S: traveling from Chicago. Where would you like to go? U: [SEOUL] S: traveling to Seoul…What day did you need to travel? U: [THE TRAVELING to berlin P_M] S: traveling in the afternoon… okay, what day would you be departing Chicago? U: [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: [flight destination mr WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at …… arrival = {Seoul / 0.65} arrival = ?

arrival = {Seoul / 0.65} arrival = ? f belief updating: problem statement S: traveling to Seoul…What day did you need to travel? U: [THE TRAVELING to berlin P_M] • given • an initial belief Binitial(C) over concept C • a system action SA(C) • a user response R • construct an updated belief • Bupdated(C) ← f(Binitial(C), SA(C), R)

outline • introduction • current solutions • approach • experimental results • effects on global performance • conclusion and future work intro : current solutions : approach : experimental results : global performance : conclusion

detecting corrections [Litman, Swerts, Hirschberg, Krahmer, Levow] / 0.72 current solutions confidence scores / detecting misunderstandings [Cox, Chase, Bansal, Hazen, Ravishankar, Walker, San-Segundo, Bohus] S: traveling from Chicago. Where would you like to go? U: [SEOUL] S: traveling to Seoul… what day did you need to travel? U: [THE TRAVELING to berlin P_M] arrival = {Seoul / 0.65} / 0.65 f / 0.35 arrival = ? • track single values • use simple heuristic belief updating rules • explicit confirmations • yes / no • implicit confirmations • new values overwrite old values intro : current solutions : approach : experimental results : global performance : conclusion

outline • introduction • current solutions • approach • experimental results • effects on global performance • conclusion and future work intro : current solutions : approach : experimental results: global performance : conclusion

belief updating: problem statement S: traveling to Seoul…What day did you need to travel? U: [THE TRAVELING to berlin P_M] arrival = {Seoul / 0.65} f / 0.35 arrival = ? • given • an initial belief Binitial(C) over concept C • a system action SA(C) • a user response R • construct an updated belief • Bupdated(C) ← f(Binitial(C), SA(C), R) intro : current solutions : approach : experimental results: global performance : conclusion

YUMA, AZ ALPINE, TX ALPENA, MI ALBANY, NY ABILENE, TX ALLIANCE, NE ABERDEEN, TX ALLAKAKET, AK ALLENTOWN, PA ALEXANDRIA, LA ALBUQUERQUE, NM belief representation Bupdated(C)← f(Binitial(C), SA(C), R) • probability distribution over the set of possible values departure • however • system “hears” only a small number of conflicting values for a concept throughout a session • max = 3 conflicting values heard intro : current solutions : approach : experimental results: global performance : conclusion

departure_city [k=3, m=2, n=1] Austin Houston other Boston S: Did you say you were flying from Austin? U: [NO ASPEN] Boston Austin other Ø Aspen Boston Aspen other belief representation • compressed belief representation • khypotheses + other • dynamically add and drop hypotheses • remember m hypotheses, add n new ones (m+n=k) Bupdated(C)← f(Binitial(C), SA(C), R) S: flying from Aspen… what is your destination? U: [NO NO I DIDN’T THAT THAT] • B…(C) is a multinomial variable of degree k+1 intro : current solutions : approach : experimental results: global performance : conclusion

system action Bupdated(C) ← f(Binitial(C), SA(C), R) intro : current solutions : approach : experimental results: global performance : conclusion

user response Bupdated(C) ← f(Binitial(C), SA(C), R) intro : current solutions : approach : experimental results: global performance : conclusion

approach • multinomial regression problem • multinomial generalized linear model • sample efficient • stepwise approach  feature selection • one separate model for each system action • Bupdated(C) ← fSA(C)(Binitial(C), R) Bupdated(C) ← f(Binitial(C), SA(C), R) intro : current solutions : approach : experimental results: global performance : conclusion

outline • introduction • current solutions • approach • experimental results • effects on global performance • conclusion and future work intro : current solutions : approach : experimental results : global performance : conclusion

data • RoomLine • conference room reservations • explicit and implicit confirmations • user study • 46 participants • 10 scenario-based interactions each • corpus • 449 sessions, 8848 user turns • transcribed & annotated • misunderstandings, corrections, correct concept values intro : current solutions : approach : experimental results : global performance : conclusion

explicit confirm implicit confirm 30.8 30.3 30% 30% 26.0 21.5 20% 20% 16.1 15.0 initial baseline (i) [error before update] 10% 10% 6.2 5.0 heuristic baseline (h) [error after heuristic update] 0% 0% i h M c i h M c correction baseline (c) [error if we had perfect correction detection] request no action 98.2 79.7 44.8 12% 45% 9.5 8% 30% 5.7 14.8 4% 15% 0% 0% i h M i h M model performance Model (M) [k=2, all features] intro : current solutions : approach : experimental results : global performance : conclusion

a new user study … • implemented models in the system • 2nd, between-subjects experiment • control: using heuristic update rules • treatment: using belief updating models • 40 participants, non-native users • improvements more likely at high word-error-rates intro : current solutions : approach : experimental results: global performance : conclusion

78% 78% treatment control 64% 30% word error rate 16% word error rate effect on task success • logistic ANOVA on task success p=0.009 logit(TaskSuccess) ← 2.09 - 0.05∙WER + 0.69∙Condition 100% 80% probability of task success 60% 40% 20% 0% 0% 20% 40% 60% 80% 100% word error rate intro : current solutions : approach : experimental results: global performance : conclusion

how about efficiency? • ANOVA on task duration for successful tasks Duration ← -0.21 + 0.013∙WER - 0.106∙Condition • significant improvement • equivalent to 7.9% absolute reduction in word-error p=0.0003 intro : current solutions : approach : experimental results: global performance : conclusion

f summary U: [CHICAGO] S: traveling from Chicago. Where would you like to go? U: [SEOUL] S: traveling to Seoul…What day did you need to travel? U: [THE TRAVELING to berlin P_M] S: traveling in the afternoon. Okay what day would you be departing chicago departure = { … } arrival = { … } / 0.72 / 0.65 arrival = {Seoul / 0.65} departure = { … } / 0.35 arrival = ? departure = { … } • approach for constructing accurate beliefs • integrate information across multiple turns • significant gains in task success and efficiency intro : current solutions : approach : experimental results: global performance : conclusion

other advantages • learns from data • tuned to the domain in which it operates • sample efficient / scalable • local one-turn optimization, concepts are independent • RoomLine operates with 29 concepts • cardinality: 2  several hundreds • portable • decoupled from dialog task specification • no assumptions about dialog management intro : current solutions : approach : experimental results: global performance : conclusion

future work • integrate information from n-best list • integrate other high-level knowledge • domain-specific constraints • inter-concept dependencies • investigate technique in other domains intro : current solutions : approach : experimental results: global performance : conclusion

thank you! questions …

improvements at different WER absolute improvement in task success word-error-rate

user study • 10 scenarios, fixed order • presented graphically (explained during briefing) • participants compensated per task success

informative features • priors and confusability • initial confidence scores • concept identity • barge-in • expectation match • repeated grammar slots

Models (k=2, runtime features) # The model for the explicit confirm action new_1 other LR_MODEL(EC) k = -15.96 3.61 answer_type[YES] = -12.67 -5.90 answer_type[NO] = 4.55 3.15 answer_type[OTHER] = 1.20 -0.75 concept_id(equip) = 6.96 4.42 i_th_confusability = -3.67 -4.80 ih_diff_lexical_one_word = -15.99 -1.17 lexw1[SMALL] = 17.63 20.26 response_new_hyps_in_selh = 18.85 0.41 END

Models (k=2, runtime features) # The model for the implicit confirm action new_1 other LR_MODEL(IC) mark_confirm = 0.31 -1.74 mark_disconfirm = 3.39 1.57 i_th_conf = 0.39 -3.63 i_th_confusability = -4.17 -4.54 k = -16.83 3.75 lex[THREE] = -2.25 -2.68 response_new_hyps_in_selh = 20.88 1.70 turn_number = 0.01 0.03 END

Models (k=2, runtime features) # The model for the request action new_1 other LR_MODEL(REQ) k = -0.78 3.56 barge_in = -2.07 -1.40 concept_id(date)= 11.29 9.80 concept_id(user_name) = 1.93 -13.91 dialog_state[RequestSpecificTimes] = 13.29 14.26 ih_diff_lexical = -1.54 0.17 initial_num_hyps_>_0 = -21.70 -2.71 total_num_parses = -1.06 -0.40 ur_selh_new_1_conf = 4.09 1.76 ur_selh_new_1_confusability = 5.81 1.70 ur_selh_new_1_prior = 0.67 0.98 ur_selh_new_1_prior_>_1 = -1.00 -6.38 END

a “k-hypotheses + other” belief updating model