320 likes | 482 Views
Sorry, I didn’t catch that …. Non-understandings and recovery in spoken dialog systems Part II: Sources & impact of non-understandings, Performance of various recovery strategies Dan Bohus Sphinx Lunch Talk Carnegie Mellon University, March 2005. S: What city are you leaving from?
E N D
Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part II: Sources & impact of non-understandings, Performance of various recovery strategies Dan Bohus Sphinx Lunch Talk Carnegie Mellon University, March 2005
S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON-understanding • System cannot extract any meaningful information from the user’s turn Non-understandings • How can we prevent non-understandings? • How can we recover from them? • Detection • Set of recovery strategies • Policy for choosing between them review: sources : impact : strategy performance
Issues under investigation • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? review: sources : impact : strategy performance
Data Collection: Experimental Design • Subjects interact over the telephone with RoomLine • Performed 10 of scenario-based tasks • Between-subjects experiment, 2 groups: • Control: system uses a random (uniform) policy for engaging the non-understanding recovery strategies • Wizard: policy is determined at runtime by a human (wizard) • 46 subjects, balanced gender x native • 449 sessions; 8278 user turns • Sessions transcribed & annotated review: sources : impact : strategy performance
Non-understanding Strategies S: For when do you need the room? U: [non-understanding] MOVE-ON 1. MOVE-ON (MOVE) Sorry, I didn’t catch that. For which day you need the room? 2. YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … 3. TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … 4. FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am … 5. ASK REPEAT (AREP) Could you please repeat that? 6. ASK REPHRASE (ARPH) Could you please try to rephrase that? 7. NOTIFY (NTFY) Sorry, I didn’t catch that ... 8. YIELD TURN (YLD) … 9. REPROMPT (RP) For when do you need the conference room? 10. DETAILED REPROMPT (DRP) Right now I need to know the date and time for when you need the reservation … HELP REPEAT NOTIFY REPROMPT review: sources : impact : strategy performance
Issues under Investigation • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? review: sources : impact : strategy performance
Goal Interpretation Semantics Parsing Text Recognition Audio Channel End-pointing Communication [Clark, Horvitz, Paek] System User ConversationLevel IntentionLevel SignalLevel ChannelLevel review: sources : impact : strategy performance
Goal Interpretation Semantics Parsing Text Recognition Audio Channel End-pointing Modeling and Breakdowns System User ConversationLevel IntentionLevel SignalLevel ChannelLevel review: sources : impact : strategy performance
Goal Interpretation Semantics Parsing Text Recognition Audio Channel End-pointing “Location” & “types” of errors System User Out-of-domainOut-of-application False Rejections Out-of-grammarOut-of-relevance ASR errorsaccents noises End-pointer errors review: sources : impact : strategy performance
% of non-understandings Out-of-domainOut-of-application False Rejections 0.14% 12.89% 18.59% Out-of-grammarOut-of-relevance 8.02% 3.21% ASR errorsaccents noises 56.05% End-pointer errors 3.91% review: sources : impact : strategy performance
Out-of-application (13% of Nonu) • 2 main classes, about equally split • Request for inexistent task functionality • “A room Monday or Tuesday” • “do you have anything anytime Thursday afternoon?” • Request for inexistent “meta” functionality • Corrections: • “Can I change the date” • “You got the time wrong” • “Wrong day” • Q: How to better convey system boundaries? • Q: Extend system language for corrections? review: sources : impact : strategy performance
Out-of-grammar (8% of Nonu) • Imperfect grammar coverage • “Doesn’t matter” “It doesn’t matter” • “Internet connection” “Network connection” • “Vaguely” “So so” / “Generally” / etc • Q: Bring users in grammar? • Carefully craft & use the “You Can Say” prompts • Q: Extend the grammar? • Online & in an unsupervised fashion? review: sources : impact : strategy performance
Grammaticality - Summary • It’s important: 25% of non-understandings • Stems (about equally) from: • Requests for inexistent task functionality • Requests for inexistent meta/corrections functionality • Lack of grammar coverage • Solutions • Offline: enlarge grammar, include correction language • Online • Carefully design “You Can Say” • All You Can Say [Collagen / USI] • Unsupervised learning of new grammar expressions review: sources : impact : strategy performance
All You Can Say • How much of the system functionality is actually used? [under work] • Certain “task” and “meta” aspects of functionality are very rarely or never used User System
% of non-understandings Out-of-domainOut-of-application False Rejections 0.14% 12.89% 18.59% Out-of-grammarOut-of-relevance 8.02% 3.21% ASR errorsaccents noises 56.05% End-pointer errors 3.91% review: sources : impact : strategy performance
Issues under Investigation • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? review: sources : impact : strategy performance
Impact on system performance • Logistic regression model • Task Success % Non-understandings per session • Natives are more likely to succeed at the same non-understandings rate • (Participants in the wizard condition also) • 2nd model (also use Misunderstandings) • Task success % Non + % Mis • Better fit • Adding native information does not improve model • Non-u on average half as costly review: sources : impact : strategy performance
Issues under Investigation • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones? • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? review: sources : impact : strategy performance
Issues under Investigation • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones? • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? review: sources : impact : strategy performance
Non-understanding Strategies S: For when do you need the room? U: [non-understanding] MOVE-ON 1. MOVE-ON (MOVE) Sorry, I didn’t catch that. For which day you need the room? 2. YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … 3. TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … 4. FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am … 5. ASK REPEAT (AREP) Could you please repeat that? 6. ASK REPHRASE (ARPH) Could you please try to rephrase that? 7. NOTIFY (NTFY) Sorry, I didn’t catch that ... 8. YIELD TURN (YLD) … 9. REPROMPT (RP) For when do you need the conference room? 10. DETAILED REPROMPT (DRP) Right now I need to know the date and time for when you need the reservation … HELP REPEAT NOTIFY REPROMPT review: sources : impact : strategy performance
How to evaluate performance? • Recovery • Next turn is okay (not a non-understanding, not a misunderstanding) • Finer-grained recovery • Next turn CER • Next turn concept transfer (dialog cost) • Time (+recovery) ?? • Time lost: 0 if next turn okay, time lost otherwise • Time to recovery (has some problems) • [More stuff under construction] review: sources : impact : strategy performance
Which strategies are better? review: sources : impact : strategy performance
Which strategies are better? • Recovery performance ranked list, based on pair-wise t-tests: • CER evaluation shows similar results review: sources : impact : strategy performance
Which strategies are better? MoveOn ≥ Help > Signal * p = 0.1089 review: sources : impact : strategy performance
What is the Impact on User Response? • Labeled user responses in 5 classes:[same tagging scheme as Shin, Choularton] • Answer (1st) • Repeat • Rephrase • Change • Contradict • Other • Hang-up review: sources : impact : strategy performance
What is the Impact on User Response? • Labeled user responses in 5 classes:[same tagging scheme as Shin, Choularton] • Answer (1st) • Repeat • Rephrase • Change • Contradict • Other • Hang-up 17.95% 44.30% 30.70% 3.63% 3.13% review: sources : impact : strategy performance
Comparing with other systems review: sources : impact : strategy performance
What responses are the best? • Recovery as a function of response type • Answer (1st) • Repeat • Rephrase • Change • Contradict • Other • Hang-up 45.45% 39.33% 63.29% 19.05% review: sources : impact : strategy performance
More to come … • Per-strategy analysis • Barge-in & impact on recovery review: sources : impact : strategy performance
Issues under Investigation • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones? • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? review: sources : impact : strategy performance
Refining the current set of strategies • Introduce more alternative dialog plans • opportunities for Move-On • “You Can Say” • Carefully tune the prompts • Smarter barge-in control • “All You Can Say” • “Speak shorter” • Anecdotal evidence to be corroborated by analysis • “Speak louder / go to a quieter place” • Not so much in these experiments, but evidence from Let’s go! • More prevention measures • If someone has troubles, you can give the YCS prompts without waiting for a non-understanding to happen review: sources : impact : strategy performance
Thank You!! • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones? • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? review: sources : impact : strategy performance