1 / 20

Evaluating Human-Machine Conversation for Appropriateness

Evaluating Human-Machine Conversation for Appropriateness. David Benyon, Preben Hansen, Oli Mival and Nick Webb. Overview. www.companions-project.org Companions are targeted as persistent, collaborative, conversational partners Rather than singular tasks, Companions have a range of tasks

alain
Download Presentation

Evaluating Human-Machine Conversation for Appropriateness

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb

  2. Overview • www.companions-project.org • Companions are targeted as persistent, collaborative, conversational partners • Rather than singular tasks, Companions have a range of tasks • Completion of tasks is important • So is conversational performance

  3. Metrics • Objective measures • WER, CER, Turn Duration, Vocabulary… • Subjective user measures • User satisfaction surveys • Appropriateness

  4. Appropriateness • D. Traum, S. Robinson and J. Stephan. Evaluation of multi-party virtual reality dialogue interaction, in LREC, 2004. • Alongside traditional measures, introduces concept of “response appropriateness” • Created for ICT/ISI mission rehearsal exercise system

  5. Initial Companion Evaluation • 2 Companion prototypes • Health & Fitness • Senior Companion • 8 users completed entire protocol • All participants were native English speakers without strong accents • Ages from 27 to 61 • 2 were female, 6 were male

  6. Initial Companion Evaluation • New version (2.0) of Senior Companion • 12 new participants • 9 male, 3 female (ages 21-38) • Key changes • Facebook photographs (pre-tagged) • Loquendo TTS elements (cough, laugh) • Additional “chat” ability from a chatbot • Improved metric results • Avg. words / utterance • 4.27 (v1) to 6.1 (v2)

  7. v1.0 SC vs v2.0 I found the Companion engaging v1.0 v2.0

  8. v1.0 SC vs v2.0 The Companion demonstrated emotion at times v1.0 v2.0

  9. Appropriateness • Traum et al. devised an “appropriateness” coding scheme. • Split system and user utterances. • Users: • Response To System [RTS] • Gets RESponse [RES] • No Response: Appropriate [NRA] • No Response: Not appropriate [NRN]

  10. 3rd Phase - Appropriateness • For agents: • Filled Pause [FP] • Request for Repair [RR] • Appropriate Response [AR] • Appropriate Question [AQ] • Appropriate new INItiative [INI] • Appropriate CONtinuation [CON] • iNAPpropriate response, initiative or continuation [NAP]

  11. Scoring Intuitions • Filled pauses generally human-like and good for virtual agents to perform but don’t add a lot (0) • Appropriate responses and questions very good (+2), but initiatives that push the interaction back on track are better (+3) • Extended contributions on topic somewhat good (+.5) • Repairs and clarifications bad (-.5), but their use can still gain points by allowing subsequent appropriate response • Inappropriate response bad (-1), no response worse (-2)

  12. Appropriateness Evaluation • 7 HFC, 13 SC dialogues • 4 Pre-Chatbot SC, 9 Post-Chatbot • All annotated by a single user • To check consistency, one dialogue of each annotated by at least 3 annotators • Inter-coder reliability scored at over 0.8 (Krippendorff’s Alpha) for all

  13. Sample HFC Dialogue <System CS="SUGGEST-MEAL-SOURCE ORDER-TAKEAWAY LUNCH" time="02:51:06 PM" turn_no="16">Alright. How about ordering lunch from a takeaway?</System> [Appropriate Question] <User time="02:51:13 PM" turn_no="17" /> [Response To System] <System CS="SUGGEST-MEAL-SOURCE ORDER-TAKEAWAY LUNCH" time="02:51:25 PM" turn_no="18">Sorry, I didn't understand. Can you re-phrase: How about ordering lunch from a takeaway?</System> [Request Repair]

  14. Sample SC Dialogue <turn end_time="03:49:01 PM" mode="photo” speaker="S" start_time="03:48:59 PM" turn_no="65">What is your relationship to Libby.</turn> [Appropriate Question] <turn end_time="03:49:04 PM" mode="photo” speaker="S" start_time="03:49:01 PM" turn_no="66">I'm sorry I didn't understand your relationship to Libby.</turn> [Inappropriate Response] … <turn end_time="03:49:19 PM" mode="photo” speaker="U" turn_no="70"> could be as my friend</turn> [Response To System]

  15. Average Score

  16. Per Utterance Score

  17. Tag Distribution

  18. Initial Conclusions • Seems to correlate with improvement in user responses (needs further investigation) • Reliably encoded by annotators • Indicates problem areas in dialogue

  19. Tools and Resources • XML encoded dialogue corpus • Corpus collection tool • Appropriateness annotation guidelines • Appropriateness annotation tool

  20. Next Steps • Refine appropriateness measures • Add NEW tags • confirmation, politeness, emotion, • Modify existing tags • specific inappropriate tags • Don’t have upper bounds of performance – require WoZ models • Need to monitor users behaviour over time • Use scoring system to inform reinforcement learning

More Related