Evaluating Human-Machine Conversation for Appropriateness

Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb

Overview • www.companions-project.org • Companions are targeted as persistent, collaborative, conversational partners • Rather than singular tasks, Companions have a range of tasks • Completion of tasks is important • So is conversational performance

Metrics • Objective measures • WER, CER, Turn Duration, Vocabulary… • Subjective user measures • User satisfaction surveys • Appropriateness

Appropriateness • D. Traum, S. Robinson and J. Stephan. Evaluation of multi-party virtual reality dialogue interaction, in LREC, 2004. • Alongside traditional measures, introduces concept of “response appropriateness” • Created for ICT/ISI mission rehearsal exercise system

Initial Companion Evaluation • 2 Companion prototypes • Health & Fitness • Senior Companion • 8 users completed entire protocol • All participants were native English speakers without strong accents • Ages from 27 to 61 • 2 were female, 6 were male

Initial Companion Evaluation • New version (2.0) of Senior Companion • 12 new participants • 9 male, 3 female (ages 21-38) • Key changes • Facebook photographs (pre-tagged) • Loquendo TTS elements (cough, laugh) • Additional “chat” ability from a chatbot • Improved metric results • Avg. words / utterance • 4.27 (v1) to 6.1 (v2)

v1.0 SC vs v2.0 I found the Companion engaging v1.0 v2.0

v1.0 SC vs v2.0 The Companion demonstrated emotion at times v1.0 v2.0

Appropriateness • Traum et al. devised an “appropriateness” coding scheme. • Split system and user utterances. • Users: • Response To System [RTS] • Gets RESponse [RES] • No Response: Appropriate [NRA] • No Response: Not appropriate [NRN]

3rd Phase - Appropriateness • For agents: • Filled Pause [FP] • Request for Repair [RR] • Appropriate Response [AR] • Appropriate Question [AQ] • Appropriate new INItiative [INI] • Appropriate CONtinuation [CON] • iNAPpropriate response, initiative or continuation [NAP]

Scoring Intuitions • Filled pauses generally human-like and good for virtual agents to perform but don’t add a lot (0) • Appropriate responses and questions very good (+2), but initiatives that push the interaction back on track are better (+3) • Extended contributions on topic somewhat good (+.5) • Repairs and clarifications bad (-.5), but their use can still gain points by allowing subsequent appropriate response • Inappropriate response bad (-1), no response worse (-2)

Appropriateness Evaluation • 7 HFC, 13 SC dialogues • 4 Pre-Chatbot SC, 9 Post-Chatbot • All annotated by a single user • To check consistency, one dialogue of each annotated by at least 3 annotators • Inter-coder reliability scored at over 0.8 (Krippendorff’s Alpha) for all

Sample HFC Dialogue <System CS="SUGGEST-MEAL-SOURCE ORDER-TAKEAWAY LUNCH" time="02:51:06 PM" turn_no="16">Alright. How about ordering lunch from a takeaway?</System> [Appropriate Question] <User time="02:51:13 PM" turn_no="17" /> [Response To System] <System CS="SUGGEST-MEAL-SOURCE ORDER-TAKEAWAY LUNCH" time="02:51:25 PM" turn_no="18">Sorry, I didn't understand. Can you re-phrase: How about ordering lunch from a takeaway?</System> [Request Repair]

Sample SC Dialogue <turn end_time="03:49:01 PM" mode="photo” speaker="S" start_time="03:48:59 PM" turn_no="65">What is your relationship to Libby.</turn> [Appropriate Question] <turn end_time="03:49:04 PM" mode="photo” speaker="S" start_time="03:49:01 PM" turn_no="66">I'm sorry I didn't understand your relationship to Libby.</turn> [Inappropriate Response] … <turn end_time="03:49:19 PM" mode="photo” speaker="U" turn_no="70"> could be as my friend</turn> [Response To System]

Average Score

Per Utterance Score

Tag Distribution

Initial Conclusions • Seems to correlate with improvement in user responses (needs further investigation) • Reliably encoded by annotators • Indicates problem areas in dialogue

Tools and Resources • XML encoded dialogue corpus • Corpus collection tool • Appropriateness annotation guidelines • Appropriateness annotation tool

Next Steps • Refine appropriateness measures • Add NEW tags • confirmation, politeness, emotion, • Modify existing tags • specific inappropriate tags • Don’t have upper bounds of performance – require WoZ models • Need to monitor users behaviour over time • Use scoring system to inform reinforcement learning

Evaluating Human-Machine Conversation for Appropriateness

Evaluating Human-Machine Conversation for Appropriateness

Presentation Transcript

predicting hospice appropriateness for alzheimer

Human-Machine Systems

Human Machine Interfaces

Human–Machine System: Controls

Determining the Appropriateness of Books for Children

Human–Machine System: Controls

A Conversation on Anti-Human Trafficking

Human-Machine Reconfigurations

Human vs. Machine

Evaluating International Collaboration for Human Exploration Beyond LEO

Human-Assisted Machine Annotation

HMI (Human Machine Interface)

REFERRAL APPROPRIATENESS

Human Machine Interface

Investigating the basis for conversation between human and robot

The Human Machine

Human-Level Machine Learning

Human-Machine Boundary

HUMAN MACHINE INTERFACE

Human-Machine Coevolution

Appropriateness of Montenegro apartment for rent

Human Machine Intererface