390 likes | 476 Views
Using Reinforcement Learning to Build a Better Model of Dialogue State. Joel Tetreault & Diane Litman University of Pittsburgh LRDC April 7, 2006. Problem. Problems with designing spoken dialogue systems: What features to use? How to handle noisy data or miscommunications?
E N D
Using Reinforcement Learning to Build a Better Model of Dialogue State Joel Tetreault & Diane Litman University of Pittsburgh LRDC April 7, 2006
Problem • Problems with designing spoken dialogue systems: • What features to use? • How to handle noisy data or miscommunications? • Hand-tailoring policies for complex dialogues? • Previous work used machine learning to improve the dialogue manager of spoken dialogue systems [Singh et al., ‘02; Walker, ‘00; Henderson et al., ‘05] • However, very little empirical work on testing the utility of adding specialized features to construct a better dialogue state
Goal • Lots of features can be used to describe the user state, which ones to you use? • Goal: show that adding more complex features to a state is a worthwhile pursuit since it alters what actions a system should make • 5 features: certainty, student dialogue move, concept repetition, frustration, student performance • All are important to tutoring systems, but also are important to dialogue systems in general
Outline • Markov Decision Processes (MDP) • MDP Instantiation • Experimental Method • Results
Markov Decision Processes • What is the best action an agent to take at any state to maximize reward at the end? • MDP Input: • States • Actions • Reward Function
MDP Output • Use policy iteration to propagate final reward to the states to determine: • V-value: the worth of each state • Policy: optimal action to take for each state • Values and policies are based on the reward function but also on the probabilities of getting from one state to the next given a certain action
MDP Frog Example Final State: +1 -1 -1 -1 -1 -1 -1 -1
MDP Frog Example Final State: +1 -1 0 -2 -1 0 -2 -3 -2
MDP’s in Spoken Dialogue MDP works offline MDP Training data Policy Dialogue System User Simulator Human User Interactions work online
ITSPOKE Corpus • 100 dialogues with ITSPOKE spoken dialogue tutoring system [Litman et al. ’04] • All possible dialogue paths were authored by physics experts • Dialogues informally follow question-answer format • 50 turns per dialogue on average • Each student session has 5 dialogues bookended by a pretest and posttest to calculate how much student learned
Corpus Annotations • Manual annotations: • Tutor and Student Moves (similar to Dialog Acts) [Forbes-Riley et al., ’05] • Frustration and certainty [Litman et al. ’04] [Liscombe et al. ’05] • Automated annotations: • Correctness (based on student’s response to last question) • Concept Repetition (whether a concept is repeated) • %Correctness (past performance)
MDP Reward Function • Reward Function: use normalized learning gain to do a median split on corpus: • 10 students are “high learners” and the other 10 are “low learners” • High learner dialogues had a final state with a reward of +100, low learners had one of -100
Infrastructure • 1. State Transformer: • Based on RLDS [Singh et al., ’99] • Outputs State-Action probability matrix and reward matrix • 2. MDP Matlab Toolkit (from INRA) to generate policies
Methodology • Construct MDP’s to test the inclusion of new state features to a baseline: • Develop baseline state and policy • Add a feature to baseline and compare polices • A feature is deemed important if adding it results in a change in policy from a baseline policy (“shifts”) • For each MDP: verify policies are reliable (V-value convergence)
Hypothetical Policy Change Example 0 shifts 5 shifts
Tests B2+ +SMove +Goal B1+ Correctness +Certainty +Frustration Baseline 2 Baseline 1 +%Correct
Baseline • Actions: {Feed, NonFeed, Mix} • Baseline State: {Correctness} Baseline network F|NF|Mix [C] [I] F|NF|Mix F|NF|Mix F|NF|Mix F|NF|Mix FINAL
Baseline 1 Policies • Trend: if you only have student correctness as a model of student state, regardless of their response, the best tactic is to always give simple feedback
But are our policies reliable? • Best way to test is to run real experiments with human users with new dialogue manager, but that is months of work • Our tact: check if our corpus is large enough to develop reliable policies by seeing if V-values converge as we add more data to corpus • Method: run MDP on subsets of our corpus (incrementally add a student (5 dialogues) to data, and rerun MDP on each subset)
Methodology: Adding more Features • Create more complicated baseline by adding certainty feature (new baseline = B2) • Add other 4 features (student moves, concept repetition, frustration, performance) individually to new baseline • Check that V-values converge • Analyze policy changes
Tests B2+ +SMove +Goal B1+ Correctness +Certainty +Frustration Baseline 2 Baseline 1 +%Correct
Certainty • Previous work (Bhatt et al., ’04) has shown the importance of certainty in ITS • A student who is certain and correct, may not need feedback, but one that is correct but showing some doubt is a sign they are becoming confused, give more feedback
B2: Baseline + Certainty Policies Trend: if neutral, give Feed or Mix, else give NonFeed
Tests B2+ +SMove +Goal B1+ Correctness +Certainty +Frustration Baseline 2 Baseline 1 + %Correct
7 Changes Student Move Policies Trend: give Mix if shallow (S), give NonFeed if Other (O)
4 Shifts Concept Repetition Policies Trend: if concept is repeated (R) give complex or mix feedback
4 Shifts Frustration Policies Trend: if student is frustrated (F), give NonFeed
3 Shifts Percent Correct Policies Trend: if student is a low performer (L), give NonFeed
Discussion • Incorporating more information into a representation of the student state has an impact on tutor policies • Despite not having human or simulated users, can still claim that our findings are reliable due to convergence of V-values and policies • Including Certainty, Student Moves and Concept Repetition effected the most change
Future Work • Developing user simulations and annotating more human-computer experiments to further verify our policies are correct • More data allows us to develop more complicated policies such as • More complex tutor actions (hints, questions) • Combinations of state features • More refined reward functions (PARADISE) • Developing more complex convergence tests
Related Work • [Paek and Chickering, ‘05] • [Singh et al., ‘99] – optimal dialogue length • [Frampton et al., ‘05] – last dialogue act • [Williams et al., ‘03] – automatically generate good state/action sets
Diff Plots Diff Plot: compare final policy (20 students) with policies generated at smaller cuts