1 / 48

How much data is enough? – Generating reliable policies w/MDP’s

This study explores the utility of adding different features to dialogue systems by comparing their impact on the dialogue state. The research aims to determine which features contribute most effectively to a better model of dialogue state, using reinforcement learning methodology.

jok
Download Presentation

How much data is enough? – Generating reliable policies w/MDP’s

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How much data is enough? – Generating reliable policies w/MDP’s Joel Tetreault University of Pittsburgh LRDC July 14, 2006

  2. Problem • Problems with designing spoken dialogue systems: • How to handle noisy data or miscommunications? • Hand-tailoring policies for complex dialogues? • What features to use? • Previous work used machine learning to improve the dialogue manager of spoken dialogue systems [Singh et al., ‘02; Walker, ‘00; Henderson et al., ‘05] • However, very little empirical work [Paek et al., ‘05; Frampton ‘05] on comparing the utility of adding specialized features to construct a better dialogue state

  3. Goal • How does one choose which features best contribute to a better model of dialogue state? • Goal: show the comparative utility of adding three different features to a dialogue state • 4 features: concept repetition, frustration, student performance, student moves • All are important to tutoring systems, but also are important to dialogue systems in general

  4. Previous Work • In complex domains, annotation and testing is time-consuming so it is important to properly choose best features beforehand • Developed a methodology for using Reinforcement Learning to determine whether adding complex features to a dialogue state will beneficially alter policies [Tetreault & Litman, EACL ’06] • Extensions: • Methodology to determine which features are the best • Also show our results generalize over different action choices (feedback vs. questions)

  5. Outline • Markov Decision Processes (MDP) • MDP Instantiation • Experimental Method • Results • Policies • Feature Comparison

  6. Markov Decision Processes • What is the best action an agent should take at any state to maximize reward at the end? • MDP Input: • States • Actions • Reward Function

  7. MDP Output • Policy: optimal action for system to take in each state • Calculated using policy iteration which depends on: • Propagating final reward to each state • the probabilities of getting from one state to the next given a certain action • Additional output: V-value: the worth of each state

  8. MDP’s in Spoken Dialogue MDP works offline MDP Training data Policy Dialogue System User Simulator Human User Interactions work online

  9. ITSPOKE Corpus • 100 dialogues with ITSPOKE spoken dialogue tutoring system [Litman et al. ’04] • All possible dialogue paths were authored by physics experts • Dialogues informally follow question-answer format • 60 turns per dialogue on average • Each student session has 5 dialogues bookended by a pretest and posttest to calculate how much student learned

  10. Corpus Annotations • Manual annotations: • Tutor Moves (similar to Dialog Acts) [Forbes-Riley et al., ’05] • Student Frustration and Certainty [Litman et al. ’04] [Liscombe et al. ’05] • Automated annotations: • Correctness (based on student’s response to last question) • Concept Repetition (whether a concept is repeated) • %Correctness (past performance)

  11. MDP State Features

  12. MDP Action Choices

  13. MDP Reward Function • Reward Function: use normalized learning gain to do a median split on corpus: • 10 students are “high learners” and the other 10 are “low learners” • High learner dialogues had a final state with a reward of +100, low learners had one of -100

  14. Methodology • Construct MDP’s to test the inclusion of new state features to a baseline: • Develop baseline state and policy • Add a feature to baseline and compare polices • A feature is deemed important if adding it results in a change in policy from a baseline policy given 3 metrics: • # of Policy Differences (Diff’s) • %Policy Change (%PC) • Expected Cumulative Reward (ECR) • For each MDP: verify policies are reliable (V-value convergence)

  15. Hypothetical Policy Change Example 0 Diffs 5 Diffs

  16. Tests B2+ +Concept B1+ +Frustration Correctness +Certainty Baseline 2 Baseline 1 +%Correct

  17. Baseline • Actions: {SAQ, CAQ, Mix, NoQ} • Baseline State: {Correctness} Baseline network SAQ|CAQ|Mix|NoQ [C] [I] FINAL

  18. Baseline 1 Policies • Trend: if you only have student correctness as a model of student state, give a hint or other state act to the student, otherwise give a Mix of complex and short answer questions

  19. But are our policies reliable? • Best way to test is to run real experiments with human users with new dialogue manager, but that is months of work • Our tact: check if our corpus is large enough to develop reliable policies by seeing if V-values converge as we add more data to corpus • Method: run MDP on subsets of our corpus (incrementally add a student (5 dialogues) to data, and rerun MDP on each subset)

  20. Baseline Convergence Plot

  21. Methodology: Adding more Features • Create more complicated baseline by adding certainty feature (new baseline = B2) • Add other 4 features (concept repetition, frustration, performance, student move) individually to new baseline • Check V-value and policy convergence • Analyze policy changes • Use Feature Comparison Metrics to determine the relative utility of the three features

  22. Tests B2+ +Concept B1+ +Frustration Correctness +Certainty Baseline 2 Baseline 1 +%Correct

  23. Certainty • Previous work (Bhatt et al., ’04) has shown the importance of certainty in ITS • A student who is certain and correct, may require a harder question since he or she is doing well, but one that is correct but showing some doubt is a sign they are becoming confused, give an easier question

  24. B2: Baseline + Certainty Policies Trend: if neutral, give SAQ or NoQ, else give Mix

  25. Baseline 2 Convergence Plots

  26. Baseline 2 Diff Plots Diff: For each subset corpus, compare policy with policy generated with full corpus

  27. Tests B2+ +Concept B1+ +Frustration Correctness +Certainty Baseline 2 Baseline 1 +%Correct

  28. Feature Comparison (3 metrics) • # Diff’s • Number of new states whose policies differ from the original • Insensitive to how frequently a state occurs • % Policy Change (%P.C.) • Take into account the frequency of each state-action sequence

  29. Feature Comparison • Expected Cumulative Reward (E.C.R.) • One issue with %P.C. is that frequently occurring states have low V-values and thus may bias the score • Use the expected value of being at the start of the dialogue to compare features • ECR = average V-value of all start states

  30. Feature Comparison Results • Trend of SMove > Concept Repetition > Frustration > Percent Correctness stays the same over all three metrics • Baseline: Also tested the effects of a binary random feature • If enough data, a random feature should not alter policies • Average diff of 5.1

  31. How reliable are policies? Frustration Concept Possible data size is small and with increased data we may see more fluctuations

  32. Confidence Bounds • Hypothesis: instead of looking at the V-values and policy differences directly, look at the confidence bounds of each V-value • As data increases, confidence of V-value should shrink to reflect a better model of the world • Additionally, the policies should converge as well

  33. Confidence Bounds • CB’s can also be used to distinguish how much better an additional state feature is over a baseline state space • That is, if the lower bound of a new state space is greater than the upper bound of the baseline state space

  34. Crossover Example More complicatedModel ECR Baseline Data

  35. Confidence Bounds: App #2 • Automatic model switching • If you know a model, at it’s worst (ie. It’s lower bound is better than another model’s upper bound) then you can automatically switch to the more complicated model • Good for online RL applications

  36. Confidence Bound Methodology • For each data slice, calculate upper and lower bounds on the V-value • Take transition matrix for slice and sample from each row using direch. statistical formula 1000 times • do this b/c real world data is not exactly approximating what data is like in the real world, but may be close • So get 1000 new transition matrices that are all very similar • Run MDP on all 1000 transition matrices to get a range of ECR’s • Rows with not a lot of data are very volatile so expect large range of ECR’s, but as data increases, transition matrices should stabilize such that most of the new matrices produce similar policies and values as the original • Take upper and lower bounds at 2.5% percentile

  37. Experiment • Original action/state setup did not show anything promising • State/action space too large for data? • Not best MDP instantiation • Looked at a variety of MDP configurations • Refined reward metric • Adding discourse segmentation

  38. +essay Instantiation with ’03+’05 data

  39. +essay Baseline1

  40. +essay Baseline2

  41. +essay B2+SMove

  42. Feature Comparison Results • Reduced state size: Certainty = {Cert+Neutral, Uncert} • Trend that SMove and Concept Repetition are the best features • B2 ECR = 31.92

  43. Baseline 1 Upper = 23.65 Lower = 0.24

  44. Baseline 2 Upper = 57.16 Lower = 39.62

  45. B2+ Concept Repetition Upper = 64.30 Lower =49.16

  46. B2+Percent Correctness Upper =48.42 Lower = 32.86

  47. B2+Student Move Upper = 61.36 Lower = 39.94

  48. Discussion • Baseline 2 – has crossover effect and policy stability • More complex features (B2 + X) – have crossover effect, but not sure if polices are stable (some stabilize at 17 students) • Indicates that 100 dialogues isn’t enough for even this simple MDP? (but is enough for baseline 2 to feel confident about?)

More Related