10 likes | 88 Views
2. 3. 4. 5. 1. System interacts with each simulation model 40,000 times to generate training corpora. Train MDP policies from each training corpus. Implement the policies into 3 new systems. New systems interact with PM 500 times each to generate testing corpora. Apply Evaluation Measures.
E N D
2 3 4 5 1 System interacts with each simulation model 40,000 times to generate training corpora Train MDP policies from each training corpus Implement the policies into 3 new systems New systems interact with PM 500 times each to generate testing corpora Apply Evaluation Measures New Sys1 TrainingCorpus1 TestingCorpus1 PM Policy1 New Sys2 PM TrainingCorpus2 TestingCorpus2 TRM Old System Policy2 Evaluation RRM TrainingCorpus3 New Sys3 TestingCorpus3 Policy3 The most realistic model Background Research Question User simulation has been used in generating training corpora for using Reinforcement Learning to learn dialog strategies (e.g., Henderson et al., 2005, Pietquin, 2004). It is not clear how realistic versus how exploratory the training corpus should be (Singh et al., 2002). This paper investigates what kind of user simulation is good for using Markov Decision Processes (MDPs) to learn dialog strategies. We compare three simulation models which differ in their efforts on modeling the dialog behaviors in a training corpus versus exploring a potentially larger dialog space. Comparing User Simulation Models For Dialog Strategy LearningHua Ai, Joel R. Tetreault and Diane J. Litman, University of Pittsburgh ITSPOKE System and Corpora Sample coded dialog Why2-Atlas text-based backend (VanLehn, Jordan, Rose et al., 2002) ITSPOKE1: Which law of motion would you use? S1: Newton’s second law. [ic, %ic=1, ncert] ITSPOKE2: Well… [WF] The best law to use is Newton’s third law. Do you recall what it says? S2: For every action there is an equal and opposite reaction. [c, %ic=50%, ncert] ITSPOKE3:Great! [SF] … 130 dialogs (1019 student turns) were collected with 26 human subjects. Correctness[c, ic] and percent incorrectness [%ic] of student answers is judged by the system and logged Certainty [cert, ncert] of student is manually annotated (kappa = 0.68) Learning Task MDP Configurations The current system can only respond to the correctness of student answers. However, other underlying information such as certainty is also useful for the tutor. Learning Task: to learn how to manipulate the strength of the tutor feedback (Strong Feedback [SF], Weak Feedback [WF] ) in order to maximize student’s overall certainty in the entire dialog. • State Space Representation (SSR): SSR1: c+%ic; SSR2: c+%ic+cert • Actions: Strong Feedback (SF), Weak Feedback (WF) • Reward Function (RF): • RF1: +100, -100; (median split on cert%, assign +100 to dialogs cert%>median) • RF2: 0 … +100 (RF2=cert%*100) Evaluation Measures Simulation Models Eval1: number of dialogs that would be assigned +100 using the median split on training corpus (Baseline=250) Eval2: the average of percent certainty in new dialogs (Baseline=35.21%) • Probabilistic Model (PM): • generate realistic student behaviors in a probabilistic way Results and Discussion • Total Random Model (TRM): • Tries to explore all the possible dialog states. Not based on any distribution • Restricted Random Model (RRM): • compromise between the exploration of the dialog state space and the realness of generated user behaviors. In blue: cases where RRM significantly outperforms the other two models; Underline: the learned policy significantly outperforms the corresponding baseline (under Eval1/Eval2). *Example Learned Policy: Give SF when the current student answer is ic and %ic>50%, otherwise give WF. Methodology • Analysis: • The performance of the PM is harmed by the data sparsity issue in the real corpus. • Percentage of frequent states seen in testing that appeared in different simulated corpora: • PM; 70.1% RRM: 76.3% TRM: 65.2% • Summary: • A restricted random model performs better than a more realistic model when trained from sparse data • State space representation may not impact evaluation results as much as reward functions and evaluation measures • It is helpful to use user simulation instead of the system to explore dialog state space