300 likes | 400 Views
Student simulation and evaluation . DOD meeting Hua Ai (hua@cs.pitt.edu) 03/03/2006. Outline. Motivations Backgrounds Corpus Student Simulation Model Comparisons Conclusions & Future Work. Motivations. For larger corpus
E N D
Student simulation and evaluation DOD meeting Hua Ai (hua@cs.pitt.edu) 03/03/2006
Outline • Motivations • Backgrounds • Corpus • Student Simulation Model • Comparisons • Conclusions & Future Work
Motivations • For larger corpus • Reinforcement Learning (RL) is used to learn the best policy for spoken dialogue systems automatically • Best strategy may often not even be present in small dataset • For cheaper corpus • Human subjects are expensive
Dialog Manager Simulated User Reinforcement Learning Strategy Dialog Corpus Simulation models Strategy learning using a simulated user (Schatzmann et al., 2005)
Backgrounds (1) • Education community • Focusing on changes of student’s inner-brain knowledge representation forms • Usually not dialogue based • Simulated students for (Venlehn et al., 1994) • tutor training • Collaborative learning
Backgrounds (2) • Dialogue community • Focusing on interactions and dialogue behaviors • Simulated users have limited actions to take • (Schatzmann et al., 2005) • Simulating on DA level
Corpus (1) • Spoken dialogue physics tutor (ITSPOKE)
(T) Question (T) Question (S) Answer (S) Answer Dialogue (T) Q (S) A … Dialogue (T) Q (S) A … Essay revision Essay revision Dialogue Dialogue Corpus (2) 5 problems • Tutoring procedure … …
Corpus (3) • Tutor’s behaviors • Defined in KCD (Knowledge Construction Dialogues) Correct Incorrect/ Partially Correct
Corpus (4) f03:s05 Different groups of subjects
Simulation Models (1) • Simulating on word level • Student’s have more complex behaviors • DA info alone isn’t enough for the system • Two models trained on two corpus 03ProbCorrect ProbCorrect f03 03Random 05ProbCorrect Random s05 05Random
Simulation Models (2) • ProbCorrect Model • Simulates average knowledge level of real students • Simulate meaningful dialogue behaviors • Random Model • Non-sense • As a contrast
Real corpus question1 Answer1_1 (c) Answer1_2 (ic) Answer1_3 (ic) question2 Answer2_1 (c) Answer2_2 (ic) Candidate Ans: For question1 c:ic = 1:2 c: Answer1_1 ic: Answer1_2 Answer1_3 For question2 c:ic = 1:1 c: Answer2_1 ic Answer2_2 • ProbCorrect Model: • Question 1 • Answer: • Choose to give a c/ic answer with the same average probability as real student • Randomly choose one answers from the corresponding answer set ProbCorrect Model
HC03&05 Question1 Answer1_1 Answer1_2 Answer1_3 Answer1_4 Question2 Answer2_1 Answer2_2 Candidate Ans: 1) Answer1_1 2) Answer1_2 3) Answer1_3 4) Answer1_4 5) Answer2_1 6) Answer2_2 Big random Model: Question i: Answer: any of the 6 answers with the same probability (Regardless the question!) Random Model
Experiments • Comparisons between real corpora • Comparisons between real & simulated corpora • Comparisons between simulated corpora
Real Corpora Comparisons (1) • Evaluation metrics • High-level dialog features • Dialog style and cooperativeness • Dialog Success Rate and Efficiency • Learning Gains
Real corpora comparisons (2) • High-level dialog features
Real corpora comparisons (3) • Dialogue style features
Real corpora comparisons (3) • Dialogue success rate
Real corpora comparisons (4) • Learning gains features
Results • Differences captured by these simple metrics can’t help to conclude whether a corpus is real or not (Schatzmann et al., 2005) • Differences could be due to different user population
Results (1) • Most of the measurements are able to distinguish between Random and ProbCorrect model • ProbCorrect model generates more realistic behaviors • We can’t conclude on the power of these metrics since the two simulated corpus are really different
Results (2) • Differences between real and random models are captured clearly, but differences between real and ProbCorrect is not clear • We don’t expect this simple model to give very real corpus. It’s surprising that the differences are small
Results (3) • S05 variety > f03 variety 05probCorrect variety > 03probCorrect variety • However, we don’t get significantly more varieties in the simulated corpus than the real ones • Could be the computer tutor is simple (c/ic) • We’re using the same candidate answer set
Results (4) • ProbCorrect models trained on different real corpora are quite different • The ProbCorrect model is more similar to the real corpus it is trained from than to the other real corpus
Comparisons between simulated dialogues with different dialogue structure
Results • Larger differences between the two simulated corpora in prob7 than in prob34 • Dialogue structure of prob34 is more restricted • The power of these simple metrics is restricted by the dialogue structure
Conclusions • The simple measurements can distinguish between • real corpora • Different population • simulated and real corpora • To different extent • simulated corpora • Different models • Trained on different corpora • Limited to different Dialog structure
Future work • Explore “deep” evaluation metrics • Test simulated corpus on policy • More simulation models • More human features • Emotion, learning • Special cases • Quick learners, slow learners