1 / 35

Predicting User Satisfaction and Student Learning in Dialogue Tutoring Systems

This study explores modeling user satisfaction and student learning in a spoken dialogue tutoring system by analyzing various interaction parameters and prediction models. The research aims to improve system performance for future users.

rnicolai
Download Presentation

Predicting User Satisfaction and Student Learning in Dialogue Tutoring Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley and Diane Litman University of Pittsburgh

  2. Outline • Overview • PARADISE • System and Corpora • Interaction Parameters • Prediction Models • Conclusions and Future Work

  3. Overview • Goals: • PARADISE: Model performancein ourspoken dialogue tutoring system in terms ofinteraction parameters • Focus design efforts on improving parameters - predict better performance for future users • Use model to predict simulated user performance - as different system versions designed

  4. Overview • What is Performance in our spoken dialoguetutoring system? • User Satisfaction:primary metric for many spoken dialogue systems, e.g. travel-planning (user surveys) • Hypothesis: less useful • Student Learning:primary metric for tutoring systems (student pre/post tests) • Hypothesis: more useful

  5. Overview • What Interaction Parameters for our spoken dialogue tutoring system? • Spoken Dialogue System-Generic (e.g. time):shown useful in non-tutoring PARADISE applications modeling User Satisfaction • Tutoring-Specific (e.g. correctness) • Hypothesis: task-specific parameters impact performance • User Affect (e.g. uncertainty) • Hypothesis: affect impacts performance - generic too

  6. Overview • Are the resulting Performance Models useful? • Generic and Tutoring parameters yielduseful Student Learning models • Affect parametersincrease usefulness • Generic and Tutoring parameters yield less useful User Satisfaction models than prior non-tutoring applications • (Bonneau-Maynard et al., 2000), (Walker et al., 2002), (Möller, 2005): better models with generic only • Too little data to include Affect parameters

  7. PARADISE Framework (Walker et al., 1997) • Measureparameters(interaction costs and benefits) and performance in system corpus • Train model via multiple linear regression (MLR) over parameters, predict performance (R2= variance predicted) • SPSS stepwise MLR: determine parameter inclusion (most correlated until no better R2/non-significant model) System Performance = ∑ wi * pi • Test model usefulness (generalize) on new corpus (R2) n i=1

  8. Qualitative Physics Tutor Speech front-end for text-based Why2-Atlas (VanLehn et al., 2002)

  9. Sphinx2 speech recognizer - Why2-Atlas performs NLP on transcript

  10. 3 ITSPOKE Corpora • Synthesized voice: Cepstral text-to-speech system • Pre-Recorded voice: paid voice talent

  11. Experimental Procedure • Subjects without college physics: • Read a small background document • Took a pretest • Worked 5 problems (dialogues) with ITSPOKE • Took a posttest

  12. User Satisfaction Survey after Posttest Task Ease: It was easy to learn from the tutor. TTS Performance:It was easy to understand the tutor. User Expertise:I knew what I could say or do at each point in the conversations with the tutor. Expected Behavior:The tutor worked the way I expected it to. Future Use:Based on my experience using the tutor to learn physics, I would like to use such a tutor regularly. ALMOST ALWAYS (5), OFTEN (4), SOMETIMES (3), RARELY (2), ALMOST NEVER (1)

  13. User Affect Annotation • Each student turn labeled with 1 Certainness State: • certain, uncertain, mixed, neutral • Each student turn labeled with 1 Frustration State: • frustrated, non-frustrated

  14. SYN03 Annotated Excerpt T: What is the NET force acting on the truck equal to? S: I don’t know (ASR: I don’t know )[Uncertain, Frustrated, Incorrect] T: Since net force is defined as the vector sum of all the forces exerted on an object, let's first find the forces acting on the truck. Try to name them. S: The impact force, the normal force, and gravity (ASR: the impact force but normal force and gravity)[Certain, Non-Frustrated, Correct] T: Yep.

  15. Interaction Parameters • Prior PARADISE applications • Each dialogue between user and system represents a single task (e.g., booking airline travel) • Parameters calculated on a per-dialogue basis • Our tutoring application • Entire tutoring session (5 dialogues) between student and ITSPOKE represents a single task • Parameters calculated on a per-student basis

  16. 13 Dialogue System-Generic Parameters • Most from prior PARADISE applications (Möller, 2005), (Walker et al. 2002), (Bonneau-Maynard, 2000) • Time on Task • Total ITSPOKE Turns, Total Student Turns • Total ITSPOKE Words, Total Student Words • Ave. ITSPOKE Words/Turn, Ave. Student Words/Turn • Word Error Rate, Concept Accuracy • Total Timeouts, Total Rejections • Ratio of Student Words to ITSPOKE Words • Ratio of Student Turns to ITSPOKE Turns

  17. 12 Tutoring-Specific Parameters • 9 Parameters related to Correctness of Student Turn • ITSPOKE labels: Correct, Incorrect, Partially Correct • TotalandPercentfor each label • Ratio of each label to every other label • Total number of essays per student • Student pretest and posttest score (for US) • Similar parameters available in most tutoring systems

  18. 25 User Affect Parameters • For each of our 4 Certainness labels: • Total, Percent, and Ratio to each other label • Total for each sequenceof identical labels (e.g. Certain:Certain) • For each of our 2 Frustration labels • Total, Percent, and Ratio to each other label • Total for each sequenceof identical labels

  19. User Satisfaction Prediction Models • Predicted Variable: Total Survey Score • Range: 9 - 24 out of 5 - 25; no corpora differences (p = .46) • Input Parameters: Generic and Tutoring • Do models generalize across corpora (system versions)? • Train on PR05  Test on SYN05 • Train on SYN05  Test on PR05 • Do models generalize better within corpora? • Train on half PR05  Test on half PR05 (for each half) • Train on half SYN05  Test on half SYN05 (for each half)

  20. User Satisfaction Prediction Models • Best Results (on Test Data) • Inter-corpus models are weak and don’t generalize well • Intra-corpus models generalize better, but are still weak predictors of User Satisfaction • Generic and Tutoring parameters selected

  21. User Satisfaction Prediction Models • Comparison to Prior Work • Some of same parameters also selected as predictors, e.g. in (Walker et al., 2002) (User Words/Turn) • Higher best test results (R2 = .3 - .5) in (Möller, 2005), (Walker et al., 2002) and (Bonneau-Maynard et al., 2000)

  22. Student Learning Prediction Models • First Experiments: • Data and Input Parameters: same as for User Satisfaction experiments • Predicted Variable: Posttest controlled for Pretest (learning gains); significant learning independently of corpus (p < .001)

  23. Student Learning Prediction Models • First Experiments: (Best Results on Test Data in table) • All models account for ~ 50% of Posttest variance in train and test data • Intra-corpus models don’t show higher generalizability • Generic and Tutoring parameters selected

  24. Student Learning Prediction Models • Further experiments: • Including third corpus (SYN03) with Generic and Tutoring parameters yields similar results • Best Result (on Test Data):

  25. Student Learning Prediction Models • Further experiments: including User Affect Parameters can improve results: Posttest = .86 * Time + .65 * Pretest - .54 * #Neutrals • Same experiment without User Affect Parameters:

  26. Summary: Student Learning Models • This method of developing a Student Learning model: • useful for our tutoring application • User Affect parameters can increase usefulness of Student Learning Models

  27. Summary: User Satisfaction Models • This method of developing a User Satisfaction model: • less useful for our tutoring application as compared to prior non-tutoring applications • Why are our User Satisfaction models less useful? • Per-student measure of User Satisfaction not fine-grained enough • Tutoring systems not designed to maximize User Satisfaction; goal is to maximize Student Learning

  28. Conclusions • For the tutoring community: • PARADISE provides an effective method of extending single Student Learning correlations • For the spoken dialogue community: • When using PARADISE: • other performance metrics may be more useful for applications not optimized for User Satisfaction • task-specific and user affect parameters may be useful

  29. Future Work • Investigate usefulness of additional input parameters for predicting Student Learning and User Satisfaction • User Affect annotations(once complete) • Tutoring Dialogue Acts(e.g. Möller, 2005; Litman and Forbes-Riley, 2006) • Discourse Structure annotations (Rotaru and Litman, 2006)

  30. Thank You! Questions? Further information: http://www.cs.pitt.edu/~litman/itspoke.html

  31. Student Learning Prediction Models • Further experiments: • Including third corpus (SYN03) with same Generic and Tutoring Specific parameters yields similar results • Training set most similar to test set yields highest generalizability

  32. User Satisfaction Prediction Models • Comparison to Prior Work • Some of same parameters also selected as predictors, e.g. in (Walker et al., 2002) (User Words/Turn) • Higher best test results (R2 = .3 - .5) in (Möller, 2005), (Walker et al., 2002) and (Bonneau-Maynard et al., 2000) • Similar sensitivity to changes in training data in (Möller, 2005) and (Walker et al., 2000)

  33. Student Learning Prediction Models • First Experiments: (Best Results on Test Data in table) • All models account for ~ 50% of Posttest variance in train and test data; less sensitive to training data changes • Intra-corpus models don’t have higher generalizability • Generic and Tutoring parameters selected

More Related