Predicting User Satisfaction and Student Learning in Dialogue Tutoring Systems

Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley and Diane Litman University of Pittsburgh

Outline • Overview • PARADISE • System and Corpora • Interaction Parameters • Prediction Models • Conclusions and Future Work

Overview • Goals: • PARADISE: Model performancein ourspoken dialogue tutoring system in terms ofinteraction parameters • Focus design efforts on improving parameters - predict better performance for future users • Use model to predict simulated user performance - as different system versions designed

Overview • What is Performance in our spoken dialoguetutoring system? • User Satisfaction:primary metric for many spoken dialogue systems, e.g. travel-planning (user surveys) • Hypothesis: less useful • Student Learning:primary metric for tutoring systems (student pre/post tests) • Hypothesis: more useful

Overview • What Interaction Parameters for our spoken dialogue tutoring system? • Spoken Dialogue System-Generic (e.g. time):shown useful in non-tutoring PARADISE applications modeling User Satisfaction • Tutoring-Specific (e.g. correctness) • Hypothesis: task-specific parameters impact performance • User Affect (e.g. uncertainty) • Hypothesis: affect impacts performance - generic too

Overview • Are the resulting Performance Models useful? • Generic and Tutoring parameters yielduseful Student Learning models • Affect parametersincrease usefulness • Generic and Tutoring parameters yield less useful User Satisfaction models than prior non-tutoring applications • (Bonneau-Maynard et al., 2000), (Walker et al., 2002), (Möller, 2005): better models with generic only • Too little data to include Affect parameters

PARADISE Framework (Walker et al., 1997) • Measureparameters(interaction costs and benefits) and performance in system corpus • Train model via multiple linear regression (MLR) over parameters, predict performance (R2= variance predicted) • SPSS stepwise MLR: determine parameter inclusion (most correlated until no better R2/non-significant model) System Performance = ∑ wi * pi • Test model usefulness (generalize) on new corpus (R2) n i=1

Qualitative Physics Tutor Speech front-end for text-based Why2-Atlas (VanLehn et al., 2002)

Sphinx2 speech recognizer - Why2-Atlas performs NLP on transcript

3 ITSPOKE Corpora • Synthesized voice: Cepstral text-to-speech system • Pre-Recorded voice: paid voice talent

Experimental Procedure • Subjects without college physics: • Read a small background document • Took a pretest • Worked 5 problems (dialogues) with ITSPOKE • Took a posttest

User Satisfaction Survey after Posttest Task Ease: It was easy to learn from the tutor. TTS Performance:It was easy to understand the tutor. User Expertise:I knew what I could say or do at each point in the conversations with the tutor. Expected Behavior:The tutor worked the way I expected it to. Future Use:Based on my experience using the tutor to learn physics, I would like to use such a tutor regularly. ALMOST ALWAYS (5), OFTEN (4), SOMETIMES (3), RARELY (2), ALMOST NEVER (1)

User Affect Annotation • Each student turn labeled with 1 Certainness State: • certain, uncertain, mixed, neutral • Each student turn labeled with 1 Frustration State: • frustrated, non-frustrated

SYN03 Annotated Excerpt T: What is the NET force acting on the truck equal to? S: I don’t know (ASR: I don’t know )[Uncertain, Frustrated, Incorrect] T: Since net force is defined as the vector sum of all the forces exerted on an object, let's first find the forces acting on the truck. Try to name them. S: The impact force, the normal force, and gravity (ASR: the impact force but normal force and gravity)[Certain, Non-Frustrated, Correct] T: Yep.

Interaction Parameters • Prior PARADISE applications • Each dialogue between user and system represents a single task (e.g., booking airline travel) • Parameters calculated on a per-dialogue basis • Our tutoring application • Entire tutoring session (5 dialogues) between student and ITSPOKE represents a single task • Parameters calculated on a per-student basis

13 Dialogue System-Generic Parameters • Most from prior PARADISE applications (Möller, 2005), (Walker et al. 2002), (Bonneau-Maynard, 2000) • Time on Task • Total ITSPOKE Turns, Total Student Turns • Total ITSPOKE Words, Total Student Words • Ave. ITSPOKE Words/Turn, Ave. Student Words/Turn • Word Error Rate, Concept Accuracy • Total Timeouts, Total Rejections • Ratio of Student Words to ITSPOKE Words • Ratio of Student Turns to ITSPOKE Turns

12 Tutoring-Specific Parameters • 9 Parameters related to Correctness of Student Turn • ITSPOKE labels: Correct, Incorrect, Partially Correct • TotalandPercentfor each label • Ratio of each label to every other label • Total number of essays per student • Student pretest and posttest score (for US) • Similar parameters available in most tutoring systems

25 User Affect Parameters • For each of our 4 Certainness labels: • Total, Percent, and Ratio to each other label • Total for each sequenceof identical labels (e.g. Certain:Certain) • For each of our 2 Frustration labels • Total, Percent, and Ratio to each other label • Total for each sequenceof identical labels

User Satisfaction Prediction Models • Predicted Variable: Total Survey Score • Range: 9 - 24 out of 5 - 25; no corpora differences (p = .46) • Input Parameters: Generic and Tutoring • Do models generalize across corpora (system versions)? • Train on PR05  Test on SYN05 • Train on SYN05  Test on PR05 • Do models generalize better within corpora? • Train on half PR05  Test on half PR05 (for each half) • Train on half SYN05  Test on half SYN05 (for each half)

User Satisfaction Prediction Models • Best Results (on Test Data) • Inter-corpus models are weak and don’t generalize well • Intra-corpus models generalize better, but are still weak predictors of User Satisfaction • Generic and Tutoring parameters selected

User Satisfaction Prediction Models • Comparison to Prior Work • Some of same parameters also selected as predictors, e.g. in (Walker et al., 2002) (User Words/Turn) • Higher best test results (R2 = .3 - .5) in (Möller, 2005), (Walker et al., 2002) and (Bonneau-Maynard et al., 2000)

Student Learning Prediction Models • First Experiments: • Data and Input Parameters: same as for User Satisfaction experiments • Predicted Variable: Posttest controlled for Pretest (learning gains); significant learning independently of corpus (p < .001)

Student Learning Prediction Models • First Experiments: (Best Results on Test Data in table) • All models account for ~ 50% of Posttest variance in train and test data • Intra-corpus models don’t show higher generalizability • Generic and Tutoring parameters selected

Student Learning Prediction Models • Further experiments: • Including third corpus (SYN03) with Generic and Tutoring parameters yields similar results • Best Result (on Test Data):

Student Learning Prediction Models • Further experiments: including User Affect Parameters can improve results: Posttest = .86 * Time + .65 * Pretest - .54 * #Neutrals • Same experiment without User Affect Parameters:

Summary: Student Learning Models • This method of developing a Student Learning model: • useful for our tutoring application • User Affect parameters can increase usefulness of Student Learning Models

Summary: User Satisfaction Models • This method of developing a User Satisfaction model: • less useful for our tutoring application as compared to prior non-tutoring applications • Why are our User Satisfaction models less useful? • Per-student measure of User Satisfaction not fine-grained enough • Tutoring systems not designed to maximize User Satisfaction; goal is to maximize Student Learning

Conclusions • For the tutoring community: • PARADISE provides an effective method of extending single Student Learning correlations • For the spoken dialogue community: • When using PARADISE: • other performance metrics may be more useful for applications not optimized for User Satisfaction • task-specific and user affect parameters may be useful

Future Work • Investigate usefulness of additional input parameters for predicting Student Learning and User Satisfaction • User Affect annotations(once complete) • Tutoring Dialogue Acts(e.g. Möller, 2005; Litman and Forbes-Riley, 2006) • Discourse Structure annotations (Rotaru and Litman, 2006)

Thank You! Questions? Further information: http://www.cs.pitt.edu/~litman/itspoke.html

Student Learning Prediction Models • Further experiments: • Including third corpus (SYN03) with same Generic and Tutoring Specific parameters yields similar results • Training set most similar to test set yields highest generalizability

User Satisfaction Prediction Models • Comparison to Prior Work • Some of same parameters also selected as predictors, e.g. in (Walker et al., 2002) (User Words/Turn) • Higher best test results (R2 = .3 - .5) in (Möller, 2005), (Walker et al., 2002) and (Bonneau-Maynard et al., 2000) • Similar sensitivity to changes in training data in (Möller, 2005) and (Walker et al., 2000)

Student Learning Prediction Models • First Experiments: (Best Results on Test Data in table) • All models account for ~ 50% of Posttest variance in train and test data; less sensitive to training data changes • Intra-corpus models don’t have higher generalizability • Generic and Tutoring parameters selected

Predicting User Satisfaction and Student Learning in Dialogue Tutoring Systems