Metacognition and Learning in Spoken Dialogue Computer Tutoring

Metacognition and Learning in Spoken Dialogue Computer Tutoring Kate Forbes-Riley and Diane LitmanLearning Research and Development CenterUniversity of PittsburghPittsburgh, PA USA

Outline • Overview • Spoken Dialogue Computer Tutoring Data • Metacognitive Metrics based on Student Uncertainty and Correctness labels • Do Metacognitive Metrics Predict Learning? • Conclusions, Future Work

Background Metacognition: important measure of performance and learning Uncertainty: metacognitive state in tutorial dialogue research Signals learning impasses (e.g., VanLehn et al., 2003) Correlates with learning (Litman & Forbes-Riley, 2009; Craig et al., 2004) Computer tutor responses improve performance (Forbes-Riley & Litman, 2010; Aist et al., 2002; Tsukahara & Ward, 2001) Complex metrics: combine dimensions (uncertainty, correctness) Learning impasse severity (Forbes-Riley et al., 2008) Knowledge monitoring accuracy (Nietfeld et al., 2006) Bias (Kelemen et al., 2000; Saadawi et al. 2009) Discrimination (Kelemen et al., 2000; Saadawi et al. 2009)

Our Research Prior work: do metacognitive metrics predict learning in a wizarded spoken dialogue tutoring corpus?(Litman & Forbes-Riley, 2009) Computed on manually-labeled uncertainty and correctness All four complex metrics predicted learning Current Work: Do metrics also predict learning in a comparable fully automated corpus? One set computed on real-time automatic (noisy) labels One set computed onpost-experiment manual labels Most complex metrics still predict learning (noisy or manual) Worthwhile/Feasible to remediate noisy metacognitive metrics

Spoken Dialogue Computer Tutoring Data ITSPOKE: speech-enhanced, modified version of Why2-Atlas qualitative physics tutor (VanLehn, Jordan, Rosé et al., 2002) Two prior controlled experiments evaluated utility of responding to uncertainty over and above correctness Uncertainty and incorrectness are learning impasses (opportunities to learn) (e.g., VanLehn et al., 2003) Enhanced ITSPOKE: response contingent on student turn’s combined uncertainty and correctness labels (impasse state) Details in Forbes-Riley & Litman, 2010 Procedure: reading, pretest, 5 problems, survey, posttest

Spoken Dialogue Computer Tutoring Data 1st Experiment: ITSPOKE-WOZ corpus (wizarded) 405 dialogues, 81 students speech recognition, uncertainty, correctness labeling by human 2nd Experiment: ITSPOKE-AUTO corpus (fully-automated) 360 dialogues, 72 students Manually transcribed and labeled after experiment Speech recognition accuracy= 74.6% (Sphinx2) Correctness accuracy: 84.7% (TuTalk (Jordan et al., 2007)) Uncertainty accuracy: 80.3% (logistic regression model built with speech/dialogue features, trained on ITSPOKE-WOZ corpus)

ITSPOKE-AUTO Corpus Excerpt t1: […] How does the man’s velocity compare to that of the keys? sAUTO: his also the is same as that of his keysincorrect+certain sMANU: his velocity is the same as that of his keyscorrect+uncertain t2: […]What forces are exerted on the man after he releases his keys? sAUTO: the only force isincorrect+certain sMANU: the only force isincorrect+uncertain t3: […] What’s the direction of the force of gravity on the man? sAUTO: that in the pull in the man vertically downcorrect+certain sMANU: gravity will be pulling the man vertically downcorrect+certain

Metacognitive Performance Metrics Metrics computed using four equations that combine Uncertainty and Correctness labels in different ways Metrics computed per student (over all 5 dialogues) Two sets of metrics: one set used real-time automatic (noisy) labels (-auto) one set used post-experiment manual labels (-manu) Metrics represent inferred (tutor-perceived) values, because uncertainty labeled by system/human judge For each metric, we computed a Partial Pearson’s correlation with posttest, controlled for pretest

Average Learning Impasse Severity (Forbes-Riley & Litman, 2008) Uncertainty and incorrectness are learning impasses We distinguish four impasse states: all combinations of binary uncertainty (UNC, CER) and correctness (INC, COR) We rank impasse states by severity based on impasse awareness We label state of each turn and compute average impasse severity State: INC_CER INC_UNC COR_UNC COR_CER Severity: most (3) less (2) least (1) none (0) Metacognitive Performance Metrics

Knowledge monitoring accuracy (HC)(Nietfeld et al., 2006) Monitoring one’s own knowledge ≈ one’s Certainty level ≈ one’s Feeling of Knowing (FOK) HC has been used to measure FOK accuracy (Smith & Clark, 1993): the accuracy with which one’s certainty corresponds to correctness Feeling of Another’s Knowing (FOAK): inferring the FOK of someone else (Brennan & Williams, 1995) We use HC to measure FOAK accuracy (our certainty is inferred) HC = (COR_CER + INC_UNC) – (INC_CER + COR_UNC) (COR_CER + INC_UNC) + (INC_CER + COR_UNC) Metacognitive Performance Metrics

Knowledge monitoring accuracy (HC)(Nietfeld et al., 2006) Monitoring one’s own knowledge ≈ one’s Certainty level ≈ one’s Feeling of Knowing (FOK) HC has been used to measure FOK accuracy (Smith & Clark, 1993): the accuracy with which one’s certainty corresponds to correctness Feeling of Another’s Knowing (FOAK): inferring the FOK of someone else (Brennan & Williams, 1995) We use HC to measure FOAK accuracy (our certainty is inferred) HC = (COR_CER + INC_UNC) – (INC_CER + COR_UNC) (COR_CER + INC_UNC) + (INC_CER + COR_UNC) Metacognitive Performance Metrics Denominator sums over all cases

Knowledge monitoring accuracy (HC)(Nietfeld et al., 2006) Monitoring one’s own knowledge ≈ one’s Certainty level ≈ one’s Feeling of Knowing (FOK) HC has been used to measure FOK accuracy (Smith & Clark, 1993): the accuracy with which one’s certainty corresponds to correctness Feeling of Another’s Knowing (FOAK): inferring the FOK of someone else (Brennan & Williams, 1995) We use HC to measure FOAK accuracy (our certainty is inferred) HC = (COR_CER + INC_UNC) – (INC_CER + COR_UNC) (COR_CER + INC_UNC) + (INC_CER + COR_UNC) Metacognitive Performance Metrics cases where (un)certainty and (in)correctness agree

Knowledge monitoring accuracy (HC)(Nietfeld et al., 2006) Monitoring one’s own knowledge ≈ one’s Certainty level ≈ one’s Feeling of Knowing (FOK) HC has been used to measure FOK accuracy (Smith & Clark, 1993): the accuracy with which certainty corresponds to correctness Feeling of Another’s Knowing (FOAK): inferring the FOK of someone else (Brennan & Williams, 1995) We use HC to measure FOAK accuracy (our uncertainty is inferred) HC = (COR_CER + INC_UNC) – (INC_CER + COR_UNC) (COR_CER + INC_UNC) + (INC_CER + COR_UNC) Metacognitive Performance Metrics cases where (un)certainty and (in)correctness are at odds

Knowledge monitoring accuracy (HC)(Nietfeld et al., 2006) Monitoring one’s own knowledge ≈ one’s Certainty level ≈ one’s Feeling of Knowing (FOK) HC has been used to measure FOK accuracy (Smith & Clark, 1993): the accuracy with which certainty corresponds to correctness Feeling of Another’s Knowing (FOAK): inferring the FOK of someone else (Brennan & Williams, 1995) We use HC to measure FOAK accuracy (our uncertainty is inferred) HC = (COR_CER + INC_UNC) – (INC_CER + COR_UNC) (COR_CER + INC_UNC) + (INC_CER + COR_UNC) Metacognitive Performance Metrics Scores range from -1 (no accuracy) to 1 (perfect accuracy)

Bias(Kelemen et al., 2000; Saadawi et al. 2009) Measures how much more certainty than correctness there is Scores less/greater than 0 indicate under-/over-confidence Bias = COR_CER + INC_CER COR_CER + INC_CER + COR_UNC + INC_UNC COR_CER + COR_UNC COR_CER + INC_CER + COR_UNC + INC_UNC Metacognitive Performance Metrics minus

Bias(Kelemen et al., 2000; Saadawi et al. 2009) Measures how much more certainty than correctness there is Scores less/greater than 0 indicate under-/over-confidence Bias = COR_CER + INC_CER COR_CER + INC_CER + COR_UNC + INC_UNC COR_CER + COR_UNC COR_CER + INC_CER + COR_UNC + INC_UNC Metacognitive Performance Metrics minus Denominator sums over all cases

Bias(Kelemen et al., 2000; Saadawi et al. 2009) Measures how much more certainty than correctness there is Scores less/greater than 0 indicate under-/over-confidence Bias = COR_CER + INC_CER COR_CER + INC_CER + COR_UNC + INC_UNC COR_CER + COR_UNC COR_CER + INC_CER + COR_UNC + INC_UNC Metacognitive Performance Metrics Total certain answers minus Total correct answers

Discrimination(Kelemen et al., 2000; Saadawi et al. 2009) Measures one’s ability to discriminate whether one is correct Scores greater than 0 indicate higher performance Discrimination = COR_CER INC_CER COR_CER + COR_UNC INC_CER + INC_UNC Metacognitive Performance Metrics minus

Discrimination(Kelemen et al., 2000; Saadawi et al. 2009) Measures one’s ability to discriminate whether one is correct Scores greater than 0 indicate higher performance Discrimination = COR_CER INC_CER COR_CER + COR_UNC INC_CER + INC_UNC Metacognitive Performance Metrics minus Correct answers Incorrect answers

Discrimination(Kelemen et al., 2000; Saadawi et al. 2009) Measures one’s ability to discriminate whether one is correct Scores greater than 0 indicate higher performance Discrimination = COR_CER INC_CER COR_CER + COR_UNC INC_CER + INC_UNC Metacognitive Performance Metrics minus Proportion of correct certain answers Proportion of incorrect certain answers

Prior ITSPOKE-WOZ Corpus Results • In ideal conditions,higher learning correlates with: • Less severe impasses (that include uncertainty)/no impasses • Higher knowledge monitoring accuracy • Underconfidence about correctness • Better discrimination of when one is correct • Being correct

Current ITSPOKE-AUTO Corpus Results: -auto labels • In noisy/realistic conditions, higher learning still correlates with: • Less severe/no impasses • Higher knowledge monitoring accuracy • Underconfidence about correctness • Being correct

Current ITSPOKE-AUTO Corpus Results: -manu labels • In correctednoisy conditions, higher learning still correlates with: • Less severe/no impasses • Higher knowledge monitoring accuracy • Being correct

Discussion Does metacognition add value over correctness for predicting learning in ideal and realistic conditions? Recomputed correlations controlling for pretest and %Correct: ITSPOKE-WOZ: All complex metrics correlate with posttest ITSPOKE-AUTO: No metrics correlate with posttest Metacognition adds value in ideal conditions Stepwise linear regression greedily selects from all metrics+pretest ITSPOKE-WOZ: selects HC after %Correct and pretest ITSPOKE-AUTO: selects Impasse Severity_auto after pretest Metacognition adds value in realistic conditions too

Conclusions Metacognitive performance metrics predict learning in a fully automated spoken dialogue computer tutoring corpus Prior work: four metrics predict learning in a wizarded corpus Three metrics still predict learning even with automated speech recognition, uncertainty and correctness labeling Average impasse severity, Knowledge monitoring accuracy, Bias Metacognitive metrics add value over correctness for predicting learning in ideal and realistic conditions At least some metrics (e.g., noisy average impasse severity)

Current and Future Work Use results to inform system modification aimed at improving metacognitive abilities (and therefore learning) Feasible to use fully automated system and noisy metacognitive metrics, rather than expensive wizarded system Metacognitive metrics represent inferred values Self-judged values differ from inferred(Pon-Barry & Shieber, 2010); expert-judged values are most reliable (D’Mello et al., 2008) FOK ratings in future system versions can help measure metacognitive improvement “Metacognition in ITS” literature will also inform system modification (e.g., AIED’07 and ITS’08 workshops)

Questions/Comments? Further Information? web search: ITSPOKE Thank You!

Future Work cont. Why didn’t Discrimination_auto, Discrimination_manu and Bias_manu correlate with learning in ITSPOKE-AUTO? Due to NLP errors in ITSPOKE-AUTO? Rerun correlations over students with few speech recognition, uncertainty and correctness errors to see if results pattern like ITSPOKE-WOZ Due to different user populations? Run ITSPOKE-AUTO on ITSPOKE-WOZ corpus then compute noisy metric correlations to see if results pattern like ITSPOKE-AUTO corpus

For C+U, I+U, I+nonU answers ITSPOKE gives same content with same dialogue act ITSPOKE gives feedback on (in)correctness SimpleAdaptation to Uncertainty

SimpleAdaptation Example TUTOR1: By the same reasoning that we used for the car, what’s the overall net force on the truck equal to? STUDENT1: The force of the car hitting it?? [C+U] TUTOR2: Fine. [FEEDBACK]We can derive the net force on the truck by summing the individual forces on it, just like we did for the car. First, what horizontal force is exerted on the truck during the collision? [SUBDIALOGUE] Same TUTOR2 subdialogue if student was I+U or I+nonU

Metacognition and Learning in Spoken Dialogue Computer Tutoring

Metacognition and Learning in Spoken Dialogue Computer Tutoring

Presentation Transcript

Spoken Dialogue Systems

Spoken Dialogue Systems

(Speech and Affect in Intelligent Tutoring) Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Technology

Discourse and Dialogue Processing in Spoken Intelligent Tutoring Systems

Spoken Dialogue Technology

Spoken Dialogue in Information Retrieval

Spoken Dialogue Systems

Spoken Dialogue for Intelligent Tutoring Systems: Opportunities and Challenges

Spoken Dialogue Systems

Learning, Adaptation and Personalization in Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems and the Learning Sciences

Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue for the Why2 Intelligent Tutoring System

Spoken Dialogue Systems