430 likes | 543 Views
Cohesion and Learning in a Tutorial Spoken Dialog System. Art Ward Diane Litman. Outline. Tutoring Goals 4 issues in measuring cohesion Why they’re interesting How we test them Results. Natural Language Dialog Tutoring. Human tutors are better than classroom instruction (Bloom 84)
E N D
Cohesion and Learning in a Tutorial Spoken Dialog System Art Ward Diane Litman
Outline • Tutoring • Goals • 4 issues in measuring cohesion • Why they’re interesting • How we test them • Results
Natural Language Dialog Tutoring • Human tutors are better than classroom instruction (Bloom 84) • Intelligent Tutoring Systems (ITSs) hope to replicate this advantage • Is Dialog important to learning? • Dialog acts: question answering, explanatory reasoning, deep student answers (Graesser et al. 95, Forbes-Riley et al. 05) • Difficult to automatically tag dialog input, so: • Automatically detectable dialog features • Average turn length, etc. (Litman et al. 04) • We look at Cohesion • Lexical Co-occurrence between turns
Goals and Results • Goals • Want to find if cohesion is correlated with learning in our tutoring dialogs. • If it is, may inform ITS design • Want to find a computationally tractable measure of cohesion • So can be used in a real-time tutor • Results • Do find strong correlations with learning • For low pre-testers • For interactive (tutor to student) measures of cohesion • Robust to multiple measures of lexical cohesion
4 Issues • Why/How identify cohesion in dialogs? • Do students of different skill levels respond to cohesion in the same way? (Is there an aptitude/treatment interaction?) • Is Interactivity Important? • What other processing steps help?
Issue 1: How identify cohesion in dialogs? • Why might cohesion be important in tutoring? • McNamara & Kintsch (96) • Students read high & low coherence text • High coherence text was low coherence version altered to: • Use consistent referring expressions • Identify anaphora • Supply background information • Interaction between pre-test score & response to textual coherence • Low pre-testers learned more from more coherent text • High pre-testers learned LESS from more coherent text
Measuring Cohesion • Measurements from Computational Linguistics • Hearst(94) topic segmentation, text • Word-count similarity of spans of text • Olney & Cai (05) topic segmentation, tutorial dialog • Several measures, including Hearst’s • Morris & Hirst (91) Lexical Chains • Thesaurus entries • Barzilay & Eldihad (97) Automatic Lexical Chains • WordNet senses • We develop measures similar to Hearst’s • But novel in that: • Applied to dialog rather than text, used to find correlations with learning
Issue 1: How identify cohesion in dialogs? Defining Cohesion • Halliday and Hassan (76) • Grammatical vs Lexical Cohesion • Lexical Cohesion • Reiteration • Exact word repetition • Synonym repetition • Near Synonym repetition • Super-ordinate class • General referring noun • Cohesion measured by counting “cohesive ties” • Two words joined by a cohesive device (i.e. reiteration)
Issue 1: How identify cohesion in dialogs? Defining Cohesion • Halliday and Hassan (76) • Grammatical vs Lexical Cohesion • Lexical Cohesion • Reiteration • Exact word repetition • Synonym repetition • Near Synonym repetition • Super-ordinate class • General referring noun • Cohesion measured by counting “cohesive ties” • Two words joined by a cohesive device (i.e. reiteration)
Issue 1: How identify cohesion in dialogs? • How we measure Lexical Cohesion • We count cohesive ties between turns • Tokens (with stop words) • (token = “word”) • Tokens (stop words removed) • (Stops = high frequency, low information words) • Stems (stop words removed)
Stems • Stem = non-inflected core of a word • Porter Stemmer • Allows us to find ties between various inflected forms of the same word in adjacent turns. • “Turns” are tutor and student contributions to Tutoring Dialogs collected by the ITSPOKE group.
Issue 2: Is there an aptitude/treatment interaction? • Why there might be: • McNamara & Kintsch • How we test it: • Mean pre-test split • All students • Above-mean pretest students (“high” pre-testers) • Below-mean pretest students (“low” pre-testers)
Issue 3: Is interactivity Important? • Why it might be: • Chi et al. (01) • Tutor centered, Student centered, Interactive • Deep learning through self construction • Not tutor actions alone • Litman & Forbes-Riley (05) • Learning correlated with both: • student utterances that display reasoning • tutor questions that require reasoning • How we test it: • Interactive corpus – compare tutor to student turns • Tutor–only corpus • Student–only corpus
Issue 4: What other processing steps help? • Tried several on training corpus: • Removing stop words • N-turn spans • Selecting “substantive” turns • TF-IDF normalization • Turn-normalized counts • (Raw tie count / # of turns in dialog) • Found final options on training corpus: • One turn spans, turn normalization, no TF-IDF, no substantive turn selection • All reported results use these options • Tested options on new corpus
Where did the corpora come from? • ITSPOKE is a speech-enabled version of Why-2 Atlas (VanLehn et al. 02) • Qualitative physics • Tutoring Cycle • Student reads instructional materials • Takes a pre-test • Starts Interactive tutoring cycle • Problem • Essay • Tutor evaluates essay, engages in dialog • Revise essay • Repeat • Takes a post-test
Tutoring Corpora • Transcripts of tutoring sessions • Training corpus (fall 2003): • 20 students, 5 problems each • 95 dialogs (5 had no dialog) • 13 low pre-testers, 7 high pre-testers • Testing corpus (spring 2005): • 34 students, 5 problems each • 163 dialogs (7 had no dialog) • 18 low pre-testers, 16 high pre-testers
Results: Aptitude/Treatment • Test: partial correlation of post-test & cohesion count, controlling for pre-test • Cohesion correlated with learning for low pre-test students • Not for high pre-test students • Little difference between types of measurement • Less significant on testing data, “token with stops” level reduced to a trend
Results: Aptitude/Treatment • Test: partial correlation of post-test & cohesion count, controlling for pre-test • Cohesion correlated with learning for low pre-test students • Not for high pre-test students • Little difference between types of measurement • Slightly less significant on testing data
Results: Aptitude/Treatment • Test: partial correlation of post-test & cohesion count, controlling for pre-test • Cohesion correlated with learning for low pre-test students • Not for high pre-test students • Little difference between types of measurement • Less significant on testing data, “token with stops” level reduced to a trend
Results: Aptitude/Treatment • Test: partial correlation of post-test & cohesion count, controlling for pre-test • Cohesion correlated with learning for low pre-test students • Not for high pre-test students • Little difference between types of measurement • Less significant on testing data, “token with stops” level reduced to a trend
Results: Aptitude/Treatment • Test: partial correlation of post-test & cohesion count, controlling for pre-test • Cohesion correlated with learning for low pre-test students • Not for high pre-test students • Little difference between types of measurement • Less significant on testing data, “token with stops” level reduced to a trend
Results: Aptitude/Treatment • Test: partial correlation of post-test & cohesion count, controlling for pre-test • Cohesion correlated with learning for low pre-test students • Not for high pre-test students • Little difference between types of measurement • Less significant on testing data, “token with stops” level reduced to a trend
Results: Aptitude/Treatment • Test: partial correlation of post-test & cohesion count, controlling for pre-test • Cohesion correlated with learning for low pre-test students • Not for high pre-test students • Little difference between types of measurement • Less significant on testing data, “token with stops” level reduced to a trend
Results: Aptitude/Treatment (2003 data) • No significant difference between amounts of (turn normalized) cohesion in high and low pre-test groups. • Difference in correlation between high and low pre-testers not due to different amounts of cohesion.
Results: Interactivity (2003) • Cohesion between tutor utterances is not correlated with learning
Results: Interactivity (2003) • No evidence that cohesion between student productions is correlated with learning (but student utterances are very short with computer tutor)
Discussion • Both high and low pre-testers successfully learned from these dialogs • Our measure of lexical cohesion seems to reflect only what the low pre-testers do to learn, not correlated with what high pre-testers do. • McNamara & Kintsch also found a positive correlation for low pre-testers, but a negative correlation for high pre-testers.
Discussion • Our measures are slightly different: • McNamara & Kintsch: Manipulated coherence in text • Reader does not contribute to coherence • Coherence is the extent to which semantic relations are spelled out in the text, rather than inferred by the reader. • Low pre-testers probably learned because high coherence text allowed them to make inferences they couldn’t from the low cohesion text. • Low pre-testers & low coherence: didn’t know the terms • High coherence may allow a greater number of successful inferences for their low pre-testers • Our work: Dialog • Student does contribute to cohesion • Higher cohesion means using more of same terms • Speculation: High cohesion may indicate the number of successful inferences our low pre-testers already made. • High pre-testers already know the terms, so new inferences are not involved in using them.
Summary • We have taken automatically computable measures of cohesion from computational linguistics • Applied them to tutorial dialog • Found correlations with student learning
Conclusions • Simple, automatically computable measures of lexical cohesion correlate with learning • But only for students with low pre-test scores, even though low and high pre-testers showed similar amounts of cohesion. • Correlation is robust to differences in type of measurement • It’s the cohesion between student and tutor that’s important
Future Work • Short term: • Cohesion may also be related with learning in high pre-testers, but we’re measuring the wrong kind of cohesion • Work underway to try “sense” level measures • Halliday & Hassan’s “synonym” levels of reiteration • “Acceleration” & “speeding up” • New issues: • Word sense disambiguation (one sense per discourse?) • Or measuring it in the wrong places • Try finding cohesion at impasses (VanLehn 03) • Try finding change in cohesion over time (Pickering & Garrod 04) • Is it the dialog, or the essay? • Long term: • Test by manipulating cohesion in ITSPOKE
Thanks • Diane Litman • ITSPOKE group
Cohesion vs Coherence • Cohesive Devices • Things that “tie” different parts of a discourse together: • Anaphora, repetition, etc… • But still may not make sense: • John hid Bill’s car keys. He likes spinach. (Jurafsky & Martin 00) • Coherence relations • Semantic relations between utterances. • Result, Explanation, elaboration, etc. (Hobbs 79)
Britton & Gulgoz 91 • Original text: Air war in the North, 1965 By the fall of 1964, Americans in both Saigon and Washington had begun to focus on Hanoi as the source of the continuing problem in the south. • Modified text: Air war in North Vietnam, 1965 By the beginning of 1965, Americans in both Saigon and Washington had begun to focus on Hanoi, capital of North Vietnam, as the source of the continuing problems in the south.