600 likes | 606 Views
This paper explores the potential benefits of using prosody in spoken tutoring dialogues to recognize student emotions and attitudes. It discusses the ITSPOKE system, its corpora, emotion prediction from prosody, methods used, and current directions.
E N D
Using Prosody to Recognize Student Emotions and Attitudes in Spoken Tutoring Dialogues Diane Litman Department of Computer Science and Learning Research and Development Center University of Pittsburgh
Outline • Introduction • The ITSPOKE System and Corpora • Emotion Prediction from Prosody & Other Features • Method • Human-human tutoring • Computer-human tutoring • Current Directions and Summary
Motivation • Working hypothesis regarding learning gains • Human Dialogue > Computer Dialogue > Text • Most human tutoring involves face-to-face spoken interaction, while most computer dialogue tutors are text-based • Evens et al., 2001; Zinn et al., 2002; Vanlehn et al., 2002; Aleven et al., 2001 • Can the effectiveness of dialogue tutorial systems be further increased by using spoken interactions?
Potential Benefits of Speech • Self-explanation correlates with learning and occurs more in speech • Hausmann and Chi, 2002 • Speech contains prosodic information, providing new sources of information for dialogue adaptation • Forbes-Riley and Litman, 2004 • Spoken computational environments may prime a more social interpretation that enhances learning • Moreno et al., 2001; Graesser et al., 2003 • Potential for hands-free interaction • Smith, 1992; Aist et al., 2003
Spoken Tutorial Dialogue Systems • Recent tutoring systems have begun to add spoken language capabilities • Rickel and Johnson, 2000; Graesser et al. 2001; Mostow and Aist, 2001; Aist et al., 2003; Fry et al., 2001; Schultz et al., 2003 • However, little empirical analysis of the learning ramifications of using speech
Outline • Introduction • The ITSPOKE System and Corpora • Emotion Prediction from Prosody & Other Features • Method • Human-human tutoring • Computer-human tutoring • Current Directions and Summary
ITSPOKE:Intelligent Tutoring SPOKEn Dialogue System • Back-end is text-based Why2-Atlas tutorial dialogue system (VanLehn et al., 2002) • Student speech digitized from microphone input; Sphinx2 speech recognizer • Tutor speech played via headphones/speakers; Cepstral text-to-speech synthesizer • Other additions: access to Why2-Atlas “internals”, speech recognition repairs, etc.
Spoken Tutoring Corpora • Human-Human Corpus • 14 students, 128 physics problems (dialogues), 5948 student turns, 5505 tutor turns • Student and tutor turns were manually transcribed for content and segmented for boundaries
Spoken Tutoring Corpora (cont.) • Computer-Human Corpus • 20 students, 100 physics problems (dialogues) , 2445 student turns, 2967 tutor turns • Noisy student turn boundaries and transcriptions extractable from ITSPOKE logs • Content of student turns also manually transcribed
ITSPOKE Corpora ComparisonHuman-Human Human-Computer …3.5 minutes into session… Tutor: I didn't understand. What force acts on all objects within close proximity to earth? Student: uh the force of gravity Tutor: Fine. Besides the Earth's gravitational force, are there any other forces acting on the person, keys, and elevator after the release? Student: no Tutor: Yep. Thus, keep in mind that after the release, the only force on the person, keys, and elevator is the force of gravity. If gravity is the only force acting on an object, will it be moving or staying still? Student: it will be moving Tutor: Fine. Not only are the person, keys, and elevator moving, they have only gravitational forces on them. When an object is falling and has only gravitational force on it, it is said to be in what? Student: freefall Tutor: Good. Thus, after the release the only force on the keys and man is the downward force of earth's gravity, so they are in freefall. Before the release, what is the relationship between the man's velocity and the keys' velocity? Student: they're equal Tutor: Correct. We know that after the release, the keys, the person, and the elevator are in freefall. Is freefall acceleration near earth a constant? Or does it vary depending upon which object is in freefall? Student: it's a constant …1.3 minutes into session… Student: Can I ask you questions? Tutor: Yes Student: Is that ok? Tutor: Mm-hm Student: Um from what we learned from the last problem they're gonna have the same- the keys and you have the same acceleration right? Tutor: Yes Student: So that means they'll both hit the bottom at the same time. But I don't understand where- I don't understand if you're in the elevator- Tutor: You see Student: Where are you going to-? Tutor: The uh let me uh the key uh- the person holds the key in front of- Student: Their face yeah- Tutor: Uh his or her face and then lets it uh uh it is let go so the the the question relates to the relative position of the key and the face uh- Student: So the key and the face- Tutor: Uh how will they- as they go- as they both fall what will be their relative position? That is the question. Student: (sigh)
Outline • Introduction • The ITSPOKE System and Corpora • Emotion Prediction from Prosody & Other Features • Method • Human-human tutoring • Computer-human tutoring • Current Directions and Summary
Monitoring Student State (motivation) Tutor26: Let me ask my question in a more general way. What happens to the velocity of a body when there is no force acting on it? Student27: dammit(ASR: it is) Tutor28 : Could you please repeat that? Student29 : same(ASR: i same) Tutor30 : Right. And what happens to the HORIZONTAL velocity of a body when there is no HORIZONTAL force acting on it? Student31: zero(ASR: the zero) Tutor32 : Well... OK, let's consider how forces in the vertical direction affect the motion of a body in a horizontal direction in a different scenario <…omitted…> Student33: oh gravity you already said this(ASR: beats gravity you're exert less) Tutor34: Fine. Are there any other forces acting on the apple as it falls? Student35: no why are you doing this again(ASR: no y and to it yes) Tutor36: Correct. The only force acting on the apple as it falls to the earth is the earth's gravitational force. In which direction does gravitational force act? Student37: downward you computer(ASR: downward you computer)
Related Research in Emotional Speech • Elicited Speech (Polzin & Waibel 1998; Oudeyer 2002; Liscombe et al. 2003) • Naturally-Occurring Speech (Ang et al. 2002; Lee et al. 2002; Batliner et al. 2003; Devillers et al. 2003; Shafran et al. 2003) • Our Work • naturally-occurring tutoring data • analysis of comparable human and computer corpora
Methodology • Emotion Annotation • Machine Learning Experiments • Extract linguistic features from student turns • Use different feature sets to predict emotions • Significant reduction of baseline error
Emotion Annotation Scheme (Sigdial’04) • ‘Emotion’: emotions/attitudes that may impact learning • Annotation of Student Turns • 3 Main Emotion Classes negative e.g. uncertain, bored, irritated, confused, sad positive e.g. confident, enthusiastic neutral no expression of negative or positive emotion • 3 Minor Emotion Classes • weak negative, weak positive, mixed
Feature Extraction per Student Turn • Five feature types • Acoustic-prosodic (1) • Non acoustic-prosodic • Lexical (2) • Other Automatic (3) • Manual (4) • Identifiers (5) • Research questions • Relative predictive utility of feature types • Impact of speech recognition • Comparison across computer and human tutoring
Feature Types (1) Acoustic-Prosodic Features • 4 pitch (f0) : max, min, mean, standard dev. • 4 energy (RMS) : max, min, mean, standard dev. • 4 temporal: turn duration (seconds) pause length preceding turn (seconds) tempo (syllables/second) internal silence in turn (zero f0 frames) available to ITSPOKE in real time
Feature Types (2) Lexical (Word Occurrence Vectors) • Human-transcribed lexical items in the turn • ITSPOKE-recognized lexical items
Feature Types (3) Other Automatic Features: available from logs • Turn Begin Time (seconds from dialog start) • Turn End Time (seconds from dialog start) • Is Temporal Barge-in (student begins before tutor turn ends) • Is Temporal Overlap (student begins and ends in tutor turn) • Number of Words in Turn • Number of Syllables in Turn
Feature Types (4) ManualFeatures: (currently) available only from human transcription • Is Prior Tutor Question (tutor turn contains “?”) • Is Student Question (student turn contains “?”) • Is Semantic Barge-in (student turn begins at tutor word/pause boundary) • Number of Hedging/Grounding Phrases (e.g. “mm-hm”, “um”) • Is Grounding (canonical phrase turns not preceded by a tutor question) • Number of False Starts in Turn (e.g. acc-acceleration)
Feature Types (5) Identifier Features • student number • student gender • problem number
Empirical Results I Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources Kate Forbes-Riley and Diane Litman Proceedings of the Human Language Technology Conference: 4th Meeting of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2004)
Annotated Human-Human Excerpt (weak, mixed -> neutral) Tutor: Uh let us talk of one car first. Student: ok. (EMOTION = NEUTRAL) Tutor: If there is a car, what is it that exerts force on the car such that it accelerates forward? Student: The engine. (EMOTION = POSITIVE) Tutor: Uh well engine is part of the car, so how can it exert force on itself? Student: um… (EMOTION = NEGATIVE)
Human Tutoring: Annotation Agreement Study • 453 student turns, 10 dialogues • 2 annotators (the authors) • 385/453 agreed (85%, Kappa .7)
Machine Learning Experiments • Task: predict negative/positive/neutral using 5 feature types • Data: “agreed” subset of annotated student turns • Weka software: boosted decision trees • Methodology: 10 runs of 10-fold cross validation • Evaluation Metrics • Mean Accuracy: %Correct • Relative Improvement Over Baseline (RI): error(baseline) – error(x) error(baseline)
Acoustic-Prosodic vs. Other Features Acoustic-prosodic features (“speech”) outperform majority baseline, but other feature types yield even higher accuracy, and the more the better • Baseline = 72.74%; RI range = 12.69% - 43.87%
Acoustic-Prosodic plus Other Features Adding acoustic-prosodic to other feature sets doesn’t significantly improve performance • Baseline = 72.74%; RI range = 23.29% - 42.26%
Adding Contextual Features (Litman et al. 2001, Batliner et al 2003): adding contextual features improves prediction accuracy Local Features: the values of all features for the two student turns preceding the student turn to be predicted Global Features: running averages and total for all features, over all student turns preceding the student turn to be predicted
Previous Feature Sets plus Context • Adding global contextual features marginally improves performance, e.g. • Same feature set with no context: 83.69%
Empirical Results II Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman and Kate Forbes-Riley Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004)
Computer Tutoring Study • Additional dataset • Consensus(all turns after annotators resolved disagreements) • Different treatment of minor classes • Additional binary prediction tasks (in paper) • Emotional/non-emotional and negative/non-negative • Slightly different features • strict turn-taking protocol (no barge-in) • ASR output rather than actual student utterances
Annotated Computer-Human Excerpt(weak -> pos/neg, mixed -> neutral) ITSPOKE: What happens to the velocity of a body when there is no force acting on it? Student: dammit (NEGATIVE) ASR: it is ITSPOKE : Could you please repeat that? Student: same (NEUTRAL) ASR: i same
Computer Tutoring:Annotation Agreement Study • 333 student turns, 15 dialogues • 2 annotators (the authors) • 202/333 agreed (61%; Kappa=.4)
Acoustic-Prosodic vs. Lexical Features(Agreed Turns) • Both acoustic-prosodic (“speech”) and lexical features significantly outperform the majority baseline • Combining feature types yields an even higher accuracy • Baseline = 46.52%
Adding Identifier Features (Agreed Turns) • Adding identifier features improves all results • With identifier features, lexical information now yields the highest accuracy • Baseline = 46.52%
Using Automatic Speech Recognition (Agreed Turns) • Surprisingly, using ASR output rather than human transcriptions does not particularly degrade accuracy • Baseline = 46.52%
Comparison with Human Tutoring - In human tutoring dialogues, emotion prediction (and annotation) is more accurate and based on somewhat different features
Summary of Results (Consensus Turns) - Using consensus rather than agreed data decreases predictive accuracy for all feature sets, but other observations generally hold
Recap • Recognition of annotated student emotions in spoken computer and human tutoring dialogues, using multiple knowledge sources • Significant improvements in predictive accuracy compared to majority class baselines • A first step towards implementing emotion prediction and adaptation in ITSPOKE
Outline • Introduction • The ITSPOKE System and Corpora • Emotion Prediction from Prosody & Other Features • Method • Human-human tutoring • Computer-human tutoring • Current Directions and Summary
Word Level Emotion Models (joint research with Mihai Rotaru) • Motivation • Emotion might not be expressed over the entire turn • Some pitch features make more sense at a smaller level • Simple word-level emotion model • Label each word with turn class • Learn a word level emotion model • Predict the class of each word in a test turn • Combine word classes using majority/weighted voting
Word Level Emotion Models - Results • Feature sets (Turn and Word levels) • Lexical • Pitch • PitchLex • Results • Word-level better than Turn-level counterpart • PitchLex at Word-level always among the best performers • PitchLex at Word-level comparable with state-of-art on our corpora HC, EnE, MBL
Prosody-Learning Correlations(joint work with Kate Forbes-Riley) • What aspects of spoken tutoring dialogues correlate with learning gains? • Dialogue features (Litman et al. 2004) • Student emotions (frequency or patterns) • Acoustic-prosodic features • Human Tutoring • Faster tempos (syllables/second) and longer turns (seconds) negatively correlate with learning (p < .09) • Computer Tutoring • Higher pitch features (average, max, min) negatively correlate with learning (p < .07)