300 likes | 439 Views
Predicting Student Emotions in Computer-Human Tutoring Dialogues. Diane J. Litman and Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA 15260 USA. Motivation. Bridge Learning Gap between Human Tutors and Computer Tutors
E N D
Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman and Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA 15260 USA
Motivation • Bridge Learning Gap between Human Tutors and Computer Tutors • Our Approach: Add emotion prediction and adaptation to ITSPOKE, our Intelligent Tutoring SPOKEn dialogue system
Back-end is Why2-Atlas system (VanLehn et al., 2002) • Sphinx2 speech recognition and Cepstral text-to-speech
Back-end is Why2-Atlas system (VanLehn et al., 2002) • Sphinx2 speech recognition and Cepstral text-to-speech
Back-end is Why2-Atlas system (VanLehn et al., 2002) • Sphinx2 speech recognition and Cepstral text-to-speech
Human-Computer Excerpt Tutor26: Let me ask my question in a more general way. What happens to the velocity of a body when there is no force acting on it? Student27: dammit(ASR: it is) Tutor28 : Could you please repeat that? Student29 : same(ASR: i same) Tutor30 : Right. And what happens to the HORIZONTAL velocity of a body when there is no HORIZONTAL force acting on it? Student31: zero(ASR: the zero) Tutor32 : Well... OK, let's consider how forces in the vertical direction affect the motion of a body in a horizontal direction in a different scenario <…omitted…> Student33: oh gravity you already said this(ASR: beats gravity you're exert less) Tutor34: Fine. Are there any other forces acting on the apple as it falls? Student35: no why are you doing this again(ASR: no y and to it yes) Tutor36: Correct. The only force acting on the apple as it falls to the earth is the earth's gravitational force. In which direction does gravitational force act? Student37: downward you computer(ASR: downward you computer)
Outline Data and Emotion Annotation Machine Learning Experiments • extract linguistic features from student turns • use different feature sets to predict emotions • 19-36% relative reduction of baseline error • comparison with human tutoring
ITSPOKE Dialogue Corpus • 100 spoken tutoring dialogues (physics problems) with ITSPOKE • on average, 19.4 minutes and 25 student turns • 20 subjects • university students who have never taken college physics and who are native speakers
Emotion Annotation Scheme (Sigdial’04) • ‘Emotion’: emotions/attitudes that may impact learning • Annotation of Student Turns • Emotion Classes negative e.g. uncertain, bored, irritated, confused, sad positive e.g. confident, enthusiastic neutral no weak or strong expression of negative or positive emotion
Example Annotated Excerpt ITSPOKE: What happens to the velocity of a body when there is no force acting on it? Student: dammit (NEGATIVE) ASR: it is ITSPOKE : Could you please repeat that? Student: same (NEUTRAL) ASR: i same
Agreement Study • 333 student turns, 15 dialogues • 2 annotators (the authors)
Emotion Classification Tasks • Negative, Neutral, Positive • Kappa = .4, Weighted Kappa = .5 • Focus of this talk • Negative, Non-Negative • Kappa = .5 • Emotional, Non-Emotional • Kappa = .3 • Results on par with prior research • Kappas of .32-.48 in (Ang et al. 2002; Narayanan 2002; Shafran et al. 2003)
Feature Extraction per Student Turn • Three feature types • Acoustic-prosodic • Lexical • Identifiers • Research questions • Relative utility of acoustic-prosodic, lexical and identifier features • Impact of speech recognition • Comparison with human tutoring (HLT/NAACL, 2004)
Feature Types (1) Acoustic-Prosodic Features • 4 pitch (f0) : max, min, mean, standard dev. • 4 energy (RMS) : max, min, mean, standard dev. • 4 temporal: turn duration (seconds) pause length preceding turn (seconds) tempo (syllables/second) internal silence in turn (zero f0 frames) available to ITSPOKE in real time
Feature Types (2) Word Occurrence Vectors • Human-transcribed lexical items in the turn • ITSPOKE-recognized lexical items
Feature Types (3) Identifier Features • student id • student gender • problem id
Machine Learning Experiments • Weka software: Boosted decision trees • gave best results in pilot studies (ASRU 2003) • Baseline: Majority class (neutral) • Methodology: 10 runs of 10-fold cross validation • Evaluation Metric: Accuracy • Datasets: • Agreed (202/333 turns where annotators agreed) • Consensus (all 333 turns after annotators resolved disagreements)
Acoustic-Prosodic vs. Lexical Features(Agreed Turns) • Both acoustic-prosodic (“speech”) and lexical features significantly outperform the majority baseline • Combining feature types yields an even higher accuracy • Baseline = 46.52%
Adding Identifier Features (Agreed Turns) • Adding identifier features improves all results • With identifier features, lexical information now yields the highest accuracy • Baseline = 46.52%
Using Automatic Speech Recognition (Agreed Turns) • Surprisingly, using ASR output rather than human transcriptions does not particularly degrade accuracy • Baseline = 46.52%
Summary of Results(Consensus Turns) - Using consensus rather than agreed data decreases predictive accuracy for all feature sets, but other observations generally hold
Comparison with Human Tutoring (Agreed Turns) - In human tutoring dialogues, emotion prediction (and annotation) is more accurate and based on somewhat different features
Related Research in Emotional Speech • Elicited Speech (Polzin & Waibel 1998; Oudeyer 2002; Liscombe et al. 2003) • Naturally-Occurring Speech (Ang et al. 2002; Lee et al. 2002; Batliner et al. 2003; Devillers et al. 2003; Shafran et al. 2003) • Our Work • naturally-occurring tutoring data • analysis of comparable human and computer corpora
Current Directions • Develop adaptive strategies for ITSPOKE • annotate human tutor turns • evaluate ITSPOKE with emotion adaptation • Co-training to address annotation bottleneck • Maeireizo, Litman, and Hwa: Saturday poster
Summary • Recognition of annotated student emotions in spoken computer tutoring dialogues • Feature sets containing acoustic-prosodic, lexical, and/or identifier features yield significant improvements in predictive accuracy compared to majority class baselines • role of differing feature types and speech recognition errors • comparable analysis of human tutoring dialogues • paper contains details regarding two other emotion prediction tasks • This research is a first step towards implementing emotion prediction and adaptation in ITSPOKE
Thank You! Questions?
Example Annotated Excerpt ITSPOKE: What else do you need to know to find the box's acceleration? Student: the direction (NEGATIVE, UNCERTAIN) ASR: add directions ITSPOKE : If you see a body accelerate, what caused that acceleration? Student: force (POSITIVE, CONFIDENT) ASR: force ITSPOKE : Good job. Say there is only one force acting on the box. How is this force, the box's mass, and its acceleration related? Student: velocity (NEGATIVE, UNCERTAIN) ASR: velocity ITSPOKE : Could you please repeat that? Student: velocity (NEGATIVE, IRRITATED) ASR: velocity
Prior Research: Affective Computer Tutoring (Kort and Reilly and Picard., 2001): propose a cyclical model of emotion change during learning; developing a non-dialog computer tutor that will use eye-tracking/facial features to predict emotion and support movement into positive emotions. (Aist and Kort and Reilly and Mostow and Picard, 2002): Adding human-provided emotional scaffolding to an automated reading tutor increases student persistence (Evens et al, 2002): for CIRCSIM: computer dialog tutor for physiology problems; hypothesize adaptive strategies for recognized student emotional states; e.g. if detecting frustration, system should respond to hedges and self-deprecation by supplying praise and restructuring the problem. (de Vicente and Pain, 2002): use human observation about student motivational states in videod interaction with non-dialog computer tutor to develop rules for detection (Ward and Tsukahara, 2003): spoken dialog computer “tutor-support” uses prosodic and contextual features of user turn (e.g. “on a roll”, “lively”, “in trouble”) to infer appropriate response as users remember train stations. Preferred over randomly chosen acknowledgments (e.g. “yes”, “right” “that’s it”, “that’s it <echo>”,… ) (Conati and Zhou, 2004): use Dynamic Bayesian Networks) to reason under uncertainty about abstracted student knowledge and emotional states through time, based on student moves in non-dialog computer game, and to guide selection of “tutor” responses. Most will be relevant to developing ITSPOKE adaptation techniques
Experimental Procedure • Students take a physics pretest • Students read background material • Students use the web and voice interface to work through up to 10 problems with either ITSPOKE or a human tutor • Students take a post-test