Advanced Methods and Analysis for the Learning and Social Sciences

Advanced Methods and Analysis for the Learning and Social Sciences PSY505Spring term, 2012 February 13, 2012

Today’s Class • Classification and Behavior Detection

Prediction • Pretty much what it says • A student is using a tutor right now.Is he gaming the system or not? • A student has used the tutor for the last half hour. How likely is it that she knows the skill in the next step? • A student has completed three years of high school. What will be her score on the college entrance exam?

Two Key Types of Prediction This slide adapted from slide by Andrew W. Moore, Google http://www.cs.cmu.edu/~awm/tutorials

Classification • There is something you want to predict (“the label”) • The thing you want to predict is categorical • The answer is one of a set of categories, not a number • CORRECT/WRONG (sometimes expressed as 0,1) • HELP REQUEST/WORKED EXAMPLE REQUEST/ATTEMPT TO SOLVE • WILL DROP OUT/WON’T DROP OUT • WILL SELECT PROBLEM A,B,C,D,E,F, or G

Where do those labels come from? • Field observations (take PSY503) • Text replays (take PSY503) • Post-test data (take PSY503) • Tutor performance • Survey data • School records • Where else?

Classification • Associated with each label are a set of “features”, which maybe you can use to predict the label Skill pknow time totalactions right ENTERINGGIVEN 0.704 9 1 WRONG ENTERINGGIVEN 0.502 10 2 RIGHT USEDIFFNUM 0.049 6 1 WRONG ENTERINGGIVEN 0.967 7 3 RIGHT REMOVECOEFF 0.792 16 1 WRONG REMOVECOEFF 0.792 13 2 RIGHT USEDIFFNUM 0.073 5 2 RIGHT ….

Classification • The basic idea of a classifier is to determine which features, in which combination, can predict the label Skill pknow time totalactions right ENTERINGGIVEN 0.704 9 1 WRONG ENTERINGGIVEN 0.502 10 2 RIGHT USEDIFFNUM 0.049 6 1 WRONG ENTERINGGIVEN 0.967 7 3 RIGHT REMOVECOEFF 0.792 16 1 WRONG REMOVECOEFF 0.792 13 2 RIGHT USEDIFFNUM 0.073 5 2 RIGHT ….

Classification • Of course, usually there are more than 4 features • And more than 7 actions/data points • These days, 800,000 student actions, and 26 features, would be a medium-sized data set

Classification • One way to classify is with a Decision Tree (like J48) PKNOW <0.5 >=0.5 TIME TOTALACTIONS <6s. >=6s. <4 >=4 RIGHT WRONG RIGHT WRONG

Classification • One way to classify is with a Decision Tree (like J48) PKNOW <0.5 >=0.5 TIME TOTALACTIONS <6s. >=6s. <4 >=4 RIGHT WRONG RIGHT WRONG Skill pknow time totalactions right COMPUTESLOPE 0.544 9 1 ?

J48/C4.5 • Can handle both numerical and categorical predictor variables • Tries to find optimal split in numerical variable • Repeatedly looks for variable which best splits the data in terms of predictive power for each variable • Later prunes out branches that turned out to have low predictive power

Step Regression Linear regression (discussed in detail in a later class), with a cut-off Essentially assigns a weight to each parameter, and then computes a numerical value Then all values below 0.5 are treated as 0, and all values >= 0.5 are treated as 1

And of course… • There are lots of other classification algorithms you can use... • K* (instance-based classification) • JRip (rule-based classification using trees) • PART (rule-based classification using trees) • Neural Network • Logistic Regression • SMO (support vector machine) • In your favorite Machine Learning package

If there’s time at the end of class… • We could go through some of these algorithms

Comments? Questions?

What data set should you generally test on? • A vote… • Raise your hands as many times as you like

What data set should you generally test on? • The data set you trained your classifier on • A data set from a different tutor • Split your data set in half (by students), train on one half, test on the other half • Split your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. • Votes?

What data set should you generally test on? • The data set you trained your classifier on • A data set from a different tutor • Split your data set in half (by students), train on one half, test on the other half • Split your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. • What are the benefits and drawbacks of each?

The dangerous one(though still sometimes OK) • The data set you trained your classifier on • If you do this, there is serious danger of over-fitting

The dangerous one(though still sometimes OK) • You have ten thousand data points. • You fit a parameter for each data point. • “If data point 1, RIGHT. If data point 78, WRONG…” • Your accuracy is 100% • Your kappa is 1 • Your model will neither work on new data, nor will it tell you anything.

The dangerous one(though still sometimes OK) • The data set you trained your classifier on • When might this one still be OK?

The dangerous one(though still sometimes OK) • The data set you trained your classifier on • When might this one still be OK? • Computing complexity-based goodness metrics such as BiC • Determine maximum possible performance of modeling approach

K-fold cross validation (standard) • Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. • What can you infer from this?

K-fold cross validation (standard) • Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. • What can you infer from this? • Your detector will work with new data from the same students

K-fold cross validation (standard) • Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. • What can you infer from this? • Your detector will work with new data from the same students • How often do we really care about this?

K-fold cross validation (student-level) • Split your data set in half (by student), train on one half, test on the other half • What can you infer from this?

K-fold cross validation (student-level) • Split your data set in half (by student), train on one half, test on the other half • What can you infer from this? • Your detector will work with data from new students from the same population (whatever it was) • Possible to do in RapidMiner • Not possible to do in Weka

K-fold or leave-one-out • Really not clear which one is best (as discussed in previous lecture) • Certain kinds of re-sampling/bootstrapping/etc. are easier to do with k-fold cross-validation

A data set from a different tutor • The most stringent test • When your model succeeds at this test, you know you have a good/general model • When it fails, it’s sometimes hard to know why

An interesting alternative • Leave-out-one-tutor-cross-validation (cf. Baker, Corbett, & Koedinger, 2006) • Train on data from 3 or more tutors • Test on data from a different tutor • (Repeat for all possible combinations) • Good for giving a picture of how well your model will perform in new lessons

Worth noting • If you want to know if your model will work on new populations • Cross-validate at the population level rather than the student level

Homework 3 • Let’s look at some of the homework 3 solutions • Please comment on what’s right and wrong, what’s clever, etc. • We’ll look at the approaches, the goodness, the final models

Homework 3 • Now let’s take the best homework • Any other ideas for how to come up with a better model? • Let’s try them!

Feature Engineering • There are lots of fancy algorithms • But typically your detector is no better than your features • Features that have good construct validity are more likely to produce a good model • Particularly nice example of this in Sao Pedro et al. (under review) • In the next assignment, you’ll create your own features to try to produce a better model

Assignment 4 • Let’s review Assignment 4

Next Class • Wednesday, February 15 • 3pm-5pm • AK232 • Feature engineering and feature distillation • SPECIAL GUEST LECTURER: SUJITH GOWDA • Assignments Due: 4. Feature Engineering

The End

Bonus Slides • If there’s time

BKT with Multiple Skills

Conjunctive Model(Pardos et al., 2008) • The probability a student can answer an item with skills A and B is • P(CORR|A^B) = P(CORR|A) * P(CORR|B) • But how should credit or blame be assigned to the various skills?

Koedinger et al.’s (2011)Conjunctive Model • Equations for 2 skills

Koedinger et al.’s (2011)Conjunctive Model • Generalized equations

Koedinger et al.’s (2011)Conjunctive Model • Handles case where multiple skills apply to an item better than classical BKT

Other BKT Extensions? • Additional parameters? • Additional states?

Many others • Compensatory Multiple Skills (Pardos et al., 2008) • Clustered Skills (Ritter et al., 2009)

Advanced Methods and Analysis for the Learning and Social Sciences