830 likes | 1.13k Views
Data Annotation for Classification. Prediction. Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables) Which students are off-task? Which students will fail the class?. Classification.
E N D
Prediction • Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables) • Which students are off-task? • Which students will fail the class?
Classification • Develop a model which can infer a categorical predicted variable from some combination of other aspects of the data • Which students will fail the class? • Is the student currently gaming the system? • Which type of gaming the system is occurring?
We will… • We will go into detail on classification methods tomorrow
In order to use prediction methods • We need to know what we’re trying to predict • And we need to have some labels of it in real data
For example… • If we want to predict whether a student using educational software is off-task, or gaming the system, or bored, or frustrated, or going to fail the class… • We need to first collect some data • And within that data, we need to be able to identify which students are off-task (or the construct of interest), and ideally when
So we need to label some data • We need to obtain outside knowledge to determine what the value is for the construct of interest
In some cases • We can get a gold-standard label • For instance, if we want to know if a student passed a class, we just go ask their instructor
But for behavioral constructs… • There’s no one to ask • We can’t ask the student (self-presentation) • There’s no gold-standard metric • So we use data labeling methods or observation methods • (e.g. quantitative field observations, video coding) • To collect bronze-standard labels • Not perfect, but good enough
One such labeling method • Text replay coding
Text replays • Pretty-prints of student interaction behavior from the logs
Sampling • You can set up any sampling schema you want, if you have enough log data • 5 action sequences • 20 second sequences • Every behavior on a specific skill, but other skills omitted
Sampling • Equal number of observations per lesson • Equal number of observations per student • Observations that machine learning software needs help to categorize (“biased sampling”)
Major Advantages • Both video and field observations hold some risk of observer effects • Text replays are based on logs that were collected completely unobtrusively
Major Advantages • Blazing fast to conduct • 8 to 40 seconds per observation
Notes • Decent inter-rater reliability is possible(Baker, Corbett, & Wagner, 2006)(Baker, Mitrovic, & Mathews, 2010)(Sao Pedro et al, 2010)(Montalvo et al, 2010) • Agree with other measures of constructs(Baker, Corbett, & Wagner, 2006) • Can be used to train machine-learned detectors(Baker & de Carvalho, 2008) (Baker, Mitrovic, & Mathews, 2010) (Sao Pedro et al, 2010)
Major Limitations • Limited range of constructs you can code • Gaming the System – yes • Collaboration in online chat – yes(Prata et al, 2008) • Frustration, Boredom – sometimes • Off-Task Behavior outside of software – no • Collaborative Behavior outside of software – no
Major Limitations • Lower precision (because lower bandwidth of observation)
Find a partner • Could be your project team-mate, but doesn’t have to be • You will do this exercise with them
Get a copy of the text replay software • On your flash drive • Or at http://www.joazeirodebaker.net/algebra-obspackage-LSRM.zip
Skim the instructions • At Instructions-LSRM.docx
Log into text replay software • Using exploratory login • Try to figure out what the student’s behavior means, with your partner • Do this for ~5 minutes
Now pick a category you want to code • With your partner
Now code data • According to your coding scheme • (is-category versus is-not-category) • Separate from your partner • For 20 minutes
Now put your data together • Using the observations-NAME files you obtained • Make a table (in excel?) showing
Now… • We can compute your inter-rater reliability… (also called agreement)
Agreement/ Accuracy • The easiest measure of inter-rater reliability is agreement, also called accuracy # of agreements total number of codes
Agreement/ Accuracy • There is general agreement across fields that agreement/accuracy is not a good metric • What are some drawbacks of agreement/accuracy?
Agreement/ Accuracy • Let’s say that Tasha and Uniqua agreed on the classification of 9200 time sequences, out of 10000 actions • For a coding scheme with two codes • 92% accuracy • Good, right?
Non-even assignment to categories • Percent Agreement does poorly when there is non-even assignment to categories • Which is almost always the case • Imagine an extreme case • Uniqua (correctly) picks category A 92% of the time • Tasha always picks category A • Agreement/accuracy of 92% • But essentially no information
An alternate metric • Kappa (Agreement – Expected Agreement) (1 – Expected Agreement)
Kappa • Expected agreement computed from a table of the form
Kappa • Expected agreement computed from a table of the form • Note that Kappa can be calculated for any number of categories (but only 2 raters)
Cohen’s (1960) Kappa • The formula for 2 categories • Fleiss’s (1971) Kappa, which is more complex, can be used for 3+ categories • I have an Excel spreadsheet which calculates multi-category Kappa, which I would be happy to share with you
Expected agreement • Look at the proportion of labels each coder gave to each category • To find the number of agreed category A that could be expected by chance, multiply pct(coder1/categoryA)*pct(coder2/categoryA) • Do the same thing for categoryB • Add these two values together and divide by the total number of labels • This is your expected agreement
Example • What is the percent agreement?
Example • What is the percent agreement? • 80%
Example • What is Tyrone’s expected frequency for on-task?
Example • What is Tyrone’s expected frequency for on-task? • 75%
Example • What is Pablo’s expected frequency for on-task?
Example • What is Pablo’s expected frequency for on-task? • 65%