Analysis of Classification Errors in Assessment Systems

Rodney D. Nielsen1,2, Wayne Ward1,2 and James H. Martin1 1 Center for Computational Language and Education Research, CU, Boulder 2 Boulder Language Technologies Classification Errors in a Domain-Independent Assessment System A harp has strings of different lengths. Describe how the sound of a longer string differs from the sound of a shorter string. When the string gets longer it makes the pitch lower. Reference Answer: A long string produces a low pitch. (Lawrence Hall of Science 2006, Assessing Science Knowledge)

Tailoring the Tutor’s Response • Question: • A harp has strings of different lengths. Describe how the sound of a longer string differs from the sound of a shorter string. • Reference answer: • A long string produces a low pitch. • Learner answers: • When the string gets longer it makes the pitch lower. • A long string produces a pitch. • It makes a loud pitch. • It makes a high pitch. • If the string is tighter, the pitch is higher.

object det det subject nmod nmod A long string produces a low pitch. Necessity of Finer-Grained Analysis • Imagine a tutor only knowing that there is some unspecified part of the reference answer that we are not sure the student understands • Reference Answer: A long string produces a low pitch. • Break the reference answer down into low-level facets derived from a dependency parse and thematic roles • NMod(string, long) The string is long. • Agent(produces, string) A string is producing something. • Product(produces, pitch) A pitch is being produced. • NMod(pitch, low) The pitch is low. • Assess whether an understanding of each facet is implied by the student’s response • Follow-up Question: Does a long string produce a higher or lower pitch.

Representing Fine-Grained Semantics Assess the relationship between the student’s answer and the reference answer facets at a finer grain Reference Ans: A long string produces a low pitch. NMod(string, long) Agent(produces, string) Product(produces, pitch) NMod(pitch, low) Expressed Expressed Expressed Unaddressed A long string produces a pitch. Yes Yes Yes No Assumed Expressed Expressed Different Argument It produces a loud pitch. Assumed Expressed Expressed Contradiction Expressed It produces a high pitch.

Answer Annotation Labels • Understood: Facets that are understood by the student • Assumed: Assumed to be understood a priori based on the question • Expressed: Directly expressed or inferred by simple reasoning • Inferred: Inferred by pragmatics or nontrivial logical reasoning • Contradicted: Facets contradicted by the learner answer • Contra-Expr: Directly contradicted by negation, antonymous expressions and their paraphrases • Contra-Infr: Contradicted by pragmatics or complex reasoning • Self-Contra: Facets that are both contradicted and implied (self contradictions) • Diff-Arg: The core relation is expressed, but it has a different modifier or argument • Unaddressed: Facets that are not addressed at all by the student’s answer

Assessment Technology Overview • Start with hand-generated reference answer facets • Automatically parse reference & learner answer and automatically extract representation • Extract a feature vector for each reference answer (RA) facet indicative of the student’s understanding of that facet • From answers, their automatic parses, the relations between these, and external corpus co-occurrence statistics • Train a machine learning classifier on the training set feature vectors • Use classifier to assess the test set answers, assigning one of five Tutor-Labels for each RA facet

Machine Learning Features Lexical Syntactic Other

Results (C4.5 decision tree) • Results on Tutor-Labels are: • 24.4 and 15.4% over majority class baseline • 19.4 and 5.9% over lexical baseline

Error Analysis of Domain-Independent Asmt • Leave-one-module-out cross-validation on the 13 training set science modules • Train on 12 modules test on the held out module; do this for each of the 13 modules • Simulates Unseen Modules (domain-independent) test set • Trained and tested on all non-Assumed facets • Analyzed random selection of subset of errors • 100 Expressed and 100 Unaddressed • Consistently annotated by all annotators • Consider the factors involved in decision by humans

Errors in Expressed Facets • Four main error factors by frequency: • 72% Paraphrases • 43% Phrase-based paraphrasing • 35% Lexical substitution • 26% Coreference • 1% Syntactic alternation (Vanderwende et al. 2005) • 22% Logical Inference • 22% Pragmatics • 6% Preprocessing Errors

Errors in Expressed Facets • 43% Phrase-based paraphrasing • 32 typical paraphrase occurrences • in the middle versus halfway between • one mineral will leave a scratch versus one will scratch the other • 14 uses of concept definitions • circuit versus electrical pathway • 6 negations of antonyms • not a lot for a little • no one has the same fingerprint for everyone has a different print

Errors in Expressed Facets • 35% Lexical substitution • Synonymy, hypernymy, hyponymy, meronymy, derivational changes, and other lexical paraphrases • Half detectable by broad coverage resource • Tiny for small, CO2 for gas, put for place, pen for ink and push for carry • Many not easily detectable in lexical resources • put the pennies for distribute the pennies, and have for contain

Errors in Expressed Facets • 26% Coreference Resolution • 15 pronouns (11 it, 3 she, 1 one) • 6 NP term substitutions • Ref Ans: clay particles are lightLearner Ans: clay is the lightest • 6 other common noun coreference issues

Errors in Expressed Facets • 22% Logical inference • no, cup 1 would be a plastic cup 25 ml water and cup 2 paper cup 25 ml and 10 g sugar=> the two cups have the same amount of water • … it is easy to discriminate…=> the two sounds are very different • 22% Pragmatics • Because the vibrations => the rubberband is vibrating • … the fulcrum is too close to the earth=> the earth is the load in the system

Errors in Expressed Facets • 6% Preprocessing errors • Normalization issues • Parser errors

Errors in Expressed Facets • Over half of the errors involved more than one of the fine-grained factors • There is a shadow there because the sun is behind it and light cannot go through solid objects. Note, I think that question was kind of dumb.=> the tree blocks the light

Errors in Unaddressed Facets • Many are questionable annotations • You could take a couple of cardboard houses and … 1 with thick glazed insulation. … =/> installing the insulation in the houses • Because the darker the color the faster it will heat up =/> darkest color

Errors in Unaddressed Facets • Biggest source of error: lexical similarity • Ignorance of context • [the electromagnet] has to be iron…=> steel is made from iron • Antonyms • closer versus greater distance and absorbs energy versus reflects energy • Misguided trust • I learned it in class

Conclusion • New assessment paradigm • Fine-grained facets and labels • Corpus of 146K fine-grained inference annotations • Answer assessment system • 24.4 and 15.4% over baseline results for in-domain and out-of-domain, respectively • First successful assessment of Grade 3-6 constructed responses • Error analysis provides insight into where future work is most appropriate

Thanks! • We are grateful to the anonymous reviewers, whose comments improved the paper, and the Lawrence Hall of Science for the data. • This work was partially funded by Award Numbers: • NSF 0551723, • IES R305B070434, and • NSF DRL-0733323.

Analysis of Classification Errors in Assessment Systems