1 / 20

Classification Errors in a Domain-Independent Assessment System

Rodney D. Nielsen 1,2 , Wayne Ward 1,2 and James H. Martin 1 1 Center for Computational Language and Education Research, CU, Boulder 2 Boulder Language Technologies. Classification Errors in a Domain-Independent Assessment System.

lea
Download Presentation

Classification Errors in a Domain-Independent Assessment System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rodney D. Nielsen1,2, Wayne Ward1,2 and James H. Martin1 1 Center for Computational Language and Education Research, CU, Boulder 2 Boulder Language Technologies Classification Errors in a Domain-Independent Assessment System A harp has strings of different lengths. Describe how the sound of a longer string differs from the sound of a shorter string. When the string gets longer it makes the pitch lower. Reference Answer: A long string produces a low pitch. (Lawrence Hall of Science 2006, Assessing Science Knowledge)

  2. Tailoring the Tutor’s Response • Question: • A harp has strings of different lengths. Describe how the sound of a longer string differs from the sound of a shorter string. • Reference answer: • A long string produces a low pitch. • Learner answers: • When the string gets longer it makes the pitch lower. • A long string produces a pitch. • It makes a loud pitch. • It makes a high pitch. • If the string is tighter, the pitch is higher.

  3. object det det subject nmod nmod A long string produces a low pitch. Necessity of Finer-Grained Analysis • Imagine a tutor only knowing that there is some unspecified part of the reference answer that we are not sure the student understands • Reference Answer: A long string produces a low pitch. • Break the reference answer down into low-level facets derived from a dependency parse and thematic roles • NMod(string, long) The string is long. • Agent(produces, string) A string is producing something. • Product(produces, pitch) A pitch is being produced. • NMod(pitch, low) The pitch is low. • Assess whether an understanding of each facet is implied by the student’s response • Follow-up Question: Does a long string produce a higher or lower pitch.

  4. Representing Fine-Grained Semantics Assess the relationship between the student’s answer and the reference answer facets at a finer grain Reference Ans: A long string produces a low pitch. NMod(string, long) Agent(produces, string) Product(produces, pitch) NMod(pitch, low) Expressed Expressed Expressed Unaddressed A long string produces a pitch. Yes Yes Yes No Assumed Expressed Expressed Different Argument It produces a loud pitch. Assumed Expressed Expressed Contradiction Expressed It produces a high pitch.

  5. Answer Annotation Labels • Understood: Facets that are understood by the student • Assumed: Assumed to be understood a priori based on the question • Expressed: Directly expressed or inferred by simple reasoning • Inferred: Inferred by pragmatics or nontrivial logical reasoning • Contradicted: Facets contradicted by the learner answer • Contra-Expr: Directly contradicted by negation, antonymous expressions and their paraphrases • Contra-Infr: Contradicted by pragmatics or complex reasoning • Self-Contra: Facets that are both contradicted and implied (self contradictions) • Diff-Arg: The core relation is expressed, but it has a different modifier or argument • Unaddressed: Facets that are not addressed at all by the student’s answer

  6. Assessment Technology Overview • Start with hand-generated reference answer facets • Automatically parse reference & learner answer and automatically extract representation • Extract a feature vector for each reference answer (RA) facet indicative of the student’s understanding of that facet • From answers, their automatic parses, the relations between these, and external corpus co-occurrence statistics • Train a machine learning classifier on the training set feature vectors • Use classifier to assess the test set answers, assigning one of five Tutor-Labels for each RA facet

  7. Machine Learning Features Lexical Syntactic Other

  8. Results (C4.5 decision tree) • Results on Tutor-Labels are: • 24.4 and 15.4% over majority class baseline • 19.4 and 5.9% over lexical baseline

  9. Error Analysis of Domain-Independent Asmt • Leave-one-module-out cross-validation on the 13 training set science modules • Train on 12 modules test on the held out module; do this for each of the 13 modules • Simulates Unseen Modules (domain-independent) test set • Trained and tested on all non-Assumed facets • Analyzed random selection of subset of errors • 100 Expressed and 100 Unaddressed • Consistently annotated by all annotators • Consider the factors involved in decision by humans

  10. Errors in Expressed Facets • Four main error factors by frequency: • 72% Paraphrases • 43% Phrase-based paraphrasing • 35% Lexical substitution • 26% Coreference • 1% Syntactic alternation (Vanderwende et al. 2005) • 22% Logical Inference • 22% Pragmatics • 6% Preprocessing Errors

  11. Errors in Expressed Facets • 43% Phrase-based paraphrasing • 32 typical paraphrase occurrences • in the middle versus halfway between • one mineral will leave a scratch versus one will scratch the other • 14 uses of concept definitions • circuit versus electrical pathway • 6 negations of antonyms • not a lot for a little • no one has the same fingerprint for everyone has a different print

  12. Errors in Expressed Facets • 35% Lexical substitution • Synonymy, hypernymy, hyponymy, meronymy, derivational changes, and other lexical paraphrases • Half detectable by broad coverage resource • Tiny for small, CO2 for gas, put for place, pen for ink and push for carry • Many not easily detectable in lexical resources • put the pennies for distribute the pennies, and have for contain

  13. Errors in Expressed Facets • 26% Coreference Resolution • 15 pronouns (11 it, 3 she, 1 one) • 6 NP term substitutions • Ref Ans: clay particles are lightLearner Ans: clay is the lightest • 6 other common noun coreference issues

  14. Errors in Expressed Facets • 22% Logical inference • no, cup 1 would be a plastic cup 25 ml water and cup 2 paper cup 25 ml and 10 g sugar=> the two cups have the same amount of water • … it is easy to discriminate…=> the two sounds are very different • 22% Pragmatics • Because the vibrations => the rubberband is vibrating • … the fulcrum is too close to the earth=> the earth is the load in the system

  15. Errors in Expressed Facets • 6% Preprocessing errors • Normalization issues • Parser errors

  16. Errors in Expressed Facets • Over half of the errors involved more than one of the fine-grained factors • There is a shadow there because the sun is behind it and light cannot go through solid objects. Note, I think that question was kind of dumb.=> the tree blocks the light

  17. Errors in Unaddressed Facets • Many are questionable annotations • You could take a couple of cardboard houses and … 1 with thick glazed insulation. … =/> installing the insulation in the houses • Because the darker the color the faster it will heat up =/> darkest color

  18. Errors in Unaddressed Facets • Biggest source of error: lexical similarity • Ignorance of context • [the electromagnet] has to be iron…=> steel is made from iron • Antonyms • closer versus greater distance and absorbs energy versus reflects energy • Misguided trust • I learned it in class

  19. Conclusion • New assessment paradigm • Fine-grained facets and labels • Corpus of 146K fine-grained inference annotations • Answer assessment system • 24.4 and 15.4% over baseline results for in-domain and out-of-domain, respectively • First successful assessment of Grade 3-6 constructed responses • Error analysis provides insight into where future work is most appropriate

  20. Thanks! • We are grateful to the anonymous reviewers, whose comments improved the paper, and the Lawrence Hall of Science for the data. • This work was partially funded by Award Numbers: • NSF 0551723, • IES R305B070434, and • NSF DRL-0733323.

More Related