370 likes | 484 Views
Parameters Driving Effectiveness of Automated Essay Scoring with LSA 9 th CAA, July 6 th 2005, Loughborough. Fridolin Wild , Christina Stahl, Gerald Stermsek, Gustaf Neumann Department of Information Systems and New Media Vienna University of Economics and Business Administration. Agenda.
E N D
Parameters Driving Effectiveness of Automated Essay Scoring with LSA9th CAA, July 6th 2005, Loughborough Fridolin Wild, Christina Stahl, Gerald Stermsek, Gustaf Neumann Department of Information Systems and New MediaVienna University of Economics and Business Administration
Agenda • TEL @ WUW • Essay Scoring with LSA • Latent Semantic Analysis (LSA) • Parameters Driving Effectiveness • Experiment Results • Summary & Future Work
TEL @ WUW • Learn@WU • > 19.000 users • > 27.000 resources • Research Driven Development • (EducaNext.org) • (HCD-online.com)
Electronic Assessment @ WUW • The Situation • No Entrance Limitations in Austria • High Drop-out Rates • Varying Number of Freshmen (by 1.000) • Space Problems • Highly Scalable Courses (with Large-Scale Assessments) • Concentrate Resources on Higher Semesters • Currently: many multiple choice tests (for practice, scanner for exams) • Feeding Answers • Tempting: learning answers by hard instead of critical thinking (negative effects found) • Future: Free-Text Assessment • Increase Quality of Feedback (formative, no autograding!) • With Latent Semantic Analysis (LSA)
Software: The R Package ‘lsa’ • Currently in Version 0.4 • available upon request • public domain • Can be integrated on DB-Level • into PostgreSQL • ‘Essay Scoring by Stored Procedures’ • Easy to use (students!) • Wrapper Module for .LRN (Diploma Thesis)
Convert to Document Term Matrix { M } = Input (Docs)
= Singular Value Decomposition
“Latent Semantics” • Assumption: documents have a semantic structure • Structure is obscured by word usage (noise, synonyms, homographs, …) • Therefore: map doc-term-matrix using conceptual indices derived statistically (truncated SVD): M2 = TS2D’
Reconstructed, Reduced Matrix m4: Graphminors: A survey
doc2doc - similarities unreduced - Based on M = TSD’ - Pearson Correlation over document vectors reduced • based on M2 = TS2D’ • - Pearson Correlation over document vectors
SVD-Updating: Folding-In • SVD Factor Stability • SVD calculates factors over a textbase • Different texts – different factors • Challenge: avoid unwanted factor changes (e.g. bad essays) • Solution: folding-in of essays instead of recalculating • SVD is computationally expensive • 14 seconds (300 docs textbase) • 10 minutes (3500 docs textbase) • … and rising!
2 vT 1 Folding-In in Detail Mk (2) convert „Dk“-format vector to „Mk“-format Tk Sk Dk (1) convert Original Vector to „Dk“-format (cf. Berry et al., 1995)
Parameters 4 x 12 x 7 x 2 x 3 = 2016 Combinations
Pre-Processing • Stemming • Porter Stemmer (snowball.tartarus.org) • ‚move‘, ‚moving‘, ‚moves‘ => ‚move‘ • in German even more important (more flections) • Stop Word Elimination • 373 Stop Words in German • Stemming plus Stop Word Elimination • Unprocessed (‘raw’) Terms
Term Weighting Schemes weightij = lw(tfij) ∙ gw(tfij) • Local Weights (LW) • None (‘raw’ tf) • Binary Term Frequency • Logarithmized Term Frequency (log) • Global Weights (GW) • None (‚raw‘ tf) • Normalisation • Inverse Document Frequency (IDF) • 1 + Entropy • 12 Combinations
SVD-Dimensionality • Percentage of Cumulated Values • Shares of 50%, 40%, 30% • Share of Values = Number of Docs • Absolute Fraction of k • 1/50 and 1/30 • Fixed Number k (‘magic 10’)
Similarity Measures & Methods • Pearson Correlation • (Cosine Correlation) • Spearman‘s Rho • Best Hit of Best Solutions • Mean of Best Solutions pics: http://davidmlane.com/hyperstat/A62891.html
Assessing Effectiveness • Compare Machine Scores with Human Scores • Human-to-Human Correlation • Usually around .6 (literature, own experiments) • Increased by familiarity between assessors, tighter assessment schemes, … • Scores vary even stronger with decreasing subject familiarity (.8 at high familiarity, worst test -.07)
Experiment Settings • Test Collection • 43 Students’ Essays in German • Scored by Human Assessor from 0 to 5 points (ratio scaled) • Average essay length: 56.4 words • Training Collection • 3 ‘golden essays’ • Plus 302 documents from a marketing glossary • Average glossary entries: 56.1 words
Overall • 48 < 0.001 • 459 < 0.01 • 885 < 0.05 • 1235 < 0.1 • Rest: 781 not significant
Pre-Processing • Best: Stop Word Filtering (Ø .31) • Stemming / Stemming&Stoppingworsen Results (by .06 and .03) • Raw: .26 • Best 50: • 21 x stopping • 14 x raw • 12 x stemming • 3 x stemming & stopping Sorted Spearman Correlation of all Experiments
Term Weighting • Global Weights: • IDF overall best (.36 with logtf) • Normalisation worsens (.15 - .17) • 1+Entropy: nearly no effect • Local Weights: • hardly any effect • raw and logtf squeeze curve • Best 50: • 20 x bintf • 19 x logtf • 11 x raw • 26 x IDF • 13 x raw • 6 x normalisation • 5 x 1+entropy
Dimensionality • ‘Share’ Scores Best • 50%: .29 (40% and 30%: .28) • Curve favours 30% • Rest: .22 to .24 • Negative Correlations at Share 30%: Normalisation as GW • Best 50: • 13 x 1/50th • 10 x share 50% • 8 x 1/30th • 8 x magic ten • 5 x share 40% • 3 x share 30% • 3 x ndocs
Correlation Measures • Cosine & Pearson slightly better incurve (in averageSpearman is) • Best 50: • 21 x Spearman • 15 x Cosine • 14 x Pearson
Correlation Method • Mean Correlationwith Best Essaysis slightly better • Best 50: • 31 x Mean • 19 x Maximum
Summary • Effectiveness can be tuned in advance • Recommendation (not a guarantee): • Use Stop Word Filtering • Use IDF as global weight and any local • Use Spearman’s Rho • Use Average Correlation to Best Essays • However: other combinations still can be successful! • Optimisations are not independent
Future Work • A Model of Influencing Factors • Stability across Changing Corpora / Contexts • Different Text Assessment Methods • Similarity Measurement Method • Doc versus Query (=Aspect) • Way of Corpus Splitting • Bag-of-Words: Documents vs. Sentence, Paragraph, N-Grams • Summaries or Controlled Vocabulary • Norm-referenced vs. criterion-referenced (NRT, CRT) • ‘Definatory’ vs. ‘case based’ QAs (I,E) • Compare with other Scoring Methods
Thanks for your attention! Get these slides at www.educanext.org
The Original Question • Question: “Wählen Sie jenen Plan aus, mit dem Sie das Kommunikationsziel am ehesten erreichen könnten und begründen Sie Ihre Auswahl in Schlagworten! (5 Punkte) • Best Essay (5P): „Bei der Wahl kommt es darauf an, ob ich Breitenwirkung (Reichweite) oder Tiefenwirkung (OTS) erzielen will (zB bei Imageverbesserung). Hier sollte meiner Meinung nach Plan 2 gewählt werden, da die OTS fast gleich sind 22 Plan 1 - 20 Plan 2 = kein großer Unterschied, doch die Kosten pro 1000 Nutzer und Kosten pro 1000 Kontakte bei Plan 2 billiger sind (Nutzer: 2 um 101,- €, Kontakte 2 um 3 € günstiger als 1)“ • Essay (2P): “Plan 2: da ich durch minimale Mehrkosten für die Schaltungen -> mehr Schaltungen habe und -> eine wesentlich größere Reichweite sowohl gesamt als auch in der Zielgruppe“