270 likes | 407 Views
Detecting Missing Hyphens in Learner Text. Aoife Cahill, SusanneWolff, Nitin Madnani Educational Testing Service. Martin Chodorow Hunter College and the Graduate Center. ACL 2013. Outline. Introduction Baselines System Description Evaluation Conclusions. Introduction.
E N D
Detecting Missing Hyphensin Learner Text Aoife Cahill, SusanneWolff, Nitin Madnani Educational Testing Service Martin Chodorow Hunter College and the Graduate Center ACL 2013
Outline • Introduction • Baselines • System Description • Evaluation • Conclusions
Introduction Missing Hyphens: • Schools may have more after school sports. • (2) I went to the dentist after school today. • (3) My father like play basketball with me.
Outline Introduction Baselines System Description Evaluation Conclusions
Baselines • Collins Dictionary • More than 1,000 times in Wikipedia • Probability of the hyphenated form as estimated from • Wikipedia is greater than 0.66
Outline Introduction Baselines System Description Evaluation Conclusions
System Description Learner text: Schools may have more after school sports.
System Description Model: Logistic regression model Probability: Only predict a missing hyphen error when the probability of the prediction is >0.99
System Description SJM-trained: - San Jose Mercury News corpus - For training, hyphenated words are automatically split (i.e. well-known becomes well known) - The training data contains 1% of thepositive examples and 3% of thenegative examples
System Description Negative examples selected: Only contexts that occur more than 20 times are selected during training.
System Description Wiki-revision-trained: - Wikipedia articles
System Description Combined: - Combine both data sources
Outline Introduction Baselines System Description Evaluation Conclusions
Evaluation • Artificial Data: • - Brown corpus • - taking 24,243 sentences • - 2,072 hyphenated words
Evaluation Evaluation 1 • Learner Text: • - CLC-FCE • - The corpus contains 1,244 exam scripts • - Totally 173 instances of missing hyphen errors
Evaluation There are 131 true positives for the learner data reveal that 87 of these are cases of a single type, the word “make-up”.
Evaluation Evaluation 2 • Learner Text: • - A data set of 1,000 student GRE and TOEFL essays • - Drawn from 295 prompts • - Ranged in length from 1 to 50 sentences • - Average of 378 words per essay
Evaluation Learner Text (Cont.): - Manually inspect a random sample of 100 instances where each system detected a missing hyphen - Twonative-English speakers judge - Using the Chicago Manual of Style as a guide - High agreement
Outline Introduction Baselines System Description Evaluation Conclusions
Conclusions 1 ) Automatically detecting missing hyphen errors in learner text 2 ) The classifiers generally performed better than the baseline systems 3 ) Taking context into account when detecting the errors is important.