A method for unsupervised broad-coverage lexical error detection and correction

A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications Workshop NAACL June 5, 2009 Nai-Lung Tsao and David Wible National Central University, Taiwan

The Research Context IWiLL Online Writing Platform www.iwillnow.org • Since 2000 under the support of MOE & Taipei Bureau of Education • IWiLL has been used in Taiwan by: • 455 schools • 2,804 teachers • 161,493 students and 22,791 independent learners. • Teachers have authored 9,429 web-based lessons with the system’s authoring tool. • The learner corpus (English TLC) has archived over 32,000 English essays • 5 million words of machine-readable running text written by Taiwan’s learners using the IWiLL writing platform. • 100,000 tokens of teacher comments on these student texts

Second Language Learners’Error Detection and Correction • Lexical and Lexico-grammatical errors - an open-ended class - driving teachers crazy - either no rules involved or rules of very limited productivity

1. Target Language Knowledgebase: Two components to our system INPUT: user-produced string ‘on my opinion’ 2. Edit Distance Algorithm Hybrid n-grams extracted from BNC Error Detection/Correction Compares User’s string & Hybrid N-grams

1. Target Language Knowledgebase: The Knowledgebase of Hybrid N-grams What, Why, and How What is a hybrid n-gram? An n-gram that admit items of different levels - Traditional n-gram: ‘in my opinion’ - Hybrid n-gram: ‘in [dps] opinion’ Hybrid n-grams extracted from BNC Why use hybrid n-grams? Error Detection. - Traditional n-grams and error precision True positive: Enjoy to canoe > unattested > marked as error False positive: Enjoy canoeing> unattested > marked as error - POS n-grams and recall Based on attested strings like: enjoy hiking OR like watching We could extract the POS gram: V + VVg But this would accept: hope exploring How hybrid n-grams are extracted for the knowledgebase

1. Target Language Knowledgebase: How the hybrid n-grams are extracted Potential Hybrid N-grams for a string {POS rough} V V lexeme enjoy VVd hike VVg Hybrid n-grams extracted from BNC [POS detailed] enjoyed hiking word form Some hybrid n-grams for enjoyed hiking enjoyed + V enjoy + V enjoyed + VVg enjoy + VVg VVd + VVg enjoyed + hike enjoy + hike V + hiking etc. 4 categories of info for each item In an n-gram

1. Target Language Knowledgebase: Two components: INPUT: user-produced string ‘on my opinion’ 2. Edit Distance Algorithm Hybrid n-grams extracted from BNC Error Detection/Correction Compares User’s string & Hybrid N-grams

Edit Distance Component Steps in measuring edit distance • Generate all hybrid n-grams from • the learner input string (Set C) 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) b. Prune Set S using filter factor or coverage We limit edit distance to ‘substitution’. So we limit search to n-grams of the same length as the learner’s input string. 3. Rank candidates by weighted edit distance between members of C and S

Edit Distance Component Steps in measuring edit distance • Generate all hybrid n-grams from • the learner input string (Set C) enjoyed hiking Input from learner: enjoyed + V enjoy + V enjoyed + VVg enjoy + VVg VVd + VVg enjoyed + hike enjoy + hike V + hiking etc. Hybrid n-grams generated from learner string Set C =

Edit Distance Component Steps in measuring edit distance • Generate all hybrid n-grams from • the learner input string (Set C) 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) b. Prune Set S using filter factor or coverage c. Eliminate N-grams under frequency threshold We limit edit distance to ‘substitution’. So we limit search to n-grams of the same length as the learner’s input string. 3. Calculate weighted edit distance between members of C and S

Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) b. Prune Set S using filter factor or coverage c. Eliminate N-grams under frequency threshold

Hybrid n-grams generated from learner string enjoyed hiking enjoyed + V enjoy + V enjoyed + VVg enjoy + VVg VVd + VVg enjoyed + hike enjoy + hike V + hiking etc. hike enjoy Set C = Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) Target Knowledgebase Hybrid N-grams Set S

Hybrid n-grams generated from learner string enjoyed hiking enjoyed + V enjoy + V enjoyed + VVg enjoy + VVg VVd + VVg enjoyed + hike enjoy + hike V + hiking etc. Set C = Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) Target Knowledgebase Hybrid N-grams Set S V V enjoy VVd hike VVg enjoyed hiking

Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) Target Knowledgebase Hybrid N-grams Set S V V enjoy VVd hike VVg enjoyed hiking

V hike VVg Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) Target Knowledgebase Hybrid N-grams Set S enjoy hiking

Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) Target Knowledgebase Hybrid N-grams Set S V hike enjoy VVd enjoyed

Pruning Set S of Candidates X enjoy + V 100 tokens enjoy + VVg We prune the subsuming Hybrid N-gram in cases where a subsumed one accounts for 80% or more of the subsuming set 80 tokens

Pruning Set S of Candidates enjoy + VVg We prune the subsuming Hybrid N-gram in cases where a subsumed one accounts for 80% or more of the subsuming set 80 tokens Pruning of the Knowledgebase will affect error recall The remaining Set S is filtered for frequency of member hybrid n-grams

Edit Distance Component Steps in measuring edit distance • Generate all hybrid n-grams from • the learner input string (Set C) 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) b. Prune Set S using filter factor or coverage We limit edit distance to ‘substitution’. So we limit search to n-grams of the same length as the learner’s input string. 3. Rank candidates by weighted edit distance between members of C and S

enjoyed to hike enjoy VVt enjoy V V to hike VVd to hike etc Weighting of Edit Distance Learner string ‘enjoyed to hike’ Generate Set S of Hybrid N-grams Generate Set C of Hybrid N-grams Distance = 1: string c and string s are identical but for one slot Correction candidates are those with a distance 1 or lower. enjoyed hiking enjoyed hike enjoy VVg VVd hiking V hiking VVd hike enjoy VVg enjoy learning Differing element = same lexeme but diff word form is closer than different lexeme Ranking of candidates with distance = 1 from learner string Differing element = same rough POS but diff detailed POS is closer than diff rough POS

Examples 1 C-selection Enjoy to swim > enjoy swimming Enjoy to shop > enjoy shopping Enjoy to canoe > enjoy canoeing Enjoy to learn > *need to learn; ?want to learn; enjoy learning Enjoy to find > *try to find; *expect to find; *fail to find; *hope to find; *want to find Hope finding > hope to find Let us to know > let us know Get used to say > *get used to; *have used to say; Collocation with C-selection Spend time to fix > spend time fixing; take time to fix Take time fixing > take time to fix Take time recuperating > take time to recuperate Spend time to recuperate > spend time recuperating; take time to recuperate

Examples 2 Preposition Fixed expressions: • On the outset > At the outset • In different reasons > For different reasons • In that time > at that time; by that time • On that time > at that time; by that time • On my opinion > in my opinion • In my point of view > from my point of view • I am interested of > I am interested in • She is interested of > she is interested in • I am interesting in > I am interested in • She is interesting in > She is interested in • Just on the time when > just at the time when; *just to the time when

Examples 3 Preposition/Particle: Verb + preposition (particle) • Discuss to each other > *discussing to each other (should be discuss WITH each other) • Discuss this to them > discuss this with them • Waited to her > waited for her • Waited to them > waited for them Noun + preposition • His admiration to > his admiration for • His accomplishment on > * No suggestion • The opposite side to > the opposite side of • A crisis on > a crisis of; a crisis in • A crisis on his work > a crisis of his work (*a crisis on his work)

Examples 4 Content Word Choice • Lead a miserable living > make a miserable living *leading a miserable living *led a miserable living lead a miserable life • Frame of mood > ??change of mood; frame of mind; * frame of reference

Examples 5 Morpho-syntactic • She will ran > She will run • She will runs > She will run Pronoun case: • What made she change > * what made she change (no correction; • should be made HER change) Noun countability or number errors: • In modern time > in modern times Number agreement in head noun and determiner • Too much people > too many people • So much things > so many things • So many thing > so many things • One of the man > one of the men • One of the problem > one of the problems • In my opinions > in my opinion • A lot of problem > a lot of problems • Complementizer selection:I wonder that > I wonder if; I wonder whether

Future Work • Improving POS tagging using 2nd order model • Machine learning of weighting for the various features determining edit distance • Incorporation of this into our IWiLL online writing environment • Incorporate MI for the knowledgebase’s hybrid n-grams

Thank you

A method for unsupervised broad-coverage lexical error detection and correction