1 / 31

DJ Hovermale 11 June 2010

CALICO 2010. An analysis of the spelling errors of L2 English learners. DJ Hovermale 11 June 2010. Presentation Overview. Introduction. Introduction. In 1997 there were 375 million native English speakers and 750 million people who spoke English as a second language (Crystal 1997)

lowell
Download Presentation

DJ Hovermale 11 June 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CALICO 2010 An analysis of the spelling errors of L2 English learners DJ Hovermale11 June 2010

  2. Presentation Overview

  3. Introduction

  4. Introduction In 1997 there were 375 million native English speakers and 750 million people who spoke English as a second language (Crystal 1997) NLP tools for these English Language Learners (ELLs) have been on the rise for the past 10 years Among these tools are many which are intended for Intelligent Computer-Aided Language Learning (ICALL) These tools provide not only error diagnosis, but also appropriate feedback to their users These NLP tools often rely on the output of parsers, part-of-speech taggers, content matching modules, etc. When learner input contains spelling errors, NLP tools are unable to properly process the input, and they therefore cannot provide the output needed for ICALL modules to work De Felice (2008) analyzes preposition usage by context, a process which is thwarted by misspellings. Nagata’s (2002) Robo-Sensei tutor flags words that are not in its lexicon and asks users to correct the spelling errors themselves.

  5. Conventional Spelling Correction Programs Introduction Don’t we have spell checkers? Why not use them? • MSWord 2007: • All errors – 65% • Non-word – 72% • Real-word – 36% • HELC-2: • Approximately 2400 unique sentences • About 500 spelling errors • 30 real-word errors • ASPELL: • All errors – 62% • Non-word – 81% • Real-word – 0%

  6. Introduction How do researchers develop ICALL tools if the input has spelling errors? They correct the errors by hand and send it on for further processing by the ICALL modules They assume that the user input will somehow become free of spelling errors • Why don’t they make their own spelling correction program? • No annotated corpora available for • development/evaluation • No standard for error annotation, so • they can’t combine resources • The ICALL people want diagnosis and • feedback, not just a list of suggestions • Need to know about Second Language • Acquisition (SLA) • Might need to know about phonetics/ • phonology of individual languages • SOMEBODY ELSE’S PROBLEM

  7. Spellchecking 101

  8. How does spelling correction work? There are really two problems: Detection: • Detection Determining that a spelling error has been made • Correction Correction: Suggesting the correct (target) word to the user

  9. How does spelling correction work? There two main classes of errors: Non-word Errors: • Non-word Words1 which are not in the dictionary2 1specific character sequences 2a list of acceptable words He refure to welcome the other man. Is it really, you go to Rondon to study? I felt someone tach my shoulder. Real-word Errors: • Real-word Do you know there phone number? Errors which are errors only in the given context (context sensitive) I felt someone torch my shoulder. I heard him rock the door.

  10. How does spelling correction work? Detection: Check each “word” (character sequence) to make sure it is in the “dictionary” (word list) Non-word Errors a able about account acid across act addition adjustment advertisement … … harmony hat hate have he head healthy hear hearing … … refurbish refurbishment refurnish refusal refuse refutable refutation refute refuter … He refure _____ to welcome the other man.  ! Spellchecking complete!  Okay

  11. How does spelling correction work? Correction: Recommend a word from the list How do we choose which word to suggest? Non-word Errors Edit Distance !! He refure _____ to welcome the other man. … refurbish refurbishment refurnish refusal refuse refutable refutation refute refuter …

  12. How does spelling correction work? Edit Distance We compute how many steps it would take to transform the misspelling into one of the words in the list. refure refuge 1 substitution We then suggest the words that take the fewest operations. refuse 1 substitution refute 1 substitution Available operations are: - Insertion (add a letter) - Deletion (remove a letter) - Substitution (change one letter for another) - Transposition (two adjacent letters switch positions) banana 6 substitutions refuge refuse refute refugee referee --------------------- Ignore All Add to Dictionary

  13. How does spelling correction work? Edit Distance Pollock and Zamora 1983, 1984 - studied 50,000 English spelling errors - found that 94% of errors had only one mistake in them - only 3.3% of the errors in first letter - mistakes were minimal deviations from the target word - 34% of all errors were omissions - 23% of errors occur in third letter det bet 1 substitution wet 1 substitution pet 1 substitution dot 1 substitution set 1 substitution let 1 substitution et 1 deletion debt 1 insertion diet 1 insertion deft 1 insertion dent 1 insertion duet 1 insertion Yannakoudakis and Fawthrop 1983b - first-letter error rate of 1.5% Gentner et al. 1983 - 58% of the 3800 substitution errors involved adjacent typewriter keys.

  14. How does spelling correction work? Edit Distance • first-letter errors heavily penalized • low cost for vowel substitutions • lower cost for omissions • lower cost for third-letter errors • low frequency words cost more det debt 1 insertion diet 1 insertion duet 1 insertion deft 1 insertion dent 1 insertion dot 1 substitution ------------------------ Ignore All Add to Dictionary bet 1 substitution wet 1 substitution pet 1 substitution dot 1 substitution set 1 substitution let 1 substitution et 1 deletion debt 1 insertion diet 1 insertion deft 1 insertion dent 1 insertion duet 1 insertion dot 1 substitution debt 1 insertion deft 1 insertion dent 1 insertion diet 1 insertion duet 1 insertion ------------------------ Ignore All Add to Dictionary dot 1 substitution debt 1 insertion deft 1 insertion dent 1 insertion diet 1 insertion duet 1 insertion ------------------------ Ignore All Add to Dictionary dot 1 substitution debt 1 insertion deft 1 insertion dent 1 insertion diet 1 insertion ------------------------ Ignore All Add to Dictionary

  15. How does spelling correction work? Detection: Check each “word” (character sequence) to make sure it is in the “dictionary” (word list) Real-word Errors NO!! … hear hearing heard heckle hem hemp herd hero heron herons … herd I him lock the door. 

  16. How does spelling correction work? Detection: How can we find them then? Real-word Errors • Golding and Roth, 1999 - determine which words are commonly confused with each other - create confusion sets from these words They had 21 in this study - if we find a word that is in one of these check to see if any other word in that set is more likely - if another word from the set is more likely in the given context, flag the current word as an error - suggest the words in the confusion set with higher likelihood as corrections I herd ____ him lock the door. … he‘ll, heal, heel heard, herd … there, their, they’re to, two, too … I herd him .01741 I heard him .98259 Correction 

  17. Hypotheses, Approach & Results

  18. Hypothesis Why do conventional spelling correction programs perform so poorly on learner English? Learners make different errors than native speakers do Conventional spell checkers are geared toward native speakers Learner errors do not follow the underlying assumptions of heuristics used by spell checkers • MSWord 2007: • All errors – 65% • Non-word – 72% • Real-word – 36% • ASPELL: • All errors – 62% • Non-word – 81% • Real-word – 0%

  19. Hypotheses Native Speaker Errors English Learner Errors • Contain multiple errors much • more frequently • Have more first letter errors • Mistakes deviate greatly • from the target word • Majority of errors will be • substitutions • Mistakes are equally likely • everywhere in the word • 94% of errors had only one • mistake in them • 1.5-3.3% of the errors in first • letter • mistakes are minimal • deviations from target word • 34% of all errors were • omissions • 23% of errors occur in third • position

  20. The HELC (Hiroshima English Learners’ Corpus, Miura 1998) contains 3600 sentences from high school age native Japanese speaking ELLs. These sentences were taken from translation tasks. About 500 spelling errors with clear target (about 300 unique). Target: Can you smell something burning? Target: I felt someone touch my shoulder. Are you burning anyone? Can you smell something is bern Do you feel something is burning? Do you feel that something burning smell. Do you smell anything fring Do you smell of anything smoked. Do you smell anyone is boreing? • Do you feel the smell which anything is • burnt? I feet who tatch my sholder. I felt somebody attacked me. I felt someone fitting my eye. I felt someone put my shoulder. I felt who touch my door. I find someone cominicate my sholder. I find who touch my shorder. Do you smell anyone is boreing? I felt that someone touched my siolder. I felt that someone toughed my shoulder. I feeled that who was tuch my shoulder.

  21. Asao Kojiro Learner Corpus The AKLC (Kojiro 2006) is a corpus of learner English errors produced by English learners was collected and distributed by Asao Kojiro. In total there were about 650 non-word errors for which a target word was obvious, with about 450 of these being unique. There were about 60 errors for which an English target could not be determined even with context, the majority of which were direct transliterations of Japanese words. The current study analyzes four of the subcorpora, totaling 277 essays. 650 total misspellings with clear target 450 of which are unique Traveling and Gardening - 123 essays Cooking and Gardening – 42 essays How far did the kite go? – 78 essays Momotaro – 34 essays There is a small garden in my house. There are azisai and several kind of trees. And my father planted tomatoes and nasubi this year. I want to plant some flowers such as asagao and himawari, but it is so troublesome that I don't plant nothing. Also I hate "mushi" it is the reason that I don't want to gardening. But in my room, there are some flowers with a vase.

  22. Approach • Examine the spelling errors in the corpus, keeping track of: • The error itself • The target word • Number of times the error appears • Position of error (or pos. of first error) • Number of operations to the target • If only one operation – which one? • If only one operation – which letters? • If MS Word 2007 offers target in list • If offered in list, which position in list? Since the word list is not used for real-word error correction, looking only at non-word errors For some errors the target word will not be obvious, these will be removed from the study

  23. Results – Operations to target word 1 Operation * 889/1193 errors  74.5% 2 Operations 211/1193 errors  17.7% 3 Operations 66/1193 errors  5.5% >3 Operations 27/1193 errors  2.3% This includes omitted spaces, without which the results would be more dramatic.

  24. Hypothesis Native Speaker Errors English Learner Errors • Contain multiple errors much • more frequently • Only 74.5% of errors have one mistake in them • Majority of errors will be • substitutions • Omission just as likely as substitution • 94% of errors had only one • mistake in them • 34% of all errors were • omissions

  25. Results – Error Position Position 1 63/1193 errors  5.2% Position 2 270/1193 errors  22.6% Position 3 179/1193 errors  15.0% Position 4 257/1193 errors  21.5% Position 5 205/1193 errors  17.2% Position 6 115/1193 errors  9.6% Position 7 63/1193 errors  5.2% > Position 7 41/1193 errors  3.4%

  26. Hypotheses Native Speaker Errors English Learner Errors • Have more first letter errors • 5.2% of errors in first position • Mistakes are equally likely • everywhere in the word • Positions 2-6 are hot spot • 1.5-3.3% of the errors in first • letter • 23% of errors occur in third • position

  27. Conclusion & Future Work

  28. Hypotheses Native Speaker Errors English Learner Errors • Contain multiple errors much • more frequently • Have more first letter errors • Mistakes deviate greatly • from the target word • Majority of errors will be • substitutions • Mistakes are equally likely • everywhere in the word • 94% of errors had only one • mistake in them • 1.5-3.3% of the errors in first • letter • mistakes are minimal • deviations from target word • 34% of all errors were • omissions • 23% of errors occur in third • position

  29. Conclusions Future Work The hypothesis that conventional spelling correction perform poorly on learner English because of the heuristics they employ is plausible Creating a spelling correction program which adjusts these heuristics to learner English might have better results JLE spelling errors DO, in fact, differ in predictable ways from those of native English speakers If we can make an effective error model then we might be able to increase spellchecker performance on JLE text I looked at 1200 errors, Pollock and Zamora looked at 50,000 – I should look at more Create a corpus of annotated learner English errors for use in developing ELL spell checker (ICLE) I looked at Japanese Learners of English, will these patterns hold for other learner populations? Is it worthwhile to customize heuristics for each learner population or better to just have one ELL spellchecker?

  30. MS Word 2007 performance by Operations to Target

  31. MS Word 2007 performance by Error Position

More Related