220 likes | 369 Views
{ Cute System Name Here }. An L1 Specific English Spell Checker for EFL Learners. DJ Hovermale 27 October 2006. The Problem.
E N D
{Cute System Name Here} An L1 Specific English Spell Checker for EFL Learners DJ Hovermale 27 October 2006
The Problem For non-native speakers of English the task of writing in English can be daunting. English has many complicated grammar rules and spelling challenges that can intimidate nonnative writers. Proofing tools such as grammar checkers and style checkers offer to provide protection from some of these hazards. However the problems that non-native writers encounter with proofing tools are well known (Tschichold 1999, Fandrych 2001). DJ Hovermale 27 October 2006
The Problem Another proofing tool, the spell checker, has been very successful for native speakers; however, as Rimrott (2005) explains, “…generic spell checkers are generally geared towards typical native speakers who mostly make accidental typographical mistakes." DJ Hovermale 27 October 2006
The Problem Gupta (1998) points out that far less attention has been paid to spell checkers than to other proofing methods. There has been a considerable amount of research in the field of EFL (English as a Foreign Language) on errors made by non-native English speakers. Many of the advances in understanding brought on by this research – language transfer errors (Odlin 1989) for example – have not been applied to spell checking. DJ Hovermale 27 October 2006
The Problem The insight behind this proposal is to implement explicit learner population modeling and transfer error modeling into contextual spell checking for learners of English as a foreign language. DJ Hovermale 27 October 2006
Types of Errors This modeling will be applied in four stages to address the four major types of spelling errors that I identified in a corpus of writing samples from Japanese learners of English. The sentence below was constructed to exhibit each of these four types of spelling errors. The intended sentence follows. We eated1 our runch2buy3 the liver4. We ate1 our lunch2 by3 the river4. DJ Hovermale 27 October 2006
Types of Errors In discussing these four types of errors we must recognize the difference between non-word errors (misspellings which result in ‘words’ which are not in the spell checker's word list) and real word errors (the results of which are words in the word list). The fundamental difference between these two types of errors is in the spell checker’s ability to detect them. DJ Hovermale 27 October 2006
Type-1 Errors The first type of error that we encounter is a non-word error. 'Eated' is not a word, so conventional spell checkers have no trouble detecting this mistake. However, they provide suggestions like 'elated' and 'eased', which are not even close to the intended word 'ate'. These spell checkers use edit distance (Levenshtein 1966) to determine which words in the dictionary to suggest as corrections, but they do not consider the mistakes made by non-native speakers. DJ Hovermale 27 October 2006
Type-1 Errors 'Eated' is a morphological overgeneration formed by applying regular rules for the past-tense morpheme to a verb that has an irregular past-tense form. Since verbs with irregular forms are a closed class we can tackle this issue by anticipating what the regular form of these words (verbs, plurals, comparatives, superlatives, etc) would be according to the regular rules, train the program to recognize them as Type-1 errors, and suggest the correct irregular form. DJ Hovermale 27 October 2006
Type-1 Errors This could also be used for other Type-1 errors such as ‘foots’, ‘gooder’, and so on, which are either not detected (‘foots’) or wrongly corrected (‘gooder’ goober). DJ Hovermale 27 October 2006
Type-2 Errors The second type of error is also a non-word error, but these are slightly more difficult to anticipate than Type-1 errors. Type 2 errors are non-word errors due to learner confusion or language transfer. In the case of ‘runch’ conventional spell checkers would make suggestions such as ‘ranch’ and ‘raunchy’, when the intended word is 'lunch'. DJ Hovermale 27 October 2006
Type-2 Errors This can be addressed by adding an L1 specific, character-based confusion matrix based on the phonological confusion matrices used in approaches to correcting EFL learners' accents (Berkling 1998). This would inform the system, for instance, that Japanese learners of English often confuse the phonemes /l/ and /r/ (and therefore the characters ‘l’ and ‘r’ when writing), and this could be added to the edit distance algorithm. Armed with this information the system will easily be able to provide ‘lunch’ as a suggested correction for ‘runch’. DJ Hovermale 27 October 2006
Type-2 Errors Whereas errors of type 1 and 2 can be detected with conventional spell checkers, Type-3 and Type-4 errors are not detected by conventional spell checking programs. Once we can identify them as potential errors, we can then apply the methods used on Type-2 errors to correct spelling mistakes of the last two classes. DJ Hovermale 27 October 2006
Type-3,4 Errors Our task now changes from determining a correct suggestion for a spelling error, to detecting real word spelling errors in the text. This task is divided among two types of errors: Type-3 and Type-4. DJ Hovermale 27 October 2006
Type-3 Errors A Type-3 error is a mistake that results in a real word, but that word is not the same part of speech as the target word. In the constructed sentence above, the word ‘buy’ is not a preposition, but the target word, ‘by’, is. DJ Hovermale 27 October 2006
Type-3 Errors I can identify this part of speech discrepancy by using a modified HMM Part-Of-Speech tagger. I will modify the tagger by weighing the transition probabilities more heavily until I detect a potential error, and then use edit distance on the emission probabilities of the given POS (in this case prepositions) to suggest the most likely candidate for correction. DJ Hovermale 27 October 2006
Type-4 Errors The final type of error is presents the biggest challenge for detection and correction. Although errors of type 3 and 4 are similar, Type-4 errors differ in that the misspelled word is not only a real word error, but it also has the same part of speech as the target word. This means that the POS detection method is of no use. DJ Hovermale 27 October 2006
Type-4 Errors To solve this problem I turn to context sensitive spell correction (Golding and Roth 1999), which employs statistics to detect mistakes such as homophone errors (rain/reign), typographic errors (post/pots), and errors that cross word boundaries (maybe/may be). DJ Hovermale 27 October 2006
Type-4 Errors Although previous context-based spell checking focused on closed classes such as those mentioned above, I can extend this method to transfer errors such as Type-4 (which is an open class) by using learner population modeling to restrict the class to a manageable size. DJ Hovermale 27 October 2006
Type-4 Errors While I realize that because this is an open class the system will undoubtedly encounter errors that were not perceived, at the same time I believe that many errors of this type can be detected using this restricted-class method. DJ Hovermale 27 October 2006
Questions/Comments Please don’t hesitate to give me feedback in person, via email, or via the feedback form that you have been given. THANKS! DJ Hovermale 27 October 2006
Cute System Names • PISCES (Probabilistic Intelligent Spell Checker for English Second as a Language) • HELPEM (Heuristic-driven English Learner Population Error Modeling) DJ Hovermale 27 October 2006