160 likes | 277 Views
The Montclair Electronic Language Learner Database (MELD). www.chss.montclair.edu/linguistics/MELD/ Eileen Fitzpatrick & Steve Seegmiller Montclair State University. Non-native speaker (NNS) corpora. Begun in early 1990’s Data written performance only
E N D
The Montclair Electronic Language Learner Database(MELD) www.chss.montclair.edu/linguistics/MELD/ Eileen Fitzpatrick & Steve Seegmiller Montclair State University
Non-native speaker (NNS) corpora • Begun in early 1990’s • Data • written performance only • essays of students of English as a foreign language • Corpus development (academic) • in Europe: Louvain, Lodz, Uppsala • in Asia: Tokyo Gakugei University, Hong Kong Univ of Science and Technology • Annotation • Lodz: part of speech • HKUST, Lodz: error tags
Gaps in NNS Corpus Creation • No NNS Corpus in America, so no corpus of English as a Second Language (ESL) • No NNS corpus is publicly available • No NNS corpus annotates errors without a predetermined list of error types
MELD Goals • Initial Goals • Collect ESL student writing • Tag writing for error • Provide publicly available NNS data • Initial Goals support • 2nd language pedagogy • Language acquisition research • tool building (grammar checkers, student editing aids, parallel texts from NS and NNS)
MELD Overview • Data • 44477 words of text annotated • 53826 more words of raw data • language, education data for each student author • upper level ESL students • Tools written to • link essays to student background data • produce an error-free version from tagged text • allow fast entry of background data
Annotation • Annotators “reconstruct” a grammatical form {error/reconstruction} school systems {is/are} since children {0/are} usually inspired becoming {a/0} good citizens • Agreement between annotators is an issue
Error Classification from a Predetermined List • Benefit • annotators agree on what an error is: only those items in the classification scheme • Problems • annotators have to learn a classification scheme • the existence of a classification scheme means that the annotators can misclassify • errors not in the scheme will be missed
Error Identification & Reconstruction • Benefits • speed in annotating since there is no classification scheme to learn • no chance of misclassifying • less common errors will be captured • a reconstructed text can be more easily parsed and tagged for part of speech • Question • How well can we agree on what is an error?
Agreement Measures • Reliability: What percentage of the errors do both taggers tag?T1 T2 (T1 +T2)/2 • Precision: What percentage of the non-expert’s (T2) tags are accurate?T1 T2 T2 • Recall: What percent of true errors did the non-expert (T2) find?T1 T2 T1 1 -
Agreement Measures Non-expert Expert High precision Low Recall Low Reliability
Agreement Measures J&L Essay Recall Precision Reliability 1-10 .54 .58 .39 11-22 .57 .78 .49 J&N Essay Recall Precision Reliability 1-10 .58 .48 .23 11-22 .37 .54 .27 L&N Essay Recall Precision Reliability 1-10 .65 .70 .37 11-22 .60 .78 .36
Conclusions on Tagging Agreement • Unsatisfactory level of agreement as to what is an error • Disagreements resolved through regular meetings • There are now 2 types of tags: one for lexico-syntactic errors and one for stylistic • The tags are transparent to the user and can be deleted or ignored
The Future • Immediate • Internet access to data and tools • an error concordancer • automatic part of speech and syntactic markup • data from different ESL skill levels • Long Range • statistical tool to correlate error frequency with student background • student editing aid • grammar checker • NNS speech data
Some Possible Applications • Preparation of instructional materials • Studies of progress over a semester • Research on error types by L1 • Research on writing characteristics by L1
L1 Spanish tense 1 {would/will} 1 {went/go} 1 {stay/stayed} 1 {gave/give} 1 {cannot/could} 1 {can/could} TOTAL: 6 Word Ct: 2305 L1 Gujarati tense 5 {was/is} 1 {passes/passed} 3 {were/are} 1 {love/loved} 2 {would/will} 1 {left/leave} 2 {is/was} 1 {kept/keeps} 2 {have/had} 1 {involved/involves} 2 {had/have} 1 {get/got} 1 {would start/started} 1 {do/did} 1 {will/0} 1 {can/could} 1 {will/were to} 1 {are/were} 1 {was/were} 1 {wanted/want} 1 {spend/spent} TOTAL: 31 Word Ct: 2500 Writing Characteristics by L1
Acknowledgments Jacqueline Cassidy Jennifer Higgins Norma Pravec Lenore Rosenbluth Donna Samko Jory Samkoff Kae Shigeta