1 / 16

The Montclair Electronic Language Learner Database (MELD)

The Montclair Electronic Language Learner Database (MELD). www.chss.montclair.edu/linguistics/MELD/ Eileen Fitzpatrick & Steve Seegmiller Montclair State University. Non-native speaker (NNS) corpora. Begun in early 1990’s Data written performance only

herman
Download Presentation

The Montclair Electronic Language Learner Database (MELD)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Montclair Electronic Language Learner Database(MELD) www.chss.montclair.edu/linguistics/MELD/ Eileen Fitzpatrick & Steve Seegmiller Montclair State University

  2. Non-native speaker (NNS) corpora • Begun in early 1990’s • Data • written performance only • essays of students of English as a foreign language • Corpus development (academic) • in Europe: Louvain, Lodz, Uppsala • in Asia: Tokyo Gakugei University, Hong Kong Univ of Science and Technology • Annotation • Lodz: part of speech • HKUST, Lodz: error tags

  3. Gaps in NNS Corpus Creation • No NNS Corpus in America, so no corpus of English as a Second Language (ESL) • No NNS corpus is publicly available • No NNS corpus annotates errors without a predetermined list of error types

  4. MELD Goals • Initial Goals • Collect ESL student writing • Tag writing for error • Provide publicly available NNS data • Initial Goals support • 2nd language pedagogy • Language acquisition research • tool building (grammar checkers, student editing aids, parallel texts from NS and NNS)

  5. MELD Overview • Data • 44477 words of text annotated • 53826 more words of raw data • language, education data for each student author • upper level ESL students • Tools written to • link essays to student background data • produce an error-free version from tagged text • allow fast entry of background data

  6. Annotation • Annotators “reconstruct” a grammatical form {error/reconstruction} school systems {is/are} since children {0/are} usually inspired becoming {a/0} good citizens • Agreement between annotators is an issue

  7. Error Classification from a Predetermined List • Benefit • annotators agree on what an error is: only those items in the classification scheme • Problems • annotators have to learn a classification scheme • the existence of a classification scheme means that the annotators can misclassify • errors not in the scheme will be missed

  8. Error Identification & Reconstruction • Benefits • speed in annotating since there is no classification scheme to learn • no chance of misclassifying • less common errors will be captured • a reconstructed text can be more easily parsed and tagged for part of speech • Question • How well can we agree on what is an error?

  9. Agreement Measures • Reliability: What percentage of the errors do both taggers tag?T1  T2 (T1 +T2)/2 • Precision: What percentage of the non-expert’s (T2) tags are accurate?T1  T2 T2 • Recall: What percent of true errors did the non-expert (T2) find?T1  T2 T1 1 -

  10. Agreement Measures Non-expert Expert High precision Low Recall Low Reliability

  11. Agreement Measures J&L Essay Recall Precision Reliability 1-10 .54 .58 .39 11-22 .57 .78 .49 J&N Essay Recall Precision Reliability 1-10 .58 .48 .23 11-22 .37 .54 .27 L&N Essay Recall Precision Reliability 1-10 .65 .70 .37 11-22 .60 .78 .36

  12. Conclusions on Tagging Agreement • Unsatisfactory level of agreement as to what is an error • Disagreements resolved through regular meetings • There are now 2 types of tags: one for lexico-syntactic errors and one for stylistic • The tags are transparent to the user and can be deleted or ignored

  13. The Future • Immediate • Internet access to data and tools • an error concordancer • automatic part of speech and syntactic markup • data from different ESL skill levels • Long Range • statistical tool to correlate error frequency with student background • student editing aid • grammar checker • NNS speech data

  14. Some Possible Applications • Preparation of instructional materials • Studies of progress over a semester • Research on error types by L1 • Research on writing characteristics by L1

  15. L1 Spanish tense 1 {would/will} 1 {went/go} 1 {stay/stayed} 1 {gave/give} 1 {cannot/could} 1 {can/could} TOTAL: 6 Word Ct: 2305 L1 Gujarati tense 5 {was/is} 1 {passes/passed} 3 {were/are} 1 {love/loved} 2 {would/will} 1 {left/leave} 2 {is/was} 1 {kept/keeps} 2 {have/had} 1 {involved/involves} 2 {had/have} 1 {get/got} 1 {would start/started} 1 {do/did} 1 {will/0} 1 {can/could} 1 {will/were to} 1 {are/were} 1 {was/were} 1 {wanted/want} 1 {spend/spent} TOTAL: 31 Word Ct: 2500 Writing Characteristics by L1

  16. Acknowledgments Jacqueline Cassidy Jennifer Higgins Norma Pravec Lenore Rosenbluth Donna Samko Jory Samkoff Kae Shigeta

More Related