The Montclair Electronic Language Learner Database (MELD)

The Montclair Electronic Language Learner Database(MELD) www.chss.montclair.edu/linguistics/MELD/ Eileen Fitzpatrick & Steve Seegmiller Montclair State University

Non-native speaker (NNS) corpora • Begun in early 1990’s • Data • written performance only • essays of students of English as a foreign language • Corpus development (academic) • in Europe: Louvain, Lodz, Uppsala • in Asia: Tokyo Gakugei University, Hong Kong Univ of Science and Technology • Annotation • Lodz: part of speech • HKUST, Lodz: error tags

Gaps in NNS Corpus Creation • No NNS Corpus in America, so no corpus of English as a Second Language (ESL) • No NNS corpus is publicly available • No NNS corpus annotates errors without a predetermined list of error types

MELD Goals • Initial Goals • Collect ESL student writing • Tag writing for error • Provide publicly available NNS data • Initial Goals support • 2nd language pedagogy • Language acquisition research • tool building (grammar checkers, student editing aids, parallel texts from NS and NNS)

MELD Overview • Data • 44477 words of text annotated • 53826 more words of raw data • language, education data for each student author • upper level ESL students • Tools written to • link essays to student background data • produce an error-free version from tagged text • allow fast entry of background data

Annotation • Annotators “reconstruct” a grammatical form {error/reconstruction} school systems {is/are} since children {0/are} usually inspired becoming {a/0} good citizens • Agreement between annotators is an issue

Error Classification from a Predetermined List • Benefit • annotators agree on what an error is: only those items in the classification scheme • Problems • annotators have to learn a classification scheme • the existence of a classification scheme means that the annotators can misclassify • errors not in the scheme will be missed

Error Identification & Reconstruction • Benefits • speed in annotating since there is no classification scheme to learn • no chance of misclassifying • less common errors will be captured • a reconstructed text can be more easily parsed and tagged for part of speech • Question • How well can we agree on what is an error?

Agreement Measures • Reliability: What percentage of the errors do both taggers tag?T1  T2 (T1 +T2)/2 • Precision: What percentage of the non-expert’s (T2) tags are accurate?T1  T2 T2 • Recall: What percent of true errors did the non-expert (T2) find?T1  T2 T1 1 -

Agreement Measures Non-expert Expert High precision Low Recall Low Reliability

Agreement Measures J&L Essay Recall Precision Reliability 1-10 .54 .58 .39 11-22 .57 .78 .49 J&N Essay Recall Precision Reliability 1-10 .58 .48 .23 11-22 .37 .54 .27 L&N Essay Recall Precision Reliability 1-10 .65 .70 .37 11-22 .60 .78 .36

Conclusions on Tagging Agreement • Unsatisfactory level of agreement as to what is an error • Disagreements resolved through regular meetings • There are now 2 types of tags: one for lexico-syntactic errors and one for stylistic • The tags are transparent to the user and can be deleted or ignored

The Future • Immediate • Internet access to data and tools • an error concordancer • automatic part of speech and syntactic markup • data from different ESL skill levels • Long Range • statistical tool to correlate error frequency with student background • student editing aid • grammar checker • NNS speech data

Some Possible Applications • Preparation of instructional materials • Studies of progress over a semester • Research on error types by L1 • Research on writing characteristics by L1

L1 Spanish tense 1 {would/will} 1 {went/go} 1 {stay/stayed} 1 {gave/give} 1 {cannot/could} 1 {can/could} TOTAL: 6 Word Ct: 2305 L1 Gujarati tense 5 {was/is} 1 {passes/passed} 3 {were/are} 1 {love/loved} 2 {would/will} 1 {left/leave} 2 {is/was} 1 {kept/keeps} 2 {have/had} 1 {involved/involves} 2 {had/have} 1 {get/got} 1 {would start/started} 1 {do/did} 1 {will/0} 1 {can/could} 1 {will/were to} 1 {are/were} 1 {was/were} 1 {wanted/want} 1 {spend/spent} TOTAL: 31 Word Ct: 2500 Writing Characteristics by L1

Acknowledgments Jacqueline Cassidy Jennifer Higgins Norma Pravec Lenore Rosenbluth Donna Samko Jory Samkoff Kae Shigeta

The Montclair Electronic Language Learner Database (MELD)

The Montclair Electronic Language Learner Database (MELD)

Presentation Transcript

Learner Language

Delta MELD is a significant predictor for the first year survival rate after liver transplantation

The Nature of Learner Language

The NATURE OF LEARNER LANGUAGE

The Nature of Learner Language

The Natural History of MELD

LEARNER LANGUAGE

The nature of learner language

The nature of learner language

The Nature of Learner Language

MELD

The Nature of Learner Language

The Nature of Learner Language

THE NATURE OF LEARNER LANGUAGE

4- Learner Language

THE NATURE of LEARNER LANGUAGE

Alaska Native Language Archive

The Nature of Learner Language

The English Language Learner