Linguistic annotation of learner corpora

Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

1. Introduction A study on linguistic annotation of learner corpora, in particular Part-Of-Speech (POS) annotation, which aims to discuss where native POS tagsets fail to accurately describe learner language, by: • Describing POS annotation practice in learner corpora, and • Characterizing the areas where properties of learner language differ from those assumed by native POS annotation schemes.

Learner corpora can play a role in identifying areas of relevance in, for example, FLT, SLA, materials design, etc. • The terminology used to single out learner language aspects needs to be mapped to instances in the corpus, i.e. annotation.

Linguistic annotation of learner corpora, in particular POS tagging, is becoming a common practice because: • By the use of generally agreed linguistic categories, it allows to objectively identify units of interest. • Other annotations specific to learner corpora (error-tagging) mostly allow research into deviances, it is costly and involves a degree of subjectivity. • In SLA research there is an interest in the developmental stages of the acquisition process. • POS tagging can be done automatically.

Recent initiatives: • International Corpus of Learner English (ICLE) • Cambridge Learner Corpus (CLC) • Japanese EFL Learner Corpus (JEFLL) • Polish Learner Corpus of English

Automatic POS-tagging consists of 2 parts: • Tag look-up: all possible tags for the given token are determined based on lexical database reference or morphological analysis. • Tag disambiguation: all possible tags are reduced to the correct tag based on distribution. Fallback strategies: weaker versions of the 3 previous sources of evidence and, as a last resort, uses of the most frequent tags.

POS-tagging learner language is essentially perceived as an instance of domain transfer (van Rooy & Schäfer 2003; Thouësny 2009): • Automatic POS-taggers trained on native data are run on learner data. • Due to differences in genre and data type, the annotations are less accurate. • To make up for this degradation of performance, post-correction is often added.

De Haan (2000) and Van Rooy & Schäfer (2002) investigated into POS tagging error types. Spelling errors seem to be source of major problems, which can be handled rather straightforwardly, especially if they result in non-words. • De Haan (2000) proposes a fine-grained classification of learner errors that become relevant to the POS tagging process. He suggests adapting the TOSCA-ICLE POS tagset to cater for these learner-specific features.

If native taggers, • Map linguistic categories of native language in POS tags, based on the combinatory possibilities of stem-morphology-distribution. The demonstrations ended without confrontation NNS but learner language • Does not always present the same POS categories because the combinatory possibilities of stem-morphology-distribution are different, […] If he want to know this […] VB/VBP? Do native taggers always provide the categories needed to describe learner language?

2. Method • This paper is based on a sample of the NOn-native Corpus of English (NOCE, Díaz Negrillo, 2007), containing around 40,000 words. • The NOCE corpus is a written corpus of EFL: • Over 300,000 words of written English by Spanish undergraduates. • 1,054 samples of an average of 250 words each.

The samples were collected: • From 2003 to 2009 primarily among first year students doing the English degree programme at the Universities of Granada and Jaén (Spain), • At 3 stages in the academic year (beginning, mid-term and end), • By the students’ lecturers, assisted by corpus compilers and in 1-hour teaching sessions, • As a timed classroom task: essay writing, and • On a voluntary basis and under the appropriate anonymous conditions.

The corpus contains 3 types of annotation: • Editorial annotation: the corpus is annotated for students’ editions of their own writing (e.g. struckouts, late insertions, reordering of units and missing/unreadable text). • Error annotation: a section of the corpus of around 40,000 words is error-tagged with the tagset EARS (Error-Annotation and Retrieval System, Díaz Negrillo, 2009). • POS annotation: the corpus is annotated with 3 automatic POS taggers: TnT, Stanford and Treebank.

General observations of the corpus’ POS annotations by the 3 POS taggers suggest: • There are areas where the taggers do not provide the same tag for a given token, • Certain cases are easy to disambiguate manually, but • In other cases disambiguation is difficult because the tagsets do not fully map the categories present in the learner corpus.

A preliminary examination of the mismatches between the native and learner POS categories suggest 4 main types of mismatches. • The mismatches are discussed on the basis of the 3 sources of information handled by automatic POS taggers in the selection of tags for tokens: • Lexical look-up: token’s stem, • Morphology: token’s derivational and inflectional markings, and • Distribution: token’s syntactic context.

3. Mismatches in POS classification variables Case 1. Stem-Distribution mismatch StemDistributionMorphology (1) You can find a big vary of beautiful beaches […] Verb ≠ Noun (2) They are very kind and friendship[…] Noun ≠ Adjective ≠ Noun

3. Mismatches in POS classification variables Case 2. Stem-Distribution Stem-Morphology mismatch StemDistributionMorphology (3) […] one of the favourite places to visit for foreigns. Adjective ≠ Noun ≠ Noun (4) […] to be choiced for a job […] Noun ≠ Verb ≠ Verb

3. Mismatches in POS classification variables Case 3. Stem-Morphology mismatch StemDistributionMorphology (5) […] this film is one of the bests ever. Adjective ≠ Adjective ≠ Noun (6) […] television, radio are very subjectives […] Adjective ≠ Adjective ≠ Noun

3. Mismatches in POS classification variables Case 4. Distribution-Morphology mismatch StemDistributionMorphology (7) […] for almost every jobs nowadays. Noun ≠ Noun Sing ≠ Noun Pl (8) […] it has grew up a lot especially since 1996 […] Verb ≠ Verb PP ≠ Verb PT

4. POS tagging learner data and deviances Not all learner errors demand special attention in POS-tagging: (9) […] Internet can modificate[…] (10) He runned to by one […] (11) […] The 11th March cames to out minds. (12) Childrens spend so much time […] (13) […] people shouldn’t be menospreciated […]

4. Conclusions • Linguistic annotation of learner data is a powerful means to gain access to learner properties with a view to conducting theoretical and applied research. • Application of native automatic POS-taggers is a sensible point of departure. • However, for linguistic annotations to be fully relevant in learner corpus research, annotation should capture the properties of learner language systematically. • Adaptation of existing native POS-tagsets to learner data specifications seems necessary.

References de Haan, P. 2000. Tagging non-native English with the TOSCA-ICLE tagger. In C. Mair & M. Hundt (Eds.), Corpus Linguistics and Linguistic Theory (pp. 69-79). Amsterdam: Rodopi. Díaz Negrillo, A. 2007. A Fine-Grained Error Tagger for Learner Corpora. Unpublished Ph.D. thesis, University of Jaen, Jaén. Díaz Negrillo, A. 2009. EARS: A User’s Manual. Munich: LINCOM. Thouësny, S. 2009. Increasing the reliability of a part-of-speech tagging tool for use with learner language. Paper presented at the Automatic Analysis of Learner Language (AALL’09) Workshop, Tempe, AZ. van Rooy, B. & Schäfer, L. 2002. The effect of learner errors on POS tag errors during automatic POS tagging. Southern African Linguistics and Applied Language Studies, 20, 325-335. van Rooy, B. & Schäfer, L. 2003. An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus. In D. Archer, P. Rayson, A. Wilson & T. McEnery (Eds.), Proceedings of the Corpus Linguistics 2003 Conference Lancaster University (UK), 28-31 March 2003. Vol. 16 (pp. 835-844). Lancaster: UCREL, Lancaster University.

Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany adinegri@ujaen.es dm@sfs.uni-tuebingen.de wunsch@sfs-tuebingen.de

Linguistic annotation of learner corpora