170 likes | 293 Views
GIVE and TAKE: towards overcoming of the bottlenecks in learner corpus linguistics. ICAME 2001 Louvain-la-Neuve 16-20 May 2001. Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland. Central issues. Pedagogical perspective: annotation (& disambiguation) of data
E N D
GIVE and TAKE: towards overcoming of the bottlenecks in learner corpus linguistics ICAME 2001 Louvain-la-Neuve 16-20 May 2001 Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland
Central issues • Pedagogical perspective: annotation (& disambiguation) of data • Procedures accessible for applied (non-specialist) researchers • Corpus-based goals vs. corpus-driven methods ICAME 2001, Louvain-la-Neuve
Example: research into learners’ core-word based phraseology • EFL learners’ overuse of high-frequency words: what does it mean? • intensive collocability of core lexical items • multi-word extensions (compounds, coinages, idioms, expressions, phrasals) • need for an idiomatic scale • multi-corpus scheme with Polish advanced EFL learner data as hub data ICAME 2001, Louvain-la-Neuve
Corpus-driven methods: precision & recall problems • Language-based obstacles: • Nature of learner language • Cross-corpus comparability • Technical aspects: • POS tagging error margin • Word-sense disambiguation and / or syntactic parsing • Cooccurrence statistics ICAME 2001, Louvain-la-Neuve
Problem 1: the nature of learner data • Difference in proficiency levels essential in cross-corpus comparisons • Recall: misspelled words may get mistagged by taggers and overlooked by concordancers, unless edited beforehand • Wrong or inconsistent hyphenation may mislead taggers, e.g. ‘money making’ vs. ‘moneymaking’ vs. ‘money-making’ • Unrecognised words vs. tagger default option tag ICAME 2001, Louvain-la-Neuve
Problem 2: cross-corpus comparability • genre homogeneity • topic-skewed distribution: heuristic method of isolation: sort by standard deviation ICAME 2001, Louvain-la-Neuve
Technical 3: performance of POS taggers • Affected: extraction of lemmas meeting POS criteria • Tagset vs. research criteria • e.g. gerund = noun or verb? • Precision (noise in data): non-verbs tagged as verbs, e.g.: • Not-telling VB(lex,montr,ingp) ?not-tel? ...(7) • agressive VB(lex,intr,infin) ?agressive? ...(3) • well-behaved VB(lex,montr,edp) ?well-behave? ...(2) • Recall (data ignored): verbs tagged as non-verbs, or lexical verbs tagged as auxiliaries, e.g: • ... who in sharing their lives with a retarded sibiling [sic!] and taking <ADJ(ge,pos,ingp)> {taking} partin every-day care problems, may decide never to have ... • Untagged and heuristically tagged items: explicit marking vs default tag ICAME 2001, Louvain-la-Neuve
Tracking & rectifying POS errors • TOSCA-ICLE tagger built-in tag editor: on-line targeting of precision & recall errors • Problem: insufficient query language: word OR lemma OR tag pattern • no tagger built-in editor: • Problem 1: comprehensive check or intuitive selection • Problem 2: most concordancers are browsers only • Problem 3: large corpora may not load into common editors • Use of text-processing UNIX tools ICAME 2001, Louvain-la-Neuve
Technical 4: semantic disambiguation and associations • sometimes only grouping data uncovers a meaningful type of association (Stubbs 1998:4) • automatic word-sense disambiguation (WSD) and machine-readable lexicons (e.g. WordNet 1.7, EuroWordNet): the Senseval Project • University of Lancaster disambiguation tool (ACASD package, Thomas & Wilson 1996) • Tools unavailable or not at implementable stage • Consequently: manual, POS-aided disambiguation ICAME 2001, Louvain-la-Neuve
Technical 5: corpus-driven phraseology extraction (1) • collocation vs cooccurrence & adjacency • word clusters • precision: many identified clusters have little linguistic significance (‘is the’; ‘of the’; ‘it BE a’) • recall: Many genuine collocations and MWUs are not contiguous (Kennedy 1998: 114) and may spill outside the typical 4:4 window (e.g. ‘TAKE care of...’ vs ‘TAKE good care of’; ‘the chance which were not eager to take’) • stop-listing not quite possible with high-frequency items (exc. Ted Pedersen’s ‘Bigram Statistics Package’) ICAME 2001, Louvain-la-Neuve
Technical 5: corpus-driven phraseology extraction (2) • co-occurrence statistics (WordSmith, TACT) • precision: not all co-occurrence patterns testify to meaningful collocations • recall: collocations may extend beyond typical 4:4 word spans • MI: mostly identifies ‘idiosyncratic collocations’ (Oakes 1998; 90): GIVE 172 2458 birth 4.65 vote 4.24 opening 4.24 antibiotic 4.01 vaccination 3.91 ingenuity 3.91 isolate 3.43 habit 3.43 happiness 3.24 away 2.91 • WordSmith: only 10 collocate output ICAME 2001, Louvain-la-Neuve
Enhancing collocation extraction • Oliver Mason’s QWICK (www.english.bham.ac.uk): • MI with weighting factors for frequent words • unlimited display of collocates • multi-test package: incl. t-score; log-likelihood; modified log likelihood; expected/observed ratio • Remaining problems • effective stop-listing not quite possible with high-frequency item tests • collocations outside a heuristic window • lexical associations between collocates (synsets) • semi-manual grouping of data essential ICAME 2001, Louvain-la-Neuve
Semi-manual disambiguation (WordSmith Tools) ICAME 2001, Louvain-la-Neuve
Problems with semi-manual editing • CONCORDANCER: • not all important information can be marked: insufficient single letter annotation • limited saving options: no possibility to circulate concordance data • TEXT EDITOR • no node-based display: no easy sorting • large corpora: handling of large or multiple files ICAME 2001, Louvain-la-Neuve
Solution: dedicated concordancer-annotator • Feature 1: simple built-in POS tagger [?] • Feature 2: allow editing of concordance lines - text and/or tags and/or lemmas - like built-in tagger editors • Feature 3: allow adding custom information to concordance lines (specialised annotation / grouping of data) • Feature 4: allow saving concordances as text BACK into the corpus (pasting) • Feature 5: multiple coocurrence tests ICAME 2001, Louvain-la-Neuve
Summary • Difficult to find/compile truly homogenous AND comparable sets of corpora = small corpus analysis often a necessity • With small corpora, mere automated methods of processing and analysis display insufficient precision and recall • Loss of data may be prove too costly when pedagogical conclusions are sought • Instead of automatisation: increase the pace of assisted pre-processing and semi-manual analysis (disambiguation) • Dedicated new type of hybrid concordancer-editor needed ICAME 2001, Louvain-la-Neuve
This show shortly available from: http://main.amu.edu.pl/~przemka/rsearch.html ICAME 2001, Louvain-la-Neuve