1 / 17

GIVE and TAKE: towards overcoming of the bottlenecks in learner corpus linguistics

GIVE and TAKE: towards overcoming of the bottlenecks in learner corpus linguistics. ICAME 2001 Louvain-la-Neuve 16-20 May 2001. Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland. Central issues. Pedagogical perspective: annotation (& disambiguation) of data

edward
Download Presentation

GIVE and TAKE: towards overcoming of the bottlenecks in learner corpus linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GIVE and TAKE: towards overcoming of the bottlenecks in learner corpus linguistics ICAME 2001 Louvain-la-Neuve 16-20 May 2001 Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

  2. Central issues • Pedagogical perspective: annotation (& disambiguation) of data • Procedures accessible for applied (non-specialist) researchers • Corpus-based goals vs. corpus-driven methods ICAME 2001, Louvain-la-Neuve

  3. Example: research into learners’ core-word based phraseology • EFL learners’ overuse of high-frequency words: what does it mean? • intensive collocability of core lexical items • multi-word extensions (compounds, coinages, idioms, expressions, phrasals) • need for an idiomatic scale • multi-corpus scheme with Polish advanced EFL learner data as hub data ICAME 2001, Louvain-la-Neuve

  4. Corpus-driven methods: precision & recall problems • Language-based obstacles: • Nature of learner language • Cross-corpus comparability • Technical aspects: • POS tagging error margin • Word-sense disambiguation and / or syntactic parsing • Cooccurrence statistics ICAME 2001, Louvain-la-Neuve

  5. Problem 1: the nature of learner data • Difference in proficiency levels essential in cross-corpus comparisons • Recall: misspelled words may get mistagged by taggers and overlooked by concordancers, unless edited beforehand • Wrong or inconsistent hyphenation may mislead taggers, e.g. ‘money making’ vs. ‘moneymaking’ vs. ‘money-making’ • Unrecognised words vs. tagger default option tag ICAME 2001, Louvain-la-Neuve

  6. Problem 2: cross-corpus comparability • genre homogeneity • topic-skewed distribution: heuristic method of isolation: sort by standard deviation ICAME 2001, Louvain-la-Neuve

  7. Technical 3: performance of POS taggers • Affected: extraction of lemmas meeting POS criteria • Tagset vs. research criteria • e.g. gerund = noun or verb? • Precision (noise in data): non-verbs tagged as verbs, e.g.: • Not-telling VB(lex,montr,ingp) ?not-tel? ...(7) • agressive VB(lex,intr,infin) ?agressive? ...(3) • well-behaved VB(lex,montr,edp) ?well-behave? ...(2) • Recall (data ignored): verbs tagged as non-verbs, or lexical verbs tagged as auxiliaries, e.g: • ... who in sharing their lives with a retarded sibiling [sic!] and taking <ADJ(ge,pos,ingp)> {taking} partin every-day care problems, may decide never to have ... • Untagged and heuristically tagged items: explicit marking vs default tag ICAME 2001, Louvain-la-Neuve

  8. Tracking & rectifying POS errors • TOSCA-ICLE tagger built-in tag editor: on-line targeting of precision & recall errors • Problem: insufficient query language: word OR lemma OR tag pattern • no tagger built-in editor: • Problem 1: comprehensive check or intuitive selection • Problem 2: most concordancers are browsers only • Problem 3: large corpora may not load into common editors • Use of text-processing UNIX tools ICAME 2001, Louvain-la-Neuve

  9. Technical 4: semantic disambiguation and associations • sometimes only grouping data uncovers a meaningful type of association (Stubbs 1998:4) • automatic word-sense disambiguation (WSD) and machine-readable lexicons (e.g. WordNet 1.7, EuroWordNet): the Senseval Project • University of Lancaster disambiguation tool (ACASD package, Thomas & Wilson 1996) • Tools unavailable or not at implementable stage • Consequently: manual, POS-aided disambiguation ICAME 2001, Louvain-la-Neuve

  10. Technical 5: corpus-driven phraseology extraction (1) • collocation vs cooccurrence & adjacency • word clusters • precision: many identified clusters have little linguistic significance (‘is the’; ‘of the’; ‘it BE a’) • recall: Many genuine collocations and MWUs are not contiguous (Kennedy 1998: 114) and may spill outside the typical 4:4 window (e.g. ‘TAKE care of...’ vs ‘TAKE good care of’; ‘the chance which were not eager to take’) • stop-listing not quite possible with high-frequency items (exc. Ted Pedersen’s ‘Bigram Statistics Package’) ICAME 2001, Louvain-la-Neuve

  11. Technical 5: corpus-driven phraseology extraction (2) • co-occurrence statistics (WordSmith, TACT) • precision: not all co-occurrence patterns testify to meaningful collocations • recall: collocations may extend beyond typical 4:4 word spans • MI: mostly identifies ‘idiosyncratic collocations’ (Oakes 1998; 90): GIVE 172 2458 birth 4.65 vote 4.24 opening 4.24 antibiotic 4.01 vaccination 3.91 ingenuity 3.91 isolate 3.43 habit 3.43 happiness 3.24 away 2.91 • WordSmith: only 10 collocate output ICAME 2001, Louvain-la-Neuve

  12. Enhancing collocation extraction • Oliver Mason’s QWICK (www.english.bham.ac.uk): • MI with weighting factors for frequent words • unlimited display of collocates • multi-test package: incl. t-score; log-likelihood; modified log likelihood; expected/observed ratio • Remaining problems • effective stop-listing not quite possible with high-frequency item tests • collocations outside a heuristic window • lexical associations between collocates (synsets) • semi-manual grouping of data essential ICAME 2001, Louvain-la-Neuve

  13. Semi-manual disambiguation (WordSmith Tools) ICAME 2001, Louvain-la-Neuve

  14. Problems with semi-manual editing • CONCORDANCER: • not all important information can be marked: insufficient single letter annotation • limited saving options: no possibility to circulate concordance data • TEXT EDITOR • no node-based display: no easy sorting • large corpora: handling of large or multiple files ICAME 2001, Louvain-la-Neuve

  15. Solution: dedicated concordancer-annotator • Feature 1: simple built-in POS tagger [?] • Feature 2: allow editing of concordance lines - text and/or tags and/or lemmas - like built-in tagger editors • Feature 3: allow adding custom information to concordance lines (specialised annotation / grouping of data) • Feature 4: allow saving concordances as text BACK into the corpus (pasting) • Feature 5: multiple coocurrence tests ICAME 2001, Louvain-la-Neuve

  16. Summary • Difficult to find/compile truly homogenous AND comparable sets of corpora = small corpus analysis often a necessity • With small corpora, mere automated methods of processing and analysis display insufficient precision and recall • Loss of data may be prove too costly when pedagogical conclusions are sought • Instead of automatisation: increase the pace of assisted pre-processing and semi-manual analysis (disambiguation) • Dedicated new type of hybrid concordancer-editor needed ICAME 2001, Louvain-la-Neuve

  17. This show shortly available from: http://main.amu.edu.pl/~przemka/rsearch.html ICAME 2001, Louvain-la-Neuve

More Related