570 likes | 727 Views
A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project. Adam Kilgarriff Lexical Computing Ltd http://www.sketchengine.co.uk. English Profile. From 2006 Cambridge Univ, Univ Press, ESOL (+ others) Goal
E N D
A cascade of corpora:The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing Ltd http://www.sketchengine.co.uk
English Profile • From 2006 • Cambridge Univ, Univ Press, ESOL (+ others) • Goal • for each CEFR level, find characteristic lexis and grammar • CEFR: Common European Framework of Reference • A1, A2: Beginner • B1, B2: Intermediate • C1, C2: Advanced • Main resource: CLC KIlgarriff
Cambridge Learner Corpus (CLC) • Since 1993 • Leading resource • CUP and Cambridge Assessment • For better dictionaries, ELT courses, tests • Material: all from exams (levels A1-C2) • 45m words; 22m error-tagged • 200,000 scripts, 138 L1s, 203 nationalities KIlgarriff
Sketch Engine • Leading corpus tool • Word sketches • One-page summaries of a word’s grammatical and collocational behaviour • In use at OUP, CUP, Collins, Macmillan, INL … • 55 languages • 175 corpora • Since May including CHILDES: demo • Since last year including CLC KIlgarriff
Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002 KIlgarriff
Error-coded corpus • Challenge • Intuitive to search for x • anywhere • only where it is part of an error • only where it is part of a correction where x can be a word, phrase, grammar pattern … Requirement for CLC in Sketch Engine KIlgarriff
Error-coded corpora in SkE • demo KIlgarriff
HOO / HOO+ • Helping Our Own • HOO: English-NNS NLP researchers • Developer = user: motivation • Shared task/competitive evaluation • Organisers define task and prepare ‘gold standard’ • Teams participate by running their software over test data • Six teams (incl Tübingen), workshop end Sept KIlgarriff
HOO+ (2012) • Probably • English: learner data from CLC • Other languages? • Tasks • Essay scoring • Determiner, preposition errors • ? • http://www.clt.mq.edu.au/research/projects/hoo/ KIlgarriff
DANTE Highlights of English lexicography KIlgarriff
DANTE KIlgarriff
DANTE KIlgarriff
DANTE KIlgarriff
DANTE http://webdante.com KIlgarriff
The KELLY Project • EU Lifelong Learning Project • Word cards • 9 languages • Arabic Chinese English Greek Italian Norwegian Polish Russian Swedish • All 36 pairs • Words the learner should know (at A1 … C2) • Partners • Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ, ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd KIlgarriff
Interesting question • How close to purely corpus-based can a pedagogic list be? KIlgarriff
Method • Take a general corpus • Count • Review, add, delete using other lists and corpora • Translate (72 directed-lg-pairs) • Words not in source list which occur in translations: • Review source list • http://kelly.sketchengine.co.uk KIlgarriff
Symmatrical pairs: <x,y> and <y,x> • Cliques: • For x, y, z, … all pairs are symmetrical • 9-language cliques (English members) • hospital library music sun theory KIlgarriff
Web corpora Replaceable or replacable? http://googlefight.com http://looglefight.com KIlgarriff
The web is Very very large Most languages Most language types Up-to-date Free Instant access KIlgarriff
Web corpus types Large, general corpora Small, specialised corpora Specially for translators KIlgarriff
Basic steps Gather pages CSE hits Select and gather whole sites General crawl Filter De-duplicate Linguistic processing Load into corpus tool KIlgarriff
WaC family corpora 100m – 2b word corpora 2-month project each All major world languages available in Sketch Engine Currently 42 languages Growing monthly Pioneers: Marco Baroni, Serge Sharoff Corpus Factory Seeds: mid-frequency words from ‘core vocab’ lists and corpora Google on seed words, then crawl KIlgarriff
How good are they? How to assess? Hard question, open research topic Good coverage Newspapers: news, politics bias Web corpora: also cover personal, kitchen vocab Web corpus / BNC / journalism corpus First two are close KIlgarriff
Evaluating word sketches 11 years 1999-2011 Feedback Good but anecdotal Formal evaluation Method also lets us evaluate corpora KIlgarriff
Goal Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality Ask a lexicographer For 42 headwords For 20 best collocates per headwords “should we include this collocation in a published dictionary?” KIlgarriff
Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable KIlgarriff
Precision and recall a request for information Find me all the fat cats KIlgarriff
High recall • Lots of responses • Maybe not all good KIlgarriff
High precision Fewer hits Higher confidence KIlgarriff
Precision and recall • We test precision • Recall is harder • How do we find all the collocations that the system should have found? • Current work • 200 collocates per headword • Selected from • All the corpora we have • Various parameter settings • Plus just-in-time evaluation for 'new' collocates KIlgarriff
Four languages, three families Dutch ANW, 102m-word lexicographic corpus English UKWaC, 1.5b web corpus Japanese JpWaC, 400m web corpus Slovene FidaPlus, 620m lexicographic corpus KIlgarriff
User evaluation Evaluate whole system Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation Can I make the system better? Evaluate each module separately Current work KIlgarriff
Components Corpus NLP tools Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics KIlgarriff
Practicalities Interface Good, Good-but Merge to good Maybe, Maybe-specialised, Bad Merge to bad For each language Two/three linguists/lexicographers If they disagree Don't use for computing performance KIlgarriff
Results Dutch 66% English 71% Japanese 87% Slovene 71% KIlgarriff
Two thirds of a collocations dictionary can be gathered automatically KIlgarriff
Thank youhttp://www.sketchengine.co.uk KIlgarriff
Lexicography: finding facts about words collocations grammatical patterns idioms synonyms meanings translations KIlgarriff
Four ages of corpus lexicography KIlgarriff
Age 1: Pre computer Oxford English Dictionary: • 5 million index cards KIlgarriff
Age 2: KWIC Concordances From 1980 Computerised Overhauled lexicography KIlgarriff
Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no KIlgarriff
Age 3: Collocation statistics Problem:too much data - how to summarise? Solution:list of words occurring in neighbourhood of headword, with frequencies Sorted by salience KIlgarriff
Age-3 collocation statistics: limitations Lists contain junk unsorted for type – mixes together adverbs, subjects, objects, prepositions What we really want: noise-free lists one list for each grammatical relation KIlgarriff
Age 4: The word sketch Large well-balanced corpus Parse to find subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before KIlgarriff
Working practice Lexicographers mainly used sketches not concordances missed less, more consistent Faster KIlgarriff
Euralex 2002 KIlgarriff
Euralex 2002 Can I have them for my language please KIlgarriff