1 / 57

Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project. Adam Kilgarriff Lexical Computing Ltd http://www.sketchengine.co.uk. English Profile. From 2006 Cambridge Univ, Univ Press, ESOL (+ others) Goal

cadee
Download Presentation

Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A cascade of corpora:The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing Ltd http://www.sketchengine.co.uk

  2. English Profile • From 2006 • Cambridge Univ, Univ Press, ESOL (+ others) • Goal • for each CEFR level, find characteristic lexis and grammar • CEFR: Common European Framework of Reference • A1, A2: Beginner • B1, B2: Intermediate • C1, C2: Advanced • Main resource: CLC KIlgarriff

  3. Cambridge Learner Corpus (CLC) • Since 1993 • Leading resource • CUP and Cambridge Assessment • For better dictionaries, ELT courses, tests • Material: all from exams (levels A1-C2) • 45m words; 22m error-tagged • 200,000 scripts, 138 L1s, 203 nationalities KIlgarriff

  4. Sketch Engine • Leading corpus tool • Word sketches • One-page summaries of a word’s grammatical and collocational behaviour • In use at OUP, CUP, Collins, Macmillan, INL … • 55 languages • 175 corpora • Since May including CHILDES: demo • Since last year including CLC KIlgarriff

  5. Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002 KIlgarriff

  6. Error-coded corpus • Challenge • Intuitive to search for x • anywhere • only where it is part of an error • only where it is part of a correction where x can be a word, phrase, grammar pattern … Requirement for CLC in Sketch Engine KIlgarriff

  7. Error-coded corpora in SkE • demo KIlgarriff

  8. HOO / HOO+ • Helping Our Own • HOO: English-NNS NLP researchers • Developer = user: motivation • Shared task/competitive evaluation • Organisers define task and prepare ‘gold standard’ • Teams participate by running their software over test data • Six teams (incl Tübingen), workshop end Sept KIlgarriff

  9. HOO+ (2012) • Probably • English: learner data from CLC • Other languages? • Tasks • Essay scoring • Determiner, preposition errors • ? • http://www.clt.mq.edu.au/research/projects/hoo/ KIlgarriff

  10. DANTE Highlights of English lexicography KIlgarriff

  11. DANTE KIlgarriff

  12. DANTE KIlgarriff

  13. DANTE KIlgarriff

  14. DANTE http://webdante.com KIlgarriff

  15. The KELLY Project • EU Lifelong Learning Project • Word cards • 9 languages • Arabic Chinese English Greek Italian Norwegian Polish Russian Swedish • All 36 pairs • Words the learner should know (at A1 … C2) • Partners • Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ, ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd KIlgarriff

  16. Interesting question • How close to purely corpus-based can a pedagogic list be? KIlgarriff

  17. Method • Take a general corpus • Count • Review, add, delete using other lists and corpora • Translate (72 directed-lg-pairs) • Words not in source list which occur in translations: • Review source list • http://kelly.sketchengine.co.uk KIlgarriff

  18. Symmatrical pairs: <x,y> and <y,x> • Cliques: • For x, y, z, … all pairs are symmetrical • 9-language cliques (English members) • hospital library music sun theory KIlgarriff

  19. Web corpora Replaceable or replacable? http://googlefight.com http://looglefight.com KIlgarriff

  20. The web is Very very large Most languages Most language types Up-to-date Free Instant access KIlgarriff

  21. Web corpus types Large, general corpora Small, specialised corpora Specially for translators KIlgarriff

  22. Basic steps Gather pages CSE hits Select and gather whole sites General crawl Filter De-duplicate Linguistic processing Load into corpus tool KIlgarriff

  23. WaC family corpora 100m – 2b word corpora 2-month project each All major world languages available in Sketch Engine Currently 42 languages Growing monthly Pioneers: Marco Baroni, Serge Sharoff Corpus Factory Seeds: mid-frequency words from ‘core vocab’ lists and corpora Google on seed words, then crawl KIlgarriff

  24. How good are they? How to assess? Hard question, open research topic Good coverage Newspapers: news, politics bias Web corpora: also cover personal, kitchen vocab Web corpus / BNC / journalism corpus First two are close KIlgarriff

  25. Evaluating word sketches 11 years 1999-2011 Feedback Good but anecdotal Formal evaluation Method also lets us evaluate corpora KIlgarriff

  26. Goal Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality Ask a lexicographer For 42 headwords For 20 best collocates per headwords “should we include this collocation in a published dictionary?” KIlgarriff

  27. Sample of headwords Nouns verbs adjectives, random High (Top 3000)‏ N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999)‏ N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000)‏ N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable KIlgarriff

  28. Precision and recall a request for information Find me all the fat cats KIlgarriff

  29. High recall • Lots of responses • Maybe not all good KIlgarriff

  30. High precision Fewer hits Higher confidence KIlgarriff

  31. Precision and recall • We test precision • Recall is harder • How do we find all the collocations that the system should have found? • Current work • 200 collocates per headword • Selected from • All the corpora we have • Various parameter settings • Plus just-in-time evaluation for 'new' collocates KIlgarriff

  32. Four languages, three families Dutch ANW, 102m-word lexicographic corpus English UKWaC, 1.5b web corpus Japanese JpWaC, 400m web corpus Slovene FidaPlus, 620m lexicographic corpus KIlgarriff

  33. User evaluation Evaluate whole system Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation Can I make the system better? Evaluate each module separately Current work KIlgarriff

  34. Components Corpus NLP tools Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics KIlgarriff

  35. Practicalities Interface Good, Good-but Merge to good Maybe, Maybe-specialised, Bad Merge to bad For each language Two/three linguists/lexicographers If they disagree Don't use for computing performance KIlgarriff

  36. Results Dutch 66% English 71% Japanese 87% Slovene 71% KIlgarriff

  37. Two thirds of a collocations dictionary can be gathered automatically KIlgarriff

  38. Thank youhttp://www.sketchengine.co.uk KIlgarriff

  39. KIlgarriff

  40. Lexicography: finding facts about words collocations grammatical patterns idioms synonyms meanings translations KIlgarriff

  41. Four ages of corpus lexicography KIlgarriff

  42. Age 1: Pre computer Oxford English Dictionary: • 5 million index cards KIlgarriff

  43. Age 2: KWIC Concordances From 1980 Computerised Overhauled lexicography KIlgarriff

  44. Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no KIlgarriff

  45. Age 3: Collocation statistics Problem:too much data - how to summarise? Solution:list of words occurring in neighbourhood of headword, with frequencies Sorted by salience KIlgarriff

  46. Age-3 collocation statistics: limitations Lists contain junk unsorted for type – mixes together adverbs, subjects, objects, prepositions What we really want: noise-free lists one list for each grammatical relation KIlgarriff

  47. Age 4: The word sketch Large well-balanced corpus Parse to find subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before KIlgarriff

  48. Working practice Lexicographers mainly used sketches not concordances missed less, more consistent Faster KIlgarriff

  49. Euralex 2002 KIlgarriff

  50. Euralex 2002 Can I have them for my language please KIlgarriff

More Related