1 / 47

Using Corpora in Linguistics and Lexicography

Using Corpora in Linguistics and Lexicography. Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK. Outline. Precision and recall History of corpus lexicography The Sketch Engine Demo Web corpora Corpus and dictionary. Find me all the fat cats.

geordi
Download Presentation

Using Corpora in Linguistics and Lexicography

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

  2. Outline • Precision and recall • History of corpus lexicography • The Sketch Engine • Demo • Web corpora • Corpus and dictionary Kilgarriff

  3. Find me all the fat cats • a request for information Kilgarriff

  4. High recall • Lots of responses • Maybe not all good Kilgarriff

  5. High precision • Fewer hits • Higher confidence Kilgarriff

  6. Information-seeking Kilgarriff

  7. Lexicography: finding facts about words • collocations • grammatical patterns • idioms • synonyms • meanings • translations Kilgarriff

  8. Four ages of corpus lexicography Kilgarriff

  9. Age 1: • Pre • computer • Oxford English • Dictionary: • 5 million • index cards Kilgarriff

  10. Age 2: KWIC Concordances • From 1980 • Computerised • Overhauled lexicography Kilgarriff

  11. Age 2: limitations as corpora get bigger: too much data • 50 lines for a word: :read all • 500 lines: could read all, takes a long time, slow • 5000 lines: no Kilgarriff

  12. Age 3: Collocation statistics • Problem:too much data - how to summarise? • Solution:list of words occurring in neighbourhood of headword, with frequencies • Sorted by salience Kilgarriff

  13. Age-3 collocation statistics: limitations Lists contain • junk • unsorted for type – mixes together adverbs, subjects, objects, prepositions What we really want: • noise-free lists • one list for each grammatical relation Kilgarriff

  14. Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword Kilgarriff

  15. Age 4: The word sketch • Large well-balanced corpus • Parse to find • subjects, objects, heads, modifiers etc • One list for each grammatical relation • Statistics to sort each list, as before Kilgarriff

  16. Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002 Kilgarriff

  17. Working practice • Lexicographers mainly used sketches not concordances • missed less, more consistent • Faster Kilgarriff

  18. Euralex 2002 Kilgarriff

  19. Euralex 2002 • Can I have them for my language please Kilgarriff

  20. The Sketch Engine • Input: • any corpus, any language • Lemmatised, part-of-speech tagged • specification of grammatical relations • Word sketches integrated with • Corpus query system • Supports complex searching, sorting etc • Credit: Pavel Rychly, Masaryk Univ Kilgarriff

  21. Customers • Dictionary publishers • Oxford University Press • Cambridge University Press • Collins • Macmillan • FrameNet Project (Berkeley, US) • National dictionary projects in • Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia • Universities • Teaching and research • Languages, linguistics, language technology • UK, Germany, US, Greece, Taiwan, Japan, China, Slovenia,… • Other • Language teaching, textbook writing • Information management, web search companies • Automatic translation Kilgarriff

  22. Web corpora • Replaceable or replacable? • http://googlefight.com • http://looglefight.com Kilgarriff

  23. The web is • Very very large • Most languages • Most language types • Up-to-date • Free • Instant access Kilgarriff

  24. Web corpus types • Large, general corpora • Small, specialised corpora • Specially for translators Kilgarriff

  25. Basic steps • Gather pages • CSE hits • Select and gather whole sites • General crawl • Filter • De-duplicate • Linguistic processing • Load into corpus tool Kilgarriff

  26. WaC family corpora • 100m – 2b word corpora • 2-month project each • All major world languages available in Sketch Engine • Currently 30 languages • Growing monthly • Pioneers: Marco Baroni, Serge Sharoff • Corpus Factory • Seeds: • mid-frequency words from ‘core vocab’ lists and corpora • Google on seed words, then crawl Kilgarriff

  27. Corpora Kilgarriff

  28. How good are they? • How to assess? • Hard question, open research topic • Good coverage • Newspapers: news, politics bias • Web corpora: also cover personal, kitchen vocab • Web corpus / BNC / journalism corpus • First two are close Kilgarriff

  29. Evaluating word sketches • 11 years • 1999-2010 • Feedback • Good but anecdotal • Formal evaluation • Method also lets us evaluate corpora Kilgarriff

  30. Goal • Collocations dictionary • Model: Oxford Collocations Dictionary • Publication-quality • Ask a lexicographer • For 42 headwords • For 20 best collocates per headwords • “should we include this collocation in a published dictionary?” Kilgarriff

  31. Sample of headwords • Nouns verbs adjectives, random • High (Top 3000)‏ • N space solution opinion mass corporation leader • V serve incorporate mix desire • Adj high detailed open academic • Mid (3000- 9999)‏ • N cattle repayment fundraising elder biologist sanitation • V grieve classify ascertain implant • Adj adjacent eldest prolific ill • Low (10,000- 30,000)‏ • N predicament adulterer bake bombshell candy shellfish • V slap outgrow plow traipse • Adj neoclassical votive adulterous expandable Kilgarriff

  32. Precision and recall • We test precision • Recall is harder • How do we find all the collocations that the system should have found? • Current work • 200 collocates per headword • Selected from • All the corpora we have • Various parameter settings • Plus just-in-time evaluation for 'new' collocates Kilgarriff

  33. Four languages, three families • Dutch • ANW, 102m-word lexicographic corpus • English • UKWaC, 1.5b web corpus • Japanese • JpWaC, 400m web corpus • Slovene • FidaPlus, 620m lexicographic corpus Kilgarriff

  34. User evaluation • Evaluate whole system • Will it help with my task • Eg preparing a collocations dictionary • Contrast: developer evaluation • Can I make the system better? • Evaluate each module separately • Current work Kilgarriff

  35. Components • Corpus • NLP tools • Segmenter, lemmatiser, POS-tagger • Sketch grammar • Statistics Kilgarriff

  36. Practicalities • Interface • Good, Good-but • Merge to good • Maybe, Maybe-specialised, Bad • Merge to bad • For each language • Two/three linguists/lexicographers • If they disagree • Don't use for computing performance Kilgarriff

  37. Results • Dutch 66% • English 71% • Japanese 87% • Slovene 71% Kilgarriff

  38. Two thirds of a collocations dictionary can be gathered automatically Kilgarriff

  39. Small specialised corpora • Terminologists • Translators needing target-language domain-specific vocab • Specialist dictionaries • Don’t exist • Expensive/inaccessible • Out of date • Instant small web corpora • BootCaT: Baroni and Bernardini 2004 • WebBootCaT demo Kilgarriff

  40. Cyborgs • A creature that is partly human and partly machine • Macmillan English Dictionary Kilgarriff

  41. Kilgarriff

  42. Kilgarriff

  43. Kilgarriff

  44. Kilgarriff

  45. Cyborgs and the Information Society The dictionary-making agent is part human (for precision), part computer (for recall). Kilgarriff

  46. Treat your computer with respect. You and it can do great things together. Kilgarriff

  47. Thank you http://www.sketchengine.co.uk Kilgarriff

More Related