Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

A cascade of corpora:The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing Ltd http://www.sketchengine.co.uk

English Profile • From 2006 • Cambridge Univ, Univ Press, ESOL (+ others) • Goal • for each CEFR level, find characteristic lexis and grammar • CEFR: Common European Framework of Reference • A1, A2: Beginner • B1, B2: Intermediate • C1, C2: Advanced • Main resource: CLC KIlgarriff

Cambridge Learner Corpus (CLC) • Since 1993 • Leading resource • CUP and Cambridge Assessment • For better dictionaries, ELT courses, tests • Material: all from exams (levels A1-C2) • 45m words; 22m error-tagged • 200,000 scripts, 138 L1s, 203 nationalities KIlgarriff

Sketch Engine • Leading corpus tool • Word sketches • One-page summaries of a word’s grammatical and collocational behaviour • In use at OUP, CUP, Collins, Macmillan, INL … • 55 languages • 175 corpora • Since May including CHILDES: demo • Since last year including CLC KIlgarriff

Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002 KIlgarriff

Error-coded corpus • Challenge • Intuitive to search for x • anywhere • only where it is part of an error • only where it is part of a correction where x can be a word, phrase, grammar pattern … Requirement for CLC in Sketch Engine KIlgarriff

Error-coded corpora in SkE • demo KIlgarriff

HOO / HOO+ • Helping Our Own • HOO: English-NNS NLP researchers • Developer = user: motivation • Shared task/competitive evaluation • Organisers define task and prepare ‘gold standard’ • Teams participate by running their software over test data • Six teams (incl Tübingen), workshop end Sept KIlgarriff

HOO+ (2012) • Probably • English: learner data from CLC • Other languages? • Tasks • Essay scoring • Determiner, preposition errors • ? • http://www.clt.mq.edu.au/research/projects/hoo/ KIlgarriff

DANTE Highlights of English lexicography KIlgarriff

DANTE KIlgarriff

DANTE http://webdante.com KIlgarriff

The KELLY Project • EU Lifelong Learning Project • Word cards • 9 languages • Arabic Chinese English Greek Italian Norwegian Polish Russian Swedish • All 36 pairs • Words the learner should know (at A1 … C2) • Partners • Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ, ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd KIlgarriff

Interesting question • How close to purely corpus-based can a pedagogic list be? KIlgarriff

Method • Take a general corpus • Count • Review, add, delete using other lists and corpora • Translate (72 directed-lg-pairs) • Words not in source list which occur in translations: • Review source list • http://kelly.sketchengine.co.uk KIlgarriff

Symmatrical pairs: <x,y> and <y,x> • Cliques: • For x, y, z, … all pairs are symmetrical • 9-language cliques (English members) • hospital library music sun theory KIlgarriff

Web corpora Replaceable or replacable? http://googlefight.com http://looglefight.com KIlgarriff

The web is Very very large Most languages Most language types Up-to-date Free Instant access KIlgarriff

Web corpus types Large, general corpora Small, specialised corpora Specially for translators KIlgarriff

Basic steps Gather pages CSE hits Select and gather whole sites General crawl Filter De-duplicate Linguistic processing Load into corpus tool KIlgarriff

WaC family corpora 100m – 2b word corpora 2-month project each All major world languages available in Sketch Engine Currently 42 languages Growing monthly Pioneers: Marco Baroni, Serge Sharoff Corpus Factory Seeds: mid-frequency words from ‘core vocab’ lists and corpora Google on seed words, then crawl KIlgarriff

How good are they? How to assess? Hard question, open research topic Good coverage Newspapers: news, politics bias Web corpora: also cover personal, kitchen vocab Web corpus / BNC / journalism corpus First two are close KIlgarriff

Evaluating word sketches 11 years 1999-2011 Feedback Good but anecdotal Formal evaluation Method also lets us evaluate corpora KIlgarriff

Goal Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality Ask a lexicographer For 42 headwords For 20 best collocates per headwords “should we include this collocation in a published dictionary?” KIlgarriff

Sample of headwords Nouns verbs adjectives, random High (Top 3000)‏ N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999)‏ N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000)‏ N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable KIlgarriff

Precision and recall a request for information Find me all the fat cats KIlgarriff

High recall • Lots of responses • Maybe not all good KIlgarriff

High precision Fewer hits Higher confidence KIlgarriff

Precision and recall • We test precision • Recall is harder • How do we find all the collocations that the system should have found? • Current work • 200 collocates per headword • Selected from • All the corpora we have • Various parameter settings • Plus just-in-time evaluation for 'new' collocates KIlgarriff

Four languages, three families Dutch ANW, 102m-word lexicographic corpus English UKWaC, 1.5b web corpus Japanese JpWaC, 400m web corpus Slovene FidaPlus, 620m lexicographic corpus KIlgarriff

User evaluation Evaluate whole system Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation Can I make the system better? Evaluate each module separately Current work KIlgarriff

Components Corpus NLP tools Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics KIlgarriff

Practicalities Interface Good, Good-but Merge to good Maybe, Maybe-specialised, Bad Merge to bad For each language Two/three linguists/lexicographers If they disagree Don't use for computing performance KIlgarriff

Results Dutch 66% English 71% Japanese 87% Slovene 71% KIlgarriff

Two thirds of a collocations dictionary can be gathered automatically KIlgarriff

Thank youhttp://www.sketchengine.co.uk KIlgarriff

KIlgarriff

Lexicography: finding facts about words collocations grammatical patterns idioms synonyms meanings translations KIlgarriff

Four ages of corpus lexicography KIlgarriff

Age 1: Pre computer Oxford English Dictionary: • 5 million index cards KIlgarriff

Age 2: KWIC Concordances From 1980 Computerised Overhauled lexicography KIlgarriff

Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no KIlgarriff

Age 3: Collocation statistics Problem:too much data - how to summarise? Solution:list of words occurring in neighbourhood of headword, with frequencies Sorted by salience KIlgarriff

Age-3 collocation statistics: limitations Lists contain junk unsorted for type – mixes together adverbs, subjects, objects, prepositions What we really want: noise-free lists one list for each grammatical relation KIlgarriff

Age 4: The word sketch Large well-balanced corpus Parse to find subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before KIlgarriff

Working practice Lexicographers mainly used sketches not concordances missed less, more consistent Faster KIlgarriff

Euralex 2002 KIlgarriff

Euralex 2002 Can I have them for my language please KIlgarriff

Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

Presentation Transcript

Lexical Analyzer

Lexical Innovation

Lexical Analyzer

Lexical Analysis

Lexical Phonology

Adam Kilgarriff doesn’t believe in word senses….

Lexical Nets

Lexical constructionalization

Lexical Semantics

Lexical Analysis

CS2013 Mathematics for Computing Science Adam Wyner University of Aberdeen Computing Science

JSE Computing Ltd - Custom Software Development

Lexical Analysis

Lexical Analysis

Lexical Analysis

LEXICAL ANALYSIS

Lexical Semantics

Lexical Analyzer

Lexical Analysis

Lexical Analysis

Teaching Lexical Phrases and Lexical Patterns

Lexical Analysis