1 / 54

What computers can and cannot do for lexicography or Us precision, them recall

What computers can and cannot do for lexicography or Us precision, them recall. Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton, UK. Outline. Precision and recall History of corpus lexicography Natural Language Processing Cyborgs. Find me all the fat cats.

abner
Download Presentation

What computers can and cannot do for lexicography or Us precision, them recall

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What computers can and cannot do for lexicographyorUs precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton, UK

  2. Outline • Precision and recall • History of corpus lexicography • Natural Language Processing • Cyborgs Adam Kilgarriff: Us precision them recall

  3. Find me all the fat cats • a request for information Adam Kilgarriff: Us precision them recall

  4. High recall • Lots of responses • Maybe not all good Adam Kilgarriff: Us precision them recall

  5. High precision • Fewer hits • Higher confidence Adam Kilgarriff: Us precision them recall

  6. Us precision, them recall Adam Kilgarriff: Us precision them recall

  7. Us precision, them recall • True in many areas • web searching, google • finding an image to illustrate a talk • Nowhere more so than lexicography Adam Kilgarriff: Us precision them recall

  8. Lexicography: finding facts about words • collocations • grammatical patterns • idioms • synonyms • antonyms • meanings • translations Adam Kilgarriff: Us precision them recall

  9. Outline • Precision and recall • History of corpus lexicography • Natural Language Processing • Cyborgs Adam Kilgarriff: Us precision them recall

  10. Four ages of corpus lexicography Adam Kilgarriff: Us precision them recall

  11. Age 1: • Pre • computer • Oxford English • Dictionary: • 5 million • index cards Adam Kilgarriff: Us precision them recall

  12. Age 2: KWIC Concordances • From 1980 • Computerised • COBUILD project was innovator • asian-kwic.html • the coloured-pens method Adam Kilgarriff: Us precision them recall

  13. Age 2: limitations as corpora get bigger: too much data • 50 lines for a word: :read all • 500 lines: could read all, takes a long time, slow • 5000 lines: no Adam Kilgarriff: Us precision them recall

  14. Age 3: Collocation statistics • Problem:too much data - how to summarise? • Solution:list of words occurring in neighbourhood of headword, with frequencies • Sorted by salience Adam Kilgarriff: Us precision them recall

  15. Collocation listing For right collocates of save (>5 hits) Adam Kilgarriff: Us precision them recall

  16. Collocation statistics • Which words? • next word • last word • window, +1 to +5; window, -5 to -1 • How sorted? • most common collocates --but for most nouns it's the • most salient collocates --how to measure salience? Adam Kilgarriff: Us precision them recall

  17. Mutual Information • Church and Hanks 1989 • How much more often does a word pair occur, than one might expect by chance • “Chance” of x and y occurring together: p(x) * p(y) • Probabilitiesapproximated by frequencies p(x) =(approx) f(x)/N Adam Kilgarriff: Us precision them recall

  18. Mutual Information * numbers are log-proportional to MI Adam Kilgarriff: Us precision them recall

  19. Problem • mathematical salience = lexicographic salience? • no! higher-frequency items are lexicographically more salient • Solution multiply MI by raw frequency Adam Kilgarriff: Us precision them recall

  20. Mutual Information Adam Kilgarriff: Us precision them recall

  21. Collocation listing For right collocates of save (>5 hits) Adam Kilgarriff: Us precision them recall

  22. Age-3 collocation statistics: limitations Lists contain • junk • unsorted for type --MI lists mix adverbs, subjects, objects, prepositions What we really want: • noise-free lists • one list for each grammatical relation Adam Kilgarriff: Us precision them recall

  23. Age 4: The word sketch • Large well-balanced corpus • Parse to find • subjects, objects, heads, modifiers etc • One list for each grammatical relation • Statistics to sort each list, as before Adam Kilgarriff: Us precision them recall

  24. Can we do it? • high-accuracy parsing is hard • lots of NLP work, many parsing frameworks exist • if any parser can handle large corpus, it's probably good enough --- sorting, statistics, make us error-tolerant Adam Kilgarriff: Us precision them recall

  25. Can we do it? • high-accuracy parsing is hard • lots of NLP work, many parsing frameworks exist • if any parser can handle large corpus, it's probably good enough--- sorting, statistics, make us error-tolerant • Poor man’s parsing: • object (of active verb) = last noun in any sequence of nouns, adjectives, determiners, numbers and adverbs following the verb Adam Kilgarriff: Us precision them recall

  26. Can we do it? • high-accuracy parsing is hard • lots of NLP work, many parsing frameworks exist • if any parser can handle large corpus, it's probably good enough--- sorting, statistics, make us error-tolerant • Poor man’s parsing: • object (of active verb) = last noun in any sequence of nouns, adjectives, determiners, numbers and adverbs following the verb Adam Kilgarriff: Us precision them recall

  27. The word sketch • British National Corpus (BNC) • 100 M words, already POS-tagged • lemmatized using John Carroll's lemmatizer • poor man’s parsing • database of 70 million triples <object, sip, coffee> <subject, arrive, coffee> <and-or, tea, coffee> <modifier, coffee, instant> • coffee_n.html Adam Kilgarriff: Us precision them recall

  28. Macmillan Dictionary of English for Advanced Leaners, 2002 Editor: Rundell. Work done 1999. • 6000 word sketches • most common nouns, verbs, adjectives of English • HTML files with hyperlinked corpus examples • lexicographers used them extensively • main use of corpus • positive feedback Adam Kilgarriff: Us precision them recall

  29. The WASPbench • with David Tugwell, UK EPSRC, grant M54971 A lexicographer's workbench • runtime creation of word sketches • integration with Word Sense Disambiguation technology • output is "disambiguating dictionary" - analysis of word's meaning into senses, plus computer program for disambiguating contextualised instances of the word • First release now available. http://wasps.itri.brighton.ac.uk/ Adam Kilgarriff: Us precision them recall

  30. The Sketch Engine • Input: • any corpus, any language • Lemmatised, part-of-speech tagged • specification of grammatical relations • Word sketches integrated with • Corpus query system • Supports complex searching, sorting etc • First release early 2004 Adam Kilgarriff: Us precision them recall

  31. Outline • Precision and recall • History of corpus lexicography • Natural Language Processing • Cyborgs Adam Kilgarriff: Us precision them recall

  32. Natural Language Processing • The academic discipline which provides the tools • Also known as Computational Linguistics, Human Language Technology (HLT), Language Engineering • Good at evaluation of its tools • Good news for lexicography: • identify the best tools, apply them to our corpora Adam Kilgarriff: Us precision them recall

  33. An Anglophone Apology • Technology, tools, resources most often available for English • This talk centres on English • Other languages often present new problems • Finding word delimiters for Chinese is hard • Finding bunsetsu for Japanese is hard • Fewer resources available, less work done • Recommendation: • find the local experts for your language Adam Kilgarriff: Us precision them recall

  34. Recap: Lexicography: finding facts about words • collocations • grammatical patterns • idioms • synonyms • antonyms • meanings • translations Adam Kilgarriff: Us precision them recall

  35. Recap: Lexicography: finding facts about words • collocations - sketches • grammatical patterns - sketches • idioms • synonyms • antonyms • meanings • translations Adam Kilgarriff: Us precision them recall

  36. Idioms • Extreme case of collocation/multi word expressions • Sequence of workshops on collocations, MWE • Technical terms (of great interest to technologists, technical): TERMIGHT Adam Kilgarriff: Us precision them recall

  37. Antonyms • Essential semantic relation Adam Kilgarriff: Us precision them recall

  38. Antonyms • Essential semantic relation but • Justeson and Katz 1995: distributional evidence for typical antonym pairs • rich men and poor men • the big ones and the small ones • black and white issues • Perhaps antonyms are ‘really’ distributional Adam Kilgarriff: Us precision them recall

  39. Thesauruses • Also near-synonyms • are there any true synonyms? • Distributional: which words share same distributions • if corpus contains <object, drink, wine>, <object, drink, beer> • 1 pt similarity between wine and beer • gather all points; find nearest neighbours • Sparck Jones, Lin, Grefenstette Adam Kilgarriff: Us precision them recall

  40. Nearest neighbours • In WASPbench • Will be generated in Sketch Engine NOUNS zebra: giraffebuffalohippopotamusrhinocerosgazelleantelopecheetahhippoleopardkangaroocrocodiledeerrhinoherbivoretortoiseprimatehyenacamelscorpionmacaqueelephantmammothalligatorcarnivoresquirreltigernewtchimpanzeemonkey Adam Kilgarriff: Us precision them recall

  41. exception:exemptionlimitationexclusioninstancemodificationrestrictionrecognitionextensioncontrastadditionrefusalexampleclauseindicationdefinitionerrorrestraintreferenceobjectionconsiderationconcessiondistinctionvariationoccurrenceanomalyoffencejurisdictionimplicationanalogyexception:exemptionlimitationexclusioninstancemodificationrestrictionrecognitionextensioncontrastadditionrefusalexampleclauseindicationdefinitionerrorrestraintreferenceobjectionconsiderationconcessiondistinctionvariationoccurrenceanomalyoffencejurisdictionimplicationanalogy pot: bowlpanjarcontainerdishjugmugtintubtraybagsaucepanbottlebasketbucketvaseplatekettleteapotglassspoonsoupboxcancaketeapacketpipecup Adam Kilgarriff: Us precision them recall

  42. VERBS measure determine assess calculate decrease monitor increase evaluate reduce detect estimate indicate analyse exceed vary test observe define record reflect affect obtain generate predict enhance alter examine quantify relate adjust boil simmer heat cook fry bubble cool stir warm steam sizzle bake flavour spill soak roast taste pour dry wash chop melt freeze scald consume burn mix ferment scorch soften Adam Kilgarriff: Us precision them recall

  43. ADJECTIVES hypnotic haunting piercing expressionless dreamy monotonous seductive meditative emotive comforting expressive mournful healing indistinct unforgettable unreadable harmonic prophetic steely sensuous soothing malevolent irresistible restful insidious expectant demonic incessant inhuman spooky pink purple yellow red blue white pale brown green grey coloured bright scarlet orange cream black crimson thick soft dark striped thin golden faded matching embroidered silver warm mauve damp Adam Kilgarriff: Us precision them recall

  44. Translation • Parallel corpora • Texts and their translations or • Comparable corpora • Matched for source and target (genre and subject matter), not translations • Which L1 words occur in equivalent L1 settings to L2 words in L2 settings? • They are candidate translation pairs • Very hard problem • Lots of high quality research Adam Kilgarriff: Us precision them recall

  45. Outline • Precision and recall • History of corpus lexicography • Natural Language Processing • Cyborgs Adam Kilgarriff: Us precision them recall

  46. Cyborgs • Robots: will they take over? • Rod Brooks’s answer: • Wrong question: greatest advances are in what the human+computer ensemble can do Adam Kilgarriff: Us precision them recall

  47. Cyborgs • A creature that is partly human and partly machine • Macmillan English Dictionary Adam Kilgarriff: Us precision them recall

  48. Adam Kilgarriff: Us precision them recall

  49. Adam Kilgarriff: Us precision them recall

  50. Adam Kilgarriff: Us precision them recall

More Related