1 / 76

Thesauruses for Natural Language Processing

Thesauruses for Natural Language Processing. Adam Kilgarriff Lexicography MasterClass and University of Brighton. Outline. Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs. What is a thesaurus? .

coen
Download Presentation

Thesauruses for Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton

  2. Outline • Definition • Uses for NLP • WASPS thesaurus • web thesauruses • Argument: words not word senses • Evaluation proposals • Cyborgs

  3. What is a thesaurus? a resource that groups words according to similarity

  4. Manual and automatic • Manual • Roget, WordNets, many publishers • Automatic • Sparck Jones (1960s), Grefenstette (1994), Lin (1998), Lee (1999) • aka distributional • two words are similar if they occur in same contexts • Are they comparable?

  5. Thesauruses in NLP • sparse data

  6. Thesauruses in NLP • sparse data • does x go with y? • don’t know, they have never been seen together • New question: does x+friends go with y+friends • indirect evidence for x and y • thesaurus tells us who friends are • “backing off”

  7. Relevant in: • Parsing • PP-attachment • conjunction scope • Bridging anaphors • Text cohesion • Word sense disambiguation (WSD) • Speech understanding • Spelling correction

  8. Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze

  9. Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze • allegory?

  10. Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze • allegory? • alligator?

  11. Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze • allegory? in upwaters? No • alligator? in upwaters? No

  12. Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze • allegory? in upwaters? No • alligator? in upwaters? No • allegory+friends in upwaters? No • alligator+friends in upwaters? Yes

  13. PP-attachment investigate stromatolite with microscope/speckles • microscope: verb attachment • speckles: noun attachment inspect jasper with spectrometer • which?

  14. PP attachment (cont) • compare frequencies of • <inspect, with, spectrometer> • <jasper, with, spectrometer>

  15. PP attachment (cont) • compare frequencies of • <inspect, with, spectrometer> • <jasper, with, spectrometer> • both zero? Try • <inspect+friends, with, spectrometer+friends> • <jasper+friends, with, spectrometer+friends>

  16. Conjunction scope • Compare • old boots and shoes • old boots and apples

  17. Conjunction scope • Compare • old boots and shoes • old boots and apples • Are the shoes old?

  18. Conjunction scope • Compare • old boots and shoes • old boots and apples • Are the shoes old? • Are the apples old?

  19. Conjunction scope • Compare • old boots and shoes • old boots and apples • Are the shoes old? • Are the apples old? • Hypothesis: • wide scope only when words are similar

  20. Conjunction scope • Compare • old boots and shoes • old boots and apples • Are the shoes old? • Are the apples old? • Hypothesis: • wide scope only when words are similar • hard problem: thesaurus might help

  21. Bridging anaphor resolution • Maria bought a large apple. The fruit was red and crisp. • fruit and apple co-refer

  22. Bridging anaphor resolution • Maria bought a large apple. The fruit was red and crisp. • fruit and apple co-refer • How to find co-referring terms?

  23. Text cohesion • words on same theme • same segment • change in theme of words • new segment • same theme: same thesaurus class

  24. Word Sense Disambiguation (WSD) • pike: fish or weapon • We caught a pike this afternoon • probably no direct evidence for • catch pike • probably is direct evidence for • catch {pike,carp,bream,cod,haddock,…}

  25. WordNet, Roget • widely used for all the above

  26. The WASPS thesaurus • credit: David Tugwell • EPSRC grant K8931 • POS-tag, lemmatise and parse the BNC (100M words) • Find all grammatical relations • <obj, climb, bank> • <modifier, big, bank> • <subject, bank, refuse> • 70 million triples

  27. WASPS thesaurus (cont) • Similarity: • <obj, drink, beer> • <obj, drink, wine> • one point similarity between beer and wine • count all points of similarity between all pairs of words • weight according to frequencies • product of MI: Lin (1998)

  28. Word Sketches • one-page summary of a word’s grammatical and collocational behaviour • demo: http://wasps.itri.bton.ac.uk • the Sketch Engine • input any corpus • generate word sketches and thesaurus • just available now

  29. Nearest neighbours to zebra

  30. Nearest neighbours zebra: giraffebuffalohippopotamusrhinocerosgazelleantelopecheetahhippoleopardkangaroocrocodiledeerrhinoherbivoretortoiseprimatehyenacamelscorpionmacaqueelephantmammothalligatorcarnivoresquirreltigernewtchimpanzeemonkey

  31. exception:exemptionlimitationexclusioninstancemodificationrestrictionrecognitionextensioncontrastadditionrefusalexampleclauseindicationdefinitionerrorrestraintreferenceobjectionconsiderationconcessiondistinctionvariationoccurrenceanomalyoffencejurisdictionimplicationanalogyexception:exemptionlimitationexclusioninstancemodificationrestrictionrecognitionextensioncontrastadditionrefusalexampleclauseindicationdefinitionerrorrestraintreferenceobjectionconsiderationconcessiondistinctionvariationoccurrenceanomalyoffencejurisdictionimplicationanalogy pot: bowlpanjarcontainerdishjugmugtintubtraybagsaucepanbottlebasketbucketvaseplatekettleteapotglassspoonsoupboxcancaketeapacketpipecup

  32. VERBS measure determine assess calculate decrease monitor increase evaluate reduce detect estimate indicate analyse exceed vary test observe define record reflect affect obtain generate predict enhance alter examine quantify relate adjust boil simmer heat cook fry bubble cool stir warm steam sizzle bake flavour spill soak roast taste pour dry wash chop melt freeze scald consume burn mix ferment scorch soften

  33. ADJECTIVES hypnotic haunting piercing expressionless dreamy monotonous seductive meditative emotive comforting expressive mournful healing indistinct unforgettable unreadable harmonic prophetic steely sensuous soothing malevolent irresistible restful insidious expectant demonic incessant inhuman spooky pink purple yellow red blue white pale brown green grey coloured bright scarlet orange cream black crimson thick soft dark striped thin golden faded matching embroidered silver warm mauve damp

  34. Nearest neighbours

  35. no clustering (tho’ could be done) • no hierarchy (tho’ could be done) • rhythm • all on the web: http://wasps.itri.bton.ac.uk • registration required

  36. The web • an enormous linguist’s playground • Computational Linguistics Special Issue, Kilgarriff and Grefenstette (eds) 29 (3) • (coming soon)

  37. Google sets • http://labs.google.com/sets • Input: zebra giraffe buffalo

  38. Google sets • http://labs.google.com/sets • Input: zebra giraffe buffalo • kudu hyena impala leopard hippo waterbuck elephant cheetah eland

  39. Google sets • http://labs.google.com/sets • Input: harbin beijing nanking

  40. Google sets • http://labs.google.com/sets • Input: harbin beijing nanking • Output: shanghai chengdu guangzhou hangzhou changchun zhejiang kunming dalian jinan fuzhou

  41. Tree structure • Roget • all human knowledge as tree structure • 1000 top categories • subdivisions • like this • etc • etc

  42. Directories and thesauruses • Yahoo, http://www.yahoo.com • Open directory project, http://dmoz.org • all human activity as tree structure plus corpus at every node • gather corpus, identify domain vocabulary • Gonzalo and colleagues, Madrid, CL Special Issue • Agirre and colleagues, ‘topic signatures’

  43. Words and word senses • automatic thesauruses • words

  44. Words and word senses • automatic thesauruses • words • manual thesauruses • simple hierarchy is appealing • homonyms

  45. Words and word senses • automatic thesauruses • words • manual thesauruses • simple hierarchy is appealing • homonyms • “aha! objects must be word senses”

  46. Problems • Theoretical • Practical

  47. Theoretical

More Related