760 likes | 913 Views
Thesauruses for Natural Language Processing. Adam Kilgarriff Lexicography MasterClass and University of Brighton. Outline. Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs. What is a thesaurus? .
E N D
Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton
Outline • Definition • Uses for NLP • WASPS thesaurus • web thesauruses • Argument: words not word senses • Evaluation proposals • Cyborgs
What is a thesaurus? a resource that groups words according to similarity
Manual and automatic • Manual • Roget, WordNets, many publishers • Automatic • Sparck Jones (1960s), Grefenstette (1994), Lin (1998), Lee (1999) • aka distributional • two words are similar if they occur in same contexts • Are they comparable?
Thesauruses in NLP • sparse data
Thesauruses in NLP • sparse data • does x go with y? • don’t know, they have never been seen together • New question: does x+friends go with y+friends • indirect evidence for x and y • thesaurus tells us who friends are • “backing off”
Relevant in: • Parsing • PP-attachment • conjunction scope • Bridging anaphors • Text cohesion • Word sense disambiguation (WSD) • Speech understanding • Spelling correction
Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze
Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze • allegory?
Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze • allegory? • alligator?
Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze • allegory? in upwaters? No • alligator? in upwaters? No
Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze • allegory? in upwaters? No • alligator? in upwaters? No • allegory+friends in upwaters? No • alligator+friends in upwaters? Yes
PP-attachment investigate stromatolite with microscope/speckles • microscope: verb attachment • speckles: noun attachment inspect jasper with spectrometer • which?
PP attachment (cont) • compare frequencies of • <inspect, with, spectrometer> • <jasper, with, spectrometer>
PP attachment (cont) • compare frequencies of • <inspect, with, spectrometer> • <jasper, with, spectrometer> • both zero? Try • <inspect+friends, with, spectrometer+friends> • <jasper+friends, with, spectrometer+friends>
Conjunction scope • Compare • old boots and shoes • old boots and apples
Conjunction scope • Compare • old boots and shoes • old boots and apples • Are the shoes old?
Conjunction scope • Compare • old boots and shoes • old boots and apples • Are the shoes old? • Are the apples old?
Conjunction scope • Compare • old boots and shoes • old boots and apples • Are the shoes old? • Are the apples old? • Hypothesis: • wide scope only when words are similar
Conjunction scope • Compare • old boots and shoes • old boots and apples • Are the shoes old? • Are the apples old? • Hypothesis: • wide scope only when words are similar • hard problem: thesaurus might help
Bridging anaphor resolution • Maria bought a large apple. The fruit was red and crisp. • fruit and apple co-refer
Bridging anaphor resolution • Maria bought a large apple. The fruit was red and crisp. • fruit and apple co-refer • How to find co-referring terms?
Text cohesion • words on same theme • same segment • change in theme of words • new segment • same theme: same thesaurus class
Word Sense Disambiguation (WSD) • pike: fish or weapon • We caught a pike this afternoon • probably no direct evidence for • catch pike • probably is direct evidence for • catch {pike,carp,bream,cod,haddock,…}
WordNet, Roget • widely used for all the above
The WASPS thesaurus • credit: David Tugwell • EPSRC grant K8931 • POS-tag, lemmatise and parse the BNC (100M words) • Find all grammatical relations • <obj, climb, bank> • <modifier, big, bank> • <subject, bank, refuse> • 70 million triples
WASPS thesaurus (cont) • Similarity: • <obj, drink, beer> • <obj, drink, wine> • one point similarity between beer and wine • count all points of similarity between all pairs of words • weight according to frequencies • product of MI: Lin (1998)
Word Sketches • one-page summary of a word’s grammatical and collocational behaviour • demo: http://wasps.itri.bton.ac.uk • the Sketch Engine • input any corpus • generate word sketches and thesaurus • just available now
Nearest neighbours zebra: giraffebuffalohippopotamusrhinocerosgazelleantelopecheetahhippoleopardkangaroocrocodiledeerrhinoherbivoretortoiseprimatehyenacamelscorpionmacaqueelephantmammothalligatorcarnivoresquirreltigernewtchimpanzeemonkey
exception:exemptionlimitationexclusioninstancemodificationrestrictionrecognitionextensioncontrastadditionrefusalexampleclauseindicationdefinitionerrorrestraintreferenceobjectionconsiderationconcessiondistinctionvariationoccurrenceanomalyoffencejurisdictionimplicationanalogyexception:exemptionlimitationexclusioninstancemodificationrestrictionrecognitionextensioncontrastadditionrefusalexampleclauseindicationdefinitionerrorrestraintreferenceobjectionconsiderationconcessiondistinctionvariationoccurrenceanomalyoffencejurisdictionimplicationanalogy pot: bowlpanjarcontainerdishjugmugtintubtraybagsaucepanbottlebasketbucketvaseplatekettleteapotglassspoonsoupboxcancaketeapacketpipecup
VERBS measure determine assess calculate decrease monitor increase evaluate reduce detect estimate indicate analyse exceed vary test observe define record reflect affect obtain generate predict enhance alter examine quantify relate adjust boil simmer heat cook fry bubble cool stir warm steam sizzle bake flavour spill soak roast taste pour dry wash chop melt freeze scald consume burn mix ferment scorch soften
ADJECTIVES hypnotic haunting piercing expressionless dreamy monotonous seductive meditative emotive comforting expressive mournful healing indistinct unforgettable unreadable harmonic prophetic steely sensuous soothing malevolent irresistible restful insidious expectant demonic incessant inhuman spooky pink purple yellow red blue white pale brown green grey coloured bright scarlet orange cream black crimson thick soft dark striped thin golden faded matching embroidered silver warm mauve damp
no clustering (tho’ could be done) • no hierarchy (tho’ could be done) • rhythm • all on the web: http://wasps.itri.bton.ac.uk • registration required
The web • an enormous linguist’s playground • Computational Linguistics Special Issue, Kilgarriff and Grefenstette (eds) 29 (3) • (coming soon)
Google sets • http://labs.google.com/sets • Input: zebra giraffe buffalo
Google sets • http://labs.google.com/sets • Input: zebra giraffe buffalo • kudu hyena impala leopard hippo waterbuck elephant cheetah eland
Google sets • http://labs.google.com/sets • Input: harbin beijing nanking
Google sets • http://labs.google.com/sets • Input: harbin beijing nanking • Output: shanghai chengdu guangzhou hangzhou changchun zhejiang kunming dalian jinan fuzhou
Tree structure • Roget • all human knowledge as tree structure • 1000 top categories • subdivisions • like this • etc • etc
Directories and thesauruses • Yahoo, http://www.yahoo.com • Open directory project, http://dmoz.org • all human activity as tree structure plus corpus at every node • gather corpus, identify domain vocabulary • Gonzalo and colleagues, Madrid, CL Special Issue • Agirre and colleagues, ‘topic signatures’
Words and word senses • automatic thesauruses • words
Words and word senses • automatic thesauruses • words • manual thesauruses • simple hierarchy is appealing • homonyms
Words and word senses • automatic thesauruses • words • manual thesauruses • simple hierarchy is appealing • homonyms • “aha! objects must be word senses”
Problems • Theoretical • Practical