190 likes | 434 Views
Word sense disambiguation and information retrieval. Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola - jarmo.ritola@hut.fi. Lexical Semantic Processing. Word sense disambiguation which sense of a word is being used non-trivial task robust algorithms
E N D
Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola - jarmo.ritola@hut.fi
Lexical Semantic Processing • Word sense disambiguation • which sense of a word is being used • non-trivial task • robust algorithms • Information retrieval • broad field • storage and retrieval of requested text documents • vector space model
Word Sense Disambiguation • (17.1) “..., everybody has a career and none of them includes washing DISHES” • (17.2) “In her tiny kitchen at home, Ms. Chen works efficiently, stir-frying several simple DISHES, including braised pig’s ears and chicken livers with green peppers” • (17.6) “I’m looking for a restaurant that SERVES vegetarian DISHES”
Selectional Restriction • Rule-to-rule approach • Blocks the formation of representations with selectional restriction violations • Correct sense achieved as side effect • PATIENT roles, mutual exclution • dishes + stir-fry => food sense • dishes + wash => artifact sense • Need: hierarchical types and restrictions
S.R. Limitations • Selectional restrictions too general • (17.7) … kind of DISHES do you recommend? • True restriction violations • (17.8) …you can’t EAT gold for lunch… • negative environment • (17.9) … Mr. Kulkarni ATE glass … • Metaphoric and metonymic uses • Selectional association (Resnik)
Robust Word Sense Disambiguation • Robust, stand alone systems • Preprocessing • part-of-speech tagging, context selection, stemming, morphological processing, parsing… • Feature selection, feature vector • Train classifier to assign words to senses • Supervised, bootstrapping, unsupervised • Does the system scale?
Inputs: Feature Vectors • Target word, context • Select relevant linguistic features • Encode them in a usable form • Numeric or nominal values • Collocational features • Co-occurrence features
Inputs: Feature Vectors (2) • (17.11) An electric guitar and BASS player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps. • Collocational [guitar, NN1, and, CJC, player, NN1, stand, VVB] • Co-occurrence [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0] fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band
Supervised Learning • Feature-encoded inputs + categories • Naïve Bayes classifier • Decision list classifiers • case statements • tests ordered according to sense likelihood
Bootstrapping Approaches • Seeds, small number of labeled instances • Initial classifier extracts larger training set • Repeat => series of classifier with improving accuracy and coverage • Hand labeling examples • One sense per collocation • Also automatic selection from machine readable dictionary
Unsupervised Methods • Unlabeled feature vectors are grouped into clusters according to a similarity metric • Clusters are labeled by hand • Agglomerative clustering • Challenges • correct senses may not be known • heterogeneous clusters • Number clusters and senses differ
Dictionary Based Approaches • Large-scale disambiguation possible • Sense definitions retrieved from the dictionary • The sense with highest overlap within context words • Dictionary entries relative short • Not enough overlap • expand word list, subject codes
Information Retrieval • Compositional semantics • Bag of words methods • Terminology • document • collection • term • query • Ad hoc retrieval
The Vector Space Model • List of terms within the collection • document vector: presence/absence of terms • raw term frequency • normalization => direction of vector • similarity is cosine of angle between vectors
The Vector Space Model • Document collection • Term by weight matrix
Term Weighting • Enormous impact on the effectiveness • Term frequency within a single document • Distribution of term across a collection • Same weighting scheme for documents and queries • Alternative weighting methods for queries • AltaVista: di,j contains 1’000’000’000 words • average query: 2.3 words
Recall versus precision • Stemming • Stop list • Homonymy, polysemy, synonymy, hyponymy • Improving user queries • relevance feedback • query expansion, thesaurus, thesaurus generation, term clustering
Summary • WSD: assign word to senses • Selectional restriction • Machine learning approaches (small scale) • supervised, bootstrapping, unsupervised • Machine readable dictionaries (large scale) • Bag of words method, Vector space model • Query improvement (relevance feedback)
Exercise - Relevance Feedback The document collection is ordered according to the 'raw term frequency' of words "speech" and "language". The values and ordering is shown in the table below. • You want to find documents with many "speech" words but few "language" words (e.g. relation 8 : 2). Your initial query is {"speech", "language"}, i.e. they have equal weights. • The search machine always returns three most similar documents. • Show that with relevance feedback you • get the documents you want. • How important is the correctness of • feedback from the user?