250 likes | 268 Views
Chapter 20. Computational Lexical Semantics. 1. Supervised Word-Sense Disambiguation (WSD). Methods that learn a classifier from manually sense-tagged text using machine learning techniques. Classifier: machine learning model for classifying instances into one of a fixed set of classes
E N D
Chapter 20 Computational Lexical Semantics 1
Supervised Word-Sense Disambiguation (WSD) • Methods that learn a classifier from manually sense-tagged text using machine learning techniques. • Classifier: machine learning model for classifying instances into one of a fixed set of classes • Treats WSD as a classification problem, where a target word is assigned the most likely sense (from a given sense inventory), based on the context in which the word appears.
Supervised Learning for WSD • Assume the POS of the target word is already determined. • Encode context using a set of features to be used for disambiguation. • Given labeled training data, encode it using these features, and train a machine learning algorithm. The result is a classifier. • Use the trained classifier to disambiguate future instances of the target word (test data), given their contextual features (the same features)
Feature Engineering • The success of machine learning requires instances to be represented using an effective set of features that are correlated with the categories of interest. • Feature engineering can be a laborious process that requires substantial human expertise and knowledge of the domain. • In NLP it is common to extract many (even thousands of) potential features and use a learning algorithm that works well with many relevant and irrelevant features.
Contextual Features • Surrounding bag of words. • POS of neighboring words • Local collocations • Syntactic relations Experimental evaluations indicate that all of these features are useful; and the best results comes from integrating all of these cues in the disambiguation process.
Surrounding Bag of Words • Unordered individual words near the ambiguous word (their exact positions are ignored) • To create the features: • Let BOW be an empty hash table • For each sentence in the training data: • For each word W within +-N words of the target word: • If W not in BOW: then BOW[W] = 0 • BOW[W] += 1 • Let Fs be a list of the K most frequent words in BOW, excluding “stop words” • “Stop words”: pronouns, numbers, conjunctions, and other “function” words. Standard lists of stop words are available • Define K features for each sentence, one for each of the K words: • Feature i is the number of Fs[i] appearing within +- N of the target word
Surrounding Bag of Words Features: Example • Example, disambiguating bass.n • 12 most frequent content words from a collection of bass.n sentences from the WSJ (J&M p. 641): • [fishing,big,sound,player,fly,rod,pound,double,runs,playing,guitar,band] • “An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps.” • Features for that sentence: [0,0,0,1,0,0,0,0,0,0,1,0] • In an arff file, these would be the values in 12 of the feature (attribute) columns
Surrounding Bag of Words • Idea? They are general topical cues of the context (“global” features)
POS of Neighboring Words • Use part-of-speech of immediately neighboring words. • Provides evidence of local syntactic context. • P-i is the POS of the word i positions to the left of the target word. • Pi is the POS of the word i positions to the right of the target word. • Typical to include features for: P-3, P-2, P-1, P1, P2, P3
POS of Neighboring Words • “An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps.” • Features for the sentence: • [JJ,NN,CC,NN,VB,IN] • 6 more feature/attribute columns in the arff file
Local Collocations • Specific lexical context immediately adjacent to the word. • For example, to determine if “interest” as a noun refers to “readiness to give attention” or “money paid for the use of money”, the following collocations are useful: • “in the interest of” • “an interest in” • “interest rate” • “accrued interest” • Ci,j is a feature of the sequence of words from i to j relative to the target word. • C-2,1 for “in the interest of” is “in the of” • Typical to include: • Single word context: C-1,-1 , C1,1, C-2,-2, C2,2 • Two word context: C-2,-1,C-1,1 ,C1,2 • Three word context: C-3,-1, C-2,1, C-1,2, C1,3
Local Collocations • Typical to include: • Single word context: C-1,-1 , C1,1, C-2,-2, C2,2 • Two word context: C-2,-1,C-1,1 ,C1,2 • Three word context: C-3,-1, C-2,1, C-1,2, C1,3 • “An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps.” • Features for this sentence: • [and,player,guitar,stand,guitar and,and player,player stand,electric guitar and,guitar and player,and player stand,player stand off] (11 more columns in arff) • What’s the difference with the bag-of-words features? • These features reflect position, and are N-grams (fixed sequences). They more richly capture the local context of the target word. Bag-of-words features, in contrast, are more general clues of the topic.
Syntactic Relations(Ambiguous Verbs) • For an ambiguous verb, it is very useful to know its direct object. • “played the game” • “played the guitar” • “played the risky and long-lasting card game” • “played the beautiful and expensive guitar” • “played the big brass tuba at the football game” • “played the game listening to the drums and the tubas” • May also be useful to know its subject: • “The game was played while the band played.” • “The game that included a drum and a tuba was played on Friday.”
Syntactic Relations(Ambiguous Nouns) • For an ambiguous noun, it is useful to know what verb it is an object of: • “played the piano and the horn” • “poached the rhinoceros’ horn” • May also be useful to know what verb it is the subject of: • “the bank near the river loaned him $100” • “the bank is eroding and the bank has given the city the money to repair it”
Syntactic Relations(Ambiguous Adjectives) • For an ambiguous adjective, it useful to know the noun it is modifying. • “a brilliant young man” • “a brilliant yellow light” • “a wooden writing desk” • “a wooden acting performance”
S NP VP ProperN V NP John played DET N piano the Using Syntax in WSD(per-word classifiers) • Produce a parse tree for a sentence using a syntactic parser. • For ambiguous verbs, use the head word of its direct object and of its subject as features. • For ambiguous nouns, use verbs for which it is the object and the subject as features. • For ambiguous adjectives, use the head word (noun) of its NP as a feature.
Syntactic Relations(Ambiguous Verbs) • Feature: head of direct object (special value null if none) • “played the game” game • “played the guitar” guitar • “played the risky and long-lasting card game” game • “played the beautiful and expensive guitar” guitar • “played the big brass tuba at the football game” tuba • “played the game listening to the drums and the tubas” game • Feature: head of subject (special value null if none) • “The game was played game while the band played band.” (two instances of “played” in one sentence) • “The game that included a drum and a tuba was played on Friday.” game
Syntactic Relations(Ambiguous Nouns) • Feature: Head verb that the target is the object of • “played the piano and the horn” played • “poached the rhinoceros’ horn” poached • Feature: Head verb that the target is the subject of • “the bank near the river loaned him $100” loaned • “the bank is eroding eroding and the bank has given the city the money to repair it” given
Syntactic Relations(Ambiguous Adjectives) • Feature: Noun the adjective modifies • “a brilliant young man” man • “a brilliant yellow light” light • “a wooden writing desk” desk • “a wooden acting performance” performance
Summary: Supervised Methodology • Create a sample of training data where a given target word is manually annotated with a sense from a predetermined set of possibilities. • One tagged word per instance • Select a set of features with which to represent context. • co-occurrences, collocations, POS tags, verb-obj relations, etc... • Convert sense-tagged training instances to feature vectors. • Apply a machine learning algorithm to induce a classifier. • Form – structure or relation among features • Parameters – strength of feature interactions • Convert a held out sample of test data into feature vectors. • “correct” sense tags are known but not used • Apply classifier to test instances to assign a sense tag.
Supervised Learning Algorithms • Once data is converted to feature vector form, any supervised learning algorithm can be used. Many have been applied to WSD with good results: • Support Vector Machines • Nearest Neighbor Classifiers • Decision Trees • Decision Lists • Naïve Bayesian Classifiers • Perceptrons • Neural Networks • Graphical Models • Log Linear Models
Summary: Supervised WSD with Individual Classifiers • Many supervised Machine Learning algorithms have been applied to Word Sense Disambiguation, most work reasonably well. • (Witten and Frank, 2000) is a great intro. to supervised learning. • Features tend to differentiate among methods more than the learning algorithms. • Good sets of features tend to include: • Co-occurrences or keywords • Collocations • Bigrams and Trigrams • Part of speech • Syntactic features
Convergence of Results • Accuracy of different systems applied to the same data tends to converge on a particular value, no one system shockingly better than another. • Senseval-1, a number of systems in range of 74-78% accuracy for English Lexical Sample task (a small number of words, so it is feasible to develop one classifier per word) • Senseval-2, a number of systems in range of 61-64% accuracy for English Lexical Sample task. • Senseval-3, a number of systems in range of 70-73% accuracy for English Lexical Sample task…
Evaluation of WSD • “In vitro”: • Corpus developed in which one or more ambiguous words are labeled with explicit sense tags according to some sense inventory. • Corpus used for training and testing WSD and evaluated using accuracy (percentage of labeled words correctly disambiguated). • Use most common sense selection as a baseline. • “In vivo”: • Incorporate WSD system into some larger application system, such as machine translation, information retrieval, or question answering. • Evaluate relative contribution of different WSD methods by measuring performance impact on the overall system on final task (accuracy of MT, IR, or QA results).