350 likes | 547 Views
Word sense disambiguation (2) Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide set was adapted from a tutorial given by Rada Mihalcea & Ted Pedersen at ACL 2005. What is Supervised Learning?.
E N D
Word sense disambiguation (2) Instructor: Paul Tarau, based on RadaMihalcea’soriginal slides Note: Some of the material in this slide set was adapted from a tutorial given by RadaMihalcea & Ted Pedersen at ACL 2005
What is Supervised Learning? • Collect a set of examples that illustrate the various possible classifications or outcomes of an event. • Identify patterns in the examples associated with each particular class of the event. • Generalize those patterns into rules. • Apply the rules to classify a new event.
Task Definition: Supervised WSD • Supervised WSD: Class of methods that induces a classifier from manually sense-tagged text using machine learning techniques. • Resources • Sense Tagged Text • Dictionary (implicit source of sense inventory) • Syntactic Analysis (POS tagger, Chunker, Parser, …) • Scope • Typically one target word per context • Part of speech of target word resolved • Lends itself to “targeted word” formulation • Reduces WSD to a classification problem where a target word is assigned the most appropriate sense from a given set of possibilities based on the context in which it occurs
Two Bags of Words (Co-occurrences in the “window of context”)
Simple Supervised Approach • Given a sentence S containing “bank”: • For each word Wi in S • If Wi is in FINANCIAL_BANK_BAG then • Sense_1 = Sense_1 + 1; • If Wi is in RIVER_BANK_BAG then • Sense_2 = Sense_2 + 1; • If Sense_1 > Sense_2 then print “Financial” • else if Sense_2 > Sense_1 then print “River” • else print “Can’t Decide”;
Supervised Methodology • Create a sample of training data where a given target word is manually annotated with a sense from a predetermined set of possibilities. • One tagged word per instance/lexical sample disambiguation • Select a set of features with which to represent context. • co-occurrences, collocations, POS tags, verb-obj relations, etc... • Convert sense-tagged training instances to feature vectors. • Apply a machine learning algorithm to induce a classifier. • Form – structure or relation among features • Parameters – strength of feature interactions • Convert a held out sample of test data into feature vectors. • “correct” sense tags are known but not used • Apply classifier to test instances to assign a sense tag.
From Text to Feature Vectors • My/pronoun grandfather/noun used/verb to/prep fish/verb along/adv the/det banks/SHORE of/prep the/det Mississippi/noun River/noun. (S1) • The/det bank/FINANCE issued/verb a/det check/noun for/prep the/det amount/noun of/prep interest/noun. (S2)
Supervised Learning Algorithms • Once data is converted to feature vector form, any supervised learning algorithm can be used. Many have been applied to WSD with good results: • Support Vector Machines • Nearest Neighbor Classifiers • Decision Trees • Decision Lists • Naïve Bayesian Classifiers • Perceptrons • Neural Networks • Graphical Models • Log Linear Models
Naïve Bayesian Classifier • Naïve Bayesian Classifier well known in Machine Learning community for good performance across a range of tasks (e.g., Domingos and Pazzani, 1997) • …Word Sense Disambiguation is no exception • Assumes conditional independence among features, given the sense of a word. • The form of the model is assumed, but parameters are estimated from training instances • When applied to WSD, features are often “a bag of words” that come from the training data • Usually thousands of binary features that indicate if a word is present in the context of the target word (or not)
Bayesian Inference • Given observed features, what is most likely sense? • Estimate probability of observed features given sense • Estimate unconditional probability of sense • Unconditional probability of features is a normalizing term, doesn’t affect sense classification
The Naïve Bayesian Classifier • Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense) • P(S=1) = 1,500/2000 = .75 • P(S=2) = 500/2,000 = .25 • Given “credit” occurs 200 times with bank/1 and 4 times with bank/2. • P(F1=“credit”) = 204/2000 = .102 • P(F1=“credit”|S=1) = 200/1,500 = .133 • P(F1=“credit”|S=2) = 4/500 = .008 • Given a test instance that has one feature “credit” • P(S=1|F1=“credit”) = .133*.75/.102 = .978 • P(S=2|F1=“credit”) = .008*.25/.102 = .020
Comparative Results • (Leacock, et. al. 1993) compared Naïve Bayes with a Neural Network and a Context Vector approach when disambiguating six senses of line… • (Mooney, 1996) compared Naïve Bayes with a Neural Network, Decision Tree/List Learners, Disjunctive and Conjunctive Normal Form learners, and a perceptron when disambiguating six senses of line… • (Pedersen, 1998) compared Naïve Bayes with Decision Tree, Rule Based Learner, Probabilistic Model, etc. when disambiguating line and 12 other words… • …All found that Naïve Bayesian Classifier performed as well as any of the other methods!
Decision Lists and Trees • Very widely used in Machine Learning. • Decision trees used very early for WSD research (e.g., Kelly and Stone, 1975; Black, 1988). • Represent disambiguation problem as a series of questions (presence of feature) that reveal the sense of a word. • List decides between two senses after one positive answer • Tree allows for decision among multiple senses after a series of answers • Uses a smaller, more refined set of features than “bag of words” and Naïve Bayes. • More descriptive and easier to interpret.
Decision List for WSD (Yarowsky, 1994) • Identify collocational features from sense tagged data. • Word immediately to the left or right of target : • I have my bank/1 statement. • The river bank/2 is muddy. • Pair of words to immediate left or right of target : • The world’s richest bank/1 is here in New York. • The river bank/2 is muddy. • Words found within k positions to left or right of target, where k is often 10-50 : • My credit is just horrible because my bank/1 has made several mistakes with my account and the balance is very low.
Building the Decision List • Sort order of collocation tests using log of conditional probabilities. • Words most indicative of one sense (and not the other) will be ranked highly.
Computing DL score • Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense) • P(S=1) = 1,500/2,000 = .75 • P(S=2) = 500/2,000 = .25 • Given “credit” occurs 200 times with bank/1 and 4 times with bank/2. • P(F1=“credit”) = 204/2,000 = .102 • P(F1=“credit”|S=1) = 200/1,500 = .133 • P(F1=“credit”|S=2) = 4/500 = .008 • From Bayes Rule… • P(S=1|F1=“credit”) = .133*.75/.102 = .978 • P(S=2|F1=“credit”) = .008*.25/.102 = .020 • DL Score = abs (log (.978/.020)) = 3.89
Using the Decision List • Sort DL-score, go through test instance looking for matching feature. First match reveals sense…
Learning a Decision Tree • Identify the feature that most “cleanly” divides the training data into the known senses. • “Cleanly” measured by information gain or gain ratio. • Create subsets of training data according to feature values. • Find another feature that most cleanly divides a subset of the training data. • Continue until each subset of training data is “pure” or as clean as possible. • Well known decision tree learning algorithms include ID3 and C4.5 (Quillian, 1986, 1993) • In Senseval-1, a modified decision list (which supported some conditional branching) was most accurate for English Lexical Sample task (Yarowsky, 2000)
Supervised WSD with Individual Classifiers • Many supervised Machine Learning algorithms have been applied to Word Sense Disambiguation, most work reasonably well. • (Witten and Frank, 2000) is a great intro. to supervised learning. • Features tend to differentiate among methods more than the learning algorithms. • Good sets of features tend to include: • Co-occurrences or keywords (global) • Collocations (local) • Bigrams (local and global) • Part of speech (local) • Predicate-argument relations • Verb-object, subject-verb, • Heads of Noun and Verb Phrases
Convergence of Results • Accuracy of different systems applied to the same data tends to converge on a particular value, no one system shockingly better than another. • Senseval-1, a number of systems in range of 74-78% accuracy for English Lexical Sample task. • Senseval-2, a number of systems in range of 61-64% accuracy for English Lexical Sample task. • Senseval-3, a number of systems in range of 70-73% accuracy for English Lexical Sample task… • What to do next?
Ensembles of Classifiers • Classifier error has two components (Bias and Variance) • Some algorithms (e.g., decision trees) try and build a representation of the training data – Low Bias/High Variance • Others (e.g., Naïve Bayes) assume a parametric form and don’t represent the training data – High Bias/Low Variance • Combining classifiers with different bias variance characteristics can lead to improved overall accuracy • “Bagging” a decision tree can smooth out the effect of small variations in the training data (Breiman, 1996) • Sample with replacement from the training data to learn multiple decision trees. • Outliers in training data will tend to be obscured/eliminated.
Ensemble Considerations • Must choose different learning algorithms with significantly different bias/variance characteristics. • Naïve Bayesian Classifier versus Decision Tree • Must choose feature representations that yield significantly different (independent?) views of the training data. • Lexical versus syntactic features • Must choose how to combine classifiers. • Simple Majority Voting • Averaging of probabilities across multiple classifier output • Maximum Entropy combination (e.g., Klein, et. al., 2002)
Ensemble Results • (Pedersen, 2000) achieved state of art for interest and line data using ensemble of Naïve Bayesian Classifiers. • Many Naïve Bayesian Classifiers trained on varying sized windows of context / bags of words. • Classifiers combined by a weighted vote • (Florian and Yarowsky, 2002) achieved state of the art for Senseval-1 and Senseval-2 data using combination of six classifiers. • Rich set of collocational and syntactic features. • Combined via linear combination of top three classifiers. • Many Senseval-2 and Senseval-3 systems employed ensemble methods.
Task Definition: Minimally supervised WSD • SupervisedWSD = learning sense classifiers starting with annotated data • Minimally supervised WSD = learning sense classifiers from annotated data, with minimal human supervision • Examples • Automatically bootstrap a corpus starting with a few human annotated examples • Use monosemous relatives / dictionary definitions to automatically construct sense tagged data • Rely on Web-users + active learning for corpus annotation
Bootstrapping WSD Classifiers • Build sense classifiers with little training data • Expand applicability of supervised WSD • Bootstrapping approaches • Co-training • Self-training • Yarowsky algorithm
Bootstrapping Recipe • Ingredients • (Some) labeled data • (Large amounts of) unlabeled data • (One or more) basic classifiers • Output • Classifier that improves over the basic classifiers
… plant#1 growth is retarded … … a nuclear power plant#2 … Classifier 1 Classifier 2 … building the only atomic plant … … plant growth is retarded … … a herb or flowering plant … … a nuclear power plant … … building a new vehicle plant … … the animal and plant life … … the passion-fruit plant … … plants#1 and animals … … industry plant#2 …
Co-training / Self-training • 1. Create a pool of examples U' • choose P random examples from U • 2. Loop for I iterations • Train Ci on L and label U' • Select G most confident examples and add to L • maintain distribution in L • Refill U' with examples from U • keep U' at constant size P • A set L of labeled training examples • A set U of unlabeled examples • Classifiers Ci
Co-training • (Blum and Mitchell 1998) • Two classifiers • independent views • [independence condition can be relaxed] • Co-training in Natural Language Learning • Statistical parsing (Sarkar 2001) • Co-reference resolution (Ng and Cardie 2003) • Part of speech tagging (Clark, Curran and Osborne 2003) • ...
Self-training • (Nigam and Ghani 2000) • One single classifier • Retrain on its own output • Self-training for Natural Language Learning • Part of speech tagging (Clark, Curran and Osborne 2003) • Co-reference resolution (Ng and Cardie 2003) • several classifiers through bagging