310 likes | 450 Views
Part 4: Supervised Methods of Word Sense Disambiguation. Outline. What is Supervised Learning? Task Definition Single Classifiers Naïve Bayesian Classifiers Decision Lists and Trees Ensembles of Classifiers. What is Supervised Learning?.
E N D
Part 4: Supervised Methods of Word Sense Disambiguation
Outline • What is Supervised Learning? • Task Definition • Single Classifiers • Naïve Bayesian Classifiers • Decision Lists and Trees • Ensembles of Classifiers
What is Supervised Learning? • Collect a set of examples that illustrate the various possible classifications or outcomes of an event. • Identify patterns in the examples associated with each particular class of the event. • Generalize those patterns into rules. • Apply the rules to classify a new event.
Outline • What is Supervised Learning? • Task Definition • Single Classifiers • Naïve Bayesian Classifiers • Decision Lists and Trees • Ensembles of Classifiers
Task Definition • Supervised WSD: Class of methods that induces a classifier from manually sense-tagged text using machine learning techniques. • Resources • Sense Tagged Text • Dictionary (implicit source of sense inventory) • Syntactic Analysis (POS tagger, Chunker, Parser, …) • Scope • Typically one target word per context • Part of speech of target word resolved • Lends itself to “lexical sample” formulation • Reduces WSD to a classification problem where a target word is assigned the most appropriate sense from a given set of possibilities based on the context in which it occurs
Two Bags of Words(Co-occurrences in the “window of context”)
Simple Supervised Approach Given a sentence S containing “bank”: For each word Wi in S If Wi is in FINANCIAL_BANK_BAG then Sense_1 = Sense_1 + 1; If Wi is in RIVER_BANK_BAG then Sense_2 = Sense_2 + 1; If Sense_1 > Sense_2 then print “Financial” else if Sense_2 > Sense_1 then print “River” else print “Can’t Decide”;
Supervised Methodology • Create a sample of training data where a given target word is manually annotated with a sense from a predetermined set of possibilities. • One tagged word per instance/lexical sample disambiguation • Select a set of features with which to represent context. • co-occurrences, collocations, POS tags, verb-obj relations, etc... • Convert sense-tagged training instances to feature vectors. • Apply a machine learning algorithm to induce a classifier. • Form – structure or relation among features • Parameters – strength of feature interactions • Convert a held out sample of test data into feature vectors. • “correct” sense tags are known but not used • Apply classifier to test instances to assign a sense tag.
Outline • What is Supervised Learning? • Task Definition • Naïve Bayesian Classifier • Decision Lists and Trees • Ensembles of Classifiers
Naïve Bayesian Classifier • Naïve Bayesian Classifier well known in Machine Learning community for good performance across a range of tasks (e.g., Domingos and Pazzani, 1997) …Word Sense Disambiguation is no exception • Assumes conditional independence among features, given the sense of a word. • The form of the model is assumed, but parameters are estimated from training instances • When applied to WSD, features are often “a bag of words” that come from the training data • Usually thousands of binary features that indicate if a word is present in the context of the target word (or not)
Bayesian Inference • Given observed features, what is most likely sense? • Estimate probability of observed features given sense • Estimate unconditional probability of sense • Unconditional probability of features is a normalizing term, doesn’t affect sense classification
The Naïve Bayesian Classifier • Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense) • P(S=1) = 1,500/2000 = .75 • P(S=2) = 500/2,000 = .25 • Given “credit” occurs 200 times with bank/1 and 4 times with bank/2. • P(F1=“credit”) = 204/2000 = .102 • P(F1=“credit”|S=1) = 200/1,500 = .133 • P(F1=“credit”|S=2) = 4/500 = .008 • Given a test instance that has one feature “credit” • P(S=1|F1=“credit”) = .133*.75/.102 = .978 • P(S=2|F1=“credit”) = .008*.25/.102 = .020
Comparative Results • (Leacock, et. al. 1993) compared Naïve Bayes with a Neural Network and a Context Vector approach when disambiguating six senses of line… • (Mooney, 1996) compared Naïve Bayes with a Neural Network, Decision Tree/List Learners, Disjunctive and Conjunctive Normal Form learners, and a perceptron when disambiguating six senses of line… • (Pedersen, 1998) compared Naïve Bayes with Decision Tree, Rule Based Learner, Probabilistic Model, etc. when disambiguating line and 12 other words… • …All found that Naïve Bayesian Classifier performed as well as any of the other methods!
Outline • What is Supervised Learning? • Task Definition • Naïve Bayesian Classifiers • Decision Lists and Trees • Ensembles of Classifiers
Decision Lists and Trees • Very widely used in Machine Learning. • Decision trees used very early for WSD research (e.g., Kelly and Stone, 1975; Black, 1988). • Represent disambiguation problem as a series of questions (presence of feature) that reveal the sense of a word. • List decides between two senses after one positive answer • Tree allows for decision among multiple senses after a series of answers • Uses a smaller, more refined set of features than “bag of words” and Naïve Bayes. • More descriptive and easier to interpret.
Decision List for WSD (Yarowsky, 1994) • Identify collocational features from sense tagged data. • Word immediately to the left or right of target : • I have my bank/1 statement. • The river bank/2 is muddy. • Pair of words to immediate left or right of target : • The world’s richest bank/1 is here in New York. • The river bank/2 is muddy. • Words found within k positions to left or right of target, where k is often 10-50 : • My credit is just horrible because my bank/1 has made several mistakes with my account and the balance is very low.
Building the Decision List • Sort order of collocation tests using log of conditional probabilities. • Words most indicative of one sense (and not the other) will be ranked highly.
Computing DL score • Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense) • P(S=1) = 1,500/2,000 = .75 • P(S=2) = 500/2,000 = .25 • Given “credit” occurs 200 times with bank/1 and 4 times with bank/2. • P(F1=“credit”) = 204/2,000 = .102 • P(F1=“credit”|S=1) = 200/1,500 = .133 • P(F1=“credit”|S=2) = 4/500 = .008 • From Bayes Rule… • P(S=1|F1=“credit”) = .133*.75/.102 = .978 • P(S=2|F1=“credit”) = .008*.25/.102 = .020 • DL Score = abs (log (.978/.020)) = 3.89
Using the Decision List • Sort DL-score, go through test instance looking for matching feature. First match reveals sense…
Learning a Decision Tree • Identify the feature that most “cleanly” divides the training data into the known senses. • “Cleanly” measured by information gain or gain ratio. • Create subsets of training data according to feature values. • Find another feature that most cleanly divides a subset of the training data. • Continue until each subset of training data is “pure” or as clean as possible. • Well known decision tree learning algorithms include ID3 and C4.5 (Quillian, 1986, 1993) • In Senseval-1 a modified decision list (which supported some conditional branching) was most accurate for English Lexical Sample task (Yarowsky, 2000)
Supervised WSD with Individual Classifiers • Most supervised Machine Learning algorithms have been applied to Word Sense Disambiguation, most work reasonably well. • Features tend to differentiate among methods more than the learning algorithms. • Good sets of features tend to include: • Co-occurrences or keywords (global) • Collocations (local) • Bigrams (local and global) • Part of speech (local) • Predicate-argument relations • Verb-object, subject-verb, • Heads of Noun and Verb Phrases
Convergence of Results • Accuracy of different systems applied to the same data tends to converge on a particular value, no one system shockingly better than another. • Senseval-1, a number of systems in range of 74-78% accuracy for English Lexical Sample task. • Senseval-2, a number of systems in range of 61-64% accuracy for English Lexical Sample task. • Senseval-3, a number of systems in range of 70-73% accuracy for English Lexical Sample task… • What to do next?
Outline • What is Supervised Learning? • Task Definition • Naïve Bayesian Classifiers • Decision Lists and Trees • Ensembles of Classifiers
Ensembles of Classifiers • Classifier error has two components (Bias and Variance) • Some algorithms (e.g., decision trees) try and build a representation of the training data – Low Bias/High Variance • Others (e.g., Naïve Bayes) assume a parametric form and don’t represent the training data – High Bias/Low Variance • Combining classifiers with different bias variance characteristics can lead to improved overall accuracy • “Bagging” a decision tree can smooth out the effect of small variations in the training data (Breiman, 1996) • Sample with replacement from the training data to learn multiple decision trees. • Outliers in training data will tend to be obscured/eliminated.
Ensemble Considerations • Must choose different learning algorithms with significantly different bias/variance characteristics. • Naïve Bayesian Classifier versus Decision Tree • Must choose feature representations that yield significantly different (independent?) views of the training data. • Lexical versus syntactic features • Must choose how to combine classifiers. • Simple Majority Voting • Averaging of probabilities across multiple classifier output • Maximum Entropy combination (e.g., Klein, et. al., 2002)
Ensemble Results • (Pedersen, 2000) achieved state of art for interest and line data using ensemble of Naïve Bayesian Classifiers. • Many Naïve Bayesian Classifiers trained on varying sized windows of context / bags of words. • Classifiers combined by a weighted vote • (Florian and Yarowsky, 2002) achieved state of the art for Senseval-1 and Senseval-2 data using combination of six classifiers. • Rich set of collocational and syntactic features. • Combined via linear combination of top three classifiers. • Many Senseval-2 and Senseval-3 systems employed ensemble methods.
References • (Black, 1988) E. Black (1988) An experiment in computational discrimination of English word senses. IBM Journal of Research and Development (32) pg. 185-194. • (Breiman, 1996) L. Breiman. (1996) The heuristics of instability in model selection. Annals of Statistics (24) pg. 2350-2383. • (Domingos and Pazzani, 1997) P. Domingos and M. Pazzani. (1997) On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning (29) pg. 103-130. • (Domingos, 2000) P. Domingos. (2000) A Unified Bias Variance Decomposition for Zero-One and Squared Loss. In Proceedings of AAAI. Pg. 564-569. • (Florian an dYarowsky, 2002) R. Florian and D. Yarowsky. (2002) Modeling Consensus: Classifier Combination for Word Sense Disambiguation. In Proceedings of EMNLP, pp 25-32. • (Kelly and Stone, 1975). E. Kelly and P. Stone. (1975) Computer Recognition of English Word Senses, North Holland Publishing Co., Amsterdam. • (Klein, et. al., 2002) D. Klein, K. Toutanova, H. Tolga Ilhan, S. Kamvar, and C. Manning, Combining Heterogeneous Classifiers for Word-Sense Disambiguation, Proceedings of Senseval-2. pg. 87-89. • (Leacock, et. al. 1993) C. Leacock, J. Towell, E. Voorhees. (1993) Corpus based statistical sense resolution. In Proceedings of the ARPA Workshop on Human Language Technology. pg. 260-265. • (Mooney, 1996) R. Mooney. (1996) Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning. Proceedings of EMNLP. pg. 82-91. • (Pedersen, 1998) T. Pedersen. (1998) Learning Probabilistic Models of Word Sense Disambiguation. Ph.D. Dissertation. Southern Methodist University. • (Pedersen, 2000) T. Pedersen (2000) A simple approach to building ensembles of Naive Bayesian classifiers for word sense disambiguation. In Proceedings of NAACL. • (Quillian, 1986). J.R. Quillian (1986) Induction of Decision Trees. Machine Learning (1). pg. 81-106. • (Quillian, 1993). J.R. Quillian (1993) C4.5 Programs for Machine Learning. San Francisco, Morgan Kaufmann. • (Yarowsky, 1994) D. Yarowsky. (1994) Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In Proceedings of ACL. pp. 88-95. • (Yarowsky, 2000) D. Yarowsky. (2000) Hierarchical decision lists for word sense disambiguation. Computers and the Humanities, 34.