370 likes | 603 Views
Part 5. Minimally Supervised Methods for Word Sense Disambiguation. Outline. Task definition What does “minimally” supervised mean? Bootstrapping algorithms Co-training Self-training Yarowsky algorithm Using the Web for Word Sense Disambiguation Web as a corpus Web as collective mind.
E N D
Part 5.Minimally Supervised Methods for Word Sense Disambiguation
Outline Task definition What does “minimally” supervised mean? Bootstrapping algorithms Co-training Self-training Yarowsky algorithm Using the Web for Word Sense Disambiguation Web as a corpus Web as collective mind
Task Definition Supervised WSD = learning sense classifiers starting with annotated data Minimally supervised WSD = learning sense classifiers from annotated class, with minimal human supervision Examples Automatically bootstrap a corpus starting with a few human annotated examples Use monosemous relatives / dictionary definitions to automatically construct sense tagged data Rely on Web-users + active learning for corpus annotation
Outline • Task definition • What does “minimally” supervised mean? • Bootstrapping algorithms • Co-training • Self-training • Yarowsky algorithm • Using the Web for Word Sense Disambiguation • Web as a corpus • Web as collective mind
Bootstrapping WSD Classifiers • Build sense classifiers with little training data • Expand applicability of supervised WSD • Bootstrapping approaches • Co-training • Self-training • Yarowsky algorithm
Bootstrapping Recipe • Ingredients • (Some) labeled data • (Large amounts of) unlabeled data • (One or more) basic classifiers • Output • Classifier that improves over the basic classifiers
Co-training / Self-training • A set L of labeled training examples • A set U of unlabeled examples • Classifiers Ci • 1. Create a pool of examples U' • choose P random examples from U • 2. Loop for I iterations • Train Ci on L and label U' • Select G most confident examples and add to L • maintain distribution in L • Refill U' with examples from U • keep U' at constant size P
Co-training • (Blum and Mitchell 1998) • Two classifiers • independent views • [independence condition can be relaxed] • Co-training in Natural Language Learning • Statistical parsing (Sarkar 2001) • Co-reference resolution (Ng and Cardie 2003) • Part of speech tagging (Clark, Curran and Osborne 2003) • ...
Self-training • (Nigam and Ghani 2000) • One single classifier • Retrain on its own output • Self-training for Natural Language Learning • Part of speech tagging (Clark, Curran and Osborne 2003) • Co-reference resolution (Ng and Cardie 2003) • several classifiers through bagging
Parameter Setting for Co-training/Self-training • 1. Create a pool of examples U' • choose P random examples from U • 2. Loop for I iterations • Train Ci on L and label U' • Select G most confident examples and add to L • maintain distribution in L • Refill U' with examples from U • keep U' at constant size P • A major drawback of bootstrapping • “No principled method for selecting optimal values for these parameters” (Ng and Cardie 2003)
Experiments with Co-training / Self-training for WSD • (Mihalcea 2004) • Training / Test data • Senseval-2 nouns (29 ambiguous nouns) • Average corpus size: 95 training examples, 48 test examples • Raw data • British National Corpus • Average corpus size: 7,085 examples • Co-training • Two classifiers: local and topical classifiers • Self-training • One classifier: global classifier
Optimal Parameter Settings • Optimized on the test set • Upper bound in co-training/self-training performance • Parameter ranges • P = {1, 100, 500, 1000, 1500, 2000, 5000} • G = {1, 10, 20, 30, 40, 50, 100, 150, 200} • I = {1, ..., 40} • 29 nouns → 120,000 runs • Accuracy: • Basic classifier: 53.84% • Optimal self-training: 65.61% • Optimal co-training: 65.75% • ~25% error reduction • Example: lady • basic = 61.53% • self-training = 84.61% [20/100/39] • co-training = 82.05% [1/1000/3]
Empirical Parameter Settings • How to detect parameter settings in practice? • 20% training data → validation set • Same range of parameter values • Method 1: Per-word parameter setting • Identify best parameter setting for each word • No improvement over basic classifier • Basic = 53.84% • Co-training = 51.73% • Self-training = 52.88%
Empirical Parameter Settings • Method 2: Overall parameter setting • For each parameter setting P, G, I • Determine the total relative growth in performance • Select the “best” setting • Co-training: • G = 1, P = 1500, I = 2 • Basic = 53.84%, Co-training = 55.67% • Self-training • G = 1, P = 1, I = 1 • Basic = 53.84%, Self-training = 54.16%
Empirical Parameter Setting • Method 3: Smoothed co-training • Combine iterations of co-training with voting • Effect • similar shape • “smoothed” learning curve • larger range with better-than-baseline performance • Results (avg.) • Basic = 53.84% • Co-training, global setting • basic = 55.67% • smoothed = 58.35% • Co-training, per-word setting • basic = 51.73% • smoothed = 56.68%
Yarowsky Algorithm • (Yarowsky 1995) • Similar to co-training • Differs in the basic assumption • “view independence” (co-training) vs. “precision independence” (Yarowsky algorithm) • (Abney 2002) • Relies on two classifiers and a decision list • One sense per collocation : • Nearby words provide strong and consistent clues as to the sense of a target word • One sense per discourse : • The sense of a target word is highly consistent within a single document
Learning Algorithm • A decision list is used to classify instances of target word : • “the loss of animal and plant species through extinction …” • Classification is based on the highest ranking rule that matches the target context
Bootstrapping Algorithm Sense-A: life Sense-B: factory • All occurrences of the target word are identified • A small training set of seed data is tagged with word sense
Bootstrapping Algorithm • Iterative procedure: • Train decision list algorithm on seed set • Classify residual data with decision list • Create new seed set by identifying samples that are tagged with a probability above a certain threshold • Retrain classifier on new seed set • Selecting training seeds • Initial training set should accurately distinguish among possible senses • Strategies: • Select a single, defining seed collocation for each possible sense. Ex: “life” and “manufacturing” for target plant • Use words from dictionary definitions • Hand-label most frequent collocates
Bootstrapping Algorithm Seed set grows and residual set shrinks ….
Bootstrapping Algorithm Convergence: Stop when residual set stabilizes
One Sense per Discourse Algorithm can be improved by applying “One Sense per Discourse” constraint • After algorithm has converged: Identify tokens tagged with low confidence, label with dominant tag of that document • After each iteration: Extend tag to all examples in a single document after enough examples are tagged with a single sense
Evaluation • Test corpus: extracted from 460 million word corpus of multiple sources (news articles, transcripts, novels, etc.) • Performance of multiple models compared with: • supervised decision lists • unsupervised learning algorithm of Schütze (1992), based on alignment of clusters with word senses
Outline • Task definition • What does “minimally” supervised mean? • Bootstrapping algorithms • Co-training • Self-training • Yarowsky algorithm • Using the Web for Word Sense Disambiguation • Web as a corpus • Web as collective mind
The Web as a Corpus • Use the Web as a large textual corpus • Build annotated corpora using monosemous relatives • Bootstrap annotated corpora starting with few seeds • Use the (semi)automatically tagged data to train WSD classifiers
Monosemous Relatives • IDEA: determine a phrase (SP) which uniquely identifies the sense of a word (W#i) • Determine one or more Search Phrases from a machine readable dictionary using several heuristics • Search the Internet using the Search Phrases from step 1. • Replace the Search Phrases in the examples gathered at 2 with W#i. • Output: sense annotated corpus for the word sense W#i
Heuristics to Identify Monosemous Relatives • Heuristic 1 • Determine a monosemous synonym • remember#1 has recollect as monosemous synonym SP=recollect • Heuristic 2 • Parse the gloss and determine the set of single phrase definitions • produce#5 has the definition “bring onto the market or release” 2 definitions: “bring onto the market” and “release” eliminate “release” as being ambiguous SP=bring onto the market • Heuristic 3 • Parse the gloss and determine the set of single phrase definitions • Replace the stop words with the NEAR operator • Strengthen the query: concatenate the words from the current synset using the AND operator • produce#6 has the synset {grow, raise, farm, produce}and the definition “cultivate by growing” SP=cultivate NEAR growing AND (grow OR raise OR farm OR produce)
Heuristics to Identify Monosemous Relatives • Heuristic 4 • Parse the gloss and determine the set of single phrase definitions • Keep only the head phrase • Strengthen the query: concatenate the words from the current synset using the AND operator • company#5 has the synset {party,company}and the definition “band of people associated in some activity” SP=band of people AND (company OR party)
Example • Building annotated corpora for the noun interest.
Example • Gather 5,404 examples • Check the first 70 examples 67 correct; 95.7% accuracy. 1. I appreciate the genuine interest#1 which motivated you to write your message. 2. The webmaster of this site warrants neither accuracy, nor interest#2. 3. He forgives us not only for our interest#3, but for his own. 4. Interest#4 coverage, including rents, was 3.6x 5. As an interest#5, she enjoyed gardening and taking part into church activities. 6. Voted on issues, they should have abstained because of direct and indirect personal interests#6 in the matters of hand. 7. The Adam Smith Society is a new interest#7 organized within the APA.
Experimental Evaluation • Tests on 20 words • 7 nouns, 7 verbs, 3 adjectives, 3 adverbs (120 word meanings) • manually check the first 10 examples of each sense of a word => 91% accuracy • (Mihalcea 1999)
Web-based Bootstrapping • Similar to Yarowsky algorithm • Relies on data gathered from the Web 1. Create a set of seeds (phrases) consisting of: • Sense tagged examples in SemCor • Sense tagged examples from WordNet • Additional sense tagged examples, if available (created with the substitution method or Open Mind method) • Phrase? • At least two open class words; • Words involved in a semantic relation (e.g. noun phrase, verb-object, verb-subject, etc.) 2. Search the Web using queries formed with the seed expressions found at Step 1 • Add to the generated corpus of maximum of N text passages • (Mihalcea 2002)
The Web as Collective Mind • Two different views of the Web: • collection of Web pages • very large group of Web users • Millions of Web users can contribute their knowledge to a data repository • Open Mind Word Expert (Chklovski and Mihalcea, 2002) • Fast growing rate: • Started in April 2002 • Currently more than 100,000 examples of noun senses in several languages
OMWEonline http://teach-computers.org
Open Mind Word Expert: Quantity and Quality • Data • A mix of different corpora: Treebank, Open Mind Common Sense, Los Angeles Times, British National Corpus • Word senses • Based on WordNet definitions • Active learning to select the most informative examples for learning • Use two classifiers trained on existing annotated data • Select items where the two classifiers disagree for human annotation • Quality: • Two tags per item • One tag per item per contributor • Evaluations: • Agreement rates of about 65% - comparable to the agreements rates obtained when collecting data for Senseval-2 with trained lexicographers • Replicability: tests on 1,600 examples of “interest” led to 90%+ replicability
References • (Abney 2002) Abney, S. Bootstrapping. Proceedings of ACL 2002. • (Blum and Mitchell 1998) Blum, A. and Mitchell, T. Combining labeled and unlabeled data with co-training. Proceedings of COLT 1998. • (Chklovski and Mihalcea 2002) Chklovski, T. and Mihalcea, R. Building a sense tagged corpus with Open Mind Word Expert. Proceedings of ACL 2002 workshop on WSD. • (Clark, Curran and Osborne 2003) Clark, S. and Curran, J.R. and Osborne, M. Bootstrapping POS taggers using unlabelled data. Proceedings of CoNLL 2003. • (Mihalcea 1999) Mihalcea, R. An automatic method for generating sense tagged corpora. Proceedings of AAAI 1999. • (Mihalcea 2002) Mihalcea, R. Bootstrapping large sense tagged corpora. Proceedings of LREC 2002. • (Mihalcea 2004) Mihalcea, R. Co-training and Self-training for Word Sense Disambiguation. Proceedings of CoNLL 2004. • (Ng and Cardie 2003) Ng, V. and Cardie, C. Weakly supervised natural language learning without redundant views. Proceedings of HLT-NAACL 2003. • (Nigam and Ghani 2000) Nigam, K. and Ghani, R. Analyzing the effectiveness and applicability of co-training. Proceedings of CIKM 2000. • (Sarkar 2001) Sarkar, A. Applying cotraining methods to statistical parsing. Proceedings of NAACL 2001. • (Yarowsky 1995) Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of ACL 1995.