Part 5. Minimally Supervised Methods for Word Sense Disambiguation

Part 5.Minimally Supervised Methods for Word Sense Disambiguation

Outline Task definition What does “minimally” supervised mean? Bootstrapping algorithms Co-training Self-training Yarowsky algorithm Using the Web for Word Sense Disambiguation Web as a corpus Web as collective mind

Task Definition Supervised WSD = learning sense classifiers starting with annotated data Minimally supervised WSD = learning sense classifiers from annotated class, with minimal human supervision Examples Automatically bootstrap a corpus starting with a few human annotated examples Use monosemous relatives / dictionary definitions to automatically construct sense tagged data Rely on Web-users + active learning for corpus annotation

Outline • Task definition • What does “minimally” supervised mean? • Bootstrapping algorithms • Co-training • Self-training • Yarowsky algorithm • Using the Web for Word Sense Disambiguation • Web as a corpus • Web as collective mind

Bootstrapping WSD Classifiers • Build sense classifiers with little training data • Expand applicability of supervised WSD • Bootstrapping approaches • Co-training • Self-training • Yarowsky algorithm

Bootstrapping Recipe • Ingredients • (Some) labeled data • (Large amounts of) unlabeled data • (One or more) basic classifiers • Output • Classifier that improves over the basic classifiers

Co-training / Self-training • A set L of labeled training examples • A set U of unlabeled examples • Classifiers Ci • 1. Create a pool of examples U' • choose P random examples from U • 2. Loop for I iterations • Train Ci on L and label U' • Select G most confident examples and add to L • maintain distribution in L • Refill U' with examples from U • keep U' at constant size P

Co-training • (Blum and Mitchell 1998) • Two classifiers • independent views • [independence condition can be relaxed] • Co-training in Natural Language Learning • Statistical parsing (Sarkar 2001) • Co-reference resolution (Ng and Cardie 2003) • Part of speech tagging (Clark, Curran and Osborne 2003) • ...

Self-training • (Nigam and Ghani 2000) • One single classifier • Retrain on its own output • Self-training for Natural Language Learning • Part of speech tagging (Clark, Curran and Osborne 2003) • Co-reference resolution (Ng and Cardie 2003) • several classifiers through bagging

Parameter Setting for Co-training/Self-training • 1. Create a pool of examples U' • choose P random examples from U • 2. Loop for I iterations • Train Ci on L and label U' • Select G most confident examples and add to L • maintain distribution in L • Refill U' with examples from U • keep U' at constant size P • A major drawback of bootstrapping • “No principled method for selecting optimal values for these parameters” (Ng and Cardie 2003)

Experiments with Co-training / Self-training for WSD • (Mihalcea 2004) • Training / Test data • Senseval-2 nouns (29 ambiguous nouns) • Average corpus size: 95 training examples, 48 test examples • Raw data • British National Corpus • Average corpus size: 7,085 examples • Co-training • Two classifiers: local and topical classifiers • Self-training • One classifier: global classifier

Optimal Parameter Settings • Optimized on the test set • Upper bound in co-training/self-training performance • Parameter ranges • P = {1, 100, 500, 1000, 1500, 2000, 5000} • G = {1, 10, 20, 30, 40, 50, 100, 150, 200} • I = {1, ..., 40} • 29 nouns → 120,000 runs • Accuracy: • Basic classifier: 53.84% • Optimal self-training: 65.61% • Optimal co-training: 65.75% • ~25% error reduction • Example: lady • basic = 61.53% • self-training = 84.61% [20/100/39] • co-training = 82.05% [1/1000/3]

Empirical Parameter Settings • How to detect parameter settings in practice? • 20% training data → validation set • Same range of parameter values • Method 1: Per-word parameter setting • Identify best parameter setting for each word • No improvement over basic classifier • Basic = 53.84% • Co-training = 51.73% • Self-training = 52.88%

Empirical Parameter Settings • Method 2: Overall parameter setting • For each parameter setting P, G, I • Determine the total relative growth in performance • Select the “best” setting • Co-training: • G = 1, P = 1500, I = 2 • Basic = 53.84%, Co-training = 55.67% • Self-training • G = 1, P = 1, I = 1 • Basic = 53.84%, Self-training = 54.16%

Empirical Parameter Setting • Method 3: Smoothed co-training • Combine iterations of co-training with voting • Effect • similar shape • “smoothed” learning curve • larger range with better-than-baseline performance • Results (avg.) • Basic = 53.84% • Co-training, global setting • basic = 55.67% • smoothed = 58.35% • Co-training, per-word setting • basic = 51.73% • smoothed = 56.68%

Yarowsky Algorithm • (Yarowsky 1995) • Similar to co-training • Differs in the basic assumption • “view independence” (co-training) vs. “precision independence” (Yarowsky algorithm) • (Abney 2002) • Relies on two classifiers and a decision list • One sense per collocation : • Nearby words provide strong and consistent clues as to the sense of a target word • One sense per discourse : • The sense of a target word is highly consistent within a single document

Learning Algorithm • A decision list is used to classify instances of target word : • “the loss of animal and plant species through extinction …” • Classification is based on the highest ranking rule that matches the target context

Bootstrapping Algorithm Sense-A: life Sense-B: factory • All occurrences of the target word are identified • A small training set of seed data is tagged with word sense

Bootstrapping Algorithm • Iterative procedure: • Train decision list algorithm on seed set • Classify residual data with decision list • Create new seed set by identifying samples that are tagged with a probability above a certain threshold • Retrain classifier on new seed set • Selecting training seeds • Initial training set should accurately distinguish among possible senses • Strategies: • Select a single, defining seed collocation for each possible sense. Ex: “life” and “manufacturing” for target plant • Use words from dictionary definitions • Hand-label most frequent collocates

Bootstrapping Algorithm Seed set grows and residual set shrinks ….

Bootstrapping Algorithm Convergence: Stop when residual set stabilizes

One Sense per Discourse Algorithm can be improved by applying “One Sense per Discourse” constraint • After algorithm has converged: Identify tokens tagged with low confidence, label with dominant tag of that document • After each iteration: Extend tag to all examples in a single document after enough examples are tagged with a single sense

Evaluation • Test corpus: extracted from 460 million word corpus of multiple sources (news articles, transcripts, novels, etc.) • Performance of multiple models compared with: • supervised decision lists • unsupervised learning algorithm of Schütze (1992), based on alignment of clusters with word senses

Outline • Task definition • What does “minimally” supervised mean? • Bootstrapping algorithms • Co-training • Self-training • Yarowsky algorithm • Using the Web for Word Sense Disambiguation • Web as a corpus • Web as collective mind

The Web as a Corpus • Use the Web as a large textual corpus • Build annotated corpora using monosemous relatives • Bootstrap annotated corpora starting with few seeds • Use the (semi)automatically tagged data to train WSD classifiers

Monosemous Relatives • IDEA: determine a phrase (SP) which uniquely identifies the sense of a word (W#i) • Determine one or more Search Phrases from a machine readable dictionary using several heuristics • Search the Internet using the Search Phrases from step 1. • Replace the Search Phrases in the examples gathered at 2 with W#i. • Output: sense annotated corpus for the word sense W#i

Heuristics to Identify Monosemous Relatives • Heuristic 1 • Determine a monosemous synonym • remember#1 has recollect as monosemous synonym  SP=recollect • Heuristic 2 • Parse the gloss and determine the set of single phrase definitions • produce#5 has the definition “bring onto the market or release” 2 definitions: “bring onto the market” and “release” eliminate “release” as being ambiguous  SP=bring onto the market • Heuristic 3 • Parse the gloss and determine the set of single phrase definitions • Replace the stop words with the NEAR operator • Strengthen the query: concatenate the words from the current synset using the AND operator • produce#6 has the synset {grow, raise, farm, produce}and the definition “cultivate by growing” SP=cultivate NEAR growing AND (grow OR raise OR farm OR produce)

Heuristics to Identify Monosemous Relatives • Heuristic 4 • Parse the gloss and determine the set of single phrase definitions • Keep only the head phrase • Strengthen the query: concatenate the words from the current synset using the AND operator • company#5 has the synset {party,company}and the definition “band of people associated in some activity” SP=band of people AND (company OR party)

Example • Building annotated corpora for the noun interest.

Example • Gather 5,404 examples • Check the first 70 examples  67 correct; 95.7% accuracy. 1. I appreciate the genuine interest#1 which motivated you to write your message. 2. The webmaster of this site warrants neither accuracy, nor interest#2. 3. He forgives us not only for our interest#3, but for his own. 4. Interest#4 coverage, including rents, was 3.6x 5. As an interest#5, she enjoyed gardening and taking part into church activities. 6. Voted on issues, they should have abstained because of direct and indirect personal interests#6 in the matters of hand. 7. The Adam Smith Society is a new interest#7 organized within the APA.

Experimental Evaluation • Tests on 20 words • 7 nouns, 7 verbs, 3 adjectives, 3 adverbs (120 word meanings) • manually check the first 10 examples of each sense of a word => 91% accuracy • (Mihalcea 1999)

Web-based Bootstrapping • Similar to Yarowsky algorithm • Relies on data gathered from the Web 1. Create a set of seeds (phrases) consisting of: • Sense tagged examples in SemCor • Sense tagged examples from WordNet • Additional sense tagged examples, if available (created with the substitution method or Open Mind method) • Phrase? • At least two open class words; • Words involved in a semantic relation (e.g. noun phrase, verb-object, verb-subject, etc.) 2. Search the Web using queries formed with the seed expressions found at Step 1 • Add to the generated corpus of maximum of N text passages • (Mihalcea 2002)

The Web as Collective Mind • Two different views of the Web: • collection of Web pages • very large group of Web users • Millions of Web users can contribute their knowledge to a data repository • Open Mind Word Expert (Chklovski and Mihalcea, 2002) • Fast growing rate: • Started in April 2002 • Currently more than 100,000 examples of noun senses in several languages

OMWEonline http://teach-computers.org

Open Mind Word Expert: Quantity and Quality • Data • A mix of different corpora: Treebank, Open Mind Common Sense, Los Angeles Times, British National Corpus • Word senses • Based on WordNet definitions • Active learning to select the most informative examples for learning • Use two classifiers trained on existing annotated data • Select items where the two classifiers disagree for human annotation • Quality: • Two tags per item • One tag per item per contributor • Evaluations: • Agreement rates of about 65% - comparable to the agreements rates obtained when collecting data for Senseval-2 with trained lexicographers • Replicability: tests on 1,600 examples of “interest” led to 90%+ replicability

References • (Abney 2002) Abney, S. Bootstrapping. Proceedings of ACL 2002. • (Blum and Mitchell 1998) Blum, A. and Mitchell, T. Combining labeled and unlabeled data with co-training. Proceedings of COLT 1998. • (Chklovski and Mihalcea 2002) Chklovski, T. and Mihalcea, R. Building a sense tagged corpus with Open Mind Word Expert. Proceedings of ACL 2002 workshop on WSD. • (Clark, Curran and Osborne 2003) Clark, S. and Curran, J.R. and Osborne, M. Bootstrapping POS taggers using unlabelled data. Proceedings of CoNLL 2003. • (Mihalcea 1999) Mihalcea, R. An automatic method for generating sense tagged corpora. Proceedings of AAAI 1999. • (Mihalcea 2002) Mihalcea, R. Bootstrapping large sense tagged corpora. Proceedings of LREC 2002. • (Mihalcea 2004) Mihalcea, R. Co-training and Self-training for Word Sense Disambiguation. Proceedings of CoNLL 2004. • (Ng and Cardie 2003) Ng, V. and Cardie, C. Weakly supervised natural language learning without redundant views. Proceedings of HLT-NAACL 2003. • (Nigam and Ghani 2000) Nigam, K. and Ghani, R. Analyzing the effectiveness and applicability of co-training. Proceedings of CIKM 2000. • (Sarkar 2001) Sarkar, A. Applying cotraining methods to statistical parsing. Proceedings of NAACL 2001. • (Yarowsky 1995) Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of ACL 1995.

Part 5. Minimally Supervised Methods for Word Sense Disambiguation

Part 5. Minimally Supervised Methods for Word Sense Disambiguation

Presentation Transcript

Combining Knowledge-based Methods and Supervised Learning for Effective Word Sense Disambiguation

Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation

Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Supervised, semi-supervised and Unsu pervised approaches for word sense disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Part 4: Supervised Methods of Word Sense Disambiguation

Unsupervised Word Sense Disambiguation Rivaling Supervised Methods

Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation