Learning a joint model of word sense and syntactic preference

Learning a joint model of word sense and syntactic preference Galen Andrew and Teg Grenager NLP Lunch December 3, 2003

Motivation • Divide and conquer: NLP has divided the problem of language processing into small problems (tagging, parsing, WSD, anaphora resolution, information extraction…) • Traditionally, we build separate models for each problem • But linguistic phenomena are correlated! • In some cases, a joint model can better represent the phenomena • We may be better able to perform one task if we have a joint model and can use many kinds of evidence

Motivation • In particular, syntactic and semantic properties of language are (of course!) very correlated • For example, semantic information (word sense, coresolution, etc.) is useful when doing syntactic processing (tagging, parsing, movement, etc.) and vice versa • Evidence that humans use this information (e.g., Clifton et al. 1984, Ferreira&McClure 1997, Garnsey et al. 1997) • Evidence that it is useful in NLP (e.g., Yarowsky 2000, Lin 1997, Bikel 2000)

Verb Sense and Subcat • We’ve chosen to focus on modelling two specific phenomena: verb sense and verb subcategorization preference. • Roland and Jurafsky (1998) demonstrate that models which condition on verb sense are better able to predict verb subcategorization • Others (e.g., Yarowski, Lin) have shown that models that condition on syntactic information are better able to predict word sense • We believe that a joint model of verb sense and subcategorization may be more accurate than separate models on either task

Example • The word “admit” has 8 senses in WordNet, with different distributions over subcategories: Sense Definition Subcategorization Example 1 Acknowledge Somebody admits that something. (Sfin) Somebody admits something. (NP) The defendant admitted that he had lied. He admitted his guilt. 2 Allow in Somebody admits somebody. (NP) Somebody admits someone (in)to somewhere. (NP PP) I will admit only those with passes. The student was admitted. 6 Give access to Something admits to somewhere. (PP) The main entrance admits to the foyer.

Lack of Training Data • To learn a joint model over verb sense and subcategorization preference, we’d ideally have a (large) dataset marked for both • No such dataset exists (although parts of the Brown corpus are in Semcor and in the PTB, it is small and not aligned) • However, we have some datasets marked for sense (Senseval, Semcor), and others that can easily be marked for subcategory (PTB) • We can think of this as one big corpus with missing data

Lack of Training Data seq bow sense subcat x x x x x x Semcor Data(marked for sense) x x x x x x x x x x x x x x x x x x Penn Treebank Data(marked for subcat) x x x x x x x x x x x x

EM to the Rescue • How do people usually deal with model parameter estimation when there is missing data? The expectation-maximization algorithm. • Big idea: it’s easy to • E: fill in missing data if you have a good model, and • M: compute maximum likelihood model parameters if you have complete data • So you initialize somehow, and then loop over the above two steps until convergence

EM to the Rescue • More formally, for data x, missing data z, and parameters • E-step: For each instance i, set • M-step: Set

The Model Subcat Sense Seq BOW

The Model: E-step Subcat Sense observed unobserved query Seq BOW

The Model: M-step Subcat Sense Seq BOW Deterministic

The Model: M-step Prior, estimated from counts Subcat Sense Seq BOW

The Model: M-step Estimated from counts Subcat Sense Seq BOW

The Model: M-step Subcat Sense Multinomial NB Model Seq BOW

The Model: M-step Subcat Sense Encoded as PCFG Grammar Only computed once Seq BOW

Subcategory Grammars • In order to represent P(seq|subcat) we needed to learn separate grammars/lexicons for each subcategory of the target verb • When reading in PTB trees, we first make a separate copy of the tree for each verb. • Then for each tree, we mark the selected verb for subcategory (using Tgrep expressions) and propagate the markings to the top of the tree. • Then trees are annotated (tag split, for accuracy) and binarized, and we read off grammars and lexicon • Thus at parse time, each root symbol must parse some verb to its specified subcategory.

Model Testing • Once we’ve trained a model with EM, we can use it to predict sense and/or subcat in a completely unmarked instance • For example, to infer sense given only the sequence (and bow): • Infering subcat given only the sequence is similar

Results • None yet, but we should have them soon

Future Work • More features, and a more complex model • Learn separate distributions over words in the VP and outside of the VP, conditioned on sense • Learn a distribution over words contained in particular argument and adjunct conditioned on sense

Learning a joint model of word sense and syntactic preference

Learning a joint model of word sense and syntactic preference

Presentation Transcript

Preference learning

Word Sense and Subjectivity

Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation

Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Word Relations and Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation: A Survey

Word Sense Disambiguation

Word Relations and Word Sense Disambiguation

Word Sense Disambiguation

Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation

Word Sense Disambiguation

A Discriminative Syntactic Word Order Model for MT

The origin of syntactic bootstrapping: A computational model

Syntactic Model of Metabolic Pathways

Word Sense Disambiguation

Word Sense Disambiguation

Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation

Word Sense and Subjectivity

Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation