CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Pars

CS546: Machine Learning and Natural LanguageLatent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav Petrov’s talk at COLING-ACL 06 are used in this lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

Parsing Problem • Annotation refines base treebank symbols to improve statistical fit of the grammar • Parent annotation [Johnson 98]

Parsing Problem • Annotation refines base treebank symbols to improve statistical fit of the grammar • Parent annotation [Johnson 98] • Head lexicalization [Collins 99,...]

Parsing Problem • Annotation refines base treebank symbols to improve statistical fit of the grammar • Parent annotation [Johnson 98] • Head lexicalization [Collins 99,...] • Automatic Annotation [Matsuzaki et al, 05;...] • Manual Annotation [Klein and Manning 03]

Manual Annotation • Manually split categories • NP: subject vs object • DT: determiners vs demonstratives • IN: sentential vs prepositional • Advantages: • Fairly compact grammar • Linguistic motivations • Disadvantages: • Performance leveled out • Manually annotated

Automatic Annotation • Use Latent Variable Models • Split (“annotate”) each node: E.g., NP -> ( NP[1], NP[2],...,NP[T]) • Each node in the tree is annotated with a latent sub-category: • Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

How to perform this clustering? • Estimating model parameters (and models structure) • Decide how do you split each terminal (what is T in ., NP -> ( NP[1], NP[2],...,NP[T]) • Estimate probabilities for all • Parsing: • Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated nodes (2)? • Usually (2), but the inferred latent variables can can be useful for other tasks • Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

Estimating the model • Estimating parameters: • If we decide on the structure of the model (how we split) we can use EM (Matsuzaki et al, 05; Petrov and Klein, 06; ...): • E-Step: estimate - obtain fractional counts of rules • M-Step: • Also can use variational methods (mean-field): [Titov and Henderson, 07; Liang et al, 07] • Recall: We considered the variational methods in the context of LDA

Estimating the model • How to decide on how many nodes to split? • Early models split all the nodes equally [Kurihara and Sato, 04; Matsuzaki et al, 05; Prescher 05,...] with T selected by hand • Models are sparse (parameter estimates are not reliable), parsing time is large

Estimating the model • How to decide on how many nodes to split? • Later different approaches were considered: • (Petrov and Klein 06): Split and merge approach – recursively split each node in 2, if the likelihood is (significantly) improved – keep, otherwise, merge back; continue until no improvement • (Liang et al 07): Use Dirichlet Processes to automatically infer the appropriate size of the grammar • Larger is the training set: more fine grain the annotation is

Estimating the model • How to decide on how many nodes to split? • (Titov and Henderson 07; current work): • Instead of annotating with a single label annotate with a binary vector: • log-linear models for instead of counts of productions • - can be large: standard Gaussian regularization to avoid overtraining • efficient approximate parsing algorithms

How to parse? • Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated nodes (2)? • How to parse: • (1) – easy – just usual parsing with the extended grammar (if all nodes split in T) • (2) - not tractable (NP-complete, [Matsuzaki et al, 2005]), • instead you can do Minimum Bayes Risk decoding (i.e., output the minimum loss tree [Goodman 96; Titov and Henderson, 06; Petrov and Klein 07]) => instead of predicting the best tree you output the tree with the minimal expected error (Not always a great idea because we often do not know good loss measures: like optimizing the Hamming loss for sequence labelingcan lead to lingustically non-plausible structures) • Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

Adaptive splitting • (Petrov and Klein, 06): Split and Merge: number of induced constituent labels: NP VP PP

Adaptive splitting • (Petrov and Klein, 06): Split and Merge: number of induced POS tags: POS , TO

Adaptive splitting • (Petrov and Klein, 06): Split and Merge: number of induced POS tags: JJ NNP NNS NN POS , TO

Induced POS-tags • Proper Nouns (NNP): • Personal pronouns (PRP):

Induced POS tags • Relative adverbs (RBR): • Cardinal Numbers (CD):

Results for this model

LVs in Parsing • In standard models for parsing (and other structured prediction problems) you need to decide how the structure decomposes into the parts (e.g., weighted CFGs / PCFGs) • In latent variable models you relax this assumption: you assume how the structure annotated with latent variables decomposes • In other words, you learn to construct composite features from the elementary features (parts) -> reduces feature engineering effort • Latent variable models become popular in many applications: • syntactic dependency parsing [Titov and Henderson, 07] – best single model system in the parsing competition (overall 3rd result out of 22 systems) (CoNLL-2007) • joint semantic role labeling and parsing [Henderson et al, 09] – again the best single model (1st result in parsing, 3rd result in SRL) (CoNLL-2009) • hidden (dynamics) CRFs [Quattoni, 09] • ...

Hidden CRFs • CRF (Lafferty et al, 2001): • Latent Dynamic CRF • No long-distance statistical dependencies between y • Long-distance dependencies can be encoded using latent vectors

Latent Variables • Drawbacks: • Learning LVs models usually involves using slower iterative algorithms (EM, Variation methods, sampling...) • Optimization problem is often non-convex – many local minima • Inference (decoding) can be more expensive • Advantages: • Reduces feature engineering effort • Especially preferable if little domain knowledge is available and complex features are needed • Induced representation can be used for other tasks (e.g., LA-PCFGs induce fine-grain grammar can be useful, e.g., for SRL) • Latent variables (= hidden representations) can be useful in muti-task learning: hidden representation is induced simultaneously for several tasks [Collobert and Weston, 2008; Titov et al, 2009]. • #

Conclusions • We considered latent variable models in different contexts: • Topic modeling • Structured prediction models • We demonstrated where and why they are useful • Reviewed basic inference/learning techniques: • EM-type algorithms • Variational approximations • Sampling • Only very basic review • Next time: a guest lecture by Ming-Wei Chang on Domain-Adaptation (really hot and important topic in NLP!)

CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Pars