1 / 22

CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Pars

CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing. Slides / Figures from Slav Petrov’s talk at COLING-ACL 06 are used in this lecture. TexPoint fonts used in EMF.

anja
Download Presentation

CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Pars

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS546: Machine Learning and Natural LanguageLatent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav Petrov’s talk at COLING-ACL 06 are used in this lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

  2. Parsing Problem • Annotation refines base treebank symbols to improve statistical fit of the grammar • Parent annotation [Johnson 98]

  3. Parsing Problem • Annotation refines base treebank symbols to improve statistical fit of the grammar • Parent annotation [Johnson 98] • Head lexicalization [Collins 99,...]

  4. Parsing Problem • Annotation refines base treebank symbols to improve statistical fit of the grammar • Parent annotation [Johnson 98] • Head lexicalization [Collins 99,...] • Automatic Annotation [Matsuzaki et al, 05;...] • Manual Annotation [Klein and Manning 03]

  5. Manual Annotation • Manually split categories • NP: subject vs object • DT: determiners vs demonstratives • IN: sentential vs prepositional • Advantages: • Fairly compact grammar • Linguistic motivations • Disadvantages: • Performance leveled out • Manually annotated

  6. Automatic Annotation • Use Latent Variable Models • Split (“annotate”) each node: E.g., NP -> ( NP[1], NP[2],...,NP[T]) • Each node in the tree is annotated with a latent sub-category: • Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

  7. How to perform this clustering? • Estimating model parameters (and models structure) • Decide how do you split each terminal (what is T in ., NP -> ( NP[1], NP[2],...,NP[T]) • Estimate probabilities for all • Parsing: • Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated nodes (2)? • Usually (2), but the inferred latent variables can can be useful for other tasks • Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

  8. Estimating the model • Estimating parameters: • If we decide on the structure of the model (how we split) we can use EM (Matsuzaki et al, 05; Petrov and Klein, 06; ...): • E-Step: estimate - obtain fractional counts of rules • M-Step: • Also can use variational methods (mean-field): [Titov and Henderson, 07; Liang et al, 07] • Recall: We considered the variational methods in the context of LDA

  9. Estimating the model • How to decide on how many nodes to split? • Early models split all the nodes equally [Kurihara and Sato, 04; Matsuzaki et al, 05; Prescher 05,...] with T selected by hand • Models are sparse (parameter estimates are not reliable), parsing time is large

  10. Estimating the model • How to decide on how many nodes to split? • Later different approaches were considered: • (Petrov and Klein 06): Split and merge approach – recursively split each node in 2, if the likelihood is (significantly) improved – keep, otherwise, merge back; continue until no improvement • (Liang et al 07): Use Dirichlet Processes to automatically infer the appropriate size of the grammar • Larger is the training set: more fine grain the annotation is

  11. Estimating the model • How to decide on how many nodes to split? • (Titov and Henderson 07; current work): • Instead of annotating with a single label annotate with a binary vector: • log-linear models for instead of counts of productions • - can be large: standard Gaussian regularization to avoid overtraining • efficient approximate parsing algorithms

  12. How to parse? • Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated nodes (2)? • How to parse: • (1) – easy – just usual parsing with the extended grammar (if all nodes split in T) • (2) - not tractable (NP-complete, [Matsuzaki et al, 2005]), • instead you can do Minimum Bayes Risk decoding (i.e., output the minimum loss tree [Goodman 96; Titov and Henderson, 06; Petrov and Klein 07]) => instead of predicting the best tree you output the tree with the minimal expected error (Not always a great idea because we often do not know good loss measures: like optimizing the Hamming loss for sequence labelingcan lead to lingustically non-plausible structures) • Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

  13. Adaptive splitting • (Petrov and Klein, 06): Split and Merge: number of induced constituent labels: NP VP PP

  14. Adaptive splitting • (Petrov and Klein, 06): Split and Merge: number of induced POS tags: POS , TO

  15. Adaptive splitting • (Petrov and Klein, 06): Split and Merge: number of induced POS tags: JJ NNP NNS NN POS , TO

  16. Induced POS-tags • Proper Nouns (NNP): • Personal pronouns (PRP):

  17. Induced POS tags • Relative adverbs (RBR): • Cardinal Numbers (CD):

  18. Results for this model

  19. LVs in Parsing • In standard models for parsing (and other structured prediction problems) you need to decide how the structure decomposes into the parts (e.g., weighted CFGs / PCFGs) • In latent variable models you relax this assumption: you assume how the structure annotated with latent variables decomposes • In other words, you learn to construct composite features from the elementary features (parts) -> reduces feature engineering effort • Latent variable models become popular in many applications: • syntactic dependency parsing [Titov and Henderson, 07] – best single model system in the parsing competition (overall 3rd result out of 22 systems) (CoNLL-2007) • joint semantic role labeling and parsing [Henderson et al, 09] – again the best single model (1st result in parsing, 3rd result in SRL) (CoNLL-2009) • hidden (dynamics) CRFs [Quattoni, 09] • ...

  20. Hidden CRFs • CRF (Lafferty et al, 2001): • Latent Dynamic CRF • No long-distance statistical dependencies between y • Long-distance dependencies can be encoded using latent vectors

  21. Latent Variables • Drawbacks: • Learning LVs models usually involves using slower iterative algorithms (EM, Variation methods, sampling...) • Optimization problem is often non-convex – many local minima • Inference (decoding) can be more expensive • Advantages: • Reduces feature engineering effort • Especially preferable if little domain knowledge is available and complex features are needed • Induced representation can be used for other tasks (e.g., LA-PCFGs induce fine-grain grammar can be useful, e.g., for SRL) • Latent variables (= hidden representations) can be useful in muti-task learning: hidden representation is induced simultaneously for several tasks [Collobert and Weston, 2008; Titov et al, 2009]. • #

  22. Conclusions • We considered latent variable models in different contexts: • Topic modeling • Structured prediction models • We demonstrated where and why they are useful • Reviewed basic inference/learning techniques: • EM-type algorithms • Variational approximations • Sampling • Only very basic review • Next time: a guest lecture by Ming-Wei Chang on Domain-Adaptation (really hot and important topic in NLP!)

More Related